## Extract reviews from TripAdvisor
Steps to download the files:
1. Go to the webpage: https://www.tripadvisor.com/Attraction_Review-g186525-d213530-Reviews-or2000-Edinburgh_Zoo-Edinburgh_Scotland.html
2. Save as > HTML only. Change the name to contain the review page number
3. Move to the next page (it will change the https link to "or-NNN")
4. Add them all into the folder called "edinburgh-zoo"

In [2]:
# Install pacakges
!pip install beautifulsoup4 pandas



In [3]:
# SETUP
# Import packages
from bs4 import BeautifulSoup
import pandas as pd
import re

# Functions
def extract_reviews(html_path) -> list:
    '''
    Extracts unique (title, review, date) tuples from a given HTML file.
    "date" will be in format DD Month YYYY (e.g. 1 June 2024)
    '''
    with open(html_path, "r", encoding="utf-8") as file:
        soup = BeautifulSoup(file, "html.parser")

    all_divs = soup.find_all("div")
    matched = []
    for div in all_divs:
        text = div.get_text(strip=True)
        classes = div.get("class", [])
        is_title = div.find("a") and div.find("a").get("href", "").startswith("/ShowUserReviews")
        is_review = "biGQs" in classes and "pZUbB" in classes

        if is_title or is_review:
            matched.append({"type": "title" if is_title else "review", "text": text})

        is_date = text.startswith("Written")
        if is_date:
            matched.append({"type": "date", "text": text})

    tuples = []
    seen = set()  # track seen (title, review, date) tuples

    i = 0
    while i < len(matched):
        if matched[i]["type"] == "title":
            title = matched[i]["text"]
            review = matched[i + 1]["text"] if i + 1 < len(matched) and matched[i + 1]["type"] == "review" else None

            # Get the date
            date = matched[i + 2]["text"] if i + 2 < len(matched) and matched[i + 2]["type"] == "date" else None
            pattern = r"Written\s+(\d{1,2}\s+[A-Z][a-z]+\s+\d{4})"
            if date:
                match = re.search(pattern, date)
                date = match.group(1) if match else None

            pair = (title, review, date)
            if pair not in seen:
                tuples.append({"title": title, "review": review, "date": date})
                seen.add(pair)
            i += 2
        else:
            i += 1

    return tuples

In [None]:
reviews = []
base_folder = 'edinburgh-zoo-reviews/'

for page_i in range(1, 11):  # Pages 1-10 have a different title structure than others
    html_path = f"{base_folder}EDINBURGH ZOO (2025) All You Should Know BEFORE You Go (w_ Reviews)-{page_i}.html"
    reviews.extend(extract_reviews(html_path))
    print(f"Processed page {page_i}", end="\r")
for page_i in range(11, 200):
    html_path = f"{base_folder}Edinburgh Zoo (2025) - All You Need to Know BEFORE You Go (with Reviews)-{page_i}.html"
    reviews.extend(extract_reviews(html_path))
    print(f"Processed page {page_i}", end="\r")

# Create and display final DataFrame
pd.set_option('display.max_colwidth', 3000)
df = pd.DataFrame(reviews)
display(df.head())

# Save the DataFrame to a CSV file
num_reviews = len(df)
df.to_csv(f"edinburgh-zoo-reviews-{num_reviews}.csv", index=False)



Unnamed: 0,title,review,date
0,Family Day Out At The Zoo,"Had an amazing time here with my little girl and her gran on a sunny day in June.Variety of animals the zoo has is incredible, plenty of places to eat and drink, great selection of play parks for kids, and the staff do an amazing job of keeping the entire zoo immaculate.We all had a magical day and would highly recommend it for a day out with the family.It's a little bit hilly (especially on way up to see lions and tigers) but well worth the climb uphill to see such beautiful animals.The animal talks were highly engaging and would also highly recommend attending these when visiting.Keep up the brilliant work RZSS.",28 June 2025
1,An excellent day out,We visited using the season ticket from our local zoo. From arrival we were made very welcome. Staff took time out to talk to us and tell us about the animals. We loved the large enclosures that the animals had to roam in and had great views of many during the day. Thank you everyone for a great day!,25 June 2025
2,Our favourite place for our days off,We’re members and love the zoo.So many different animals too see and easy to get to by the 26 or 31 bus from the Center.It is a challenging zoo as it’s all on hill so be prepared to get your hike on but it’s all worth it.Staff are great and always happy to helpWorth a visit we almost go every other month,20 June 2025
3,Youth club day out,"Brilliant day out with the Youth Club at the Zoo, we travelled all the way up from County Durham and the zoo staff couldnt have been more helpful from initial contact to collecting the tickets on the day. The zoo was clean and tidy and the group loved it with the penguins being a firm favourite. We even took advantage of the lunch voucher system and the visit triggered some additional workshops following the visit.",14 June 2025
4,Best zoo,"This was without a doubt one of the best zoo visits we've ever had (helped by the fact we stayed at the Holiday Inn on sight and were offered discounted tickets after 14:00, so therefore the crowds had also dispersed) Despite the layout being steep, the views over Edinburgh were superb once at the top, everything was well laid out and easy to find. Just wish we'd had longer, but I'm sure we will return!",12 June 2025
