# Web Scraping and Data Analysis of the Top 6 Luxurious Hotels in Tunisia

In a world fueled by data, extracting meaningful insights from the vast expanse of the internet has become a valuable skill. This project marks the inception of a personal endeavor, where the focus is on unraveling the world of Tunisia's most luxurious hotels through web scraping and data analysis.

As of August 15, 2023, the journey begins with the art of web scraping, a technique that allows for the automated extraction of data from websites. The aim is clear: to collect and dissect information from TripAdvisor about the top 6 luxury hotels in Tunisia. This data will serve as the building blocks for an in-depth analysis, shedding light on the nuances that define these opulent establishments.

In a solo pursuit of knowledge, the project delves into the realm of Python programming, employing libraries like BeautifulSoup and Requests. The objective is to navigate the web, collecting essential details such as hotel names, ratings, reviews, and potentially even more intricate insights. This process is a gateway to unraveling the essence of luxury and hospitality offered by these remarkable hotels.

Armed with this data, the project takes a turn into the realm of data analysis. The focus shifts from scraping to interpretation, as patterns, trends, and correlations within the collected information are meticulously unearthed. This analysis aims not only to satiate curiosity but also to contribute to the knowledge base, providing potential travelers with a glimpse into what these hotels have to offer.

Through this solo expedition, the project creator dives into the heart of luxury, curiosity as their compass, and data as their guide. The journey promises to uncover a wealth of information, offering a unique perspective on Tunisia's most luxurious hotels, as of the project's commencement on August 15, 2023.

# Installing Necessary Libraries

Let us start by making sure all necessary Python libraries are installed and imported. You can run the following one-liner code to install the necessary Python packages:

```python

!pip install requests bs4 numpy pandas
```

After installation, you can now import the following libraries:

In [1]:
import requests
import bs4
import numpy as np
import pandas as pd
import re

# Defining Webscraping Functions

To begin web scraping, the first step is to determine the main URL of the website from which we intend to extract data. In this scenario, as we are focusing on scraping TripAdvisor.com.ph, we assign the main URL as below. Additionally, we identify the specific URL of the hotel for which we desire to obtain reviews.
For this example, we have selected the hotel 'La_Cigale_Tabarka', and the URL of its first page will serve as our starting point.

In [9]:
# La_Cigale_Tabarka
main = 'https://www.tripadvisor.com.ph'
url_first = '/Hotel_Review-g297953-d7309360-Reviews-La_Cigale_Tabarka_Hotel_Thalasso_Spa_Golf-Tabarka_Jendouba_Governorate.html'

We then define a function, 'get_soup'. 

In [10]:
def get_soup(url):
    """
    Return a BeautifulSoup from the input `url`.
    """
    headers = {'User-Agent': 'Mozilla/5.0'}
    proxies = {'http': 'http://206.189.157.23'}
    soup = bs4.BeautifulSoup(
        requests.get(url, headers=headers, proxies=proxies).content
    )
    return soup

In [11]:
def get_reviews(soup):
    """
    Extracts reviews, ratings, and dates from the provided BeautifulSoup
    object.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object representing
        the parsed HTML content.

    Returns:
        pandas.DataFrame: A DataFrame containing the extracted reviews,
        ratings, and dates.

    """
    review_list = []
    for review in soup.find_all('span', class_='QewHA H4 _a'):
        review_list.append(review.getText())

    rating_list = []
    for rating in soup.find_all('div', class_="Hlmiy F1"):
        rating_list.append(rating.find('span').get('class')[1])
        
    date_list = []
    for date in soup.find_all('span', class_='teHYY _R Me S4 H3'):
        date_list.append(date.getText())
    
    if ((len(review_list) != len(rating_list)) |
        (len(rating_list) != len(date_list)) |
         (len(review_list) != len(date_list))):
        min_len = min(len(review_list), len(rating_list), len(date_list))
        review_list = review_list[:min_len]
        rating_list = rating_list[:min_len]
        date_list = date_list[:min_len]
    df = pd.DataFrame({'reviews': review_list,
                       'ratings': rating_list,
                       'date': date_list})
    return df

Lastly, another function that is introduced in this article is 'scrape_reviews'. The function scrape_reviews(main, url_first, total_reviews) takes three parameters as input: 'main', which represents the main URL of the website; 'url_first', which is the URL of the first page of the hotel; and 'total_reviews', which specifies the maximum number of reviews to scrape.

In [12]:
def scrape_reviews(main, url_first, total_reviews):
    """
    Scrape reviews from multiple pages of a hotel on TripAdvisor.

    Args:
        main (str): The main URL of the website.
        url_first (str): The URL of the first page of the hotel.
        total_reviews (int): The maximum number of reviews to scrape.

    Returns:
        pandas.DataFrame: A DataFrame containing the scraped reviews,
        ratings, and dates.
    """
    df_all = pd.DataFrame()
    url = main + url_first
    soup = get_soup(url)
    df_all = get_reviews(soup)

    while (len(df_all) <= total_reviews):
        try:
            next_url = main + (soup.find('a',
                                         class_='ui_button nav next primary')
                               .get('href'))
            next_soup = get_soup(next_url)
            df = get_reviews(next_soup)
            df_all = pd.concat([df_all, df])
            soup = next_soup
        except AttributeError:
            break
    df_all = df_all.reset_index().drop('index', axis=1)
    df_all['ratings'] = df_all['ratings'].apply(lambda x:
                                                int(re.findall(r'\d', x)[0]))
    df_all['date'] = df_all['date'].apply(lambda x:
                            re.findall(r'Date of stay: (\w+ \d+)', x)[0])
    return df_all
    

# Samples of Web Scraping Results

## Hotel Name:La_Cigale_Tabarka

In [14]:
# main and url_first and defined above
df_La_Cigale_Tabarka = scrape_reviews(main, url_first, 1500)

We can check on the total number of reviews we are able to scrape using the following function.

In [15]:
len(df_La_Cigale_Tabarka)

65

We can then save the scraped reviews as a csv file. Simply change the 'file_directory' based on where you want your file to be saved in your directory.

In [20]:
# Specify the Excel filename
file_directory = 'La_Cigale_Tabarka.xlsx'

# Write the DataFrame to an Excel file
df_La_Cigale_Tabarka.to_excel(file_directory, index=False)

Here's a display of the first few rows of our dataset.

In [21]:
df_La_Cigale_Tabarka.head()

Unnamed: 0,reviews,ratings,date
0,"The hotel is terrific. However, the spa is ex...",5,June 2023
1,Veldig dissappointed of being exempt from my r...,1,June 2023
2,We come to Tunisia from Texas every year in Ma...,1,May 2023
3,I have stayed in many luxurious hotels in diff...,5,January 2023
4,La cigale Tabarka is the most amazing and wond...,5,December 2022


For sanity check, we can count the number of unique row values of our dataset.

In [22]:
df_La_Cigale_Tabarka.nunique()

reviews    65
ratings     5
date       39
dtype: int64

Here, we display the number of reviews per rating.

In [23]:
df_La_Cigale_Tabarka.groupby('ratings').count()

Unnamed: 0_level_0,reviews,date
ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
1,10,10
2,5,5
3,6,6
4,5,5
5,39,39


Looks great!

## Hotel Name: La Badira

In [24]:
# Crimson Resort and Spa Boracay
url_La_Badira = '/Hotel_Review-g297943-d6677898-Reviews-La_Badira-Hammamet_Nabeul_Governorate.html'
df_La_Badira = scrape_reviews(main, url_La_Badira, 1500)

In [25]:
# Specify the Excel filename
file_directory = 'Badira.xlsx'

# Write the DataFrame to an Excel file
df_La_Badira.to_excel(file_directory, index=False)

In [26]:
df_La_Badira.head()

Unnamed: 0,reviews,ratings,date
0,"Great stay at « la badira », the overall exper...",5,July 2023
1,Best hotel in Tunisia for couples: best breakf...,5,August 2023
2,A lovely hotel a little out of town but close ...,5,July 2023
3,I spent almost a week in la Badira hôtel and I...,5,July 2023
4,Have stayed at some pretty posh hotels like th...,5,June 2023


In [27]:
df_La_Badira.nunique()

reviews    434
ratings      5
date        96
dtype: int64

In [28]:
df_La_Badira.groupby('ratings').count()

Unnamed: 0_level_0,reviews,date
ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,8
2,4,4
3,11,11
4,46,46
5,365,365


## Hotel Name: Four_Seasons_Hotel

In [34]:
# Four_Seasons_Hotel_Tunis Hotel
url_Four_Seasons_Hotel_Tunis = '/Hotel_Review-g297942-d13144748-Reviews-Four_Seasons_Hotel_Tunis-Gammarth_Tunis_Governorate.html'
df_Four_Seasons_Hotel_Tunis = scrape_reviews(main, url_Four_Seasons_Hotel_Tunis, 1500)

In [35]:
df_Four_Seasons_Hotel_Tunis.head()

Unnamed: 0,reviews,ratings,date
0,This is the 8th Four Seasons we have stayed at...,3,August 2023
1,The best hotel in Tunisia without a doubt.The ...,5,July 2023
2,"Calm, clean with attentive and well trained st...",5,June 2023
3,This is a very special place and what makes it...,5,June 2023
4,We enjoyed an absolutely wonderful 4 night sta...,5,June 2023


In [36]:
df_Four_Seasons_Hotel_Tunis.groupby('ratings').count()

Unnamed: 0_level_0,reviews,date
ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,8
2,11,11
3,16,16
4,31,31
5,158,158


In [37]:
# Specify the Excel filename
file_directory = 'Four_Seasons_Hotel_Tunis.xlsx'

# Write the DataFrame to an Excel file
df_Four_Seasons_Hotel_Tunis.to_excel(file_directory, index=False)

## Hotel Name: Hasdrubal_Thalassa_Spa_Yasmine_Hammamet

In [38]:
# Hasdrubal_Thalassa_Spa_Yasmine_Hammamet hotel
url_Hasdrubal_Thalassa_Spa_Yasmine_Hammamet = '/Hotel_Review-g297943-d302773-Reviews-Hasdrubal_Thalassa_Spa_Yasmine_Hammamet-Hammamet_Nabeul_Governorate.html'
df_Hasdrubal_Thalassa_Spa_Yasmine_Hammamet = scrape_reviews(main, url_Four_Seasons_Hotel_Tunis, 1500)

In [39]:
df_Hasdrubal_Thalassa_Spa_Yasmine_Hammamet.head()

Unnamed: 0,reviews,ratings,date
0,This is the 8th Four Seasons we have stayed at...,3,August 2023
1,The best hotel in Tunisia without a doubt.The ...,5,July 2023
2,"Calm, clean with attentive and well trained st...",5,June 2023
3,This is a very special place and what makes it...,5,June 2023
4,We enjoyed an absolutely wonderful 4 night sta...,5,June 2023


In [40]:
df_Hasdrubal_Thalassa_Spa_Yasmine_Hammamet.groupby('ratings').count()

Unnamed: 0_level_0,reviews,date
ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,8
2,11,11
3,16,16
4,31,31
5,158,158


In [48]:
# Specify the Excel filename
file_directory = 'Hasdrubal_Thalassa_Spa_Yasmine_Hammamet.xlsx'

# Write the DataFrame to an Excel file
df_Hasdrubal_Thalassa_Spa_Yasmine_Hammamet.to_excel(file_directory, index=False)

## Hotel Name: The_Residence_Tunis-Gammarth

In [44]:
# The_Residence_Tunis-Gammarth hotel
url_The_Residence_Tunis_Gammarth = '/Hotel_Review-g297942-d302774-Reviews-The_Residence_Tunis-Gammarth_Tunis_Governorate.html'
df_The_Residence_Tunis_Gammarth = scrape_reviews(main, url_The_Residence_Tunis_Gammarth, 1500)

In [45]:
df_The_Residence_Tunis_Gammarth.head()

Unnamed: 0,reviews,ratings,date
0,I wanted to post a comment to thank the team a...,5,July 2023
1,We stay at the residence every summer since 5 ...,5,August 2023
2,I highly recommend staying at The Residence if...,5,July 2023
3,"Very friendly and helpful personnel, great swi...",5,July 2023
4,This is a beautiful hotel with great views fro...,4,July 2023


In [46]:
df_The_Residence_Tunis_Gammarth.groupby('ratings').count()

Unnamed: 0_level_0,reviews,date
ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
1,14,14
2,14,14
3,31,31
4,109,109
5,326,326


In [49]:
# Specify the Excel filename
file_directory = 'The_Residence_Tunis_Gammarth.xlsx'

# Write the DataFrame to an Excel file
df_The_Residence_Tunis_Gammarth.to_excel(file_directory, index=False)

## Hotel Name:Radisson_Blu_Palace_Resort_Thalasso_Djerba

In [50]:
# The_Residence_Tunis-Gammarth hotel
url_Radisson_Blu_Palace_Resort_Thalasso_Djerba = '/Hotel_Review-g1951333-d614737-Reviews-Radisson_Blu_Palace_Resort_Thalasso_Djerba-Mezraia_Djerba_Island_Medenine_Governorate.html'
df_Radisson_Blu_Palace_Resort_Thalasso_Djerba = scrape_reviews(main, url_Radisson_Blu_Palace_Resort_Thalasso_Djerba, 1500)

In [51]:
df_Radisson_Blu_Palace_Resort_Thalasso_Djerba.head()

Unnamed: 0,reviews,ratings,date
0,Djerba Music Land festival poorly organised. ...,1,August 2023
1,А wonderful holiday thanks to Sophien who who ...,5,June 2023
2,On the beach Mezraria you may have a wonderful...,5,July 2023
3,We have been to the radisson blu djerba plenty...,5,July 2023
4,Women with muslim swimwear are being discrimin...,1,July 2023


In [52]:
df_Radisson_Blu_Palace_Resort_Thalasso_Djerba.groupby('ratings').count()

Unnamed: 0_level_0,reviews,date
ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
1,16,16
2,18,18
3,34,34
4,74,74
5,118,118


In [53]:
# Specify the Excel filename
file_directory = 'TRadisson_Blu_Palace_Resort_Thalasso_Djerba.xlsx'

# Write the DataFrame to an Excel file
df_Radisson_Blu_Palace_Resort_Thalasso_Djerba.to_excel(file_directory, index=False)