# Data Preparation Workbook

## Import libraries and modules

In [1]:
import pandas as pd
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
import numpy as np
import json

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

<spacytextblob.spacytextblob.SpacyTextBlob at 0x1111a5af0>

## 1 - Data Extraction, Cleaning and Joining

### 1.1 'Travel+Leisure World's Best Hotels 2022' dataset

In [2]:
hotels_2022_df = pd.read_csv('./data/100_hotels_2022.csv', encoding='latin1')
hotels_2022_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Hotel      101 non-null    object 
 1   Location   101 non-null    object 
 2   Country    101 non-null    object 
 3   Region     101 non-null    object 
 4   Company    101 non-null    object 
 5   Score      101 non-null    float64
 6   Rank       101 non-null    int64  
 7   Rooms      101 non-null    int64  
 8   Theme      101 non-null    object 
 9   Year       101 non-null    int64  
 10  2021       101 non-null    int64  
 11  Past_rank  101 non-null    int64  
dtypes: float64(1), int64(5), object(6)
memory usage: 9.6+ KB


In [3]:
hotels_2022_df.head()

Unnamed: 0,Hotel,Location,Country,Region,Company,Score,Rank,Rooms,Theme,Year,2021,Past_rank
0,Rosewood Castiglion del Bosco,Montalcino,Italy,Europe,Massimo and Chiara Ferragamo,99.25,1,53,Countryside,2000,0,0
1,Grace Hotel,Santorini,Greece,Europe,Auberge Resorts Collection,99.22,2,20,Coastal,2000,1,6
2,Waldorf Astoria Maldives Ithaafushi,Ithaafushi Island,Maldives,Southeast Asia,Hilton,99.11,3,119,Island,2019,1,80
3,Pickering House Inn,Wolfeboro,United States,North America,Peter and Patty Cooke,98.95,4,10,Boutique,1813,1,34
4,One&Only Reethi Rah,North Malé Atoll,Maldives,Southeast Asia,Kerzner International,98.93,5,130,Island,2005,0,0


`hotels_2022_df` is a pretty neat dataset with no missing values.

### 1.2 The scraped hotel info dataset

In [4]:
ta_hotels_df = pd.read_pickle('./data/ta_hotels.pickle.zip')
ta_hotels_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   hotel    101 non-null    object
 1   url      101 non-null    object
 2   info     101 non-null    object
 3   price    101 non-null    object
 4   reviews  101 non-null    object
 5   page     101 non-null    object
dtypes: object(6)
memory usage: 4.9+ KB


In [5]:
ta_hotels_df.head()

Unnamed: 0,hotel,url,info,price,reviews,page
0,Rosewood Castiglion del Bosco,https://www.tripadvisor.com//Hotel_Review-g635...,"{'id': 1147343, 'name': 'Rosewood Castiglion D...","[{'perNight': 1446, 'vendorName': 'Hotels.com'...","[{'id': 862724074, 'url': '/ShowUserReviews-g6...",{'redux': {'api': {'requests': {'_data_1_0_new...
1,Grace Hotel,https://www.tripadvisor.com//Hotel_Review-g635...,"{'id': 1088707, 'name': 'Grace Hotel, Auberge ...","[{'perNight': 1294, 'vendorName': 'Hotels.com'...","[{'id': 862555965, 'url': '/ShowUserReviews-g6...",{'redux': {'api': {'requests': {'_data_1_0_new...
2,Waldorf Astoria Maldives Ithaafushi,https://www.tripadvisor.com//Hotel_Review-g239...,"{'id': 15618284, 'name': 'Waldorf Astoria Mald...","[{'perNight': 2970, 'vendorName': 'Hotels.com'...","[{'id': 862726636, 'url': '/ShowUserReviews-g2...",{'redux': {'api': {'requests': {'_data_1_0_new...
3,Pickering House Inn,https://www.tripadvisor.com//Hotel_Review-g462...,"{'id': 14785276, 'name': 'Pickering House Inn'...",[],"[{'id': 845175111, 'url': '/ShowUserReviews-g4...",{'redux': {'api': {'requests': {'_data_1_0_new...
4,One&Only Reethi Rah,https://www.tripadvisor.com//Hotel_Review-g685...,"{'name': 'One&Only Reethi Rah', 'id': 563828, ...","[{'vendorName': 'Expedia.com', 'perNight': 205...","[{'id': 862375316, 'date': '2022-09-28', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...


#### 1.2.1 Handling missing prices

We can see already that there're missing values in the `price` column.

We'll need to find out what are the missing ones.

In [6]:
ta_hotels_df.loc[~ta_hotels_df.price.apply(lambda row: any([price['perNight'] for price in row])), ['hotel', 'price']]

Unnamed: 0,hotel,price
3,Pickering House Inn,[]
13,Wilderness Safaris Bisate Lodge,[]
51,Cavas Wine Lodge,"[{'vendorName': 'eDreams', 'perNight': None}, ..."
54,Lodge on Little St. Simons Island,[]
59,Gibb's Farm,"[{'vendorName': 'eDreams', 'perNight': None}]"
63,San Ysidro Ranch,"[{'vendorName': 'Booking.com', 'perNight': None}]"
72,Twin Farms,[]
95,andBeyond Ngorongoro Crater Lodge,[]
98,Wentworth Mansion,"[{'vendorName': 'Expedia.com', 'perNight': Non..."


There're 9 hotels which have no price offers on TripAdvisor.com.

We would want to retrieve price offer from hotel official websites.

But first we will make sure that we submit same traveler information when retrieving price to have consistent results.

In [7]:
ta_hotels_df.page.iloc[3]['redux']['travelerInfo']['hotels']

{'guests': '1_2',
 'stayDates': '2022_10_16_2022_10_17',
 'defaultDates': True,
 'travelerType': None}

We have manually looked up the prices from the official websites: 

* Pickering House Inn: The starting rate for the next available Sunday (30-Oct, 2022) is USD 615 per night, for up to two occupants.
    * secure.thinkreservations.com/pickeringhousewolfeboro/reservations/
* Wilderness Safaris Bisate Lodge: There is a recommend selling rate for the period 01-Jun to 31-Oct 2022, which is USD 2,678 per person per night.
    * wilderness-safaris.com/compare-rates
* Cavas Wine Lodge: The starting rate for 16-Oct, 2022 is USD 900 per night, for 1 Adult.
    * cavaswinelodge.com
* Lodge on Little St. Simons Island: The starting price for weekends in prime season is USD 675 for one person per night (775 with 100 deduction for single occupancy)
    * www.littlestsimonsisland.com/lodging/georgia-beach-cabin-rentals
* Gibb's Farm: The starting rate for 16-Oct, 2022 is USD 761.50 per night, for 1 Adult.
    * www.gibbsfarm.com/stay/rates
* San Ysidro Ranch: The starting rate for 16-Oct, 2022 is USD 2495 per night, for 1 Adult.
    * www.sanysidroranch.com
* Twin Farms: The price for the next available Sunday (30-Oct, 2022) is USD 3,135, for one adult one room.
    * www.twinfarms.com
* andBeyond Ngorongoro Crater Lodge: The starting price for the period 01 - 31 Oct 2022 is USD 1,300 per person per night.
    * www.andbeyond.com/rates/ngorongoro-crater-lodge-suite-rate/
* Wentworth Mansion: The starting price for the next available Sunday (13-Nov, 2022) is USD 585, for one adult one room
    * res.wentworthmansion.com

All prices are starting prices and standard inclusive packages, to be in consistent with the TripAdvisor prices.

Now we can insert the above prices to the DataFrame with `vendorName` 'Official':

In [8]:
ta_hotels_df.at[3, 'price'] = [{'perNight': 615, 'vendorName': 'Official'}]
ta_hotels_df.at[13, 'price'] = [{'perNight': 2678, 'vendorName': 'Official'}]
ta_hotels_df.at[51, 'price'] = [{'perNight': 900, 'vendorName': 'Official'}]
ta_hotels_df.at[54, 'price'] = [{'perNight': 675, 'vendorName': 'Official'}]
ta_hotels_df.at[59, 'price'] = [{'perNight': 761.50, 'vendorName': 'Official'}]
ta_hotels_df.at[63, 'price'] = [{'perNight': 2495, 'vendorName': 'Official'}]
ta_hotels_df.at[72, 'price'] = [{'perNight': 3135, 'vendorName': 'Official'}]
ta_hotels_df.at[95, 'price'] = [{'perNight': 1300, 'vendorName': 'Official'}]
ta_hotels_df.at[98, 'price'] = [{'perNight': 585, 'vendorName': 'Official'}]

In [9]:
ta_hotels_df.iloc[[3, 13, 51, 54, 59, 63, 72, 95, 98]]

Unnamed: 0,hotel,url,info,price,reviews,page
3,Pickering House Inn,https://www.tripadvisor.com//Hotel_Review-g462...,"{'id': 14785276, 'name': 'Pickering House Inn'...","[{'perNight': 615, 'vendorName': 'Official'}]","[{'id': 845175111, 'url': '/ShowUserReviews-g4...",{'redux': {'api': {'requests': {'_data_1_0_new...
13,Wilderness Safaris Bisate Lodge,https://www.tripadvisor.com//Hotel_Review-g317...,"{'name': 'Wilderness Safaris Bisate Lodge', 'i...","[{'perNight': 2678, 'vendorName': 'Official'}]","[{'id': 856162624, 'date': '2022-08-24', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...
51,Cavas Wine Lodge,https://www.tripadvisor.com//Hotel_Review-g110...,"{'name': 'Cavas Wine Lodge', 'id': 600077, 'ty...","[{'perNight': 900, 'vendorName': 'Official'}]","[{'id': 863022437, 'date': '2022-10-03', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...
54,Lodge on Little St. Simons Island,https://www.tripadvisor.com//Hotel_Review-g352...,{'name': 'The Lodge on Little St. Simons Islan...,"[{'perNight': 675, 'vendorName': 'Official'}]","[{'id': 859821336, 'date': '2022-09-12', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...
59,Gibb's Farm,https://www.tripadvisor.com//Hotel_Review-g317...,"{'name': 'Gibb's Farm', 'id': 597670, 'type': ...","[{'perNight': 761.5, 'vendorName': 'Official'}]","[{'id': 858983701, 'date': '2022-09-07', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...
63,San Ysidro Ranch,https://www.tripadvisor.com//Hotel_Review-g330...,"{'name': 'San Ysidro Ranch', 'id': 81955, 'typ...","[{'perNight': 2495, 'vendorName': 'Official'}]","[{'id': 853114915, 'date': '2022-08-09', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...
72,Twin Farms,https://www.tripadvisor.com//Hotel_Review-g571...,"{'name': 'Twin Farms', 'id': 265474, 'type': '...","[{'perNight': 3135, 'vendorName': 'Official'}]","[{'id': 860894270, 'date': '2022-09-19', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...
95,andBeyond Ngorongoro Crater Lodge,https://www.tripadvisor.com//Hotel_Review-g317...,"{'name': 'andBeyond Ngorongoro Crater Lodge', ...","[{'perNight': 1300, 'vendorName': 'Official'}]","[{'id': 860614644, 'date': '2022-09-17', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...
98,Wentworth Mansion,https://www.tripadvisor.com//Hotel_Review-g541...,"{'name': 'Wentworth Mansion', 'id': 111475, 't...","[{'perNight': 585, 'vendorName': 'Official'}]","[{'id': 862087227, 'date': '2022-09-26', 'rati...",{'messages': {'common_Cookie_consent_14f6': 'C...


#### 1.2.2 Handling multiple prices (from different vendors)

For most of the hotels on TripAdvisor.com, prices from different vendors are provided.

We are going to take the mean value for these instances, concat with the 'Travel+Leisure World's Best Hotels 2022' dataset into our main dataset `hotels_data`:

In [10]:
# Define a `get_mean_price` function to iterate each row and return the rounded up mean prices
def get_mean_price(row):
    prices = [price['perNight'] for price in row if price['perNight'] is not None]
    return round(np.mean(prices))


# Get the mean prices and concat with the 'Travel+Leisure World's Best Hotels 2022' dataset
hotels_data = pd.concat([
    hotels_2022_df,
    pd.DataFrame(data=ta_hotels_df.price.apply(get_mean_price)).rename(columns={'price': 'PricePerNight'})
], axis=1)

#### 1.2.3 Extract hotel geographical information

The geographical information of the hotels embeded under the `urqlCache` tag in TripAdvisor.com pages.

We can extract them and merge into the main dataset `hotels_data`:

In [11]:
# Define a 'get_geo' function to iterate each row and return the latitude and longitude of the hotel
def get_geo(row):
    for key,val in row['page']['urqlCache'].items():
        for k,v in val.items():
            if 'currentLocation' in v:
                location_info  = json.loads(v)['currentLocation'][0]
                if location_info['placeType'] == 'ACCOMMODATION':
                    #print(key, k)
                    row['Latitude'] = location_info['latitude']
                    row['Longitude'] = location_info['longitude']
    
    return row[['hotel', 'Latitude', 'Longitude']]


# Get the geo info and merge to the main dataset `hotels_data`
hotels_data = hotels_data.merge(
    ta_hotels_df.apply(get_geo, axis=1),
    how='left',
    left_on='Hotel',
    right_on='hotel',
    validate='one_to_one'
).drop(columns=['hotel'])

hotels_data.head()

Unnamed: 0,Hotel,Location,Country,Region,Company,Score,Rank,Rooms,Theme,Year,2021,Past_rank,PricePerNight,Latitude,Longitude
0,Rosewood Castiglion del Bosco,Montalcino,Italy,Europe,Massimo and Chiara Ferragamo,99.25,1,53,Countryside,2000,0,0,1521,43.083687,11.421272
1,Grace Hotel,Santorini,Greece,Europe,Auberge Resorts Collection,99.22,2,20,Coastal,2000,1,6,1291,36.43266,25.421421
2,Waldorf Astoria Maldives Ithaafushi,Ithaafushi Island,Maldives,Southeast Asia,Hilton,99.11,3,119,Island,2019,1,80,2970,4.013715,73.38303
3,Pickering House Inn,Wolfeboro,United States,North America,Peter and Patty Cooke,98.95,4,10,Boutique,1813,1,34,615,43.584614,-71.20888
4,One&Only Reethi Rah,North Malé Atoll,Maldives,Southeast Asia,Kerzner International,98.93,5,130,Island,2005,0,0,2073,4.520508,73.36693


#### 1.2.4 Extract hotel style information

The hotel style information are embeded under the `urqlCache` tag in TripAdvisor.com pages.

We can extract them and merge into the main dataset `hotels_data`:

In [12]:
# Define a 'get_style' function to iterate each row and return the styles labels of the hotel
def get_style(row):
    hotel_cache = json.loads(next(
        v["data"] for k, v in row['page']['urqlCache'].items()
        if 'locationDescription' in v["data"] and
        ('"styleRankings"' in v["data"])
    ))
    hotel_info = hotel_cache['locations'][0]
    row['Styles'] = [style['translatedName'] for style in hotel_info['detail']['styleRankings']]

    return row[['hotel', 'Styles']]


# Get the styles label and merge to the main dataset `hotels_data`
hotels_data = hotels_data.merge(
    ta_hotels_df.apply(get_style, axis=1),
    how='left',
    left_on='Hotel',
    right_on='hotel',
    validate='one_to_one'
).drop(columns=['hotel'])

hotels_data.head()

Unnamed: 0,Hotel,Location,Country,Region,Company,Score,Rank,Rooms,Theme,Year,2021,Past_rank,PricePerNight,Latitude,Longitude,Styles
0,Rosewood Castiglion del Bosco,Montalcino,Italy,Europe,Massimo and Chiara Ferragamo,99.25,1,53,Countryside,2000,0,0,1521,43.083687,11.421272,"[Luxury, Romantic, Classic, Quiet, Family]"
1,Grace Hotel,Santorini,Greece,Europe,Auberge Resorts Collection,99.22,2,20,Coastal,2000,1,6,1291,36.43266,25.421421,"[Luxury, Modern, Classic, Trendy, Great View]"
2,Waldorf Astoria Maldives Ithaafushi,Ithaafushi Island,Maldives,Southeast Asia,Hilton,99.11,3,119,Island,2019,1,80,2970,4.013715,73.38303,[Trendy]
3,Pickering House Inn,Wolfeboro,United States,North America,Peter and Patty Cooke,98.95,4,10,Boutique,1813,1,34,615,43.584614,-71.20888,"[Business, Charming]"
4,One&Only Reethi Rah,North Malé Atoll,Maldives,Southeast Asia,Kerzner International,98.93,5,130,Island,2005,0,0,2073,4.520508,73.36693,[Modern]


#### 1.2.5 Extract hotel type, stars, customer rating and description

The hotel type, stars, customer rating and description information stored in the `info` column as dictionaries.

We can extract them and merge into the main dataset `hotels_data`:

In [13]:
# Get the hotel info and concat with the main dataset `hotels_data`
hotels_data = pd.concat([
    hotels_data,
    pd.DataFrame(ta_hotels_df['info'].tolist()).loc[:, ['type', 'stars', 'rating', 'description']]\
        .rename(columns={
            'type': 'Type',
            'stars': 'Stars',
            'rating': 'CustomerRating',
            'description': 'Description'
        })
], axis=1)

hotels_data.head()

Unnamed: 0,Hotel,Location,Country,Region,Company,Score,Rank,Rooms,Theme,Year,2021,Past_rank,PricePerNight,Latitude,Longitude,Styles,Type,Stars,CustomerRating,Description
0,Rosewood Castiglion del Bosco,Montalcino,Italy,Europe,Massimo and Chiara Ferragamo,99.25,1,53,Countryside,2000,0,0,1521,43.083687,11.421272,"[Luxury, Romantic, Classic, Quiet, Family]",T_HOTEL,5 Star,5.0,Situated within the UNESCO-listed Val d'Orcia ...
1,Grace Hotel,Santorini,Greece,Europe,Auberge Resorts Collection,99.22,2,20,Coastal,2000,1,6,1291,36.43266,25.421421,"[Luxury, Modern, Classic, Trendy, Great View]",T_HOTEL,4 Star,5.0,"Whitewashed abodes, cobalt-domed churches, and..."
2,Waldorf Astoria Maldives Ithaafushi,Ithaafushi Island,Maldives,Southeast Asia,Hilton,99.11,3,119,Island,2019,1,80,2970,4.013715,73.38303,[Trendy],T_HOTEL,,5.0,Escape to the unforgettable surrounded by asto...
3,Pickering House Inn,Wolfeboro,United States,North America,Peter and Patty Cooke,98.95,4,10,Boutique,1813,1,34,615,43.584614,-71.20888,"[Business, Charming]",T_BEDANDBREAKFAST,,5.0,Named TRAVEL + LEISURE Magazine's # 1 Resort H...
4,One&Only Reethi Rah,North Malé Atoll,Maldives,Southeast Asia,Kerzner International,98.93,5,130,Island,2005,0,0,2073,4.520508,73.36693,[Modern],T_RESORT,5 Star,5.0,"Surrounded by the crystal blue waters, this ul..."


We can see there are strings in the `Stars` column and we would like to keep the numbers only and convert the column to 'float':

In [14]:
hotels_data['Stars'] = hotels_data.Stars.str.rstrip(' Star').astype('float')
hotels_data.Stars.describe()

count    90.000000
mean      4.838889
std       0.416161
min       2.500000
25%       5.000000
50%       5.000000
75%       5.000000
max       5.000000
Name: Stars, dtype: float64

#### 1.2.6 Handling missing descriptions

In [15]:
hotels_data.loc[hotels_data.Description.isna(), ['Hotel', 'Location', 'Description']]

Unnamed: 0,Hotel,Location,Description
20,Monasterio,Cusco,
70,"Morrison House Old Town Alexandria, Autograph ...",Alexandria,
75,W Santiago,Las Condes,


There are three hotels don't have description on TripAdvisor.

We will be filling in manually with the introduction from the hotel official websites:

In [16]:
hotels_data.at[20, 'Description'] = """
Stay at Monasterio, one of the most unique hotels in Cusco, and experience the thrill of staying in a protected national monument. Discover inspired restaurants and boutique rooms and suites, all clustered around a tranquil central courtyard.

The ideal hotel from which to discover the delights of the city, its doors open to a vibrant scene of old and modern architecture, markets, galleries and cafes serving exciting Andean cuisine.
""".strip()

hotels_data.at[70, 'Description'] = """
Escape to our boutique hotel in Old Town Alexandria.
Embark on a historic getaway at Morrison House Old Town Alexandria, Autograph Collection. Settled in the heart of the city, our hotel provides an unrivaled location moments from the cobblestone roads of King Street, as well as destinations such as Waterfront Park, the National Science Foundation and George Washington's Mount Vernon. After a day of work or play, unwind in our spacious accommodations, where stylish décor, plush bedding and Italian marble bathrooms offer ultimate relaxation. Peer out your room's expansive windows to admire breathtaking views of the sites surrounding our charming hotel. Our restaurant features upscale fine dining with a world view. A Mesoamerican influence captures the cuisine and culture that traveled the trade routes to and through Central and North America. Curated cocktails are celebrated through the use of fresh juices, local herbs and pure science. With three elegant event spaces, we can host professional and social gatherings that are sure to impress.
""".strip()

hotels_data.at[75, 'Description'] = """
STEAL THE SCENE AT OUR GLAM HOTEL.
Against a panoramic backdrop of the snow-capped Andes, W Santiago reinvents style and sophistication in El Golf. The city's most fashionable and trendsetting enclave is a fascinating universe of urban innovation amid cobblestone streets and glitzy skyscrapers. Mix it up with sizzling international cuisines in destination restaurants, signature cocktails, and dancing in Whiskey Blue and cozy conversations in W Lounge. Work out at O2Fit; Wellness Club, then swim in WET®, our rooftop pool with dazzling city views. Follow that dream in one of 196 luxuriously comfortable rooms and suites, featuring the signature W Bed, fully-wired technology, the delightful W MixBar, and state-of-the-art entertainment, and let our Whatever/Whenever® service take care of everything else.
""".strip()

In [17]:
hotels_data.iloc[[20, 70, 75]]

Unnamed: 0,Hotel,Location,Country,Region,Company,Score,Rank,Rooms,Theme,Year,2021,Past_rank,PricePerNight,Latitude,Longitude,Styles,Type,Stars,CustomerRating,Description
20,Monasterio,Cusco,Peru,Latin America,LVMH,97.87,21,122,Palace,1592,0,0,580,-13.515289,-71.97697,"[Luxury, Charming, Romantic, Green, Quaint]",T_HOTEL,5.0,5.0,"Stay at Monasterio, one of the most unique hot..."
70,"Morrison House Old Town Alexandria, Autograph ...",Alexandria,United States,North America,Marriott International,96.42,71,45,Contemporary,1864,0,0,304,38.8047,-77.0491,"[Charming, Romantic, Boutique, Quiet, Green]",T_HOTEL,4.0,4.5,Escape to our boutique hotel in Old Town Alexa...
75,W Santiago,Las Condes,Chile,Latin America,Marriott International,96.36,75,196,Contemporary,2009,0,0,289,-33.41411,-70.59854,"[City View, Trendy, Luxury, Business, Modern]",T_HOTEL,5.0,4.0,STEAL THE SCENE AT OUR GLAM HOTEL.\nAgainst a ...


#### 1.2.7 Customer reviews sentiment process

Customer reviews reflect the real feedback after their stay with the hotel, in terms of text, compared to the customer ratings.

The overall customer ratings are already provided by TripAdvisor. We would want to process the review texts and get sentiment scores which represent the polarity magnitude of customers' feedback (positive if above 0, negative if below 0).

In [18]:
# Explode the lists in `reviews` column to rows
reviews = ta_hotels_df.loc[:, ['hotel', 'reviews']].copy().explode('reviews').reset_index(drop=True)

# Convert the dictionaries from `reviews` column to columns containing each key
reviews = pd.concat(
    [
        reviews[['hotel']],
        pd.DataFrame(reviews.reviews.to_list())
    ],
    axis=1
    )[['hotel', 'text']]

reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61155 entries, 0 to 61154
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   hotel   61155 non-null  object
 1   text    61155 non-null  object
dtypes: object(2)
memory usage: 955.7+ KB


In [19]:
# Define a 'get_sentiment_score' function to iterate each row and return the sentiment score of each review
def get_sentiment_score(row, text_col):
    doc = nlp(row[text_col])
    row['SentimentScore'] = doc._.blob.polarity
    return row


# Get the sentiment score
reviews = reviews.apply(lambda row: get_sentiment_score(row, 'text'), axis=1)
reviews.head()

Unnamed: 0,hotel,text,SentimentScore
0,Rosewood Castiglion del Bosco,10/10. \n\nWhere do I begin! We spent 2 nights...,0.324754
1,Rosewood Castiglion del Bosco,I’ve just spent a week with my family at The R...,0.233929
2,Rosewood Castiglion del Bosco,My best hotel experience ever. I’m happy to ha...,0.345833
3,Rosewood Castiglion del Bosco,"Amazing location, excellent and exceptionally ...",0.564063
4,Rosewood Castiglion del Bosco,Most Beautiful Hotel Ever !\nI can say that Ro...,0.673214


In [20]:
reviews.groupby(by=['hotel']).text.count().describe()

count     101.000000
mean      605.495050
std       360.336971
min        20.000000
25%       380.000000
50%       751.000000
75%       820.000000
max      2789.000000
Name: text, dtype: float64

Given that the amount of reviews of these hotels varies quite much, we would want to apply bootstrap sampling on the sentiment scores, to construct a 95% confidence interval of the mean score for each hotel, in order to present the small world uncertainty.

In [21]:
# Define a 'bootstrapping' function to iterate each row
# and return the mean, lower & upper bound of each hotel's sentiment score
def bootstrapping(group):
    bootstrap_means = [
        np.random.choice(group['SentimentScore'], len(group['SentimentScore']), replace=True).mean()
        for _ in range(B)
    ]
    mean = np.mean(bootstrap_means).round(4)
    lower = np.percentile(bootstrap_means, 2.5).round(4)
    upper = np.percentile(bootstrap_means, 97.5).round(4)

    return [mean, lower, upper]

# Get the bootstrapping statistics
B = 5000
np.random.seed(42)  # for reproducibility
bootstraps = reviews.groupby(by=['hotel']).apply(bootstrapping).rename('bootstrapping').reset_index()

bootstraps = pd.concat([
    bootstraps['hotel'],
    pd.DataFrame(bootstraps.bootstrapping.tolist(), columns=['SentiMean', 'SentiLower', 'SentiUpper'])
], axis=1)

In [22]:
# Merge the bootstrapping statistics to the main dataset `hotels_data`
hotels_data = hotels_data.merge(
    bootstraps,
    how='left',
    left_on='Hotel',
    right_on='hotel',
    validate='one_to_one'
).drop(columns=['hotel'])

hotels_data.head()

Unnamed: 0,Hotel,Location,Country,Region,Company,Score,Rank,Rooms,Theme,Year,...,Latitude,Longitude,Styles,Type,Stars,CustomerRating,Description,SentiMean,SentiLower,SentiUpper
0,Rosewood Castiglion del Bosco,Montalcino,Italy,Europe,Massimo and Chiara Ferragamo,99.25,1,53,Countryside,2000,...,43.083687,11.421272,"[Luxury, Romantic, Classic, Quiet, Family]",T_HOTEL,5.0,5.0,Situated within the UNESCO-listed Val d'Orcia ...,0.3564,0.34,0.3724
1,Grace Hotel,Santorini,Greece,Europe,Auberge Resorts Collection,99.22,2,20,Coastal,2000,...,36.43266,25.421421,"[Luxury, Modern, Classic, Trendy, Great View]",T_HOTEL,4.0,5.0,"Whitewashed abodes, cobalt-domed churches, and...",0.3462,0.334,0.3582
2,Waldorf Astoria Maldives Ithaafushi,Ithaafushi Island,Maldives,Southeast Asia,Hilton,99.11,3,119,Island,2019,...,4.013715,73.38303,[Trendy],T_HOTEL,,5.0,Escape to the unforgettable surrounded by asto...,0.3696,0.3566,0.3827
3,Pickering House Inn,Wolfeboro,United States,North America,Peter and Patty Cooke,98.95,4,10,Boutique,1813,...,43.584614,-71.20888,"[Business, Charming]",T_BEDANDBREAKFAST,,5.0,Named TRAVEL + LEISURE Magazine's # 1 Resort H...,0.3627,0.3272,0.3988
4,One&Only Reethi Rah,North Malé Atoll,Maldives,Southeast Asia,Kerzner International,98.93,5,130,Island,2005,...,4.520508,73.36693,[Modern],T_RESORT,5.0,5.0,"Surrounded by the crystal blue waters, this ul...",0.3303,0.3218,0.339


#### 1.2.8 Export the main dataset `hotels_data` to a pickle file for analysis

In [23]:
hotels_data.to_pickle('./data/hotels_data.pickle')

### 1.3 'Trip Advisor Hotel Reviews' dataset

In [24]:
ta_reviews = pd.read_csv('./data/tripadvisor_hotel_reviews.csv.zip')
ta_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20491 entries, 0 to 20490
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  20491 non-null  object
 1   Rating  20491 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 320.3+ KB


In [25]:
ta_reviews.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


#### 1.3.1 Calculate `SentimentScore` of each review

In [26]:
# Get the sentiment score
ta_reviews = ta_reviews.apply(lambda row: get_sentiment_score(row, 'Review'), axis=1)
ta_reviews.head()

Unnamed: 0,Review,Rating,SentimentScore
0,nice hotel expensive parking got good deal sta...,4,0.208744
1,ok nothing special charge diamond member hilto...,2,0.214923
2,nice rooms not 4* experience hotel monaco seat...,3,0.29442
3,"unique, great stay, wonderful time hotel monac...",5,0.504825
4,"great stay great stay, went seahawk game aweso...",5,0.384615


In [27]:
ta_reviews.groupby(by=['Rating']).Review.count()

Rating
1    1421
2    1793
3    2184
4    6039
5    9054
Name: Review, dtype: int64

We can see the amount of reviews by each rating group is also skewed. Let's apply bootstrap sampling on the sentiment scores, to construct a 95% confidence interval of the mean score for each rating group, for presenting the uncertainty.

In [28]:
# Define a 'bootstrapping' function to iterate each row
# and return the mean, lower & upper bound of each hotel's sentiment score
def bootstrapping(group):
    bootstrap_means = [
        np.random.choice(group['SentimentScore'], len(group['SentimentScore']), replace=True).mean()
        for _ in range(B)
    ]
    mean = np.mean(bootstrap_means).round(4)
    lower = np.percentile(bootstrap_means, 2.5).round(4)
    upper = np.percentile(bootstrap_means, 97.5).round(4)

    return [mean, lower, upper]

# Get the bootstrapping statistics
B = 5000
np.random.seed(42)  # for reproducibility
ta_senti_bootstraps = ta_reviews.groupby(by=['Rating']).apply(bootstrapping).rename('bootstrapping').reset_index()

ta_senti_bootstraps = pd.concat([
    ta_senti_bootstraps['Rating'],
    pd.DataFrame(ta_senti_bootstraps.bootstrapping.tolist(), columns=['SentiMean', 'SentiLower', 'SentiUpper'])
], axis=1)

In [29]:
ta_senti_bootstraps

Unnamed: 0,Rating,SentiMean,SentiLower,SentiUpper
0,1,-0.0406,-0.0502,-0.0313
1,2,0.0929,0.0868,0.0992
2,3,0.1975,0.1921,0.2027
3,4,0.2962,0.2932,0.2994
4,5,0.3635,0.3606,0.3662


In [30]:
# Save to pickle file
ta_senti_bootstraps.to_pickle('./data/ta_senti_bootstraps.csv')

## Record Library Dependency

In [37]:
%load_ext watermark
%watermark -u -i -d -v -iv -w -p spacytextblob

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Last updated: 2022-10-18T14:40:15.356378+08:00

Python implementation: CPython
Python version       : 3.9.13
IPython version      : 8.5.0

spacytextblob: 4.0.0

numpy : 1.23.3
sys   : 3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:00:52) 
[Clang 13.0.1 ]
spacy : 3.2.4
pandas: 1.5.0
json  : 2.0.9

Watermark: 2.3.1

