# Analysis Workbook

## Import Libraries and Modules

In [1]:
import pandas as pd
import numpy as np
import random
import altair as alt
from vega_datasets import data

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

<spacytextblob.spacytextblob.SpacyTextBlob at 0x170694940>

In [2]:
# Load the main dataset
hotels_data = pd.read_pickle('./data/hotels_data.pickle')
hotels_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101 entries, 0 to 100
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Hotel           101 non-null    object 
 1   Location        101 non-null    object 
 2   Country         101 non-null    object 
 3   Region          101 non-null    object 
 4   Company         101 non-null    object 
 5   Score           101 non-null    float64
 6   Rank            101 non-null    int64  
 7   Rooms           101 non-null    int64  
 8   Theme           101 non-null    object 
 9   Year            101 non-null    int64  
 10  2021            101 non-null    int64  
 11  Past_rank       101 non-null    int64  
 12  PricePerNight   101 non-null    int64  
 13  Latitude        101 non-null    float64
 14  Longitude       101 non-null    float64
 15  Styles          101 non-null    object 
 16  Type            101 non-null    object 
 17  Stars           90 non-null     flo

## 1 - Exploratory Data Analysis

### 1.1 Geographical Exploration

We would want to have this intuitive overview of all the hotels through geographical visualization.

#### 1.1.1 First we extract the words and word counts from the description to be used in terms of word cloud:

In [3]:
# A function to return top n words from the review text
def top_n_words(row, n):
    doc = nlp(row)
    lemma = [
        token.lemma_ for token in doc
        if (token.is_stop == False) and 
           (token.pos_ not in ['PUNCT', 'NUM', 'SPACE']) and 
           (token.lemma_ not in ['#', '&'])  # Filter out stop-words, punctuations, numbers and whitespaces
    ]
    
    word_dict = {
        word: lemma.count(word) for word in set(lemma)
    }
    
    return sorted(word_dict.items(), key=lambda x: x[-1], reverse=True)[:n]


# Get the top n words from the hotel descriptions and output a new dataframe with word counts
top_words = hotels_data.Description.apply(lambda row: top_n_words(row, 20))
top_words = pd.concat([hotels_data.loc[:, 'Hotel'], top_words], axis=1).explode('Description').reset_index()
top_words = pd.concat([
    top_words.loc[:, 'Hotel'],
    pd.DataFrame(top_words.Description.tolist(), columns=['Word', 'Count'])
], axis=1)

top_words.head()

Unnamed: 0,Hotel,Word,Count
0,Rosewood Castiglion del Bosco,home,2
1,Rosewood Castiglion del Bosco,Bosco,2
2,Rosewood Castiglion del Bosco,di,2
3,Rosewood Castiglion del Bosco,estate,2
4,Rosewood Castiglion del Bosco,Montalcino,2


In [4]:
# Get random x and y positions for the words
def shuffled_range(n):
    return random.sample(range(n), k=n)


n = len(top_words)
x = shuffled_range(n)
y = shuffled_range(n)

words_data = top_words.assign(x=x, y=y)

#### 1.1.2 Then we plot the hotels together with the regional distribution and word cloud:

In [5]:
def geo_plot(data=data, hotel_df=hotels_data, word_df=words_data):
    # Import countries data and plot the map
    countries = alt.topo_feature(data.world_110m.url, 'countries')
    
    map = alt.Chart(countries).mark_geoshape(
        fill='lightgray',
        stroke='white'
    ).project(
        "equirectangular"
    ).encode(
        opacity=alt.value(0.4)
    )

    # Unique value lists 
    country = hotel_df.Country.unique()
    region = hotel_df.Region.unique()
    theme = hotel_df.Theme.unique()

    # Selectors for interactivity
    selection_theme = alt.selection_multi(fields=['Theme'], bind='legend')
    selection_country = alt.selection_single(fields=['Country'])
    selection_hotel = alt.selection_single(fields=['Hotel'])

    region_dropdown = alt.binding_select(
        options=[None] + list(region),
        labels = ['All'] + list(region),
        name='Region'
    )
    selection_region = alt.selection_single(fields=['Region'], bind=region_dropdown)

    slider = alt.binding_range(
        min=hotel_df.PricePerNight.min(),
        max=hotel_df.PricePerNight.max(),
        step=100,
        name='Price per night below'
    )
    selection_price = alt.selection_single(
        name='selection_price',
        fields=['PricePerNight'],
        bind=slider,
        init={'PricePerNight': hotel_df.PricePerNight.max()}
    )

    # Plot the hotel circles on the map
    hotels = alt.Chart(hotel_df).mark_circle().encode(
        longitude='Longitude:Q',
        latitude='Latitude:Q',
        color=alt.condition(
            selection_hotel | selection_theme,
            alt.Color('Theme:N', scale=alt.Scale(domain=theme, scheme='tableau20')),
            alt.value('lightgray')
        ),
        opacity=alt.condition(selection_theme, alt.value(1), alt.value(0)),
        size=alt.condition(
            alt.datum.PricePerNight <= selection_price.PricePerNight,
            alt.value(48),
            alt.value(0)
        ),
        tooltip=['Rank', 'Hotel', 'Location', 'Country', 'Theme', 'PricePerNight']
    ).add_selection(
        selection_theme,
        selection_country,
        selection_hotel,
        selection_region,
        selection_price
    ).transform_filter(
        selection_country | selection_region
    ).properties(
        title=alt.TitleParams(
            text='The 100 Best Hotels in the World',
            subtitle="Click on the Theme Legend to filter the hotels by `Theme`"
        )
    )

    map_chart = alt.layer(map, hotels).properties(
        width=1000,
        height=400
    )

    # Plot the dotplot for hotels distribution among the countries
    region_dots = alt.Chart(hotel_df).transform_window(
        rank_in_country='rank(Score)',
        groupby=['Country'],
        #sort=[alt.SortField('Score', order='ascending')]
    ).mark_circle().encode(
        x=alt.X('Country:N', axis=alt.Axis(labelLimit=80)),
        y=alt.Y('rank_in_country:Q', title='Amount of Hotels per Country'),
        color=alt.condition(
            selection_hotel | selection_theme,
            alt.Color('Theme:N', scale=alt.Scale(domain=theme, scheme='tableau20'), legend=None),
            alt.value('lightgray')
        ),
        size=alt.Size('PricePerNight:Q', legend=alt.Legend(direction='vertical', orient='right')),
        tooltip=['Rank', 'Hotel', 'Location', 'Theme', 'PricePerNight']
    ).transform_filter(
        selection_region
    ).add_selection(
        selection_theme,
        selection_country,
        selection_hotel
    ).properties(
        width=600,
        height=300
    )

    # Plot the word cloud
    word_cloud = alt.Chart(word_df).mark_text(baseline='middle').encode(
        x=alt.X('x:O', axis=None),
        y=alt.Y('y:O', axis=None),
        text='Word:N',
        color=alt.Color('Count:Q', scale=alt.Scale(scheme='lighttealblue'), legend=None),
        size=alt.Size('Count:Q', legend=None)
    ).transform_filter(
        selection_hotel
    ).properties(
        width=360,
        height=360
    )

    # Horizontal concatination to generate the below chart
    below_chart = alt.hconcat(region_dots, word_cloud).resolve_scale(
        size='independent'
    )

    # Vertical concatination for the final chart
    chart = alt.vconcat(map_chart, below_chart).configure(
        axis=alt.AxisConfig(
            title=None, domain=False, ticks=False, labelPadding=10, labelColor='gray', labelFont='Helvetica Neue',
            #labelAngle=0
        ),
        #autosize = alt.AutoSizeParams(
        #    resize = True
        #),
        legend=alt.LegendConfig(
            direction='horizontal',
            orient='top'
        ),
        lineBreak='\n',
        title=alt.TitleConfig(
            fontSize=24,
            fontWeight=400,
            subtitleFontSize=12,
            subtitleLineHeight=8,
            anchor='start',
        ),
        view=alt.ViewConfig(
            strokeWidth=0
        )
    )

    return chart


geo_plot(data, hotels_data, words_data)

* `Theme` is encoded as the color of the circle.
* `PricePerNight` is encoded as the size of the circle in the regional distribution.
* The amount of hotels per country are encoded as the position on the Y-axis in the regional distribution.
* Word counts are encoded as the size of the words, as well as the color scale (the more the darker).

### 1.2 Year of Operation Exploration

We also want to find out how hotels are distributed in terms of the year of operation.

In [6]:
def year_plot(df=hotels_data):
    # Selectors for interactivity
    theme = df.Theme.unique()
    selection_theme = alt.selection_multi(fields=['Theme'], bind='legend')

    chart = alt.Chart(df).mark_circle().encode(
        x=alt.X(
            'binned:Q',
            scale=alt.Scale(domain=[1580, 2020]),
            axis=alt.Axis(
                grid=False,
                labelFontSize=alt.condition("(datum.value >= 2000) & (datum.value <= 2022)", alt.value(13), alt.value(12)),
                labelFontWeight=alt.condition("(datum.value >= 2000) & (datum.value <= 2022)", alt.value(800), alt.value(400))
            )
        ),
        y='rank_in_bin:Q',
        color=alt.condition(
            selection_theme,
            alt.Color('Theme:N', scale=alt.Scale(domain=theme, scheme='tableau20')),
            alt.value('lightgray')
        ),
        size=alt.Size('PricePerNight:Q', legend=alt.Legend(direction='horizontal', orient='bottom')),
        tooltip=['Rank', 'Hotel', 'Country','Year', 'Theme', 'PricePerNight']
    ).transform_bin(
        as_='binned',
        field='Year',
        bin=alt.BinParams(step=10)
    ).transform_window(
        rank_in_bin='rank(Score)',
        groupby=['binned'],
        #sort=[alt.SortField('Score', order='ascending')]
    ).add_selection(
        selection_theme
    ).properties(
        width=1000,
        height=400,
        title=alt.TitleParams(
            text='Almost half of the hotels started operating after the year 2000',
            #subtitle="Click on the Theme Legend to filter the hotels by `Theme`"
        )
    ).configure(
        axis=alt.AxisConfig(
            title=None, domain=False, ticks=False, labelPadding=10, labelColor='gray', labelFont='Helvetica Neue',
            labelFontSize=12, format='d'
        ),
        legend=alt.LegendConfig(
            direction='horizontal',
            orient='top'
        ),
        lineBreak='\n',
        title=alt.TitleConfig(
            fontSize=36,
            fontWeight=400,
            subtitleFontSize=12,
            subtitlePadding=10,
            anchor='start',
        ),
        view=alt.ViewConfig(
            strokeWidth=0
        )
    )

    return chart


year_plot(hotels_data)

### 1.3 Correlation Exploration

We want to find out how different variables affecting the ranking scores of the hotels.

In [7]:
# Get the pairwise correlation of the variables
df_corr = hotels_data[['Score', 'Rooms', 'Year', 'PricePerNight', 'Stars', 'CustomerRating', 'SentiMean']]
corr_matrix = df_corr[[
    'Score', 'Rooms', 'Year', 'PricePerNight', 'Stars', 'CustomerRating', 'SentiMean'
]].corr().reset_index().melt('index')

corr_matrix.columns = ['var1', 'var2', 'correlation']

In [8]:
# Plot the correlation heatmap
def corr_plot(data=corr_matrix):
    base = alt.Chart(data).transform_filter(
        alt.datum.var1 < alt.datum.var2
    ).encode(
        x=alt.X(
            'var1:N',
            axis=alt.Axis(
                labelAngle=-45,
                labelColor=alt.condition("datum.value == 'Score'", alt.value('black'), alt.value('grey')),
                labelFontSize=alt.condition("datum.value == 'Score'", alt.value(20), alt.value(18)),
                labelFontWeight=alt.condition("datum.value == 'Score'", alt.value(800), alt.value(400))
            )
        ),
        y=alt.Y(
            'var2:N',
            axis=alt.Axis(
                labelColor=alt.condition("datum.value == 'Score'", alt.value('black'), alt.value('grey')),
                labelFontSize=alt.condition("datum.value == 'Score'", alt.value(20), alt.value(18)),
                labelFontWeight=alt.condition("datum.value == 'Score'", alt.value(800), alt.value(400))
            )
            ),
    ).properties(
        width=alt.Step(100),
        height=alt.Step(100))

    rects = base.mark_rect().encode(
        color=alt.Color(
            'correlation',
            scale = alt.Scale(
                domain = [-0.5, 0, 0.5],
                range = ['#1276CE', '#D9D9D9', '#F06400']
            )
        )
    )

    text = base.mark_text(
        size=24
    ).encode(
        text=alt.Text('correlation', format=".2f"),
        color=alt.condition(
            "(datum.var1 == 'Score') | (datum.var2 == 'Score')",
            alt.value('white'),
            alt.value('black')
        )
    )

    chart = alt.layer(rects, text).configure(
        axis=alt.AxisConfig(
            title=None, domain=False, ticks=False, labelPadding=10, labelColor='gray', labelFont='Helvetica Neue',
            labelFontSize=18
        ),
        legend=alt.LegendConfig(
            direction='vertical',
            orient='right'
        ),
        lineBreak='\n',
        title=alt.TitleConfig(
            fontSize=24,
            fontWeight=400,
            subtitleFontSize=12,
            subtitlePadding=10,
            anchor='start',
        ),
        view=alt.ViewConfig(
            strokeWidth=0
        )
    )

    return chart


corr_plot(corr_matrix)

### 1.4 Sentiment Score Exploration

We would want to compare the Sentiment Score / Rating distribution with the reviews of some random pick hotels.

In [9]:
# Load the 'ta_senti_bootstraps' dataset
ta_senti_bootstraps = pd.read_pickle('./data/ta_senti_bootstraps.csv')
ta_senti_bootstraps

Unnamed: 0,Rating,SentiMean,SentiLower,SentiUpper
0,1,-0.0406,-0.0502,-0.0313
1,2,0.0929,0.0868,0.0992
2,3,0.1975,0.1921,0.2027
3,4,0.2962,0.2932,0.2994
4,5,0.3635,0.3606,0.3662


In [10]:
def senti_chart(ci_data=ta_senti_bootstraps, mean_data=hotels_data):
    ci = alt.Chart(ci_data).mark_rule(size=6).encode(
        x=alt.X('Rating:Q', axis=alt.Axis(values=[1, 2, 3, 4, 5])),
        y='SentiLower:Q',
        y2='SentiUpper:Q'
    )

    mean = alt.Chart(mean_data).mark_circle(color='red').encode(
        x='CustomerRating:Q',
        y='SentiMean:Q'
    )

    line = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(strokeDash=[2, 5], color='gray', size=3).encode(
        y='y'
    )

    chart = alt.layer(mean, ci, line).properties(
        width=600,
        height=800,
        title=alt.TitleParams(
            text=[
                'Comparison of the reviews’ sentiment scores',
                'of the Top 100 hotels and random hotels'
            ],
            subtitle="\n"
        )
    ).configure(
        axis=alt.AxisConfig(
            title=None, domain=False, grid=False, ticks=False, labelPadding=10, labelColor='gray', labelFont='Helvetica Neue',
            labelFontSize=12
        ),
        legend=alt.LegendConfig(
            direction='horizontal',
            orient='top'
        ),
        lineBreak='\n',
        title=alt.TitleConfig(
            fontSize=24,
            fontWeight=400,
            subtitleFontSize=12,
            subtitlePadding=10,
            anchor='start',
        ),
        view=alt.ViewConfig(
            strokeWidth=0
        )
    )

    return chart

senti_chart(ci_data=ta_senti_bootstraps, mean_data=hotels_data)

* All the Top 100 hotels have a higher sentiment score than rating 4 reviews, indicating that in general customers have a satisfying experience.
* However, when we have a closer look of the rating 5 group, there are still some hotels are with a lower sentiment score than reviews from random pick hotels, which serve as a reminder that there is still space to improve.

### 2.0 Future Work

* Prices are very much dependent on dates and hotel policies. We would need to figure out a better logic for consistency. For example, find a future date with possible availabilities for all hotel and get an average per night price for five days.
* The reviews retrieved are all in English and the date of the reviews is not put into consideration. Reviews from diverse languages should be retrieved and processed. The reviews might have to be grouped by annual as well.
* We were planning to analyse the correlation between amenities and the ranking score. However, the amenities data we scraped from Tripadvisor only include the basic amenities. Those special amenities that fit special themes and make hotels stand out are excluded. We also noticed that there is more information about amenities on booking.com. Due to the time limit of this project, we were not able to scrape more data. In the future, getting more data from other related websites is beneficial for a more comprehensive analysis. 
* Numerical variables haven’t presented strong relation with the ranking score. Methodology for encoding categorical features should be included, along with the normalization of the numerical date, for further correlation analysis.

## Record Library Dependency

In [11]:
%load_ext watermark
%watermark -u -i -d -v -iv -w -p vega_datasets,spacytextblob

Last updated: 2022-10-18T14:42:24.995687+08:00

Python implementation: CPython
Python version       : 3.9.13
IPython version      : 8.5.0

vega_datasets: 0.9.0
spacytextblob: 4.0.0

numpy : 1.23.3
spacy : 3.2.4
altair: 4.2.0
sys   : 3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:00:52) 
[Clang 13.0.1 ]
pandas: 1.5.0

Watermark: 2.3.1

