# Reviews analysis on European Luxury Hotels

# 0. Background and Introduction

Whether hotel managers like it or not, guest reviews are becoming an essential factor influencing people's booking/purchase. Think about our past experiences. When we are looking for vacation accommodation on Booking.com, we may unknowingly scroll down the screen to view reviews.

As reviews become more and more important, hotel owners need to start using them. The analysis of reviews can help hotels understand what customers think about themselves and understand whether the hotel meets customer expectations. This analysis can also contribute to building up user portraits based on the reviews, which is essential for formulating marketing strategies based on the role of the customer. In addition, hotels can also compare themselves with similar hotels in the same region to measure their competitiveness and better locate themselves.

The review on Booking.com has a significant feature. Reviewers should give positive and negative feelings separately when writing, as shown in the figure. For potential customers, this can help them better decide which hotel to choose. For owners, this can better help estimate customers' overall rating of the hotel through review text.

fig.1 - Reviewing process on Booking.com

This report will focus on luxury hotels in six famous European tourism nations and analyze their reviews on Booking.com. The analysis will mainly involve and try to answer these questions: What do customers care about hotels in different countries? Are customers' reviews of each hotel positive or negative? How to predict the guest's rating based on hotels' basic information, the reviewers' information, and their text evaluations? How to classify and locate hotels based on their ratings and reviews. How to cluster consumers and roughly describe their portraits?

## 1. Data loading and wrangling

In [756]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import os
import geoviews as gv
import geoviews.tile_sources as gvts
import hvplot.pandas
import holoviews as hv
import seaborn as sns
import altair as alt

np.random.seed(42)

palette5 = ["#61a262","#1a7328","#f2b4ae","#f26a4b","#f25d50"]
palette4 = ["#61a262","#1a7328","#f2b4ae","#f25d50"]
palette2 = ["#61a262","#f26a4b"]


### Load hospital reviews data with API 

Set up environment variable for kaggle API.

In [757]:
# set Kaggle key to environment variable
os.environ["KAGGLE_USERNAME"] = "lanxiao1120"
os.environ["KAGGLE_KEY"] = "dc6f594f7f5b866e3aec880f88948cb8"

# !!users need to install kaggle package first
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

Load data from Kaggle with API.

In [758]:
#Signature: dataset_download_file(dataset, file_name, path=None, force=False, quiet=True)
# download file via API
api.dataset_download_file('jiashenliu/515k-hotel-reviews-data-in-europe','Hotel_Reviews.csv', path='data')
# read it
hotel_raw = pd.read_csv('data/Hotel_Reviews.csv.zip')

#hotel_raw.head()

### Wrangle data and join with country Boundaries

Transform data into geo data frame, transform date columns into date type, and trim data into those of 2016.

In [759]:
# set up coord
hotel = hotel_raw.copy()
hotel['geometry'] = gpd.points_from_xy(hotel['lng'], hotel['lat'])

# to geo df
hotel = gpd.GeoDataFrame(hotel, geometry="geometry", crs="EPSG:4326")

# convert crs to 3857
hotel = hotel.to_crs(epsg=3857)

# transform to date
hotel['Review_Date'] = pd.to_datetime(hotel['Review_Date'] ,format='%m/%d/%Y')

# trim into 2016
hotel = hotel.loc[hotel['Review_Date'] >= '2016-01-01']
hotel = hotel.loc[hotel['Review_Date'] < '2017-01-01']

hotel.head(n=2)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng,geometry
66,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-29,7.7,Hotel Arena,United Kingdom,Asked for more coffee and sugars only got giv...,40,1403,Nice open room Bed plenty of room Bath room c...,28,7,9.2,"[' Leisure trip ', ' Couple ', ' Large King Ro...",217 day,52.360576,4.915968,POINT (547243.088 6865586.634)
67,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-28,7.7,Hotel Arena,South Africa,Room was not cleaned correctly Wine Champagne...,30,1403,To begin with we were upgraded which made my ...,92,6,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",218 day,52.360576,4.915968,POINT (547243.088 6865586.634)


In [760]:
%%opts WMTS [width=800, height=800, xaxis=None, yaxis=None]

hotel_2017 = hotel.loc[hotel['Review_Date']>'2017-01-01']

#hotel_2017.hvplot(geo=True, tiles=True, crs=3857)

Load data of European countries, and spatial join with hotel data.

In [761]:
# get country data 
url = 'https://gisco-services.ec.europa.eu/distribution/v2/countries/geojson/CNTR_RG_01M_2020_3857.geojson'
country = gpd.read_file(url)
#select col
country = country[['NAME_ENGL','geometry']]
# spatial join
hotel_joined = gpd.sjoin(hotel, country, op='within', how='left').drop(['index_right'], axis=1)


Change the name of the column for country.

In [762]:
# change column name
hotel_joined = hotel_joined.rename(
    columns={"NAME_ENGL": "Country"}
)

hotel_joined.head(n=2)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng,geometry,Country
66,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-29,7.7,Hotel Arena,United Kingdom,Asked for more coffee and sugars only got giv...,40,1403,Nice open room Bed plenty of room Bath room c...,28,7,9.2,"[' Leisure trip ', ' Couple ', ' Large King Ro...",217 day,52.360576,4.915968,POINT (547243.088 6865586.634),Netherlands
67,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-28,7.7,Hotel Arena,South Africa,Room was not cleaned correctly Wine Champagne...,30,1403,To begin with we were upgraded which made my ...,92,6,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",218 day,52.360576,4.915968,POINT (547243.088 6865586.634),Netherlands


## 2. Exploratory analysis

##  3. Words frequency analysis & Interactive word clouds

What are the most commonly used words in hotel reviews? Analyzing it can help hotel owners understand the most valuable services that reviewers care about. 

So let's start with the comprehensive data. After removing the meaningless pause words, the four most common terms are in order: room, staff, location, and breakfast, which represent the four fields consumers care about most about hotel accommodation.

### Lower and split reviews

In [763]:
hotel_review = hotel_joined.copy()

hotel_review['Negative_Review'] = [review.lower().split() for review in hotel_review['Negative_Review']]
hotel_review['Positive_Review'] = [review.lower().split() for review in hotel_review['Positive_Review']]

### Remove stop words and punctuation

Load stop words and create a list.

In [764]:
import nltk
# download stop words
nltk.download('stopwords');

#Get the list of common stop words
stop_words = list(set(nltk.corpus.stopwords.words('english')))

[nltk_data] Downloading package stopwords to /Users/lexi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Load common punctuation and create a list.

In [765]:
import string

punctuation = list(string.punctuation)

Remove stop words and punctuation from our reviews.

In [766]:
# list to remove
ignored = stop_words + punctuation

# Remove from each review column
hotel_review['Negative_Review'] = [[word for word in review if word not in ignored]
              for review in hotel_review['Negative_Review']]
hotel_review['Positive_Review'] = [[word for word in review if word not in ignored]
              for review in hotel_review['Positive_Review']]

### Count word frequencies first time

Create a new column containing both positive review and negative review.

In [767]:
hotel_review['Total_Review'] = pd.concat([hotel_review['Negative_Review'],hotel_review['Positive_Review']], 
                                         ignore_index=True)

hotel_review.head(n=2)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng,geometry,Country,Total_Review
66,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-29,7.7,Hotel Arena,United Kingdom,"[asked, coffee, sugars, got, given, two, satch...",40,1403,"[nice, open, room, bed, plenty, room, bath, ro...",28,7,9.2,"[' Leisure trip ', ' Couple ', ' Large King Ro...",217 day,52.360576,4.915968,POINT (547243.088 6865586.634),Netherlands,"[would, nice, one, responsible, cleaning, room..."
67,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-28,7.7,Hotel Arena,South Africa,"[room, cleaned, correctly, wine, champagne, gl...",30,1403,"[begin, upgraded, made, wife, happy, room, spa...",92,6,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",218 day,52.360576,4.915968,POINT (547243.088 6865586.634),Netherlands,"[first, impressions, dark, reception, made, us..."


Define a helper function to calculate word frequencies from our data frame with other information.

In [768]:
def count_word(review_col, top=20):
    """
    Given a column of all words for every rhotel, count word frequencies across all reviews.
    
    By default, this returns the top 20 words, but you can specify a different value for `top`.
    """
    import itertools, collections
    
    # List of all words across hotels
    all_words = list(itertools.chain(*review_col))

    # Create counter
    counter = collections.Counter(all_words)
    
    return pd.DataFrame(counter.most_common(top),
                        columns=['words', 'count'])

In [769]:
counts_reviews = count_word(hotel_review['Total_Review'], top=20)
counts_reviews

Unnamed: 0,words,count
0,room,81858
1,staff,60017
2,location,52072
3,hotel,49687
4,breakfast,37343
5,good,34250
6,negative,33003
7,great,30074
8,friendly,22985
9,bed,21364


### Remove words that are not helpful

Remove words that are not very helpful for analysis.

In [770]:
neutral_words = ["nothing", "hotel", "would","could","one","bit","little","us","get","time","really","also","even"]
hotel_review['Total_Review'] = [[word for word in review if word not in neutral_words]
              for review in hotel_review['Total_Review']]

### Count words frequency for the final cleaned reviews

In [771]:
counts_reviews = count_word(hotel_review['Total_Review'], top=20)
counts_reviews

Unnamed: 0,words,count
0,room,81858
1,staff,60017
2,location,52072
3,breakfast,37343
4,good,34250
5,negative,33003
6,great,30074
7,friendly,22985
8,bed,21364
9,helpful,20890


Plot the words frequency result.

### Define functions to plot word clouds from word frequency

However, the focus does vary with the country and its tourism characteristics. So an interactive word cloud is made by the country and the largest number of words to show. 

As the results show, people are most concerned about the location and staff services in the Netherlands, Italy, Spain, and Austria, and there are more positive words in the evaluation (good, great, friendly, Etc.). In the UK, people are more concerned about rooms conditions and staff services, and the frequency of 'negative' is higher in reviews. In France, people pay more attention to rooms and breakfast, and negative words (negative, small, Etc.) appear more frequently. 

What's more, each country has its unique characteristics. For example, reviewers in France pay more attention to the night experience and shower experience.

In [773]:
import multidict as multidict
import numpy as np
import os
import re
from PIL import Image
from os import path
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import itertools, collections


def getFrequencyDictForText(sentence): 
    
    """
    Given a list of all words from all reviews, count word frequencies across all reviews, 
    return a dictionary of word and frequency.
    
    """

    # get frequency
    fullTermsDict = multidict.MultiDict()
    tmpDict = {}

    # making dict for counting frequencies
    for text in sentence.split(" "):
        if re.match("a|the|an|the|to|in|for|of|or|by|with|is|on|that|be", text):
            continue
        val = tmpDict.get(text, 0)
        tmpDict[text.lower()] = val + 1
    for key in tmpDict:
        fullTermsDict.add(key, tmpDict[key])
    return fullTermsDict


def plot_cloud(Country, df, select_col='Country', text_col='Total_Review', Maximum=[20,50,100,150], title="Most used words in reviews by country"):
    """
    Given a dataframe, the column to interact with, the column with review text, 
    selected the country to analyze, maxmium word to display, title to show,
    generate a word cloud.
    
    """
        
    # select country 
    df_filtered = df.loc[df[select_col]==Country]
    # put all review words into one string object.
    all_words_list = list(itertools.chain(*df_filtered[text_col]))
    all_words = ''
    for word in all_words_list:
        all_words += " " + word
    
    # load base pic
    #alice_mask = np.array(Image.open("image/europe.png"))

    # set up word clooud
    wc = WordCloud(background_color="white", 
                   max_words=Maximum,
                   width=1500,       
                   height=960,          
                   margin=10)
    
    # generate word cloud
    wc.generate_from_frequencies(getFrequencyDictForText(all_words))
    
    # show
    fig, ax = plt.subplots(figsize=(15, 10))
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(title, fontsize=40,fontweight="bold")
    plt.figure()
    

### Implement interactivity

Obtain a list of all countries.

In [774]:
list1 = list(hotel_review['Country'].unique())

# remove nan
list1 = list1[0:3]+list1[4:7]

list1

['Netherlands', 'United Kingdom', 'France', 'Spain', 'Austria', 'Italy']

Interact Functions.

## 4. Sentiment analysis

Are customers' reviews of each hotel positive or negative, subjective or objective? Let's analyze negative, positive, and integrated reviews.

### Create "text blobs" and pass text to

We create three text bolbs here, one for negative reviews, one for positive reviews, and one for total reviews.

In [775]:
import textblob

# concat positive and negative reviews together
hotel_review_raw = hotel_joined.copy()
hotel_review_raw['Total_Review'] = pd.concat([hotel_review_raw['Negative_Review'],hotel_review_raw['Positive_Review']], 
                                             ignore_index=True)

blobs_total = [textblob.TextBlob(review) for review in hotel_review_raw['Total_Review']]
blobs_neg = [textblob.TextBlob(review) for review in hotel_review_raw['Negative_Review']]
blobs_pos = [textblob.TextBlob(review) for review in hotel_review_raw['Positive_Review']]

### Combine all sentiment data into a DataFrame

In [776]:
hotel_sentiment = hotel_review_raw.copy()

hotel_sentiment['total_polarity'] = [blob.sentiment.polarity for blob in blobs_total]
hotel_sentiment['total_subjectivity'] = [blob.sentiment.subjectivity for blob in blobs_total]

hotel_sentiment['neg_polarity'] = [blob.sentiment.polarity for blob in blobs_neg]
hotel_sentiment['neg_subjectivity'] = [blob.sentiment.subjectivity for blob in blobs_neg]

hotel_sentiment['pos_polarity'] = [blob.sentiment.polarity for blob in blobs_pos]
hotel_sentiment['pos_subjectivity'] = [blob.sentiment.subjectivity for blob in blobs_pos]

hotel_sentiment.head()

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,...,lng,geometry,Country,Total_Review,total_polarity,total_subjectivity,neg_polarity,neg_subjectivity,pos_polarity,pos_subjectivity
66,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-29,7.7,Hotel Arena,United Kingdom,Asked for more coffee and sugars only got giv...,40,1403,Nice open room Bed plenty of room Bath room c...,...,4.915968,POINT (547243.088 6865586.634),Netherlands,It would have been nice if the one responsibl...,0.232963,0.708025,0.054667,0.517333,0.366667,0.772222
67,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-28,7.7,Hotel Arena,South Africa,Room was not cleaned correctly Wine Champagne...,30,1403,To begin with we were upgraded which made my ...,...,4.915968,POINT (547243.088 6865586.634),Netherlands,First impressions of the dark reception made ...,0.05,0.388889,-0.4,0.533333,0.451547,0.686869
68,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-21,7.7,Hotel Arena,United Arab Emirates,The bathroom was amazing Though we were two f...,35,1403,Beautiful design comfortable room friendly st...,...,4.915968,POINT (547243.088 6865586.634),Netherlands,Foyer was a mess Only place to relax was the ...,-0.063297,0.584928,0.05,0.95,0.391667,0.61
69,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-20,7.7,Hotel Arena,United Kingdom,Our room didn t have a sofa in but these two ...,46,1403,The bath and shower room was amazing Beautifu...,...,4.915968,POINT (547243.088 6865586.634),Netherlands,the club sandwiches cold bacon with congealed...,-0.573333,0.91,-0.051389,0.497222,0.425,0.8
70,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,2016-12-19,7.7,Hotel Arena,Germany,The breakfast was rather average from quality...,55,1403,The building was fantastic All renovated area...,...,4.915968,POINT (547243.088 6865586.634),Netherlands,The only complaint we had was the iron in the...,-0.044444,0.511111,0.211111,0.443519,0.534375,0.7875


### Count how many reviews are overall unbiased

In [777]:
zero = (hotel_sentiment['total_polarity']==0).sum()
print("Number of unbiased reviews: ", zero)
print("Proportion of unbiased reviews: ", round(zero/len(hotel_sentiment)*100,2),"%")

Number of unbiased reviews:  45595
Proportion of unbiased reviews:  17.24 %


### The distribution of polarity

Polarity runs from -1 (most negative) to +1 (most positive). Plot the histograms of the polarities of positive, negative and integrated reviews together with different colors. We can see that the polarity of negative reviews is concentrated in 0-0.2, while the polarity of positive reviews is concentrated in the range of 0.3-0.7, which proves that the algorithm's evaluation value of review sentiment is larger than the normal one. Therefore, the polarity of the true neutral evaluation should be slightly greater than 0. 

Then we look at the polarity distribution of the integrated reviews. Most polarities are near 0, which represents slightly negative feelings, and some are in the range of  -0.2 to -0.1, which means strongly negative emotions. Also, some polarities are more evenly distributed in the field of 0.3-0.7, while a small number of polarities are more significant than 0.7. 

In summary, positive reviews are a bit more than negative reviews overall, but some negative reviews are very extreme. This result enlightens hotel managers that the most important thing to pay attention to is to reduce the appearance of these extremely negative comments, that is, to prioritize improving the shortest board rather than further optimizing the existing advantages.

In [633]:
columns = ['neg_polarity', 'pos_polarity', 'total_polarity']

polarity_dist = hotel_sentiment.hvplot.hist(y=columns, 
                            alpha=0.3, 
                            legend='right',
                            title = "The distribution of polarity by review classification")

polarity_dist

In [619]:
hvplot.save(polarity_dist, 'html/polarity_dist.html')

In [620]:
print("The average polarity of integrated reviews:",round(hotel_sentiment['total_polarity'].mean(),4))
print("The median polarity of integrated reviews:",round(hotel_sentiment['total_polarity'].median(),4))

The average polarity of integrated reviews: 0.2052
The median polarity of integrated reviews: 0.15


In [621]:
print("The average polarity of positive reviews:",round(hotel_sentiment['pos_polarity'].mean(),4))
print("The median polarity of positive reviews:",round(hotel_sentiment['pos_polarity'].median(),4))

The average polarity of positive reviews: 0.3801
The median polarity of positive reviews: 0.4


In [622]:
print("The average polarity of negative reviews:",round(hotel_sentiment['neg_polarity'].mean(),4))
print("The median polarity of negative reviews:",round(hotel_sentiment['neg_polarity'].median(),4))

The average polarity of negative reviews: 0.0355
The median polarity of negative reviews: 0.0


### The distribution of subjectivity

The subjectivity of negative reviews is more concentrated at 0.4 and below, while positive reviews are more distributed above 0.5. It indicates that negative consumer reviews are often more objective than positive ones. Therefore, if hotels want to increase the number of positive reviews, it's best to provide customers with higher emotional value, such as low-cost surprise gifts or services, and a significantly warm attitude.

In [634]:
columns = ['neg_subjectivity', 'pos_subjectivity', 'total_subjectivity']

subjectivity_dist = hotel_sentiment.hvplot.hist(y=columns, 
                            alpha=0.3, 
                            legend='right',
                            title = "The distribution of subjectivity by review classification")

subjectivity_dist

In [635]:
hvplot.save(subjectivity_dist, 'html/subjectivity_dist.html')

### Explore the monthly trend of polarity

At last, explore the monthly trend of polarity and subjectivity. Interestingly, the reviewers' comments text will be more positive and subjective in the summer. This phenomenon may be because the weather and temperature will affect the users' mood or experience. In any case, it means that European hotels need to pay more attention to maintaining user experience, complaints, and evaluations in winter.

Sort the reviews in chronological order.

In [784]:
# sort 
hotel_sentiment = hotel_sentiment.sort_values(by='Review_Date', ascending=True)

# get month
import datetime as dt
hotel_sentiment['month'] = hotel_sentiment['Review_Date'].dt.month_name()


## 6. Clustering hotels based on DBSCAN

### Select and engineer features

In [785]:
feature_columns = [
    'Hotel_Name',
    'Average_Score', 
    'Total_Number_of_Reviews', 
    'total_subjectivity'
]

# select features
hotel_cluster = hotel_sentiment[feature_columns].copy()

# group by hotel and calculate mean
hotel_cluster = hotel_cluster.groupby('Hotel_Name').mean().reset_index().dropna()

hotel_cluster.head()

Unnamed: 0,Hotel_Name,Average_Score,Total_Number_of_Reviews,total_subjectivity
0,11 Cadogan Gardens,8.7,393.0,0.488381
1,1K Hotel,7.7,663.0,0.399792
2,25hours Hotel beim MuseumsQuartier,8.8,4324.0,0.604034
3,41,9.6,244.0,0.32468
4,45 Park Lane Dorchester Collection,9.4,68.0,0.44277


### Normalize features

In [786]:
from sklearn.preprocessing import StandardScaler

# drop hotel name
hotel_scaled = hotel_cluster.drop(['Hotel_Name'], axis=1)

# Scale these features
scaler = StandardScaler()
hotel_scaled = scaler.fit_transform(hotel_scaled)

hotel_scaled

array([[ 0.44364563, -0.67491521,  0.01049934],
       [-1.39199299, -0.47888846, -0.84116048],
       [ 0.62720949,  2.17908913,  1.12234468],
       ...,
       [ 1.17790108,  2.43174583, -0.93601578],
       [ 0.62720949,  0.26310911, -0.58306978],
       [ 0.07651791, -0.3757929 ,  1.37191062]])

### Run DBSCAN to extract high-density clusters

In [787]:
from sklearn.cluster import dbscan 

# Run DBSCAN 
cores, labels = dbscan(hotel_scaled, eps=0.38, min_samples=30)

# Add the labels back to the original (unscaled) dataset
hotel_cluster['label'] = labels

# Extract the number of clusters 
num_clusters = hotel_cluster['label'].nunique() - 1
print("The number of clusters", num_clusters)

The number of clusters 3


In [788]:
hotel_cluster.groupby('label').size()

label
-1    813
 0    409
 1    195
 2     40
dtype: int64

### Get mean statistics for clusters

In [789]:
# groupby by the label
grps = hotel_cluster.groupby('label')

# calculate average pickup hour and trip distance per cluster
avg_values = grps[[ 'Average_Score', 'Total_Number_of_Reviews', 'total_subjectivity']].mean().reset_index()

avg_values

Unnamed: 0,label,Average_Score,Total_Number_of_Reviews,total_subjectivity
0,-1,8.376015,1719.720787,0.502717
1,0,8.663081,663.550122,0.397988
2,1,8.304951,1122.51757,0.586961
3,2,8.785,965.425,0.600886


### Visualize the clusters

Join with other information that not involves clustering process. 

In [790]:
# merge with geometry
hotel_cluster = pd.merge(hotel_cluster, hotel_joined[['geometry','Hotel_Name','Country']],on=['Hotel_Name'], how='left').drop_duplicates()
hotel_cluster = gpd.GeoDataFrame(hotel_cluster, geometry="geometry", crs="EPSG:3857")

hotel_cluster.head()

Unnamed: 0,Hotel_Name,Average_Score,Total_Number_of_Reviews,total_subjectivity,label,geometry,Country
0,11 Cadogan Gardens,8.7,393.0,0.488381,0,POINT (-17725.926 6709077.580),United Kingdom
81,1K Hotel,7.7,663.0,0.399792,-1,POINT (263367.933 6251804.926),France
149,25hours Hotel beim MuseumsQuartier,8.8,4324.0,0.604034,-1,POINT (1820589.050 6141273.832),Austria
481,41,9.6,244.0,0.32468,-1,POINT (-15990.956 6709887.750),United Kingdom
529,45 Park Lane Dorchester Collection,9.4,68.0,0.44277,-1,POINT (-16868.899 6711358.422),United Kingdom


### Plot relation between scores, subjectivity, and number of reviews, coloring by  labels

Make plots interactive.

In [803]:
import panel as pn

pn.extension("vega")

import param as pm

class hotelsByCountry(pm.Parameterized):

    Countries = pm.ObjectSelector(default='Netherlands', objects=['Netherlands', 'United Kingdom', 'France', 'Spain', 'Austria', 'Italy'])

    @pm.depends('Countries')
    def scatter(self):
        """
        Return an altair scatter plot of the x and y by label.
        """

        p1 = (
            alt.Chart(hotel_cluster.loc[hotel_cluster['Country']==self.Countries]
                     ).mark_circle().encode(
                alt.X('total_subjectivity:Q', scale=alt.Scale(zero=False)),
                alt.Y('Average_Score:Q', scale=alt.Scale(zero=False)),
                size='Total_Number_of_Reviews:Q',
                color=alt.Color('label:N', scale=alt.Scale(scheme='dark2'))
            ).properties(width=500)
        )
        return p1

app = hotelsByCountry(name="")



Layout.

In [801]:
# The title
title = pn.Pane("<h1>Relationships between scores, subjectivity, and number of reviews, coloring by labels</h1>", width=1000)

# Layout the dashboard
panel = pn.Column(
    pn.Row(title),
    pn.Row(pn.Param(app.param, width=300)),
    pn.Row(app.scatter),
)

In [802]:
panel.servable()

### Plot relation between scores, subjectivity, and number of reviews, coloring by countries

In [688]:
out2 = alt.Chart(hotel_cluster.loc[hotel_cluster['label']!=-1]
         ).mark_circle().encode(
    alt.X('total_subjectivity:Q', scale=alt.Scale(zero=False)),
    alt.Y('Average_Score:Q', scale=alt.Scale(zero=False)),
    size='Total_Number_of_Reviews:Q',
    color=alt.Color('Country:N', scale=alt.Scale(scheme='dark2')),
    tooltip=list(hotel_cluster.columns)
).interactive().properties(width=500)

out2

In [692]:
# save the chart as JSON
#out2.save("countryAltair.json") 

### Plot the radar map for hotels

Normalize features of cluster centers.

In [665]:
hotel_centers = avg_values.copy().drop(['label',], axis=1)

hotel_centers = scaler.fit_transform(hotel_centers)

hotel_centers = pd.DataFrame(hotel_centers)

hotel_centers = hotel_centers.rename(
    columns={
        0: "Average_Score", 
        1: "Total_Number_of_Reviews",
        2: "total_subjectivity"
    }
)

hotel_centers.head()

Unnamed: 0,Average_Score,Total_Number_of_Reviews,total_subjectivity
0,-0.788479,1.564753,-0.239995
1,0.66016,-1.180883,-1.534188
2,-1.147091,0.012255,0.80105
3,1.27541,-0.396125,0.973133


Plot.

## 7. Clustering reviewers based on K-mean

### Select and engineer features

In [669]:
# create a col for total number of review words
hotel_cluster2 = hotel_sentiment.copy()
hotel_cluster2['Review_Total_Word_Counts'] = hotel_cluster2['Review_Total_Negative_Word_Counts']+hotel_cluster2['Review_Total_Positive_Word_Counts']


feature_columns = [
    'Review_Total_Word_Counts',
    'Total_Number_of_Reviews_Reviewer_Has_Given', 
    'days_since_review',
]

# select features
hotel_cluster2 = hotel_cluster2[feature_columns]

# convert days_since_review	 from string to Integers 
hotel_cluster2['days_since_review'] = [row.split() for row in hotel_cluster2['days_since_review']]
hotel_cluster2['days_since_review'] = [row[0] for row in hotel_cluster2['days_since_review']]
hotel_cluster2['days_since_review'] = hotel_cluster2['days_since_review'].astype(float)

# group by hotel and calculate mean
hotel_cluster2 = hotel_cluster2.dropna()

hotel_cluster2.head(n=2)

Unnamed: 0,Review_Total_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,days_since_review
79826,8,2,580.0
444094,9,4,580.0


### Fit clustering

In [670]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5)

kmeans.fit(hotel_cluster2[['Review_Total_Word_Counts','Total_Number_of_Reviews_Reviewer_Has_Given', 'days_since_review']])

# Extract the labels
hotel_cluster2['label'] = kmeans.labels_

## Have a glimpse of cluster size

In [671]:
# Extract the number of clusters 
num_clusters2 = hotel_cluster2['label'].nunique() - 1
print("The number of clusters", num_clusters2)

The number of clusters 4


In [672]:
hotel_cluster2.groupby('label').size()

label
0    67097
1    61468
2     8060
3    72188
4    55590
dtype: int64

### Get mean statistics for clusters

In [673]:
# groupby by the label
grps = hotel_cluster2.groupby('label')

# calculate average pickup hour and trip distance per cluster
avg_values2 = grps[['Review_Total_Word_Counts','Total_Number_of_Reviews_Reviewer_Has_Given', 'days_since_review']].mean().reset_index()

avg_values2

Unnamed: 0,label,Review_Total_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,days_since_review
0,0,31.92484,7.195612,265.140006
1,1,31.10503,7.11022,449.403836
2,2,178.679653,7.483375,377.728164
3,3,29.975328,6.886435,358.850377
4,4,33.233801,7.555783,536.851196


### Visualize the clusters

### Plot relation between scores, subjectivity, and number of reviews, coloring by  labels

In [674]:
"""
to_plot = hotel_cluster2.loc[hotel_cluster2['label']!=-1].drop_duplicates()

# data is so large, select some
to_plot = to_plot.sample(n=150, replace=False, random_state=42, axis=0)

to_plot.hvplot.scatter(x='Total_Number_of_Reviews_Reviewer_Has_Given', y='Review_Total_Word_Counts', by='label', 
                      legend='right', height=400, width=800,
                       size='days_since_review', alpha=0.3,
                      hover_cols=['Review_Total_Word_Counts','Total_Number_of_Reviews_Reviewer_Has_Given', 'days_since_review'])


### Plot the radar map for reviewers

Normalize features of cluster centers.

In [675]:
cluster_centers = scaler.fit_transform(kmeans.cluster_centers_)

cluster_centers = pd.DataFrame(cluster_centers)

cluster_centers = cluster_centers.rename(
    columns={
        0: "Review_Total_Word_Counts", 
        1: "Total_Number_of_Reviews_Reviewer_Has_Given",
        2: "days_since_review"
    }
)

cluster_centers.head()

Unnamed: 0,Review_Total_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,days_since_review
0,-0.493508,-0.204612,-1.454396
1,-0.508405,-0.547683,0.567568
2,1.999668,0.943884,-0.214799
3,-0.526656,-1.464901,-0.426343
4,-0.471099,1.273312,1.52797


Plot.