#### IACH PROJECT 
Hotel Recommendation System 

Authors:

Diogo Dória 

Mariana Paulino


## Data Analysis

In [2]:
#import required libraires
import pandas as pd
import numpy as np
from langdetect import detect
from sklearn.feature_extraction import _stop_words
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import folium
from folium import plugins
import ipywidgets
import geocoder
import geopy
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import reverse_geocode

In [3]:
# importing data
df= pd.read_csv('Hotel_Reviews.csv')
# changing column to lower case
df.columns=[x.lower() for x in df.columns]

In [4]:
#shape of dataframe
df.shape

(515738, 17)

In [5]:
# The dataset comprimises of 17 columns. 
df.columns

Index(['hotel_address', 'additional_number_of_scoring', 'review_date',
       'average_score', 'hotel_name', 'reviewer_nationality',
       'negative_review', 'review_total_negative_word_counts',
       'total_number_of_reviews', 'positive_review',
       'review_total_positive_word_counts',
       'total_number_of_reviews_reviewer_has_given', 'reviewer_score', 'tags',
       'days_since_review', 'lat', 'lng'],
      dtype='object')

In this dataset, there are 515738 observations consisting of 17 variables, consisting of 1492 hotels across Europe. We will discuss the geographical distribution of the hotels across Europe in the later section.

In [6]:
#there are 1492 hotels
hotel=list(df.hotel_name.unique())
len(hotel)

1492

In [7]:
# Inherently, there are "H tel" in the dataset
df[df['hotel_name'].str.contains("H tel")]['hotel_name'].unique()[:10]

array(['H tel De Vend me', 'H tel des Ducs D Anjou',
       'H tel Juliana Paris', 'H tel de Jos phine BONAPARTE',
       'H tel Keppler', 'H tel Chaplain Paris Rive Gauche',
       'H tel Regina Op ra Grands Boulevards', 'H tel Diva Opera',
       'H tel Duo', 'H tel Le Marianne'], dtype=object)

<br>
Through research, it was evident that the scraped data removed latin script.

> <p><b>Example:</b>
> </p>
>Relais Hôtel du Vieux Paris --> Relais H tel du Vieux Paris

In [8]:
#impute the word "H tel" to "Hotel"
df['hotel_name']=df['hotel_name'].apply(lambda x:x.replace('H tel','Hotel'))

In [9]:
# sanity check to check
df[df.index==12608]['hotel_name']

12608    Hotel De Vend me
Name: hotel_name, dtype: object

## Geocode based on the dataset

In this section, we will discuss about the hotels and with respect to their geolocation.

In [10]:
#group by hotelname and aggregate
geocode_df = df.groupby('hotel_name').agg({'lat': 'first','lng':'first'}).reset_index()

In [11]:
#check if there is null
geocode_df.isnull().sum()

hotel_name     0
lat           17
lng           17
dtype: int64

There were 17 hotels which do not have the latitude and longitude details in the dataframe. Since the number of hotels is less than 15%, decided to drop empty set. 

In [12]:
#drop null and reset index
geocode_df.dropna(subset=['lat','lng'],inplace=True)
geocode_df.reset_index(drop='index',inplace=True)

In [13]:
#sanity check if there is null
geocode_df.isnull().sum()

hotel_name    0
lat           0
lng           0
dtype: int64

The below section is a function to use a use reverse  latitude and longitude to get the cities.

In [14]:
#create new column
geocode_df['location']= ''
#function to search for geocode
def search(x):
    for i in range (x.shape[0]):
        co= ((geocode_df['lat'][i], geocode_df['lng'][i]),)
        geocode_df['location'][i]= reverse_geocode.search(co)
#apply function
#search the df
search(geocode_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  geocode_df['location'][i]= reverse_geocode.search(co)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  geocode_df['location'][i]= reverse_geocode.search(co)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  geocode_df['location'][i]= reverse_geocode.search(co)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  geocod

In [15]:
geocode_df

Unnamed: 0,hotel_name,lat,lng,location
0,11 Cadogan Gardens,51.493616,-0.159235,"[{'country_code': 'GB', 'city': 'Chelsea', 'co..."
1,1K Hotel,48.863932,2.365874,"[{'country_code': 'FR', 'city': 'Paris', 'coun..."
2,25hours Hotel beim MuseumsQuartier,48.206474,16.354630,"[{'country_code': 'AT', 'city': 'Vienna', 'cou..."
3,41,51.498147,-0.143649,"[{'country_code': 'GB', 'city': 'West End of L..."
4,45 Park Lane Dorchester Collection,51.506371,-0.151536,"[{'country_code': 'GB', 'city': 'West End of L..."
...,...,...,...,...
1469,citizenM London Bankside,51.505151,-0.100472,"[{'country_code': 'GB', 'city': 'City of Londo..."
1470,citizenM London Shoreditch,51.524137,-0.078698,"[{'country_code': 'GB', 'city': 'Barbican', 'c..."
1471,citizenM Tower of London,51.510237,-0.076443,"[{'country_code': 'GB', 'city': 'City of Londo..."
1472,every hotel Piccadilly,51.510146,-0.131506,"[{'country_code': 'GB', 'city': 'London', 'cou..."


In [16]:
#function to search for country
def search_country(x):
    return x[0]['country']

In [17]:
#map the city
geocode_df['country']=geocode_df['location'].map(search_country)

In [18]:
geocode_df.country.unique()

array(['United Kingdom', 'France', 'Austria', 'Spain', 'Italy',
       'Netherlands'], dtype=object)

In [19]:
#function to search for city
def search_city(x):
    return x[0]['city']

In [20]:
#map the city
geocode_df['city']=geocode_df['location'].map(search_city)

In [21]:
geocode_df.city.value_counts()

city
Paris               224
Vienna              137
Milan               135
Amsterdam            83
Levallois-Perret     82
                   ... 
Buckhurst Hill        1
la Teixonera          1
Sagrada Família       1
Sants                 1
Harringay             1
Name: count, Length: 111, dtype: int64

In [22]:
geocode_df.city.unique

<bound method Series.unique of 0                  Chelsea
1                    Paris
2                   Vienna
3       West End of London
4       West End of London
               ...        
1469        City of London
1470              Barbican
1471        City of London
1472                London
1473                Vienna
Name: city, Length: 1474, dtype: object>

### Mapinha

In [23]:
#generating a map to output the cities 
map2 = folium.Map(location=[41.3851,2.1734], zoom_start=4)
folium.raster_layers.TileLayer('Open Street Map').add_to(map2)
for la,lo in zip(geocode_df.lat,geocode_df.lng):
    folium.Marker(
        location=[la,lo],
        icon=folium.Icon(icon_color='white')
    ).add_to(map2)
# Plotting 
map2.save('../image/Europe_overview.html')
map2

In [24]:
#save to pickle
geocode_df.to_pickle('../data/geocode.pkl')

## Tags a.k.a Attributes

In this section, we will be cleaning and preparing the dataset for modelling for recommender engine.

In [25]:
tag_df=df[['hotel_name','tags','lat','lng']]

In [26]:
tag_df.head()

Unnamed: 0,hotel_name,tags,lat,lng
0,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Duplex Double...",52.360576,4.915968
1,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Duplex Double...",52.360576,4.915968
2,Hotel Arena,"[' Leisure trip ', ' Family with young childre...",52.360576,4.915968
3,Hotel Arena,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",52.360576,4.915968
4,Hotel Arena,"[' Leisure trip ', ' Couple ', ' Suite ', ' St...",52.360576,4.915968


In [27]:
tag_df = tag_df.groupby('hotel_name').agg({'tags': ', '.join,'lat':'first','lng':'first'}).reset_index()

In [28]:
tag_df.shape

(1491, 4)

In [29]:
tag_df.isnull().sum()

hotel_name     0
tags           0
lat           17
lng           17
dtype: int64

In [30]:
tag_df.head()

Unnamed: 0,hotel_name,tags,lat,lng
0,11 Cadogan Gardens,"[' Leisure trip ', ' Couple ', ' Superior Quee...",51.493616,-0.159235
1,1K Hotel,"[' Leisure trip ', ' Couple ', ' Superior M Do...",48.863932,2.365874
2,25hours Hotel beim MuseumsQuartier,"[' Leisure trip ', ' Solo traveler ', ' Standa...",48.206474,16.35463
3,41,"[' Leisure trip ', ' Couple ', ' Executive Kin...",51.498147,-0.143649
4,45 Park Lane Dorchester Collection,"[' Leisure trip ', ' Solo traveler ', ' Execut...",51.506371,-0.151536


In [31]:
tag_df.dropna(subset=['lat','lng'],inplace=True)

In [32]:
tag_df.to_pickle("../data/hoteltag.pkl")

## Reviews

In this section, we will be preparing the dataset for hotel recommender based on reviews.<br>

In the case of positive or negative reviews, observed that there is "No Positive" or "No Negative" in the dataframe which potentially affects the vectorization of words as it has appear several times.

In [33]:
#suppose to be empty 
df[df['positive_review']=='No Positive'][['negative_review','positive_review']].head(5)

Unnamed: 0,negative_review,positive_review
8,Even though the pictures show very clean room...,No Positive
32,Our bathroom had an urine order Shower was ve...,No Positive
98,Got charged 50 for a birthday package when it...,No Positive
121,The first room had steep steps to a loft bed ...,No Positive
134,Foyer was a mess Only place to relax was the ...,No Positive


In [34]:
# suppose to be empty 

df[df['negative_review']=='No Negative'][['negative_review','positive_review']].head(5)

Unnamed: 0,negative_review,positive_review
1,No Negative,No real complaints the hotel was great great ...
13,No Negative,This hotel is being renovated with great care...
15,No Negative,This hotel is awesome I took it sincirely bec...
18,No Negative,Public areas are lovely and the room was nice...
48,No Negative,The quality of the hotel was brilliant and ev...


decided to replace them as "No" since it will be removed by stopwords. In the case of replacing it with empty string, it will potentially become NaN value which affects the computation.

In [35]:
#replace No negative and no positive
df["negative_review"] = df["negative_review"].apply(lambda x: x.replace("No Negative", "No"))
df["positive_review"] = df["positive_review"].apply(lambda x: x.replace("No Positive", "No"))


Another issue with the dataset, there are duplicated reviews and interested in capturing one set of the reviews. In the below section, wrote a function to check for duplicates and labelling the duplicates.

In [36]:
#function to check negative & positive duplicates
def check(x):
    pos, neg = x
    if pos ==  neg:
        return 1
    return 0

In [37]:
df['check_dup'] = [check(x) for x in df[['positive_review','negative_review']].values]
index_col= df[df['check_dup']==1].index

In [38]:
df[df['check_dup']==1][['positive_review','negative_review']].head()

Unnamed: 0,positive_review,negative_review
1403,No,No
2451,The hotel good location and clean but some st...,The hotel good location and clean but some st...
2872,Ok,Ok
3067,Standard Hotel,Standard Hotel
5839,I was completely disappointed and mad since t...,I was completely disappointed and mad since t...


Based on the labelling function, dedupe duplicates and replacing them with 'nothing' since its going to be removed by Stopwords.

In [39]:
for x in index_col:
    df['positive_review'][x]='nothing'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['positive_review'][x]='nothing'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['positive_review'][x]='nothing'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['positive_review'][x]='nothing'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['positive_review'][x]='nothing'
A value is trying to be set 

In [40]:
#dedupe duplicates
df[df['check_dup']==1][['positive_review','negative_review']].head()

Unnamed: 0,positive_review,negative_review
1403,nothing,No
2451,nothing,The hotel good location and clean but some st...
2872,nothing,Ok
3067,nothing,Standard Hotel
5839,nothing,I was completely disappointed and mad since t...


In [41]:
#this is to group by hotel name and joining the reviews into on observation with ','.
df1 = df.groupby('hotel_name').agg({'negative_review': ', '.join,'positive_review': ', '.join,
                                    'lat': 'first','lng':'first','hotel_address':'first',
                                    'tags': ', '.join}).reset_index()

In [42]:
#check if there is null
df1.isnull().sum()

hotel_name          0
negative_review     0
positive_review     0
lat                17
lng                17
hotel_address       0
tags                0
dtype: int64

In [43]:
df1.dropna(subset=['lat','lng'],inplace=True)

In [44]:
#import pickle with city
geocode=pd.read_pickle('../data/geocode.pkl')

In [45]:
# merging the dataset based on hotel name
new_df= pd.merge(df1,geocode,on='hotel_name',how='outer')

In [46]:
# exporting to pickle
new_df.to_pickle('../data/review.pkl')

In [47]:
#Plot the amount of hotels by country present in the dataset with plotly
import plotly.express as px
import plotly.graph_objects as go

#group by country
country_df = new_df.groupby('country').agg({'hotel_name':'count'}).reset_index()

#sort by count
country_df=country_df.sort_values(by='hotel_name',ascending=False)

#plot
fig = px.bar(country_df, x='country', y='hotel_name',color='hotel_name', color_continuous_scale='Magma_r')
#Update background color to white
fig.update_layout(plot_bgcolor='white')
fig.update_layout(title='Number of Hotels by Country', xaxis_title="Country", yaxis_title="Number of Hotels")
fig.show()

#Save the plot
fig.write_html("../image/Number_of_Hotels_by_Country.html")


In [48]:
#Show columns in the dataframe
new_df.columns


Index(['hotel_name', 'negative_review', 'positive_review', 'lat_x', 'lng_x',
       'hotel_address', 'tags', 'lat_y', 'lng_y', 'location', 'country',
       'city'],
      dtype='object')

In [57]:
#Plot the count of different tags in the dataset with plotly

#Create a list of all the tags
tag_list=[]
for x in new_df.tags:
    tag_list.extend(x.split(', '))

#Create a dataframe with the tags and their count
tag_df = pd.DataFrame(tag_list,columns=['tag'])
tag_df = tag_df.groupby('tag').agg({'tag':'count'}).rename(columns={'tag':'count'}).reset_index()
tag_df = tag_df.sort_values(by='count',ascending=False)

#Take the [ and ] out of the text
tag_df['tag'] = tag_df['tag'].apply(lambda x: x.replace('[',''))
tag_df['tag'] = tag_df['tag'].apply(lambda x: x.replace(']',''))

#Take the ' out of the text
tag_df['tag'] = tag_df['tag'].apply(lambda x: x.replace("'",''))

#Join the tags with the same meaning


tag_df['tag'] = tag_df['tag'].apply(lambda x: x.replace('and','&'))
tag_df['tag'] = tag_df['tag'].apply(lambda x: x.replace('with','w/'))
tag_df['tag'] = tag_df['tag'].apply(lambda x: x.replace('without','w/o'))


#Plot the top 10 tags
fig = px.bar(tag_df[:10], x='count', y='tag',color='count', color_continuous_scale='Magma_r')
#Update background color to white
fig.update_layout(plot_bgcolor='white')
fig.update_layout(title='Top 10 Tags', xaxis_title="Count", yaxis_title="Tag")
fig.show()

#Save the plot
fig.write_html("../image/Top_10_Tags.html")


In [64]:
#If the name of the tag is repetead in the tag, sum the values of the tag
tag_df = tag_df.groupby('tag').agg({'count':'sum'}).reset_index()
tag_df = tag_df.sort_values(by='count',ascending=False)

#Plot the top 15 tags
fig = px.bar(tag_df[:15], x='count', y='tag',color='count', color_continuous_scale='Magma_r')
#Update background color to white
fig.update_layout(plot_bgcolor='white')
fig.update_layout(title='Top 15 Tags', xaxis_title="Count", yaxis_title="Tag")
fig.show()
