# Analysing the most used words in reviews

Our goal with this analysis is to go through all the reviews/comments for all the rooms and identify what Airbnb customers care the most about, when rating the rooms. Our main strategy is a very simple form of Natural Language Processing (NLP), in which we will correlate word frequencies with room ratings. We will also try to apply a simple multi linear regression model to predict the rating based on the words.

In [1]:
import pandas

#get dataframe from pickle
df = pandas.read_pickle("Data/final.pkl")
df.head()

Unnamed: 0_level_0,accuracyRating,allowsChildren,allowsEvents,allowsInfants,allowsPets,allowsSmoking,bathrooms,bedrooms,beds,checkinRating,...,reviewsCount,roomType,serviceFee,totalPrice,url,distToCenter,monthsSinceCreation,daysSinceUpdate,totalAmenities,totalLanguages
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21888626,9.0,1,1,1,1,1,1.0,2.0,2.0,9.0,...,7,Entire home/apt,56.0,406.0,https://www.airbnb.com/rooms/21888626?query=Al...,43.209876,18.833333,544.0,10,0
23720498,10.0,1,0,1,0,1,2.0,3.0,4.0,10.0,...,5,Entire home/apt,63.0,463.0,https://www.airbnb.com/rooms/23720498?query=Al...,45.932316,53.333333,3.0,15,5
22981563,9.0,1,0,1,1,0,1.0,3.0,3.0,10.0,...,3,Entire home/apt,39.0,284.0,https://www.airbnb.com/rooms/22981563?query=Al...,45.391489,16.8,25.0,17,1
5371649,10.0,1,0,1,1,0,2.5,3.0,7.0,10.0,...,31,Entire home/apt,29.0,209.0,https://www.airbnb.com/rooms/5371649?query=Ale...,41.607718,62.5,3.0,14,0
18837008,8.0,1,1,1,1,0,1.0,1.0,2.0,10.0,...,3,Entire home/apt,67.0,492.0,https://www.airbnb.com/rooms/18837008?query=Al...,50.17207,24.966667,11.0,15,4


## Cleaning The Data

Before starting to analyze text data, we first need to go through some text pre-processing techniques. We should note here that this cleaning process could easily go on forever, since there's always an exception to every cleaning step. Therefore, our approach is to follow a minimum viable approach, with clear simple cleaning processes. Our ultimate goal is to get a “bag of words” for each room, ie, a list of relevant words used in each room.

# Regular Expressions
We proceeded with the following steps, by means of regular expressions:
* Merge all comments into one string, by room;
* Remove everything from the text data except letters and spaces;
* Lowercase every letter;
* Remove any word with less than 3 letters.


In [2]:
import re

print(df.iloc[1,:]["reviews"])

def mergeAllReviews(row):
    merged = " ".join(list(row))
    merged = merged.lower()
    merged = re.sub('[^\\w ]', ' ', merged) #matches everything but letters and spaces
    merged = re.sub('\\d', ' ', merged) #matches any digit
    merged = re.sub('\\b\\w{1,2}\\b', ' ', merged) #matches any sequence 2 or less letter words. This is because of aposstrophes.
    merged = re.sub('\\s+', ' ', merged) #matches any sequence of whitespace characters
    merged = re.sub('^\s+|\s+$', '', merged) #matches whitespaces in the beginning or end of word
    
    #remove stop words
    
    return merged

df["filteredReviews"] = df["reviews"].apply(lambda x: mergeAllReviews(x))
print(df.iloc[1,:]["filteredReviews"])

['Amazing place very nice lady recommend to anywone', 'Lovely home situated in the quiet village of Labrugeria.  Perfect for our needs and wonderful hosts.  A great place to come home to after days out exploring all the lovely sites, beaches and towns of the area.  45 minutes by car to Lisbon and necessary for days out.']
amazing place very nice lady recommend anywone lovely home situated the quiet village labrugeria perfect for our needs and wonderful hosts great place come home after days out exploring all the lovely sites beaches and towns the area minutes car lisbon and necessary for days out


## Remove stopwords, foreign words and typos
Next we went through a slightly more advanced cleaning process. We used the library nltk which has a list of stopwords (very common words that bear no meaning) and a list of “all” english words. We then went through each string of words and converted it to a list of strings, splitting by space. Then we iterated over the list to remove all the stop words and words that don’t exist in the English language

In [3]:
#remove stop words and words that dont exist in english dictionary
print(df.iloc[1,:]["filteredReviews"])
import nltk
#nltk.download('stopwords')
#nltk.download('words')
from nltk.corpus import stopwords 

stopWords = set(stopwords.words('english')) 
englishWords = set(nltk.corpus.words.words())

#tokenize words and remove stop words
def tokenizeAndRemoveStopWords(row):
    wordTokens = row.split(" ")
    newList=[]
    for word in wordTokens:
        if (word not in stopWords) and (word in englishWords):
            newList.append(word)
    return newList

df["filteredReviews"] = df["filteredReviews"].apply(lambda x: tokenizeAndRemoveStopWords(x))
print(df.iloc[1,:]["filteredReviews"])

amazing place very nice lady recommend anywone lovely home situated the quiet village labrugeria perfect for our needs and wonderful hosts great place come home after days out exploring all the lovely sites beaches and towns the area minutes car lisbon and necessary for days out
['amazing', 'place', 'nice', 'lady', 'recommend', 'lovely', 'home', 'situated', 'quiet', 'village', 'perfect', 'needs', 'wonderful', 'great', 'place', 'come', 'home', 'days', 'exploring', 'lovely', 'area', 'car', 'necessary', 'days']


## Keep only nouns and adjectives
Next, we used nltk once again to categorize each word according to its grammatical category. We decided that the words that port value to our analysis are only nouns and adjectives

In [4]:
#keep only adjectives (start eith J), non proper nouns (NN and NNS) -> THIS TAKES A LOONNGGG TIME
import nltk
#nltk.download('averaged_perceptron_tagger')

print(df.iloc[1, :]["filteredReviews"])

#tokenize words and remove stop words
def tagAndRemoveWords(row):
    newList=[]
    tagged = nltk.pos_tag(row)
    for entry in tagged :
        if entry[1].startswith("J") or entry[1] == "NN" or entry[1] == "NNS":
            newList.append(entry[0])
    return newList

df["filteredReviews"] = df["filteredReviews"].apply(lambda x: tagAndRemoveWords(x))
print(df.iloc[1, :]["filteredReviews"])

['amazing', 'place', 'nice', 'lady', 'recommend', 'lovely', 'home', 'situated', 'quiet', 'village', 'perfect', 'needs', 'wonderful', 'great', 'place', 'come', 'home', 'days', 'exploring', 'lovely', 'area', 'car', 'necessary', 'days']
['amazing', 'place', 'nice', 'lady', 'home', 'quiet', 'village', 'perfect', 'wonderful', 'great', 'place', 'home', 'days', 'area', 'car', 'necessary', 'days']


## Stemming

We again used nltk to stemmatize words, i.e., convert every word to its most elementary stem. For instance, communication stems to commun. However, since these stems are not English words and are difficult to read, we substituted them, for simplicity, by the first word that was stemmatized to the same stem.

In [5]:
#stemmatize words
print(df.iloc[1, :]["filteredReviews"])
import nltk
from nltk.stem.porter import PorterStemmer
englishWords = set(nltk.corpus.words.words())
stemmedWords = {}

def stemmatizeWords(row):
    stemmer = PorterStemmer()
    newList = []
    for word in row:
        stemmedWord = stemmer.stem(word)
        if stemmedWord not in stemmedWords: #this is just so we get an english valid word, and not stemmed
            stemmedWords[stemmedWord] = word
        newList.append(stemmedWords[stemmedWord])
    return newList

df["filteredReviews"] = df["filteredReviews"].apply(lambda x: stemmatizeWords(x))
df = df[df['filteredReviews'].map(lambda d: len(d)) > 0] #remove all rooms with empty list of words
print(df.iloc[1, :]["filteredReviews"])

['amazing', 'place', 'nice', 'lady', 'home', 'quiet', 'village', 'perfect', 'wonderful', 'great', 'place', 'home', 'days', 'area', 'car', 'necessary', 'days']
['amazing', 'place', 'nice', 'lady', 'home', 'quiet', 'village', 'perfect', 'wonderful', 'great', 'place', 'home', 'days', 'area', 'car', 'necessary', 'days']


## MatrixTerm and word frequency

By using sklearn, we converted the “bag of words” format to a matrix where we have the rooms in the the rows and every word in the columns. The value is the frequency of the word for that room:

In [6]:
#create document term matrix
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
data_cv = cv.fit_transform([' '.join(map(str, l)) for l in df['filteredReviews']])
matrixTerm = pandas.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
matrixTerm.index = df.index
matrixTerm.head()

Unnamed: 0_level_0,abb,ability,able,abnormal,abode,abrasive,abrupt,absence,absent,absolute,...,yummy,zag,zenith,zero,zest,zig,zigzag,zip,zone,zoo
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21888626,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23720498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
22981563,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5371649,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18837008,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
#remove rooms with too little words
from GeneralFunctions import removeOutliers
print(len(matrixTerm))
matrixTerm["totalWords"]=matrixTerm.apply(lambda x : x.sum(), axis=1)
matrixTerm=removeOutliers(matrixTerm, "totalWords", onlyLower=True)
print(len(matrixTerm))
del matrixTerm["totalWords"]

1115
1115


## Most common words

we are going to count, for each word, how many times they appear in a room at least once

In [8]:
# first we are going to transpose the dataframe
matrixTermT = matrixTerm.transpose()
matrixTermT.head()

id,21888626,23720498,22981563,5371649,18837008,10419816,7354154,21877376,10677809,23156578,...,18092477,17005931,19756855,27901668,15006723,14574388,3167172,11964076,11790067,3671486
abb,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ability,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
able,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
abnormal,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abode,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# now we will count how many times a word appears in a room at least once
wordCount = matrixTermT.astype(bool).sum(axis=1)
pandas.DataFrame(wordCount.sort_values(ascending=False), columns = ["count"]).transpose()

Unnamed: 0,great,place,nice,clean,stay,good,host,location,everything,apartment,...,loss,lord,longue,longish,lonesome,loin,logia,loggia,loco,abb
count,1003,1000,951,943,931,887,887,887,877,850,...,1,1,1,1,1,1,1,1,1,1


In [10]:
#we do not want to analyze words that appear too much or too little.
rn = matrixTermT.shape[1]
commonWords = wordCount[(wordCount/rn > 0.2) & (wordCount/rn < 0.8)]

In [11]:
#correlation between the words and ratingC

df['ratingC'].value_counts().plot(kind='bar')

#remove all ratingC<4 because there are just way too few
result = pandas.concat([matrixTerm, df["ratingC"]], axis=1, join='inner')

newDf = pandas.DataFrame()

for word in commonWords.index.values:
    newDf.loc[word,"corr"] =result[word].corr(result["ratingC"])

relevantWords = newDf[(newDf["corr"]>0.1) | (newDf["corr"]<-0.1)].abs().sort_values(by=["corr"], ascending=False)
relevantWords

Unnamed: 0,corr
price,0.170882
super,0.163266
hospitality,0.156517
best,0.153108
welcome,0.151044
home,0.146134
thoughtful,0.146067
wonderful,0.142164
stylish,0.141753
excellent,0.137698


In [12]:
#we now have to manually select words that do not give us any information
removeWords=["anyone", "full", "everything", "thank", "many", "much", "lots", "perfect", "feel", "make", "gem", "local",
             "visit", "gorgeous", "return", "superb", "incredible", "trip", "wish", "enjoy", "amazing", "recommend",
             "fantastic", "excellent", "wonderful", "home", "welcome", "best", "outstanding", "super", "stay", "felt", "great",
            "sure", "anything", "hope", "plenty", "love", "apartment", "everyone", "hesitate", "awesome", "host", "hostess", "husband",
            "way", "walk", "want", "pleasure", "explore", "real", "enough", "times", "better", "travel", "spent", "moment", "fine",
             "experience", "place"]

removeWords = [c for c in removeWords if c in relevantWords.index.values]

relevantWords = relevantWords.drop(removeWords)
relevantWords

Unnamed: 0,corr
price,0.170882
hospitality,0.156517
thoughtful,0.146067
stylish,0.141753
beautiful,0.135555
special,0.126767
attentive,0.124101
comfortable,0.120728
warm,0.113756
fresh,0.113617


In [13]:
#create wordcloud with most relevant words
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import urllib
import requests
import numpy as np
import matplotlib.pyplot as plt


words = " ".join(relevantWords.index.values)
mask = np.array(Image.open(requests.get('http://www.clker.com/cliparts/O/i/x/Y/q/P/yellow-house-hi.png', stream=True).raw))

word_cloud = WordCloud(width = 512, height = 512, background_color='white', stopwords=STOPWORDS, mask=mask).generate(words)
plt.figure(figsize=(10,8),facecolor = 'white', edgecolor='blue')
plt.imshow(word_cloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.savefig("WordCloudHouse.png")
plt.show()
    

ModuleNotFoundError: No module named 'wordcloud'

## Scatter plot 
We can immediately see that the only word negatively correlated with the rating is price. This probably means that people only talk about price when they are unhappy.
It is alto interesting to note that, from the 18 words, the “hospitality” topic is the most common, including “thoughtful, “kind”, “helpful” and “attentive”. So, for an Airbnb host, hospitality and giving away food, namely wine and breakfast, would be an excellent combination.

Note, however, that all these words have very poor correlations with the rating, between 0.1 and 0.2. But that is somehow expected due to the nature of text variables, and most prominently word frequencies.


In [None]:
import seaborn as sns

for word in relevantWords.index.values:
    sns.lmplot(word,y='ratingC',data=result,fit_reg=True)

## Multiple Linear regression

This gave very bad results, so we decided not to include in the report

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

X = result.drop(result.columns.difference(relevantWords.index.values),1)
Y = result["ratingC"]

columns = X.columns.values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3,
                                                  random_state=1)

reg = LinearRegression()
reg.fit(X_train[columns], Y_train)


Y_predicted = reg.predict(X_test[columns])
print("Mean squared error: %.2f" % mean_squared_error(Y_test, Y_predicted))
print('R²: %.2f' % r2_score(Y_test, Y_predicted))


fig, ax = plt.subplots()
ax.scatter(Y_test, Y_predicted)
ax.plot([Y_test.min(), Y_test.max()], [Y_test.min(), Y_test.max()], 'k--', lw=4)
ax.set_xlabel('measured')
ax.set_ylabel('predicted')
plt.show()