# AirBnB Sentiment Analysis

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Project Scope
 - Use of sentiment analysis, of the reviews of each ad, to view the evaluation of the ad
    itself.

 - Search for relationships between the price of a room and the day of the week, holidays,
    and time of year, and relationships between the price and the characteristics of a
    room to make a forecast.

  Specify that the analysis is unsupervised

Dataset: https://www.kaggle.com/brittabettendorf/berlin-airbnb-data

In [2]:
import pandas as pd
import gensim
import nltk



## Import of the reviews' dataset

The dataset imported and used along this analysis is the one previously generated.

In [3]:
dfReviews = pd.read_csv('reviews_summary_langs.csv')
dfReviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,Lang
0,2015,69544350,2016-04-11,7178145,Rahel,mein freund und ich hatten gute gemütliche vie...,de
1,2015,69990732,2016-04-15,41944715,Hannah,jan was very friendly and welcoming host! the ...,en
2,2015,71605267,2016-04-26,30048708,Victor,un appartement tres bien situé dans un quartie...,fr
3,2015,73819566,2016-05-10,63697857,Judy,"it is really nice area, food, park, transport ...",en
4,2015,74293504,2016-05-14,10414887,Romina,"buena ubicación, el departamento no está orden...",es


In [4]:
dfReviews['Lang'].unique()

array(['de', 'en', 'fr', 'es', 'no', 'ro', 'ca', 'sv', 'pt', 'it', 'ko',
       'nl', 'af', 'ru', 'zh-cn', 'fi', 'da', 'hu', 'None', 'vi', 'ja',
       'pl', 'cy', 'id', 'cs', 'et', 'hr', 'el', 'tr', 'sl', 'so',
       'zh-tw', 'tl', 'sk', 'sq', 'sw', 'uk', 'lv', 'mk', 'he', 'lt',
       'bg', 'th', 'ar'], dtype=object)

The rows in which the 'Lang' column shows the value 'None' are the ones that in the previous
step have thrown some problems.
In particular, the possible problems are the inability of the used technique to detect
their language or the too-narrow length of the review.

In [5]:
dfNoneLangReviews = dfReviews[dfReviews['Lang'] == 'None']
print(f'Number of reviews with None language: {dfNoneLangReviews.shape[0]}')
print(f'Percentage of reviews with None language: '
      f'{round(dfNoneLangReviews.shape[0] * 100 / dfReviews.shape[0],2)}%')

Number of reviews with None language: 648
Percentage of reviews with None language: 0.16%


The reviews written in english language are the interesting ones for this analysis.

In [6]:
dfEnglishReviews = dfReviews[dfReviews['Lang'] == 'en']
dfEnglishReviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,Lang
1,2015,69990732,2016-04-15,41944715,Hannah,jan was very friendly and welcoming host! the ...,en
3,2015,73819566,2016-05-10,63697857,Judy,"it is really nice area, food, park, transport ...",en
6,2015,76603178,2016-05-28,29323516,Laurent,"we had a very nice stay in berlin, thanks to j...",en
7,2015,77296201,2016-05-31,9025122,Rasmus,"great location close to mauerpark, kastanienal...",en
9,2015,82322683,2016-06-27,73902920,Mag,"apartment very well located, close to everythi...",en


In [7]:
dfEnglishReviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 268508 entries, 1 to 401466
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     268508 non-null  int64 
 1   id             268508 non-null  int64 
 2   date           268508 non-null  object
 3   reviewer_id    268508 non-null  int64 
 4   reviewer_name  268508 non-null  object
 5   comments       268508 non-null  object
 6   Lang           268508 non-null  object
dtypes: int64(3), object(4)
memory usage: 16.4+ MB


## Duplicates removal

The first step requires the removal of the duplicated reviews.

In [8]:
print('Number of English reviews: {}'.format(dfEnglishReviews.shape[0]))
print('Number of unique English reviews: {}'.format(len(dfEnglishReviews['comments'].unique())))

Number of English reviews: 268508
Number of unique English reviews: 258767


In [9]:
dfEnglishReviews = dfEnglishReviews.drop_duplicates(subset='comments')
print(f'Number of reviews after the duplicated removal: {dfEnglishReviews.shape[0]}')

Number of reviews after the duplicated removal: 258767


## Non-English words removal

In [33]:
dfEnglishReviews['comments'].iloc[172]

"britta’s apartment is perfect! clean, bright, fully-appointed , cozy. there is very convenience area, many cafes and restaurant around but quiet. it is situated few minutes from train and tram station. it made us perfect stay in berlin. we didn't have chance to meet britta but she is very kindness and has strong sense of responsibility, prepared  many tips and information for us. thanks a lot.\r\nとても便利な場所にあり､ﾄﾗﾑや駅から数分で､きれいで明るく､設備の整った､居心地の良い完璧な部屋です｡周辺には沢山のｶﾌｪやﾚｽﾄﾗﾝがあり､暮らすように旅する滞在にﾋﾟｯﾀﾘです｡brittaに会う機会はありませんでしたが､周辺情報などをまとめたものを用意してくれていて､大変親切にして頂きました｡"

In [34]:
from re import sub

dfEnglishReviews['comments'] = dfEnglishReviews.apply(
    lambda x: sub(r"[^A-Za-z]", " ", x['comments']), axis=1)
dfEnglishReviews['comments'].iloc[172]

'britta s apartment is perfect  clean  bright  fully appointed   cozy  there is very convenience area  many cafes and restaurant around but quiet  it is situated few minutes from train and tram station  it made us perfect stay in berlin  we didn t have chance to meet britta but she is very kindness and has strong sense of responsibility  prepared  many tips and information for us  thanks a lot                                                                                               britta                                                     '

## Tokenization

In order to prepare the data for the analysis model, it is needed to perform a tokenization
operation.
For this purpose, the 'gensim' library is used.

In [35]:
tokenizedEnglishReviews = dfEnglishReviews.apply(
    lambda x: gensim.utils.simple_preprocess(str(x['comments'])), axis=1)
tokenizedEnglishReviews

1         [jan, was, very, friendly, and, welcoming, hos...
3         [it, is, really, nice, area, food, park, trans...
6         [we, had, very, nice, stay, in, berlin, thanks...
7         [great, location, close, to, mauerpark, kastan...
9         [apartment, very, well, located, close, to, ev...
                                ...                        
401442    [great, place, to, stay, and, bit, far, though...
401443    [the, place, is, great, very, spacey, and, cle...
401445    [this, appartment, is, super, comfortable, and...
401453          [nice, quite, close, to, the, center, walk]
401462    [the, host, canceled, this, reservation, days,...
Length: 258767, dtype: object

## Normalization

Another important step concerns the normalization of the reviews.
For this purpose, the 'nltk' library is used.

In particular, the 'wordnet' and 'average_perceptron_tagger' packages are downloaded from
the 'nltk' resources.
The first package provides a 'Lemmatizer' that, given a word, converts it into its base form.
The second package provides a method that, given a word, returns a tag representing its
grammatical type.

In [11]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Marco\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Marco\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [12]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag import pos_tag

def lemmatize_reviews(tokenized_reviews):
    lemmatizer = WordNetLemmatizer()
    lemmatized_reviews = []
    for tokens_review in tokenized_reviews:
        lemmatized_review = []
        for word, tag in pos_tag(tokens_review):
            if tag.startswith('NN'):
                pos = 'n'
            elif tag.startswith('VB'):
                pos = 'v'
            else:
                pos = 'a'
            lemmatized_review.append(lemmatizer.lemmatize(word, pos))
        lemmatized_reviews.append(lemmatized_review)

    return lemmatized_reviews

In [36]:
lemmatizedTokenizedEnglishReviews = lemmatize_reviews(tokenizedEnglishReviews)
lemmatizedTokenizedEnglishReviews[:5]

[['jan',
  'be',
  'very',
  'friendly',
  'and',
  'welcome',
  'host',
  'the',
  'apartment',
  'be',
  'great',
  'and',
  'the',
  'area',
  'be',
  'sooo',
  'amaze',
  'lot',
  'of',
  'nice',
  'cafe',
  'and',
  'shop',
  'enjoy',
  'my',
  'time',
  'there',
  'lot'],
 ['it',
  'be',
  'really',
  'nice',
  'area',
  'food',
  'park',
  'transport',
  'be',
  'perfect'],
 ['we',
  'have',
  'very',
  'nice',
  'stay',
  'in',
  'berlin',
  'thanks',
  'to',
  'jan',
  'premium',
  'situate',
  'apartment',
  'the',
  'place',
  'isn',
  'big',
  'but',
  'be',
  'quiet',
  'and',
  'functional',
  'also',
  'it',
  'situate',
  'in',
  'perfect',
  'neighbourhood',
  'jan',
  'be',
  'very',
  'welcome',
  'host',
  'eager',
  'to',
  'help',
  'you',
  'if',
  'need',
  'or',
  'to',
  'provide',
  'you',
  'any',
  'kind',
  'of',
  'information',
  'he',
  'also',
  'have',
  'very',
  'good',
  'advice',
  'on',
  'biergarten'],
 ['great',
  'location',
  'close',
  'to',

## Stop words removal

Finally, it is important to remove the stop-words.
For this purpose, the 'stopwords' package of the 'nltk' library is used.

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Marco\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
from nltk.corpus import stopwords

stopWords = stopwords.words('english')
stopWords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [16]:
def remove_stop_words(tokenized_reviews, stop_words):
    tokenized_reviews_without_stopwords = []
    for tokenized_review in tokenized_reviews:
        tokenized_reviews_without_stopwords.append(
            [word for word in tokenized_review if not word in stop_words]
        )
    return tokenized_reviews_without_stopwords

In [39]:
lemmatizedTokenizedEnglishReviewsWithoutStopWords = remove_stop_words(
    lemmatizedTokenizedEnglishReviews, stopWords)
lemmatizedTokenizedEnglishReviewsWithoutStopWords[:5]

[['jan',
  'friendly',
  'welcome',
  'host',
  'apartment',
  'great',
  'area',
  'sooo',
  'amaze',
  'lot',
  'nice',
  'cafe',
  'shop',
  'enjoy',
  'time',
  'lot'],
 ['really', 'nice', 'area', 'food', 'park', 'transport', 'perfect'],
 ['nice',
  'stay',
  'berlin',
  'thanks',
  'jan',
  'premium',
  'situate',
  'apartment',
  'place',
  'big',
  'quiet',
  'functional',
  'also',
  'situate',
  'perfect',
  'neighbourhood',
  'jan',
  'welcome',
  'host',
  'eager',
  'help',
  'need',
  'provide',
  'kind',
  'information',
  'also',
  'good',
  'advice',
  'biergarten'],
 ['great',
  'location',
  'close',
  'mauerpark',
  'kastanienallee',
  'rosenthaler',
  'platz',
  'lot',
  'bar',
  'restaurant',
  'nearby',
  'jan',
  'friendly',
  'service',
  'mind'],
 ['apartment',
  'well',
  'locate',
  'close',
  'everything',
  'supermarket',
  'transport',
  'city',
  'center',
  'quiet',
  'night',
  'apartment',
  'locate',
  'inside',
  'building',
  'basic',
  'equipment',

## Bigrams generation

In [42]:
from gensim.models.phrases import Phrases, Phraser

phrases = Phrases(lemmatizedTokenizedEnglishReviewsWithoutStopWords, min_count=3, progress_per=50000)
bigram = Phraser(phrases)
sentences = bigram[lemmatizedTokenizedEnglishReviewsWithoutStopWords]
sentences[1]

['really', 'nice', 'area', 'food', 'park', 'transport', 'perfect']

In [43]:
from collections import defaultdict

word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

64487

In [44]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

['great',
 'apartment',
 'stay',
 'place',
 'berlin',
 'nice',
 'location',
 'host',
 'clean',
 'good']

## Word2Vec model

In [45]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(min_count=20,
                     window=4,
                     vector_size=300,
                     sample=6e-5,
                     alpha=0.03,
                     min_alpha=0.0007,
                     negative=20,
                     workers=4)

In [46]:
from time import time

t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

Time to build vocab: 0.13 mins


In [47]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

Time to train the model: 5.65 mins


In [48]:
# w2v_model.save("word2vec.model")

## Exploring the model

In [49]:
w2v_model.wv.most_similar(positive=["apartment"])

[('flat', 0.6814600825309753),
 ('spacious', 0.5264761447906494),
 ('modern', 0.48734188079833984),
 ('studio', 0.4537934958934784),
 ('clean', 0.4438861906528473),
 ('bright', 0.4140062928199768),
 ('beautifully_appoint', 0.3940298855304718),
 ('light_airy', 0.39114996790885925),
 ('modern_furnishing', 0.3764360547065735),
 ('location', 0.37172961235046387)]

## Clustering model

In [50]:
from sklearn.cluster import KMeans
import numpy as np

# kmeans_model = KMeans(n_clusters=2, max_iter=1000, random_state=True, n_init=50).fit(X=.vectors.astype('double'))
kmeans_model = KMeans(n_clusters=2, max_iter=1000, random_state=True, n_init=50)
kmeans_model.fit(X=w2v_model.wv.vectors.astype('double'))

KMeans(max_iter=1000, n_clusters=2, n_init=50, random_state=True)

In [55]:
w2v_model.wv.similar_by_vector(kmeans_model.cluster_centers_[0], topn=10, restrict_vocab=None)

[('useless', 0.6690021753311157),
 ('unacceptable', 0.6566799283027649),
 ('due_lack', 0.6558582782745361),
 ('replacement', 0.6488563418388367),
 ('rug', 0.6485123634338379),
 ('trap', 0.634070634841919),
 ('smelly', 0.6288523077964783),
 ('take_trash', 0.6276218891143799),
 ('electricity', 0.6265416145324707),
 ('mop', 0.6223761439323425)]

In [56]:
negative_cluster_index = 0
negative_cluster_center = kmeans_model.cluster_centers_[negative_cluster_index]
positive_cluster_center = kmeans_model.cluster_centers_[1-negative_cluster_index]

In [57]:
# dfWords = pd.DataFrame(w2v_model.wv.vocab.keys())

dfWords = pd.DataFrame(w2v_model.wv.key_to_index.keys())
dfWords.columns = ['words']
dfWords['vectors'] = dfWords.words.apply(lambda x: w2v_model.wv[f'{x}'])
dfWords['cluster'] = dfWords.vectors.apply(lambda x: kmeans_model.predict([np.array(x)]))
dfWords.cluster = dfWords.cluster.apply(lambda x: x[0])
dfWords.head()

Unnamed: 0,words,vectors,cluster
0,great,"[0.023204772, 0.059004717, -0.3098381, 0.60022...",1
1,apartment,"[0.0007390261, -0.41159153, 0.41341403, -0.402...",1
2,stay,"[0.33254883, 0.40867707, 0.19273868, -0.435026...",1
3,place,"[0.011042707, -0.16829014, -0.29300955, 0.0513...",1
4,berlin,"[0.24693859, 0.07494845, -0.36934593, -0.03560...",1


In [59]:
dfWords['cluster_value'] = [-1 if i==negative_cluster_index else 1 for i in dfWords.cluster]
dfWords['closeness_score'] = dfWords.apply(lambda x: 1/(kmeans_model.transform([x.vectors]).min()), axis=1)
dfWords['sentiment_coeff'] = dfWords.closeness_score * dfWords.cluster_value
dfWords[dfWords['cluster_value'] == -1].head()

Unnamed: 0,words,vectors,cluster,cluster_value,closeness_score,sentiment_coeff
11,room,"[-0.6147242, -0.74645656, 0.66465837, 0.408310...",0,-1,0.136778,-0.136778
20,need,"[0.7595736, -0.007738052, 0.30257878, 0.175887...",0,-1,0.130511,-0.130511
40,like,"[0.2302345, 0.010540499, 0.1864814, 0.1680768,...",0,-1,0.131209,-0.131209
45,one,"[-0.3450254, -0.36648145, 1.055815, -0.6519758...",0,-1,0.161927,-0.161927
46,bed,"[-0.49814302, -0.77044404, 0.9306575, -0.22167...",0,-1,0.107062,-0.107062
