# AirBnB Sentiment Analysis

In [8]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Project Scope
 - Use of sentiment analysis, of the reviews of each ad, to view the evaluation of the ad
    itself.

 - Search for relationships between the price of a room and the day of the week, holidays,
    and time of year, and relationships between the price and the characteristics of a
    room to make a forecast.

  Specify that the analysis is unsupervised

Dataset: https://www.kaggle.com/brittabettendorf/berlin-airbnb-data

In [9]:
import pandas as pd
import gensim
import nltk

## Import of the reviews' dataset

The dataset imported and used along this analysis is the one previously generated.

In [10]:
dfReviews = pd.read_csv('reviews_summary_langs.csv')
dfReviews.head()

Unnamed: 0.1,Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,Lang
0,0,2015,69544350,2016-04-11,7178145,Rahel,mein freund und ich hatten gute gemütliche vie...,de
1,1,2015,69990732,2016-04-15,41944715,Hannah,jan was very friendly and welcoming host! the ...,en
2,2,2015,71605267,2016-04-26,30048708,Victor,un appartement tres bien situé dans un quartie...,fr
3,3,2015,73819566,2016-05-10,63697857,Judy,"it is really nice area, food, park, transport ...",en
4,4,2015,74293504,2016-05-14,10414887,Romina,"buena ubicación, el departamento no está orden...",es


In [11]:
dfReviews['Lang'].unique()

array(['de', 'en', 'fr', 'es', 'no', 'ro', 'ca', 'sv', 'pt', 'it', 'ko',
       'nl', 'af', 'ru', 'zh-cn', 'fi', 'da', 'hu', 'None', 'vi', 'ja',
       'pl', 'cy', 'id', 'cs', 'et', 'hr', 'el', 'tr', 'sl', 'so',
       'zh-tw', 'tl', 'sk', 'sq', 'sw', 'uk', 'lv', 'mk', 'he', 'lt',
       'bg', 'th', 'ar'], dtype=object)

The rows in which the 'Lang' column shows the value 'None' are the ones that in the previous
step have thrown some problems.
In particular, the possible problems are the inability of the used technique to detect
their language or the too-narrow length of the review.

In [18]:
dfNoneLangReviews = dfReviews[dfReviews['Lang'] == 'None']
print(f'Number of reviews with None language: {dfNoneLangReviews.shape[0]}')
print(f'Percentage of reviews with None language: '
      f'{round(dfNoneLangReviews.shape[0] * 100 / dfReviews.shape[0],2)}%')

Number of reviews with None language: 648
Percentage of reviews with None language: 0.16%


The reviews written in english language are the interesting ones for this analysis.

In [13]:
dfEnglishReviews = dfReviews[dfReviews['Lang'] == 'en']
dfEnglishReviews.head()

Unnamed: 0.1,Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,Lang
1,1,2015,69990732,2016-04-15,41944715,Hannah,jan was very friendly and welcoming host! the ...,en
3,3,2015,73819566,2016-05-10,63697857,Judy,"it is really nice area, food, park, transport ...",en
6,6,2015,76603178,2016-05-28,29323516,Laurent,"we had a very nice stay in berlin, thanks to j...",en
7,7,2015,77296201,2016-05-31,9025122,Rasmus,"great location close to mauerpark, kastanienal...",en
9,9,2015,82322683,2016-06-27,73902920,Mag,"apartment very well located, close to everythi...",en


In [14]:
dfEnglishReviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 268508 entries, 1 to 401466
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Unnamed: 0     268508 non-null  int64 
 1   listing_id     268508 non-null  int64 
 2   id             268508 non-null  int64 
 3   date           268508 non-null  object
 4   reviewer_id    268508 non-null  int64 
 5   reviewer_name  268508 non-null  object
 6   comments       268508 non-null  object
 7   Lang           268508 non-null  object
dtypes: int64(4), object(4)
memory usage: 18.4+ MB


## Duplicates removal

The first step requires the removal of the duplicated reviews.

In [24]:
print('Number of English reviews: {}'.format(dfEnglishReviews.shape[0]))
print('Number of unique English reviews: {}'.format(len(dfEnglishReviews['comments'].unique())))

Number of English reviews: 268508
Number of unique English reviews: 258767


In [27]:
dfEnglishReviews = dfEnglishReviews.drop_duplicates(subset='comments')
print(f'Number of reviews after the duplicated removal: {dfEnglishReviews.shape[0]}')

Number of reviews after the duplicated removal: 258767


## Tokenization

In order to prepare the data for the analysis model, it is needed to perform a tokenization
operation.
For this purpose, the 'gensim' library is used.

In [15]:
tokenizedEnglishReviews = dfEnglishReviews.apply(
    lambda x: gensim.utils.simple_preprocess(str(x['comments'])), axis=1)
tokenizedEnglishReviews

1         [jan, was, very, friendly, and, welcoming, hos...
3         [it, is, really, nice, area, food, park, trans...
6         [we, had, very, nice, stay, in, berlin, thanks...
7         [great, location, close, to, mauerpark, kastan...
9         [apartment, very, well, located, close, to, ev...
                                ...                        
401462    [the, host, canceled, this, reservation, days,...
401463    [the, host, canceled, this, reservation, days,...
401464    [the, host, canceled, this, reservation, days,...
401465    [the, host, canceled, this, reservation, days,...
401466    [the, host, canceled, this, reservation, days,...
Length: 268508, dtype: object

## Normalization

Another important step is the normalization.
For this purpose, the 'nltk' library is used.

In particular, the 'wordnet' and 'average_perceptron_tagger' packages are downloaded from
the 'nltk' resources.

In [None]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

In [35]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag import pos_tag

def lemmatize_sentences(tokenized_sentences):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentences = []
    for tokens_sentence in tokenized_sentences:
        lemmatized_sentence = []
        for word, tag in pos_tag(tokens_sentence):
            if tag.startswith('NN'):
                pos = 'n'
            elif tag.startswith('VB'):
                pos = 'v'
            else:
                pos = 'a'
            lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
        lemmatized_sentences.append(lemmatized_sentence)

    return lemmatized_sentences

In [40]:
lemmatizedTokenizedEnglishReviews = lemmatize_sentences(tokenizedEnglishReviews)
lemmatizedTokenizedEnglishReviews[:5]

[['jan',
  'be',
  'very',
  'friendly',
  'and',
  'welcome',
  'host',
  'the',
  'apartment',
  'be',
  'great',
  'and',
  'the',
  'area',
  'be',
  'sooo',
  'amaze',
  'lot',
  'of',
  'nice',
  'cafe',
  'and',
  'shop',
  'enjoy',
  'my',
  'time',
  'there',
  'lot'],
 ['it',
  'be',
  'really',
  'nice',
  'area',
  'food',
  'park',
  'transport',
  'be',
  'perfect'],
 ['we',
  'have',
  'very',
  'nice',
  'stay',
  'in',
  'berlin',
  'thanks',
  'to',
  'jan',
  'premium',
  'situate',
  'apartment',
  'the',
  'place',
  'isn',
  'big',
  'but',
  'be',
  'quiet',
  'and',
  'functional',
  'also',
  'it',
  'situate',
  'in',
  'perfect',
  'neighbourhood',
  'jan',
  'be',
  'very',
  'welcome',
  'host',
  'eager',
  'to',
  'help',
  'you',
  'if',
  'need',
  'or',
  'to',
  'provide',
  'you',
  'any',
  'kind',
  'of',
  'information',
  'he',
  'also',
  'have',
  'very',
  'good',
  'advice',
  'on',
  'biergarten'],
 ['great',
  'location',
  'close',
  'to',

## Stop words removal


In [37]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Marco\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [41]:
from nltk.corpus import stopwords

stopWords = stopwords.words('english')
stopWords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [42]:
def remove_stop_words(tokenized_sentences, stop_words):
    tokenized_sentences_without_stopwords = []
    for tokenized_sentence in tokenized_sentences:
        tokenized_sentences_without_stopwords.append(
            [word for word in tokenized_sentence if not word in stop_words]
        )
    return tokenized_sentences_without_stopwords

In [43]:
lemmatizedTokenizedEnglishReviewsWithoutStopWords = remove_stop_words(
    lemmatizedTokenizedEnglishReviews, stopWords)
lemmatizedTokenizedEnglishReviewsWithoutStopWords[:5]


[['jan',
  'friendly',
  'welcome',
  'host',
  'apartment',
  'great',
  'area',
  'sooo',
  'amaze',
  'lot',
  'nice',
  'cafe',
  'shop',
  'enjoy',
  'time',
  'lot'],
 ['really', 'nice', 'area', 'food', 'park', 'transport', 'perfect'],
 ['nice',
  'stay',
  'berlin',
  'thanks',
  'jan',
  'premium',
  'situate',
  'apartment',
  'place',
  'big',
  'quiet',
  'functional',
  'also',
  'situate',
  'perfect',
  'neighbourhood',
  'jan',
  'welcome',
  'host',
  'eager',
  'help',
  'need',
  'provide',
  'kind',
  'information',
  'also',
  'good',
  'advice',
  'biergarten'],
 ['great',
  'location',
  'close',
  'mauerpark',
  'kastanienallee',
  'rosenthaler',
  'platz',
  'lot',
  'bar',
  'restaurant',
  'nearby',
  'jan',
  'friendly',
  'service',
  'mind'],
 ['apartment',
  'well',
  'locate',
  'close',
  'everything',
  'supermarket',
  'transport',
  'city',
  'center',
  'quiet',
  'night',
  'apartment',
  'locate',
  'inside',
  'building',
  'basic',
  'equipment',