# AirBnB Sentiment Analysis - Dataset generation

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Project Scope
 - Use of sentiment analysis, of the reviews of each ad, to view the evaluation of the ad
    itself.

 - Search for relationships between the price of a room and the day of the week, holidays,
    and time of year, and relationships between the price and the characteristics of a
    room to make a forecast.

Dataset: https://www.kaggle.com/brittabettendorf/berlin-airbnb-data

In [None]:
import pandas as pd
import zipfile36 as zipfile
import langdetect
import os
import matplotlib.pyplot as plt
import gensim
import nltk

## 1. Import of the reviews' dataset

The first step concerns the download of the datasets.
In particular, for this purpose, the Kaggle APIs are used.

In [None]:
!kaggle datasets download -d brittabettendorf/berlin-airbnb-data

In [None]:
zf = zipfile.ZipFile('berlin-airbnb-data.zip')
dfReviews = pd.read_csv(zf.open('reviews_summary.csv'))
dfReviews.head()

## 2. Data preprocessing

### 2.1 Null data-points removal

Once the dataset is available, it is needed to check whether there are some null data-points.

In [None]:
dfReviews.info()

In [None]:
dfNullReviews = dfReviews[dfReviews['comments'].isnull()]
print(f'Number of null comments: {dfNullReviews.shape[0]}')
dfNullReviews.head()

In [None]:
dfReviews.dropna(axis=0, how='any', inplace=True)
dfReviews.info()

### 2.2 Lowercase conversion

After the null data-points removal operation, it is needed to convert all the comments
into lowercase strings.

In [None]:
dfReviews['comments'] = dfReviews.apply(lambda x: x['comments'].lower(), axis=1)
dfReviews.head()

### 2.3 Reviews' language detection

Since the comments are written in many languages, it can be useful to detect the language
of each comment.
This operation allows the selection of the comments based on their language (and also an
eventual translation of all the comments into a common language).

In order to detect the language of the comments, the langdetect library is used.

The first step of this operation concerns the definition of a method that

In [None]:
def get_lang_from_comment(dataframe):
    list_langs = []
    for index, comment in dataframe['comments'].iteritems():
        if index % 5000 == 0:
            print(f'Processed {index} rows...')
        try:
            comment_lang = langdetect.detect(comment[:50])
            list_langs.append(comment_lang)
        except:
            list_langs.append("None")

    return list_langs

Once the language for each comment is detected, it is added as a new column to the already
existing dataframe. Then the resulting dataframe is saved into a .csv file.

Since this operation is very time-consuming, it is checked whether the operation has
already been executed, and the results have been saved into a .csv file.

In [None]:
if os.path.exists('reviews_summary_langs.csv'):
    dfReviews = pd.read_csv('reviews_summary_langs.csv')
else:
    dfReviews['Lang'] = get_lang_from_comment(dfReviews)
    dfReviews.to_csv('reviews_summary_langs.csv', sep=",", index=False, header=True)

dfReviews.head()

In [None]:
dfReviews['Lang'].unique()

The rows in which the 'Lang' column shows the value 'None' are the ones that in the previous
step have thrown some problems.
In particular, the possible problems are the inability of the used technique to detect
their language or the too-narrow length of the review.

In [None]:
dfNoneLangReviews = dfReviews[dfReviews['Lang'] == 'None']
print(f'Number of reviews with None language: {dfNoneLangReviews.shape[0]}')
print(f'Percentage of reviews with None language: '
      f'{round(dfNoneLangReviews.shape[0] * 100 / dfReviews.shape[0],2)}%')

#### 2.3.1 English reviews selection

The reviews written in english language are the interesting ones for this analysis.

In [None]:
dfEnglishReviews = dfReviews[dfReviews['Lang'] == 'en']
dfEnglishReviews.head()

In [None]:
dfEnglishReviews.info()

### 2.4 Duplicates removal

The first step requires the removal of the duplicated reviews.

In [None]:
print('Number of English reviews: {}'.format(dfEnglishReviews.shape[0]))
print('Number of unique English reviews: {}'.format(len(dfEnglishReviews['comments'].unique())))

In [None]:
dfEnglishReviews = dfEnglishReviews.drop_duplicates(subset='comments')
print(f'Number of reviews after the duplicated removal: {dfEnglishReviews.shape[0]}')

### 2.5 Non-English words removal

In [None]:
dfEnglishReviews['comments'].iloc[172]

In [None]:
from re import sub

dfEnglishReviews['comments'] = dfEnglishReviews.apply(
    lambda x: sub(r"[^A-Za-z]", " ", x['comments']), axis=1)
dfEnglishReviews['comments'].iloc[172]

### 2.6 Tokenization

In order to prepare the data for the analysis model, it is needed to perform a tokenization
operation.
For this purpose, the 'gensim' library is used.

In [None]:
tokenizedEnglishReviews = dfEnglishReviews.apply(
    lambda x: gensim.utils.simple_preprocess(str(x['comments'])), axis=1)
tokenizedEnglishReviews

### 2.7 Normalization

Another important step concerns the normalization of the reviews.
For this purpose, the 'nltk' library is used.

In particular, the 'wordnet' and 'average_perceptron_tagger' packages are downloaded from
the 'nltk' resources.
The first package provides a 'Lemmatizer' that, given a word, converts it into its base form.
The second package provides a method that, given a word, returns a tag representing its
grammatical type.

In [None]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag import pos_tag

def lemmatize_reviews(tokenized_reviews):
    lemmatizer = WordNetLemmatizer()
    lemmatized_reviews = []
    for tokens_review in tokenized_reviews:
        lemmatized_review = []
        for word, tag in pos_tag(tokens_review):
            if tag.startswith('NN'):
                pos = 'n'
            elif tag.startswith('VB'):
                pos = 'v'
            else:
                pos = 'a'
            lemmatized_review.append(lemmatizer.lemmatize(word, pos))
        lemmatized_reviews.append(lemmatized_review)

    return lemmatized_reviews

In [None]:
lemmatizedTokenizedEnglishReviews = lemmatize_reviews(tokenizedEnglishReviews)
lemmatizedTokenizedEnglishReviews[:5]

## 3. Sentiment Analysis

### 3.1 Bigrams generation

In [None]:
from gensim.models.phrases import Phrases, Phraser

phrases = Phrases(lemmatizedTokenizedEnglishReviews, min_count=3, progress_per=50000)

bigram = Phraser(phrases)

bigramReviews = bigram[lemmatizedTokenizedEnglishReviews]

bigramReviews[0]

In [None]:
from collections import defaultdict

dictWordFreq = defaultdict(int)
for review in bigramReviews:
    for i in review:
        dictWordFreq[i] += 1

len(dictWordFreq)

In [None]:
# Eventually, show an example of item in word_freq

In [None]:
sorted(dictWordFreq, key=dictWordFreq.get, reverse=True)[:10]

### 3.2 Word2Vec model

In [None]:
from gensim.models import Word2Vec

w2vModel = Word2Vec(min_count=20,
                    window=4,
                    vector_size=300,
                    sample=6e-5,
                    alpha=0.03,
                    min_alpha=0.0007,
                    negative=20,
                    workers=4)

In [None]:
from time import time

t = time()

w2vModel.build_vocab(bigramReviews, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
t = time()

w2vModel.train(bigramReviews,
               total_examples=w2vModel.corpus_count,
               epochs=30,
               report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
# w2vModel.save("word2vec.model")

As an example, it is possible to show the most similar words to a given word.
This step allows to have a first look at the goodness of the Word2Vec model.

In [None]:
w2vModel.wv.most_similar(positive=["apartment"])

### 3.3 Clustering model

In [None]:
from sklearn.cluster import KMeans
import numpy as np

kmeansModel2Clusters = KMeans(n_clusters=2, max_iter=1000, random_state=42, n_init=50)
kmeansModel2Clusters.fit(X=w2vModel.wv.vectors.astype('double'))

In [None]:
w2vModel.wv.similar_by_vector(kmeansModel2Clusters.cluster_centers_[0],
                              topn=10,
                              restrict_vocab=None)

In [None]:
negativeClusterIndex = 0

In [None]:
dfWords2Clusters = pd.DataFrame(
    w2vModel.wv.key_to_index.keys())

dfWords2Clusters.columns = ['words']

dfWords2Clusters['vectors'] = \
    dfWords2Clusters['words'].apply(
        lambda x: w2vModel.wv[f'{x}'])

dfWords2Clusters['cluster'] = \
    dfWords2Clusters['vectors'].apply(
        lambda x: kmeansModel2Clusters.predict([np.array(x)]))

dfWords2Clusters['cluster'] = \
    dfWords2Clusters['cluster'].apply(
        lambda x: x[0])

dfWords2Clusters.head()

In [None]:
dfWords2Clusters['cluster_value'] = [
    -1 if i==negativeClusterIndex else 1
    for i in dfWords2Clusters['cluster']]

dfWords2Clusters['closeness_score'] = \
    dfWords2Clusters.apply(
        lambda x: 1/(kmeansModel2Clusters.transform([x.vectors]).min()),
        axis=1)

dfWords2Clusters['sentiment_coeff'] = \
    dfWords2Clusters['closeness_score'] * \
    dfWords2Clusters['cluster_value']

dfWords2Clusters[
    dfWords2Clusters['cluster_value'] == -1].head()

### 3.4 TF-IDF

In [None]:
dfCleanedReviews = pd.DataFrame(
    [' '.join(review) for review in lemmatizedTokenizedEnglishReviews],
    columns=['comments'])

dfCleanedReviews.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(norm=None)
transformed = tfidf.fit_transform(
    dfCleanedReviews['comments'].tolist())
features = pd.Series(tfidf.get_feature_names())

In [None]:
def create_tfidf_dictionary(x, transformed_file, features_file):
    """
    create dictionary for each input sentence x, where each word has assigned its tfidf score

    inspired  by function from this wonderful article:
    https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34

    x - row of dataframe, containing sentences, and their indexes,
    transformed_file - all sentences transformed with TfidfVectorizer
    features - names of all words in corpus used in TfidfVectorizer
    """
    vector_coo = transformed_file[x.name].tocoo()
    vector_coo.col = features_file.iloc[vector_coo.col].values
    dict_from_coo = dict(zip(vector_coo.col, vector_coo.data))
    return dict_from_coo

def replace_tfidf_words(x, transformed_file, features_file):
    """
    replacing each word with it's calculated tfidf dictionary with scores of each word
    x - row of dataframe, containing sentences, and their indexes,
    transformed_file - all sentences transformed with TfidfVectorizer
    features - names of all words in corpus used in TfidfVectorizer
    """
    dictionary = create_tfidf_dictionary(x, transformed_file, features_file)
    try:
        res = list(map(lambda y:dictionary[f'{y}'], x['comments'].split()))
    except KeyError:
        res = [0 for i in x['comments'].split()]
    return res

In [None]:
tfidfScores = dfCleanedReviews.apply(
    lambda x: replace_tfidf_words(x, transformed, features), axis=1)

### 3.5 Closeness score

In [None]:
dictSentiment2Clusters = dict(zip(
    dfWords2Clusters['words'].values,
    dfWords2Clusters['sentiment_coeff'].values))

In [None]:
def replace_sentiment_words(word, sentiment_dict):
    """
    replacing each word with its associated sentiment score from sentiment dict
    """
    try:
        out = sentiment_dict[word]
    except KeyError:
        out = 0
    return out

In [None]:
closenessScores2Clusters = \
    dfCleanedReviews['comments'].apply(
        lambda x: list(map(
            lambda y: replace_sentiment_words(y, dictSentiment2Clusters),
            x.split())))

### 3.6 Sentiment score computation

In [None]:
dfSentiment2ClustersTfidfReviews = \
    pd.DataFrame([closenessScores2Clusters,
                  tfidfScores,
                  dfCleanedReviews['comments']]).T

dfSentiment2ClustersTfidfReviews.columns = \
    ['sentiment_coeff', 'tfidf_scores', 'review']

dfSentiment2ClustersTfidfReviews['sentiment_rate'] = \
    dfSentiment2ClustersTfidfReviews.apply(
        lambda x: np.array(x.loc['sentiment_coeff']) @ np.array(x.loc['tfidf_scores']),
        axis=1)

dfSentiment2ClustersTfidfReviews['prediction'] =\
    (dfSentiment2ClustersTfidfReviews['sentiment_rate'] > 0)\
        .astype('int8')

dfSentiment2ClustersTfidfReviews.head()

It is also possible to show as an example the top-5 negative reviews, according to our
sentiment prediction.

In [None]:
dfNegativeSentiment = dfSentiment2ClustersTfidfReviews[
    dfSentiment2ClustersTfidfReviews['prediction'] == 0].sort_values(
        by=['sentiment_rate'])

print('Top-5 negative reviews:')
dfNegativeSentiment['review'].head().tolist()

Finally, the dataset is saved into a .csv file.

In [None]:
# dfSentiment2ClustersTfidfReviews.to_csv(
#     'sentiment_dataset_2_clusters.csv',
#     sep=',', index=False, header=True)

### 3.7 TextBlob

In [None]:
from textblob import TextBlob

In [None]:
textblobSentiment = dfCleanedReviews['comments'].apply(
    lambda x: TextBlob(x).sentiment.polarity)

textblobSentiment.head()

In [None]:
dfSentiment2ClustersTfidfReviews['textblob_sentiment'] = \
    textblobSentiment

dfSentiment2ClustersTfidfReviews['textblob_prediction'] = \
    (dfSentiment2ClustersTfidfReviews['textblob_sentiment'] > 0)\
        .astype('int8')

dfSentiment2ClustersTfidfReviews.head()

### 3.8 Sentiment Analysis Evaluation

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def compute_test_scores(predictions, labels):

    df_conf_matrix = pd.DataFrame(confusion_matrix(labels, predictions))

    print(df_conf_matrix)

    test_scores = accuracy_score(labels, predictions), \
                  precision_score(labels, predictions), \
                  recall_score(labels, predictions), \
                  f1_score(labels, predictions)

    return test_scores

In [None]:
testScores2ClustersSentiment = compute_test_scores(
    dfSentiment2ClustersTfidfReviews['prediction'],
    dfSentiment2ClustersTfidfReviews['textblob_prediction'])

dfTestScores2ClustersSentiment = pd.DataFrame([testScores2ClustersSentiment])
dfTestScores2ClustersSentiment.columns = ['accuracy', 'precision', 'recall', 'f1']
dfTestScores2ClustersSentiment = dfTestScores2ClustersSentiment.T
dfTestScores2ClustersSentiment.columns = ['scores']

print('Scores for sentiment analysis with 2 clusters and no stopwords: ')
dfTestScores2ClustersSentiment

The sentiment analysis with 2 clusters shows bad results.
So, we try to compute both the elbow and the silhouette methods in order to check
whether the clustering of the words can be performed with better results.

### 3.9 Clustering evaluation

In [None]:
def kmeans_elbow_method(vectors):
    wcss = []
    for i in range(1, 11):
        kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
        kmeans.fit(X=vectors)

        # inertia_ is sum of squared distance of samples to its closest cluster centers.
        wcss.append(kmeans.inertia_)
        print("inertia_", kmeans.inertia_)

    plt.plot(range(1, 11), wcss)
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()

In [None]:
kmeans_elbow_method(w2vModel.wv.vectors.astype('double'))

In [None]:
from sklearn.metrics import silhouette_score

def kmeans_silhouette(X,range_clusters):
    for i, k in range_clusters :

        # Run the Kmeans algorithm
        km = KMeans(n_clusters = k, init = 'k-means++', random_state = 42)

        km.fit(X)
        labels = km.predict(X)

        print("For n_clusters =", k,
                  "The computed average silhouette_score is :",
              silhouette_score(X, labels, metric='euclidean'))

In [None]:
# rangeClusters = enumerate([2,3,4,5,6,7,8,9,10])
# kmeans_silhouette(w2vModel.wv.vectors.astype('double'), rangeClusters)

## 4. Sentiment analysis with 3 clusters

### 4.1 Clustering model

In [None]:
kmeansModel3Clusters = KMeans(n_clusters=3, max_iter=1000, random_state=42, n_init=50)
kmeansModel3Clusters.fit(X=w2vModel.wv.vectors.astype('double'))

In [None]:
w2vModel.wv.similar_by_vector(
    kmeansModel3Clusters.cluster_centers_[2], topn=10, restrict_vocab=None)

In [None]:
negativeClusterIndex = 2
positiveClusterIndex = 0

In [None]:
dfWords3Clusters = pd.DataFrame(
    w2vModel.wv.key_to_index.keys())

dfWords3Clusters.columns = ['words']

dfWords3Clusters['vectors'] = \
    dfWords3Clusters['words'].apply(
        lambda x: w2vModel.wv[f'{x}'])

dfWords3Clusters['cluster'] = \
    dfWords3Clusters['vectors'].apply(
        lambda x: kmeansModel3Clusters.predict([np.array(x)]))

dfWords3Clusters.cluster = \
    dfWords3Clusters['cluster'].apply(
        lambda x: x[0])

dfWords3Clusters.head()

In [None]:
dfWords3Clusters['cluster_value'] = \
    [-1 if i==negativeClusterIndex
     else 1 if i==positiveClusterIndex else 0
     for i in dfWords3Clusters['cluster']]

dfWords3Clusters['closeness_score'] = \
    dfWords3Clusters.apply(
        lambda x: 1/(kmeansModel3Clusters.transform([x.vectors]).min()),
        axis=1)

dfWords3Clusters['sentiment_coeff'] = \
    dfWords3Clusters['closeness_score'] * \
    dfWords3Clusters['cluster_value']

dfWords3Clusters[
    dfWords3Clusters['cluster_value'] == -1].head()

### 4.2 Closeness score

The TF-IDF technique is not applied because the results would be the set of words
composing the reviews has not changed.

...

In [None]:
dictSentiment3Clusters = dict(zip(
    dfWords3Clusters['words'].values,
    dfWords3Clusters['sentiment_coeff'].values))

In [None]:
closenessScores3Clusters = \
    dfCleanedReviews['comments'].apply(
        lambda x: list(map(
            lambda y: replace_sentiment_words(y, dictSentiment3Clusters),
            x.split())))

### 4.3 Sentiment score computation

In [None]:
dfSentiment3ClustersTfidfReviews = pd.DataFrame(
    [closenessScores3Clusters,
     tfidfScores,
     dfCleanedReviews['comments']]).T

dfSentiment3ClustersTfidfReviews.columns = \
    ['sentiment_coeff', 'tfidf_scores', 'review']

dfSentiment3ClustersTfidfReviews['sentiment_rate'] = \
    dfSentiment3ClustersTfidfReviews.apply(
        lambda x: np.array(x.loc['sentiment_coeff']) @ np.array(x.loc['tfidf_scores']),
        axis=1)

dfSentiment3ClustersTfidfReviews['prediction'] = \
    (dfSentiment3ClustersTfidfReviews['sentiment_rate']>0)\
        .astype('int8')

dfSentiment3ClustersTfidfReviews.head()

In [None]:
# dfSentiment3ClustersTfidfReviews.to_csv(
#     'sentiment_dataset_3_clusters.csv',
#     sep=',', index=False, header=True)

In [None]:
dfNegativeSentiment = dfSentiment3ClustersTfidfReviews[
    dfSentiment3ClustersTfidfReviews['prediction'] == 0].sort_values(
        by=['sentiment_rate'])

dfNegativeSentiment['review'].head().tolist()

### 4.4 TextBlob

Even in this case, the list of values concerning the TextBlob sentiment prediction has
not changed.
So, the same list is used.

In [None]:
dfSentiment3ClustersTfidfReviews['textblob_sentiment'] = \
    textblobSentiment

dfSentiment3ClustersTfidfReviews['textblob_prediction'] = \
    (dfSentiment3ClustersTfidfReviews['textblob_sentiment'] > 0)\
        .astype('int8')

dfSentiment3ClustersTfidfReviews.head()

### 4.5 Sentiment evaluation

In [None]:
testScores3ClustersSentiment = compute_test_scores(
    dfSentiment3ClustersTfidfReviews['prediction'],
    dfSentiment3ClustersTfidfReviews['textblob_prediction'])

dfTestScores3ClustersSentiment = pd.DataFrame([testScores3ClustersSentiment])
dfTestScores3ClustersSentiment.columns = ['accuracy', 'precision', 'recall', 'f1']
dfTestScores3ClustersSentiment = dfTestScores3ClustersSentiment.T
dfTestScores3ClustersSentiment.columns = ['scores']

print('Scores for sentiment analysis with 3 clusters and no stopwords: ')
dfTestScores3ClustersSentiment

## 5. Sentiment Analysis without stop words

### 5.1 Stop words removal

Finally, it is important to remove the stop-words.
For this purpose, the 'stopwords' package of the 'nltk' library is used.

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

stopWords = stopwords.words('english')
stopWords[:10]

In [None]:
def remove_stop_words(tokenized_reviews, stop_words):
    tokenized_reviews_without_stopwords = []
    for tokenized_review in tokenized_reviews:
        tokenized_reviews_without_stopwords.append(
            [word for word in tokenized_review if not word in stop_words]
        )
    return tokenized_reviews_without_stopwords

In [None]:
lemmatizedTokenizedEnglishReviewsWithoutStopWords = remove_stop_words(
    lemmatizedTokenizedEnglishReviews, stopWords)
lemmatizedTokenizedEnglishReviewsWithoutStopWords[:5]

### 5.2 Bigrams generation

from notebook reviews_without_stopwords_analysis
