# AirBnB Sentiment Analysis - Dataset generation

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Project Scope
 - Use of sentiment analysis, of the reviews of each ad, to view the evaluation of the ad
    itself.

 - Search for relationships between the price of a room and the day of the week, holidays,
    and time of year, and relationships between the price and the characteristics of a
    room to make a forecast.

Dataset: https://www.kaggle.com/brittabettendorf/berlin-airbnb-data

In [None]:
import pandas as pd
import zipfile36 as zipfile
import langdetect
import os
import matplotlib.pyplot as plt
import gensim
import nltk
import seaborn as sns

## 1. Import of the reviews' dataset

The first step concerns the download of the datasets.
In particular, for this purpose, the Kaggle APIs are used.

In [None]:
!kaggle datasets download -d brittabettendorf/berlin-airbnb-data

In [None]:
zf = zipfile.ZipFile('berlin-airbnb-data.zip')
dfReviews = pd.read_csv(zf.open('reviews_summary.csv'))
dfReviews.head()

## 2. Data preprocessing

### 2.1 Null data-points removal

Once the dataset is available, it is needed to check whether there are some null data-points.

In [None]:
dfReviews.info()

In [None]:
dfNullReviews = dfReviews[dfReviews['comments'].isnull()]
print(f'Number of null comments: {dfNullReviews.shape[0]}')
dfNullReviews.head()

In [None]:
dfReviews.dropna(axis=0, how='any', inplace=True)
dfReviews.info()

### 2.2 Lowercase conversion

After the null data-points removal operation, it is needed to convert all the comments
into lowercase strings.

In [None]:
dfReviews['comments'] = dfReviews.apply(lambda x: x['comments'].lower(), axis=1)
dfReviews.head()

### 2.3 Reviews' language detection

Since the comments are written in many languages, it can be useful to detect the language
of each comment.
This operation allows the selection of the comments based on their language (and also an
eventual translation of all the comments into a common language).

In order to detect the language of the comments, the langdetect library is used.

The first step of this operation concerns the definition of a method that

In [None]:
def get_lang_from_comment(dataframe):
    list_langs = []
    for index, comment in dataframe['comments'].iteritems():
        if index % 5000 == 0:
            print(f'Processed {index} rows...')
        try:
            comment_lang = langdetect.detect(comment[:50])
            list_langs.append(comment_lang)
        except:
            list_langs.append("None")

    return list_langs

Once the language for each comment is detected, it is added as a new column to the already
existing dataframe. Then the resulting dataframe is saved into a .csv file.

Since this operation is very time-consuming, it is checked whether the operation has
already been executed, and the results have been saved into a .csv file.

In [None]:
if os.path.exists('reviews_summary_langs.csv'):
    dfReviews = pd.read_csv('reviews_summary_langs.csv')
else:
    dfReviews['Lang'] = get_lang_from_comment(dfReviews)
    dfReviews.to_csv('reviews_summary_langs.csv', sep=",", index=False, header=True)

dfReviews.head()

In [None]:
dfReviews['Lang'].unique()

The rows in which the 'Lang' column shows the value 'None' are the ones that in the previous
step have thrown some problems.
In particular, the possible problems are the inability of the used technique to detect
their language or the too-narrow length of the review.

In [None]:
dfNoneLangReviews = dfReviews[dfReviews['Lang'] == 'None']
print(f'Number of reviews with None language: {dfNoneLangReviews.shape[0]}')
print(f'Percentage of reviews with None language: '
      f'{round(dfNoneLangReviews.shape[0] * 100 / dfReviews.shape[0],2)}%')

#### 2.3.1 English reviews selection

The reviews written in english language are the interesting ones for this analysis.

In [None]:
dfEnglishReviews = dfReviews[dfReviews['Lang'] == 'en']
dfEnglishReviews.head()

In [None]:
dfEnglishReviews.info()

### 2.4 Duplicates removal

Another required step is the removal of the duplicated reviews.

In [None]:
print('Number of English reviews: {}'.format(dfEnglishReviews.shape[0]))
print('Number of unique English reviews: {}'.format(len(dfEnglishReviews['comments'].unique())))

In [None]:
dfEnglishReviews = dfEnglishReviews.drop_duplicates(subset='comments')
print(f'Number of reviews after the duplicated removal: {dfEnglishReviews.shape[0]}')

### 2.5 Non-English words removal

In [None]:
dfEnglishReviews['comments'].iloc[172]

In [None]:
from re import sub

dfEnglishReviews['comments'] = dfEnglishReviews.apply(
    lambda x: sub(r"[^A-Za-z]", " ", x['comments']), axis=1)
dfEnglishReviews['comments'].iloc[172]

### 2.6 Tokenization

In order to prepare the data for the analysis model, it is needed to perform a tokenization
operation.
For this purpose, the 'gensim' library is used.

In [None]:
tokenizedEnglishReviews = dfEnglishReviews.apply(
    lambda x: gensim.utils.simple_preprocess(str(x['comments'])), axis=1)
tokenizedEnglishReviews

### 2.7 Normalization

Another important step concerns the normalization of the reviews.
For this purpose, the 'nltk' library is used.

In particular, the 'wordnet' and 'average_perceptron_tagger' packages are downloaded from
the 'nltk' resources.
The first package provides a 'Lemmatizer' that, given a word, converts it into its base form.
The second package provides a method that, given a word, returns a tag representing its
grammatical type.

In [None]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag import pos_tag

def lemmatize_reviews(tokenized_reviews):
    lemmatizer = WordNetLemmatizer()
    lemmatized_reviews = []
    for tokens_review in tokenized_reviews:
        lemmatized_review = []
        for word, tag in pos_tag(tokens_review):
            if tag.startswith('NN'):
                pos = 'n'
            elif tag.startswith('VB'):
                pos = 'v'
            else:
                pos = 'a'
            lemmatized_review.append(lemmatizer.lemmatize(word, pos))
        lemmatized_reviews.append(lemmatized_review)

    return lemmatized_reviews

In [None]:
lemmatizedTokenizedEnglishReviews = lemmatize_reviews(tokenizedEnglishReviews)
lemmatizedTokenizedEnglishReviews[:5]

## 3. Sentiment Analysis

The idea of the sentiment analysis is to determine whether the reviews, of Airbnb
activity in Berlin, are positive or negative.

### 3.1 Bigrams generation

In order to take into account some small sequences of words the bigrams are introduced.
The Gensim Phrases package is used to automatically detect bigrams from a list
of sentences.

In [None]:
from gensim.models.phrases import Phrases, Phraser

phrases = Phrases(lemmatizedTokenizedEnglishReviews, min_count=3, progress_per=50000)

bigram = Phraser(phrases)

bigramReviews = bigram[lemmatizedTokenizedEnglishReviews]

bigramReviews[0]

In [None]:
from collections import defaultdict

dictWordFreq = defaultdict(int)
for review in bigramReviews:
    for i in review:
        dictWordFreq[i] += 1

len(dictWordFreq)

In [None]:
sorted(dictWordFreq, key=dictWordFreq.get, reverse=True)[:10]

### 3.2 Word2Vec model

Word2Vec is a group of models that tries to represent each word in a large text as
a vector in a space of N dimensions (which we will call features) making similar
words also be close to each other.
In this particular case the CBOW architecture is used.
In this way, each word in the corpus is predicted by its given context.

In [None]:
from gensim.models import Word2Vec
from time import time

if os.path.exists('word2vec.model'):
    w2vModel = Word2Vec.load('word2vec.model')
else:
    w2vModel = Word2Vec(min_count=20,
                        window=4,
                        vector_size=300,
                        sample=6e-5,
                        alpha=0.03,
                        min_alpha=0.0007,
                        negative=20,
                        workers=4)

    t = time()
    w2vModel.build_vocab(bigramReviews, progress_per=10000)
    print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

    t = time()
    w2vModel.train(bigramReviews,
                   total_examples=w2vModel.corpus_count,
                   epochs=30,
                   report_delay=1)
    print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

    w2vModel.save("word2vec.model")

As an example, it is possible to show the most similar words to a given word.
This step allows to have a first look at the goodness of the Word2Vec model.

In [None]:
w2vModel.wv.most_similar(positive=["apartment"])

### 3.3 Clustering model

For the clustering, the K-means technique is used.

In [None]:
from sklearn.cluster import KMeans
import numpy as np

kmeansModel2Clusters = KMeans(n_clusters=2, max_iter=1000, random_state=42, n_init=50)
kmeansModel2Clusters.fit(X=w2vModel.wv.vectors.astype('double'))

In order to check which cluster is relatively positive, and which negative, it is
possible to show the words that are the nearest to each cluster.
In particular, the cosine similarity to the coordinates of first cluster is used
to determine the similarity between cluster and word.

In [None]:
w2vModel.wv.similar_by_vector(kmeansModel2Clusters.cluster_centers_[0],
                              topn=10,
                              restrict_vocab=None)

The 10 closest words to cluster number 0 in terms of cosine distance are shown.
Looking at these words, it is possible to state that the cluster 0 is the cluster
concerning the negative words.

In [None]:
negativeClusterIndex = 0

The next step concerns the assignment to each word of the sentiment score, computed
through its converted cluster value, negative (-1) or positive (1), and its closeness
score, that simply represents the closeness of a word to its cluster center.

In [None]:
dfWords2Clusters = pd.DataFrame(
    w2vModel.wv.key_to_index.keys())

dfWords2Clusters.columns = ['words']

dfWords2Clusters['vectors'] = \
    dfWords2Clusters['words'].apply(
        lambda x: w2vModel.wv[f'{x}'])

dfWords2Clusters['cluster'] = \
    dfWords2Clusters['vectors'].apply(
        lambda x: kmeansModel2Clusters.predict([np.array(x)]))

dfWords2Clusters['cluster'] = \
    dfWords2Clusters['cluster'].apply(
        lambda x: x[0])

dfWords2Clusters.head()

In [None]:
dfWords2Clusters['cluster_value'] = [
    -1 if i==negativeClusterIndex else 1
    for i in dfWords2Clusters['cluster']]

dfWords2Clusters['closeness_score'] = \
    dfWords2Clusters.apply(
        lambda x: 1/(kmeansModel2Clusters.transform([x.vectors]).min()),
        axis=1)

dfWords2Clusters['sentiment_coeff'] = \
    dfWords2Clusters['closeness_score'] * \
    dfWords2Clusters['cluster_value']

dfWords2Clusters[
    dfWords2Clusters['cluster_value'] == -1].head()

With the above steps, a full dataframe of words is created,
where each word has its own weighted cluster value, closeness score and sentiment
score.

### 3.4 TF-IDF

TF-IDF is used to show how important a word is to a review.

In [None]:
dfCleanedReviews = pd.DataFrame(
    [' '.join(review) for review in lemmatizedTokenizedEnglishReviews],
    columns=['comments'])

dfCleanedReviews.head()

To calculate the tfidf score of each word, the sklearn library is used.
This step is conducted to consider how unique every word is for every sentence,
and increase positive/negative signal associated with words that are highly specific
for given sentence in comparison to whole corpus.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(norm=None)
transformed = tfidf.fit_transform(
    dfCleanedReviews['comments'].tolist())
features = pd.Series(tfidf.get_feature_names())

Finally, a vector containing the tfidf score for each review is created by replacing
each word with the corresponding tfidf score.

In [None]:
def create_tfidf_dictionary(x, transformed_file, features_file):
    """
    create dictionary for each input sentence x, where each word has assigned its tfidf score

    inspired  by function from this wonderful article:
    https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34

    x - row of dataframe, containing sentences, and their indexes,
    transformed_file - all sentences transformed with TfidfVectorizer
    features - names of all words in corpus used in TfidfVectorizer
    """
    vector_coo = transformed_file[x.name].tocoo()
    vector_coo.col = features_file.iloc[vector_coo.col].values
    dict_from_coo = dict(zip(vector_coo.col, vector_coo.data))
    return dict_from_coo

def replace_tfidf_words(x, transformed_file, features_file):
    """
    replacing each word with it's calculated tfidf dictionary with scores of each word
    x - row of dataframe, containing sentences, and their indexes,
    transformed_file - all sentences transformed with TfidfVectorizer
    features - names of all words in corpus used in TfidfVectorizer
    """
    dictionary = create_tfidf_dictionary(x, transformed_file, features_file)
    try:
        res = list(map(lambda y:dictionary[f'{y}'], x['comments'].split()))
    except KeyError:
        res = [0 for i in x['comments'].split()]
    return res

In [None]:
tfidfScores = dfCleanedReviews.apply(
    lambda x: replace_tfidf_words(x, transformed, features), axis=1)

### 3.5 Closeness score

As the TF-IDF score, each word in every sentence is replaced by its own
closeness score.

In [None]:
dictSentiment2Clusters = dict(zip(
    dfWords2Clusters['words'].values,
    dfWords2Clusters['sentiment_coeff'].values))

In [None]:
def replace_sentiment_words(word, sentiment_dict):
    """
    replacing each word with its associated sentiment score from sentiment dict
    """
    try:
        out = sentiment_dict[word]
    except KeyError:
        out = 0
    return out

In [None]:
closenessScores2Clusters = \
    dfCleanedReviews['comments'].apply(
        lambda x: list(map(
            lambda y: replace_sentiment_words(y, dictSentiment2Clusters),
            x.split())))

### 3.6 Sentiment score computation

A new dataframe is created by associating all the preivously-obtained values with
each review.

In [None]:
dfSentiment2ClustersTfidfReviews = \
    pd.DataFrame([closenessScores2Clusters,
                  tfidfScores,
                  dfCleanedReviews['comments']]).T

dfSentiment2ClustersTfidfReviews.columns = \
    ['sentiment_coeff', 'tfidf_scores', 'review']

dfSentiment2ClustersTfidfReviews['sentiment_rate'] = \
    dfSentiment2ClustersTfidfReviews.apply(
        lambda x: np.array(x.loc['sentiment_coeff']) @ np.array(x.loc['tfidf_scores']),
        axis=1)

dfSentiment2ClustersTfidfReviews['prediction'] =\
    (dfSentiment2ClustersTfidfReviews['sentiment_rate'] > 0)\
        .astype('int8')

dfSentiment2ClustersTfidfReviews.head()

It is also possible to show as an example the top-5 negative reviews, according to
our sentiment prediction.

In [None]:
dfNegativeSentiment = dfSentiment2ClustersTfidfReviews[
    dfSentiment2ClustersTfidfReviews['prediction'] == 0].sort_values(
        by=['sentiment_rate'])

print('Top-5 negative reviews:')
dfNegativeSentiment['review'].head().tolist()

Finally, the dataset can be saved into a .csv file.

In [None]:
# dfSentiment2ClustersTfidfReviews.to_csv(
#     'sentiment_dataset_2_clusters.csv',
#     sep=',', index=False, header=True)

### 3.7 TextBlob

In order to understand the goodness of the obtained results, they can be compared to
the ones obtained by an already-existing external library.
In particular the TextBlob library is used, which is built on the shoulders of NLTK.

In [None]:
from textblob import TextBlob

For each review, the sentiment score is computed by using TextBlob.

In [None]:
textblobSentiment = dfCleanedReviews['comments'].apply(
    lambda x: TextBlob(x).sentiment.polarity)

textblobSentiment.head()

The TextBlob sentiment score is associated with each review.

In [None]:
dfSentiment2ClustersTfidfReviews['textblob_sentiment'] = \
    textblobSentiment

dfSentiment2ClustersTfidfReviews['textblob_prediction'] = \
    (dfSentiment2ClustersTfidfReviews['textblob_sentiment'] > 0)\
        .astype('int8')

dfSentiment2ClustersTfidfReviews.head()

### 3.8 Sentiment Analysis Evaluation

Finally, it is possible to compare the obtained results by using the accuracy metrics.
In particular, the Textblob results are used as truth labels, and the
manually-obtained results are used as predictions.
Even in this case, the sklearn library is used.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def compute_test_scores(predictions, labels):

    df_conf_matrix = pd.DataFrame(confusion_matrix(labels, predictions))

    print(df_conf_matrix)

    test_scores = accuracy_score(labels, predictions), \
                  precision_score(labels, predictions), \
                  recall_score(labels, predictions), \
                  f1_score(labels, predictions)

    return test_scores

In [None]:
testScores2ClustersSentiment = compute_test_scores(
    dfSentiment2ClustersTfidfReviews['prediction'],
    dfSentiment2ClustersTfidfReviews['textblob_prediction'])

dfTestScores2ClustersSentiment = pd.DataFrame([testScores2ClustersSentiment])
dfTestScores2ClustersSentiment.columns = ['accuracy', 'precision', 'recall', 'f1']
dfTestScores2ClustersSentiment = dfTestScores2ClustersSentiment.T
dfTestScores2ClustersSentiment.columns = ['scores']

print('Scores for sentiment analysis with 2 clusters and no stopwords: ')
dfTestScores2ClustersSentiment

The sentiment analysis with 2 clusters shows bad results.
So, it is possible to use both the elbow and the silhouette methods in order to check
whether the clustering of the words can be performed with better results.

### 3.9 Clustering evaluation

The Elbow Method is used to find the best value of the number of cluster.

In [None]:
def kmeans_elbow_method(vectors):
    wcss = []
    for i in range(1, 11):
        kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
        kmeans.fit(X=vectors)

        # inertia_ is sum of squared distance of samples to its closest cluster centers.
        wcss.append(kmeans.inertia_)
        print("inertia_", kmeans.inertia_)

    plt.plot(range(1, 11), wcss)
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()

In [None]:
kmeans_elbow_method(w2vModel.wv.vectors.astype('double'))

From the above Elbow method plot is not easy to establish a definitive value of
number of clusters.
So, in order to be more precise, the Silhouette method can be useful to understand
the right number of cluster.

In [None]:
from sklearn.metrics import silhouette_score

def kmeans_silhouette(x,range_clusters):
    for i, k in range_clusters :

        # Run the Kmeans algorithm
        km = KMeans(n_clusters = k, init = 'k-means++', random_state = 42)

        km.fit(x)
        labels = km.predict(x)

        print("For n_clusters =", k,
                  "The computed average silhouette_score is :",
              silhouette_score(x, labels, metric='euclidean'))

In [None]:
rangeClusters = enumerate([2,3,4,5,6,7,8,9,10])
kmeans_silhouette(w2vModel.wv.vectors.astype('double'), rangeClusters)

The obtained value show that the clustering with n_clusters = 3 could perform better
results.

## 4. Sentiment analysis with 3 clusters

The performed steps are the same of the previous analysis.
The steps concerning Word2Vec, TF-IDF and TextBlob are not applied because the
results would be the same with respect to the previous ones.

### 4.1 Clustering model

In [None]:
kmeansModel3Clusters = KMeans(n_clusters=3, max_iter=1000, random_state=42, n_init=50)
kmeansModel3Clusters.fit(X=w2vModel.wv.vectors.astype('double'))

In [None]:
w2vModel.wv.similar_by_vector(
    kmeansModel3Clusters.cluster_centers_[2], topn=10, restrict_vocab=None)

In [None]:
negativeClusterIndex = 2
positiveClusterIndex = 0

In [None]:
dfWords3Clusters = pd.DataFrame(
    w2vModel.wv.key_to_index.keys())

dfWords3Clusters.columns = ['words']

dfWords3Clusters['vectors'] = \
    dfWords3Clusters['words'].apply(
        lambda x: w2vModel.wv[f'{x}'])

dfWords3Clusters['cluster'] = \
    dfWords3Clusters['vectors'].apply(
        lambda x: kmeansModel3Clusters.predict([np.array(x)]))

dfWords3Clusters.cluster = \
    dfWords3Clusters['cluster'].apply(
        lambda x: x[0])

dfWords3Clusters.head()

In [None]:
dfWords3Clusters['cluster_value'] = \
    [-1 if i==negativeClusterIndex
     else 1 if i==positiveClusterIndex else 0
     for i in dfWords3Clusters['cluster']]

dfWords3Clusters['closeness_score'] = \
    dfWords3Clusters.apply(
        lambda x: 1/(kmeansModel3Clusters.transform([x.vectors]).min()),
        axis=1)

dfWords3Clusters['sentiment_coeff'] = \
    dfWords3Clusters['closeness_score'] * \
    dfWords3Clusters['cluster_value']

dfWords3Clusters[
    dfWords3Clusters['cluster_value'] == -1].head()

### 4.2 Closeness score

In [None]:
dictSentiment3Clusters = dict(zip(
    dfWords3Clusters['words'].values,
    dfWords3Clusters['sentiment_coeff'].values))

In [None]:
closenessScores3Clusters = \
    dfCleanedReviews['comments'].apply(
        lambda x: list(map(
            lambda y: replace_sentiment_words(y, dictSentiment3Clusters),
            x.split())))

### 4.3 Sentiment score computation

In [None]:
dfSentiment3ClustersTfidfReviews = pd.DataFrame(
    [closenessScores3Clusters,
     tfidfScores,
     dfCleanedReviews['comments']]).T

dfSentiment3ClustersTfidfReviews.columns = \
    ['sentiment_coeff', 'tfidf_scores', 'review']

dfSentiment3ClustersTfidfReviews['sentiment_rate'] = \
    dfSentiment3ClustersTfidfReviews.apply(
        lambda x: np.array(x.loc['sentiment_coeff']) @ np.array(x.loc['tfidf_scores']),
        axis=1)

dfSentiment3ClustersTfidfReviews['prediction'] = \
    (dfSentiment3ClustersTfidfReviews['sentiment_rate']>0)\
        .astype('int8')

dfSentiment3ClustersTfidfReviews.head()

In [None]:
# dfSentiment3ClustersTfidfReviews.to_csv(
#     'sentiment_dataset_3_clusters.csv',
#     sep=',', index=False, header=True)

In [None]:
dfNegativeSentiment = dfSentiment3ClustersTfidfReviews[
    dfSentiment3ClustersTfidfReviews['prediction'] == 0].sort_values(
        by=['sentiment_rate'])

dfNegativeSentiment['review'].head().tolist()

### 4.4 TextBlob

In [None]:
dfSentiment3ClustersTfidfReviews['textblob_sentiment'] = \
    textblobSentiment

dfSentiment3ClustersTfidfReviews['textblob_prediction'] = \
    (dfSentiment3ClustersTfidfReviews['textblob_sentiment'] > 0)\
        .astype('int8')

dfSentiment3ClustersTfidfReviews.head()

### 4.5 Sentiment analysis evaluation

In [None]:
testScores3ClustersSentiment = compute_test_scores(
    dfSentiment3ClustersTfidfReviews['prediction'],
    dfSentiment3ClustersTfidfReviews['textblob_prediction'])

dfTestScores3ClustersSentiment = pd.DataFrame([testScores3ClustersSentiment])
dfTestScores3ClustersSentiment.columns = ['accuracy', 'precision', 'recall', 'f1']
dfTestScores3ClustersSentiment = dfTestScores3ClustersSentiment.T
dfTestScores3ClustersSentiment.columns = ['scores']

print('Scores for sentiment analysis with 3 clusters and no stopwords: ')
dfTestScores3ClustersSentiment

The sentiment analysis with 3 clusters shows better results compared to the
sentiment analysis with 2 clusters.
However, it could be interesting to investigate the results that can be obtained by
removing the stop words.

## 5. Sentiment Analysis without stop words

The performed steps are the same as before.

### 5.1 Stop words removal

In order to obtain a list of the English common stop words, the 'stopwords'
package of the 'nltk' library is used.

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

stopWords = stopwords.words('english')
stopWords[:10]

In [None]:
def remove_stop_words(tokenized_reviews, stop_words):
    tokenized_reviews_without_stopwords = []
    for tokenized_review in tokenized_reviews:
        tokenized_reviews_without_stopwords.append(
            [word for word in tokenized_review if not word in stop_words]
        )
    return tokenized_reviews_without_stopwords

In [None]:
lemmatizedTokenizedEnglishReviewsWithoutStopWords = remove_stop_words(
    lemmatizedTokenizedEnglishReviews, stopWords)
lemmatizedTokenizedEnglishReviewsWithoutStopWords[:5]

### 5.2 Bigrams generation

In [None]:
phrasesWithoutStopWords = Phrases(lemmatizedTokenizedEnglishReviewsWithoutStopWords,
                                  min_count=3,
                                  progress_per=50000)

bigramWithoutStopWords = Phraser(phrasesWithoutStopWords)

bigramReviewsWithoutStopWords = bigramWithoutStopWords[
    lemmatizedTokenizedEnglishReviewsWithoutStopWords]

bigramReviewsWithoutStopWords[0]

In [None]:
dictWordFreq = defaultdict(int)
for review in bigramReviewsWithoutStopWords:
    for i in review:
        dictWordFreq[i] += 1

len(dictWordFreq)

In [None]:
sorted(dictWordFreq, key=dictWordFreq.get, reverse=True)[:10]

### 5.3 Word2Vec model

In [None]:
if os.path.exists('word2vec_no_stopwords.model'):
    w2vModelWithoutStopWords = Word2Vec.load('word2vec_no_stopwords.model')
else:
    w2vModelWithoutStopWords = Word2Vec(min_count=20,
                                 window=4,
                                 vector_size=300,
                                 sample=6e-5,
                                 alpha=0.03,
                                 min_alpha=0.0007,
                                 negative=20,
                                 workers=4)

    t = time()
    w2vModelWithoutStopWords.build_vocab(bigramReviewsWithoutStopWords, progress_per=10000)
    print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

    t = time()
    w2vModelWithoutStopWords.train(bigramReviewsWithoutStopWords,
                                   total_examples=w2vModelWithoutStopWords.corpus_count,
                                   epochs=30,
                                   report_delay=1)
    print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

    w2vModelWithoutStopWords.save("word2vec_no_stopwords.model")

In [None]:
w2vModelWithoutStopWords.wv.most_similar(positive=["apartment"])

### 5.4 Clustering model

In [None]:
kmeansModel2Clusters = KMeans(n_clusters=2, max_iter=1000, random_state=42, n_init=50)
kmeansModel2Clusters.fit(X=w2vModelWithoutStopWords.wv.vectors.astype('double'))

In [None]:
w2vModelWithoutStopWords.wv.similar_by_vector(kmeansModel2Clusters.cluster_centers_[0],
                                              topn=10,
                                              restrict_vocab=None)

In [None]:
negativeClusterIndex = 0

In [None]:
dfWords2ClustersWithoutStopWords = pd.DataFrame(
    w2vModelWithoutStopWords.wv.key_to_index.keys())

dfWords2ClustersWithoutStopWords.columns = ['words']

dfWords2ClustersWithoutStopWords['vectors'] = \
    dfWords2ClustersWithoutStopWords['words'].apply(
        lambda x: w2vModelWithoutStopWords.wv[f'{x}'])

dfWords2ClustersWithoutStopWords['cluster'] = \
    dfWords2ClustersWithoutStopWords['vectors'].apply(
        lambda x: kmeansModel2Clusters.predict([np.array(x)]))

dfWords2ClustersWithoutStopWords['cluster'] = \
    dfWords2ClustersWithoutStopWords['cluster'].apply(
        lambda x: x[0])

dfWords2ClustersWithoutStopWords.head()

In [None]:
dfWords2ClustersWithoutStopWords['cluster_value'] = [
    -1 if i==negativeClusterIndex else 1
    for i in dfWords2ClustersWithoutStopWords['cluster']]

dfWords2ClustersWithoutStopWords['closeness_score'] = \
    dfWords2ClustersWithoutStopWords.apply(
        lambda x: 1/(kmeansModel2Clusters.transform([x.vectors]).min()),
        axis=1)

dfWords2ClustersWithoutStopWords['sentiment_coeff'] = \
    dfWords2ClustersWithoutStopWords['closeness_score'] * \
    dfWords2ClustersWithoutStopWords['cluster_value']

dfWords2ClustersWithoutStopWords[
    dfWords2ClustersWithoutStopWords['cluster_value'] == -1].head()

### 5.5 TF-IDF

In [None]:
dfCleanedReviewsWithoutStopWords = pd.DataFrame(
    [' '.join(review) for review in lemmatizedTokenizedEnglishReviewsWithoutStopWords],
    columns=['comments'])

dfCleanedReviewsWithoutStopWords.head()

In [None]:
tfidfWithoutStopWords = TfidfVectorizer(norm=None)
transformed = tfidfWithoutStopWords.fit_transform(
    dfCleanedReviewsWithoutStopWords['comments'].tolist())
features = pd.Series(tfidfWithoutStopWords.get_feature_names())

In [None]:
tfidfScoresWithoutStopWords = dfCleanedReviewsWithoutStopWords.apply(
    lambda x: replace_tfidf_words(x, transformed, features), axis=1)

### 5.6 Closeness score

In [None]:
dictSentiment2ClustersWithoutStopWords = dict(zip(
    dfWords2ClustersWithoutStopWords['words'].values,
    dfWords2ClustersWithoutStopWords['sentiment_coeff'].values))

In [None]:
closenessScores2ClustersWithoutStopWords = \
    dfCleanedReviewsWithoutStopWords['comments'].apply(
        lambda x: list(map(
            lambda y: replace_sentiment_words(y, dictSentiment2ClustersWithoutStopWords),
            x.split())))

### 5.7 Sentiment score computation

In [None]:
dfSentiment2ClustersTfidfReviewsWithoutStopWords = \
    pd.DataFrame([closenessScores2ClustersWithoutStopWords,
                  tfidfScoresWithoutStopWords,
                  dfCleanedReviewsWithoutStopWords['comments']]).T

dfSentiment2ClustersTfidfReviewsWithoutStopWords.columns = \
    ['sentiment_coeff', 'tfidf_scores', 'review']

dfSentiment2ClustersTfidfReviewsWithoutStopWords['sentiment_rate'] = \
    dfSentiment2ClustersTfidfReviewsWithoutStopWords.apply(
        lambda x: np.array(x.loc['sentiment_coeff']) @ np.array(x.loc['tfidf_scores']),
        axis=1)

dfSentiment2ClustersTfidfReviewsWithoutStopWords['prediction'] =\
    (dfSentiment2ClustersTfidfReviewsWithoutStopWords['sentiment_rate'] > 0)\
        .astype('int8')

dfSentiment2ClustersTfidfReviewsWithoutStopWords.head()

In [None]:
dfNegativeSentiment = dfSentiment2ClustersTfidfReviewsWithoutStopWords[
    dfSentiment2ClustersTfidfReviewsWithoutStopWords['prediction'] == 0].sort_values(
        by=['sentiment_rate'])

print('Top-5 negative reviews:')
dfNegativeSentiment['review'].head().tolist()

In [None]:
# dfSentiment2ClustersTfidfReviewsWithoutStopWords.to_csv(
#     'sentiment_dataset_2_clusters_no_stopwords.csv',
#     sep=',', index=False, header=True)

### 5.8 TextBlob

In [None]:
textblobSentimentWithoutStopWords = dfCleanedReviewsWithoutStopWords['comments'].apply(
    lambda x: TextBlob(x).sentiment.polarity)

textblobSentimentWithoutStopWords.head()

In [None]:
dfSentiment2ClustersTfidfReviewsWithoutStopWords['textblob_sentiment'] = \
    textblobSentimentWithoutStopWords

dfSentiment2ClustersTfidfReviewsWithoutStopWords['textblob_prediction'] = \
    (dfSentiment2ClustersTfidfReviewsWithoutStopWords['textblob_sentiment'] > 0)\
        .astype('int8')

dfSentiment2ClustersTfidfReviewsWithoutStopWords.head()

### 5.9 Sentiment analysis evaluation

In [None]:
testScores2ClustersWithoutStopWordsSentiment = compute_test_scores(
    dfSentiment2ClustersTfidfReviewsWithoutStopWords['prediction'],
    dfSentiment2ClustersTfidfReviewsWithoutStopWords['textblob_prediction'])

dfTestScores2ClustersWithoutStopWordsSentiment = pd.DataFrame(
    [testScores2ClustersWithoutStopWordsSentiment])

dfTestScores2ClustersWithoutStopWordsSentiment.columns = \
    ['accuracy', 'precision', 'recall', 'f1']

dfTestScores2ClustersWithoutStopWordsSentiment = \
    dfTestScores2ClustersWithoutStopWordsSentiment.T

dfTestScores2ClustersWithoutStopWordsSentiment.columns = ['scores']

print('Scores for sentiment analysis with 2 clusters and no stopwords: ')
dfTestScores2ClustersWithoutStopWordsSentiment

The obtained results seem to be very good.
Even in this case it is possible to check whether there exists a better value for the
number of cluster.

### 5.10 Clustering evaluation

In [None]:
kmeans_elbow_method(w2vModelWithoutStopWords.wv.vectors.astype('double'))

In [None]:
rangeClusters = enumerate([2,3,4,5,6,7,8,9,10])
kmeans_silhouette(w2vModelWithoutStopWords.wv.vectors.astype('double'), rangeClusters)

Even if the Silhouette method shows a worse score with n_clusters = 3, it can be
interesting to check how much the model accuracy changes.

## 6. Sentiment analysis without stop words with 3 clusters

The performed steps are the same as before.

### 6.1 Clustering model

In [None]:
kmeansModel3Clusters = KMeans(n_clusters=3, max_iter=1000, random_state=42, n_init=50)
kmeansModel3Clusters.fit(X=w2vModelWithoutStopWords.wv.vectors.astype('double'))

In [None]:
w2vModelWithoutStopWords.wv.similar_by_vector(
    kmeansModel3Clusters.cluster_centers_[2], topn=10, restrict_vocab=None)

In [None]:
negativeClusterIndex = 2
positiveClusterIndex = 0

In [None]:
dfWords3ClustersWithoutStopWords = pd.DataFrame(
    w2vModelWithoutStopWords.wv.key_to_index.keys())

dfWords3ClustersWithoutStopWords.columns = ['words']

dfWords3ClustersWithoutStopWords['vectors'] = \
    dfWords3ClustersWithoutStopWords['words'].apply(
        lambda x: w2vModelWithoutStopWords.wv[f'{x}'])

dfWords3ClustersWithoutStopWords['cluster'] = \
    dfWords3ClustersWithoutStopWords['vectors'].apply(
        lambda x: kmeansModel3Clusters.predict([np.array(x)]))

dfWords3ClustersWithoutStopWords.cluster = \
    dfWords3ClustersWithoutStopWords['cluster'].apply(
        lambda x: x[0])

dfWords3ClustersWithoutStopWords.head()

In [None]:
dfWords3ClustersWithoutStopWords['cluster_value'] = \
    [-1 if i==negativeClusterIndex
     else 1 if i==positiveClusterIndex else 0
     for i in dfWords3ClustersWithoutStopWords['cluster']]

dfWords3ClustersWithoutStopWords['closeness_score'] = \
    dfWords3ClustersWithoutStopWords.apply(
        lambda x: 1/(kmeansModel3Clusters.transform([x.vectors]).min()),
        axis=1)

dfWords3ClustersWithoutStopWords['sentiment_coeff'] = \
    dfWords3ClustersWithoutStopWords['closeness_score'] * \
    dfWords3ClustersWithoutStopWords['cluster_value']

dfWords3ClustersWithoutStopWords[
    dfWords3ClustersWithoutStopWords['cluster_value'] == -1].head()

### 6.2 Closeness score

In [None]:
dictSentiment3ClustersWithoutStopWords = dict(zip(
    dfWords3ClustersWithoutStopWords['words'].values,
    dfWords3ClustersWithoutStopWords['sentiment_coeff'].values))

In [None]:
closenessScores3ClustersWithoutStopWords = \
    dfCleanedReviewsWithoutStopWords['comments'].apply(
        lambda x: list(map(
            lambda y: replace_sentiment_words(y, dictSentiment3ClustersWithoutStopWords),
            x.split())))

### 6.3 Sentiment score computation

In [None]:
dfSentiment3ClustersTfidfReviewsWithoutStopWords = pd.DataFrame(
    [closenessScores3ClustersWithoutStopWords,
     tfidfScoresWithoutStopWords,
     dfCleanedReviewsWithoutStopWords['comments']]).T

dfSentiment3ClustersTfidfReviewsWithoutStopWords.columns = \
    ['sentiment_coeff', 'tfidf_scores', 'review']

dfSentiment3ClustersTfidfReviewsWithoutStopWords['sentiment_rate'] = \
    dfSentiment3ClustersTfidfReviewsWithoutStopWords.apply(
        lambda x: np.array(x.loc['sentiment_coeff']) @ np.array(x.loc['tfidf_scores']),
        axis=1)

dfSentiment3ClustersTfidfReviewsWithoutStopWords['prediction'] = \
    (dfSentiment3ClustersTfidfReviewsWithoutStopWords['sentiment_rate']>0)\
        .astype('int8')

dfSentiment3ClustersTfidfReviewsWithoutStopWords.head()

In [None]:
# dfSentiment3ClustersTfidfReviewsWithoutStopWords.to_csv(
#     'sentiment_dataset_3_clusters_no_stopwords.csv',
#     sep=',', index=False, header=True)

In [None]:
dfNegativeSentiment = dfSentiment3ClustersTfidfReviewsWithoutStopWords[
    dfSentiment3ClustersTfidfReviewsWithoutStopWords['prediction'] == 0].sort_values(
        by=['sentiment_rate'])

dfNegativeSentiment['review'].head().tolist()

### 6.4 TextBlob

In [None]:
dfSentiment3ClustersTfidfReviewsWithoutStopWords['textblob_sentiment'] = \
    textblobSentimentWithoutStopWords

dfSentiment3ClustersTfidfReviewsWithoutStopWords['textblob_prediction'] = \
    (dfSentiment3ClustersTfidfReviewsWithoutStopWords['textblob_sentiment'] > 0)\
        .astype('int8')

dfSentiment3ClustersTfidfReviewsWithoutStopWords.head()

### 6.5 Sentiment analysis evaluation

In [None]:
testScores3ClustersWithoutStopWordsSentiment = compute_test_scores(
    dfSentiment3ClustersTfidfReviewsWithoutStopWords['prediction'],
    dfSentiment3ClustersTfidfReviewsWithoutStopWords['textblob_prediction'])

dfTestScores3ClustersWithoutStopWordsSentiment = pd.DataFrame(
    [testScores3ClustersWithoutStopWordsSentiment])

dfTestScores3ClustersWithoutStopWordsSentiment.columns = \
    ['accuracy', 'precision', 'recall', 'f1']

dfTestScores3ClustersWithoutStopWordsSentiment = \
    dfTestScores3ClustersWithoutStopWordsSentiment.T

dfTestScores3ClustersWithoutStopWordsSentiment.columns = ['scores']

print('Scores for sentiment analysis with 3 clusters and no stopwords: ')
dfTestScores3ClustersWithoutStopWordsSentiment


The obtained results are strongly worse with respect to the ones that could be
expected.
It was reasonable to expect a decrease in the accuracy score but not so marked.

In [None]:
import calendar
from datetime import date

## 7. Relationships' investigation between price and calendar features

This analysis concerns the investigation of some relationships between the price
of a room, or listing, and some time periods.
In particular, the considered time periods are the days of the week, the seasons
of the year and the holidays periods.

### 7.1 Import of the dataset

In [None]:
zf = zipfile.ZipFile('berlin-airbnb-data.zip')
dfPricesDates = pd.read_csv(zf.open('calendar_summary.csv'))
dfPricesDates.head()

### 7.2 Columns exploration

Once the dataset is imported, it is needed to explore its columns to check whether
some data preprocessing technique should be applied.

The first step concerns the removal of the null data-points.

In [None]:
dfPricesDates.info()

In [None]:
print('Number of null rows:', dfPricesDates['price'].isnull().sum())

In [None]:
dfPricesDates.dropna(axis=0, how='any', inplace=True)
dfPricesDates.info()

Since the 'available' column only contains a single value, it can be dropped.

In [None]:
dfPricesDates['available'].unique()

In [None]:
dfPricesDates.drop(columns=['available'], inplace=True)
dfPricesDates.head()

In order to deal with the 'price' column, its values need to be converted into
numeric.

In [None]:
type(dfPricesDates['price'].iloc[0])

In [None]:
dfPricesDates['price'] = dfPricesDates['price'].apply(
    lambda x: x.replace(',', ''))

dfPricesDates['price'] = dfPricesDates['price'].apply(
    lambda x: float(x[1:]))

dfPricesDates.info()

Finally, also the type of the 'date' column is converted into datetime.

In [None]:
dfPricesDates.info()

In [None]:
dfPricesDates['date'] = pd.to_datetime(dfPricesDates['date'])
dfPricesDates.info()

In [None]:
dfPricesDates.head()

### 7.3 Price vs Day of the week

The first step concerns the creation of a new column for the day of the week.
In particular, through the use of the calendar library, the week of the day is
computed from the corresponding date.

In [None]:
dfPricesDates['weekday'] = dfPricesDates.apply(
    lambda x: calendar.day_name[x['date'].weekday()], axis=1)

dfPricesDates.head()

Then, it is possible to show, for each day of the week, the average price of
all the listings.

In [None]:
weekdayMeans = dfPricesDates.groupby('weekday').price.mean()
plt.bar(weekdayMeans.index, weekdayMeans)

A new dataframe is created by computing, for each listing and for each day of the
week, the average price.

In [None]:
listingWeekdayMeans = dfPricesDates.groupby(['listing_id', 'weekday']).price.mean()
listingWeekdayMeans.head()

In order to work on the dataframe, the days of the week are converted into columns.

In [None]:
dfListingWeekdayMeans = listingWeekdayMeans.unstack(level=1)
dfListingWeekdayMeans.head()

Finally, the correlation between the days of the week is computed.
In particular, the kendall's tau coefficient is used to measure the correlation
degree.

In [None]:
dfListingWeekdayMeans.corr(method='kendall')

The correlation matrix shows that all the features are very highly correlated.
So, it is possible to state that there are, meanly, very slight differences between
the prices of the listings in the different days of the week.

### 7.4 Price vs Season

Even in this case, the first step concerns the creation of a new column for the
season of the year.

In [None]:
defaultYear = 2000
seasons = [('winter', (date(defaultYear,  1,  1),  date(defaultYear,  3, 20))),
           ('spring', (date(defaultYear,  3, 21),  date(defaultYear,  6, 20))),
           ('summer', (date(defaultYear,  6, 21),  date(defaultYear,  9, 22))),
           ('autumn', (date(defaultYear,  9, 23),  date(defaultYear, 12, 20))),
           ('winter', (date(defaultYear, 12, 21),  date(defaultYear, 12, 31)))]

def get_season(curr_date):
    curr_date = curr_date.replace(year=defaultYear)
    return next(season for season, (start, end) in seasons
                if start <= curr_date <= end)

In [None]:
dfPricesDates['season'] = dfPricesDates['date'].apply(
    lambda x: get_season(x))

dfPricesDates.head()

Then, it is possible to show, for each season of the year, the average price of
all the listings.

In [None]:
seasonMeans = dfPricesDates.groupby('season').price.mean()
plt.bar(seasonMeans.index, seasonMeans)

A new dataframe is created by computing, for each listing and for each season of the
year, the average price.

In [None]:
listingSeasonMeans = dfPricesDates.groupby(['listing_id', 'season']).price.mean()
listingSeasonMeans.head()

In order to work on the dataframe, the seasons of the year are converted into columns.

In [None]:
dfListingSeasonMeans = listingSeasonMeans.unstack(level=1)
dfListingSeasonMeans.head()

Some listings show NaN values when the mean of the price for a season is considered.
This may be caused by the unavailability of the listing during that considered period.
So, these listings are not considered in the correlation analysis.

In [None]:
dfListingSeasonMeans.info()

In [None]:
dfListingSeasonNonNullMeans = dfListingSeasonMeans.dropna(axis=0, how='any')
dfListingSeasonNonNullMeans.info()

Finally, the correlation between the seasons of the year is computed.

In [None]:
dfListingSeasonNonNullMeans.corr(method='kendall')

Even in this case, the correlation matrix shows that all the features are very highly
correlated.
So, it is possible to state that there are, meanly, very slight differences between
the prices of the listings in the different seasons.

### 7.5 Price vs Holiday

Even in this case, the first step concerns the creation of a new column for the
holidays.
In order to check whether a certain date is considered as holiday, the holidays
library is used.

In [None]:
import holidays

germanHolidays = holidays.Germany()

In [None]:
dfPricesDates['holidays'] = dfPricesDates.apply(
    lambda x: int(x['date'] in germanHolidays or x['weekday'] == 'Sunday'), axis=1)

dfPricesDates.head()

Then, it is possible to show, for both the holidays and the non-holidays dates, the
average price of all the listings.

In [None]:
holidaysMeans = dfPricesDates.groupby('holidays').price.mean()
plt.bar(holidaysMeans.index, holidaysMeans)

A new dataframe is created by computing, for each listing and for both holidays and
non-holidays dayes, the average price.

In [None]:
listingHolidaysMeans = dfPricesDates.groupby(['listing_id', 'holidays']).price.mean()
listingHolidaysMeans.head()

In order to work on the dataframe, the holidays' values are converted into columns.

In [None]:
dfListingHolidaysMeans = listingHolidaysMeans.unstack(level=1)
dfListingHolidaysMeans.head()

Some listings show NaN values when the mean of the price for a season is considered.
This may be caused by the unavailability of the listing during that considered period.
So, these listings are not initially considered in the correlation analysis.

In [None]:
dfListingHolidaysMeans.info()

In [None]:
dfListingHolidaysNonNullMeans = dfListingHolidaysMeans.dropna(axis=0, how='any')
dfListingHolidaysNonNullMeans.info()

Finally, the correlation between the holidays and non-holidays dates is computed.

In [None]:
dfListingHolidaysNonNullMeans.corr(method='kendall')

Even in this case, the correlation matrix shows that all the features are very highly
correlated.
So, it is possible to state that there are, meanly, very slight differences between
the prices of the listings in the different days of the week.

## 8. Relationships' investigation between price and rooms' characteristics

This analysis concerns the investigation of some relationships between the price
of a room, or listing, and its own characteristics.
Then, the relationships should be used to create a model that is able to make
predictions.

### 8.1 Import of the datasets

In [None]:
dfListingsSummary = pd.read_csv(zf.open('listings_summary.csv'))
dfListingsSummary.head()

### 8.2 Columns exploration

Once the dataset is imported, it is needed to explore its columns to check which
of them describe a listing's characteristic.

In [None]:
dfListingsSummary.columns

Firstly, the columns concerning the room characteristics and the prices are selected.

In [None]:
dfPricesListingTypes = dfListingsSummary[['id', 'room_type', 'accommodates',
                                          'bathrooms', 'bedrooms', 'beds', 'bed_type',
                                          'amenities', 'square_feet', 'price',
                                          'weekly_price', 'monthly_price',
                                          'security_deposit', 'cleaning_fee']]

dfPricesListingTypes.head()

In [None]:
dfPricesListingTypes.info()

Looking at the null data-points, the columns 'square_feet', 'weekly_price',
'monthly_price', 'security_deposit' and 'cleaning_fee' can be dropped.

In [None]:
dfPricesListingTypes = dfPricesListingTypes.drop(
    columns=['square_feet', 'weekly_price', 'monthly_price', 'security_deposit',
             'cleaning_fee'])

dfPricesListingTypes.head()

Since it is not needed in order to make predictions, the column 'id' is dropped.

In [None]:
dfPricesListingTypes = dfPricesListingTypes.drop(columns=['id'])
dfPricesListingTypes.head()

The column 'price' must be converted into numeric.

In [None]:
dfPricesListingTypes['price'] = dfPricesListingTypes['price'].apply(
    lambda x: x.replace(',', ''))

dfPricesListingTypes['price'] = dfPricesListingTypes['price'].apply(
    lambda x: float(x[1:]))

dfPricesListingTypes.info()

Instead of directly dealing with the column 'amenities', it can be replaced
by the number of amenities each listing contains.

In [None]:
dfPricesListingTypes['amenities'].unique()

In [None]:
dfPricesListingTypes['amenities_nr'] = dfPricesListingTypes.apply(
    lambda x: len(x['amenities'].split(',')) + 1, axis=1)

dfPricesListingTypes = dfPricesListingTypes.drop(columns=['amenities'])

dfPricesListingTypes.head()

In order to deal with the column 'room_type', the corresponding dummies are generated
and inserted into the dataframe.

In [None]:
dummiesRoomType = pd.get_dummies(dfPricesListingTypes['room_type'])
dummiesRoomType.head()

In [None]:
dfPricesListingTypes = dfPricesListingTypes.drop(columns=['room_type'])
dfPricesListingTypes = pd.concat([dfPricesListingTypes, dummiesRoomType], axis=1)
dfPricesListingTypes.head()

In order to deal with the column 'bed_type', the corresponding dummies are generated
and inserted into the dataframe.

In [None]:
dummiesBedType = pd.get_dummies(dfPricesListingTypes['bed_type'])
dummiesBedType.head()

In [None]:
dfPricesListingTypes = dfPricesListingTypes.drop(columns=['bed_type'])
dfPricesListingTypes = pd.concat([dfPricesListingTypes, dummiesBedType], axis=1)
dfPricesListingTypes.head()

Finally, the data-points containing null points must be removed.

In [None]:
dfPricesListingTypes.dropna(axis=0, how='any', inplace=True)
dfPricesListingTypes.info()

### 8.3 Outliers analysis

Another important step concerns the analysis of the column 'price' in order to check
whether it contains some outliers.

Firstly, the heatmap concerning the correlation between the selected columns is shown.

In [None]:
def plot_correlation_matrix(df):
    sns.set(font_scale = 1.4)
    plt.figure(figsize = (18,18))
    sns.heatmap(df.corr(method='kendall'),
                cmap=plt.cm.Blues,
                annot = True)
    plt.gca().set_title('Price VS Listing characteristics', fontsize = 15)
    plt.show()

In [None]:
plot_correlation_matrix(dfPricesListingTypes)

According to the heatmap, the column 'price' shows a slight degree of correlation with
only a few columns in the dataframe.
In particular, it shows slight positive correlation with both 'accommodates' and
'Entire home/apt', and a slight negative correlation with 'Private room'.

In [None]:
sns.boxplot(x=dfPricesListingTypes['price'])

The boxplot shows the existence of some outliers.

So, basing on the mean and the standard deviation values, two border limits are
computed.

In [None]:
def get_outliers_limits(df_column, factor):
    upper_lim = df_column.mean () + df_column.std () * factor

    lower_lim = df_column.mean () - df_column.std () * factor

    return upper_lim, lower_lim

In [None]:
upperLim, lowerLim = get_outliers_limits(dfPricesListingTypes['price'], 3)

print("upper_lim:", upperLim)
print("lower_lim:", lowerLim)

Since the values in the column 'price' are always positive, the lower limit can be
discarded.

In [None]:
dfPricesListingTypesWithoutOutliers = \
    dfPricesListingTypes[dfPricesListingTypes['price'] < upperLim]

print('Number of outliers removed:',
      dfPricesListingTypes.shape[0] - dfPricesListingTypesWithoutOutliers.shape[0])

In [None]:
sns.boxplot(x=dfPricesListingTypesWithoutOutliers['price'])

### 8.4 Regression

In order to make predictions on the column 'price', a regression model can be
built-up.

#### 8.4.1 Regression with numerical features

Firstly, it is possible to start building the regression model by using just the
features that were initially provided as numerical.

In [None]:
dfPricesNumericalListingTypes = \
    dfPricesListingTypesWithoutOutliers[['accommodates', 'bathrooms', 'bedrooms',
                                         'beds', 'price']]

dfPricesNumericalListingTypes.head()

Once the numerical features are selected, the heatmap concerning the correlation
between them can be shown.

In [None]:
plot_correlation_matrix(dfPricesNumericalListingTypes)

Before the creation of the regression model, the available data-points should be
split into train and test data.

In [None]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split

def get_train_test(train_data, test_data):
    train_x, test_x, train_y, test_y = train_test_split(
        train_data, test_data, test_size=0.3, shuffle=True, random_state=42)

    return train_x, test_x, train_y, test_y

In [None]:
trainXNumerical, testXNumerical, trainYNumerical, testYNumerical = \
    get_train_test(dfPricesNumericalListingTypes.iloc[:,:4],
                   dfPricesNumericalListingTypes['price'])

trainXNumerical.head()

Once the data has been split, a linear regression model can be built-up and trained.

In [None]:
linearRegressionNumerical = linear_model.LinearRegression()
linearRegressionNumerical.fit(trainXNumerical, trainYNumerical)

Finally, the values concerning the model coefficients and the model accuracy can be
shown.
The score coefficient, representing the accuracy of the model, is the coefficient
of determination R^2 of the prediction.
It is computed as

$$ R^2 = 1 - u/v $$

where u is the residual sum of squares and v is the total sum of squares.

In [None]:
print('Coefficients:', linearRegressionNumerical.coef_)
print('Intercept:', linearRegressionNumerical.intercept_)
print('Score:', linearRegressionNumerical.score(testXNumerical, testYNumerical))

Looking at the score, the regression model does not reach a good value of accuracy.

It is also possible to plot both the training and the test errors.

In [None]:
def plot_regression_error(regression_model, train_x, train_y, test_x, test_y):

    ## plotting residual errors in training data
    plt.scatter(regression_model.predict(train_x),
                regression_model.predict(train_x) - train_y,
                color = "green", s = 10, label = 'Train data')

    ## plotting residual errors in test data
    plt.scatter(regression_model.predict(test_x),
                regression_model.predict(test_x) - test_y,
                color = "blue", s = 10, label = 'Test data')

    ## plotting legend
    plt.legend(loc = 'upper right')

    ## plot title
    plt.title("Residual errors")

    ## method call for showing the plot
    plt.show()

In [None]:
plot_regression_error(linearRegressionNumerical, trainXNumerical, trainYNumerical,
                      testXNumerical, testYNumerical)

#### 8.4.2 Regression with all the features

Looking at the results obtained by using only the numerical features, it is possible
to build the regression model with all the features.

The performed steps are the same as before.

In [None]:
dfPricesAllListingTypes = dfPricesListingTypesWithoutOutliers.copy()
dfPricesAllListingTypes.head()

In [None]:
plot_correlation_matrix(dfPricesAllListingTypes)

In [None]:
trainX, testX, trainY, testY = get_train_test(
    dfPricesAllListingTypes.loc[
        :, dfPricesAllListingTypes.columns != 'price'],
    dfPricesAllListingTypes['price'])

trainX.head()

In [None]:
linearRegression = linear_model.LinearRegression()
linearRegression.fit(trainX, trainY)

In [None]:
print('Coefficients:', linearRegression.coef_)
print('Intercept:', linearRegression.intercept_)
print('Score:', linearRegression.score(testX, testY))

Looking at the score, the regression model reaches a better value of accuracy,
but even in this case the results are not satisfactory.

In [None]:
plot_regression_error(linearRegression, trainX, trainY, testX, testY)

### 8.5 Classification

Instead of trying to predict the column 'price' with a regression, it is also
possible to transform it into a categorical feature and try to predict the
category.

#### 8.5.1 Transformation of price for classification

The 'price' feature needs to be binned in order to generate categorical values.

In [None]:
dfPricesAllListingTypes['price'].describe()

In order to obtain an optimized binning of the feature, a clustering technique
on the feature is used.

In [None]:
kmeans_elbow_method(dfPricesAllListingTypes['price'].values.reshape(-1,1))

In [None]:
rangeClusters = enumerate([2,3,4,5])
kmeans_silhouette(dfPricesAllListingTypes['price'].values.reshape(-1,1), rangeClusters)

The value n_clusters = 3 is chosen because it allows to have a more flexible
classification.

In [None]:
kmeansModelPrice = KMeans(n_clusters=3, max_iter=1000, random_state=42, n_init=50)
kmeansModelPrice.fit(dfPricesAllListingTypes['price'].values.reshape(-1,1))

In [None]:
kmeansModelPrice.cluster_centers_

Looking at the model clusters, it is possible to assign the cluster index to
each binning value.

In [None]:
lowClusterIndex = 1
mediumClusterIndex = 0

#### 8.5.2 Classification with numerical features

Firstly, it is possible to start building the classification model by using just the
features that were initially provided as numerical.

In [None]:
dfPricesNumericalListingTypes['price_cluster'] = kmeansModelPrice.predict(
    dfPricesNumericalListingTypes['price'].values.reshape(-1,1))

dfPricesNumericalListingTypes['price_cluster'] = \
    dfPricesNumericalListingTypes['price_cluster'].apply(
        lambda x: 'low' if x==lowClusterIndex else
            'medium' if x==mediumClusterIndex else 'high')

dfPricesNumericalListingTypes.head()

Before the creation of the classification model, the available data-points should be
split into train and test data.

In [None]:
trainXNumericalClass, testXNumericalClass, trainYNumericalClass, testYNumericalClass = \
    get_train_test(dfPricesNumericalListingTypes.iloc[:,:4],
                   dfPricesNumericalListingTypes['price_cluster'])

trainXNumericalClass.head()

Once the data has been split, a decision tree classifier can be built-up and trained.

In [None]:
from sklearn.tree import DecisionTreeClassifier

decisionTreeNumerical = DecisionTreeClassifier()
decisionTreeNumerical.fit(trainXNumericalClass, trainYNumericalClass)

Then, the model is used in order to generate the predictions.

In [None]:
predictionsNumerical = decisionTreeNumerical.predict(testXNumericalClass)

Once the predictions have been generated, it is possible to evaluate the model.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(confusion_matrix(testYNumericalClass, predictionsNumerical))
print(classification_report(testYNumericalClass, predictionsNumerical))

The classification with only the numerical features globally shows good results, with
an accuracy of 78%.
In particular, the model shows very good values for the "low" classification, but poor
results in the classification of "medium" and "high" data-points.

#### 8.5.3 Classification with all the features

Looking at the results obtained by using only the numerical features, it is possible
to build the classification model with all the features.

The performed steps are the same as before.

In [None]:
dfPricesAllListingTypes['price_cluster'] = kmeansModelPrice.predict(
    dfPricesAllListingTypes['price'].values.reshape(-1,1))

dfPricesAllListingTypes['price_cluster'] = \
    dfPricesAllListingTypes['price_cluster'].apply(
        lambda x: 'low' if x==lowClusterIndex else
            'medium' if x==mediumClusterIndex else 'high')

dfPricesAllListingTypes.head()

In [None]:
trainXClass, testXClass, trainYClass, testYClass = get_train_test(
    dfPricesAllListingTypes.drop(columns=['price', 'price_cluster']),
    dfPricesAllListingTypes['price_cluster'])

trainXClass.head()

In [None]:
decisionTree = DecisionTreeClassifier()
decisionTree.fit(trainXClass, trainYClass)

In [None]:
predictions = decisionTree.predict(testXClass)

In [None]:
print(confusion_matrix(testYClass, predictions))
print(classification_report(testYClass, predictions))

Even in this case, the classification with all the features globally shows some
good results, that are only slightly lower than using only the numerical features.
In particular, the model shows a global accuracy of 76%, very good values for the
"low" classification, but poor results in the classification of "medium" and "high"
data-points.

### 8.6 Outliers removal with new threshold

Since the previously-obtained results are not very satisfactory, it is possible to
increase the number of removed outliers.
This operation is performed by increasing the multiplication factor in the formula
used to obtain the border limits.

In [None]:
upperLim2, lowerLim2 = get_outliers_limits(dfPricesListingTypes['price'], 2)

print("upper_lim:", upperLim2)
print("lower_lim:", lowerLim2)

Since the values in the column 'price' are always positive, the lower limit can be
discarded.

In [None]:
dfPricesListingTypesWithoutOutliers2 = \
    dfPricesListingTypes[dfPricesListingTypes['price'] < upperLim2]

print('Number of outliers removed:',
      dfPricesListingTypes.shape[0] - dfPricesListingTypesWithoutOutliers2.shape[0])

In [None]:
sns.boxplot(x=dfPricesListingTypesWithoutOutliers2['price'])

#### 8.6.1 Regression with numerical features and new threshold

The performed steps are always the same.

In [None]:
dfPricesNumericalListingTypes2 = \
    dfPricesListingTypesWithoutOutliers2[['accommodates', 'bathrooms', 'bedrooms',
                                         'beds', 'price']]

dfPricesNumericalListingTypes2.head()

In [None]:
plot_correlation_matrix(dfPricesNumericalListingTypes2)

In [None]:
trainXNumerical2, testXNumerical2, trainYNumerical2, testYNumerical2 = \
    get_train_test(dfPricesNumericalListingTypes2.iloc[:,:4],
                   dfPricesNumericalListingTypes2['price'])

trainXNumerical2.head()

In [None]:
linearRegressionNumerical2 = linear_model.LinearRegression()
linearRegressionNumerical2.fit(trainXNumerical2, trainYNumerical2)

In [None]:
print('Coefficients:', linearRegressionNumerical2.coef_)
print('Intercept:', linearRegressionNumerical2.intercept_)
print('Score:', linearRegressionNumerical2.score(testXNumerical2, testYNumerical2))

With respect to the results obtained by using more data-points, the model accuracy
is slightly increased.
However, the results are still not satisfactory.

In [None]:
plot_regression_error(linearRegressionNumerical2, trainXNumerical2, trainYNumerical2,
                      testXNumerical2, testYNumerical2)

#### 8.6.2 Regression with all the features and new threshold

In [None]:
dfPricesAllListingTypes2 = dfPricesListingTypesWithoutOutliers2.copy()
dfPricesAllListingTypes2.head()

In [None]:
plot_correlation_matrix(dfPricesAllListingTypes2)

In [None]:
trainX2, testX2, trainY2, testY2 = get_train_test(
    dfPricesAllListingTypes2.loc[
        :, dfPricesAllListingTypes2.columns != 'price'],
    dfPricesAllListingTypes2['price'])

trainX2.head()

In [None]:
linearRegression2 = linear_model.LinearRegression()
linearRegression2.fit(trainX2, trainY2)

In [None]:
print('Coefficients:', linearRegression2.coef_)
print('Intercept:', linearRegression2.intercept_)
print('Score:', linearRegression2.score(testX2, testY2))

Even in this case, the accuracy value is slightly greater than the one obtained
by considering a higher number of data-points, but still not enough in order to
make reliable predictions.

In [None]:
plot_regression_error(linearRegression2, trainX2, trainY2, testX2, testY2)

#### 8.6.3 Transformation of price with new threshold for classification

It is also possible to use the dataframe with more outliers removed in order to
create a new classification model and check its performance.

Even in this case, the performed steps are always the same.

In [None]:
dfPricesAllListingTypes2['price'].describe()

In [None]:
kmeans_elbow_method(dfPricesAllListingTypes2['price'].values.reshape(-1,1))

In [None]:
rangeClusters = enumerate([2,3,4,5])
kmeans_silhouette(dfPricesAllListingTypes2['price'].values.reshape(-1,1),
                  rangeClusters)

The value n_clusters = 3 is chosen because it allows to have a more flexible
classification.

In [None]:
kmeansModelPrice2 = KMeans(n_clusters=3, max_iter=1000, random_state=42, n_init=50)
kmeansModelPrice2.fit(dfPricesAllListingTypes2['price'].values.reshape(-1,1))

In [None]:
kmeansModelPrice2.cluster_centers_

In [None]:
lowClusterIndex2 = 0
mediumClusterIndex2 = 1

#### 8.6.4 Classification with numerical features and new threshold

In [None]:
dfPricesNumericalListingTypes2['price_cluster'] = kmeansModelPrice2.predict(
    dfPricesNumericalListingTypes2['price'].values.reshape(-1,1))

dfPricesNumericalListingTypes2['price_cluster'] = \
    dfPricesNumericalListingTypes2['price_cluster'].apply(
        lambda x: 'low' if x==lowClusterIndex2 else
            'medium' if x==mediumClusterIndex2 else 'high')

dfPricesNumericalListingTypes2.head()

In [None]:
trainXNumericalClass2, testXNumericalClass2, trainYNumericalClass2, testYNumericalClass2 = \
    get_train_test(dfPricesNumericalListingTypes2.iloc[:,:4],
                   dfPricesNumericalListingTypes2['price_cluster'])

trainXNumericalClass2.head()

In [None]:
decisionTreeNumerical2 = DecisionTreeClassifier()
decisionTreeNumerical2.fit(trainXNumericalClass2, trainYNumericalClass2)

In [None]:
predictionsNumerical2 = decisionTreeNumerical2.predict(testXNumericalClass2)

In [None]:
print(confusion_matrix(testYNumericalClass2, predictionsNumerical2))
print(classification_report(testYNumericalClass2, predictionsNumerical2))

Even in this case, the classification with only the numerical features globally
shows good results, with an accuracy of 78%.
The results are comparable to the ones previously-obtained.

#### 8.6.5 Classification with all the features and new threshold

In [None]:
dfPricesAllListingTypes2['price_cluster'] = kmeansModelPrice2.predict(
    dfPricesAllListingTypes2['price'].values.reshape(-1,1))

dfPricesAllListingTypes2['price_cluster'] = \
    dfPricesAllListingTypes2['price_cluster'].apply(
        lambda x: 'low' if x==lowClusterIndex2 else
            'medium' if x==mediumClusterIndex2 else 'high')

dfPricesAllListingTypes2.head()

In [None]:
trainXClass2, testXClass2, trainYClass2, testYClass2 = get_train_test(
    dfPricesAllListingTypes2.drop(columns=['price', 'price_cluster']),
    dfPricesAllListingTypes2['price_cluster'])

trainXClass2.head()

In [None]:
decisionTree2 = DecisionTreeClassifier()
decisionTree2.fit(trainXClass2, trainYClass2)

In [None]:
predictions2 = decisionTree2.predict(testXClass2)

In [None]:
print(confusion_matrix(testYClass2, predictions2))
print(classification_report(testYClass2, predictions2))

Even in this case, the classification accuracy is comparable to the one obtained
by using the dataframe containing more data-points.

### 8.7 Outliers removal with percentiles

The outliers can also be removed according to the percentiles.

In [None]:
upperLim3 = dfPricesListingTypes['price'].quantile(.99) # Return value at the given quantile
lowerLim3 = dfPricesListingTypes['price'].quantile(.01)

print("upper_lim:", upperLim3)
print("lower_lim:", lowerLim3)

In [None]:
dfPricesListingTypesWithoutOutliers3 = \
    dfPricesListingTypes[dfPricesListingTypes['price'] > lowerLim3]

dfPricesListingTypesWithoutOutliers3 = dfPricesListingTypesWithoutOutliers3[
    dfPricesListingTypesWithoutOutliers3['price'] < upperLim3]

print('Number of outliers removed:',
      dfPricesListingTypes.shape[0] - dfPricesListingTypesWithoutOutliers3.shape[0])

In [None]:
sns.boxplot(x=dfPricesListingTypesWithoutOutliers3['price'])

#### 8.7.1 Regression with numerical features and percentiles

In [None]:
dfPricesNumericalListingTypes3 = \
    dfPricesListingTypesWithoutOutliers3[['accommodates', 'bathrooms', 'bedrooms',
                                         'beds', 'price']]

dfPricesNumericalListingTypes3.head()

In [None]:
plot_correlation_matrix(dfPricesNumericalListingTypes3)

In [None]:
trainXNumerical3, testXNumerical3, trainYNumerical3, testYNumerical3 = \
    get_train_test(dfPricesNumericalListingTypes3.iloc[:,:4],
                   dfPricesNumericalListingTypes3['price'])

trainXNumerical3.head()

In [None]:
linearRegressionNumerical3 = linear_model.LinearRegression()
linearRegressionNumerical3.fit(trainXNumerical3, trainYNumerical3)

In [None]:
print('Coefficients:', linearRegressionNumerical3.coef_)
print('Intercept:', linearRegressionNumerical3.intercept_)
print('Score:', linearRegressionNumerical3.score(testXNumerical3, testYNumerical3))

With respect to the results obtained by the previous analysis, the model accuracy
is slightly increased.
However, the results are still not satisfactory.

In [None]:
plot_regression_error(linearRegressionNumerical3, trainXNumerical3, trainYNumerical3,
                      testXNumerical3, testYNumerical3)

#### 8.7.2 Regression with all the features and percentiles

In [None]:
dfPricesAllListingTypes3 = dfPricesListingTypesWithoutOutliers3.copy()
dfPricesAllListingTypes3.head()

In [None]:
plot_correlation_matrix(dfPricesAllListingTypes3)

In [None]:
trainX3, testX3, trainY3, testY3 = get_train_test(
    dfPricesAllListingTypes3.loc[
        :, dfPricesAllListingTypes3.columns != 'price'],
    dfPricesAllListingTypes3['price'])

trainX3.head()

In [None]:
linearRegression3 = linear_model.LinearRegression()
linearRegression3.fit(trainX3, trainY3)

In [None]:
print('Coefficients:', linearRegression3.coef_)
print('Intercept:', linearRegression3.intercept_)
print('Score:', linearRegression3.score(testX3, testY3))

Even in this case, the accuracy results are slightly better than the one obtained
in the previous analysis.
However, they are still not good enough in order to make reliable predictions.

In [None]:
plot_regression_error(linearRegression3, trainX3, trainY3, testX3, testY3)

#### 8.7.3 Transformation of price with percentiles for classification

The 'price' feature is binned in order to generate categorical values.

In [None]:
dfPricesAllListingTypes3['price'].describe()

In [None]:
kmeans_elbow_method(dfPricesAllListingTypes3['price'].values.reshape(-1,1))

In [None]:
rangeClusters = enumerate([2,3,4,5])
kmeans_silhouette(dfPricesAllListingTypes3['price'].values.reshape(-1,1),
                  rangeClusters)

The value n_clusters = 3 is chosen because it allows to have a more flexible
classification.

In [None]:
kmeansModelPrice3 = KMeans(n_clusters=3, max_iter=1000, random_state=42, n_init=50)
kmeansModelPrice3.fit(dfPricesAllListingTypes3['price'].values.reshape(-1,1))

In [None]:
kmeansModelPrice3.cluster_centers_

In [None]:
lowClusterIndex3 = 0
mediumClusterIndex3 = 1

#### 8.7.4 Classification with numerical features and percentiles

In [None]:
dfPricesNumericalListingTypes3['price_cluster'] = kmeansModelPrice3.predict(
    dfPricesNumericalListingTypes3['price'].values.reshape(-1,1))

dfPricesNumericalListingTypes3['price_cluster'] = \
    dfPricesNumericalListingTypes3['price_cluster'].apply(
        lambda x: 'low' if x==lowClusterIndex3 else
            'medium' if x==mediumClusterIndex3 else 'high')

dfPricesNumericalListingTypes3.head()

In [None]:
trainXNumericalClass3, testXNumericalClass3, trainYNumericalClass3, testYNumericalClass3 = \
    get_train_test(dfPricesNumericalListingTypes3.iloc[:,:4],
                   dfPricesNumericalListingTypes3['price_cluster'])

trainXNumericalClass3.head()

In [None]:
decisionTreeNumerical3 = DecisionTreeClassifier()
decisionTreeNumerical3.fit(trainXNumericalClass3, trainYNumericalClass3)

In [None]:
predictionsNumerical3 = decisionTreeNumerical3.predict(testXNumericalClass3)

In [None]:
print(confusion_matrix(testYNumericalClass3, predictionsNumerical3))
print(classification_report(testYNumericalClass3, predictionsNumerical3))

Even in this case, the classification with only the numerical features globally
shows good results, with an accuracy of 71%.
The results are slightly worse than the ones obtained in the previous analysis.

#### 8.7.5 Classification with all the features and percentiles

In [None]:
dfPricesAllListingTypes3['price_cluster'] = kmeansModelPrice3.predict(
    dfPricesAllListingTypes3['price'].values.reshape(-1,1))

dfPricesAllListingTypes3['price_cluster'] = \
    dfPricesAllListingTypes3['price_cluster'].apply(
        lambda x: 'low' if x==lowClusterIndex3 else
            'medium' if x==mediumClusterIndex3 else 'high')

dfPricesAllListingTypes3.head()

In [None]:
trainXClass3, testXClass3, trainYClass3, testYClass3 = get_train_test(
    dfPricesAllListingTypes3.drop(columns=['price', 'price_cluster']),
    dfPricesAllListingTypes3['price_cluster'])

trainXClass3.head()

In [None]:
decisionTree3 = DecisionTreeClassifier()
decisionTree3.fit(trainXClass3, trainYClass3)

In [None]:
predictions3 = decisionTree3.predict(testXClass3)

In [None]:
print(confusion_matrix(testYClass3, predictions3))
print(classification_report(testYClass3, predictions3))

Even in this case, the classification with all the features globally shows good
results, with an accuracy of 71%.
The results are slightly worse than the ones obtained in the previous analysis.

### 8.8 Classification with balanced classes

Since all the classification reports have shown that the considered classes are
strongly unbalanced, it is possible to try to perform a balancing operation.
In this way, the classification model should be able to better predict the labels
of the classes that were initially represented by a lower number of samples.

#### 8.8.1 Dataframe balancing

The starting dataframe is the one from which the best classification model has
been built-up, that is the dataframe from which the outliers removal operation was
based on the mean and the standard deviation.

In [None]:
dfPricesListingTypesUnbalanced = dfPricesNumericalListingTypes.copy()
dfPricesListingTypesUnbalanced.head()

In [None]:
dfPricesListingTypesUnbalanced.groupby('price_cluster').price.count()

In order to balance the classes, the data-points with label 'high' need to be
over-sampled, while the data-points with label 'low' need to be under-sampled.
The desired number of data-points for each label of 'price_cluster' is the number
of data-points with the label 'medium'.

In [None]:
from sklearn.utils import resample

def over_sample(df, label, length):
    df_label = df[df['price_cluster'] == label]
    df_label_oversampled = resample(df_label, replace=True,
                                    n_samples=length, random_state=42)

    return df_label_oversampled

def under_sample(df, label, length):
    df_label = df[df['price_cluster'] == label]
    df_label_undersampled = df_label.sample(n=length)

    return df_label_undersampled

In [None]:
def create_balanced_df(df):
    df_medium_price_cluster_balanced = df[df['price_cluster'] == 'medium']

    df_low_price_cluster_balanced = \
        under_sample(df, 'low', df_medium_price_cluster_balanced.shape[0])

    df_high_price_cluster_balanced = \
        over_sample(df, 'high', df_medium_price_cluster_balanced.shape[0])

    return pd.concat([df_low_price_cluster_balanced,
                      df_medium_price_cluster_balanced,
                      df_high_price_cluster_balanced])

In [None]:
dfPricesListingTypesBalanced = create_balanced_df(dfPricesListingTypesUnbalanced)
dfPricesListingTypesBalanced.head()

In [None]:
dfPricesListingTypesBalanced.groupby('price_cluster').price.count()

#### 8.8.2 Classification with balanced dataframe

Once the balanced dataframe is available, it is possible to perform a new
classification experiment.

The performed steps are always the same.

In [None]:
trainXBalancedClass, testXBalancedClass, trainYBalancedClass, testYBalancedClass = \
    get_train_test(dfPricesListingTypesBalanced.iloc[:,:4],
                   dfPricesListingTypesBalanced['price_cluster'])

trainXBalancedClass.head()

In [None]:
decisionTreeBalanced = DecisionTreeClassifier()
decisionTreeBalanced.fit(trainXBalancedClass, trainYBalancedClass)

In [None]:
predictionsBalanced = decisionTreeBalanced.predict(testXBalancedClass)

In [None]:
print(confusion_matrix(testYBalancedClass, predictionsBalanced))
print(classification_report(testYBalancedClass, predictionsBalanced))

With the balanced classes, the model shows precision and recall values that are
better with respect to the other classification models.
However, the global accuracy of the model is around 10% lower with respect to
the other ones.

## 9. Conclusions

The final considerations about the performed analysis can be reported.

### 9.1 Sentiment analysis

Concerning the sentiment analysis, the obtained results show that it is possible to
reach an accuracy value that is very close to the one obtained by an already-existing
external library.
In particular, these results have been reached through the use of many pre-processing
techniques, such as the tokenization of the reviews, the normalization of the words
and removal of the stop words.
Furthermore, the use of the Word2Vec technique, combined with both the clustering
for the detection of the sentiment of each word, and the TF-IDF technique for the
weight of the words inside each review, has shown very good results.

### 9.2 Correlation analysis

Concerning the correlation analysis performed over the feature 'price', the obtained
results are not very satisfactory.

In particular, concerning the correlation between the 'price' and the calendar
features, the correlation matrices have always shown a very high correlation between
the average price of the listings and the different values of the calendar features.
In only one case, indeed, it is possible to notice a correlation value that is lower
than 0.9.
This case concerns the correlation of the average price during the 'winter' and the
'summer' seasons, and they show a correlation value of 0.89.

So, it is not possible to identify some correlation relationships between the 'price'
of a room and the considered calendar features.

Concerning the correlation between the 'price' and the room's characteristics, the
initial dataframe shows a weak correlation between the 'price' and all the considered
features.
After the outliers removal operation, the correlation values have not changed
significantly, and the effects can be noticed in the accuracy values of the generated
models.
In particular, the best regression model is built-up by considering all the features
after the outliers removal operation with the percentiles, and its accuracy score is
0.45.
The best classification model is built-up by only considering the numerical features
after the outliers removal operation based on the mean and the standard deviation,
and it reaches an accuracy value of 0.78.
Even trying to balance the classes, the classification accuracy has not shown 
good values, with a global score of 0.67.

So, it is not possible to predict with a good confidence the price of a room given
its characteristics.
