# AirBnB Sentiment Analysis - Dataset generation

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Project Scope
 - Use of sentiment analysis, of the reviews of each ad, to view the evaluation of the ad
    itself.

 - Search for relationships between the price of a room and the day of the week, holidays,
    and time of year, and relationships between the price and the characteristics of a
    room to make a forecast.

Dataset: https://www.kaggle.com/brittabettendorf/berlin-airbnb-data

In [3]:
import langdetect
import pandas as pd
import zipfile36 as zipfile

## 1. Import of the reviews' dataset

The first step concerns the download of the datasets.
In particular, for this purpose, the Kaggle APIs are used.

In [3]:
!kaggle datasets download -d brittabettendorf/berlin-airbnb-data

berlin-airbnb-data.zip: Skipping, found more recently modified local copy (use --force to force download)


In [17]:
zf = zipfile.ZipFile('berlin-airbnb-data.zip')
dfReviews = pd.read_csv(zf.open('reviews_summary.csv'))
dfReviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2015,69544350,2016-04-11,7178145,Rahel,Mein Freund und ich hatten gute gemütliche vie...
1,2015,69990732,2016-04-15,41944715,Hannah,Jan was very friendly and welcoming host! The ...
2,2015,71605267,2016-04-26,30048708,Victor,Un appartement tres bien situé dans un quartie...
3,2015,73819566,2016-05-10,63697857,Judy,"It is really nice area, food, park, transport ..."
4,2015,74293504,2016-05-14,10414887,Romina,"Buena ubicación, el departamento no está orden..."


## 2. Data preprocessing

Once the dataset is available, it is needed to check whether there are some null data-points.

In [5]:
dfReviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401963 entries, 0 to 401962
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     401963 non-null  int64 
 1   id             401963 non-null  int64 
 2   date           401963 non-null  object
 3   reviewer_id    401963 non-null  int64 
 4   reviewer_name  401963 non-null  object
 5   comments       401467 non-null  object
dtypes: int64(3), object(3)
memory usage: 18.4+ MB


In [6]:
dfNullReviews = dfReviews[dfReviews['comments'].isnull()]
print(f'Number of null comments: {dfNullReviews.shape[0]}')
dfNullReviews.head()

Number of null comments: 496


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
3049,42809,160781,2010-12-31,226667,Frank,
4316,63468,75695155,2016-05-22,26052219,Sebastian,
7984,139769,10110711,2014-01-31,10977586,Mark,
8411,153015,234852734,2018-02-14,165971645,Chiara,
10960,183918,11107030,2014-03-21,11014142,Andrea,


In [7]:
dfReviews.dropna(axis=0, how='any', inplace=True)
dfReviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 401467 entries, 0 to 401962
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     401467 non-null  int64 
 1   id             401467 non-null  int64 
 2   date           401467 non-null  object
 3   reviewer_id    401467 non-null  int64 
 4   reviewer_name  401467 non-null  object
 5   comments       401467 non-null  object
dtypes: int64(3), object(3)
memory usage: 21.4+ MB


After the null data-points removal operation, it is needed to convert all the comments
into lowercase strings.

In [8]:
dfReviews['comments'] = dfReviews.apply(lambda x: x['comments'].lower(), axis=1)
dfReviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2015,69544350,2016-04-11,7178145,Rahel,mein freund und ich hatten gute gemütliche vie...
1,2015,69990732,2016-04-15,41944715,Hannah,jan was very friendly and welcoming host! the ...
2,2015,71605267,2016-04-26,30048708,Victor,un appartement tres bien situé dans un quartie...
3,2015,73819566,2016-05-10,63697857,Judy,"it is really nice area, food, park, transport ..."
4,2015,74293504,2016-05-14,10414887,Romina,"buena ubicación, el departamento no está orden..."


Since the comments are written in many languages, it can be useful to detect the language
of each comment.
This operation allows the selection of the comments based on their language (and also an
eventual translation of all the comments into a common language).

In order to detect the language of the comments, the langdetect library is used.

The first step of this operation concerns the definition of a method that

In [9]:
def get_lang_from_comment(dataframe):
    list_langs = []
    for index, comment in dataframe['comments'].iteritems():
        if index % 5000 == 0:
            print(f'Processed {index} rows...')
        try:
            comment_lang = langdetect.detect(comment[:50])
            list_langs.append(comment_lang)
        except:
            list_langs.append("None")

    return list_langs

Once the language for each comment is detected, it is added as a new column to the already
existing dataframe.

In [15]:
dfReviews['Lang'] = get_lang_from_comment(dfReviews)
dfReviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2015,69544350,2016-04-11,7178145,Rahel,mein freund und ich hatten gute gemütliche vie...
1,2015,69990732,2016-04-15,41944715,Hannah,jan was very friendly and welcoming host! the ...
2,2015,71605267,2016-04-26,30048708,Victor,un appartement tres bien situé dans un quartie...
3,2015,73819566,2016-05-10,63697857,Judy,"it is really nice area, food, park, transport ..."
4,2015,74293504,2016-05-14,10414887,Romina,"buena ubicación, el departamento no está orden..."


Finally, the dataframe is saved into a .csv file.
In this way, the language detection operation, that is very time-consuming, is performed
only once.

In [16]:
dfReviews.to_csv('reviews_summary_langs.csv', sep=",", index=False, header=True)
