# Capstone Project 
## Part 1 - Predicting Airbnb rental prices

## Sentiment Analysis - Dec 20 reviews

In [1]:
import pandas as pd
import numpy as np
import psycopg2
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

sns.set_theme(context='notebook', style='darkgrid', palette='Set2', font_scale=1.5)

### Load data

In [2]:
# connect to database
db_user = 'postgres'
db_password = ''
db_host = 'localhost'
db_port = 5432
database = 'airbnb'

conn_str = f'postgresql://{db_user}:{db_password}@{db_host}:{db_port}/{database}'
conn = psycopg2.connect(conn_str)

# load data for Dec 20 reviews from database
df = pd.read_sql('SELECT * FROM dec20_reviews', conn)
df.shape

(1178933, 6)

### Clean data
- Drop empty reviews
- Drop automated reviews (upon cancellation)
- Drop non-English reviews as the sentiment analysis will not work on them

In [7]:
# drop null comments
df.dropna(inplace=True)

In [8]:
# drop reviews containing 'This is an automated posting'
automated = df[df.comments.str.contains('This is an automated posting')].index
df.drop(index=automated, inplace=True)

In [9]:
# example of non-English review
df.iloc[79,5]

'Nossa estadia em Londres foi fantástica, e ficamos muito bem acomodados em Brixton! Adriano foi um anfitrião muito acolhedor e prestativo. Seu apartamento acomoda muito bem uma família, com supermercado próximo e, principalmente, pelo acesso fácil ao metrô e ônibus. Quando retornarmos a Londres, com certeza, ficaremos neste apartamento novamente.'

In [10]:
# import language detection module
from langdetect import detect
print(detect(df.iloc[79,5]))
print(detect(df.iloc[0,5]))

pt
en


In [11]:
def detect_language(review):
    try:
        return detect(review)
    except:
        return np.nan

In [15]:
df['language'] = df.comments.map(detect_language)

In [24]:
(df['language'] != 'en').sum()

151488

In [29]:
df = df[df.language == 'en']

In [30]:
df.shape

(1006574, 7)

### Sentiment Analysis

In [31]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [32]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/julia/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [33]:
sid = SentimentIntensityAnalyzer()

In [34]:
# check sentiment analysis scores on a few individual reviews
for i in range(5):
    print(df.comments[i])
    ss = sid.polarity_scores(df.comments[i])
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print('\n---------------------------------------------------')

The flat was bright, comfortable and clean and Adriano was pleasant and gracious about accommodating us at the last minute. The Brixton tube was a very short walk away and there were plenty of buses. There are lots of fast food restaurants, banks, and shops along the main street.
compound: 0.9413, neg: 0.0, neu: 0.731, pos: 0.269, 
---------------------------------------------------
We stayed with Adriano and Valerio for a week when first moving to London. The apartment is great and very clean compared to a lot of places we've seen in London. Situated very close to Brixton tube and good bus links to central London. Thanks guys!
compound: 0.9214, neg: 0.0, neu: 0.752, pos: 0.248, 
---------------------------------------------------
Adriano was a fantastic host. We felt very at home while staying there. Our first morning we woke up and saw the dining table set for breakfast which was much appreciated. His flat is conveniently located a block from the tube station, with a number of shops 

In [35]:
df['sa_compound'] = df.comments.map(lambda x: sid.polarity_scores(x)['compound'])

In [37]:
df['sa_neg'] = df.comments.map(lambda x: sid.polarity_scores(x)['neg'])

In [39]:
df['sa_neu'] = df.comments.map(lambda x: sid.polarity_scores(x)['neu'])

In [41]:
df['sa_pos'] = df.comments.map(lambda x: sid.polarity_scores(x)['pos'])

In [105]:
df_agg = df.groupby('listing_id').sa_compound.agg(['mean', 'median', 'std', 'skew', stats.kurtosis, 'count', 'max', 'min']).reset_index()
print(df_agg.shape)
df_agg.head()

(51836, 9)


Unnamed: 0,listing_id,mean,median,std,skew,kurtosis,count,max,min
0,11551,0.901849,0.953,0.169497,-3.902532,15.830288,171,0.9961,0.0
1,13913,0.878995,0.9477,0.21567,-3.915228,12.311691,20,0.9954,0.0
2,15400,0.904411,0.9542,0.125688,-2.439843,5.861301,83,0.9974,0.34
3,17402,0.8969,0.9241,0.095037,-1.342989,1.080238,33,0.9938,0.6249
4,25123,0.945812,0.97225,0.084463,-4.271175,19.66698,108,0.9958,0.4199


In [89]:
# write aggregated data to CSV
df_agg.to_csv('../capstone-data-airbnb/dec20-data/reviews_sentiment_analysis_aggregated.csv')

In [94]:
# write clean data to CSV
df.to_csv('../capstone-data-airbnb/dec20-data/reviews_sentiment_analysis.csv')