# Section 1: Business Understanding

This project uses AirBnB Dataset that contains the following files:

calendar.csv listings.csv reviews.csv

The AirBnB dataset contains the AirBnB listing, reviews and calendar/frequency of travel to the city of Seattle. The data can be used to find out trend information related to travel to Seattle for various months. It can also be used to find out about pricing and reviews related to the listings.

This file contains the sentiment analysis of reviews used to answer Question 4. 

## Question 4 Are the review comments reflective of the review_score_ratings? 




In [10]:
import numpy as np 
import pandas as pd 
import re  
import nltk 
from nltk.corpus import stopwords
import AirBnB as ab
from nltk.corpus import wordnet
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
import nltk
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Section 1: Data Understanding - Question 4
##  Are the review comments reflective of the review_score_ratings?
## Gather

In [11]:
# Read in the reviews.csv file. 

file_path="./reviews.csv"
reviews_df=ab.read_file(file_path) 
five_lines=ab.get_five_lines(reviews_df)
print(five_lines)

   listing_id        id        date  reviewer_id reviewer_name  \
0     7202016  38917982  2015-07-19     28943674        Bianca   
1     7202016  39087409  2015-07-20     32440555         Frank   
2     7202016  39820030  2015-07-26     37722850           Ian   
3     7202016  40813543  2015-08-02     33671805        George   
4     7202016  41986501  2015-08-10     34959538          Ming   

                                            comments  
0  Cute and cozy place. Perfect location to every...  
1  Kelly has a great room in a very central locat...  
2  Very spacious apartment, and in a great neighb...  
3  Close to Seattle Center and all it has to offe...  
4  Kelly was a great host and very accommodating ...  


In [12]:
# Read in the reviews.csv file. 
file_path="./listings.csv"
listings_df=ab.read_file(file_path) 
five_lines=ab.get_five_lines(listings_df)
print(five_lines)

        id                           listing_url       scrape_id last_scraped  \
0   241032   https://www.airbnb.com/rooms/241032  20160104002432   2016-01-04   
1   953595   https://www.airbnb.com/rooms/953595  20160104002432   2016-01-04   
2  3308979  https://www.airbnb.com/rooms/3308979  20160104002432   2016-01-04   
3  7421966  https://www.airbnb.com/rooms/7421966  20160104002432   2016-01-04   
4   278830   https://www.airbnb.com/rooms/278830  20160104002432   2016-01-04   

                                  name  \
0         Stylish Queen Anne Apartment   
1   Bright & Airy Queen Anne Apartment   
2  New Modern House-Amazing water view   
3                   Queen Anne Chateau   
4       Charming craftsman 3 bdm house   

                                             summary  \
0                                                NaN   
1  Chemically sensitive? We've removed the irrita...   
2  New modern house built in 2013.  Spectacular s...   
3  A charming apartment that sits at

# Section 2: Data Preparation - Question 4
##  Are the review comments reflective of the review_score_ratings?

In [13]:
# Checking and cleaning comments inside the review file
cols=['listing_id','reviewer_id','reviewer_name','comments']
listing_reviews_df=reviews_df[cols]
listing_reviews_df.shape
reviews_df.dtypes
reviews_df['comments']=reviews_df['comments'].astype(str)
reviews_df["review_clean"] = reviews_df["comments"].apply(lambda x: ab.clean_text(x))
reviews_df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,review_clean
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...,cute cozy place perfect location everything
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...,kelly great room central location \r\nbeautifu...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb...",spacious apartment great neighborhood kind apa...
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...,close seattle center offer ballet theater muse...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...,kelly great host accommodate great neighborhoo...


In [14]:
#Getting the listings and review scores
cols=['id','review_scores_rating']
listings_df=listings_df[cols]
listings_df=listings_df.rename(columns={"id":'listing_id'})

In [15]:
# Binning the review scores and figuring out the buckets to create score_sentiment column
listings_reviews_df = pd.merge(left=listings_df, right=reviews_df, left_on='listing_id', right_on='listing_id')
listings_reviews_df.shape
listings_reviews_df['review_scores_rating']=listings_reviews_df['review_scores_rating'].astype('float')
listings_reviews_df.head()
listings_reviews_df['review_scores_rating_bins']=pd.cut(listings_reviews_df.review_scores_rating,bins = [0,85.0,90.0,95.0,97.0,100.0])
listings_reviews_df.head()
listings_reviews_df['review_scores_rating_bins'].value_counts().sort_index()
listings_reviews_df['score_sentiment']=pd.cut(listings_reviews_df.review_scores_rating,bins = [0,85.0,90.0,95.0,97.0,100.0],labels=['Bad','Fair','Good','Average','Best'])
listings_reviews_df.head()


Unnamed: 0,listing_id,review_scores_rating,id,date,reviewer_id,reviewer_name,comments,review_clean,review_scores_rating_bins,score_sentiment
0,241032,95.0,682061,2011-11-01,479824,Bro,Excellent all the way around. \r\n\r\nMaija wa...,excellent way around \r\n\r\nmaija excellent h...,"(90.0, 95.0]",Good
1,241032,95.0,691712,2011-11-04,357699,Megan,Maija's apartment was a wonderful place to sta...,maija's apartment wonderful place stay extreme...,"(90.0, 95.0]",Good
2,241032,95.0,702999,2011-11-08,1285567,Marylee,one of the most pleasant stays i've had in my ...,one pleasant stay i've travel maija wonderful ...,"(90.0, 95.0]",Good
3,241032,95.0,717262,2011-11-14,647857,Graham,"Maija's suite is beautiful, cozy and convenien...",maija's suite beautiful cozy conveniently loca...,"(90.0, 95.0]",Good
4,241032,95.0,730226,2011-11-19,1389821,Franka,Our stay was short and pleasant. With its own ...,stay short pleasant porch space flat boost lot...,"(90.0, 95.0]",Good


In [16]:
# Checking for missing data in score_sentiment
listings_reviews_df['score_sentiment'].isnull().sum()
listings_reviews_df.dropna(subset=['score_sentiment'], how='all', inplace=True)
listings_reviews_df.shape
listings_reviews_df['score_sentiment'].isnull().sum()
listings_reviews_df.shape

(84829, 10)

# Section 3: Data Modeling - Question 4
## Are the review comments reflective of the review_score_ratings?

In [8]:
# Running the model and getting the predictions
predictions, confusion_matrix, classification_report, accuracy_score = ab.clean_fit_tfdf_random_forest(listings_reviews_df, 'review_clean', 'score_sentiment', 0.4, 0, 100, 2000, 5, 0.7)

# Section 4: Modeling Evaluation - Question 4
## Are the review comments reflective of the review_score_ratings?

In [9]:
# printing out the output from the modelling
print(confusion_matrix)  

[[3322    8 1254   16 4510]
 [  65  150   59   46  883]
 [ 824    8 3997   18 4048]
 [ 217   34  216  224 1977]
 [ 940   11 1185   42 9878]]
[[3322    8 1254   16 4510]
 [  65  150   59   46  883]
 [ 824    8 3997   18 4048]
 [ 217   34  216  224 1977]
 [ 940   11 1185   42 9878]]
0.5178297772014617


In [10]:
print(classification_report)

[[3322    8 1254   16 4510]
 [  65  150   59   46  883]
 [ 824    8 3997   18 4048]
 [ 217   34  216  224 1977]
 [ 940   11 1185   42 9878]]


In [11]:
print(accuracy_score)

0.5178297772014617


# Section 5: Conclusion - Question 4
## Are the review comments reflective of the review_score_ratings?

Using tf-df with random forest model, you can get 52.58% accuracy in predicting the review_score_ratings. This means that we can be 52.58% certain that the reviews reflect the score_rating. 
The review_score_ratings are binned into the following:[0,85.0,90.0,95.0,97.0,100.0]= [‘Bad’,’Fair’,’Good’,’Average’,’Best’]
Some buckets such as Average, Best, and Good showed more stability in terms of the predictive power than the other buckets based on the f1-score and precision numbers.