# Analyzing Hotel Ratings on Tripadvisor

In this homework, we will analyze the data we scraped in Part 1 by fitting a regression model on the data.

** Task 1 (20 pts) **

Now, we will use regression to analyze this information. First, we will fit a linear regression model that predicts the average rating.

For example, the average rating of a hotel is calculated as follows:

![Information to be scraped](traveler_ratings.png)

$$ \text{AVG_SCORE} = \frac{1*15 + 2*21 + 3*55 + 4*228 + 5*1271}{1590}$$

Use the model to analyze the important factors that decide the $\text{AVG_SCORE}$.

In [30]:
import numpy as np
import pandas as pd
df = pd.read_csv('traveler_ratings.csv')
df['weighted total rating'] = (df['Excellent'] * 5 +df['Very good']*4 + df['Average']*3 +df['Poor']*2 +df['Terrible']*1)

df['total rating'] = df['Excellent']+df['Very good']+df['Average']+df['Poor']+df['Terrible']
df['Average rating'] = round(df['weighted total rating']*1.0/df['total rating'],5)

df.to_csv('new_traveler_ratings.csv')

# get the average rating for all hotels
# this is y vector
hotel_rating_df = df[['hotel_name', 'Average rating']]




In [31]:
# from new traveler's rating get only hotel name and rating
# stores in 'data' dictionary
import csv
import csv
with open('new_traveler_ratings.csv', 'r') as f:
    reader = csv.reader(f)
    data = {}
    for row in reader:
        data[row[2]] = row[10]  


In [32]:
"""
    now we have the average rating 
    and we have all reviews about all hotels
    
    next: get all average features for each hotels, and append the average rating
    which is the target at the end of each entry, make it a perfect training 
    data set.
"""


# remove prefix 'Review of' from the hotel_name column
attribute_df=pd.read_csv('attribute_ratings.csv') 
newnamelist=[' '.join(x.split(' ')[2:]) for x in attribute_df['hotel_name'].tolist()]
attribute_df['hotel_name'] = pd.Series(newnamelist).values


unique_hotel = set(attribute_df['hotel_name'].tolist())
# average ratings for 48 hotels -- will be used as y
hotel_rating_df= hotel_rating_df[hotel_rating_df['hotel_name'].isin(unique_hotel)]

# average ratings for 48 hotels -- will be used as y.
average_rating = [ (h,data[h]) for h in unique_hotel]


In [33]:
avgAtr_hotel_rating =[]

In [34]:
# split the total review by hotels
def get_all_avgAtr_review(total_review):
    hotel_name = total_review['hotel_name'][0]
    start_index = 0
    end_index = 0
    r=0
    
    while r <total_review.shape[0]:
        row = total_review[r:r+1]
        if row['hotel_name'].values[0] == hotel_name:
            end_index+=1
            if(end_index==total_review.shape[0]):
                result=get_hotel_avgAtr(attribute_df[start_index:end_index],start_index)
                avgAtr_hotel_rating.append(result)
                #print(start_index,end_index)
            r+=1
        else:
            #print(start_index,end_index)
            result = get_hotel_avgAtr(attribute_df[start_index:end_index],start_index)
            avgAtr_hotel_rating.append(result)
            start_index = end_index
            hotel_name = total_review['hotel_name'][start_index]
    print('done')    

In [35]:
# for each hotel, calculate the average attribute ratings
# and return a single row with hotel name and average attribute ratings
def get_hotel_avgAtr(hotel_review, line_num):
    hotel_avgAtr = [hotel_review['hotel_name'][line_num]]
    for atr in hotel_review.columns.values[3:]:
        sumof=0
        counter=0
        for x in hotel_review[atr].values:
            if ~np.isnan(x):
                sumof += x
                counter+=1
        if(counter==0):
            hotel_avgAtr.append(0.0)
        else: hotel_avgAtr.append(sumof/counter)
    return hotel_avgAtr
    
    

In [36]:
get_all_avgAtr_review(attribute_df)

done


In [37]:
# we have all average ratings for each hotel, form each entry for training 
training_set = pd.DataFrame(avgAtr_hotel_rating, columns=['hotel_name','Value','Rooms','Location','Cleanliness','Sleep Quality','Service','star_rating']) 

In [38]:
training_set.to_csv('hotel_avg_attrib_ratings.csv')

In [39]:
# get name from the training set, and get data the avg rating from 'data' dictionary
train_y = [ [x,float(data[x])] for x in training_set['hotel_name'].tolist()]

In [40]:
# form target_y set
train_y_df = pd.DataFrame(train_y,columns=['hotel_name','avg_rating'])

In [41]:
from sklearn import linear_model
X_train = training_set[['Value','Rooms','Location','Cleanliness','Sleep Quality', 'Service','star_rating']][0:56]
X_test = training_set[['Value','Rooms','Location','Cleanliness','Sleep Quality', 'Service','star_rating']][56:]
y_train = train_y_df[['avg_rating']][0:56]
y_test = train_y_df[['avg_rating']][56:]

In [42]:
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean squared error
print("Mean squared error: %.2f" % np.mean((regr.predict(X_test) - y_test) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(X_test, y_test))


Coefficients: 
 [[ 0.10937083  0.25820411  0.12114313 -0.00245773  0.15811094  0.25501095
   0.21788678]]
Mean squared error: 0.00
Variance score: 0.98


-------

** Task 3 (30 pts) **

Finally, we will use logistic regression to decide if a hotel is _excellent_ or not. We classify a hotel as _excellent_ if more than **60%** of its ratings are 5 stars. This is a binary attribute on which we can fit a logistic regression model. As before, use the model to analyze the data.

In [43]:
# mark all hotels with 1:excellent, 0:not excellent
df['excellent_ratio'] = df['Excellent']/(df['Excellent']+df['Very good']+df['Average']+df['Poor']+df['Terrible']) 

In [44]:
df['is_excellent'] = df['excellent_ratio'] // 0.6

In [45]:
X_train = training_set[['Value','Rooms','Location','Cleanliness','Sleep Quality', 'Service','star_rating']][0:56]
X_test = training_set[['Value','Rooms','Location','Cleanliness','Sleep Quality', 'Service','star_rating']][56:]


In [46]:
hotel_ex_df = df[['hotel_name','is_excellent']]

In [47]:
# make the dictionary about y
ex_data={}
for x in hotel_ex_df.values.tolist():
    ex_data[x[0]] = x[1]


In [48]:
# get the y target
y_target = [ ex_data[h] for h in training_set['hotel_name'].values.tolist()]

In [49]:
# make the training & test y
y_train = y_target[0:56]
y_test = y_target[56:]

In [50]:
import matplotlib.pyplot as plt
n=0.02
logreg = linear_model.LogisticRegression(C=10000)
logreg.fit(X_train,y_train)
print('Coefficients: \n', logreg.coef_)
print("Mean squared error: %.2f" % np.mean((logreg.predict(X_test) - y_test) ** 2))
print('Variance score: %.2f' % logreg.score(X_test,y_test))

Coefficients: 
 [[ -6.11756393  20.31178923   1.68199926  -3.2453248   17.0542242
   10.34300276  -4.05162057]]
Mean squared error: 0.08
Variance score: 0.92


In [1]:
import json

In [3]:
data = []

In [5]:
import pandas as pd
    

In [6]:
pf = pd.read_csv('attribute_ratings.csv')

In [9]:
data = pf['review_id'].values

In [12]:
len(set(data))

109461

-------