# Purpose of the Notebook

- **Section 1:** is where read the data frames data that we created in the data preprocessing notebook, parse this data and store into two different dictionaries. One dictionary stores the positive reviews (greater than 3 stars) and the other stores the negative reviews (less than 4 stars).

- **Section 2:** is where we train a classifier per hotel. Get the most helpful review, for each hotel and then store this helpfulness score into a list. Then we calculate the average helpfulness for the list. Similarly we calculate the percentage of helpful reviews that are recommended by the classifier.

- **Section 3:** we do the same as before except just instead of training and testing a classifier we just get a the most recently written review and a random review.

In [None]:
import pandas as pd
import pickle
import re
from  sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import resample
from datetime import *
from statistics import mean
from random import seed
from random import randint

## Section 1

Read the csv files created in the data processing notebook

In [None]:
user_decision = int(input('Enter 1 to read Chicago dataset and enter 2 to read Las-Vegas dataset'))

In the data preprocessing notebook we created a list of all the reviews from the datasets we used for this project. This list was written to this file. We use the python pickle library to read a list straight from a file and store this list in review_count.

In [None]:
if(user_decision==1):
    df = pd.read_csv('chicago_data_frame_with_fewer_rows.csv')
    df = df.set_index(['hotel_id', 'reviewer_id'])
    with open("reviews_file_chicago.txt", "rb") as fp:   # Unpickling
        review_count = pickle.load(fp)
elif(user_decision ==2):
    df = pd.read_csv('las-vegas_data_frame_with_fewer_rows.csv')
    df = df.set_index(['hotel_id', 'reviewer_id'])
    with open("reviews_file_las-vegas.txt", "rb") as fp:   # Unpickling
        review_count = pickle.load(fp)

Parse the reviews to get a list of all the unique hotel ids in the dataset. We use a set because sets don't allow
duplicates.

In [None]:
regex = r'<hotelUrl>\n(.*?)\n</hotelUrl>'
unique_hotel_ids = set()
for i in review_count:
    hotel_name = re.search(regex, i)
    hotel_name = hotel_name[1]
    unique_hotel_ids.add(hotel_name)

Function to convert string to date object

In [None]:
def convert_to_date(date):
    ans = datetime.strptime(date, "%b %d, %Y")
    return ans

Remove reviews which don't have a minimum of at least 5 helpfulness opininons

In [None]:
regex = r'<hotelUrl>\n(.*?)\n</hotelUrl>\n[\s\S]+?\n<helpfulness>\n(.*?)\n</helpfulness>'
unique_hotel_ids = set()
for i in review_count:
    hotel_name = re.search(regex, i)
    hotel_id = hotel_name[1]
    helpful_score = (hotel_name[2]).split(' of ')
    if (int(helpful_score[1])) > 4:
        unique_hotel_ids.add(hotel_id)


Create a list of lists, where each sublist has the format ['reviewer id', 'hotel id', 'date']

In [None]:
date_member_hotel = []
obj = []
count = 0
date_member_hotel_regex = r'<memberUrl>\n([\s\S]*?)\n</memberUrl>\n<hotelUrl>\n([\s\S]*?)\n</hotelUrl>\n[\s\S]*?\n<date>\n(.*?)\n</date>'
for i in review_count:
    matched = re.search(date_member_hotel_regex, i)
    member = matched[1]
    hotel = matched[2]
    date = matched[3]
    obj.append([hotel, member, date])

Create a dictionary where the key is the hotel id and the values are all reviews with that hotel id

In [None]:
rows_with_hotel_as_key = dict()

for i in unique_hotel_ids:
    if i in df.index.get_level_values('hotel_id'):
        if i in rows_with_hotel_as_key:
            rows_with_hotel_as_key[i] = pd.merge(rows_with_hotel_as_key[i], df.loc[[i]], how = 'right', on=['hotel_id', 'reviewer_id'])
        else:
            rows_with_hotel_as_key[i] = pd.DataFrame(df.loc[[i]])

Create a dictionary where the key is the hotel id and the values are all reviews with that hotel id for all reviews
that have a star rating less than 4 stars

In [None]:
rows_with_hotel_as_key_culled_neg = dict()

for i in rows_with_hotel_as_key:
    if len(rows_with_hotel_as_key[i]) < 5:
        continue
    else:
        neg = 0
        for j,row in rows_with_hotel_as_key[i].iterrows():
            if (row['ST1']) < 4:
                neg+=1
        if neg > 4:
            rows_with_hotel_as_key_culled_neg[i] = rows_with_hotel_as_key[i]

Create a dictionary where the key is the hotel id and the values are all reviews with that hotel id for all reviews 
that have a star rating more than 3 stars

In [None]:
rows_with_hotel_as_key_culled_pos = dict()

for i in rows_with_hotel_as_key:
    if len(rows_with_hotel_as_key[i]) < 5:
        continue
    else:
        pos = 0
        for j,row in rows_with_hotel_as_key[i].iterrows():
            if (row['ST1']) >= 4:
                pos+=1
        if pos > 4:
            rows_with_hotel_as_key_culled_pos[i] = rows_with_hotel_as_key[i]

## Section 2

For each hotel in unique hotel ids we train a classifier using all other hotels. Essentially using the leave one
out train test split for the classifier. We store the highest ranked review into a list for each iteration of the
loop. We use random forest classifier as our classifier. The output from the classifier for each review is a 
confidence score which represents how confident the classifier is that the review is helpful.

- This is only for the negative reviews (reviews that received less than 4 stars)

In [None]:
x_train = pd.DataFrame(columns=df.columns)
x_train = pd.DataFrame(columns=df.columns)
x_train['hotel_id'] = ''
x_train['reviewer_id'] = ''
x_train = x_train.set_index(['hotel_id', 'reviewer_id'])

In [None]:
classifier = RandomForestClassifier()
hotel_prediction = dict()
y_pred = []
y_test = []
y_pred_df_neg = dict()
for i in rows_with_hotel_as_key_culled_neg:
    #create a dataframe I can use to train using leave one out approach
    # i is the hotel we want to use to test the classifier which is trained using the other hotels
    x_test = rows_with_hotel_as_key_culled_neg[i].copy()
    y_train = []
    x_train = pd.DataFrame(columns=df.columns)
    for j in rows_with_hotel_as_key_culled_neg:
        if i != j:
            new_df = rows_with_hotel_as_key_culled_neg[j].copy()
            x_train = pd.concat([x_train, new_df])
    y_train = x_train.pop('helpful_of_not')
    y_train = (y_train.to_frame()['helpful_of_not']).astype(str).astype(int)
    classifier.fit(X=x_train.values, y=y_train.values)
    y_test.append(x_test.pop('helpful_of_not'))
    prediction = classifier.predict_proba(x_test)
    y_pred_df_neg[x_test.index[0][0]] = pd.DataFrame(prediction, columns = ['not helpful', 'helpfulness'], index=x_test.index)


In [None]:
for i in y_pred_df_neg:
    y_pred_df_neg[i] = pd.merge(y_pred_df_neg[i], df, how = 'left', on=['hotel_id', 'reviewer_id'])

In [None]:
y_pred_fin_neg = dict()

for i in (y_pred_df_neg):
    count =0
    for j,row in y_pred_df_neg[i].iterrows():
        y_pred_fin_neg[i] = y_pred_df_neg[i][y_pred_df_neg[i]['ST1'] < 4]

In [None]:
for i in (y_pred_fin_neg):
    y_pred_fin_neg[i].sort_values(by=['helpfulness'], inplace=True, ascending=False)

In [None]:
neg_class_avg = []

for i in (y_pred_fin_neg):
    top = y_pred_fin_neg[i].iloc[0]
    neg_class_avg.append((top['helpfulness']))

In [None]:
neg_class_total = []

for i in (y_pred_fin_neg):
    top = y_pred_fin_neg[i].iloc[0]
    helpful = (top['helpfulness'])
    if helpful >= 0.5:
        neg_class_total.append(1)
    else:
        neg_class_total.append(0)

The percentage of helpful reviews recommended by the classification recommendation technique for negative reviews

In [None]:
mean(neg_class_avg)

The average helpfulness of the reviews recommended by the classification recommendation technique for positive reviews

In [None]:
mean(neg_class_total)

This time we carry out the same training and testing split except this time we only do it for positive review.

In [None]:
x_train_pos = pd.DataFrame(columns=df.columns)
x_train_pos = pd.DataFrame(columns=df.columns)
x_train_pos['hotel_id'] = ''
x_train_pos['reviewer_id'] = ''
x_train_pos = x_train_pos.set_index(['hotel_id', 'reviewer_id'])

In [None]:
classifier = RandomForestClassifier(n_estimators=39, max_depth=20)
hotel_prediction = dict()
y_pred = []
y_test = []
y_pred_df_pos = dict()
for i in rows_with_hotel_as_key_culled_pos:
    #create a dataframe I can use to train using leave one out approach
    # i is the hotel we want to use to test the classifier which is trained using the other hotels
    x_test = rows_with_hotel_as_key_culled_pos[i].copy()
    y_train = []
    x_train = pd.DataFrame(columns=df.columns)
    for j in rows_with_hotel_as_key_culled_pos:
        if i != j:
            new_df = rows_with_hotel_as_key_culled_pos[j].copy()
            x_train_pos = pd.concat([x_train, new_df])
    y_train_pos = x_train_pos.pop('helpful_of_not')
    y_train = (y_train_pos.to_frame()['helpful_of_not']).astype(str).astype(int)
    classifier.fit(X=x_train_pos.values, y=y_train.values)
    y_test.append(x_test.pop('helpful_of_not'))
    prediction = classifier.predict_proba(x_test)
    y_pred_df_pos[x_test.index[0][0]] = pd.DataFrame(prediction, columns = ['not helpful', 'helpfulness'], index=x_test.index)
    

In [None]:
for i in y_pred_df_pos:
    y_pred_df_pos[i] = pd.merge(y_pred_df_pos[i], df, how = 'left', on=['hotel_id', 'reviewer_id'])

In [None]:
y_pred_fin_pos = dict()

for i in (y_pred_df_pos):
    count =0
    for j,row in y_pred_df_pos[i].iterrows():
        y_pred_fin_pos[i] = y_pred_df_pos[i][y_pred_df_pos[i]['ST1'] >= 4]

In [None]:
for i in (y_pred_fin_pos):
    y_pred_fin_pos[i].sort_values(by=['helpfulness'], inplace=True, ascending=False)

In [None]:
pos_class_avg = []

for i in (y_pred_fin_pos):
    top = y_pred_fin_pos[i].iloc[0]
    pos_class_avg.append((top['helpfulness']))

In [None]:
pos_class_total = []

for i in (y_pred_fin_pos):
    top = y_pred_fin_pos[i].iloc[0]
    helpful = (top['helpfulness'])
    if helpful >= 0.5:
        pos_class_total.append(1)
    else:
        pos_class_total.append(0)

The percentage of helpful reviews recommended by the classification recommendation technique for positive reviews

In [None]:
mean(pos_class_avg)

The mean helpfulness score for the reveiews recommended by the recommendation technique for positive reviews

In [None]:
mean(pos_class_total)

## Section 3

### Date Negative Mean Review Helpfulness

Below we first add the date column to all the rows of the data frames. We didn't add the dates to the rows before now because they would have interfered with the classification results. Once the date column is appended to all rows we then sort the data frame for each hotel by most recently written review, using the date column. Then we add the helpfulness of the top review to a list and calculate the percentage of helpful reviews and the mean helpfulness for the lists. We then repeat this except for positive reviews once we do this for negative reviews.

In [None]:
for i in y_pred_fin_neg:
    y_pred_fin_neg[i]['date'] = ''

In [None]:
for j in y_pred_fin_neg:
    for i in obj: 
        test_tuple = tuple([i[0],i[1]])
        if test_tuple in y_pred_fin_neg[j].index:
            date = convert_to_date(i[2])
            y_pred_fin_neg[j].set_value(y_pred_fin_neg[j].loc[[test_tuple]].index,'date',  date)

In [None]:
for i in (y_pred_fin_neg):
    y_pred_fin_neg[i].sort_values(by=['date'], inplace=True, ascending=False)

In [None]:
hold = y_pred_fin_neg.copy()
for i in hold:
    if len(hold[i]) < 1:
        del y_pred_fin_neg[i]

In [None]:
values_helpful_of_not_neg = df['helpful_of_not']

In [None]:
for i in y_pred_fin_neg:
    y_pred_fin_neg[i] = pd.merge(y_pred_fin_neg[i], values_helpful_of_not_neg, how = 'left', on=['hotel_id', 'reviewer_id'])

In [None]:
for i in y_pred_fin_neg:
    y_pred_fin_neg[i] = pd.merge(y_pred_fin_neg[i], values_helpful_of_not_neg, how = 'left', on=['hotel_id', 'reviewer_id'])

In [None]:
for i in (y_pred_fin_neg):
    y_pred_fin_neg[i].sort_values(by=['date'], inplace=True, ascending=False)

In [None]:
total = []
for i in (y_pred_fin_neg):
    top = (y_pred_fin_neg[i].iloc[0])['helpful_of_not']
    total.append(top)

total_fin = list(map(int, total))

Mean review helpfulness for the most recently written negative review for each hotel.

In [None]:
mean(total_fin)

### Positive Date Mean Review Helpfulness

In [None]:
for i in y_pred_fin_pos:
    y_pred_fin_pos[i]['date'] = ''

In [None]:
for j in y_pred_fin_pos:
    for i in obj: 
        test_tuple = tuple([i[0],i[1]])
        if test_tuple in y_pred_fin_pos[j].index:
            date = convert_to_date(i[2])
            y_pred_fin_pos[j].set_value(y_pred_fin_pos[j].loc[[test_tuple]].index,'date',  date)

In [None]:
for i in (y_pred_fin_pos):
    y_pred_fin_pos[i].sort_values(by=['date'], inplace=True, ascending=False)

In [None]:
hold = y_pred_fin_pos.copy()
for i in hold:
    if len(hold[i]) < 1:
        del y_pred_fin_pos[i]

In [None]:
values_helpful_of_not_pos = df['helpful_of_not']

In [None]:
for i in y_pred_fin_pos:
    y_pred_fin_pos[i] = pd.merge(y_pred_fin_pos[i], values_helpful_of_not_pos, how = 'left', on=['hotel_id', 'reviewer_id'])

In [None]:
for i in y_pred_fin_pos:
    y_pred_fin_pos[i] = pd.merge(y_pred_fin_pos[i], values_helpful_of_not_pos, how = 'left', on=['hotel_id', 'reviewer_id'])

In [None]:
for i in (y_pred_fin_pos):
    y_pred_fin_pos[i].sort_values(by=['date'], inplace=True, ascending=False)

In [None]:
total = []
for i in (y_pred_fin_pos):
    top = (y_pred_fin_pos[i].iloc[0])['helpful_of_not']
    total.append(top)

total_fin = list(map(int, total))

Mean review helpfulness for the most recently written positive review for each hotel.

In [None]:
mean(total_fin)

### Random Negative Percentage of recommended reviews that are helpful

In [None]:
for i in y_pred_fin_neg:
    del y_pred_fin_neg[i]['date']

In [None]:
for i in y_pred_fin_neg:
    rand_int = randint(0,len(y_pred_fin_neg[i])-1)
    top = (y_pred_fin_neg[i].iloc[rand_int])['helpful_of_not']
    total.append(top)

In [None]:
total_fin_neg = list(map(int, total))

In [None]:
mean(total_fin_neg)

In [None]:
df.pop('helpful_of_not')

In [None]:
vals_df = pd.read_csv('vals_df.csv')

In [None]:
vals_df = vals_df.set_index(['hotel_id', 'reviewer_id'])

In [None]:
df = pd.merge(vals_df, df, how = 'left', on=['hotel_id', 'reviewer_id'])

In [None]:
rows_with_hotel_as_key = dict()

for i in unique_hotel_ids:
    if i in df.index.get_level_values('hotel_id'):
        if i in rows_with_hotel_as_key:
            rows_with_hotel_as_key[i] = pd.merge(rows_with_hotel_as_key[i], df.loc[[i]], how = 'right', on=['hotel_id', 'reviewer_id'])
        else:
            rows_with_hotel_as_key[i] = pd.DataFrame(df.loc[[i]])

In [None]:
rows_with_hotel_as_key_culled_neg = dict()

for i in rows_with_hotel_as_key:
    if len(rows_with_hotel_as_key[i]) < 5:
        continue
    else:
        neg = 0
        for j,row in rows_with_hotel_as_key[i].iterrows():
            if (row['ST1']) < 4:
                neg+=1
        if neg > 4:
            rows_with_hotel_as_key_culled_neg[i] = rows_with_hotel_as_key[i]

In [None]:
for i in y_pred_fin_neg:
    y_pred_fin_neg[i]['date'] = ''

In [None]:
for j in y_pred_fin_neg:
    for i in obj: 
        test_tuple = tuple([i[0],i[1]])
        if test_tuple in y_pred_fin_neg[j].index:
            date = convert_to_date(i[2])
            y_pred_fin_neg[j].set_value(y_pred_fin_neg[j].loc[[test_tuple]].index,'date',  date)

In [None]:
for i in (y_pred_fin_neg):
    y_pred_fin_neg[i].sort_values(by=['date'], inplace=True, ascending=False)

In [None]:
total = []
for i in (y_pred_fin_neg):
    top = (y_pred_fin_neg[i].iloc[0])['helpfulness']
    total.append(top)

In [None]:
print(type(total))
total_fin = list(map(float, total))

Percentage of helpful reviews that were recommended by the most recently written review recommender for negative
reviews.

In [None]:
mean(total_fin)

# Random Positive Percentage of recommended reviews that are helpful

In [None]:
for i in y_pred_fin_pos:
    rand_int = randint(0,len(y_pred_fin_pos[i])-1)
    top = (y_pred_fin_pos[i].iloc[rand_int])['helpful_of_not']
    total.append(top)

In [None]:
total_fin_pos = list(map(int, total))

In [None]:
mean(total_fin_pos)

In [None]:
df.pop('helpful_of_not')

In [None]:
vals_df = pd.read_csv('vals_df.csv')

In [None]:
vals_df = vals_df.set_index(['hotel_id', 'reviewer_id'])

In [None]:
df = pd.merge(vals_df, df, how = 'left', on=['hotel_id', 'reviewer_id'])

In [None]:
rows_with_hotel_as_key = dict()

for i in unique_hotel_ids:
    if i in df.index.get_level_values('hotel_id'):
        if i in rows_with_hotel_as_key:
            rows_with_hotel_as_key[i] = pd.merge(rows_with_hotel_as_key[i], df.loc[[i]], how = 'right', on=['hotel_id', 'reviewer_id'])
        else:
            rows_with_hotel_as_key[i] = pd.DataFrame(df.loc[[i]])

In [None]:
rows_with_hotel_as_key_culled_pos = dict()

for i in rows_with_hotel_as_key:
    if len(rows_with_hotel_as_key[i]) < 5:
        continue
    else:
        pos = 0
        for j,row in rows_with_hotel_as_key[i].iterrows():
            if (row['ST1']) >= 4:
                pos+=1
        if pos > 4:
            rows_with_hotel_as_key_culled_pos[i] = rows_with_hotel_as_key[i]

In [None]:
for i in y_pred_fin_pos:
    y_pred_fin_pos[i]['date'] = ''

In [None]:
for j in y_pred_fin_pos:
    for i in obj: 
        test_tuple = tuple([i[0],i[1]])
        if test_tuple in y_pred_fin_pos[j].index:
            date = convert_to_date(i[2])
            y_pred_fin_pos[j].set_value(y_pred_fin_pos[j].loc[[test_tuple]].index,'date',  date)

In [None]:
for i in (y_pred_fin_pos):
    y_pred_fin_pos[i].sort_values(by=['date'], inplace=True, ascending=False)

In [None]:
total = []
for i in (y_pred_fin_pos):
    top = (y_pred_fin_pos[i].iloc[0])['helpfulness']
    total.append(top)

In [None]:
print(type(total))
total_fin = list(map(float, total))

Percentage of helpful reviews that were recommended by the most recently written review recommender for positive
reviews.

In [None]:
mean(total_fin)