# FIT5120 - Industry Experience Studio Project  S1 2022

### Project Name: HOTEL REVIEW ASSISTANT
### Task Name: Data Wrangling and Machine Learning



Team information
- Team Name: AntiFake
- Team Number: TA 36

Date: 07/04/2022

Version: 1.0

Programming Language: Python 3.8 and Jupyter notebook

Python Libraries used:
- pandas (For data manipulation and analysis)
- numpy (For building the fake detection algorithm)
- re (For data extraction)

## Table of Contents

* [1. Import Library](#sec_1)
* [2. Data Wrangling](#sec_2)
* [3. Analysis of Review Reliability](#sec_3)
* [4. Building the Algorithm for Review Reliablility](#sec_4)
* [5. Data Normalisation and Generating Outputs](#sec_5)

### 1. Import Library

In [1]:
import pandas as pd
import numpy as np
import re

### 2. Data Wrangling

In [2]:
df_review = pd.read_csv('reviews.csv')
df_review.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,9835,279854,2011-05-24,560832,Miriam,"Very hospitable, much appreciated.\r<br/>"
1,9835,3640746,2013-02-26,5143343,Michelle,A beautiful house in a lovely quiet neighbourh...
2,9835,23731188,2014-12-08,2478713,Karyn,This was my first time using airbnb and it was...
3,9835,46588875,2015-09-12,26184717,Rosalind,I was visiting Melbourne to spend time with my...
4,12936,73473,2010-08-04,111479,Brian,Perfect apartment in a perfect location!!!! \r...


In [46]:
df_listing = pd.read_csv('listings.csv')
df_listing = df_listing.loc[~df_listing.name.isnull()] # remove null in names
df_listing = df_listing.rename(columns={'id':'listing_id'})
df_listing.shape

(17832, 74)

In [5]:
# merge all in dfs
df_all = df_review.merge(df_listing, on='listing_id', how='left')

In [None]:
# if NAN -> Property information not available at the moment
not_available = 'Property information not available at the moment'
df_all.review_scores_accuracy = df_all.apply(lambda x: not_available if pd.isnull(x['review_scores_accuracy']) else str(x['review_scores_accuracy']), axis=1)

### 3. Analysis of Review Reliability

In [None]:
# Import pre-trained NLP model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

# Using the piepline to generate the result for testing data
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
generator = pipeline(task="text-classification", model=model, tokenizer=tokenizer)

In [None]:
# Building the function for generate the predicted review star rating and its probability
# Group by listing ids, HAS NON-VALUE
def get_stars(message, generator):
    """
    Output: (stars: number, probability)
    """
    if pd.isnull(message):
        return (5, 1)
    message = message[:512]
    message = re.sub('\r<br/>', ' ', message)
    result = generator(message)[0].values()
    result = list(result)
    stars = int(result[0][0])
    prob = float(result[1])
    return (stars, prob)

In [None]:
# Set the default star rating as '1' if review is null
star_prob_lst = df_all.comments.apply(lambda x: get_stars(x[:512], generator) if not pd.isnull(x) else (True, 1))
star_lst = [each[0] for each in star_prob_lst.values]
prob_lst = [each[1] for each in star_prob_lst.values ]

# Appending into meraged dataframe
df_all['Predicted_star'] = star_lst
df_all['Predicted_star_Probability'] = prob_lst

### 4. Building the Algorithm for Review Reliablility

In [14]:
# Define the function to validate the review_scores_rating and its difference
def reliable(review_scores_rating, predicted_star):
    if not review_scores_rating:
        return 1
    diff = abs(predicted_star - review_scores_rating)/review_scores_rating
    if diff > 1:
        return 0
    return 1- diff

# Generating the Reliable probability
df_all['Reliable_probability'] = df_all.apply(lambda x: reliable(x['review_scores_rating'], x['Predicted_star']), axis=1)

In [17]:
# Calculate the overall Review Accuracy for each listing
Overall_Reliable_probability = df_all.groupby(['name']).mean()['Reliable_probability']
# merge all in dfs
df_all = df_all.merge(Overall_Reliable_probability, on='listing_id', how='left')
# Rename the columns
df_all = df_all.rename(columns={"Reliable_probability_x": "Reliable_probability", "Reliable_probability_y": "Overall_Reliable_probability"})


### 5. Data Normalisation and Generating Outputs

In [23]:
# Subset and nomarlising the dataframe
iteration1_columns = ['listing_id', 'date', 'comments', 'listing_url','last_scraped', 'name', 'description', 'neighborhood_overview',
                     'picture_url', 'host_name', 'host_since', 'host_about', 'host_location', 'host_response_time',
                     'host_acceptance_rate', 'latitude', 'longitude', 'property_type', 'price', 
                      'review_scores_accuracy', 'Predicted_star', 'Predicted_star_Probability', 'Reliable_probability', 
                      'Overall_Reliable_probability']

df_all = df_all[iteration1_columns]
df_all = df_all.round(2)

# Generate two subsets for demonstrate the Most Reliable Reviews and Less Reliable Reviews
df1 = df_all.sort_values('Reliable_probability',ascending = False).groupby('name').head(5).reset_index()
df2 = df_all.sort_values('Reliable_probability',ascending = False).groupby('name').tail(5).reset_index()

# Normalizing the probability representation
df1['Reliable_probability'] = df1['Reliable_probability'] *100
df1['Overall_Reliable_probability'] = df1['Overall_Reliable_probability'] *100
df2['Reliable_probability'] = df2['Reliable_probability'] *100
df2['Overall_Reliable_probability'] = df2['Overall_Reliable_probability'] *100

In [27]:
# Calculating descrption accuracy range
def rank(x):
    x = float(x)
    if x==5.0:
        return 'high'
    elif 4.9< x <5:
        return 'medium-high'
    elif 4.75<= x <=4.9:
        return 'medium-low'
    else:
        return 'low'

df1['descrption_accuracy_range'] = df1.apply(lambda x : rank(x['review_scores_accuracy']), axis=1)
df2['descrption_accuracy_range'] = df2.apply(lambda x : rank(x['review_scores_accuracy']), axis=1)

In [32]:
# Generating the outputs
df1.to_csv('high_reliability.csv', index=False)
df2.to_csv('low_reliability.csv', index=False)