# 05-318 Human AI Interaction ML Project
#### Name: Joanne Tsai

### Overview & Datasets
This project aims to build a classifier for truthful and deceptive hotel reviews. The datasets used are included in  the Deceptive Opinion Spam Corpus that can be obtained from the two associated papers cited below. The dataset contains 800 truthful reviews from Trip Advisor, Expedia, and Hotels.com and 800 deceptive reviews from Mechanical Turk. The positive and negative reviews are balanced and all files are in plain text.

### References:
[1] M. Ott, Y. Choi, C. Cardie, and J.T. Hancock. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.

[2] M. Ott, C. Cardie, and J.T. Hancock. 2013. Negative Deceptive Opinion Spam. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

## Importing libraries

In [466]:
import pandas as pd
import glob
import string

import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import pos_tag, pos_tag_sents
from textblob import TextBlob

import sklearn
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier

import pickle

## Importing Data

In [467]:
# Get all the text files
folder_path = ['data/pos_truthful_from_TripAdvisor', 'data/neg_truthful_from_Web']
truthful_reviews = []
for folder in folder_path:
    for i in range(1, 6):
        files = glob.glob(folder + f"/fold{i}" + "/*.txt")
        for file in files:
            f = open(file, "r")
            content = f.read()
            truthful_reviews.append([content, "truthful"])

# Check if reviews are read correctly (should be 800)
# print(len(truthful_reviews))

truthful_reviews_df = pd.DataFrame(truthful_reviews, columns =['review', 'label'])
truthful_reviews_df.head()

Unnamed: 0,review,label
0,took a weekend trip with my wife. got a great ...,truthful
1,"Thirty years ago, we had a tiny ""room"" and ind...",truthful
2,"Dispite what other are saying, this was one, i...",truthful
3,We loved the hotel. When I see other posts abo...,truthful
4,We have just returned from a week at the James...,truthful


In [468]:
# Get all the text files
folder_path = ['data/pos_deceptive_from_MTurk', 'data/neg_deceptive_from_MTurk']
deceptive_reviews = []
for folder in folder_path:
    for i in range(1, 6):
        files = glob.glob(folder + f"/fold{i}" + "/*.txt")
        for file in files:
            f = open(file, "r")
            content = f.read()
            deceptive_reviews.append([content, "deceptive"])

# Check if reviews are read correctly (should be 800)
# print(len(deceptive_reviews))

deceptive_reviews_df = pd.DataFrame(deceptive_reviews, columns =['review', 'label'])
deceptive_reviews_df.head()

Unnamed: 0,review,label
0,I traveled to Chicago with my husband for a ro...,deceptive
1,I stayed in the Sofitel Chicago Water Tower ho...,deceptive
2,This hotel was gorgeous! I really enjoyed my s...,deceptive
3,"This is an absolutely exquisite hotel, at a gr...",deceptive
4,I recently traveled up to Chicago for business...,deceptive


In [469]:
# Merge the two datasets
dataframes = [truthful_reviews_df, deceptive_reviews_df]
reviews = pd.concat(dataframes)

reviews.head()

# print(len(reviews))

Unnamed: 0,review,label
0,took a weekend trip with my wife. got a great ...,truthful
1,"Thirty years ago, we had a tiny ""room"" and ind...",truthful
2,"Dispite what other are saying, this was one, i...",truthful
3,We loved the hotel. When I see other posts abo...,truthful
4,We have just returned from a week at the James...,truthful


## Extracting & Tagging Data

### Remove stopwords from reviews
We want to remove stop words in the reviews and keep words that have meanings in the context. We create a new column to store the filtered reviews.

In [470]:
stop = stopwords.words('english')
def remove_stop_words(col):
    new_col = []
    for review in col:
        new_val = []
        review = review.translate(str.maketrans('', '', string.punctuation))
        for word in review.split():
            if word not in stop:
                new_val.append(word)
        new_col.append(' '.join(new_val))
    return new_col

reviews['review_without_stopwords'] = remove_stop_words(reviews['review'])

In [471]:
reviews.head()

Unnamed: 0,review,label,review_without_stopwords
0,took a weekend trip with my wife. got a great ...,truthful,took weekend trip wife got great rate valet in...
1,"Thirty years ago, we had a tiny ""room"" and ind...",truthful,Thirty years ago tiny room indifferent service...
2,"Dispite what other are saying, this was one, i...",truthful,Dispite saying one best Hotel stay Chicago I I...
3,We loved the hotel. When I see other posts abo...,truthful,We loved hotel When I see posts shabby I cant ...
4,We have just returned from a week at the James...,truthful,We returned week James Chicago The hotel fabul...


### Tag reviews using TextBlob
We want to tag each word in the review to see what type of word it is. We create a new column to store the tagged values.

In [489]:
def tag(review_without_stopwords):
    return TextBlob(review_without_stopwords).tags

tagged_arr = reviews.review_without_stopwords.apply(tag)
temp = pd.DataFrame(tagged_arr)

In [473]:
def reformat_tagged_words(col):
    new_col = []
    for review in col:
        new_val = []
        for tagged_tuple in review:
            new_val.append("/".join(tagged_tuple))
        new_col.append(' '.join(new_val))
    return new_col

reviews['tagged_review'] = reformat_tagged_words(temp['review_without_stopwords'])
reviews.head()

Unnamed: 0,review,label,review_without_stopwords,tagged_review
0,took a weekend trip with my wife. got a great ...,truthful,took weekend trip wife got great rate valet in...,took/VBD weekend/NN trip/NN wife/NN got/VBD gr...
1,"Thirty years ago, we had a tiny ""room"" and ind...",truthful,Thirty years ago tiny room indifferent service...,Thirty/CD years/NNS ago/RB tiny/JJ room/NN ind...
2,"Dispite what other are saying, this was one, i...",truthful,Dispite saying one best Hotel stay Chicago I I...,Dispite/NNP saying/VBG one/CD best/JJS Hotel/N...
3,We loved the hotel. When I see other posts abo...,truthful,We loved hotel When I see posts shabby I cant ...,We/PRP loved/VBD hotel/NN When/WRB I/PRP see/V...
4,We have just returned from a week at the James...,truthful,We returned week James Chicago The hotel fabul...,We/PRP returned/VBD week/NN James/NNP Chicago/...


In [474]:
reviews['tagged_review']

0      took/VBD weekend/NN trip/NN wife/NN got/VBD gr...
1      Thirty/CD years/NNS ago/RB tiny/JJ room/NN ind...
2      Dispite/NNP saying/VBG one/CD best/JJS Hotel/N...
3      We/PRP loved/VBD hotel/NN When/WRB I/PRP see/V...
4      We/PRP returned/VBD week/NN James/NNP Chicago/...
                             ...                        
795    A/DT decent/JJ place/NN stay/VB The/DT people/...
796    The/DT InterContinental/NNP Chicago/NNP hotel/...
797    When/WRB I/PRP first/RB made/VBD reservations/...
798    I/PRP would/MD stay/VB hotel/VB The/DT rooms/N...
799    I/PRP awful/VBP experience/VB The/DT staff/NN ...
Name: tagged_review, Length: 1600, dtype: object

## Cross Validation

In [475]:
x_train, x_test, y_train, y_test = train_test_split(reviews['tagged_review'], reviews['label'], test_size=0.3, random_state=42)

## Vectorizing data
We want to vectorize text to assign different weights to different terms based on its frequency in the review and its frequency in the corpus.

In [476]:
vectorizer = TfidfVectorizer(lowercase=True, use_idf=True, smooth_idf=True, sublinear_tf=False)

# Vectorize to assign different weights to different types of text
x_train_tf = vectorizer.fit_transform(x_train)
x_test_tf = vectorizer.transform(x_test)

first_vector = x_train_tf[0] 

# Show the vectorized results
df = pd.DataFrame(first_vector.T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"]) 
df.sort_values(by=["tfidf"], ascending=False)



Unnamed: 0,tfidf
nn,0.459930
jj,0.238482
nnp,0.212978
nns,0.177165
rb,0.157750
...,...
everyone,0.000000
everyday,0.000000
everybody,0.000000
every,0.000000


## Building the model

We want to try running different types of models and compare the results to see what model classifier to use.

- x (independent variable) -- `tagged_review`
- y (dependent variable) -- `label`

### 1. Decision Tree Classifier

In [477]:
from sklearn import tree
dt_clf = tree.DecisionTreeClassifier()
dt_clf.fit(x_train_tf, y_train)
dt_pred = dt_clf.predict(x_test_tf)

# Calculate accuracy score
dt_accuracy = metrics.accuracy_score(y_test, dt_pred)

print("The accuracy score for decision tree classifier is", dt_accuracy)

The accuracy score for decision tree classifier is 0.7416666666666667


### 2. Random Forest Classifier

In [478]:
rf_clf = RandomForestClassifier()
rf_clf.fit(x_train_tf, y_train)
rf_pred = rf_clf.predict(x_test_tf)

# Calculate accuracy score
rf_accuracy = metrics.accuracy_score(y_test, rf_pred)

print("The accuracy score for random forest classifier is", rf_accuracy)

The accuracy score for random forest classifier is 0.8458333333333333


### 3. Support Vector Machine (SVM) Classifier

In [479]:
def svc_param_selection(X, y, nfolds):
    Cs = [0.001, 0.01, 0.1, 1, 10]
    gammas = [0.001, 0.01, 0.1, 1]
    params = {'C': Cs, 'gamma' : gammas, 'kernel' : ['linear']}
    regressor = GridSearchCV(SVC(), params, cv=nfolds)
    regressor.fit(X, y)
    return regressor.best_params_

best_params = svc_param_selection(x_train_tf, y_train, 5)
clf = SVC(C=best_params['C'], gamma=best_params['gamma'], kernel=best_params['kernel'])
clf.fit(x_train_tf, y_train)

### SVM model evaluation

In [480]:
pred = clf.predict(x_test_tf)
accuracy = metrics.accuracy_score(y_test, pred)  
print("The accuracy score for SVM classifier is", accuracy)

The accuracy score for SVM classifier is 0.88125


Based on the accuracy score of the decision tree, random forest, and SVM model, we can see that the SVM model produces the highest accuracy and thus we use this to proceed to next step.

In [481]:
confusion_matrix = confusion_matrix(y_test, pred)
print (confusion_matrix)

[[208  21]
 [ 36 215]]


In [482]:
report = classification_report(y_test, pred)
print(report)

              precision    recall  f1-score   support

   deceptive       0.85      0.91      0.88       229
    truthful       0.91      0.86      0.88       251

    accuracy                           0.88       480
   macro avg       0.88      0.88      0.88       480
weighted avg       0.88      0.88      0.88       480



## Exporting Model

In [515]:
# Export the model and vectorizer
data = {"model": clf, "vectorizer": vectorizer}
with open('model.pkl', 'wb') as file:
    pickle.dump(data, file)

In [516]:
with open('model.pkl', 'rb') as file:
    data = pickle.load(file)

clf_loaded = data["model"]

In [514]:
# Test running on data after reloading the model
y_pred = clf_loaded.predict(x_test_tf)
y_pred

array(['deceptive', 'truthful', 'truthful', 'truthful', 'deceptive',
       'deceptive', 'deceptive', 'truthful', 'truthful', 'truthful',
       'truthful', 'deceptive', 'deceptive', 'deceptive', 'deceptive',
       'truthful', 'truthful', 'deceptive', 'truthful', 'deceptive',
       'deceptive', 'truthful', 'truthful', 'truthful', 'truthful',
       'deceptive', 'deceptive', 'deceptive', 'truthful', 'truthful',
       'truthful', 'truthful', 'truthful', 'deceptive', 'truthful',
       'deceptive', 'truthful', 'deceptive', 'truthful', 'truthful',
       'truthful', 'deceptive', 'deceptive', 'deceptive', 'deceptive',
       'deceptive', 'deceptive', 'truthful', 'truthful', 'deceptive',
       'deceptive', 'deceptive', 'deceptive', 'truthful', 'deceptive',
       'deceptive', 'truthful', 'truthful', 'truthful', 'truthful',
       'truthful', 'truthful', 'deceptive', 'deceptive', 'truthful',
       'truthful', 'truthful', 'deceptive', 'deceptive', 'truthful',
       'deceptive', 'truthful

Unnamed: 0,review,label,review_without_stopwords,tagged_review
0,took a weekend trip with my wife. got a great ...,truthful,took weekend trip wife got great rate valet in...,took/VBD weekend/NN trip/NN wife/NN got/VBD gr...
1,"Thirty years ago, we had a tiny ""room"" and ind...",truthful,Thirty years ago tiny room indifferent service...,Thirty/CD years/NNS ago/RB tiny/JJ room/NN ind...
2,"Dispite what other are saying, this was one, i...",truthful,Dispite saying one best Hotel stay Chicago I I...,Dispite/NNP saying/VBG one/CD best/JJS Hotel/N...
3,We loved the hotel. When I see other posts abo...,truthful,We loved hotel When I see posts shabby I cant ...,We/PRP loved/VBD hotel/NN When/WRB I/PRP see/V...
4,We have just returned from a week at the James...,truthful,We returned week James Chicago The hotel fabul...,We/PRP returned/VBD week/NN James/NNP Chicago/...
