# BazaarVoice Challange

## Background
Machine learning models are capable of generating sequences of text that seem authentic to humans. Naturally, that ability could be used to produce fake reviews, at massive scale, and cause problems for review systems.For our challenge, we’ll provide two datasets. One dataset of fake reviews, that have been produced by a model, and then a dataset of real reviews. Our task is to develop a model that can classify a review as fake or authentic.

In [1]:
from sklearn import *
import sklearn
import pandas as pd
import numpy as np
import matplotlib
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

%matplotlib inline
import matplotlib.pyplot as plt

## Data Preprocess

In [2]:
fakes = open("fake_reviews.txt").readlines()
reals = open("real_reviews.txt").readlines()


fakes2 = []
for rev in  fakes:
    rev = rev.replace('\n', '')
    fakes2.append(rev)

reals2 = []
for tv in  reals:
    tv= tv.replace('\n', '')
    reals2.append(tv)


fakeD = {key: 0 for (key) in fakes2}
realD = {key: 1 for (key) in reals2}

aRevs = {**fakeD, **realD}
print(len(aRevs))


9998


In [3]:
daf = shuffle(pd.DataFrame(list(aRevs.items()), columns=["Review", "Class"]), random_state=12)

In [4]:
daf.to_pickle("./opDF.pkl")

## Example

In [21]:
(daf.head(5))

Unnamed: 0,Review,Class
5669,"Bought this a week ago and so far, very impres...",1
8798,I just got this today in the mail and it’s so ...,1
3205,Bought these for my use as a Christmas present...,0
8729,absolutely love these headphones- sound qualit...,1
6412,I have had my unit for twenty years now and it...,1


In [7]:
x, y = daf["Review"], daf["Class"]

In [8]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=12 )

In [9]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(7998,)
(2000,)
(7998,)
(2000,)


## Count Vectorization

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
x_train = cv.fit_transform(x_train)
x_test = cv.transform(x_test)

In [11]:
x_train.shape

(7998, 9538)

## Building The Classifier
After having tried and tested over 10 algorithms, we settled on basic Adaboosting, which was the best mix of accuracy, efficiency and speed

In [12]:
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=130, learning_rate=.5)
clf.fit(x_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.5, n_estimators=130, random_state=None)

In [14]:
ypreds = clf.predict(x_test)
print(ypreds)

[1 1 0 ... 0 1 1]


## Custom Accuracy Metric

In [15]:
tpos, tneg, fpos, fneg = 0, 0, 0, 0

for prediction, correct_value in zip(ypreds, y_test):
    if prediction == 1 and correct_value == 1:
        tpos += 1
    if prediction == 1 and correct_value == 0:
        fpos += 1
    if prediction == 0 and correct_value == 0:
        tneg += 1
    if prediction == 0 and correct_value == 1:
        fneg += 1
        
        
        
recall = (tpos) / (tpos + fneg)
skrecall = sklearn.metrics.recall_score(y_test, ypreds)
skpres = sklearn.metrics.precision_score(y_test, ypreds)
skac = clf.score(x_test, y_test)
print(f'Recall: {recall:.2f}')
print(f'Sklearn recall: {sklearn.metrics.recall_score(y_test, ypreds):.2f}')
precision = (tpos) / (tpos + fpos)
print(f'Precision: {precision:.2f}')
print(f'Skearn precision: {sklearn.metrics.precision_score(y_test, ypreds):.2f}')
accuracy = (tpos + tneg) / (tpos + tneg + fpos + fneg)
print(f'Accuracy: {accuracy:.2f}')
print(f'Sklearn accuracy: {clf.score(x_test, y_test):.2f}')
print(f'Average score={((recall+skrecall+skpres+precision+skac+accuracy)/6)*100}')

Recall: 0.87
Sklearn recall: 0.87
Precision: 0.90
Skearn precision: 0.90
Accuracy: 0.89
Sklearn accuracy: 0.89
Average score=88.74451358637052


## Exporting All The Things

In [16]:
tests = open("mixed_test_reviews.txt").readlines()

newTests = []
for rvw in tests:
    rvw = rvw.replace("\n", "")
    newTests.append(rvw)

In [17]:
preds = (clf.predict(cv.transform(newTests)))

In [18]:
preds2 = []
for pred in preds:
    pred = round(pred)
    preds2.append(pred)

In [19]:
with open('res.txt', 'w') as f:
    for prid in preds2:
        f.write("%s\n" % prid)
