# Machine learning and regression

## Exploring Classifier Transferability
### Data sets

__Data Set 1:__ This dataset is based around examining real vs fake reviews. These opinion SPAM reviews were produced by a researcher named Myle Ott. In addition to collecting real reviews on hotels from the web and TripAdvisor, Ott et al. ran Amazon Mechanical Turk surveys to have real people write both positive and negative fake reviews of the hotels:

- http://myleott.com/op-spam.html

The goal with the data set was to train computers to detect which reviews were real vs. fake. These are provided in the following nested file structure:

- `./data/op_spam_v1.4/negative_polarity/deceptive_from_MTurk/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/positive_polarity/deceptive_from_MTurk/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/negative_polarity/truthful_from_Web/fold[1-5]/*.txt`
- `./data/op_spam_v1.4/positive_polarity/truthful_from_TripAdvisor/fold[1-5]/*.txt`

__Data Set 2:__ The purpose of this code is to train an Opinion SPAM classifier on the _curated_ __Data Set 1__, and apply it to get an idea of how prolific SPAM is on this completely different, _real-world_ hotel [booking website's](booking.com) data. The data from this website live in the assignment's data directory, too:

- `./data/Hotel_Reviews.csv`
    
and were taken from [Kaggle](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe).

The first step is to load the Op SPAM data using `sklearn`. As the data is across multiple files, I had to build a a full list of all the different review files in the data set using `glob` module's `.glob(regex)` method to output a list of all `all_files` matching the provided `regex` pattern.

In [1]:
import glob
from pprint import pprint
all_files = glob.glob('./data/op_spam_v1.4/*_polarity/*/*/*.txt')
pprint(all_files[:5])

['./data/op_spam_v1.4\\negative_polarity\\deceptive_from_MTurk\\fold1\\d_hilton_1.txt',
 './data/op_spam_v1.4\\negative_polarity\\deceptive_from_MTurk\\fold1\\d_hilton_10.txt',
 './data/op_spam_v1.4\\negative_polarity\\deceptive_from_MTurk\\fold1\\d_hilton_11.txt',
 './data/op_spam_v1.4\\negative_polarity\\deceptive_from_MTurk\\fold1\\d_hilton_12.txt',
 './data/op_spam_v1.4\\negative_polarity\\deceptive_from_MTurk\\fold1\\d_hilton_13.txt']


This is based around supervised learning, but the data does not have implict labels. To construct the labels from the file names, I used a regex match on `all_files`. As this is doing sentiment classification, I utilized the word 'positive_polarity' in the file path to indicate a positve label (of value `1`) and otherwise use a negative label (value `0`). Store these values in a `np.array()` called `labels`.


In [6]:
import re
import collections
mask_files = []

for t in all_files:
    #print(t.find('negative'))
    if re.search('positive',t):
        k = 1
        mask_files.append(k)

    else:
        k = 0
        mask_files.append(k)

    
import numpy as np
labels = np.array(mask_files)

print(len(labels)- sum(labels)) # count of 0s
print(sum(labels)) #count of 1s

800
800


Now, I `import`ed `sklearn`'s TDM-maker `CountVectorizer` from `sklearn.feature_extraction.text`. Then I initialized an instance of 
- `CountVectorizer(input = 'filename')` 

and called `vectorizer`, to apply its `.fit()` and `.transform()` methods to `all_files`. The end result produces a `TDM`.

I then converted the matrix to a dense representation using `TDM.toarray()`.

In [2]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(input = 'filename')

TDM = vectorizer.fit_transform(all_files)
TDM = TDM.toarray()
print(TDM.shape)

(1600, 9571)


Next, I used `train_test_split` to split the `TDM` and `labels` into $75\%$ training and $25\%$ test sets, importing the function from `sklearn.model_selection`. Also, I determined that I wanted to use `random_state = 0` as my random state as this will come up later

In [7]:
from sklearn.model_selection import train_test_split

train, test, train_labels, test_labels = train_test_split(TDM,labels,test_size=0.25, random_state=0)

I `import`ed, initialize, and `.fit()` a series of binary classifiers with `sklearn` on the training data split to compare their precision, recall, and F1.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, precision_score, recall_score

classifiers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
models = []

for clf in classifiers:
    Logistic_classifier = LogisticRegression(solver=clf)
    Logistic_classifier.fit(train, train_labels)
    score = Logistic_classifier.score(test, test_labels)
    predictions = Logistic_classifier.predict(test)
    models.append([clf, score, predictions, Logistic_classifier.predict_proba(test), precision_score(predictions, test_labels),recall_score(predictions, test_labels),f1_score(predictions, test_labels)])
    print(clf)
    print(score)
    print(predictions[:6])
    print(Logistic_classifier.predict_proba(test)[:6])
    print("Precision, recall, and F1 were:")
    print(precision_score(predictions, test_labels))
    print(recall_score(predictions, test_labels))
    print(f1_score(predictions, test_labels))  

newton-cg
0.9225
[1 0 1 1 1 0]
[[4.12962150e-02 9.58703785e-01]
 [9.97808290e-01 2.19170995e-03]
 [8.98474284e-02 9.10152572e-01]
 [2.52990737e-05 9.99974701e-01]
 [5.81132751e-04 9.99418867e-01]
 [9.49928921e-01 5.00710788e-02]]
Precision, recall, and F1 were:
0.9128205128205128
0.9270833333333334
0.9198966408268734


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs
0.9225
[1 0 1 1 1 0]
[[4.12494446e-02 9.58750555e-01]
 [9.97801105e-01 2.19889502e-03]
 [8.99812843e-02 9.10018716e-01]
 [2.51895718e-05 9.99974810e-01]
 [5.80446675e-04 9.99419553e-01]
 [9.49977673e-01 5.00223272e-02]]
Precision, recall, and F1 were:
0.9128205128205128
0.9270833333333334
0.9198966408268734
liblinear
0.9225
[1 0 1 1 1 0]
[[4.05146895e-02 9.59485310e-01]
 [9.97719456e-01 2.28054411e-03]
 [9.19005313e-02 9.08099469e-01]
 [2.41680569e-05 9.99975832e-01]
 [5.67617652e-04 9.99432382e-01]
 [9.48333526e-01 5.16664740e-02]]
Precision, recall, and F1 were:
0.9128205128205128
0.9270833333333334
0.9198966408268734




sag
0.9275
[1 0 1 1 1 0]
[[0.23455944 0.76544056]
 [0.97222867 0.02777133]
 [0.14620265 0.85379735]
 [0.00185253 0.99814747]
 [0.00331635 0.99668365]
 [0.89274615 0.10725385]]
Precision, recall, and F1 were:
0.9179487179487179
0.9322916666666666
0.9250645994832041
saga
0.9275
[1 0 1 1 1 0]
[[0.33750908 0.66249092]
 [0.92871187 0.07128813]
 [0.17840061 0.82159939]
 [0.00779389 0.99220611]
 [0.00815339 0.99184661]
 [0.90116798 0.09883202]]
Precision, recall, and F1 were:
0.9179487179487179
0.9322916666666666
0.9250645994832041




Accuracy is a result of a confusion matrix that says predicted/classified vs actual which in this case is 92.5%, which can be good if the ones you are misclassifying are not dangerous to misclassify. Precision is (true positive/(true positive+false positive)) or (true positive / total predicted positive), so its saying how many of the total you called positive were actually actually positive. This is 91.7% which is pretty good, and important to calculate when the cost of false negatives are high. Recall is (true positive/(true positive + false negative)) or actual positive = (true positive + false negative) so knowing that ours is 92.7% we know that our false negatives are relatively low. $F_1$ is 2X((precisionXrecall)/(precision+recall)) thus better when you want a balance between precision and recall and there are uneven class distributions (i.e., larger numbers of actual positves or negatives) however for this data we have the same number of class distributions.  

In [9]:
print(models[4][-3:]) #precision, recall, F1
print(models[4][1]) #accuracy

[0.9179487179487179, 0.9322916666666666, 0.9250645994832041]
0.9275


As I previously mentioned, I also wanted to see how well this sentiment polarity classifier does on a different data set:

- `./data/Hotel_Reviews.csv`

which was hosted on a Kaggle competition, but came from from Booking.com:

- https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

There's a decent description of the data there, where it seems a customer can comment with positive and negative reviews, in parallel. To get started, I loaded these data in with pandas and provided the information about this dataset below

In [10]:
import pandas as pd

hotels = pd.read_csv('./data/Hotel_Reviews.csv')

In [11]:
list(hotels)

['Hotel_Address',
 'Additional_Number_of_Scoring',
 'Review_Date',
 'Average_Score',
 'Hotel_Name',
 'Reviewer_Nationality',
 'Negative_Review',
 'Review_Total_Negative_Word_Counts',
 'Total_Number_of_Reviews',
 'Positive_Review',
 'Review_Total_Positive_Word_Counts',
 'Total_Number_of_Reviews_Reviewer_Has_Given',
 'Reviewer_Score',
 'Tags',
 'days_since_review',
 'lat',
 'lng']

In [12]:
hotels.head(2)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968


One difficulty is that sometimes, a reviewer won't leave a positive or negative review in one of the categories. However, what's left is not a conventional N/A or anything. Refering back to the data dictionary:

- https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe

So if there is no positive review, it says No Positive, and similarly there is no negative review it says No Negative. This has to be dealt with. To do this I made a list of the non-null reviews with a parallel list of labels, again 1 for positive reviews and 0 for negative reviews.

In [13]:
Hotel_labels = []
Hotel_review = []
for row in hotels['Negative_Review']:
    if row == "No Negative":
        pass
    else:
        i = 0
        Hotel_labels.append(i)
        Hotel_review.append(row)
for row in hotels['Positive_Review']:
    if row == "No Positive":
        pass
    else:
        i = 1
        Hotel_labels.append(i)
        Hotel_review.append(row)

There are more positive reviews than negative reviews (480k compared to 388k, or 55% compared to 45%), but this is not a high level of class imbalanced, but it can be corrected by sampling if desired.

In [14]:
print((len(Hotel_review)-sum(Hotel_labels))/len(Hotel_review)) #number/percent of zeros
print(sum(Hotel_labels)/len(Hotel_review)) #number/percent of ones

testing123=zip(Hotel_review,Hotel_labels)
print(list(testing123)[387846:387850])

0.4470148909686045
0.5529851090313955
[(' The ac was useless It was a hot week in vienna and it only gave more hot air', 0), (' I was in 3rd floor It didn t work Free Wife ', 0), (' Only the park outside of the hotel was beautiful ', 1), (' No real complaints the hotel was great great location surroundings rooms amenities and service Two recommendations however firstly the staff upon check in are very confusing regarding deposit payments and the staff offer you upon checkout to refund your original payment and you can make a new one Bit confusing Secondly the on site restaurant is a bit lacking very well thought out and excellent quality food for anyone of a vegetarian or vegan background but even a wrap or toasted sandwich option would be great Aside from those minor minor things fantastic spot and will be back when i return to Amsterdam ', 1)]


Using `CountVectorizer()` again&mdash; I then created a TDM for the new hotel data. To do so I used the same initialized vectorizer from earlier, this is what makes it an exercise in classifier trasferability.

In [15]:
vectorizer.input = 'content'
TDM2 = vectorizer.transform(Hotel_review)


When appling the classifier to this new, Booking.com TDM, I also computed the accuracy, precision, recall, and $F_1$. As discussed earlier the class imbalance means that we should look at the F1, which is around 81% for each model. 

In [16]:
train2, test2, train_labels2, test_labels2 = train_test_split(TDM2,Hotel_labels,test_size=0.25, random_state=0)
classifiers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
models2 = []

for clf in classifiers:
    Logistic_classifier = LogisticRegression(solver=clf)
    Logistic_classifier.fit(train, train_labels)
    score = Logistic_classifier.score(test2, test_labels2)
    predictions = Logistic_classifier.predict(test2)
    models2.append([clf, score, predictions, Logistic_classifier.predict_proba(test2), precision_score(predictions, test_labels2),recall_score(predictions, test_labels2),f1_score(predictions, test_labels2)])
    print(clf)
    print(score)
    print(predictions[:6])
    print(Logistic_classifier.predict_proba(test2)[:6])
    print("Precision, recall, and F1 were:")
    print(precision_score(predictions, test_labels2))
    print(recall_score(predictions, test_labels2))
    print(f1_score(predictions, test_labels2))  

newton-cg
0.7883684477433037
[1 0 1 1 1 0]
[[1.61442581e-01 8.38557419e-01]
 [5.33155204e-01 4.66844796e-01]
 [2.21882104e-01 7.78117896e-01]
 [3.53371636e-01 6.46628364e-01]
 [3.99005573e-04 9.99600994e-01]
 [5.34914439e-01 4.65085561e-01]]
Precision, recall, and F1 were:
0.8518240754628935
0.7842451303060293
0.8166389058649188


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs
0.7887003826471809
[1 0 1 1 1 0]
[[1.62130203e-01 8.37869797e-01]
 [5.34498945e-01 4.65501055e-01]
 [2.22848508e-01 7.77151492e-01]
 [3.54122477e-01 6.45877523e-01]
 [3.97728000e-04 9.99602272e-01]
 [5.36314321e-01 4.63685679e-01]]
Precision, recall, and F1 were:
0.8508074596270186
0.7852127569579562
0.8166951291208902
liblinear
0.7931400119865382
[1 0 1 1 1 0]
[[1.70012340e-01 8.29987660e-01]
 [5.50046837e-01 4.49953163e-01]
 [2.34190074e-01 7.65809926e-01]
 [3.64255036e-01 6.35744964e-01]
 [3.92536915e-04 9.99607463e-01]
 [5.52288419e-01 4.47711581e-01]]
Precision, recall, and F1 were:
0.835833208339583
0.7994086424279133
0.8172152517516702




sag
0.8220921119358259
[1 0 1 1 1 0]
[[0.24029359 0.75970641]
 [0.58612999 0.41387001]
 [0.29471479 0.70528521]
 [0.31779801 0.68220199]
 [0.00344781 0.99655219]
 [0.50384282 0.49615718]]
Precision, recall, and F1 were:
0.9064380114327617
0.7990157191126781
0.8493437336519016




saga
0.7992347056382831
[1 0 1 1 1 1]
[[0.27653388 0.72346612]
 [0.58240944 0.41759056]
 [0.32673295 0.67326705]
 [0.30332667 0.69667333]
 [0.00645372 0.99354628]
 [0.49036772 0.50963228]]
Precision, recall, and F1 were:
0.9323283835808209
0.7595104268491365
0.8370929005903082


Comparing these these results with the results from earlier, the recall is much worse across the board, the while the precision is only a little worse. Thus the $F_1$ are lower overall.

I was interested in going back and looking at the original dataset and rebuilding it to look at whether or not they are true (using the deceptive label on some files), with deceptive labeled as 1 and files not labeled as deceptive marked as 0. 

In [17]:
mask_files =[]

for t in all_files:
    #print(t.find('negative'))
    if re.search('deceptive',t):
        k = 1
        mask_files.append(k)

    else:
        k = 0
        mask_files.append(k)

labels = np.array(mask_files)

print(len(labels)- sum(labels)) # count of 0s
print(sum(labels)) #count of 1s

800
800


For this I had to train a new classifier on _all_ of the Opinion SPAM labels. 

In [18]:
train3, test3, train_labels3, test_labels3 = train_test_split(TDM,labels,test_size=0.25, random_state=0)

classifiers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
models3 = []

for clf in classifiers:
    Logistic_classifier = LogisticRegression(solver=clf)
    Logistic_classifier.fit(train3, train_labels3)
    score = Logistic_classifier.score(test3, test_labels3)
    predictions = Logistic_classifier.predict(test3)
    models3.append([clf, score, predictions, Logistic_classifier.predict_proba(test3), precision_score(predictions, test_labels3),recall_score(predictions, test_labels3),f1_score(predictions, test_labels3)])
    print(clf)
    print(score)
    print(predictions[:6])
    print(Logistic_classifier.predict_proba(test)[:6])
    print("Precision, recall, and F1 were:")
    print(precision_score(predictions, test_labels3))
    print(recall_score(predictions, test_labels3))
    print(f1_score(predictions, test_labels3))  

newton-cg
0.855
[1 1 0 1 1 0]
[[2.26589458e-01 7.73410542e-01]
 [6.42914939e-05 9.99935709e-01]
 [6.55221297e-01 3.44778703e-01]
 [4.59268999e-03 9.95407310e-01]
 [3.24572157e-01 6.75427843e-01]
 [9.99819875e-01 1.80125010e-04]]
Precision, recall, and F1 were:
0.8469387755102041
0.8556701030927835
0.8512820512820514


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs
0.855
[1 1 0 1 1 0]
[[2.27332236e-01 7.72667764e-01]
 [6.44517766e-05 9.99935548e-01]
 [6.57542922e-01 3.42457078e-01]
 [4.43444580e-03 9.95565554e-01]
 [3.32764753e-01 6.67235247e-01]
 [9.99826871e-01 1.73128848e-04]]
Precision, recall, and F1 were:
0.8469387755102041
0.8556701030927835
0.8512820512820514
liblinear
0.85
[1 1 0 1 1 0]
[[2.29299554e-01 7.70700446e-01]
 [6.37693281e-05 9.99936231e-01]
 [6.50379387e-01 3.49620613e-01]
 [4.89342760e-03 9.95106572e-01]
 [3.29705880e-01 6.70294120e-01]
 [9.99821697e-01 1.78302973e-04]]
Precision, recall, and F1 were:
0.8367346938775511
0.8541666666666666
0.845360824742268




sag
0.85
[1 1 0 1 1 0]
[[0.27528864 0.72471136]
 [0.00238204 0.99761796]
 [0.65144443 0.34855557]
 [0.01148668 0.98851332]
 [0.32291663 0.67708337]
 [0.96409129 0.03590871]]
Precision, recall, and F1 were:
0.8418367346938775
0.8505154639175257
0.846153846153846
saga
0.8425
[1 1 0 1 1 0]
[[0.32270378 0.67729622]
 [0.00647821 0.99352179]
 [0.64522627 0.35477373]
 [0.02588351 0.97411649]
 [0.30724531 0.69275469]
 [0.85130072 0.14869928]]
Precision, recall, and F1 were:
0.8418367346938775
0.8375634517766497
0.8396946564885496




Running this new classifier that I just trained against the new hotel reviews data set. I set the classification threshold at $0.5$ to find SPAM reviews.

In [19]:
classifiers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
models4 = []

for clf in classifiers:
    Logistic_classifier = LogisticRegression(solver=clf)
    Logistic_classifier.fit(train3, train_labels3)
    score = Logistic_classifier.score(test2, test_labels2) #but these are the positive and negative labels...? How is this working as a dependant variable for spam detection?
    predictions = Logistic_classifier.predict(test2)
    models4.append([clf, score, predictions, test2, Logistic_classifier.predict_proba(test2), precision_score(predictions, test_labels2),recall_score(predictions, test_labels2),f1_score(predictions, test_labels2)])
    print(clf)
    print(score)
    print(predictions[:6])
    print(Logistic_classifier.predict_proba(test2)[:6])
    print("Precision, recall, and F1 were:")
    print(precision_score(predictions, test_labels2))
    print(recall_score(predictions, test_labels2))
    print(f1_score(predictions, test_labels2))  
    
# threshold of  0.5? 

newton-cg
0.44417039325065694
[1 0 0 0 0 0]
[[0.35642289 0.64357711]
 [0.76411942 0.23588058]
 [0.66325492 0.33674508]
 [0.87126427 0.12873573]
 [0.9753977  0.0246023 ]
 [0.63757917 0.36242083]]
Precision, recall, and F1 were:
0.09638684732430045
0.4882032667876588
0.16098929011336194


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs
0.4439721543497303
[1 0 0 0 0 0]
[[0.35436375 0.64563625]
 [0.76410586 0.23589414]
 [0.65999324 0.34000676]
 [0.8698751  0.1301249 ]
 [0.97565119 0.02434881]
 [0.6340093  0.3659907 ]]
Precision, recall, and F1 were:
0.09809509524523774
0.4875341671498385
0.1633275986458738
liblinear
0.4418975612004979
[1 0 0 0 0 0]
[[0.33150996 0.66849004]
 [0.73822121 0.26177879]
 [0.63347064 0.36652936]
 [0.85954438 0.14045562]
 [0.97588647 0.02411353]
 [0.60424642 0.39575358]]
Precision, recall, and F1 were:
0.1183190840457977
0.4821392190152801
0.19000909966812973




sag
0.4367295191554101
[1 0 0 0 0 0]
[[0.41420406 0.58579594]
 [0.65319014 0.34680986]
 [0.58794645 0.41205355]
 [0.78645479 0.21354521]
 [0.96428004 0.03571996]
 [0.56748047 0.43251953]]
Precision, recall, and F1 were:
0.08687065646717664
0.45280806150371367
0.14577463311636102




saga
0.4338619704024711
[1 0 0 0 0 0]
[[0.43568092 0.56431908]
 [0.6151047  0.3848953 ]
 [0.56332884 0.43667116]
 [0.73669076 0.26330924]
 [0.94870189 0.05129811]
 [0.54290986 0.45709014]]
Precision, recall, and F1 were:
0.08933719980667633
0.4423402236250361
0.14865192764986862


In [20]:
models4[0][4],

(array([[0.35642289, 0.64357711],
        [0.76411942, 0.23588058],
        [0.66325492, 0.33674508],
        ...,
        [0.73626411, 0.26373589],
        [0.97437101, 0.02562899],
        [0.95579896, 0.04420104]]),)

What I found is that this classifier does not predict well, so this is not good indicator for the booking.com's dataset for determining truthful vs paid reviews. To examine this further, I sorted the Booking.com's reviews by their predicted probabilities from high to low. 

In [21]:
#just selecting one of the classifers to sort

Sorted_values = sorted(zip(models4[0][4][:,1],Hotel_review))

In [22]:
print(Sorted_values[0])

(2.6204199698427492e-11, ' The room was really small With my suitcase open I had to climb over the bed to g et to the bathroom ')


As there are no SPAM labels for the Booking.com data, I inspected the first few most and least spammy reviews. To the human eye there is not too much difference between the 'most' and 'least' spammy reviews. 

In [23]:
print(Sorted_values[-1])

(0.9996488261298903, ' The Deluxe Room was pricey and extremely small It was so small that a big suitcase couldn t be moved around The bed was too narrow for an average sized adult to roll from one side to the other ')


In [24]:
print(Sorted_values[-3])

(0.9987647855698705, ' Not enough lifts we were on 7th floor took 15 mins from room to reception Saturday morning not much better late afternoon Too many light switches 13 in 1 room ')


In [25]:
print(Sorted_values[1])

(3.903642702145739e-10, ' The room was a little snug but we went for the cheapest option so that was to be expected Plus the aircon wasn t too good and didn t come with instructions luckily we weren t too hot ')
