<a href="https://colab.research.google.com/github/lupis30puc/yelp_bert_random_forest/blob/update-5/Yelp_RF_mimic_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Yelp Polarity on kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset)

12,993 samples from the Yelp Dataset Challenge 2020. 
Divided on train, validation and test subsets. 
Their corresponding sizes are: 10,394 train samples, 1,949 validation samples and 650 test samples.


Tutorial on which I support: 
[Sentiment Analysis Yelp with Random Forest](https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset)

## Set Up

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/88/b1/41130a228dd656a1a31ba281598a968320283f48d42782845f6ba567f00b/transformers-4.2.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 5.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 21.3MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 23.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=dfa17d42611469bc06d

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import string
import math

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
%matplotlib inline
import time
import pickle

## Loading the ids and predicted labels from BERT

I want to create a boolean matrix with the input ids of the datasets...

In [None]:
with open('/content/drive/MyDrive/Yelp/tensors_yelp/train_ids_99.pkl', 'rb') as f:
    train_ids = pickle.load(f)

with open('/content/drive/MyDrive/Yelp/tensors_yelp/test_ids_99.pkl', 'rb') as d:
    test_ids = pickle.load(d)

In [None]:
len(train_ids)

10394

In [None]:
len(test_ids)

2599

In [None]:
test_ids[0][:10]

[101, 3893, 2015, 2204, 3295, 3095, 16286, 5379, 4997, 2015]

In [None]:
import torch
bert_train_pred = torch.load('/content/drive/MyDrive/Yelp/model_99/flat_pred_labels_train')
bert_test_pred = torch.load('/content/drive/MyDrive/Yelp/model_99/flat_pred_labels_test')

In [None]:
bert_train_pred[:5]

[1, 0, 0, 1, 1]

In [None]:
bert_test_pred[:5]

[0, 1, 1, 0, 0]

## Initializing BERT tokenizer and reviews texts:

In [None]:
# NO NEED FOR THE DATASET OR YES?
train_df = pd.read_pickle('/content/drive/MyDrive/Yelp/sample_train_10394.pkl')
test_df = pd.read_pickle('/content/drive/MyDrive/Yelp/sample_test_2599.pkl')

train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)
#train_df.head()

In [None]:
# launching the saved model tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('/content/drive/MyDrive/Yelp/model_99/')

In [None]:
#tokenizer.vocab_size

30522

## Creation of boolean dataframes

In [None]:
# get a list of all unique ids of the train ids,
# and a dict with the appearence of ids in each review
%%time
isin_ids_tr = {i:np.unique(train_ids[i]) for i in range(len(train_ids))}
all_uni_ids_tr = np.concatenate(list(isin_ids_tr.values()), axis=0)
unique_ids_train = np.unique(all_uni_ids_tr)

CPU times: user 240 ms, sys: 409 µs, total: 241 ms
Wall time: 247 ms


In [None]:
len(unique_ids_train)

15574

In [None]:
# get a list of all unique ids of the train ids,
# and a dict with the appearence of ids in each review
# same but for test
%%time
isin_ids_ts = {i:np.unique(test_ids[i]) for i in range(len(test_ids))}
all_uni_ids_ts = np.concatenate(list(isin_ids_ts.values()), axis=0)
unique_ids_test = np.unique(all_uni_ids_ts)

CPU times: user 60 ms, sys: 908 µs, total: 60.9 ms
Wall time: 64.9 ms


In [None]:
len(unique_ids_test)

10163

In [None]:
%%time
all_unique_ids = np.concatenate((unique_ids_train, unique_ids_test), axis=0)
unique_ids = np.unique(all_unique_ids)

CPU times: user 2.62 ms, sys: 0 ns, total: 2.62 ms
Wall time: 2.4 ms


In [None]:
len(unique_ids)

16368

In [None]:
# a function to fill in the dataframes in a boolean way
def is_word_in(isin_ids, df):
  index = range(len(isin_ids))
  for i in index:
    ids = list(isin_ids[i])
    for id in ids:
      df[id][i] = 1

In [None]:
# making a dataframe where the index are the reviews index, and the columns are the unique words on the reviews.
%%time
x_train = pd.DataFrame(index=range(len(train_ids)), columns=unique_ids)
x_train.fillna(0, inplace=True)

CPU times: user 48.8 s, sys: 1.34 s, total: 50.2 s
Wall time: 50.2 s


In [None]:
%%time
is_word_in(isin_ids_tr, x_train)

CPU times: user 28.2 s, sys: 1.04 s, total: 29.3 s
Wall time: 29.5 s


In [None]:
x_train.tail()

Unnamed: 0,0,100,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1067,1087,1094,1095,1107,1635,1636,1646,1647,1651,1656,...,29578,29582,29584,29589,29591,29592,29593,29597,29602,29610,30173,30174,30177,30179,30180,30181,30182,30183,30186,30187,30189,30191,30192,30194,30197,30198,30200,30203,30207,30211,30212,30217,30219,30220,30221,30228,30239,30251,30257,30263
10389,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10390,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10391,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10392,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10393,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
%%time
x_train.describe()

CPU times: user 34 s, sys: 257 ms, total: 34.2 s
Wall time: 34.3 s


Unnamed: 0,0,100,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1067,1087,1094,1095,1107,1635,1636,1646,1647,1651,1656,...,29578,29582,29584,29589,29591,29592,29593,29597,29602,29610,30173,30174,30177,30179,30180,30181,30182,30183,30186,30187,30189,30191,30192,30194,30197,30198,30200,30203,30207,30211,30212,30217,30219,30220,30221,30228,30239,30251,30257,30263
count,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,...,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0
mean,0.856937,0.000481,1.0,1.0,0.004137,0.010583,9.6e-05,0.001828,0.002982,0.00635,0.017991,0.008755,0.00202,0.000673,0.003945,0.002213,0.000192,0.004329,0.001347,0.000577,0.000673,0.007216,0.004137,0.004233,0.000673,0.008274,0.001155,0.002309,0.003656,0.0,0.0,9.6e-05,9.6e-05,9.6e-05,9.6e-05,0.000289,0.0,9.6e-05,9.6e-05,9.6e-05,...,0.0,9.6e-05,9.6e-05,0.000577,0.000577,0.000673,0.002598,0.000192,0.000192,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,0.0,0.0,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,0.0,9.6e-05,9.6e-05,0.0,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05,9.6e-05
std,0.350154,0.021929,0.0,0.0,0.064189,0.102333,0.009809,0.042718,0.054533,0.079436,0.132925,0.093162,0.044906,0.025944,0.062685,0.046991,0.013871,0.065659,0.036678,0.02402,0.025944,0.084642,0.064189,0.064928,0.025944,0.090589,0.03396,0.047999,0.060357,0.0,0.0,0.009809,0.009809,0.009809,0.009809,0.016987,0.0,0.009809,0.009809,0.009809,...,0.0,0.009809,0.009809,0.02402,0.02402,0.025944,0.050903,0.013871,0.013871,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.0,0.0,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.0,0.009809,0.009809,0.0,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809,0.009809
min,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# making a dataframe where the index are the reviews index, and the columns are the unique words on the reviews.
%%time
x_test = pd.DataFrame(index=range(len(test_ids)), columns=unique_ids)
x_test.fillna(0, inplace=True)


CPU times: user 14 s, sys: 52.8 ms, total: 14.1 s
Wall time: 14.1 s


In [None]:
#x_test =  for i in _4_rev_ids for id in list(_4_rev_ids[i]) lambda x_test[id][i] : 1
#x_test = [x_test[id][i] = 1 for i in _4_rev_ids for id in list(_4_rev_ids[i])]
%%time
is_word_in(isin_ids_ts, x_test)

CPU times: user 7.02 s, sys: 250 ms, total: 7.27 s
Wall time: 7.29 s


In [None]:
x_test.tail()

Unnamed: 0,0,100,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1067,1087,1094,1095,1107,1635,1636,1646,1647,1651,1656,...,29578,29582,29584,29589,29591,29592,29593,29597,29602,29610,30173,30174,30177,30179,30180,30181,30182,30183,30186,30187,30189,30191,30192,30194,30197,30198,30200,30203,30207,30211,30212,30217,30219,30220,30221,30228,30239,30251,30257,30263
2594,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2595,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2596,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2597,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2598,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
%%time
x_test.describe()

Unnamed: 0,0,100,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1067,1087,1094,1095,1107,1635,1636,1646,1647,1651,1656,...,29578,29582,29584,29589,29591,29592,29593,29597,29602,29610,30173,30174,30177,30179,30180,30181,30182,30183,30186,30187,30189,30191,30192,30194,30197,30198,30200,30203,30207,30211,30212,30217,30219,30220,30221,30228,30239,30251,30257,30263
count,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,...,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0
mean,0.850327,0.00077,1.0,1.0,0.003848,0.010004,0.0,0.001154,0.005002,0.005387,0.015775,0.00808,0.002309,0.0,0.006541,0.003463,0.0,0.005771,0.000385,0.000385,0.00077,0.007695,0.004232,0.005002,0.000385,0.010773,0.001539,0.001539,0.003078,0.000385,0.000385,0.0,0.0,0.0,0.0,0.000385,0.000385,0.0,0.0,0.0,...,0.000385,0.0,0.0,0.000385,0.000385,0.001154,0.002309,0.0,0.0,0.0,0.000385,0.000385,0.000385,0.0,0.0,0.000385,0.000385,0.000385,0.0,0.000385,0.0,0.0,0.000385,0.0,0.0,0.0,0.0,0.000385,0.0,0.0,0.000385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,0.356819,0.027735,0.0,0.0,0.061922,0.099537,0.0,0.033962,0.070561,0.07321,0.124629,0.089542,0.048001,0.0,0.080627,0.058755,0.0,0.075765,0.019615,0.019615,0.027735,0.087401,0.064932,0.070561,0.019615,0.103254,0.039208,0.039208,0.055406,0.019615,0.019615,0.0,0.0,0.0,0.0,0.019615,0.019615,0.0,0.0,0.0,...,0.019615,0.0,0.0,0.019615,0.019615,0.033962,0.048001,0.0,0.0,0.0,0.019615,0.019615,0.019615,0.0,0.0,0.019615,0.019615,0.019615,0.0,0.019615,0.0,0.0,0.019615,0.0,0.0,0.0,0.0,0.019615,0.0,0.0,0.019615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
x_train.to_pickle('/content/drive/MyDrive/Yelp/binary_train_rf')

In [None]:
x_test.to_pickle('/content/drive/MyDrive/Yelp/binary_test_rf')

## Training

In [None]:
x_train = pd.read_pickle('/content/drive/MyDrive/Yelp/binary_train_rf')
x_test = pd.read_pickle('/content/drive/MyDrive/Yelp/binary_test_rf')

In [None]:
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
# SAME RANDOM STATE AS IN BERT....
#x_train,x_test,y_train,y_test = train_test_split(df1,test_labels,test_size=0.2,random_state=42)

In [None]:
y_train = bert_train_pred.copy()
y_test = bert_test_pred.copy()

In [None]:
len(y_train)

10394

In [None]:
%%time
# Random Forest
from sklearn.ensemble import RandomForestClassifier
#rmfr = RandomForestClassifier(random_state=42)

CPU times: user 57.2 ms, sys: 12.3 ms, total: 69.5 ms
Wall time: 247 ms


### first attempt of random forest with n_estimators=300, min_samples_leaf=3

In [None]:
%%time
rmfr2 = RandomForestClassifier(n_estimators=300, min_samples_leaf=3, random_state=42)
rmfr2.fit(x_train,y_train)

CPU times: user 1min 18s, sys: 18.1 ms, total: 1min 18s
Wall time: 1min 18s


In [None]:
%%time
predrmfr2 = rmfr2.predict(x_test)

CPU times: user 570 ms, sys: 14 ms, total: 584 ms
Wall time: 589 ms


In [None]:
%%time
print("Confusion Matrix for Random Forest Classifier:")
print(confusion_matrix(y_test,predrmfr2))
print("Score:",round(accuracy_score(y_test,predrmfr2)*100,2))
print("Classification Report:")
print(classification_report(y_test,predrmfr2))

Confusion Matrix for Random Forest Classifier:
[[1160  121]
 [ 151 1167]]
Score: 89.53
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.91      0.90      1281
           1       0.91      0.89      0.90      1318

    accuracy                           0.90      2599
   macro avg       0.90      0.90      0.90      2599
weighted avg       0.90      0.90      0.90      2599

CPU times: user 18.2 ms, sys: 0 ns, total: 18.2 ms
Wall time: 29.7 ms


In [None]:
import joblib
# save
joblib.dump(rmfr2, "/content/drive/MyDrive/Yelp/model_99/rf_est300_leaf3.joblib", compress=3)

# load, no need to initialize the loaded_rf
#loaded_rf = joblib.load("/content/drive/MyDrive/Yelp/model_99/rf_est300_leaf3.joblib")

['/content/drive/MyDrive/Yelp/model_99/rf_est300_leaf3.joblib']

## Grid search aprox time 4 hrs

In [None]:
param_grid = {'min_samples_leaf': [2, 3, 4, 5, 6, 7, 8, 9, 10], 'n_estimators': [200, 300, 1000] }
# The parameter grid to explore, as a dictionary mapping estimator parameters to sequences of allowed values.

In [None]:
#make a graph were the test and train accuracy are closer to each other, if they cross or are very close 

In [None]:
rmfr = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator = rmfr, param_grid = param_grid, cv = 10)
# Exhaustive search over specified parameter values for an estimator.
# The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [None]:
%%time
grid_search.fit(x_train, y_train) # Run fit with all sets of parameters.
#rmfr.fit(x_train,y_train)

CPU times: user 4h 6min 9s, sys: 7.75 s, total: 4h 6min 17s
Wall time: 4h 6min 58s


GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=42,
                                 

In [None]:
best_model = grid_search.best_params_  # Parameter setting that gave the best results on the hold out data.
print(best_model)

{'min_samples_leaf': 1, 'n_estimators': 1000}


In [None]:
random_f = grid_search.best_estimator_ 

In [None]:
grid_search.best_score_ 

0.8932072258828757

In [None]:
import joblib
# save
joblib.dump(random_f, "/content/drive/MyDrive/Yelp/model_99/rf_est1000_leaf1.joblib", compress=3)

['/content/drive/MyDrive/Yelp/model_99/rf_est1000_leaf1.joblib']

In [None]:
%%time
predrmfr = random_f.predict(x_test)
print("Confusion Matrix for Random Forest Classifier:")
print(confusion_matrix(y_test,predrmfr))
print("Score:",round(accuracy_score(y_test,predrmfr)*100,2))
print("Classification Report:")
print(classification_report(y_test,predrmfr))

Confusion Matrix for Random Forest Classifier:
[[1166  115]
 [ 148 1170]]
Score: 89.88
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.91      0.90      1281
           1       0.91      0.89      0.90      1318

    accuracy                           0.90      2599
   macro avg       0.90      0.90      0.90      2599
weighted avg       0.90      0.90      0.90      2599

CPU times: user 3.09 s, sys: 3 ms, total: 3.09 s
Wall time: 3.1 s


## Analizing feature importance

In [None]:
# Converting unique_ids into words:
%%time
feature_names_test = [list(tokenizer.vocab.keys())[id] for id in unique_ids]

In [None]:
importance = rmfr2.feature_importances_

In [None]:
feature_importance = pd.DataFrame({'keys': feature_names_test, 'imp': importance})

In [None]:
feature_importance.head()

Unnamed: 0,keys,imp
0,[PAD],0.002289
1,[UNK],1e-05
2,[CLS],0.0
3,[SEP],0.0
4,a,4.5e-05


In [None]:
# newest one, random_state=42
top10 = list(feature_importance.sort_values(by=['imp'], ascending=False)['keys'][:10])
top10

['great',
 'amazing',
 'delicious',
 'worst',
 'love',
 'asked',
 'horrible',
 'friendly',
 'told',
 'best']

In [None]:
print("Test  Accuracy : %.2f"%rmfr2.score(x_test, y_test))
print("Train Accuracy : %.2f"%rmfr2.score(x_train, y_train))

Test  Accuracy : 0.90
Train Accuracy : 0.95


In [None]:
!pip install treeinterpreter
#https://pypi.org/project/treeinterpreter/

Collecting treeinterpreter
  Downloading https://files.pythonhosted.org/packages/af/19/fa8556093f6b8c7374825118e05cf5a99c71262392382c3642ab1fd8a742/treeinterpreter-0.2.3-py2.py3-none-any.whl
Installing collected packages: treeinterpreter
Successfully installed treeinterpreter-0.2.3


In [None]:
from treeinterpreter import treeinterpreter as ti

In [None]:
%%time
preds, bias, contributions = ti.predict(rmfr2, x_test)

CPU times: user 4min 1s, sys: 5.1 s, total: 4min 6s
Wall time: 4min 6s


In [None]:
preds.shape, bias.shape, contributions.shape

((2599, 2), (2599, 2), (2599, 16368, 2))

In [None]:
y_test[0]

0

In [None]:
# from http://blog.datadive.net/random-forest-interpretation-with-scikit-learn/
top20 = list(feature_importance.sort_values(by=['imp'], ascending=False)['keys'][:20])
print("Prediction", preds[0])
print(np.argmax(preds[0], axis=0).flatten())
print("Bias (trainset prior)", bias[0])
print('')
print("Feature contributions:")
for c, feature in zip(contributions[0], top20):
  #word = list(tokenizer.vocab.keys())[feature]
  print(feature, c)

Prediction [0.51573443 0.48426557]
[0]
Bias (trainset prior) [0.49902989 0.50097011]

Feature contributions:
great [-0.00218083  0.00218083]
amazing [0. 0.]
delicious [0. 0.]
worst [0. 0.]
love [-1.67814874e-05  1.67814874e-05]
asked [ 0.00265121 -0.00265121]
horrible [0. 0.]
friendly [-7.04680712e-06  7.04680712e-06]
told [0. 0.]
best [ 9.97701573e-06 -9.97701573e-06]
rude [0. 0.]
said [0. 0.]
minutes [0. 0.]
excellent [0. 0.]
terrible [ 1.30336735e-05 -1.30336735e-05]
awesome [0. 0.]
definitely [0. 0.]
poor [0. 0.]
bland [0. 0.]
bad [0. 0.]


In [None]:
test_df.text[0]

'positives good location staff reasonably friendly negatives dry manis pedis advertise website update site base coat coat blobbed attention coat cover color nail nails filed totally random lenths shapes shorter end fingers longer pedi woman wasn paying attention ended filing skin bled cute tea menu offer definitely expensive getting answer phone pedi station isn comfortable slightly padded bench notes staff looks surprised sure come ask work'

In [None]:
test_df.columns

Index(['text', 'label', 'categories'], dtype='object')

In [None]:
test_df.label[0]

1

In [None]:
print("Bias For Sample 0                        : %s"%bias[0])
print("Constributions For Sample 0              : %s"%contributions[0])
print("Prediction Based on Bias & Contributions : %s"%(bias[0] + contributions[0].sum()))
print("Actual Target Value                      : %s"%y_test[0])
print("Target Value As Per Treeinterpreter      : %s"%preds[0][0])

Bias For Sample 0                        : [0.50452564 0.49547436]
Constributions For Sample 0              : [[ 0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00]
 [ 4.13365671e-06 -4.13365671e-06]
 ...
 [ 0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00]]
Prediction Based on Bias & Contributions : [0.50452564 0.49547436]
Actual Target Value                      : 1
Target Value As Per Treeinterpreter      : 0.4116563342349329


In [None]:
def create_contrbutions_df(contributions, random_sample, feature_names):
    contribs = contributions[random_sample].tolist()
    contribs.insert(0, bias[random_sample])
    contribs = np.array(contribs)
    contrib_df = pd.DataFrame(data=contribs, index=["Base"] + feature_names, columns=["Contributions_0", "Contributions_1"])
    prediction = contrib_df[["Contributions_0", "Contributions_1"]].sum()
    contrib_df.loc["Prediction"] = prediction
    return contrib_df

In [None]:
import random
random_sample = random.randint(1, len(x_test))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %s"%y_test[random_sample])
print("Predicted Value     : %s"%np.argmax(preds[random_sample]))

Selected Sample     : 73
Actual Target Value : 0
Predicted Value     : 1


In [None]:
print("Prediction", preds[73])

Prediction [0.35911367 0.64088633]


In [None]:
# 0 negative, 1 positive
for c, feature in zip(contributions[73], top20):
  #word = list(tokenizer.vocab.keys())[feature]
  print(feature, c)

great [0. 0.]
not [0. 0.]
delicious [ 1.66684837e-07 -1.66684837e-07]
worst [-0.000156  0.000156]
no [-0.0044492  0.0044492]
to [-0.0034821  0.0034821]
excellent [-2.60448061e-05  2.60448061e-05]
then [ 0.00065689 -0.00065689]
horrible [-0.00024421  0.00024421]
was [-0.00241072  0.00241072]
good [-0.01901613  0.01901613]
when [-0.00904529  0.00904529]
asked [-0.00430954  0.00430954]
ordered [-0.0040156  0.0040156]
t [ 0.00080385 -0.00080385]
order [-0.00598097  0.00598097]
don [-0.0004334  0.0004334]
after [ 0.00104962 -0.00104962]
love [ 0.00656338 -0.00656338]
friendly [-0.00126679  0.00126679]


In [None]:
# from: https://coderzcolumn.com/tutorials/machine-learning/treeinterpreter-interpreting-tree-based-models-prediction-of-individual-sample
contrib_df = create_contrbutions_df(contributions, random_sample, feature_names_test)
contrib_df


Unnamed: 0,Contributions_0,Contributions_1
Base,5.045256e-01,4.954744e-01
[CLS],0.000000e+00,0.000000e+00
[SEP],0.000000e+00,0.000000e+00
a,1.666848e-07,-1.666848e-07
d,-1.559995e-04,1.559995e-04
...,...,...
int,0.000000e+00,0.000000e+00
indifferent,0.000000e+00,0.000000e+00
amplified,0.000000e+00,0.000000e+00
vowed,0.000000e+00,0.000000e+00


In [None]:
contrib_df.sort_values(by=['Contributions_0'], ascending=False)[:20]

Unnamed: 0,Contributions_0,Contributions_1
Base,0.504526,0.495474
Prediction,0.359114,0.640886
got,0.014659,-0.014659
great,0.014593,-0.014593
under,0.011655,-0.011655
how,0.011444,-0.011444
things,0.009973,-0.009973
only,0.007946,-0.007946
##s,0.006563,-0.006563
##co,0.005638,-0.005638


In [None]:
test_df.text[73]

'i walked out after being ignored a friend of mine said the place was good but he is indian apparently a single white female doesn t rate to even get a glass of water when i was seated the host informed me it was a vegetarian restaurant ohkay i said fine with me and i smiled he turned on his heel as if it was a huge imposition to seat me i looked over the menu and waited for quite some time i closed the menu thinking that might bring the waiter i was disappointed i walked out and went to passage to india where i enjoyed their weekend buffet if you are a large group you might get some service i have no ideal how the food is but if you want to eat someone does have to serve you first i recommend going to any of the other dozen indian restaurants in the area instead '

In [None]:
test_df.label[73]

0

In [None]:
test_df.summa