<a href="https://colab.research.google.com/github/lupis30puc/BERT_interpretation_with_RF/blob/main/RF_input_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Yelp dataset on kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset)


Tutorial on which I support: 
[Sentiment Analysis Yelp with Random Forest](https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset)


As the goal is to mimic the previously obtained BertForSequentialClassification model's results, I will use the acquired token ids and predicted labels.
First I will use the tokens' ids to create binary dataframes. 
Then I will take these dataframes as the x values and pair them with the predicted labes used as the y values.

## Set Up

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 8.9MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 47.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 42.8MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=dc0e3eb89f0

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import string
import math
import time
import pickle

## Loading the ids and predicted labels from BERT
I want to create a boolean matrix with datasets token ids (input ids).

In [None]:
with open('/content/drive/MyDrive/Yelp/model_128_/train_ids_128.pkl', 'rb') as f:
    train_ids = pickle.load(f)

with open('/content/drive/MyDrive/Yelp/model_128_/test_ids_128.pkl', 'rb') as d:
    test_ids = pickle.load(d)

In [None]:
len(train_ids), len(test_ids), len(test_ids[0]) # because it is the max length on the BERT model

(10394, 2599, 128)

In [None]:
test_ids[0][:10]

[101, 3893, 2015, 2204, 3295, 3095, 16286, 5379, 4997, 2015]

## Creation of binary dataframes

I will identify the unique ids that appear on both train and test datasets. At the same time I want to keep a record of the specific reviews in which the ids appear. 

In [None]:
# get a dict with the appearence of ids in each review,
# and a list of all the train unique ids
%%time
isin_ids_tr = {i:np.unique(train_ids[i]) for i in range(len(train_ids))}
all_uni_ids_tr = np.concatenate(list(isin_ids_tr.values()), axis=0) 
unique_ids_train = np.unique(all_uni_ids_tr)
print('the train appereance dictionary has a length of ' + str(len(isin_ids_tr)))
print('number of unique ids on the train set: ' + str(len(unique_ids_train)))
print('number of repeated ids: ' + str(len(all_uni_ids_tr) - len(unique_ids_train)))

the train appereance dictionary has a length of 10394
number of unique ids on the train set: 15784
number of repeated ids: 459936
CPU times: user 256 ms, sys: 3.65 ms, total: 260 ms
Wall time: 272 ms


In [None]:
# get a dict with the appearence of ids in each review,
# and a list of all the test unique ids
%%time
isin_ids_ts = {i:np.unique(test_ids[i]) for i in range(len(test_ids))}
all_uni_ids_ts = np.concatenate(list(isin_ids_ts.values()), axis=0)
unique_ids_test = np.unique(all_uni_ids_ts)
print('the test appereance dictionary has a length of ' + str(len(isin_ids_ts)))
print('number of unique ids on the train set: ' + str(len(unique_ids_test)))
print('number of repeated ids: ' + str(len(all_uni_ids_ts) - len(unique_ids_test)))

the test appereance dictionary has a length of 2599
number of unique ids on the train set: 10386
number of repeated ids: 107989
CPU times: user 65.5 ms, sys: 101 µs, total: 65.6 ms
Wall time: 70.3 ms


In [None]:
# now I join the unique ids lists 
# and then I get rid of the repeated ids
%%time
all_unique_ids = np.concatenate((unique_ids_train, unique_ids_test), axis=0)
unique_ids = np.unique(all_unique_ids)
print('number of final unique ids: ' + str(len(unique_ids)))

number of final unique ids: 16563
CPU times: user 2.71 ms, sys: 0 ns, total: 2.71 ms
Wall time: 7.55 ms


In [None]:
# saving the dictionaries of appearenace
with open('/content/drive/MyDrive/Yelp/model_128_/isin_ids_tr.pkl', 'wb') as d:
  pickle.dump(isin_ids_tr, d)

with open('/content/drive/MyDrive/Yelp/model_128_/isin_ids_ts.pkl', 'wb') as d:
  pickle.dump(isin_ids_ts, d)

### Saving feature names

In [None]:
# launching the saved model tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('/content/drive/MyDrive/Yelp/model_128_/tokenizer_128')

In [None]:
#tokenizer.vocab_size

30522

In [None]:
check = [list(tokenizer.vocab.keys())[id] for id in unique_ids[26:35]]
check

['y', 'z', 'the', 'of', 'and', 'in', 'to', 'was', 'he']

In [None]:
# Converting unique_ids into words:
feature_names = [list(tokenizer.vocab.keys())[id] for id in unique_ids]

In [None]:
# checking that they have the same length
print(len(unique_ids), len(feature_names))

16563 16563


In [None]:
# saving the unique_ids converted into words; it will be useful for the feature analysis
with open('/content/drive/MyDrive/Yelp/model_128_/feature_names_feb_03.pkl', 'wb') as d:
  pickle.dump(feature_names, d)

In [None]:
# saving the unique_ids converted into words; it will be useful for the feature analysis
with open('/content/drive/MyDrive/Yelp/model_128_/unique_ids_feb_03.pkl', 'wb') as d:
  pickle.dump(unique_ids, d)

## Finalizing the dataframes

In [None]:
# a function to fill in the dataframes in a boolean way
def is_word_in(isin_dict, df):
  """
  This is a helper function to fill in a dataframe based on the appearance of a word/token_id on a set of documents.
  It needs:
    a dictionary consisting of review indexes as keys and of words/token_ids as values; 
    and a dataframe created with the review indexes as rows and the words/token_ids as columns.

  Through the keys and values of the dictionary, it iterates on the values ids 
  to insert a 1 on every column/id for each row/key/review on the dataframe.
  
  """
  for key, value in isin_dict.items():
    for id in value:
      df[id][key] = 1

In [None]:
import pickle
with open('/content/drive/MyDrive/Yelp/model_128_/feature_names_feb_03.pkl', 'rb') as f:
  feature_names = pickle.load(f)

In [None]:
with open('/content/drive/MyDrive/Yelp/model_128_/unique_ids_feb_03.pkl', 'rb') as f:
  unique_ids = pickle.load(f)

In [None]:
# making a dataframe where the index are the reviews index, and the columns are the unique ids/words on the reviews.
%%time
x_train = pd.DataFrame(index=range(len(train_ids)), columns=unique_ids)
x_train.fillna(0, inplace=True) # we are fill in with 0 representing that there is no appearance of the id on that review

CPU times: user 50.8 s, sys: 826 ms, total: 51.6 s
Wall time: 51.6 s


In [None]:
%%time
is_word_in(isin_ids_tr, x_train)

CPU times: user 27.6 s, sys: 605 ms, total: 28.2 s
Wall time: 28.2 s


In [None]:
x_train.columns = feature_names
x_train.head()

Unnamed: 0,[PAD],[CLS],[SEP],a,b,c,d,e,f,g,h,i,j,k,l,n,o,p,q,r,s,t,u,v,w,x,y,z,the,of,and,in,to,was,he,is,as,for,on,with,...,polka,starbucks,adamant,inspecting,##ducted,##pone,##roids,##ppet,##lib,colossal,foreigner,vet,freaks,rosewood,upstate,ks,vo,##tzer,##werk,binoculars,enthusiast,squeak,inflated,bonuses,##rco,penitentiary,##etched,##lster,##nsor,##toy,inhuman,tbs,inspections,disgrace,infused,pudding,stalks,leases,##wil,thyroid
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
%%time
x_train.describe()

CPU times: user 30.8 s, sys: 133 ms, total: 30.9 s
Wall time: 31 s


Unnamed: 0,[PAD],[CLS],[SEP],a,b,c,d,e,f,g,h,i,j,k,l,n,o,p,q,r,s,t,u,v,w,x,y,z,the,of,and,in,to,was,he,is,as,for,on,with,...,polka,starbucks,adamant,inspecting,##ducted,##pone,##roids,##ppet,##lib,colossal,foreigner,vet,freaks,rosewood,upstate,ks,vo,##tzer,##werk,binoculars,enthusiast,squeak,inflated,bonuses,##rco,penitentiary,##etched,##lster,##nsor,##toy,inhuman,tbs,inspections,disgrace,infused,pudding,stalks,leases,##wil,thyroid
count,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,...,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0
mean,0.921301,1.0,1.0,0.004329,0.011353,9.6e-05,0.001636,0.003271,0.006542,0.018568,0.008947,0.002117,0.000673,0.004233,0.002213,0.000481,0.004618,0.001443,0.00077,0.00077,0.007697,0.004426,0.004618,0.000866,0.009044,0.001347,0.002213,0.003848,0.000673,0.000481,0.000481,0.011256,0.012603,0.001924,0.005003,0.00279,0.004907,0.001347,0.002694,0.000289,...,0.000192,0.00635,0.000192,0.000192,0.000385,9.6e-05,0.000192,0.001058,9.6e-05,0.000192,9.6e-05,0.00279,0.000192,9.6e-05,0.0,9.6e-05,0.003945,9.6e-05,0.0,9.6e-05,0.0,0.000481,0.000192,9.6e-05,9.6e-05,9.6e-05,0.000192,9.6e-05,9.6e-05,0.0,9.6e-05,9.6e-05,0.000577,0.000577,0.000962,0.002982,9.6e-05,0.000192,0.000289,9.6e-05
std,0.269282,0.0,0.0,0.065659,0.105948,0.009809,0.040411,0.057103,0.080623,0.135001,0.094171,0.04596,0.025944,0.064928,0.046991,0.021929,0.067802,0.037963,0.027734,0.027734,0.087397,0.066381,0.067802,0.029415,0.094672,0.036678,0.046991,0.061919,0.025944,0.021929,0.021929,0.105503,0.111561,0.043825,0.070557,0.05275,0.069879,0.036678,0.051835,0.016987,...,0.013871,0.079436,0.013871,0.013871,0.019614,0.009809,0.013871,0.032516,0.009809,0.013871,0.009809,0.05275,0.013871,0.009809,0.0,0.009809,0.062685,0.009809,0.0,0.009809,0.0,0.021929,0.013871,0.009809,0.009809,0.009809,0.013871,0.009809,0.009809,0.0,0.009809,0.009809,0.02402,0.02402,0.031004,0.054533,0.009809,0.013871,0.016987,0.009809
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# making a dataframe where the index are the reviews index, and the columns are the unique words/ids on the reviews.
%%time
x_test = pd.DataFrame(index=range(len(test_ids)), columns=unique_ids)
x_test.fillna(0, inplace=True)

CPU times: user 14.3 s, sys: 22.7 ms, total: 14.3 s
Wall time: 14.3 s


In [None]:
%%time
is_word_in(isin_ids_ts, x_test)

CPU times: user 7.52 s, sys: 316 ms, total: 7.84 s
Wall time: 7.84 s


In [None]:
x_test.columns = feature_names
x_test.head()

Unnamed: 0,[PAD],[CLS],[SEP],a,b,c,d,e,f,g,h,i,j,k,l,n,o,p,q,r,s,t,u,v,w,x,y,z,the,of,and,in,to,was,he,is,as,for,on,with,...,polka,starbucks,adamant,inspecting,##ducted,##pone,##roids,##ppet,##lib,colossal,foreigner,vet,freaks,rosewood,upstate,ks,vo,##tzer,##werk,binoculars,enthusiast,squeak,inflated,bonuses,##rco,penitentiary,##etched,##lster,##nsor,##toy,inhuman,tbs,inspections,disgrace,infused,pudding,stalks,leases,##wil,thyroid
0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
%%time
x_test.describe()

CPU times: user 28.1 s, sys: 130 ms, total: 28.2 s
Wall time: 28.2 s


Unnamed: 0,[PAD],[CLS],[SEP],a,b,c,d,e,f,g,h,i,j,k,l,n,o,p,q,r,s,t,u,v,w,x,y,z,the,of,and,in,to,was,he,is,as,for,on,with,...,polka,starbucks,adamant,inspecting,##ducted,##pone,##roids,##ppet,##lib,colossal,foreigner,vet,freaks,rosewood,upstate,ks,vo,##tzer,##werk,binoculars,enthusiast,squeak,inflated,bonuses,##rco,penitentiary,##etched,##lster,##nsor,##toy,inhuman,tbs,inspections,disgrace,infused,pudding,stalks,leases,##wil,thyroid
count,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,...,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0
mean,0.91843,1.0,1.0,0.003848,0.010004,0.0,0.001154,0.005387,0.006541,0.01616,0.00808,0.002309,0.0,0.006541,0.003463,0.0,0.005771,0.000385,0.000385,0.00077,0.00808,0.004232,0.005387,0.000385,0.011158,0.001539,0.001539,0.003078,0.001539,0.000385,0.000385,0.009234,0.013082,0.003463,0.003463,0.003848,0.006156,0.001154,0.00077,0.000385,...,0.0,0.005002,0.0,0.0,0.0,0.000385,0.00077,0.000385,0.0,0.00077,0.0,0.00077,0.0,0.0,0.000385,0.0,0.003463,0.0,0.000385,0.000385,0.000385,0.0,0.000385,0.0,0.00077,0.0,0.0,0.0,0.0,0.000385,0.0,0.0,0.000385,0.000385,0.001154,0.003078,0.0,0.0,0.0,0.0
std,0.273761,0.0,0.0,0.061922,0.099537,0.0,0.033962,0.07321,0.080627,0.126115,0.089542,0.048001,0.0,0.080627,0.058755,0.0,0.075765,0.019615,0.019615,0.027735,0.089542,0.064932,0.07321,0.019615,0.105061,0.039208,0.039208,0.055406,0.039208,0.019615,0.019615,0.095669,0.113648,0.058755,0.058755,0.061922,0.078235,0.033962,0.027735,0.019615,...,0.0,0.070561,0.0,0.0,0.0,0.019615,0.027735,0.019615,0.0,0.027735,0.0,0.027735,0.0,0.0,0.019615,0.0,0.058755,0.0,0.019615,0.019615,0.019615,0.0,0.019615,0.0,0.027735,0.0,0.0,0.0,0.0,0.019615,0.0,0.0,0.019615,0.019615,0.033962,0.055406,0.0,0.0,0.0,0.0
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0


In [None]:
# saving the x values dataframes
x_train.to_pickle('/content/drive/MyDrive/Yelp/model_128_/binary_tr_words')
x_test.to_pickle('/content/drive/MyDrive/Yelp/model_128_/binary_ts_words')

## Creating a dataframe with all reviews

In [None]:
# Loading the train and test datasets obatined on the preprocessing notebook
train_df = pd.read_pickle('/content/drive/MyDrive/Yelp/sample_train_10394.pkl')
test_df = pd.read_pickle('/content/drive/MyDrive/Yelp/sample_test_2599.pkl')

In [None]:
ts_dict = dict(zip(list(test_df.index), list(isin_ids_ts.values()))) 
tr_dict = dict(zip(list(train_df.index), list(isin_ids_tr.values()))) 
final_dict = {**tr_dict, **ts_dict}
len(final_dict.keys())

12993

In [None]:
# making a dataframe where the index are the reviews index, and the columns are the unique words/ids on the reviews.
%%time
x_sample = pd.DataFrame(index=range(len(final_dict)), columns=unique_ids)
x_sample.fillna(0, inplace=True)

CPU times: user 1min 6s, sys: 3.46 s, total: 1min 9s
Wall time: 1min 9s


In [None]:
x_sample.head()

Unnamed: 0,0,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,...,29499,29500,29502,29508,29510,29513,29514,29519,29521,29523,29524,29525,29526,29528,29530,29535,29536,29546,29548,29549,29550,29552,29561,29563,29566,29569,29574,29576,29577,29578,29582,29584,29589,29591,29592,29593,29594,29597,29602,29610
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
%%time
is_word_in(final_dict, x_sample)

CPU times: user 34.9 s, sys: 2.92 s, total: 37.8 s
Wall time: 37.8 s


In [None]:
x_sample.columns = feature_names
x_sample.describe()

Unnamed: 0,[PAD],[CLS],[SEP],a,b,c,d,e,f,g,h,i,j,k,l,n,o,p,q,r,s,t,u,v,w,x,y,z,the,of,and,in,to,was,he,is,as,for,on,with,...,polka,starbucks,adamant,inspecting,##ducted,##pone,##roids,##ppet,##lib,colossal,foreigner,vet,freaks,rosewood,upstate,ks,vo,##tzer,##werk,binoculars,enthusiast,squeak,inflated,bonuses,##rco,penitentiary,##etched,##lster,##nsor,##toy,inhuman,tbs,inspections,disgrace,infused,pudding,stalks,leases,##wil,thyroid
count,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,...,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0,12993.0
mean,0.920727,1.0,1.0,0.004233,0.011083,7.7e-05,0.001539,0.003694,0.006542,0.018087,0.008774,0.002155,0.000539,0.004695,0.002463,0.000385,0.004849,0.001231,0.000693,0.00077,0.007773,0.004387,0.004772,0.00077,0.009467,0.001385,0.002078,0.003694,0.000847,0.000462,0.000462,0.010852,0.012699,0.002232,0.004695,0.003002,0.005157,0.001308,0.002309,0.000308,...,0.000154,0.00608,0.000154,0.000154,0.000308,0.000154,0.000308,0.000924,7.7e-05,0.000308,7.7e-05,0.002386,0.000154,7.7e-05,7.7e-05,7.7e-05,0.003848,7.7e-05,7.7e-05,0.000154,7.7e-05,0.000385,0.000231,7.7e-05,0.000231,7.7e-05,0.000154,7.7e-05,7.7e-05,7.7e-05,7.7e-05,7.7e-05,0.000539,0.000539,0.001001,0.003002,7.7e-05,0.000154,0.000231,7.7e-05
std,0.270175,0.0,0.0,0.064927,0.104694,0.008773,0.039205,0.060671,0.080621,0.13327,0.093261,0.046374,0.023206,0.06836,0.049568,0.019614,0.069467,0.035072,0.026311,0.027733,0.087827,0.066091,0.068916,0.027733,0.096839,0.037196,0.04554,0.060671,0.029085,0.021485,0.021485,0.10361,0.111977,0.047193,0.06836,0.054707,0.071627,0.036149,0.047998,0.017544,...,0.012406,0.077741,0.012406,0.012406,0.017544,0.012406,0.017544,0.030377,0.008773,0.017544,0.008773,0.048789,0.012406,0.008773,0.008773,0.008773,0.061917,0.008773,0.008773,0.012406,0.008773,0.019614,0.015194,0.008773,0.015194,0.008773,0.012406,0.008773,0.008773,0.008773,0.008773,0.008773,0.023206,0.023206,0.031617,0.054707,0.008773,0.012406,0.015194,0.008773
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# saving the x values dataframes
x_sample.to_pickle('/content/drive/MyDrive/Yelp/model_128_/binary_all')