<a href="https://colab.research.google.com/github/lupis30puc/yelp_bert_random_forest/blob/update-6/RF_input_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Yelp Polarity on kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset)

12,993 samples from the Yelp Dataset Challenge 2020. 
Divided on train and test subsets. 
Their corresponding sizes are: 10,394 train samples, 2,599 test samples.


Tutorial on which I support: 
[Sentiment Analysis Yelp with Random Forest](https://www.kaggle.com/omkarsabnis/sentiment-analysis-on-the-yelp-reviews-dataset)


As the goal is to mimic the previously obtained BertForSequentialClassification model's results, I will use the acquired token ids and predicted labels.
First I will use the tokens' ids to create binary dataframes. 
Then I will take these dataframes as the x values and pair them with the predicted labes used as the y values.

## Set Up

In [None]:
#!pip install transformers

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import string
import math
import time
import pickle

## Loading the ids and predicted labels from BERT
I want to create a boolean matrix with datasets token ids (input ids).

In [3]:
with open('/content/drive/MyDrive/Yelp/model_128_/train_ids_128.pkl', 'rb') as f:
    train_ids = pickle.load(f)

with open('/content/drive/MyDrive/Yelp/model_128_/test_ids_128.pkl', 'rb') as d:
    test_ids = pickle.load(d)

In [4]:
len(train_ids)

10394

In [5]:
len(test_ids)

2599

In [None]:
len(test_ids[0]) # because it is the max length on the BERT model

99

In [None]:
test_ids[0][:10]

[101, 3893, 2015, 2204, 3295, 3095, 16286, 5379, 4997, 2015]

## Creation of boolean dataframes

I will identify the unique ids that appear on both train and test datasets. At the same time I want to keep a record of the specific reviews in which the ids appear. 

In [6]:
# get a dict with the appearence of ids in each review,
# and a list of all the train unique ids
%%time
isin_ids_tr = {i:np.unique(train_ids[i]) for i in range(len(train_ids))}
all_uni_ids_tr = np.concatenate(list(isin_ids_tr.values()), axis=0) 
unique_ids_train = np.unique(all_uni_ids_tr)

CPU times: user 264 ms, sys: 11.2 ms, total: 275 ms
Wall time: 287 ms


In [7]:
print('the train appereance dictionary has a length of ' + str(len(isin_ids_tr)))
print('number of unique ids on the train set: ' + str(len(unique_ids_train)))
print('number of repeated ids: ' + str(len(all_uni_ids_tr) - len(unique_ids_train)))

the train appereance dictionary has a length of 10394
number of unique ids on the train set: 15784
number of repeated ids: 459936


In [8]:
# get a dict with the appearence of ids in each review,
# and a list of all the test unique ids
%%time
isin_ids_ts = {i:np.unique(test_ids[i]) for i in range(len(test_ids))}
all_uni_ids_ts = np.concatenate(list(isin_ids_ts.values()), axis=0)
unique_ids_test = np.unique(all_uni_ids_ts)

CPU times: user 68.4 ms, sys: 0 ns, total: 68.4 ms
Wall time: 75.4 ms


In [9]:
print('the test appereance dictionary has a length of ' + str(len(isin_ids_ts)))
print('number of unique ids on the train set: ' + str(len(unique_ids_test)))
print('number of repeated ids: ' + str(len(all_uni_ids_ts) - len(unique_ids_test)))

the test appereance dictionary has a length of 2599
number of unique ids on the train set: 10386
number of repeated ids: 107989


In [10]:
# now I join the unique ids lists 
# and then I get rid of the repeated ids
%%time
all_unique_ids = np.concatenate((unique_ids_train, unique_ids_test), axis=0)
unique_ids = np.unique(all_unique_ids)

CPU times: user 1.73 ms, sys: 0 ns, total: 1.73 ms
Wall time: 1.86 ms


In [11]:
print('number of final unique ids: ' + str(len(unique_ids)))

number of final unique ids: 16563


### Checking if symbols and punctuation from ids were removed properly and saving feature names

In [None]:
# launching the saved model tokenizer
#from transformers import BertTokenizer
#tokenizer = BertTokenizer.from_pretrained('/content/drive/MyDrive/Yelp/model_99_/tokenizer_99')

In [None]:
#tokenizer.vocab_size

30522

In [None]:
#check = [list(tokenizer.vocab.keys())[id] for id in unique_ids[26:35]]
#check

['y', 'z', 'the', 'of', 'and', 'in', 'to', 'was', 'he']

In [None]:
# Converting unique_ids into words:
#feature_names = [list(tokenizer.vocab.keys())[id] for id in unique_ids]

In [None]:
# checking that they have the same length
#print(len(unique_ids), len(feature_names))

In [None]:
# saving the unique_ids converted into words; it will be useful for the feature analysis
#with open('/content/drive/MyDrive/Yelp/model_99_/feature_names_feb_01.pkl', 'wb') as d:
#  pickle.dump(feature_names, d)

## Finalizing the dataframes

In [12]:
# a function to fill in the dataframes in a boolean way
def is_word_in(isin_dict, df):
  """
  This is a helper function to fill in a dataframe based on the appearance of a word/token_id on a set of documents.
  It needs:
    a dictionary consisting of review indexes as keys and of words/token_ids as values; 
    and a dataframe created with the review indexes as rows and the words/token_ids as columns.

  Through the keys and values of the dictionary, it iterates on the values ids 
  to insert a 1 on every column/id for each row/key/review on the dataframe.
  
  """
  for key, value in isin_dict.items():
    for id in value:
      df[id][key] = 1

In [13]:
# making a dataframe where the index are the reviews index, and the columns are the unique ids/words on the reviews.
%%time
x_train = pd.DataFrame(index=range(len(train_ids)), columns=unique_ids)
x_train.fillna(0, inplace=True) # we are fill in with 0 representing that there is no appearance of the id on that review

CPU times: user 50.7 s, sys: 2.01 s, total: 52.7 s
Wall time: 52.8 s


In [14]:
%%time
is_word_in(isin_ids_tr, x_train)

CPU times: user 27.6 s, sys: 715 ms, total: 28.4 s
Wall time: 28.4 s


In [15]:
x_train.tail()

Unnamed: 0,0,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,...,29499,29500,29502,29508,29510,29513,29514,29519,29521,29523,29524,29525,29526,29528,29530,29535,29536,29546,29548,29549,29550,29552,29561,29563,29566,29569,29574,29576,29577,29578,29582,29584,29589,29591,29592,29593,29594,29597,29602,29610
10389,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10390,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10391,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10392,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10393,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [16]:
%%time
x_train.describe()

CPU times: user 35.2 s, sys: 214 ms, total: 35.4 s
Wall time: 35.5 s


Unnamed: 0,0,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,...,29499,29500,29502,29508,29510,29513,29514,29519,29521,29523,29524,29525,29526,29528,29530,29535,29536,29546,29548,29549,29550,29552,29561,29563,29566,29569,29574,29576,29577,29578,29582,29584,29589,29591,29592,29593,29594,29597,29602,29610
count,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,...,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0,10394.0
mean,0.921301,1.0,1.0,0.004329,0.011353,9.6e-05,0.001636,0.003271,0.006542,0.018568,0.008947,0.002117,0.000673,0.004233,0.002213,0.000481,0.004618,0.001443,0.00077,0.00077,0.007697,0.004426,0.004618,0.000866,0.009044,0.001347,0.002213,0.003848,0.000673,0.000481,0.000481,0.011256,0.012603,0.001924,0.005003,0.00279,0.004907,0.001347,0.002694,0.000289,...,0.000192,0.00635,0.000192,0.000192,0.000385,9.6e-05,0.000192,0.001058,9.6e-05,0.000192,9.6e-05,0.00279,0.000192,9.6e-05,0.0,9.6e-05,0.003945,9.6e-05,0.0,9.6e-05,0.0,0.000481,0.000192,9.6e-05,9.6e-05,9.6e-05,0.000192,9.6e-05,9.6e-05,0.0,9.6e-05,9.6e-05,0.000577,0.000577,0.000962,0.002982,9.6e-05,0.000192,0.000289,9.6e-05
std,0.269282,0.0,0.0,0.065659,0.105948,0.009809,0.040411,0.057103,0.080623,0.135001,0.094171,0.04596,0.025944,0.064928,0.046991,0.021929,0.067802,0.037963,0.027734,0.027734,0.087397,0.066381,0.067802,0.029415,0.094672,0.036678,0.046991,0.061919,0.025944,0.021929,0.021929,0.105503,0.111561,0.043825,0.070557,0.05275,0.069879,0.036678,0.051835,0.016987,...,0.013871,0.079436,0.013871,0.013871,0.019614,0.009809,0.013871,0.032516,0.009809,0.013871,0.009809,0.05275,0.013871,0.009809,0.0,0.009809,0.062685,0.009809,0.0,0.009809,0.0,0.021929,0.013871,0.009809,0.009809,0.009809,0.013871,0.009809,0.009809,0.0,0.009809,0.009809,0.02402,0.02402,0.031004,0.054533,0.009809,0.013871,0.016987,0.009809
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [17]:
# making a dataframe where the index are the reviews index, and the columns are the unique words/ids on the reviews.
%%time
x_test = pd.DataFrame(index=range(len(test_ids)), columns=unique_ids)
x_test.fillna(0, inplace=True)

CPU times: user 14.5 s, sys: 135 ms, total: 14.6 s
Wall time: 14.5 s


In [18]:
%%time
is_word_in(isin_ids_ts, x_test)

CPU times: user 7.48 s, sys: 129 ms, total: 7.6 s
Wall time: 7.63 s


In [19]:
x_test.tail()

Unnamed: 0,0,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,...,29499,29500,29502,29508,29510,29513,29514,29519,29521,29523,29524,29525,29526,29528,29530,29535,29536,29546,29548,29549,29550,29552,29561,29563,29566,29569,29574,29576,29577,29578,29582,29584,29589,29591,29592,29593,29594,29597,29602,29610
2594,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2595,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2596,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2597,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2598,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [20]:
%%time
x_test.describe()

CPU times: user 31.8 s, sys: 144 ms, total: 32 s
Wall time: 32.1 s


Unnamed: 0,0,101,102,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,...,29499,29500,29502,29508,29510,29513,29514,29519,29521,29523,29524,29525,29526,29528,29530,29535,29536,29546,29548,29549,29550,29552,29561,29563,29566,29569,29574,29576,29577,29578,29582,29584,29589,29591,29592,29593,29594,29597,29602,29610
count,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,...,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0,2599.0
mean,0.91843,1.0,1.0,0.003848,0.010004,0.0,0.001154,0.005387,0.006541,0.01616,0.00808,0.002309,0.0,0.006541,0.003463,0.0,0.005771,0.000385,0.000385,0.00077,0.00808,0.004232,0.005387,0.000385,0.011158,0.001539,0.001539,0.003078,0.001539,0.000385,0.000385,0.009234,0.013082,0.003463,0.003463,0.003848,0.006156,0.001154,0.00077,0.000385,...,0.0,0.005002,0.0,0.0,0.0,0.000385,0.00077,0.000385,0.0,0.00077,0.0,0.00077,0.0,0.0,0.000385,0.0,0.003463,0.0,0.000385,0.000385,0.000385,0.0,0.000385,0.0,0.00077,0.0,0.0,0.0,0.0,0.000385,0.0,0.0,0.000385,0.000385,0.001154,0.003078,0.0,0.0,0.0,0.0
std,0.273761,0.0,0.0,0.061922,0.099537,0.0,0.033962,0.07321,0.080627,0.126115,0.089542,0.048001,0.0,0.080627,0.058755,0.0,0.075765,0.019615,0.019615,0.027735,0.089542,0.064932,0.07321,0.019615,0.105061,0.039208,0.039208,0.055406,0.039208,0.019615,0.019615,0.095669,0.113648,0.058755,0.058755,0.061922,0.078235,0.033962,0.027735,0.019615,...,0.0,0.070561,0.0,0.0,0.0,0.019615,0.027735,0.019615,0.0,0.027735,0.0,0.027735,0.0,0.0,0.019615,0.0,0.058755,0.0,0.019615,0.019615,0.019615,0.0,0.019615,0.0,0.027735,0.0,0.0,0.0,0.0,0.019615,0.0,0.0,0.019615,0.019615,0.033962,0.055406,0.0,0.0,0.0,0.0
min,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0


In [21]:
# saving the x values dataframes
x_train.to_pickle('/content/drive/MyDrive/Yelp/model_128_/binary_train_rf_feb_02')
x_test.to_pickle('/content/drive/MyDrive/Yelp/model_128_/binary_test_rf_feb_02')