# Preprocessing

Now that the EDA is all wrapped up, I can move on to preprocessing. In this notebook, I'll be solely focusing on NLP task at hand. Luckily, I have already dealt with removing stop words and lemmatizatoin of the text. Now, all I need to focus on is converting the text to some numeric format for modelling.

As per usual, I'll start with loading all the relevant libraries.

In [1]:
import pandas as pd
from library.sb_utils import save_file
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

## Loading The Data

Now I'll load in all the data. This data comes as a result of work done in the [`1_data_wrangling.ipynb`](https://github.com/isabelanyc/Ice-Cream-Reviews/blob/main/notebooks/1_data_wrangling.ipynb) notebook file.

In [2]:
data = pd.read_csv('../data/ice_cream_data.csv', index_col=0)

In [3]:
data.head(3)

Unnamed: 0,author,brand,name,rating,rating_count,stars,text,good_review
0,Ilovebennjerry,bj,Salted Caramel Core,3.7,208,3,not enough brownie super good dont get wrong b...,Bad
1,Sweettooth909,bj,Salted Caramel Core,3.7,208,5,im obsessed pint i decided try although im hug...,Good
2,LaTanga71,bj,Salted Caramel Core,3.7,208,3,my favoritemore caramel please my caramel core...,Bad


To reiterate, I am only interested in `text` and `good_reviews` so I will only use these features and store it in a variable called `preprocessed_data`.

In [5]:
preprocessed_data = data[['text', 'good_review']]
preprocessed_data.head(3)

Unnamed: 0,text,good_review
0,not enough brownie super good dont get wrong b...,Bad
1,im obsessed pint i decided try although im hug...,Good
2,my favoritemore caramel please my caramel core...,Bad


## Randomize the Data

Right now, the data is ordered by brand. This ordering will impact how I do the `train_test_split`, so I will mix up the data.

In [6]:
preprocessed_data = preprocessed_data.sample(frac=1)
preprocessed_data.reset_index(inplace=True)
preprocessed_data.drop(['index'], axis=1, inplace=True)

In [7]:
preprocessed_data.head(5)

Unnamed: 0,text,good_review
0,fall party mouth i absolutely love blueberry c...,Good
1,change recipe i love cookie dough ice cream an...,Bad
2,delicious i sensitive sugar trying reduce adde...,Good
3,very soft consistency melt fast i like ice cre...,Bad
4,so badi got website write review lol it even s...,Bad


## TF-IDF Vectorizer

The dataframe currently only has textual data in it. In many instances, I would have a lot of trouble passing this right into a model since none of the data is in a numeric format. So to address this issue, I will use the TF-IDF method to transform the data. Here is a useful video as to how TF-IDF works from [Krish Naik's Youtube](https://www.youtube.com/watch?v=D2V1okCEsiE).

I will restrict the number of features to 1000 to make the computation time much faster. Then I'll store the independent features in `X` and the dependent column in `y`.

In [8]:
tfidf = TfidfVectorizer()

In [17]:
tfidf = TfidfVectorizer(max_features=2000)

In [18]:
X = tfidf.fit_transform(preprocessed_data['text']).toarray()
y = preprocessed_data['good_review']

In [19]:
X.head()

AttributeError: 'numpy.ndarray' object has no attribute 'head'

## Split Into Test and Training Sets

I'll use a `train_test_split()` from `sklearn` to split the training and testing data 70-30. 

In [12]:
X = pd.DataFrame(X, columns=tfidf.get_feature_names())

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [14]:
tfidf_df = pd.DataFrame(X_train, columns=tfidf.get_feature_names())

## Save Data

Now, I'l' just save the training and testing data so I can use it for modelling.

In [None]:
# save the data to a new csv file
datapath = '../data'
save_file(preprocessed_data, 'preprocessed_data.csv', datapath)

In [None]:
save_file(X_test, 'X_test.csv', datapath)

In [None]:
save_file(X_train, 'X_train.csv', datapath)

In [None]:
save_file(y_train, 'y_train.csv', datapath)

In [None]:
save_file(y_test, 'y_test.csv', datapath)