# Preprocessing

Now that the EDA is all wrapped up, I can move on to preprocessing. In this notebook, I'll be solely focusing on NLP task at hand. Luckily, I have already dealt with removing stop words and lemmatizatoin of the text. Now, all I need to focus on is converting the text to some numeric format for modelling.

As per usual, I'll start with loading all the relevant libraries.

In [1]:
import pandas as pd
from library.sb_utils import save_file
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import sagemaker
import boto3

## Loading The Data

Now I'll load in all the data. This data comes as a result of work done in the [`1_data_wrangling.ipynb`](https://github.com/isabelanyc/Ice-Cream-Reviews/blob/main/notebooks/1_data_wrangling.ipynb) notebook file. I will also create the session for the S3 bucket.

In [2]:
data = pd.read_csv('../data/ice_cream_data.csv', index_col=0)

In [3]:
sm_boto3 = boto3.client('sagemaker')
sess = sagemaker.Session()
bucket = 'ice-cream-reviews-sagemakerbucket'
print('Using bucket ' + bucket)

Using bucket ice-cream-reviews-sagemakerbucket


In [4]:
data.head(3)

Unnamed: 0,author,brand,name,rating,rating_count,stars,text,good_review
0,Ilovebennjerry,bj,Salted Caramel Core,3.7,208,3,not enough brownie super good dont get wrong b...,Bad
1,Sweettooth909,bj,Salted Caramel Core,3.7,208,5,im obsessed pint i decided try although im hug...,Good
2,LaTanga71,bj,Salted Caramel Core,3.7,208,3,my favoritemore caramel please my caramel core...,Bad


To reiterate, I am only interested in `text` and `good_reviews` so I will only use these features and store it in a variable called `preprocessed_data`.

In [5]:
preprocessed_data = data[['text', 'good_review']]
preprocessed_data.head(3)

Unnamed: 0,text,good_review
0,not enough brownie super good dont get wrong b...,Bad
1,im obsessed pint i decided try although im hug...,Good
2,my favoritemore caramel please my caramel core...,Bad


## Randomize the Data

Right now, the data is ordered by brand. This ordering will impact how I do the `train_test_split`, so I will mix up the data.

In [6]:
preprocessed_data = preprocessed_data.sample(frac=1)
preprocessed_data.reset_index(inplace=True)
preprocessed_data.drop(['index'], axis=1, inplace=True)

In [7]:
preprocessed_data.head(5)

Unnamed: 0,text,good_review
0,taste change i totally agree everyone say tast...,Bad
1,vanilla goodness who doesnt love icecream im o...,Good
2,excellent high quality flavor the ice cream si...,Good
3,great description i really like cinnabon flavo...,Good
4,best ice cream i tried haagendazs bar ice crea...,Good


## TF-IDF Vectorizer

The dataframe currently only has textual data in it. In many instances, I would have a lot of trouble passing this right into a model since none of the data is in a numeric format. So to address this issue, I will use the TF-IDF method to transform the data. Here is a useful video as to how TF-IDF works from [Krish Naik's Youtube](https://www.youtube.com/watch?v=D2V1okCEsiE).

I will restrict the number of features to 1000 to make the computation time much faster. Then I'll store the independent features in `X` and the dependent column in `y`.

In [8]:
tfidf = TfidfVectorizer()

In [9]:
tfidf = TfidfVectorizer(max_features=1000)

In [10]:
X = tfidf.fit_transform(preprocessed_data['text']).toarray()
y = preprocessed_data['good_review']

In [11]:
feature_names = tfidf.get_feature_names_out()
X = pd.DataFrame(X, columns=feature_names)

In [12]:
X.head()

Unnamed: 0,able,absolute,absolutely,across,actual,actually,add,added,addicted,addicting,...,yesterday,yet,you,youll,your,youre,youve,yum,yummy,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.127193,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Split Into Test and Training Sets

I'll use a `train_test_split()` from `sklearn` to split the training and testing data 70-30. 

In [13]:
train, test = train_test_split(pd.concat([pd.DataFrame(X), y], axis=1), test_size=0.3, random_state=42)

## Save Data

Now, I'll just save the training and testing data so I can use it for modelling. I will be saving it locally and saving the data to my AWS S3 Bucket.

In [14]:
preprocessed_data.to_csv("../data/preprocessed_data.csv")
train.to_csv("../data/train.csv")
test.to_csv("../data/test.csv")

In [15]:
sk_prefix = "sagemaker/ice-cream-reviews/sklearncontainer"
preprocessed_data_path = sess.upload_data(
    path="../data/preprocessed_data.csv", bucket=bucket, key_prefix=sk_prefix
)

In [16]:
train_path = sess.upload_data(
    path="../data/train.csv", bucket=bucket, key_prefix=sk_prefix
)

In [17]:
test_path = sess.upload_data(
    path="../data/test.csv", bucket=bucket, key_prefix=sk_prefix
)

In [18]:
print(train_path)
print(test_path)

s3://ice-cream-reviews-sagemakerbucket/sagemaker/ice-cream-reviews/sklearncontainer/train.csv
s3://ice-cream-reviews-sagemakerbucket/sagemaker/ice-cream-reviews/sklearncontainer/test.csv
