# 3.0 Pre-processing and Training

Contents

3.1 [Introduction](#3.1)

  * [3.1.1 Problem Recap](#3.1.1)
  * [3.1.2 Notebook Goals](#3.1.2)
 
   
3.2 [Load the data](#3.2)

  * [3.2.1 Imports](#3.2.1)
  * [3.2.2 Load the data](#3.2.2)


3.4 [Pre-processing](#3.4)

  * [3.4.1 Examine Positive/Negative Split](#3.4.1)
  * [3.4.2 Set Random Seed for Reproducability](#3.4.2)
  * [3.4.3 Train/test Split](#3.4.3)

3.5 [Vectorizing the Text](#3.5)

  * [3.5.1 Count Vectorization](#3.5.1)
  * [3.5.2 Term-Frequency Inverse-Document Frequency](#3.5.2)

3.6 [Building a Simple Model](#3.6)
<br/><br/>
  * 3.6.1 [Logistic Regression](#3.6.1)
<br/><br/>
    * [3.6.1.1 Training the Model](#3.6.1.1)
    * [3.6.1.2 Fitting the Model](#3.6.1.2)
    * [3.6.1.3 Evaluating the Model](#3.6.1.3)
<br/><br/>
  * 3.6.2 [Naive Bayes](#3.6.2)
<br/> <br/> 
    * [3.6.2.1 Training the Model](#3.6.1.1)
    * [3.6.2.2 Fitting the Model](#3.6.2.2)
    * [3.6.2.3 Evaluating the Model](#3.6.2.3)
<br/> <br/>
  * 3.6.3 [SVM Classifier](#3.6.3)
<br/>  <br/>
    * [3.6.3.1 Training the Model](#3.6.3.1)
    * [3.6.3.2 Fitting the Model](#3.6.3.2)
    * [3.6.3.3 Evaluating the Model](#3.6.3.3)



## 3.1 Introduction <a name="3.1"></a>

### 3.1.1 Problem Recap <a name="3.1.1"><a/>

Using customer text data about amazon products, we will build, evaluate and compare models to estimate the probability that a given text review can be classified as “positive” or “negative”.

Our goal is to build a text classifier using Amazon product review data which can be used to analyze customer sentiment which does not have accompanying numeric data.

### 3.1.2 Notebook Goals <a name="3.1.2"></a>

1. Normalize the text using lemmatization to convert different forms of the same word to the same stem where possible.


2. Split the data into train and test sets.

3. Convert our text review data into numeric features. (CountVectorization, TFIDF, and word2vec are options)

4. Build a simple, initial predictive model using Logistic Regression. 

5. Logistic Regression predicts probabilities. With "positive" (0) and "negative" (1) review, we can do a binary classification.

6. Measure the performance of the model with a Classification Report, Confusion Matrix and ROC + AUC.

## 3.2 Load the data <a name="3.2"><a/>

### 3.2.1 Imports <a name="3.2.1"><a/>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyarrow.parquet as pq
from spacy.lang.en import English
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from random import seed
from sklearn.linear_model import LogisticRegression

### 3.2.2 Load the data <a name="3.2.2"><a/>

In [3]:
data = pq.read_table("../data/edited/fashion.parquet")
fashion = data.to_pandas()

In [4]:
fashion.head()

Unnamed: 0,review,neg_sentiment,stars,review_length
0,exactly needed,0,5,4
1,agree review opening small bent hook expensiv...,1,2,49
2,love going order pack work including losing ea...,0,4,50
3,tiny opening,1,2,4
4,okay,1,3,1


In [5]:
fashion.describe()

Unnamed: 0,neg_sentiment,stars,review_length
count,873352.0,873352.0,873352.0
mean,0.304939,3.904786,29.131591
std,0.460382,1.419361,39.372047
min,0.0,1.0,1.0
25%,0.0,3.0,7.0
50%,0.0,5.0,17.0
75%,1.0,5.0,36.0
max,1.0,5.0,2196.0


In [6]:
fashion.shape

(873352, 4)

## 3.3 Pre-processing 

### 3.3.1 Examine Positive/Negative Split <a name="3.3.1"><a/>

In [7]:
print("0 as positive reviews, 1 as negative")
print()
print(fashion["neg_sentiment"].value_counts())



0 as positive reviews, 1 as negative

0    607033
1    266319
Name: neg_sentiment, dtype: int64


In [8]:
cv = CountVectorizer(min_df = 0.001)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(fashion["review"], fashion["neg_sentiment"], test_size = .1)

### 3.4.1 Set Random Seed for Reproducability <a name="3.3.2"><a/>

In [10]:
seed(42)

### 3.4.2 Train/Test Split <a name="3.3.3"><a/>

In [11]:
y_train.value_counts()

0    546209
1    239807
Name: neg_sentiment, dtype: int64

In [12]:
y_test.value_counts()

0    60824
1    26512
Name: neg_sentiment, dtype: int64

In [13]:
clf = LogisticRegression(max_iter = 1000).fit(X_train, y_train)

ValueError: could not convert string to float: ' not big'

In [None]:
y_pred = clf.predict(X_train)

In [None]:
(y_pred == y_train).shape

(786016,)

In [None]:
688882/786016

0.8764223629035541

In [None]:
y_test_preds = clf.predict(X_test)

In [None]:
(y_test_preds == y_test).shape

(87336,)

In [None]:
(y_test_preds == y_test).sum()

76338

In [None]:
76338/87336

0.8740725474031328

## 3.5 Initial Modeling <a name="3."></a>