# Amazon Baby Products Dataset

Sentiment analysis is one of the most important parts of Natural Language Processing. It is different than machine learning with numeric data because text data cannot be processed by an algorithm directly. It needs to be transformed into a numeric form. So, text data are vectorized before they get fed into the machine learning model.

## About the dataset

The original dataset has three features: name(name of the products), review(Customer reviews of the products), and rating(rating of the customer of a product ranging from 1 to 5). The review column will be the input column and the rating column will be used to understand the sentiments of the review. Here are some important data preprocessing steps:

- The dataset has about 183,500 rows of data. There are 1147 null values. I simply will get rid of those null values.

- As the dataset is pretty big, it takes a lot of time to run some machine learning algorithm. So, I used 30% of the data for this project which is still 54,000 data. The sample was representative.

- If the rating is 1 and 2 that will be considered a bad review or negative review. And if the review is 3, 4, and 5, the review will be considered as a good review or positive review. So, I added a new column named ‘sentiments’ to the dataset that will use 1 for the positive reviews and 0 for the negative reviews.


## Import the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('amazon_baby.csv')
df

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
...,...,...,...
183526,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea! very handy to have and look...,5
183527,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks! It is a great blend of fu...,5
183528,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kids.......,5
183529,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product. I have...,5


In [3]:
df.shape

(183531, 3)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   name    183213 non-null  object
 1   review  182702 non-null  object
 2   rating  183531 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.2+ MB


In [5]:
df.describe()

Unnamed: 0,rating
count,183531.0
mean,4.120448
std,1.285017
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


In [6]:
df = df.dropna()

## Pre-processing the data

In [7]:
df['review']

0         These flannel wipes are OK, but in my opinion ...
1         it came early and was not disappointed. i love...
2         Very soft and comfortable and warmer than it l...
3         This is a product well worth the purchase.  I ...
4         All of my kids have cried non-stop when I trie...
                                ...                        
183526    Such a great idea! very handy to have and look...
183527    This product rocks!  It is a great blend of fu...
183528    This item looks great and cool for my kids.......
183529    I am extremely happy with this product. I have...
183530    I love this product very mush . I have bought ...
Name: review, Length: 182384, dtype: object

In [8]:
df['review_processed'] = df['review'].str.replace("[^a-zA-Z0-9]", " ")

# Re ordering columns
df = df[['review','review_processed','rating']]

  df['review_processed'] = df['review'].str.replace("[^a-zA-Z0-9]", " ")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review_processed'] = df['review'].str.replace("[^a-zA-Z0-9]", " ")


In [9]:
df['review_processed']

0         These flannel wipes are OK  but in my opinion ...
1         it came early and was not disappointed  i love...
2         Very soft and comfortable and warmer than it l...
3         This is a product well worth the purchase   I ...
4         All of my kids have cried non stop when I trie...
                                ...                        
183526    Such a great idea  very handy to have and look...
183527    This product rocks   It is a great blend of fu...
183528    This item looks great and cool for my kids    ...
183529    I am extremely happy with this product  I have...
183530    I love this product very mush   I have bought ...
Name: review_processed, Length: 182384, dtype: object

In [10]:
np.random.seed(34)
df1 = df.sample(frac = 0.3)

In [11]:
df1['sentiments'] = df1.rating.apply(lambda x: 0 if x in [1, 2] else 1)

In [12]:
df1

Unnamed: 0,review,review_processed,rating,sentiments
165191,An off-white or cream sheet that is so soft. I...,An off white or cream sheet that is so soft I...,5,1
108775,I was skeptical about how well these will work...,I was skeptical about how well these will work...,5,1
162820,It soft and material appears to be excellent. ...,It soft and material appears to be excellent ...,5,1
148217,This is a very nice cover. I have two because ...,This is a very nice cover I have two because ...,5,1
46428,"I love these Lovies. They are cute, soft and d...",I love these Lovies They are cute soft and d...,5,1
...,...,...,...,...
137608,Fit my Jenny Lind crib perfectly. Water proof ...,Fit my Jenny Lind crib perfectly Water proof ...,5,1
156932,I purchased this and returned it immediately b...,I purchased this and returned it immediately b...,1,0
171309,I love this diaper bag. Everywhere I go with m...,I love this diaper bag Everywhere I go with m...,5,1
57598,This is Pancake Bear number 4 for our house. ...,This is Pancake Bear number 4 for our house ...,5,1


## Sentiment Analysis

In [13]:
X = df1['review']
y = df1['sentiments']

## Train Test Split

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=20)

In [15]:
X

165191    An off-white or cream sheet that is so soft. I...
108775    I was skeptical about how well these will work...
162820    It soft and material appears to be excellent. ...
148217    This is a very nice cover. I have two because ...
46428     I love these Lovies. They are cute, soft and d...
                                ...                        
137608    Fit my Jenny Lind crib perfectly. Water proof ...
156932    I purchased this and returned it immediately b...
171309    I love this diaper bag. Everywhere I go with m...
57598     This is Pancake Bear number 4 for our house.  ...
99994     The Snuzzler is perfect for my baby boy. It ma...
Name: review, Length: 54715, dtype: object

In [16]:
y

165191    1
108775    1
162820    1
148217    1
46428     1
         ..
137608    1
156932    0
171309    1
57598     1
99994     1
Name: sentiments, Length: 54715, dtype: int64

In [17]:
X.shape

(54715,)

In [18]:
y.shape

(54715,)

## Vectorizing the text data

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [20]:
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

## Model Development

### 1. Logistic Regression

In [21]:
from sklearn.linear_model import LogisticRegression
#Training the model
lr = LogisticRegression()
lr.fit(ctmTr, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

### Accuracy score

In [22]:
lr_score = lr.score(X_test_dtm, y_test)
print("Results for Logistic Regression with CountVectorizer")
print(lr_score)

Results for Logistic Regression with CountVectorizer
0.9045965457370008


### Predicting the labels for test data

In [23]:
y_pred_lr = lr.predict(X_test_dtm)
y_pred_lr

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

### Confusion matrix

In [24]:
from sklearn.metrics import confusion_matrix

In [25]:
cm_lr = confusion_matrix(y_test, y_pred_lr)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_lr).ravel()
print(tn, fp, fn, tp)

934 668 376 8965


### True positive and true negative rates

In [26]:
tpr_lr = round(tp/(tp + fn), 4)
tnr_lr = round(tn/(tn+fp), 4)
print(tpr_lr, tnr_lr)

0.9597 0.583


### 2. Support Vector Regression

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=12)

### Vectorizing the text data

In [28]:
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

### Training the model

In [29]:
from sklearn import svm

svcl = svm.SVC()
svcl.fit(ctmTr, y_train)
svcl_score = svcl.score(X_test_dtm, y_test)
print("Results for Support Vector Machine with CountVectorizer")
print(svcl_score)
y_pred_sv = svcl.predict(X_test_dtm)

Results for Support Vector Machine with CountVectorizer
0.9039568674038198


### Confusion matrix

In [30]:
cm_sv = confusion_matrix(y_test, y_pred_sv)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_sv).ravel()
print(tn, fp, fn, tp)

566 948 103 9326


### True positive and true negative rates

In [31]:
tpr_sv = round(tp/(tp + fn), 4)
tnr_sv = round(tn/(tn+fp), 4)
print(tpr_sv, tnr_sv)

0.9891 0.3738


### 3. K Nearest Neighbor

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=148)

### Vectorizing the text data

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

### Training the model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(ctmTr, y_train)
knn_score = knn.score(X_test_dtm, y_test)
print("Results for KNN Classifier with CountVectorizer")
print(knn_score)
y_pred_knn = knn.predict(X_test_dtm)

Results for KNN Classifier with CountVectorizer
0.8536050443205703


### Confusion matrix

In [None]:
cm_knn = confusion_matrix(y_test, y_pred_knn)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_knn).ravel()
print(tn, fp, fn, tp)

### True positive and true negative rates

In [None]:
tpr_knn = round(tp/(tp + fn), 4)
tnr_knn = round(tn/(tn+fp), 4)
print(tpr_knn, tnr_knn)

# Conclusion:

###### Logistic regression was the best out of all three classifiers used for this project considering overall accuracy, true positive rate, and true negative rate.