# Natural Language Processing

## Importing the libraries

In this section, we import the libraries that we will be using in the code.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

## Importing the dataset

The dataset is read and saved as a dataframe using the following snippet.

In [None]:
dataset = pd.read_csv('reviews_triage.csv')

The column and its datatype can be elicited by using info() function.

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8227 entries, 0 to 8226
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              8227 non-null   int64 
 1   Age                      8227 non-null   int64 
 2   Title                    6963 non-null   object
 3   Review Text              8227 non-null   object
 4   Rating                   8227 non-null   int64 
 5   Recommended IND          8227 non-null   int64 
 6   Positive Feedback Count  8227 non-null   int64 
 7   Division Name            8227 non-null   object
 8   Department Name          8227 non-null   object
 9   Class Name               8227 non-null   object
dtypes: int64(5), object(5)
memory usage: 642.9+ KB


In [None]:
dataset.describe()

Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,8227.0,8227.0,8227.0,8227.0,8227.0
mean,923.779263,42.983834,3.449496,0.501519,2.943114
std,195.685255,12.083563,1.369485,0.500028,6.41359
min,1.0,19.0,1.0,0.0,0.0
25%,860.0,34.0,2.0,0.0,0.0
50%,936.0,41.0,3.0,1.0,1.0
75%,1078.0,51.0,5.0,1.0,3.0
max,1204.0,94.0,5.0,1.0,117.0


isnull() function is used to find if a row record is null or not

In [None]:
dataset.isnull().sum()

Clothing ID                   0
Age                           0
Title                      1264
Review Text                   0
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                 0
Department Name               0
Class Name                    0
dtype: int64

In [None]:
dataset.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Clothes,Dresses
1,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Clothes,Dresses
2,1077,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Clothes,Dresses
3,1077,31,Not what it looks like,"First of all, this is not pullover styling. th...",2,0,7,General,Clothes,Dresses
4,697,31,Falls flat,"Loved the material, but i didnt really look at...",3,0,0,Initmates,Intimate,Lounge


for processing the Text, datatype should be string. Hence, this conversion is necessary.

In [None]:
dataset['Review Text']=dataset['Review Text'].apply(str)

In [None]:
dataset['class'] = dataset['Department Name']+" "+dataset['Class Name']

In [None]:
dataset.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,class
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Clothes,Dresses,Clothes Dresses
1,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Clothes,Dresses,Clothes Dresses
2,1077,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Clothes,Dresses,Clothes Dresses
3,1077,31,Not what it looks like,"First of all, this is not pullover styling. th...",2,0,7,General,Clothes,Dresses,Clothes Dresses
4,697,31,Falls flat,"Loved the material, but i didnt really look at...",3,0,0,Initmates,Intimate,Lounge,Intimate Lounge


## Cleaning the texts

In the below snippet, we remove stopwords from the review text using nltk package. Also we convert the words in the review to their root words using porter stemmer. We use the porter stemmer because it is simple and efficient.

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 8227):
  review = re.sub('[^a-zA-Z]', ' ', dataset['Review Text'][i])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')
  all_stopwords.remove('not')
  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
  review = ' '.join(review)
  corpus.append(review)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
print(corpus)

['high hope dress realli want work initi order petit small usual size found outrag small small fact could not zip reorder petit medium ok overal top half comfort fit nice bottom half tight layer sever somewhat cheap net layer imo major design flaw net layer sewn directli zipper c', 'love traci rees dress one not petit feet tall usual wear p brand dress pretti packag lot dress skirt long full overwhelm small frame not stranger alter shorten narrow skirt would take away embellish garment love color idea style not work return dress', 'dress run small esp zipper area run order sp typic fit tight materi top look feel cheap even pull caus rip fabric pretti disappoint go christma dress year needless say go back', 'first not pullov style side zipper purchas knew side zipper larg bust side zipper next imposs second tull feel look cheap slip awkward tight shape underneath not look like describ sadli return sure find someth exchang', 'love materi didnt realli look long dress purchas larg medium i

In [None]:
len(corpus)

8227

## Creating the Bag of Words model

CountVectorizer is a feature extraction technique in natural language processing (NLP) that is used to convert a collection of text documents to a matrix of token counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 5000)
X = cv.fit_transform(corpus).toarray()
y = dataset[['class']]

In [None]:
len(X[0])

5000

In [None]:
y.shape

(8227, 1)

In [None]:
y = dataset[['class']]

In [None]:
y = dataset['class'].str.split(' ')

In [None]:
y

0       [Clothes, Dresses]
1       [Clothes, Dresses]
2       [Clothes, Dresses]
3       [Clothes, Dresses]
4       [Intimate, Lounge]
               ...        
8222      [Bottoms, Pants]
8223      [Tops, Sweaters]
8224      [Tops, Sweaters]
8225         [Tops, Knits]
8226     [Bottoms, Skirts]
Name: class, Length: 8227, dtype: object

The purpose of MultiLabelBinarizer is to convert a collection of sequences of labels into a binary matrix representation. Each unique label in the input sequences is treated as a separate class, and a binary indicator matrix is created to represent the presence or absence of each label for each sample.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
multilabel = MultiLabelBinarizer()

In [None]:
y = multilabel.fit_transform(y)

In [None]:
y.shape

(8227, 27)

In [None]:
multilabel.classes_

array(['Blouses', 'Bottoms', 'Casual', 'Clothes', 'Dresses', 'Fine',
       'Intimate', 'Jackets', 'Jacks', 'Jeans', 'Knits', 'Layering',
       'Legwear', 'Lounge', 'Lower', 'Outerwear', 'Pants', 'Shorts',
       'Skirts', 'Sleep', 'Style', 'Sweaters', 'Swim', 'Tops', 'Trend',
       'bottoms', 'gauge'], dtype=object)

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Training the Naive Bayes model on the Training set

In [None]:
def j_score(y_true, y_pred):
  jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
  return jaccard.mean()*100

In [None]:

def print_score(y_pred, clf):
  print("Clf: ", clf.__class__.__name__)
  print('Jacard score: {}'.format(j_score(y_test, y_pred)))
  print('----')

## Predicting the Test set results

TfidfVectorizer is another feature extraction technique commonly used in natural language processing (NLP) and information retrieval. It stands for Term Frequency-Inverse Document Frequency. This vectorizer calculates the TF-IDF value for each term in a document, which represents the importance of a term in relation to a collection of documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', max_features=1500, ngram_range=(1,3), stop_words='english')
X_new = tfidf.fit_transform(dataset['Review Text']).toarray()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size = 0.20, random_state = 0)

In [None]:
def j_score(y_true, y_pred):
  jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
  return jaccard.mean()*100

In [None]:
def print_score(y_pred, clf):
  print("Clf: ", clf.__class__.__name__)
  print('Jacard score: {}'.format(j_score(y_test, y_pred)))
  print('----')

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


We call all the classifier models to compare classification by each model and select the best model.

In [None]:
classifier = GaussianNB()
classifier1 = LogisticRegression()
classifierRF = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifierDT = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

OneVsRestClassifier is a strategy for extending binary classifiers to multilabel classification problems. It is a part of scikit-learn's multiclass module. The idea behind this strategy is to treat each label as a separate binary classification problem. For each label, a binary classifier is trained to distinguish between instances that have the label and instances that do not.

In [None]:
from sklearn.multiclass import OneVsRestClassifier

In [None]:
for classifier in [classifierDT]:
  clf = OneVsRestClassifier(classifierDT)
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print_score(y_pred, classifier)

Clf:  DecisionTreeClassifier
Jacard score: 51.44368743852341
----


In [None]:
y_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Now to convert the labels from it's vectorized format to it's original text format, we use inverse_transform function.

In [None]:
multilabel.inverse_transform(clf.predict(X_test))

[('Tops',),
 ('Knits', 'Tops'),
 ('Bottoms',),
 ('Knits', 'Sweaters', 'Tops'),
 ('Knits', 'Tops'),
 ('Knits', 'Tops'),
 (),
 ('Lounge',),
 ('Blouses', 'Jackets', 'Tops'),
 ('Jackets', 'Jacks', 'Sweaters', 'Tops'),
 ('Bottoms', 'Pants'),
 ('Jackets',),
 ('Sweaters', 'Tops'),
 ('Skirts', 'Tops'),
 ('Blouses', 'Tops'),
 ('Tops',),
 ('Clothes', 'Dresses'),
 ('Bottoms', 'Pants'),
 ('Clothes', 'Dresses', 'Tops'),
 ('Clothes', 'Dresses'),
 ('Bottoms', 'Clothes', 'Dresses', 'Jackets', 'Jacks'),
 ('Clothes', 'Dresses'),
 ('Knits', 'Tops'),
 ('Clothes', 'Dresses'),
 (),
 ('Blouses', 'Fine', 'Jackets', 'Jacks', 'Knits', 'Tops', 'gauge'),
 ('Bottoms',),
 (),
 ('Clothes', 'Dresses'),
 ('Legwear', 'Tops'),
 ('Bottoms', 'Skirts', 'Sweaters'),
 ('Intimate', 'Jeans'),
 ('Blouses', 'Tops'),
 ('Clothes', 'Dresses'),
 ('Blouses', 'Knits', 'Tops'),
 (),
 ('Tops',),
 ('Blouses', 'Clothes', 'Dresses', 'Knits', 'Tops'),
 ('Tops',),
 ('Clothes', 'Dresses', 'Knits'),
 ('Knits', 'Tops'),
 (),
 ('Jackets', 'Jacks

Now we test the classification with our own review below.

In [None]:
Y = ['i hate this pant']

In [None]:
Yt = tfidf.transform(Y).toarray()

In [None]:
Yt

array([[0., 0., 0., ..., 0., 0., 0.]])

In [None]:
clf.predict(Yt)

array([[0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0]])

In [None]:
multilabel.inverse_transform(clf.predict(Yt))

[('Bottoms', 'Clothes', 'Dresses', 'Pants')]

**Conclusion:**
We have successfully created a model to classify reviews realted to garments and match the specific department the review speaks about. We have also modeled the code in a way that it is possible to classify data using different classifiers. This allows us to compare the degree of accuracy of each classifier and select the most efficient one.