## Week 3 - Exercise 2

Author: Khushee Kapoor

Last Updated: 3/4/22

### Setting Up

To start, we have imported the following libraries:

- NumPy: to work with the data
- Pandas: to manipulate the dataframe
- MatPlotLib: for data visualization
- Seaborn: for data visulization

In [1]:
# importing the libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

Next, we read the dataset and store it into a dataframe using the read_csv() function from the Pandas library.

In [2]:
# reading the dataset
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')

After that, we view the first few rows of the dataframe to get a glimpse of it. To do this, we use the head() function from the Pandas library.

In [3]:
# viewing the first 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Positive Feedback Count,Division Name,Department Name,Category,Recommended IND
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,0,Initmates,Intimate,Intimates,1
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,4,General,Dresses,Dresses,1
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,General,Dresses,Dresses,0
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,0,General Petite,Bottoms,Pants,1
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,6,General,Tops,Blouses,1


### Q1. Preprocessing Pipeline

**a. Find any null values are present or not, If present remove those data.**

**b. Remove the data that have less than 5 reviews.**

**c. Clean the data and remove the special characters and replace the contractions with its expansion. Convert the uppercase character to lower case. Also, remove the punctuations.**

To solve Question 1a, we use the isnull() function from the Pandas library and then sum over them using the sum() function.

In [4]:
# checking for missing values
df.isnull().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Positive Feedback Count       0
Division Name                14
Department Name              14
Category                     14
Recommended IND               0
dtype: int64

As we can see, there are missing values in the Title, Review Text, Division Name, Department Name, and Category columns. To remove them, we use the dropna() function from the Pandas library.

In [5]:
# dropping the missing values
df = df.dropna()

To solve Question 1b, we create a new column to store the count of reviews of each product. To do this, we iterate over the unique clothing ids using the unique() function from the Pandas library, then use the loc() function to locate the every unqiue id and the value_counts() function to store it's count. Then we use logical splicing to filter out the columns having less than 5 reviews.

In [6]:
# creating a new column
df['number'] = 0

# storing the count of every clothing id
for cloth in df['Clothing ID'].unique():
    df.loc[df['Clothing ID']==cloth, 'number'] = df['Clothing ID'].value_counts()[cloth]

# filtering out the products with less than 5 reviews
df = df[df.number>=5]

To solve Question 1c, we first create a dictionary with all the contractions and their expansions. 

In [7]:
# creating a dictionary with contractions and their expansions
contractions = {
"a'ight":"alright",
"ain't":"are not",
"amn't":"am not",
"aren't":"are not",
"can't":"cannot",
"'cause": "because",
"could've":"could have",
"couldn't":"could not",
"couldn't've":"could not have",
"daren't":"dare not",
"daresn't":"dare not",
"dasn't":"dare not",
"didn't":"did not",
"doesn't":"does not",
"don't":"do not",
"everybody's":"everybody is",
"everyone's":"everyone is",
"giv'n":"given",
"gonna":"going to",
"gon't":"go not", 
"gotta":"got to",
"hadn't":"had not",
"had've":"had have",
"hasn't":"has not",
"haven't":"have not",
"he'd":"he had", 
"he'll":"he will",
"he's":"he is",
"here's":"here is",
"how'd":"how did",
"how'll":"how will",
"how're":"how are",
"how's":"how is",
"I'd":"I had",
"I'd've":"I would have",
"I'd'nt":"I would not",
"I'd'nt've":"I would not have",
"I'll":"I will",
"I'm":"I am",
"I've":"I have",
"isn't":"is not",
"it'd":"it would",
"it'll":"it will",
"it's":"it is",
"let's":"let us",
"ma'am":"madam",
"mayn't":"may not",
"may've":"may have",
"mightn't":"might not",
"might've":"might have",
"mustn't":"must not",
"mustn't've":"must not have",
"must've":"must have",
"needn't":"need not",
"needn't've":"need not have",
"o'clock":"of the clock",
"oughtn't":"ought not",
"oughtn't've":"ought not have",
"shan't":"shall not",
"she'd":"she would",
"she'll":"she will",
"she's":"she is",
"should've":"should have",
"shouldn't":"should not",
"shouldn't've":"should not have",
"somebody's":"somebody is",
"someone's":"someone is",
"something's":"something is",
"so're":"so are",
"so’s":"so is",
"so’ve":"so have",
"that'll":"that will",
"that're":"that are",
"that's":"that is",
"that'd":"that would",
"there'd":"there would",
"there'll":"there will",
"there're":"there are",
"there's":"there is",
"these're":"these are",
"these've":"these have",
"they'd":"they would",
"they'll":"they will",
"they're":"they are",
"they've":"they have",
"this's":"this is",
"those're":"those are",
"those've":"those have",
"to've":"to have",
"wasn't":"was not",
"we'd":"we would",
"we'd've":"we would have",
"we'll":"we will",
"we're":"we are",
"we've":"we have",
"weren't":"were not",
"what'd":"what did",
"what'll":"what will",
"what're":"what are",
"what's":"what is",
"what've":"what have",
"when's":"when is",
"where'd":"where did",
"where'll":"where will",
"where're":"where are",
"where's":"where is",
"where've":"where have",
"which'd":"which would",
"which'll":"which will",
"which're":"which are",
"which's":"which is",
"which've":"which have",
"who'd":"who would",
"who'd've":"who would have",
"who'll":"who will",
"who're":"who are",
"who's":"who is",
"who've":"who have",
"why'd":"why did",
"why're":"why are",
"why's":"why is",
"won't":"will not",
"would've":"would have",
"wouldn't":"would not",
"wouldn't've":"would not have",
"y'at":"you at",
"yes’m":"yes madam",
"you'd":"you would",
"you'll":"you will",
"you're":"you are",
"you've":"you have"}

Next, we create a custom made function to convert the contractions to expansions. This function loops over the previously created dictionary and uses the replace() function from the string library to replace the contractions with the expansions.

In [8]:
# custom made function to convert contraction to expansion
def cont_to_exp(x):
    if type(x) is str:
        x = x.replace('\\','')
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x

After that, we apply the function on every observation in the review and title column. To do that, we use the apply() function from the Pandas library and the lambda function.

In [9]:
# converting contraction to expansion in every row of the review and title column
df['Review Text'] = df['Review Text'].apply(lambda x:cont_to_exp(x))
df['Title'] = df['Title'].apply(lambda x:cont_to_exp(x))

Next, we create a custom made function to convert upper case to lower case. This function implements the lower() function from the string package.

In [10]:
# custom made function to convert upper case to lower
def upper_to_lower(x):
    if type(x) is str:
        return x.lower()
    else:
        return x

After that, we apply the function on every observation in the review and title column. To do that, we use the apply() function from the Pandas library and the lambda function.

In [11]:
# converting upper case to lower from every row of the review and title column
df['Review Text'] = df['Review Text'].apply(lambda x:upper_to_lower(x))
df['Title'] = df['Title'].apply(lambda x:upper_to_lower(x))

Next, we create a custom made function to remove the punctuations. This function loops over every punctutation in the string.punctuation constant from the string library and use the replace() function to replace them with blanks.

In [12]:
# custom made function to remove punctuations
import string
def remove_punctuations(text):
    if type(text) is str:
        for p in string.punctuation:
            text = text.replace(p, '')
        return text
    else:
        return text 

After that, we apply the function on every observation in the review and title column. To do that, we use the apply() function from the Pandas library and the lambda function.

In [13]:
# removing punctuations from every row of the review and title column
df['Review Text'] = df['Review Text'].apply(lambda x:remove_punctuations(x))
df['Title'] = df['Title'].apply(lambda x:remove_punctuations(x))

Next, we create a custom made function to remove the special characters. This function loops over every character from a custom made string containing all special characters and use the replace() function to replace them with blanks.

In [14]:
# custom made function to remove special characters
def remove_special_chars(text):
    special_chars = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    if type(text) is str:
        for p in special_chars:
            text = text.replace(p, '')
        return text
    else:
        return text

After that, we apply the function on every observation in the review and title column. To do that, we use the apply() function from the Pandas library and the lambda function.

In [15]:
# removing special characters from every row of the review and title column
df['Review Text'] = df['Review Text'].apply(lambda x:remove_special_chars(x))
df['Title'] = df['Title'].apply(lambda x:remove_special_chars(x))

To further analyze the sentiment regarding the product, we use the TextBlob package to calculate the polarity of the title and the review.

In [16]:
# importing the text blob package
from textblob import TextBlob

# computing the polarity of the title and review text
df['polarity_title'] = df['Title'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['polarity_review'] = df['Review Text'].apply(lambda x: TextBlob(x).sentiment.polarity)

### Q2. Separate the columns into dependent and independent variables (or features and labels). Then you split those variables into train and test sets (80:20).

To solve Question 2, we first drop all the irrelevant columns using the drop() function from the Pandas library. These variables include identifiers, and the newly created number column.

In [17]:
# dropping the irrelevant columns
df = df.drop(columns=['Unnamed: 0', 'Clothing ID', 'number', 'Department Name', 'Division Name', 'Category'])

Next we split the data into independent variables (x) and independent variable (y).

In [18]:
# splitting the data into independent and dependent variables
x = df.drop(columns=['Recommended IND'])
y = df['Recommended IND']

Next, we use the train_test_split() function from the sklearn library and divide the dataset into training and testing sets. We set the train_size parameter to 0.8 to ensure that 80% of the dataset is put into training.

In [19]:
# diving the dataset into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=105)

After that, we segregate the numerical and categorical columns to easily perform the appropriate preprocessing on the variables.

In [20]:
# selecting the numerical columns
numerical_cols = [cname for cname in x.columns if x[cname].dtype in ['int64', 'float64']]

# selecting the categorical columns
categorical_cols = [cname for cname in x.columns if x[cname].dtype == 'object']

### Q3. Apply the Naïve Bayes Classification Algorithm on Sentiment category to predict if item is recommended.

To solve Question 3, we form the following pipeline:

                                      Independent Variables
                                    ___________|__________
                                   |                      |
                               Numerical             Categorical
                                   |                      |
                                Scaling              Vectorizing
                                   |                      |
                                   |               One Hot Encoding
                                   |______________________|
                                               |
               Dependent Variable -- Naive Bayes Classifier
                                               |
                                             Output

To do this, we use the following libraries:

- Pipeline: to create the pipeline
- ColumnTransformer: to aggregrate the preprocessing steps for the numerical and categorical columns
- RobustScaler: to scale the numerical values
- CountVectorizer: to vectorize the text and generate the top 10 unigrams and bigrams
- OneHotEncoder: to one-hot-encode the categorical values
- GaussianNB: to build the naive bayes classifier model

In [21]:
# importing libraries for preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

# scaling the numerical columns 
numerical_transformer = Pipeline(steps = [
    ('scaler', RobustScaler())
])

# vectorizing and one-hot encoding the categorical columns
categorical_transformer = Pipeline(steps=[
    ('vectorizer', CountVectorizer(stop_words='english', ngram_range=(1,2), max_features=10)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols),
      ])

# building model for prediction
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

# bundle preprocessing and modeling code in a pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor), 
                              ('model', model)])

Following that, we fit the pipeline with the training data using the fit() function from the SKLearn library.

In [22]:
# fitting the training data 
pipe.fit(x_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   RobustScaler())]),
                                                  ['Age', 'Rating',
                                                   'Positive Feedback Count',
                                                   'polarity_title',
                                                   'polarity_review']),
                                                 ('cat',
                                                  Pipeline(steps=[('vectorizer',
                                                                   CountVectorizer(max_features=10,
                                                                                   ngram_range=(1,
                                                                                                2),
                       

### Q4. Tabulate accuracy in terms of accuracy, precision, recall, and F1 score.

To solve Question 4, we use the score(), precision_score(), recall_score(), and f1_score() functions from the SKLearn library to calculate the accuracy, precision, recall, and f1 scores on the test data.

In [23]:
# importing libraries for obtaining evaluation metrics
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# computing and printing the accuracy
print(str.format('Accuracy: {:.2f}%', pipe.score(x_test, y_test)*100))

# predicting the value of y from the testing set
y_pred = pipe.predict(x_test)

# computing and printing the precision
print(str.format('Precision: {:.2f}%', precision_score(y_test, y_pred)*100))

# computing and printing the recall
print(str.format('Recall: {:.2f}%', recall_score(y_test, y_pred)*100))

# computing and printing the f1 score
print(str.format('F1 Score: {:.2f}%', f1_score(y_test, y_pred)*100))

Accuracy: 92.47%
Precision: 97.69%
Recall: 92.97%
F1 Score: 95.27%


As we can see, all the scores are very high. This means that the model performs really well on the data.