# Introduction to Computer Vision: E-Commerce Text Classification

## Problem Statement

### Context

Product categorization also referred to as product classification, is a field of study within natural language processing (NLP). It is also one of the biggest challenges for e-commerce companies. With the advancement of AI technology, researchers have been applying machine learning to product categorization problems.

Product categorization is the placement and organization of products into their respective categories. In that sense, it sounds simple: choose the correct department for a product. However, this process is complicated by the sheer volume of products on many e-commerce platforms. Furthermore, many products could belong to multiple categories.

There are many reasons why product categorization is important for e-commerce and marketing. Through the accurate classification of your products, you can increase conversion rates, strengthen your search engine, and improve your site’s Google ranking.

### Objective

The aim of the project is to build a classification model that can accurately classify product descriptions into the predefined categories of "Household," "Clothing & Accessories," "Electronics," and "Books."

### Data Dictionary

* Label - Class name
* Text - Description from the e-commerce website  



## **Please read the instructions carefully before starting the project.**

This is a commented Python Notebook file in which all the instructions and tasks to be performed are mentioned.

* Blanks '_______' are provided in the notebook that need to be filled with an appropriate code to get the correct result

* With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space

* Identify the task to be performed correctly and only then proceed to write the required code

* Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code"

* Running incomplete code may throw an error

* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors

* Add the results/observations derived from the analysis in the presentation and submit the same in .pdf format

## Importing necessary libraries

In [None]:
# install and import necessary libraries.

!pip install contractions

import re, string, unicodedata                          # Import Regex, string and unicodedata.
import contractions                                     # Import contractions library.
from bs4 import BeautifulSoup                           # Import BeautifulSoup.

import numpy as np                                      # Import numpy.
import pandas as pd                                     # Import pandas.
import nltk                                             # Import Natural Language Tool-Kit.
import seaborn as sns                                   # Import seaborn
import matplotlib.pyplot as plt                         # Import Matplotlib

nltk.download('stopwords')                              # Download Stopwords.
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords                       # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.
from wordcloud import WordCloud,STOPWORDS               # Import WorldCloud and Stopwords
from sklearn.feature_extraction.text import CountVectorizer # Import count Vectorizer
from sklearn.model_selection import train_test_split    # Import train test split
from sklearn.ensemble import RandomForestClassifier     # Import Rndom Forest Classifier
from sklearn.model_selection import cross_val_score     # Import cross val score
from sklearn.metrics import confusion_matrix            # Import confusion matrix
import wordcloud                         # Import Word Cloud
from sklearn.feature_extraction.text import TfidfVectorizer # Import Tf-Idf vector
import nltk
nltk.download('omw-1.4')
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from wordcloud import WordCloud

## Loading the dataset

In [None]:
# Mount Google drive to access the dataset  (Run this code, if you are using Google Colab)
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data = pd.read_csv("________")               # Complete the code to read the dataset

## Data Overview

The initial steps to get an overview of any dataset is to:
- Observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not
- Get information about the number of rows and columns in the dataset
- Find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.

### Check the head and tail of the data

In [None]:
data.__________()                            # Complete the code to display the first 5 rows of the dataset
data.__________()                            # Complete the code to display the last 5 rows of the dataset

### Understand the shape of the dataset

In [None]:
data.___________()                           # Complete the code to get the shape of data

### Checking for Missing Values

In [None]:
data.'_______'                               # Complete the code to check duplicate entries in the data

## Exploratory Data Analysis

### Univariate Analysis

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

#### Percentage of text for each label

In [None]:
labeled_barplot(data, "_________", perc=True)         # Complete the code to plot the labeled barplot for Label

### Bivariate Analysis

#### Wordcloud for each product category

In [None]:
import wordcloud
def show_wordcloud(data, title):
    text = ' '.join(data['Text'].astype(str).tolist())                                         # Converting Summary column into list
    stopwords = set(wordcloud.STOPWORDS)                                                       # instantiate the stopwords from wordcloud

    fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords,background_color='white',          # Setting the different parameter of stopwords
                    colormap='viridis', width=800, height=600).generate(text)

    plt.figure(figsize=(14,11), frameon=True)
    plt.imshow(fig_wordcloud)
    plt.axis('off')
    plt.title(title, fontsize=30)
    plt.show()

**Household product**

In [None]:
show_wordcloud(data[data.Label=='_________'],'Summary Word_Cloud')    # Complete the code to get the word cloud for Household

**Clothing & Accessories product**

In [None]:
show_wordcloud(data[data.Label=='_________'],'Summary Word_Cloud')    # Complete the code to get the word cloud for Clothing & Accessories

**Electronics product**

In [None]:
show_wordcloud(data[data.Label=='_________'],'Summary Word_Cloud')    # Complete the code to get the word cloud for Electronics

**Books product**

In [None]:
show_wordcloud(data[data.Label=='_________'],'Summary Word_Cloud')    # Complete the code to get the word cloud for Books

## Data Preparation for Modeling

-	Html tag removal.
-	Remove the numbers.
-	Tokenization.
-	Removal of Special Characters and Punctuations.
-	Conversion to lowercase.
-	Lemmatize or stemming.


### Remove HTML Tages

In [None]:
# Code to remove the html tage
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

data['text'] = data['___'].apply(____________)                        # Complete the code to apply strip html function on text column
data._______                                                          # Complete the code to display the head of the data

### Remove numbers

In [None]:
def remove_numbers(text):
  text = re.sub(r'\d+', '', text)
  return text

data['Text'] = data['_____'].apply(lambda x: remove_numbers(x))       # Complete the code to apply the above function on text column
data.head()

### Apply Tokenization

In [None]:
data['_____'] = data.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1) # Complete the code to apply tokenization on text column

### Applying lowercase and removing stopwords and punctuation

**All the preprocessing steps in one function**

In [None]:
stopwords = set(stopwords.words('english'))

In [None]:
def remove_non_ascii(words):
    #Remove non-ASCII characters from list of tokenized words
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    #Remove punctuation from list of tokenized words
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def to_lowercase(words):
    #Convert all characters to lowercase from list of tokenized words
    new_words = []                                  # Create empty list to store pre-processed words.
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words


def remove_stopwords(words):
    #Remove stop words from list of tokenized words
    new_words = []                                  # Create empty list to store pre-processed words.
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words


def lemmatize_list(words):
    #Lemmatize words in list of tokenized words
    lemmatizer = WordNetLemmatizer()
    lemmas = []                                    # Create empty list to store pre-processed words.
    for word in words:
        lemma = lemmatizer.lemmatize(word)
        lemmas.append(lemma)                       # Append processed words to new list.
    return lemmas


def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)

data['Text'] = data.apply(lambda row: normalize(row['Text']), axis=1)
data.head()

## Model Building

### Using countvectorizer

In [None]:
# Vectorization (Convert text data to numbers).

Count_vec = ______________(max_features=_____)                # Complete the code to initialize the CountVectorizer function with max_ features = 5000.
data_features = Count_vec._____(data['_____'])                # Complete the code to fit and transrofm the count_vec variable on the text column

data_features = data_features._______()                       # Complete the code to convert the datafram into array

In [None]:
data_features.___________                                     # Complete the code to check the shape of the data features

#### Store Independent and Dependent variables

In [None]:
X = _____________                                             # Complete the code to get the independent variable (data_features) stored as X

y = data.__________                                           # Complete the code to get the dependent variable (Label) stored as Y

#### Split the data into train and test

In [None]:
# Split data into training and testing set.

X_train, X_test, y_train, y_test =_________ (__, __, test_size=___, random_state=42)   # Complete the code to split the X and Y into train and test dat

#### Random Forest Model

In [None]:
# Using Random Forest to build model for the classification of reviews.

forest = ____________(n_estimators=____, n_jobs=4)            # Initialize the Random Forest Classifier

forest = ______.____(______, _______)                         # Fit the forest variable on X_train and y_train

print(forest)

print(np.mean(_______________(forest, X, y, cv=10)))          # Calculate cross validation score

##### Optimize the parameter: The number of trees in the random forest model(n_estimators)

In [None]:
# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]

In [None]:
# K-Fold Cross - validation .
cv_scores = []                                                                             # Initializing a emptry list to store the score
for b in base_ln:
    clf = _______________(n_estimators = b)                                                # Complete the code to apply Rondome Forest Classifier
    scores = ___________(_____, ______, _______, cv = 5, scoring = '___________')          # Complete the code to find the cross-validation score on the classifier (clf) for accuracy
    cv_scores.append(scores.mean())                                                        # Append the scores to cv_scores list

In [None]:
# plot the error as k increases
error = [1 - x for x in cv_scores]                                 # Error corresponds to each number of estimator
optimal_learners = base_ln[error.index(min(error))]                # Selection of optimal number of n_estimator corresponds to minimum error.
plt.plot(base_ln, error)                                           # Plot between each number of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate('(%s, %s)' % xy, xy = xy, textcoords='data')
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()

In [None]:
# Train the best model and calculating accuracy on test data .
clf = _________(n_estimators = _____________)                     # Initialize the Random Forest classifier with optimal learners
___.____(____, ___)                                               # Fit the classifer on X_train and y_train
___.____(____, ___)                                               # Find the score on X_train and y_train

In [None]:
  # Predict the result for test data using the model built above.
  result = _____.predict(_______)                                   # Complete the code to predict the X_test data using the model built above (forest)

In [None]:
# Print and plot Confusion matirx

conf_mat = ________(___________, _________)                       # Complete the code to calculate the confusion matrix between test data and result

print(conf_mat)                                                   # Print confusion matrix

#### Wordcloud of top 40 important features from countvectorizer+Randomforest based mode

In [None]:
all_features = Count_vec.get_feature_names()                     # Instantiate the feature from the vectorizer
top_features=''                                                  # Addition of top 40 feature into top_feature after training the model
feat=clf.feature_importances_
features=np.argsort(feat)[::-1]
for i in features[0:40]:
    top_features+=all_features[i]
    top_features+=','

print(top_features)

print(" ")
print(" ")

# Complete the code by applying wordcloud on top features
wordcloud = ________(background_color="white",colormap='viridis',width=2000,height=1000).generate(_______)

In [None]:
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.figure(1, figsize=(14, 11), frameon='equal')
plt.title('Top 40 features WordCloud', fontsize=20)
plt.axis("off")
plt.show()

### Using TF-IDF (Term Frequency- Inverse Document Frequency)

In [None]:
# Using TfidfVectorizer to convert text data to numbers.

tfidf_vect = ______________(max_features=_____)                          # Complete the code to initialize the TF-IDF vector function with max_features = 5000.
data_features = tfidf_vect.fit_transform(data['text'])                   # Fit the tf idf function on the text column

data_features = data_features._______()                                  # Complete the code to convert the datafram into array

In [None]:
data_features.___________                                                # Complete the code to check the shape of the data features

#### Store Independent and Dependent variables

In [None]:
X = _____________                                                        # Complete the code to get the independent variable (data_features) stored as X

y = data.__________                                                      # Complete the code to get the dependent variable (Label) stored as Y

#### Split the data into train and test

In [None]:
# Split data into training and testing set.

X_train, X_test, y_train, y_test =_________ (__, __, test_size=___, random_state=____)   # Complete the code to split the X and Y into train and test dat

#### Random Forest Model

In [None]:
# Using Random Forest to build model for the classification of reviews.

forest = ____________(n_estimators=____, n_jobs=4)            # Initialize the Random Forest Classifier

forest = ______.____(______, _______)                         # Fit the forest variable on X_train and y_train

print(forest)

print(np.mean(_______________(forest, X, y, cv=10)))          # Calculate cross validation score

##### Optimize the parameter: The number of trees in the random forest model(n_estimators)

In [None]:
# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]

In [None]:
# K-Fold Cross - validation .
cv_scores = []                                                                             # Initializing a emptry list to store the score
for b in base_ln:
    clf = _______________(n_estimators = b)                                                # Complete the code to apply Rondome Forest Classifier
    scores = ___________(_____, ______, _______, cv = 5, scoring = '___________')          # Complete the code to find the cross-validation score on the classifier (clf) for accuracy
    cv_scores.append(scores.mean())                                                        # Append the scores to cv_scores list

In [None]:
# plotting the error as k increases
error = [1 - x for x in cv_scores]                                 # #rror corresponds to each nu of estimator
optimal_learners = base_ln[error.index(min(error))]                # Selection of optimal nu of n_estimator corresponds to minimum error.
plt.plot(base_ln, error)                                           # Plot between each nu of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate('(%s, %s)' % xy, xy = xy, textcoords='data')
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()

**Plot the misclassification error for each of estimators (Hint: Use the above code which is used while plotting the miscalssification error for CountVector function)**

In [None]:
# Train the best model and calculating accuracy on test data .
clf = _________(n_estimators = _____________)                     # Initialize the Random Forest classifier with optimal learners
___.____(____, ___)                                               # Fit the classifer on X_train and y_train
___.____(____, ___)                                               # Find the score on X_train and y_train

In [None]:
# Predict the result for test data using the model built above.
result = _____.predict(_______)                                   # Complete the code to predict the X_test data using the model built above (forest)

#### Wordcloud of top 40 important features from TF-IDF+Randomforest based mode

In [None]:
all_features = tfidf_vect.get_feature_names_out()          # Instantiate the feature from the vectorizer
top_features=''                                            # Addition of top 40 feature into top_feature after training the model
feat=clf.feature_importances_
features=np.argsort(feat)[::-1]
for i in features[0:40]:
    top_features+=all_features[i]
    top_features+=', '

print(top_features)

print(" ")
print(" ")

# Complete the code by applying wordcloud on top features
wordcloud = ________(background_color="white",colormap='viridis',width=2000,height=1000).generate(_______)

In [None]:
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.figure(1, figsize=(14, 11), frameon='equal')
plt.title('Top 40 features WordCloud', fontsize=20)
plt.axis("off")
plt.show()

## Summary

---



## Happy Learning!