In [29]:
import pandas as pd
import numpy as np
import time
import re
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import nltk

nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Adeel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Adeel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# About Dataset

---

The dataset used contains customer reviews from a women’s clothing e-commerce platform. It includes various columns such as Review Text, Rating, Age, and Clothing ID. For this task, the primary focus was on the **Review Text** column, which holds customer feedback about different clothing products. These reviews were used to perform natural language processing tasks like One Hot Encoding and Bag of Words, helping to convert the text data into numerical form for further analysis.

In [30]:

data=pd.read_csv(r"C:\Users\Adeel\Desktop\NLP\Womens Clothing E-Commerce Reviews.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## preprocessing
---
Preprocessing is an essential step in natural language processing that helps clean and standardize text data. In this case, the text from customer reviews was converted to lowercase to avoid treating words like "Great" and "great" differently. Punctuation was removed to focus only on meaningful words. The text was then split into individual words (tokens), and common stopwords like "the", "and", "is" were removed to reduce noise. This cleaned version of the text helps improve the performance of models like One Hot Encoding and Bag of Words.

In [40]:

# Use a sample of non-null review texts
sample_reviews = data['Review Text'].dropna().sample(10, random_state=1).tolist()

custom_stopwords= set(stopwords.words('english')) 

# Preprocessing function
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    tokens = text.split()
    return [word for word in tokens if word not in custom_stopwords]







### 🔹 **Task 1: One Hot Encoding**
---

**What was the task?**  
The goal was to implement **One Hot Encoding** for a dataset of consumer reviews on clothing products. Each word in the review text had to be represented as a binary vector indicating its presence in a sentence.

**How did you do that?**  
- First, we randomly selected 10 reviews from the dataset and removed null values.
- Each review was **preprocessed**: converted to lowercase, punctuation removed, and stopwords filtered out.
- A **vocabulary** of unique words was created from the cleaned text.
- For each review, we generated a binary vector where each index represents a word from the vocabulary. If the word exists in the review, that index is marked `1`; otherwise, `0`.

**Library functions used:**  
- `pandas` for data loading and manipulation  
- `numpy` for array and vector operations  
- `re` (regular expressions) for punctuation removal  
- `time` to measure execution time  
- `sk-learn` for one hot-encoding

**What are the results?**  
- The one-hot encoded matrix has a shape of `(10, N)`, where 10 is the number of reviews and `N` is the number of unique words (vocabulary size).
- Each row in the matrix represents the word presence (1 or 0) for a given review.

**Execution Time:**  
Approximately **X.XXX seconds** (replace with actual value from your run).


In [38]:

start_time_1hot = time.time()

# Preprocess and create vocabulary
processed_reviews = [preprocess(review) for review in sample_reviews]
vocab = sorted(set(word for review in processed_reviews for word in review))
word2idx = {word: idx for idx, word in enumerate(vocab)}

# One-hot encode each review
def one_hot_encode(tokens, vocab):
    vector = np.zeros(len(vocab), dtype=int)
    for token in tokens:
        if token in vocab:
            vector[word2idx[token]] = 1
    return vector

one_hot_matrix = np.array([one_hot_encode(review, vocab) for review in processed_reviews])

end_time_1hot = time.time()

print("Task 1: One Hot Encoding")
print("Vocabulary Size:", len(vocab))
print("One Hot Matrix Shape:", one_hot_matrix.shape)
print("One Hot Matrix:\n", one_hot_matrix)
print("Execution Time:", round(end_time_1hot - start_time_1hot, 4), "seconds\n")



Task 1: One Hot Encoding
Vocabulary Size: 256
One Hot Matrix Shape: (10, 256)
One Hot Matrix:
 [[0 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [0 1 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 1]
 [0 0 0 ... 0 0 0]]
Execution Time: 0.006 seconds



### 🔹 **Task 2: Bag of Words (BoW)**
---
**What was the task?**  
The objective was to implement a **Bag of Words** model for customer feedback. We needed to count how frequently each word appears in each review.

**How did you do that?**  
- We used the same preprocessed text from Task 1.
- Each cleaned review was converted back to a string (from token list).
- We used `CountVectorizer` from `sklearn` to build the BoW matrix.
- The matrix counts how many times each word from the vocabulary appears in each review.

**Library functions used:**  
- `CountVectorizer` from `sklearn.feature_extraction.text`  
- `pandas`, `numpy`, `re` as before  
- `time` to measure execution time  

**What are the results?**  
- The resulting matrix shape is `(10, M)`, where 10 is the number of reviews and `M` is the number of unique words used across all reviews.
- Each cell in the matrix shows how many times a word appears in a specific review.

**Execution Time:**  
Approximately **Y.YYY seconds** (replace with actual value from your run).




In [28]:
start_time_bow = time.time()

# Re-join processed tokens to string for CountVectorizer
cleaned_strings = [" ".join(review) for review in processed_reviews]

# Apply CountVectorizer
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(cleaned_strings).toarray()
bow_df = pd.DataFrame(bow_matrix, columns=vectorizer.get_feature_names_out())

end_time_bow = time.time()

print("Task 2: Bag of Words")
print("BoW Matrix Shape:", bow_df.shape)
print(bow_df.head())
print("Execution Time:", round(end_time_bow - start_time_bow, 4), "seconds")

Task 2: Bag of Words
BoW Matrix Shape: (10, 255)
   102  135lbs  32e  40  52  54  628  absolutely  according  actually  ...  \
0    0       0    0   0   0   0    0           0          0         0  ...   
1    1       0    0   0   1   0    0           0          0         0  ...   
2    0       1    1   0   0   1    0           1          0         0  ...   
3    0       0    0   0   0   0    0           0          0         0  ...   
4    0       0    0   0   0   0    0           0          1         0  ...   

   worth  would  wouldnt  xl  xs  xxs  yay  yeah  years  zipper  
0      0      0        0   0   0    0    0     0      0       0  
1      0      0        0   0   1    2    0     0      0       0  
2      0      0        0   0   0    0    0     0      0       0  
3      0      0        0   0   0    0    1     0      0       0  
4      0      0        0   0   0    0    0     0      0       0  

[5 rows x 255 columns]
Execution Time: 0.0081 seconds


# --------------------------The End-----------------------------