# **Assignment 2: Milestone I Natural Language Processing**
## **Task 2 & 3**
#### **Student Name**:
#### **Student ID**:


**Environment**: Python 3 and Jupyter notebook

**Libraries used**: 
* pandas
* numpy
* collections.Counter
* sklearn.feature_extraction.text.CountVectorizer
* sklearn.feature_extraction.text.TfidfVectorizer
* gensim.models.KeyedVectors
* gensim.scripts.glove2word2vec.glove2word2vec

## **Introduction**

The first stage of the notebook addresses **Task 2: Generating Feature Representations** from Milestone 1 of the assignment. The goal is to convert the cleaned reviews from Task 1 into **numerical representations** that can be directly used by machine learning models for classification.  

Specifically, I implement two categories of representations:  

- **Bag-of-Words Model** – each review is transformed into a sparse vector of token counts, aligned with the curated vocabulary from Task 1. This representation is exported as `count_vectors.txt`.  
- **Word Embedding Models** – each review is encoded using pretrained GloVe embeddings, with two variants:  
  - **Unweighted average embeddings**, where all tokens contribute equally.  
  - **TF-IDF weighted embeddings**, where more informative tokens receive higher importance.  

By the end of this stage, I obtain multiple complementary feature representations of the reviews. These representations capture both surface-level token frequencies and deeper semantic meaning, providing the foundation for classification experiments in Task 3.


## Importing libraries 

In [None]:
import pandas as pd
import numpy as np
from collections import Counter

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

## Task 2. Generating Feature Representations for Clothing Items Reviews

With the preprocessing completed in Task 1, I now move on to generating feature representations of the reviews.  
The goal of this step is to transform the cleaned text into numerical formats that can be used by machine learning models in Task 3. 

### 2.1 Load Processed Data

Before I start building feature representations, I need to load the output from Task 1. The file `processed.csv` contains the cleaned and tokenized reviews that I generated earlier. From this file, I focus on the `Review Text` column, which holds the processed reviews.  

I will convert it into a list of strings so it can be directly used in the different representation methods that follow in Task 2.

In [21]:
df = pd.read_csv("processed.csv")
reviews = df["Review Text"].astype(str).tolist()

print("Head of reviews:")
print(reviews[:5])

Head of reviews:
['high hopes wanted work initially petite usual found outrageously fact zip reordered petite medium half nicely bottom half tight layer cheap net layers imo major design flaw net layer sewn directly zipper', 'jumpsuit fun flirty fabulous time compliments', 'shirt due adjustable front tie length leggings sleeveless pairs cardigan shirt', 'tracy reese dresses petite feet tall brand pretty package lot skirt long full overwhelmed frame stranger alterations shortening skirt embellishment garment idea style work returned', 'basket hte person store pick teh pale hte gorgeous turns prefectly baggy hte xs hte bummer petite decided ejans pants skirts oops']


Next, I load the vocabulary file that I created in Task 1.  

The file `vocab.txt` stores all valid tokens from the processed reviews, each mapped to a unique integer index. This mapping is essential because it ensures that every feature representation I build in Task 2 will use a **consistent and reproducible index space**.  

Here, I read the file line by line, split each entry into the token and its index, and store them in a Python dictionary called `vocab`. This dictionary will serve as the reference for constructing bag-of-words vectors and other feature encodings.

In [22]:
vocab = {}
with open("vocab.txt", "r", encoding="utf-8") as f:
    for line in f:
        word, idx = line.strip().split(":")
        vocab[word] = int(idx)

print("\nHead of vocab:")
for i, (word, idx) in enumerate(vocab.items()):
    if i >= 5:
        break
    print(f"{word}: {idx}")


Head of vocab:
a-cup: 0
a-flutter: 1
a-frame: 2
a-kind: 3
a-line: 4


### 2.2 Bag-of-Words Model: Count Vector Representation

I begin by generating the **bag-of-words (BoW) representation** for each review. This method transforms each review into a sparse vector where: 
- The **index** corresponds to a token from the vocabulary created in Task 1.  
- The **value** represents how many times that token appears in the review.  

To ensure consistency, I explicitly pass in the vocabulary so the feature indices match exactly with the saved `vocab.txt`. After constructing the count vectors, I save them in a sparse format (`count_vectors.txt`).  

Each line in this file corresponds to one review and follows the format:  

In [23]:
vectorizer = CountVectorizer(vocabulary=vocab)
X_counts = vectorizer.fit_transform(reviews)

# Save in sparse format
with open("count_vectors.txt", "w", encoding="utf-8") as f:
    for i, row in enumerate(X_counts):
        # row is a sparse vector
        indices = row.nonzero()[1]
        counts = row.data
        entries = [f"{idx}:{cnt}" for idx, cnt in zip(indices, counts)]
        f.write(f"#{i}," + ",".join(entries) + "\n")


This structure allows me to preserve the vector representation in a compact form, while still making it easy to interpret and reload later.

### 2.3 Models Based on Word Embeddings

#### 2.3.1 Pretrained Embedding Model

For the embedding-based representations, I use the **GloVe (Global Vectors for Word Representation)** model. Since GloVe is originally provided in its own format, I first convert it into the **Word2Vec format** so it can be loaded directly with `gensim`.

The original file `glove.6B.300d.txt` contains 300-dimensional word vectors. I use `glove2word2vec` from `gensim` to convert it into the `Word2Vec` format.

In [None]:
glove_input_file = "glove.6B.300d.txt" 
word2vec_output_file = "glove.6B.300d.word2vec.txt" 
glove2word2vec(glove_input_file, word2vec_output_file)

  glove2word2vec(glove_input_file, word2vec_output_file)


(400000, 300)

After converting GloVe into Word2Vec format, I now load the embeddings into memory using `gensim`. The model provides a dictionary-like structure where each word is mapped to its 300-dimensional vector representation.

In [None]:
embedding_path = "glove.6B.300d.word2vec.txt"
w2v_model = KeyedVectors.load_word2vec_format(embedding_path, binary=False)

embedding_dim = w2v_model.vector_size
print("Embedding dimension:", embedding_dim)

Embedding dimension: 300


This embedding model will later be used to construct both unweighted and TF-IDF weighted feature representations for the reviews.

#### 2.3.2 Unweighted Embedding Representation

The first embedding-based representation I build is the **unweighted average of word vectors**. In this approach, each review is represented by taking the mean of all the word embeddings found in the review.

Concept Overview:
- For every token in a review, I check if it exists in the pretrained GloVe model.  
- If it does, I collect its 300-dimensional vector.  
- I then compute the simple average across all available word vectors in that review.  
- If a review contains no known tokens, I return a zero vector of the same dimension.  

This results in one fixed-size vector per review, providing a straightforward way to capture the overall semantic meaning of the text without applying any weighting scheme.

In [26]:
def get_avg_vector(tokens):
    vectors = [w2v_model[word] for word in tokens if word in w2v_model]
    if not vectors:
        return np.zeros(embedding_dim)
    return np.mean(vectors, axis=0)

unweighted_embeddings = np.array([get_avg_vector(review.split()) for review in reviews])

To confirm the structure of the result, I preview the first few embeddings:

In [27]:
print("Head of unweighted embeddings:")
print(unweighted_embeddings[:5])

Head of unweighted embeddings:
[[-0.13585053  0.16533092 -0.09622453 ... -0.12143915  0.03876616
  -0.06293373]
 [-0.0585745  -0.23308516 -0.04500633 ...  0.19549799  0.16249667
  -0.07392985]
 [-0.12962705 -0.04213319 -0.08205573 ...  0.15072682  0.36137545
  -0.07374223]
 [-0.16844843 -0.14789265 -0.0754644  ...  0.00641364  0.22104792
   0.11514494]
 [-0.06090566 -0.12363915 -0.17390808 ...  0.12790534  0.1592525
   0.12398838]]


After reviewing the output, the embeddings appear to be correctly computed. I will now proceed to the next step.

#### 2.3.3 TF-IDF Weighted Embedding Representation

Next, I improve on the simple averaging method by applying **TF-IDF weighting** to the word embeddings. The idea here is to give more importance to informative words while reducing the influence of very common ones.


Concept Overview:

- I first compute TF-IDF values for all tokens using the same vocabulary from Task 1.  
- For each review, I look up the embeddings of the tokens that appear in both the GloVe model and the TF-IDF dictionary.  
- Each word vector is multiplied by its corresponding IDF weight.  
- I then take the weighted average of these vectors to form the final review embedding.  
- If a review has no matching tokens, I assign a zero vector of the same dimension.  

This weighted representation helps capture not only the semantic meaning of words, but also their relative importance across the entire collection of reviews.

In [28]:
tfidf = TfidfVectorizer(vocabulary=vocab)
tfidf_matrix = tfidf.fit_transform(reviews)
idf_dict = dict(zip(tfidf.get_feature_names_out(), tfidf.idf_))

def get_weighted_vector(tokens):
    vectors = []
    weights = []
    for word in tokens:
        if word in w2v_model and word in idf_dict:
            vectors.append(w2v_model[word])
            weights.append(idf_dict[word])
    if not vectors:
        return np.zeros(embedding_dim)
    vectors = np.array(vectors)
    weights = np.array(weights)
    return np.average(vectors, axis=0, weights=weights)

weighted_embeddings = np.array([get_weighted_vector(review.split()) for review in reviews])

To confirm the structure of the result, I preview the first few embeddings:

In [29]:
print("Head of weighted embeddings:")
print(weighted_embeddings[:5])

Head of weighted embeddings:
[[-0.11905508  0.16162047 -0.07476648 ... -0.12663434  0.04404344
  -0.09266121]
 [-0.05981297 -0.28089033 -0.06609663 ...  0.19276437  0.19332011
  -0.07271967]
 [-0.12593867 -0.03113094 -0.04780481 ...  0.13701874  0.36218534
  -0.06994134]
 [-0.14754247 -0.17754009 -0.04563789 ...  0.01184478  0.18642595
   0.10937177]
 [-0.01375175 -0.12893063 -0.13782937 ...  0.11002514  0.11803029
   0.1281771 ]]


After reviewing the output, the embeddings appear to be correctly computed. I will now proceed to the next step.

### Saving outputs

I have already saved the file `count_vectors.txt` as part of the earlier steps.

In [30]:
file_path = "count_vectors.txt"

try:
    with open(file_path, "r", encoding="utf-8") as file:
        for _ in range(5):  # Display the first 5 lines
            print(file.readline().strip())
except FileNotFoundError:
    print(f"File not found: {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

#0,687:1,1028:1,1716:1,1792:1,2289:1,2481:1,2602:1,2892:2,3010:1,3087:1,3193:1,3258:1,3549:2,3552:1,3832:1,3934:1,4224:2,4234:1,4427:1,4639:2,5260:1,5668:1,6726:1,7092:1,7207:1,7406:1,7520:1,7522:1
#1,1287:1,2284:1,2502:1,2667:1,3403:1,6739:1
#2,86:1,925:1,1988:1,2646:1,3584:1,3595:1,4506:1,5736:2,5924:1,6716:1
#3,179:1,721:1,1950:1,2083:1,2373:1,2610:1,2657:1,2711:1,3168:1,3707:1,3748:1,4472:1,4484:1,4639:1,4912:1,5176:1,5332:1,5764:1,5900:2,6290:1,6366:1,6548:1,6809:1,7406:1
#4,408:1,449:1,818:1,1630:1,2050:1,2803:1,3120:4,4363:1,4513:1,4528:1,4628:1,4639:1,4663:1,4883:1,5903:1,6274:1,6601:1,6907:1,7469:1


Upon inspection, the file format meets all specified requirements and appears correctly structured.

## Task 3. Clothing Review Classification

...... Sections and code blocks on buidling classification models based on different document feature represetations. 
Detailed comparsions and evaluations on different models to answer each question as per specification. 

<span style="color: red"> You might have complex notebook structure in this section, please feel free to create your own notebook structure. </span>

In [31]:
# Code to perform the task...


## Summary
Give a short summary and anything you would like to talk about the assessment tasks here.

## Couple of notes for all code blocks in this notebook
- please provide proper comment on your code
- Please re-start and run all cells to make sure codes are runable and include your output in the submission.   
<span style="color: red"> This markdown block can be removed once the task is completed. </span>