# **Assignment 2: Milestone I Natural Language Processing**
## **Task 2 & 3**
#### **Student Name**:
#### **Student ID**:


**Environment**: Python 3 and Jupyter notebook

**Libraries used**: 
* pandas
* numpy
* collections.Counter
* sklearn.feature_extraction.text.CountVectorizer
* sklearn.feature_extraction.text.TfidfVectorizer
* gensim.models.KeyedVectors
* gensim.scripts.glove2word2vec.glove2word2vec

## **Introduction**

The first stage of the notebook addresses **Task 2: Generating Feature Representations** from Milestone 1 of the assignment. The goal is to convert the cleaned reviews from Task 1 into **numerical representations** that can be directly used by machine learning models for classification.  

Specifically, I implement two categories of representations:  

- **Bag-of-Words (BoW)** – Each review is converted into a sparse vector of token counts using the curated vocabulary built in Task 1. This frequency-based representation is saved in count_vectors.txt.
- **Embedding-based Representations** – Each review is encoded using pretrained FastText embeddings (cc.en.300.vec), with two variants:
  - **Unweighted average embeddings** – All known tokens contribute equally to the final review vector.
  - **TF-IDF weighted embeddings** – Tokens are weighted by their inverse document frequency (IDF) to emphasize more informative words.

By the end of this stage, I obtain multiple complementary feature representations of the reviews. These representations capture both surface-level token frequencies and deeper semantic meaning, providing the foundation for classification experiments in Task 3.


## Importing libraries 

In [23]:
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy import sparse
from gensim.models import KeyedVectors

## Task 2. Generating Feature Representations for Clothing Items Reviews

With the preprocessing completed in Task 1, I now move on to generating feature representations of the reviews.  
The goal of this step is to transform the cleaned text into numerical formats that can be used by machine learning models in Task 3. 

### 2.1 Load Processed Data

Before I start building feature representations, I need to load the output from Task 1. The file `processed.csv` contains the cleaned and tokenized reviews that I generated earlier. From this file, I focus on the `Review Text` column, which holds the processed reviews.

I will convert it into a list of strings so it can be directly used in the different representation methods that follow in Task 2.

In [24]:
df = pd.read_csv("t1_outputs/processed.csv")
df = df.dropna(subset=["Review Text"])
df = df[df["Review Text"].str.strip().astype(bool)]
reviews = df["Review Text"].astype(str).tolist()

print("Head of reviews:")
print(reviews[:5])

Head of reviews:
['high hopes wanted work initially petite usual found outrageously fact zip reordered petite medium half nicely bottom half tight layer cheap net layers imo major design flaw net layer sewn directly zipper', 'jumpsuit fun flirty fabulous time compliments', 'shirt due adjustable front tie length leggings sleeveless pairs cardigan shirt', 'tracy reese dresses petite feet tall brand pretty package lot skirt long full overwhelmed frame stranger alterations shortening skirt embellishment garment idea style work returned', 'basket hte person store pick teh pale hte gorgeous turns prefectly baggy hte xs hte bummer petite decided ejans pants skirts oops']


Next, I load the vocabulary file that I created in Task 1.  

The file `vocab.txt` stores all valid tokens from the processed reviews, each mapped to a unique integer index. This mapping is essential because it ensures that every feature representation I build in Task 2 will use a **consistent and reproducible index space**.  

Here, I read the file line by line, split each entry into the token and its index, and store them in a Python dictionary called `vocab`. This dictionary will serve as the reference for constructing bag-of-words vectors and other feature encodings.

In [25]:
with open("t1_outputs/vocab.txt", "r") as f:
    vocab_lines = f.readlines()

vocab = {line.split(":")[0]: int(line.strip().split(":")[1]) for line in vocab_lines}

print("\nHead of vocab:")
for i, (word, idx) in enumerate(vocab.items()):
    if i >= 5:
        break
    print(f"{word}: {idx}")


Head of vocab:
a-cup: 0
a-flutter: 1
a-frame: 2
a-kind: 3
a-line: 4


### 2.2 Bag-of-Words Model: Count Vector Representation

I begin by generating the **bag-of-words (BoW) representation** for each review. This method transforms each review into a sparse vector where: 
- The **index** corresponds to a token from the vocabulary created in Task 1.  
- The **value** represents how many times that token appears in the review.  

In [26]:
# Build CountVectorizer using fixed vocab
vectorizer = CountVectorizer(vocabulary=vocab)
X_counts = vectorizer.fit_transform(reviews)

### 2.3 Load FastText Word Vectors

In this step, I load a pre-trained word embedding model to support the construction of vector-based representations for each review in later sections.

The embedding model used here is **FastText** (`cc.en.300.vec`), which maps each word to a dense 300-dimensional vector based on how it appears in a wide range of real-world text. One key advantage of FastText is its ability to handle **out-of-vocabulary (OOV)** words — that is, words not seen during training. 

It does this by breaking words into smaller overlapping sequences of characters called **n-grams** ("dress" contains trigrams like "dre", "res", "ess"). This means even if a word is misspelled, rare, or completely new, FastText can still generate a meaningful embedding based on its subword structure — an important feature when working with noisy or user-generated text like product reviews.


The embedding file is already in a compatible `.vec` format, so I load it directly using `gensim`'s `KeyedVectors`. Once loaded, each word can be queried to retrieve its vector. These vectors will be used in the next stages to build:
- **Unweighted document vectors**: average of all word vectors in a review.
- **Weighted document vectors**: average of word vectors weighted by their TF-IDF score.

This setup allows each review to be numerically represented in a form suitable for downstream classification models.

In [27]:
# Load FastText .vec file directly
ft_model = KeyedVectors.load_word2vec_format("embedding_model/cc.en.300.vec", binary=False)
embedding_dim = ft_model.vector_size

print("Embedding dimension:", embedding_dim)
print("Sample vector for 'quality':", ft_model['quality'][:5])


Embedding dimension: 300
Sample vector for 'quality': [ 0.0416  0.0178  0.0021  0.0199 -0.0519]


### 2.4 Tokenization Consistency

For feature extraction methods like Bag-of-Words and embeddings, I need to re-tokenize the text into word lists to ensure correct vocabulary matching and vector lookup. I will re-tokenize the reviews using the same regex pattern defined in Task 1. This guarantees alignment with the vocabulary and preprocessing applied earlier.

In [28]:
tokenizer = re.compile(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?") 

def tokenize(text): 
    tokens = tokenizer.findall(text.lower()) 
    clean_tokens = [t.strip("-'") for t in tokens] # remove trailing punctuation if needed 
    return clean_tokens 

tokenized_reviews = [tokenize(r) for r in reviews] 

### 2.5 OOV Diagnostics

To assess how well the FastText model covers the dataset vocabulary, I calculate the out-of-vocabulary (OOV) rate across all tokens. This helps identify whether any significant preprocessing adjustments are needed before embedding lookup.

In [29]:
def compute_coverage(token_lists, model):
    total = sum(len(toks) for toks in token_lists)
    known = sum(sum(1 for tok in toks if tok in model) for toks in token_lists)
    return {
        "total_tokens": total,
        "covered_tokens": known,
        "coverage_pct": round(100 * known / total, 2)
    }

coverage = compute_coverage(tokenized_reviews, ft_model)
print("FastText OOV Coverage:", coverage)


FastText OOV Coverage: {'total_tokens': 355505, 'covered_tokens': 354155, 'coverage_pct': 99.62}


The FastText model achieves a coverage of **99.62%**, leaving only **0.38%** of tokens unmatched. To ensure that the remaining unmatched tokens do not significantly impact downstream results, I inspect the OOV cases more closely before deciding whether additional handling is necessary.

In [30]:
missing = Counter( 
    tok 
    for review in tokenized_reviews 
    for tok in review 
    if tok not in ft_model 
) 

print("Top OOV tokens:", missing.most_common(10))
print("All missing tokens:")
for token, count in missing.items():
    print(f"{token}: {count}")

Top OOV tokens: [('pilcro', 267), ("would've", 83), ('xxsp', 77), ("retailer's", 56), ("could've", 38), ("model's", 35), ('pxs', 34), ('cartonnier', 23), ('true-to', 21), ("should've", 19)]
All missing tokens:
ejans: 2
would've: 83
could've: 38
eptite: 3
should've: 19
square-apple: 2
cami's: 3
swtr: 5
that'll: 2
xxsp: 77
pants-they: 2
pilcro: 267
woman's: 9
xs-s: 13
retailer's: 56
pxs: 34
maternity-ish: 8
maternity-esque: 2
season's: 4
as-pictured: 3
xspetite: 16
peek-a: 7
fit-and: 7
valentine's: 5
cartonnier: 23
true-to: 21
model's: 35
camisol: 3
denimy: 2
and-go: 3
husband's: 11
skinny's: 3
moulinette: 9
sweatercoat: 5
d-dd: 7
compliements: 2
fit's: 5
dind't: 16
d's: 4
xs's: 2
flattrering: 2
brother's: 5
lot's: 2
canvas-y: 2
lyocel: 2
liekd: 2
pettie: 3
evanthe: 3
floreat: 14
charlie's: 4
grandma's: 4
above-the: 3
deletta: 19
x-s: 3
maxi's: 3
higher-waisted: 2
pxxs: 14
dress's: 6
seafolly: 7
lnever: 2
maeve's: 5
stevies: 12
non-petite: 6
reviewer's: 10
mother's: 13
day's: 2
chrolox: 

Upon inspecting the missing tokens, most are either **misspellings** (`ejans`, `eptite`, `swtr`) or **words containing apostrophes** (`"would've"`, `"model's"`), which FastText may not handle directly in this form.

Given the very low OOV rate and the non-critical nature of most missing tokens, I choose not to apply additional corrections or manual filtering at this stage. The model is sufficiently robust for embedding generation without modification.

### 2.6 Generate Unweighted FastText Embedding

The first embedding-based representation I build is the unweighted average of word vectors. In this approach, each review is represented by taking the mean of all the word embeddings found in the review.

Concept Overview:
- For every token in a review, I check if it exists in the pretrained FastText model.
- If it does, I collect its 300-dimensional vector.
- I then compute the simple average across all available word vectors in that review.
- If a review contains no known tokens, I return a zero vector of the same dimension.

In [31]:
def avg_embedding(tokens, model):
    vectors = [model[w] for w in tokens if w in model]
    if not vectors:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

unweighted_embeds = np.array([avg_embedding(tokens, ft_model) for tokens in tokenized_reviews])

This results in one fixed-size vector per review, providing a straightforward way to capture the overall semantic meaning of the text without applying any weighting scheme.

To confirm the structure of the result, I preview the first few embeddings from a sample review using FastText:

In [32]:
def validator_token_embeddings(tokens, ft_model, dims=5, limit=5):
    print(f"FastText token-wise vectors (first {dims} dims, showing up to {limit} tokens):\n")
    shown = 0
    for token in tokens:
        if token in ft_model:
            vec = ft_model[token][:dims]
            rounded = [f"{v:.4f}" for v in vec]
            print(f"{token}: [{', '.join(rounded)}]")
            shown += 1
        else:
            print(f"{token}: [OOV]")
            shown += 1
        if shown >= limit:
            break

test_idx = 42
validator_token_embeddings(tokenized_reviews[test_idx], ft_model, dims=8, limit=5)


FastText token-wise vectors (first 8 dims, showing up to 5 tokens):

armholes: [0.0003, -0.0838, -0.0788, -0.0090, 0.1114, 0.0431, -0.0207, -0.0507]
bit: [-0.0657, -0.0629, -0.1529, 0.0636, 0.0471, 0.0937, 0.0748, -0.0786]
oversized: [0.0227, -0.0610, -0.0377, 0.0574, 0.0435, -0.0276, 0.0018, -0.0600]
older: [0.0349, 0.0054, 0.0081, 0.0161, 0.0178, -0.0276, -0.0159, -0.0264]
woman: [0.1328, -0.0516, 0.0142, 0.0771, -0.0610, -0.0764, 0.0576, 0.0009]


Below is a sample output from index 42, demonstrating that all 5 selected tokens are recognized and mapped to meaningful vectors.

After reviewing the output, the embeddings appear to be correctly computed. I will now proceed to the next step.

### 2.7 Generate TF-IDF Weighted FastText Embedding

Next, I improve on the simple averaging method by applying TF-IDF weighting to the word embeddings. The idea here is to give more importance to informative words while reducing the influence of very common ones.

Concept Overview:
- I first compute TF-IDF values for all tokens using the same vocabulary built in Task 1.
- For each review, I look up the FastText embeddings of the tokens that appear in both the embedding model and the TF-IDF dictionary.
- Each word vector is scaled by its corresponding IDF weight.
- I then take the weighted average of these vectors to form the final review embedding.
- If a review has no matching tokens, I assign a zero vector of the same dimension.

In [33]:
# Use same vocab as before
tfidf_vectorizer = TfidfVectorizer(vocabulary=vocab)
X_tfidf = tfidf_vectorizer.fit_transform(reviews)
idf_weights = dict(zip(tfidf_vectorizer.get_feature_names_out(), tfidf_vectorizer.idf_))

def tfidf_weighted_embedding(tokens, model, idf_dict):
    vectors = []
    weights = []
    for token in tokens:
        if token in model and token in idf_dict:
            vectors.append(model[token] * idf_dict[token])
            weights.append(idf_dict[token])
    if not vectors:
        return np.zeros(model.vector_size)
    return np.sum(vectors, axis=0) / np.sum(weights)

weighted_embeds = np.array([
    tfidf_weighted_embedding(tokens, ft_model, idf_weights)
    for tokens in tokenized_reviews
])

This weighted representation captures both the semantic meaning of words and their relative importance across the entire set of reviews—leading to potentially more discriminative features for downstream classification.

To confirm the structure of the result, I preview a few token embeddings with their corresponding TF-IDF weight:

In [34]:
def validator_weighted_tokens(tokens, model, idf_dict, dims=5, limit=5):
    print(f"TF-IDF Weighted token-wise vectors (first {dims} dims, up to {limit} tokens):\n")
    shown = 0
    for token in tokens:
        if token in model and token in idf_dict:
            vec = model[token] * idf_dict[token]
            rounded = [f"{v:.4f}" for v in vec[:dims]]
            print(f"{token}: [{', '.join(rounded)}]  (weight: {idf_dict[token]:.4f})")
            shown += 1
        elif token in model:
            print(f"{token}: [valid token, missing IDF]")
            shown += 1
        else:
            print(f"{token}: [OOV]")
            shown += 1
        if shown >= limit:
            break

test_idx = 42
validator_weighted_tokens(tokenized_reviews[test_idx], ft_model, idf_weights, dims=8, limit=5)


TF-IDF Weighted token-wise vectors (first 8 dims, up to 5 tokens):

armholes: [0.0017, -0.4834, -0.4545, -0.0519, 0.6426, 0.2486, -0.1194, -0.2924]  (weight: 5.7680)
bit: [-0.2066, -0.1978, -0.4809, 0.2000, 0.1481, 0.2947, 0.2353, -0.2472]  (weight: 3.1453)
oversized: [0.1243, -0.3339, -0.2064, 0.3142, 0.2381, -0.1511, 0.0099, -0.3285]  (weight: 5.4743)
older: [0.2636, 0.0408, 0.0612, 0.1216, 0.1345, -0.2085, -0.1201, -0.1994]  (weight: 7.5538)
woman: [0.8354, -0.3246, 0.0893, 0.4850, -0.3837, -0.4806, 0.3624, 0.0057]  (weight: 6.2909)


After reviewing the output, the embeddings appear to be correctly computed. I will now proceed to the next step.

### 2.8 Save Outputs

After generating the three types of document-level representations, I now save them into their required output formats. Each line in the output files corresponds to one review and starts with a #index followed by the data values, separated by commas.

#### 1 Bag-of-Words

- Saves the sparse Bag-og-words representation to `count_vectors.txt`
- Format: #reviewIndex,tokenIndex1:count1,tokenIndex2:count2,...

In [35]:
# Save sparse BoW counts
with open("t2_outputs/count_vectors.txt", "w") as f:
    for i, row in enumerate(X_counts):
        entries = [
            f"{idx}:{val}"
            for idx, val in zip(row.indices, row.data)
        ]
        f.write(f"#" + str(i) + "," + ",".join(entries) + "\n")

#### 2 Unweighted Embeddings

- Saves the average FastText embeddings to `unweighted_vectors.txt`
- Format: #reviewIndex,val1,val2,...,val300

In [36]:
# Save unweighted FastText embeddings
with open("t2_outputs/unweighted_vectors.txt", "w") as f:
    for i, vec in enumerate(unweighted_embeds):
        vec_str = ",".join(map(str, vec))
        f.write(f"#{i},{vec_str}\n")

#### 3 Weighted Embeddings
- Saves the TF_IDF weighted FastText embeddings to `weighted_vectors.txt`
- Format: #reviewIndex,val1,val2,...,val300

In [37]:
# Save weighted FastText embeddings
with open("t2_outputs/weighted_vectors.txt", "w") as f:
    for i, vec in enumerate(weighted_embeds):
        vec_str = ",".join(map(str, vec))
        f.write(f"#{i},{vec_str}\n")

Upon inspection, the file format meets all specified requirements and appears correctly structured.

## Task 3. Clothing Review Classification

...... Sections and code blocks on buidling classification models based on different document feature represetations. 
Detailed comparsions and evaluations on different models to answer each question as per specification. 

<span style="color: red"> You might have complex notebook structure in this section, please feel free to create your own notebook structure. </span>

In [38]:
# Code to perform the task...


## Summary
Give a short summary and anything you would like to talk about the assessment tasks here.

## Couple of notes for all code blocks in this notebook
- please provide proper comment on your code
- Please re-start and run all cells to make sure codes are runable and include your output in the submission.   
<span style="color: red"> This markdown block can be removed once the task is completed. </span>