# **Assignment 3: Milestone I | Natural Language Processing**
## **Task 1. Basic Text Pre-processing**

**Group 01:**
- Tran Tu Tam (s3999159)
- Phan Nhat Minh (s3978598)
- Le Thien Son (s3977955)

**Environment**: 

To ensure full reproducibility, the Python environment used in this assignment is managed with **conda**. We worked with **Python 3.10** on **conda 25.7.0** inside Jupyter Notebook.  

The exact environment can be recreated by running the following command with the provided `environment.yaml` file:

```bash
conda env create -f environment.yaml
```

After creating the environment, activate it with:

```bash
conda activate ap4ds-a3
```

**Libraries used**: 
* pandas
* re
* Counter
* nltk

## **Introduction**

This notebook addresses **Task 1: Basic Text Pre-processing** from Milestone 1 of the assignment.  
The goal is to transform raw customer clothing reviews into a **clean, standardized, and machine-readable form** that can later be used for feature extraction and classification tasks.

Specifically, I focus on the `"Review Text"` field and apply the following operations:

- **Tokenization** using a regex pattern that preserves meaningful hyphenated and apostrophized words,  
- **Normalization** by lowercasing tokens and filtering out very short words,  
- **Noise reduction** by removing generic stopwords,  
- **Frequency-based filtering** by discarding both extremely rare words (occur only once in the dataset) and overly dominant words (top 20 by document frequency),  
- **Final export** of cleaned reviews to `processed.csv` and a curated vocabulary to `vocab.txt`.

By the end of this stage, the dataset is stripped of inconsistencies and irrelevant tokens, leaving behind a corpus that better represents the signal needed for text classification.

## **Importing Libraries**

In [144]:
import pandas as pd
import re
from collections import Counter
import nltk

## 1.1 Loading and Examining the Data

In [145]:
# Define file paths. Adjust file path based on your actual directory structure.
original_data_path = "../data/assignment3.csv"
stopwords_path = "../data/stopwords_en.txt"

try:
    # Load the dataset
    df = pd.read_csv(original_data_path)
    # Load the stopwords from the provided file
    with open(stopwords_path, 'r', encoding='utf-8') as f:
        stopwords = set(line.strip().lower() for line in f if line.strip())
    print("Successfully loaded dataset and stopwords.")
except FileNotFoundError as e:
    print(f"Error: {e}. Please revise the file paths.")
    exit()

Successfully loaded dataset and stopwords.


In [146]:
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits


In [147]:
df.describe(include='all')

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
count,19662.0,19662.0,19662,19662,19662.0,19662.0,19662.0,19662,19662,19662
unique,,,13983,19656,,,,3,6,20
top,,,Love it!,Perfect fit and i've gotten so many compliment...,,,,General,Tops,Dresses
freq,,,136,3,,,,11664,8713,5371
mean,921.297274,43.260808,,,4.183145,0.818177,2.652477,,,
std,200.227528,12.258122,,,1.112224,0.385708,5.834285,,,
min,1.0,18.0,,,1.0,0.0,0.0,,,
25%,861.0,34.0,,,4.0,1.0,0.0,,,
50%,936.0,41.0,,,5.0,1.0,1.0,,,
75%,1078.0,52.0,,,5.0,1.0,3.0,,,


Lastly, we want to double check if the stopwords list imported corrrectly.

In [148]:
for w in sorted(stopwords):
    print(w)

a
a's
able
about
above
according
accordingly
across
actually
after
afterwards
again
against
ain't
all
allow
allows
almost
alone
along
already
also
although
always
am
among
amongst
an
and
another
any
anybody
anyhow
anyone
anything
anyway
anyways
anywhere
apart
appear
appreciate
appropriate
are
aren't
around
as
aside
ask
asking
associated
at
available
away
awfully
b
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
believe
below
beside
besides
best
better
between
beyond
both
brief
but
by
c
c'mon
c's
came
can
can't
cannot
cant
cause
causes
certain
certainly
changes
clearly
co
com
come
comes
concerning
consequently
consider
considering
contain
containing
contains
corresponding
could
couldn't
course
currently
d
definitely
described
despite
did
didn't
different
do
does
doesn't
doing
don't
done
down
downwards
during
e
each
edu
eg
eight
either
else
elsewhere
enough
entirely
especially
et
etc
even
ever
every
everybody
everyone
everything
everywhere
ex
exactly
example

### 1.2 Pre-processing data

In [149]:
import nltk
nltk.download('wordnet')
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Regex-based tokenizer
pattern = re.compile(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")

sample_reviews = df['Review Text'].head(10)

# Tokenize each review into words
tokenized_reviews = [pattern.findall(str(review).lower()) for review in sample_reviews]

# Flatten tokens
sample_tokens = [token for review_tokens in tokenized_reviews for token in review_tokens]

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stems = {token: stemmer.stem(token) for token in sample_tokens}
lemmas = {token: lemmatizer.lemmatize(token) for token in sample_tokens}

print("Original Tokens and their Stems:")
print(stems)
print("\nOriginal Tokens and their Lemmas:")
print(lemmas)


Original Tokens and their Stems:
{'i': 'i', 'had': 'had', 'such': 'such', 'high': 'high', 'hopes': 'hope', 'for': 'for', 'this': 'thi', 'dress': 'dress', 'and': 'and', 'really': 'realli', 'wanted': 'want', 'it': 'it', 'to': 'to', 'work': 'work', 'me': 'me', 'initially': 'initi', 'ordered': 'order', 'the': 'the', 'petite': 'petit', 'small': 'small', 'my': 'my', 'usual': 'usual', 'size': 'size', 'but': 'but', 'found': 'found', 'be': 'be', 'outrageously': 'outrag', 'so': 'so', 'in': 'in', 'fact': 'fact', 'that': 'that', 'could': 'could', 'not': 'not', 'zip': 'zip', 'up': 'up', 'reordered': 'reorder', 'medium': 'medium', 'which': 'which', 'was': 'wa', 'just': 'just', 'ok': 'ok', 'overall': 'overal', 'top': 'top', 'half': 'half', 'comfortable': 'comfort', 'fit': 'fit', 'nicely': 'nice', 'bottom': 'bottom', 'a': 'a', 'very': 'veri', 'tight': 'tight', 'under': 'under', 'layer': 'layer', 'several': 'sever', 'somewhat': 'somewhat', 'cheap': 'cheap', 'net': 'net', 'over': 'over', 'layers': 'laye

[nltk_data] Downloading package wordnet to /Users/pnm/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [150]:
# Cell 1: POS-aware lemmatizer helpers
import re
import nltk
nltk.download('averaged_perceptron_tagger_eng')
from nltk.corpus import wordnet

# If not already downloaded:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

pattern = re.compile(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")

def penn_to_wn(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    if tag.startswith('V'):
        return wordnet.VERB
    if tag.startswith('N'):
        return wordnet.NOUN
    if tag.startswith('R'):
        return wordnet.ADV
    return None

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/pnm/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pnm/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/pnm/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/pnm/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [151]:
# Cell 2: Tokenize -> stem vs. POS-aware lemma
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

sample_reviews = df['Review Text'].head(10)
tokenized_reviews = [pattern.findall(str(r).lower()) for r in sample_reviews]
sample_tokens = [tok for toks in tokenized_reviews for tok in toks]

# POS tag once for all tokens (works best per sentence; for quick check it's fine)
pos_tags = nltk.pos_tag(sample_tokens)
wn_pos = [penn_to_wn(tag) for (_, tag) in pos_tags]

stems = {tok: stemmer.stem(tok) for tok in sample_tokens}

lemmas_pos_aware = {}
for tok, tag in zip(sample_tokens, wn_pos):
    if tag is None:
        lemmas_pos_aware[tok] = lemmatizer.lemmatize(tok)     # fall back to default
    else:
        lemmas_pos_aware[tok] = lemmatizer.lemmatize(tok, pos=tag)

print("Original → Stem (sample):")
print({k: stems[k] for k in list(stems)[:60]})
print("\nOriginal → Lemma (POS-aware, sample):")
print({k: lemmas_pos_aware[k] for k in list(lemmas_pos_aware)[:60]})

Original → Stem (sample):
{'i': 'i', 'had': 'had', 'such': 'such', 'high': 'high', 'hopes': 'hope', 'for': 'for', 'this': 'thi', 'dress': 'dress', 'and': 'and', 'really': 'realli', 'wanted': 'want', 'it': 'it', 'to': 'to', 'work': 'work', 'me': 'me', 'initially': 'initi', 'ordered': 'order', 'the': 'the', 'petite': 'petit', 'small': 'small', 'my': 'my', 'usual': 'usual', 'size': 'size', 'but': 'but', 'found': 'found', 'be': 'be', 'outrageously': 'outrag', 'so': 'so', 'in': 'in', 'fact': 'fact', 'that': 'that', 'could': 'could', 'not': 'not', 'zip': 'zip', 'up': 'up', 'reordered': 'reorder', 'medium': 'medium', 'which': 'which', 'was': 'wa', 'just': 'just', 'ok': 'ok', 'overall': 'overal', 'top': 'top', 'half': 'half', 'comfortable': 'comfort', 'fit': 'fit', 'nicely': 'nice', 'bottom': 'bottom', 'a': 'a', 'very': 'veri', 'tight': 'tight', 'under': 'under', 'layer': 'layer', 'several': 'sever', 'somewhat': 'somewhat', 'cheap': 'cheap', 'net': 'net', 'over': 'over', 'layers': 'layer', 'im

In [152]:
# Cell 3: Vocabulary impact (unique token counts)
orig_vocab = set(sample_tokens)
stem_vocab = set(stems.values())
lemma_vocab = set(lemmas_pos_aware.values())

print(f"Unique tokens (original): {len(orig_vocab)}")
print(f"Unique tokens (stemmed): {len(stem_vocab)}")
print(f"Unique tokens (lemmatized): {len(lemma_vocab)}")

# A few examples where stem vs lemma differs
diff_examples = []
for tok in orig_vocab:
    s = stemmer.stem(tok)
    l = lemmas_pos_aware.get(tok, tok)
    if s != l:
        diff_examples.append((tok, s, l))
    if len(diff_examples) >= 25:
        break

print("\nExamples (token, stem, lemma):")
for t, s, l in diff_examples:
    print(f"{t:>15}  →  {s:>12}  |  {l}")

Unique tokens (original): 305
Unique tokens (stemmed): 293
Unique tokens (lemmatized): 281

Examples (token, stem, lemma):
       leggings  →           leg  |  legging
    everythiing  →      everythi  |  everythiing
           feet  →          feet  |  foot
    reminiscent  →      reminisc  |  reminiscent
     everything  →       everyth  |  everything
         really  →        realli  |  really
    beautifully  →        beauti  |  beautifully
           aded  →           ade  |  aded
       retailer  →        retail  |  retailer
      initially  →         initi  |  initially
           said  →          said  |  say
           kept  →          kept  |  keep
          ejans  →          ejan  |  ejans
        several  →         sever  |  several
          tulle  →          tull  |  tulle
         you're  →         you'r  |  you're
        decided  →         decid  |  decide
            try  →           tri  |  try
  embellishment  →     embellish  |  embellishment
    comfortable  →    

#### 1.2.1 Initial Text Cleaning

To begin the text processing pipeline, I perform a series of cleaning and normalization steps on the `"Review Text"` column to standardize the data for frequency-based filtering and downstream modeling. These steps include:

- **Tokenization** using the regex `r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"` to capture valid word patterns (including hyphenated and apostrophized words),
- **Lowercasing** all words to standardize word forms,
- **Removing short words** with fewer than 2 characters,
- **Filtering out stopwords** using the provided stopword list.

The function `initial_clean()` encapsulates these transformations. Before applying it, I also ensure that any missing review text values are replaced with empty strings to prevent processing errors.


In [154]:
def initial_clean(text):
    """
    Performs tokenization, lowercasing, and removes short words and stopwords.
    """
    if not isinstance(text, str):
        return []
    
    # Tokenize using regex that handles hyphens and apostrophes
    tokens = re.findall(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?", text)
    
    # Lowercase all tokens
    tokens = [word.lower() for word in tokens]
    
    # Filter out words with length less than 2
    tokens = [word for word in tokens if len(word) >= 2]
    
    # Filter out stopwords
    tokens = [word for word in tokens if word not in stopwords]
    
    return tokens

To avoid processing errors, I first replace any missing values in the `Review Text` column with empty strings before applying the `initial_clean()` function.

In [155]:
df['processed_text'] = df['Review Text'].fillna('').apply(initial_clean)
print("Initial cleaning (tokenization, lowercase, short words, stopwords) complete.")
df['processed_text'].head(10)

Initial cleaning (tokenization, lowercase, short words, stopwords) complete.


0    [high, hopes, dress, wanted, work, initially, ...
1    [love, love, love, jumpsuit, fun, flirty, fabu...
2    [shirt, flattering, due, adjustable, front, ti...
3    [love, tracy, reese, dresses, petite, feet, ta...
4    [aded, basket, hte, mintue, person, store, pic...
5    [ordered, carbon, store, pick, ton, stuff, top...
6    [love, dress, xs, runs, snug, bust, ordered, s...
7    [lbs, ordered, petite, make, length, long, typ...
8    [dress, runs, small, esp, zipper, area, runs, ...
9    [find, reliant, reviews, written, savvy, shopp...
Name: processed_text, dtype: object

Upon inspection, the `processed_text` column contains token lists that reflect all expected transformations: proper tokenization, lowercase conversion, and removal of short or common stopwords.

#### 1.2.2 Filter Rare and Dominant Words

In this step, I refine the cleaned tokens further by removing both **rare words** and **dominant words**, which can negatively impact the performance of downstream models:

- **Rare words** (term frequency = 1) are often typos, misspellings, or highly specific terms that add noise but little generalizable value.
- **Dominant words** (top 20 in document frequency) appear too frequently across reviews and may dilute meaningful patterns.

First, I identify and filter out words that occur only once in the full dataset based on **term frequency**.

In [156]:
# Create a flat list of all tokens from all reviews
all_tokens_tf = [token for review_tokens in df['processed_text'] for token in review_tokens]

# Calculate the frequency of each term
term_freq = Counter(all_tokens_tf)

# Identify words that appear only once
words_to_remove_once = {word for word, count in term_freq.items() if count == 1}
print(f"Identified {len(words_to_remove_once)} words that appear only once.")
print(words_to_remove_once)

Identified 6734 words that appear only once.


Next, I identify overly common words based on **document frequency** — how many reviews each word appears in — and select the top 20.

In [157]:
# Use sets to count each word only once per review
doc_freq_counter = Counter()
for review_tokens in df['processed_text']:
    doc_freq_counter.update(set(review_tokens))
print(doc_freq_counter)

top_20_words = {word for word, count in doc_freq_counter.most_common(20)}
print(f"Identified top 20 most frequent words: {sorted(list(top_20_words))}")

Identified top 20 most frequent words: ['back', 'bought', 'color', 'comfortable', 'cute', 'dress', 'fabric', 'fit', 'fits', 'flattering', 'great', 'love', 'nice', 'ordered', 'perfect', 'size', 'small', 'soft', 'top', 'wear']


Now, I combine these rare and dominant words and apply a final filtering pass to clean the token list in each review.

In [158]:
# Combine all unwanted words into one removal set
words_to_remove = stopwords.union(words_to_remove_once, top_20_words)

# Final cleaning function
def final_clean(tokens):
    """
    Removes the combined set of unwanted words from a list of tokens.
    """
    return [token for token in tokens if token not in words_to_remove]

# Apply to each review
df['final_processed_text'] = df['processed_text'].apply(final_clean)
print("Final cleaning pass complete.")
df['final_processed_text']

Final cleaning pass complete.


0        [high, hopes, wanted, work, initially, petite,...
1        [jumpsuit, fun, flirty, fabulous, time, compli...
2        [shirt, due, adjustable, front, tie, length, l...
3        [tracy, reese, dresses, petite, feet, tall, br...
4        [basket, hte, person, store, pick, teh, pale, ...
                               ...                        
19657         [happy, snag, price, easy, slip, cut, combo]
19658    [reminds, maternity, clothes, stretchy, shiny,...
19659                 [worked, glad, store, order, online]
19660    [wedding, summer, medium, waist, perfectly, lo...
19661    [lovely, feminine, perfectly, easy, comfy, hig...
Name: final_processed_text, Length: 19662, dtype: object

Upon inspection, the `final_processed_text` column aligns with expectations. For example, the word `"dresses"`—identified as a top 20 dominant word—has been removed from the first review.

With this, the review tokens are now clean, filtered, and ready for feature representation.

#### 1.2.3 Save the Cleaned Data

After filtering out both rare and overly common words, I join the remaining tokens back into space-separated strings and replace the original `Review Text` column.

The cleaned dataset is then exported to `processed.csv`, which will be used as the input for generating feature representations in the next task.

In [159]:
# Copy original DataFrame and update the 'Review Text' with cleaned tokens as strings
output_df = df.copy()
output_df['Review Text'] = output_df['final_processed_text'].apply(lambda tokens: ' '.join(tokens))

# Drop intermediate processing columns to match the original structure
final_output_df = df.drop(columns=['processed_text', 'final_processed_text'])
final_output_df['Review Text'] = output_df['Review Text']



In [160]:
# Replace empty strings with NaN
final_output_df.replace("", pd.NA, inplace=True)

# Check for null values
final_output_df.isnull().sum()

Clothing ID                 0
Age                         0
Title                       0
Review Text                10
Rating                      0
Recommended IND             0
Positive Feedback Count     0
Division Name               0
Department Name             0
Class Name                  0
dtype: int64

In [161]:
# Drop rows with null values
final_output_df.dropna(inplace=True)

In [162]:
# Check for duplication
final_output_df.duplicated().sum()

0

In [163]:
# Save the cleaned dataset
final_output_df.to_csv('../output/processed.csv', index=False)
print("Saved the processed data to '../output/processed.csv'.")

Saved the processed data to '../output/processed.csv'.


In [164]:
# count null values
final_output_df.isnull().sum()

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

## Saving required outputs
Finally, I generate and save the required vocabulary file `vocab.txt` based on the cleaned token list.

Each word is assigned a unique integer ID, starting from 0. The vocabulary is sorted in alphabetical order, as per the assignment specification.

This file will be used to interpret vector representations in the next steps.

In [165]:
# Flatten all tokens from the final cleaned reviews
all_final_tokens = [token for review_tokens in df['final_processed_text'] for token in review_tokens]

# Build sorted unique vocabulary
vocabulary = sorted(list(set(all_final_tokens)))

# Write vocab to file with format: word:index
with open('../output/vocab.txt', 'w') as f:
    for i, word in enumerate(vocabulary):
        f.write(f"{word}:{i}\n")

print(f"Built and saved a vocabulary of {len(vocabulary)} words to 'vocab.txt'.")
print("\nTask 1 successfully completed!")

Built and saved a vocabulary of 7529 words to 'vocab.txt'.

Task 1 successfully completed!


## **Summary**

#### **Implemented Pre-processing Steps**

- **Tokenization**  
  Applied a regex-based tokenizer (`r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"`) to split reviews into meaningful words while preserving hyphenated and apostrophized terms such as *“well-made”* or *“don’t”*.  

- **Normalization**  
  Standardized tokens by converting all text to lowercase and removing very short tokens (length < 2), ensuring consistency and eliminating stray characters.  

- **Noise Reduction**  
  Removed generic stopwords using the provided stopword list to filter out high-frequency but semantically weak terms like *“the”* or *“and”*.  

- **Frequency-based Filtering**  
  Refined the vocabulary by discarding:  
  - Rare words (term frequency = 1), often typos or one-off mentions.  
  - Overly dominant words (top 20 by document frequency), which appear across many reviews but add little discriminative power.  

- **Final Export**  
  Produced two required outputs:  
  - `processed.csv`, containing the cleaned version of the reviews.  
  - `vocab.txt`, a curated and alphabetically sorted vocabulary with word-to-index mappings.  


#### **Reflection**

This pipeline produced a **denoised, standardized corpus** that balances coverage with clarity. Regex tokenization preserved structure, normalization enforced consistency, and filtering steps reduced vocabulary size while keeping informative words.  
Together, these choices ensure the dataset is both **cleaner and more meaningful**, ready for feature representation and modeling in the next milestone.