# **Assignment 2: Milestone I Natural Language Processing**
## **Task 1. Basic Text Pre-processing**
#### **Student Name**: Tran Tu Tam
#### **Student ID**: s3999159


**Environment**: Python 3 and Jupyter notebook

**Libraries used**: 
* pandas
* re
* Counter
* nltk

## **Introduction**

This notebook performs text preprocessing on customer clothing reviews as required in **Milestone 1** of the assignment. The objective is to normalize and clean the review texts in preparation for feature extraction and machine learning classification.

The steps covered in this notebook include:

- Cleaning the review text using a defined regex tokenizer,
- Removing short words and stopwords,
- Removing rare and overly frequent words,
- Creating a final vocabulary file,
- Saving the cleaned reviews for downstream tasks.

These steps ensure that the text data is standardized, denoised, and ready for feature representation in the next stage.

## **Importing libraries**

In [None]:
import pandas as pd
import re
from collections import Counter
import nltk

### 1.1 Examining and loading data
- Examine the data and explain your findings
- Load the data into proper data structures and get it ready for processing.

In [48]:
try:
    # Load the dataset
    df = pd.read_csv('../data/assignment3.csv')
    # Load the stopwords from the provided file
    with open('../data/stopwords_en.txt', 'r') as f:
        stopwords = set(f.read().splitlines())
    print("Successfully loaded dataset and stopwords.")
except FileNotFoundError as e:
    print(f"Error: {e}. Please make sure 'assignment3.csv' and 'stopwords_en.txt' are in the same directory.")
    exit()

Successfully loaded dataset and stopwords.


### 1.2 Pre-processing data

#### 1.2.1 Initial Text Cleaning

To begin the text processing pipeline, I perform a series of cleaning and normalization steps on the `"Review Text"` column to standardize the data for frequency-based filtering and downstream modeling. These steps include:

- **Tokenization** using the regex `r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"` to capture valid word patterns (including hyphenated and apostrophized words),
- **Lowercasing** all words to standardize word forms,
- **Removing short words** with fewer than 2 characters,
- **Filtering out stopwords** using the provided stopword list.

The function `initial_clean()` encapsulates these transformations. Before applying it, I also ensure that any missing review text values are replaced with empty strings to prevent processing errors.


In [49]:
def initial_clean(text):
    """
    Performs tokenization, lowercasing, and removes short words and stopwords.
    """
    if not isinstance(text, str):
        return []
    
    # Tokenize using regex that handles hyphens and apostrophes
    tokens = re.findall(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?", text)
    
    # Lowercase all tokens
    tokens = [word.lower() for word in tokens]
    
    # Filter out words with length less than 2
    tokens = [word for word in tokens if len(word) >= 2]
    
    # Filter out stopwords
    tokens = [word for word in tokens if word not in stopwords]
    
    return tokens

To avoid processing errors, I first replace any missing values in the `Review Text` column with empty strings before applying the `initial_clean()` function.

In [50]:
df['processed_text'] = df['Review Text'].fillna('').apply(initial_clean)
print("Initial cleaning (tokenization, lowercase, short words, stopwords) complete.")
df['processed_text'].head(10)

Initial cleaning (tokenization, lowercase, short words, stopwords) complete.


0    [high, hopes, dress, wanted, work, initially, ...
1    [love, love, love, jumpsuit, fun, flirty, fabu...
2    [shirt, flattering, due, adjustable, front, ti...
3    [love, tracy, reese, dresses, petite, feet, ta...
4    [aded, basket, hte, mintue, person, store, pic...
5    [ordered, carbon, store, pick, ton, stuff, top...
6    [love, dress, xs, runs, snug, bust, ordered, s...
7    [lbs, ordered, petite, make, length, long, typ...
8    [dress, runs, small, esp, zipper, area, runs, ...
9    [find, reliant, reviews, written, savvy, shopp...
Name: processed_text, dtype: object

Upon inspection, the `processed_text` column contains token lists that reflect all expected transformations: proper tokenization, lowercase conversion, and removal of short or common stopwords.

#### 1.2.2 Filter Rare and Dominant Words

In this step, I refine the cleaned tokens further by removing both **rare words** and **dominant words**, which can negatively impact the performance of downstream models:

- **Rare words** (term frequency = 1) are often typos, misspellings, or highly specific terms that add noise but little generalizable value.
- **Dominant words** (top 20 in document frequency) appear too frequently across reviews and may dilute meaningful patterns.

First, I identify and filter out words that occur only once in the full dataset based on **term frequency**.

In [51]:
# Create a flat list of all tokens from all reviews
all_tokens_tf = [token for review_tokens in df['processed_text'] for token in review_tokens]

# Calculate the frequency of each term
term_freq = Counter(all_tokens_tf)

# Identify words that appear only once
words_to_remove_once = {word for word, count in term_freq.items() if count == 1}
print(f"Identified {len(words_to_remove_once)} words that appear only once.")
print(words_to_remove_once)

Identified 6734 words that appear only once.


Next, I identify overly common words based on **document frequency** — how many reviews each word appears in — and select the top 20.

In [None]:
# Use sets to count each word only once per review
doc_freq_counter = Counter()
for review_tokens in df['processed_text']:
    doc_freq_counter.update(set(review_tokens))
print(doc_freq_counter)

top_20_words = {word for word, count in doc_freq_counter.most_common(20)}
print(f"Identified top 20 most frequent words: {sorted(list(top_20_words))}")

Identified top 20 most frequent words: ['back', 'bought', 'color', 'comfortable', 'cute', 'dress', 'fabric', 'fit', 'fits', 'flattering', 'great', 'love', 'nice', 'ordered', 'perfect', 'size', 'small', 'soft', 'top', 'wear']


Now, I combine these rare and dominant words and apply a final filtering pass to clean the token list in each review.

In [53]:
# Combine all unwanted words into one removal set
words_to_remove = stopwords.union(words_to_remove_once, top_20_words)

# Final cleaning function
def final_clean(tokens):
    """
    Removes the combined set of unwanted words from a list of tokens.
    """
    return [token for token in tokens if token not in words_to_remove]

# Apply to each review
df['final_processed_text'] = df['processed_text'].apply(final_clean)
print("Final cleaning pass complete.")
df['final_processed_text']

Final cleaning pass complete.


0        [high, hopes, wanted, work, initially, petite,...
1        [jumpsuit, fun, flirty, fabulous, time, compli...
2        [shirt, due, adjustable, front, tie, length, l...
3        [tracy, reese, dresses, petite, feet, tall, br...
4        [basket, hte, person, store, pick, teh, pale, ...
                               ...                        
19657         [happy, snag, price, easy, slip, cut, combo]
19658    [reminds, maternity, clothes, stretchy, shiny,...
19659                 [worked, glad, store, order, online]
19660    [wedding, summer, medium, waist, perfectly, lo...
19661    [lovely, feminine, perfectly, easy, comfy, hig...
Name: final_processed_text, Length: 19662, dtype: object

Upon inspection, the `final_processed_text` column aligns with expectations. For example, the word `"dresses"`—identified as a top 20 dominant word—has been removed from the first review.

With this, the review tokens are now clean, filtered, and ready for feature representation.

#### 1.2.3 Save the Cleaned Data

After filtering out both rare and overly common words, I join the remaining tokens back into space-separated strings and replace the original `Review Text` column.

The cleaned dataset is then exported to `processed.csv`, which will be used as the input for generating feature representations in the next task.

In [54]:
# Copy original DataFrame and update the 'Review Text' with cleaned tokens as strings
output_df = df.copy()
output_df['Review Text'] = output_df['final_processed_text'].apply(lambda tokens: ' '.join(tokens))

# Drop intermediate processing columns to match the original structure
final_output_df = df.drop(columns=['processed_text', 'final_processed_text'])
final_output_df['Review Text'] = output_df['Review Text']

# Save the cleaned dataset
final_output_df.to_csv('processed.csv', index=False)
print("Saved the processed data to 'processed.csv'.")

Saved the processed data to 'processed.csv'.


## Saving required outputs
Finally, I generate and save the required vocabulary file `vocab.txt` based on the cleaned token list.

Each word is assigned a unique integer ID, starting from 0. The vocabulary is sorted in alphabetical order, as per the assignment specification.

This file will be used to interpret vector representations in the next steps.

In [55]:
# Flatten all tokens from the final cleaned reviews
all_final_tokens = [token for review_tokens in df['final_processed_text'] for token in review_tokens]

# Build sorted unique vocabulary
vocabulary = sorted(list(set(all_final_tokens)))

# Write vocab to file with format: word:index
with open('vocab.txt', 'w') as f:
    for i, word in enumerate(vocabulary):
        f.write(f"{word}:{i}\n")

print(f"Built and saved a vocabulary of {len(vocabulary)} words to 'vocab.txt'.")
print("\nTask 1 successfully completed!")

Built and saved a vocabulary of 7529 words to 'vocab.txt'.

Task 1 successfully completed!


## **Summary**

In this task, I implemented a complete text preprocessing pipeline for the clothing review dataset.  
Key outcomes include:

- Tokenized and normalized the review text,  
- Removed noise such as short words, stopwords, rare words, and overly frequent words,  
- Saved the cleaned dataset to `processed.csv`,  
- Built and exported a sorted vocabulary of unique tokens to `vocab.txt`.  

This preprocessing ensures the text data is clean, consistent, and suitable for feature extraction and model building in the next milestone.  

Overall, the task reinforced the importance of careful data cleaning and normalization in natural language processing, as even small details (like stopword handling or frequency-based filtering) can significantly influence model performance.