# Assignment 2: Milestone I Natural Language Processing
## Task 1
#### Group DS_G3
#### Members:
*   To Minh Tuan  s4055570
*   Huynh Huu Tri s4079860
*   Tran Viet Duc s4106117
*   Tran Minh Quang s4098857

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment:
```
import gensim.downloader as api
from gensim.models import FastText
import lightgbm as lgb
from matplotlib import pyplot as plt
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, RegexpTokenizer
import numpy as np
import os
import pandas as pd
import re
from scipy.sparse import hstack, csr_matrix
from scipy.stats import chi2_contingency
import seaborn as sns
from sklearn.ensemble import AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, make_scorer, classification_report
from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, PowerTransformer, StandardScaler
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
import warnings
```

## 1. Introduction

This notebook implements a comprehensive text pre-processing pipeline for customer clothing reviews, forming the foundation for subsequent analysis and classification tasks. Our approach systematically transforms raw review text into structured, normalized representations suitable for natural language processing models. The pipeline includes tokenization with sentence segmentation and word-level parsing, followed by cleaning operations including stopword removal, rare term filtering, and lemmatization to normalize word forms. We handle fashion-specific vocabulary challenges by preserving hyphenated product descriptors (e.g., "well-made") while removing common English stopwords.

The implementation produces three distinct vocabularies—for Title text, Review Text, and their combination—with each vocabulary storing normalized tokens in alphabetical order with unique identifiers. These vocabularies support the feature representation and classification tasks in subsequent notebooks. Special attention is given to the cleaning steps that enhance signal-to-noise ratio in the text data, particularly important for clothing reviews where product descriptions contain domain-specific terminology. The resulting cleaned text and vocabulary structures provide a solid foundation for generating the vector representations needed for the machine learning classification models in Tasks 2 and 3.

### 1.1. Project Scope & Requirements

This assignment focuses on developing a natural language processing (NLP) pipeline for analyzing and classifying clothing reviews. The project encompasses three interconnected tasks:

1. **Text Pre-processing**: Implementing a comprehensive cleaning pipeline for review text data, including tokenization, stopword removal, and vocabulary construction according to specific formatting requirements.

2. **Feature Representation**: Generating multiple vector representations of the review texts using both traditional bag-of-words approaches and modern word embeddings.

3. **Classification Modeling**: Building and evaluating machine learning models that can predict whether a customer recommends a product based on their review text.

The assignment is implemented in 2 files:

1. "task1.ipynb": Implement the **Text Pre-processing** task.

2. "task2_3.ipynb": Using the result of task 1 to implement the **Feature Representation** and **Classification Modeling** tasks.

### 1.2. Table of content

The outline of this Notebook "task1.ipynb" is described as below:

1. Introduction

    1.1 Project Scope & Requirements

    1.2. Table of content

    1.3. Importing libraries

2. Examining and loading data

3. Pre-processing data

    3.1. Tokenization

    3.2. Text removal and frequency calculation

    3.3. Lemmatization

    3.4. Fix typos

    3.5. Full text preprocessing pipeline

4. Saving required outputs

5. Summary

### 1.3. Importing libraries

In [5]:
# install gensim if the package is not existed -- MAY NEED TO RESTART KERNEL AFTER INSTALLATION FOR GENSIM TO WORK
!pip install gensim



In [6]:
# Code to import libraries ??? FIX ME: remove all unused libs before submission
import pandas as pd
import re
import numpy as np
from bs4 import BeautifulSoup
from collections import Counter
import difflib
import html
import nltk
from nltk.tokenize import sent_tokenize, RegexpTokenizer
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures, TrigramAssocMeasures, TrigramCollocationFinder, QuadgramCollocationFinder, QuadgramAssocMeasures
from nltk.corpus import words, wordnet
from nltk.probability import *
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams
from nltk.metrics import edit_distance as levenshtein_distance

# load default dataset
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
nltk.download('averaged_perceptron_tagger_eng') # Add this line to download the missing resource

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\tomin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tomin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\tomin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\tomin\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\tomin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\tomin\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already u

True

## 2. Examining and loading data
+ Examine the data and explain the findings.
+ Load the data into proper data structures and get it ready for processing.

In [7]:
# load dataset
df = pd.read_csv('assignment3_II.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19662 entries, 0 to 19661
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              19662 non-null  int64 
 1   Age                      19662 non-null  int64 
 2   Title                    19662 non-null  object
 3   Review Text              19662 non-null  object
 4   Rating                   19662 non-null  int64 
 5   Recommended IND          19662 non-null  int64 
 6   Positive Feedback Count  19662 non-null  int64 
 7   Division Name            19662 non-null  object
 8   Department Name          19662 non-null  object
 9   Class Name               19662 non-null  object
 10  Clothes Title            19662 non-null  object
 11  Clothes Description      19662 non-null  object
dtypes: int64(5), object(7)
memory usage: 1.8+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Clothes Title,Clothes Description
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Elegant A-Line Dress,"A classic A-line dress that flows gracefully, ..."
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Petite High-Waisted Trousers,"Chic, high-waisted trousers designed to elonga..."
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Silk Button-Up Blouse,A luxurious silk blouse with a timeless button...
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,Elegant A-Line Dress,"A classic A-line dress that flows gracefully, ..."
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits,Petite Cable Knit Sweater,A cozy cable knit sweater tailored specificall...


## 3. Pre-processing data
This section performs the required text pre-processing steps.

In order to generate an auditable process, each single step will be implemented by a distinct functions. At the end, all the functions will be combined together by a main function to execute the whole process.

To ensure the most efficient model training process later, some extra steps (such as lemmatization, typos handling) will be added into the required process to make the data as clean as possible.

### 3.1. Tokenization
This section generate a function to:
+ Perform sentence segmentation.
+ Extract tokens from *Review Text* attribute of the `df` dataset and transform into lower case format.
+ Store result in `corpus`, of which each row store a list of tokens for a review.

In [8]:
# Function to tokenize pd.DataFrame to corpus (a 2D list, each row stores tokens of a review)
def tokenize(df, attribute, get_vocab = False, print_process = False):
  '''
  Perform sentence segmentation and word tokenization.

  Args:
    df (pd.DataFrame): DataFrame containing text column
    attribute (str): Name of text column
    print_process (bool): Print the result of the tokenization process or not

  Returns:
    corpus (list of list): 2D list of tokens per row
  '''
  regex_tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")
  ENTITY_RE = re.compile(r"&(?:[A-Za-z]+|#[0-9]+|#x[0-9A-Fa-f]+);")
  corpus = []

  for text in df[attribute]:
    tokens = []

    # drop HTML encode
    unescaped = html.unescape(text)
    soup = BeautifulSoup(unescaped, 'html.parser')
    text = soup.get_text()
    text = ENTITY_RE.sub(' ', text)

    # tokenization
    for sent in sent_tokenize(text): # sentence segmentation
      words = regex_tokenizer.tokenize(sent) # word tokenization
      words = [w.lower() for w in words] # lowercase transform
      tokens.extend(words)
    corpus.append(tokens)

  # Build vocab dict (alphabetical order)
  if get_vocab:
    unique_tokens = sorted({t for doc in corpus for t in doc})
    vocab = {token: idx for idx, token in enumerate(unique_tokens)}
    return corpus, vocab

  # Print process
  if print_process:
    print(f"Finish tokenize df[{attribute}]: {sum([len(tokens) for tokens in corpus])} token extracted")

  return corpus

# demo use of tokenize() function
print('*** This is just demo only ***')
corpus = tokenize(df, 'Review Text', print_process = True) # each row in corpus store all tokens of a review

*** This is just demo only ***
Finish tokenize df[Review Text]: 1206545 token extracted


### 3.2. Detect collocations

In [9]:
def find_collocations(corpus, ngram_range = [2, 3, 4], num_collocations = 20, min_freq = 10, print_process = False):
    '''
    Find n-gram collocations (bigrams, trigrams, fourgrams) in the corpus using PMI measure

    Args:
        corpus (list of list): 2D token list
        num_collocations (int): Number of top collocations to return per n-gram type
        min_freq (int): Minimum frequency for collocations
        print_process (bool): Whether to print results
        ngram_range (tuple): Range of n-grams to find (min_n, max_n)

    Returns:
        collocations_dict (dict): Dictionary of top collocations by n-gram type
    '''
    # Import necessary modules

    # Flatten corpus for collocation finding
    all_tokens = [token for doc in corpus for token in doc]

    # Dictionary to store collocations by n-gram size
    collocations = []

    # Find quadgrams if in range
    if 4 in ngram_range:
        finder = QuadgramCollocationFinder.from_words(all_tokens)
        finder.apply_freq_filter(min_freq)
        quadgram_measures = QuadgramAssocMeasures()
        top_quadgrams = finder.nbest(quadgram_measures.pmi, num_collocations)
        collocations.extend([f"{w1}-{w2}-{w3}-{w4}" for w1, w2, w3, w4 in top_quadgrams])

        if print_process:
            print(f"\n=== Top {num_collocations} Quadgram Collocations ===")
            for i, collocation in enumerate(top_quadgrams, 1):
                print(f"{i}. '{' '.join(collocation)}'")

    # Find trigrams if in range
    if 3 in ngram_range:
        finder = TrigramCollocationFinder.from_words(all_tokens)
        finder.apply_freq_filter(min_freq)
        trigram_measures = TrigramAssocMeasures()
        top_trigrams = finder.nbest(trigram_measures.pmi, num_collocations)
        collocations.extend([f"{w1}-{w2}-{w3}" for w1, w2, w3 in top_trigrams])

        if print_process:
            print(f"\n=== Top {num_collocations} Trigram Collocations ===")
            for i, collocation in enumerate(top_trigrams, 1):
                print(f"{i}. '{' '.join(collocation)}'")

    # Find bigrams if in range
    if 2 in ngram_range:
        finder = BigramCollocationFinder.from_words(all_tokens)
        finder.apply_freq_filter(min_freq)
        bigram_measures = BigramAssocMeasures()
        top_bigrams = finder.nbest(bigram_measures.pmi, num_collocations)
        collocations.extend([f"{w1}-{w2}" for w1, w2 in top_bigrams])

        if print_process:
            print(f"\n=== Top {num_collocations} Bigram Collocations ===")
            for i, collocation in enumerate(top_bigrams, 1):
                print(f"{i}. '{' '.join(collocation)}'")

    # Calculate optimal min_freq based on corpus size
    if print_process:
        total_tokens = len(all_tokens)
        recommended_min_freq = max(3, int(total_tokens * 0.0001))
        if min_freq < recommended_min_freq:
            print(f"\nNote: Your min_freq={min_freq} may be too low for a corpus with {total_tokens} tokens.")
            print(f"Consider using min_freq≥{recommended_min_freq} for more meaningful collocations.")

    return collocations

# demo use of find_collocations() function
print('*** This is just demo only ***')
corpus = tokenize(df, 'Review Text')
collocations = find_collocations(corpus, print_process = True, ngram_range = [2, 3, 4])  # Find bigrams through quadgrams
print(collocations)

*** This is just demo only ***

=== Top 20 Quadgram Collocations ===
1. 'had such high hopes'
2. 'those of us who'
3. 'you raise your arms'
4. 'as another reviewer noted'
5. 'a few weeks ago'
6. 'angel of the north'
7. 'after reading other reviews'
8. 'was pleasantly surprised by'
9. 'if you're between sizes'
10. 'as another reviewer mentioned'
11. 'have paid full price'
12. 'received tons of compliments'
13. 'inches above my knee'
14. 'paid full price for'
15. 'say enough good things'
16. 'get compliments every time'
17. 'deal breaker for me'
18. 'other reviewers have noted'
19. 'love at first sight'
20. 'i've received many compliments'

=== Top 20 Trigram Collocations ===
1. 'wide rib cage'
2. 'worth every penny'
3. 'few weeks ago'
4. 'start by saying'
5. 'hd in paris'
6. 'exceeded my expectations'
7. 'i've ever owned'
8. 'with flip flops'
9. 'dry clean only'
10. 'pleasantly surprised by'
11. 'paid full price'
12. 'received numerous compliments'
13. 'paying full price'
14. 'pay full 

In [None]:
def add_collocations(corpus, collocations, print_process = False):
  '''
  Add collocations to the corpus

  Args:
      corpus (list of list): 2D token list
      collocations_dict (dict): Dictionary of top collocations by n-gram type

  Returns:
      corpus (list of list): Corpus with detected collocations treated as single tokens (e.g., 'new-york')
      replaced_tokens (dict): All tokens that have been replaced by collocations in a form {token_be_replaced: collocation}
  '''
  result_corpus = []

  for doc in corpus:
    doc = ' '.join(doc) # doc is transfromed from ['he', 'work', 'out', ...] to 'he work out ...' so all collocations will be separated with space
    for collocation in collocations:
      collocation_with_space = collocation.replace('-', ' ')
      doc = doc.replace(collocation_with_space, collocation) # replace all collocation with space to collocation with "-", like 'work out' to 'work-out'
    doc = doc.split(' ') # doc is transformed to ['he', 'work-out', ...]
    result_corpus.append(doc)

  if print_process:
    print(f"Finish add collocations:")
    print(f"+ Before: {sum([len(review_tokens) for review_tokens in corpus])} tokens: {corpus[:5]} ...")
    print(f"+ Now:    {sum([len(review_tokens) for review_tokens in result_corpus])} tokens: {result_corpus[:5]} ...")

  return result_corpus

# demo use of add_collocations() function
print('*** This is just demo only ***')
corpus = tokenize(df, 'Review Text')
collocations = find_collocations(corpus, ngram_range = [2, 3, 4])  # Find bigrams through quadgrams
new_corpus = add_collocations(corpus, collocations, print_process = True)

### 3.2. Text removal and frequency calculation
This section generates 2 functions:
+ A function `calc_frequency()` to calculate term frequency and document frequency of `corpus` (2D list, each row store tokens of a review).
+ A function `remove_tokens()` to remove tokens from `corpus`.

In [11]:
# Function to calculate frequency
def calc_frequency(corpus, type = 'term', print_process = False):
  '''
  Calculate token frequency. Term frequency (type = "term") or document frequency (type = "document")

  Args:
    corpus (list of list): 2D token list
    type (str): "term" or "document"
    print_process (bool): Print the result of the tokenization process or not

  Returns:
    frequency (dict): token -> frequency
  '''
  if type == 'term': # if type is 'term'
    all_tokens = [token for doc in corpus for token in doc] # extract all tokens of all documents into 1 list
    frequency = FreqDist(all_tokens)
  else:
    all_unique_tokens_in_each_doc = [token for doc in corpus for token in set(doc)] # filter unique tokens in each document (set(doc)), then extract them into 1 list
    frequency = FreqDist(all_unique_tokens_in_each_doc)

  # Print process
  if print_process:
    print(f"Finish calculate {type} frequency of {sum([len(tokens) for tokens in corpus])} tokens")

  return frequency

# demo use of calc_frequency() function
print('*** This is just demo only ***')
term_frequency = calc_frequency(corpus, type = 'term', print_process = True)
document_frequency = calc_frequency(corpus, type = 'document', print_process = True)
print('*** Term frequency ***\n', term_frequency)
print('*** Document frequency ***\n', document_frequency)

*** This is just demo only ***
Finish calculate term frequency of 1206545 tokens
Finish calculate document frequency of 1206545 tokens
*** Term frequency ***
 <FreqDist with 14806 samples and 1206545 outcomes>
*** Document frequency ***
 <FreqDist with 14806 samples and 898450 outcomes>


In [12]:
# Function to remove invalid tokens
def remove_tokens(corpus, tokens_to_remove, remove_single_char = False, print_process = False):
  '''
  Remove the tokens of `corpus` that are in `tokens_to_remove`

  Args:
    corpus (list of list): tokenized text
    tokens_to_remove (list): list of tokens (str)
    remove_single_char (bool): whether removing tokens with length = 1 or not

  Returns:
    corpus (list of list): cleaned tokenized text
  '''

  tokens_to_remove = set(tokens_to_remove)
  cleaned_corpus = []
  for doc in corpus:
    cleaned_doc = [w for w in doc if (w not in tokens_to_remove) and ((not remove_single_char) or len(w) >= 2)]
    cleaned_corpus.append(cleaned_doc)

  # Print process
  if print_process:
    print("Finish removal:")
    print(f"+ Before: {sum([len(review_tokens) for review_tokens in corpus])} tokens: {corpus[:5]} ...")
    print(f"+ Now:    {sum([len(review_tokens) for review_tokens in cleaned_corpus])} tokens: {cleaned_corpus[:5]} ...")

  return cleaned_corpus

# demo use of remove_tokens() function
print('*** This is just demo only ***')
with open("stopwords_en.txt", "r", encoding="utf-8") as file: # Download stopwords_en.txt
  stop_words = set(word for word in file)
  corpus_cleaned = remove_tokens(corpus, stop_words, remove_single_char = True, print_process = True)

*** This is just demo only ***
Finish removal:
+ Before: 1206545 tokens: [['i', 'had', 'such', 'high', 'hopes', 'for', 'this', 'dress', 'and', 'really', 'wanted', 'it', 'to', 'work', 'for', 'me', 'i', 'initially', 'ordered', 'the', 'petite', 'small', 'my', 'usual', 'size', 'but', 'i', 'found', 'this', 'to', 'be', 'outrageously', 'small', 'so', 'small', 'in', 'fact', 'that', 'i', 'could', 'not', 'zip', 'it', 'up', 'i', 'reordered', 'it', 'in', 'petite', 'medium', 'which', 'was', 'just', 'ok', 'overall', 'the', 'top', 'half', 'was', 'comfortable', 'and', 'fit', 'nicely', 'but', 'the', 'bottom', 'half', 'had', 'a', 'very', 'tight', 'under', 'layer', 'and', 'several', 'somewhat', 'cheap', 'net', 'over', 'layers', 'imo', 'a', 'major', 'design', 'flaw', 'was', 'the', 'net', 'over', 'layer', 'sewn', 'directly', 'into', 'the', 'zipper', 'it', 'c'], ['i', 'love', 'love', 'love', 'this', 'jumpsuit', "it's", 'fun', 'flirty', 'and', 'fabulous', 'every', 'time', 'i', 'wear', 'it', 'i', 'get', 'noth

### 3.3. Lemmatization
This section generates a function to:
+ Lemmatize each token in `corpus` (2D list, each row stores tokens of a review in the dataset)

In [13]:
# Lemmatization
def lemmatize(corpus, print_process = False):
  '''
  Apply lemmatization to 2D token list.

  Args:
    corpus (list of list)

  Returns:
    corpus (list of list): lemmatized tokens
  '''
  result_corpus = []
  pos_map = {
    'ADJ': 'a',
    'ADP': 's',
    'ADV': 'r',
    'NOUN': 'n', # assume any undefined tags (like DET, PRON, ...) is n (NOUN)
    'VERB': 'v',
  }

  lemmatizer = WordNetLemmatizer()
  for doc in corpus:
    doc_with_tag = nltk.pos_tag(doc, tagset = 'universal') # set POS tag for all tokens in doc (tag is the type of word: NOUN, ADJ, ...)
    lemmatized_doc = [lemmatizer.lemmatize(token, pos_map.get(tag, 'n')) for token, tag in doc_with_tag] # assume any undefined tags (like DET, PRON, ...) is n (NOUN)
    result_corpus.append(lemmatized_doc)

  # Print process
  if print_process:
    print("Finish lemmatize:")
    print(f"+ Before: {sum([len(review_tokens) for review_tokens in corpus])} tokens: {corpus[:5]} ...")
    print(f"+ Now:    {sum([len(review_tokens) for review_tokens in result_corpus])} tokens: {result_corpus[:5]} ...")
  return result_corpus

# demo use of lemmatize() function
print('*** This is just demo only ***')
corpus = tokenize(df, 'Review Text')
corpus_cleaned = lemmatize(corpus, print_process = True)

*** This is just demo only ***
Finish lemmatize:
+ Before: 1206545 tokens: [['i', 'had', 'such', 'high', 'hopes', 'for', 'this', 'dress', 'and', 'really', 'wanted', 'it', 'to', 'work', 'for', 'me', 'i', 'initially', 'ordered', 'the', 'petite', 'small', 'my', 'usual', 'size', 'but', 'i', 'found', 'this', 'to', 'be', 'outrageously', 'small', 'so', 'small', 'in', 'fact', 'that', 'i', 'could', 'not', 'zip', 'it', 'up', 'i', 'reordered', 'it', 'in', 'petite', 'medium', 'which', 'was', 'just', 'ok', 'overall', 'the', 'top', 'half', 'was', 'comfortable', 'and', 'fit', 'nicely', 'but', 'the', 'bottom', 'half', 'had', 'a', 'very', 'tight', 'under', 'layer', 'and', 'several', 'somewhat', 'cheap', 'net', 'over', 'layers', 'imo', 'a', 'major', 'design', 'flaw', 'was', 'the', 'net', 'over', 'layer', 'sewn', 'directly', 'into', 'the', 'zipper', 'it', 'c'], ['i', 'love', 'love', 'love', 'this', 'jumpsuit', "it's", 'fun', 'flirty', 'and', 'fabulous', 'every', 'time', 'i', 'wear', 'it', 'i', 'get', 'no

### 3.4. Fix typos
This section generates a function to fix all potential typos in `corpus` (2D list, each row stores tokens of a review in the dataset). The method used is the combination of frequent-based strategy and built-in dictionary look up:
1. Extract vocabulary of `corpus` with term frequencies.
2. If a token is in built-in dictionary (from NLTK library), it is a correct word. Otherwise, it is a potential typos (still need to check further since it may be a correct special terminology).
3. If a token is similar to a word that is supposed to be correct, and the token appears less frequently than that word, the token is supposed to be a typo (frequent-based approach).
4. If a token is not similar to any other correct words which appear more frequent than it, that token may be correct (frequent-based approach).


In [25]:
def fix_typos(corpus, cutoff = 0.95, print_fixes = False, print_process = False):
  """
  Fix rare typos in a corpus using difflib similarity and edit distance.

  Parameters:
      corpus: 2D list of tokens
      min_freq: words with freq >= min_freq are considered correct dictionary
      cutoff: similarity threshold for difflib
      max_edit: maximum edit distance to accept as a fix
      print_fixes: whether to print each fix

  Returns:
      fixed_corpus: corpus with typos corrected
  """
  # ---- Build vocab ----
  # Flatten corpus and compute term frequency
  freq = calc_frequency(corpus, 'term')

  # Build dictionary of frequent words
  vocab = sorted(freq.keys(), key = lambda w: freq[w], reverse = True)
  builtin_dict = words.words()
  fixed_vocab = {}

  potential_typos = vocab[::-1] # reverse vocab to sort freq ascendingly

  # ---- Fix typos ----
  for token in vocab:
    if token in fixed_vocab: # token is fixed before
      continue
    fixed_vocab[token] = token # token has not been fixed by more frequent words => token is correct

    while len(potential_typos) > 0 and freq[potential_typos[-1]] >= freq[token]: # potential typos of token should not have freq >= token
      potential_typos.pop()

    matches = difflib.get_close_matches(token, potential_typos, n=1, cutoff=cutoff) # Use difflib to find close matches
    if matches:
      nearest = matches[0]
      if nearest in builtin_dict: # if nearest in dictionary, it is just a correct word which is similar to token
        fixed_vocab[nearest] = nearest
      else:
        fixed_vocab[nearest] = token # Fix nearest to token

  # ---- Generate new corpus based on correct tokens ----
  fixed_corpus = []
  for tokens in corpus:
    fixed_tokens = []
    for token in tokens:
        fixed_tokens.append(fixed_vocab[token])
    fixed_corpus.append(fixed_tokens)

  # ---- Print uncorrect tokens and New corpus ----
  if print_fixes:
    uncorrect_vocab = sorted(fixed_vocab.items())
    for token, fixed in uncorrect_vocab:
      if token != fixed:
        print(f"Fix: {token} -> {fixed} (similar score: {difflib.SequenceMatcher(None, token, fixed).ratio()})")
  
  with open('typos.txt', 'wt') as f:
    for token, replacement in fixed_vocab.items():
      if token != replacement:
        f.write(f"{token}:{replacement}\n")

  if print_process:
    print("Finish fix typos:")
    print(f"+ Before: {sum([len(review_tokens) for review_tokens in corpus])} tokens: {corpus[:5]} ...")
    print(f"+ Now:    {sum([len(review_tokens) for review_tokens in fixed_corpus])} tokens: {fixed_corpus[:5]} ...")

  return fixed_corpus

# demo use
print('*** This is just demo only ***')
corpus = tokenize(df, 'Title')
corpus_cleaned = fix_typos(corpus, print_fixes = True, print_process = True)

*** This is just demo only ***
Fix: buttondowns -> buttondown (similar score: 0.9523809523809523)
Fix: comnfortable -> confortable (similar score: 0.9565217391304348)
Fix: descriptions -> description (similar score: 0.9565217391304348)
Fix: differences -> difference (similar score: 0.9523809523809523)
Fix: disapppointing -> disappointing (similar score: 0.9629629629629629)
Fix: dissappointed -> dissapointed (similar score: 0.96)
Fix: embellishments -> embellishment (similar score: 0.9629629629629629)
Fix: emroidered -> embroidered (similar score: 0.9523809523809523)
Fix: eyecatching -> eye-catching (similar score: 0.9565217391304348)
Fix: pear-shape -> pear-shaped (similar score: 0.9523809523809523)
Fix: shortwaisted -> short-waisted (similar score: 0.96)
Fix: silhouettes -> silhouette (similar score: 0.9523809523809523)
Fix: suprisingly -> surprisingly (similar score: 0.9565217391304348)
Fix: sweatshirts -> sweatshirt (similar score: 0.9523809523809523)
Fix: turtlenceck -> turtleneck 

### 3.5. Full text preprocessing pipeline
This section generates `text_preprocessing()` function to define a pipeline:
1. Sentence segmentation and word tokenization -> use `tokenize()`.
2. Text removal (tokens with length 1 and stop words) -> use `remove_tokens()` with "stopword_en.txt" file.
3. Lemmatization -> use `lemmatize()`. This is optional, only implement if `apply_lemmatize = True`.
4. Remove tokens with term frequency = 1 -> use `calc_frequency(type = 'term')`. This is optional, only implement if `remove_outlier = True`.
5. Remove tokens with document frequency in top 20 highest -> use `calc_frequency(type = 'document')`. This is optional, only implement if `remove_outlier = True`.

In [26]:
# Function to execute all text preprocessing pipeline
def text_preprocessing(original_df, attribute, remove_outlier = False, ngram_range = [2, 3, 4], print_process = False):
  '''
    Implement full process of text preprocessing with the steps.

    Agrs:
      df (pd.DataFrame): Dataset
      attribute (str): the text attribute of df for preprocessing
      detect_outliers (bool): Implement step 4 or not

    Returns:
      df (pd.DataFrame): Dataset after preprocessing
      vocab (dict): The vocabulary of all unique tokens extracted, with their id (in alphabetical order)
  '''
  df = original_df.copy()
  if print_process:
    print(f"\n*** Proceed df[{attribute}] ***")

  # ---- Sentence segmentation -> word tokenization ----
  corpus = tokenize(df, attribute, print_process = print_process)
  removed_tokens = set()

  # ---- Find and add collocations ----
  collocations = find_collocations(corpus, ngram_range = ngram_range)
  corpus = add_collocations(corpus, collocations, print_process = print_process)
  with open('collocations.txt', 'wt') as f:
    for collocation in collocations:
      f.write(f"{collocation}\n")

  # ---- Fix typos ----
  corpus = fix_typos(corpus, cutoff = 0.9, print_process = print_process)

  # ---- Text removal (single char + stop words) ----
  if print_process:
    print('Removing stopwords and tokens with length 1')

  with open("stopwords_en.txt", "r", encoding="utf-8") as f: # Download stopwords_en.txt
    stop_words = set(w.strip().lower() for w in f if w.strip())
  corpus = remove_tokens(corpus, stop_words, remove_single_char = True, print_process = print_process) # Text removal: stopwords + tokens with length = 1
  removed_tokens.update(stop_words)

  # ---- Remove rare tokens + Remove frequent tokens ----
  if remove_outlier: # Only run if the user allow (when detect_outliers = True)

    # Remove tokens with term frequency = 1
    term_freq = calc_frequency(corpus, type="term") # calculate term frequency
    rare_tokens = set(t for t, f in term_freq.items() if f == 1) # filter tokens with term frequency = 1

    if print_process:
      print('Removing tokens with term frequency = 1')
    corpus = remove_tokens(corpus, rare_tokens, print_process = print_process) # remove rare tokens
    removed_tokens.update(rare_tokens)

    # Remove tokens with document frequency in top 20
    doc_freq = calc_frequency(corpus, type="document") # calculate document frequency
    top_tokens = set(sorted(doc_freq, key=lambda x: doc_freq[x], reverse=True)[:20]) # filter top 20 most frequent tokens (document frequency)

    if print_process:
      print('Removing tokens with document frequency in top 20')
    corpus = remove_tokens(corpus, top_tokens, print_process = print_process) # remove frequent tokens
    removed_tokens.update(top_tokens)

  # ---- Lemmatization ----
  corpus = lemmatize(corpus, print_process = print_process)

  if print_process:
    print('Remove stopwords again after lemmatization')
  corpus = remove_tokens(corpus, stop_words, remove_single_char = True, print_process = print_process)

  # ----- Output -----
  # Apply final cleaned text to the dataset
  processed_texts = [' '.join([t for t in doc]) for doc in corpus] # each row in df[attribute] will be a string (combination of tokens separated by " ")
  df[attribute] = processed_texts

  # Build vocab dict (alphabetical order)
  unique_tokens = sorted({t for doc in corpus for t in doc})
  vocab = {token: idx for idx, token in enumerate(unique_tokens)}

  # Save removed tokens
  with open('removed_tokens.txt', 'wt') as f:
    for token in removed_tokens:
      f.write(f"{token}\n")

  print(f"Finish proceed df[{attribute}]: remain {len(vocab)} unique tokens")

  return df, vocab

# demo use
print('*** This is just demo only ***')
df['Title And Review'] = (df['Title'] + ' ' + df['Review Text']).str.strip()
processed_df, vocab_both = text_preprocessing(df, 'Title And Review', remove_outlier = True, print_process = True)
processed_df

*** This is just demo only ***

*** Proceed df[Title And Review] ***
Finish tokenize df[Title And Review]: 1271913 token extracted
Finish add collocations:
+ Before: 1271913 tokens: [['some', 'major', 'design', 'flaws', 'i', 'had', 'such', 'high', 'hopes', 'for', 'this', 'dress', 'and', 'really', 'wanted', 'it', 'to', 'work', 'for', 'me', 'i', 'initially', 'ordered', 'the', 'petite', 'small', 'my', 'usual', 'size', 'but', 'i', 'found', 'this', 'to', 'be', 'outrageously', 'small', 'so', 'small', 'in', 'fact', 'that', 'i', 'could', 'not', 'zip', 'it', 'up', 'i', 'reordered', 'it', 'in', 'petite', 'medium', 'which', 'was', 'just', 'ok', 'overall', 'the', 'top', 'half', 'was', 'comfortable', 'and', 'fit', 'nicely', 'but', 'the', 'bottom', 'half', 'had', 'a', 'very', 'tight', 'under', 'layer', 'and', 'several', 'somewhat', 'cheap', 'net', 'over', 'layers', 'imo', 'a', 'major', 'design', 'flaw', 'was', 'the', 'net', 'over', 'layer', 'sewn', 'directly', 'into', 'the', 'zipper', 'it', 'c'], ['

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Clothes Title,Clothes Description,Title And Review
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Elegant A-Line Dress,"A classic A-line dress that flows gracefully, ...",major design flaw had-such-high-hopes work ini...
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,Petite High-Waisted Trousers,"Chic, high-waisted trousers designed to elonga...",favorite buy jumpsuit fun flirty fabulous time...
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Silk Button-Up Blouse,A luxurious silk blouse with a timeless button...,shirt shirt due adjustable front tie length le...
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,Elegant A-Line Dress,"A classic A-line dress that flows gracefully, ...",petite tracy-reese dress petite foot tall bran...
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits,Petite Cable Knit Sweater,A cozy cable knit sweater tailored specificall...,shimmer fun basket hte person store pick teh d...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
19657,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses,Petite Floral Midi Dress,A beautiful floral midi dress designed for pet...,occasion happy snag price easy slip cut combo
19658,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits,Petite Cable Knit Sweater,A cozy cable knit sweater tailored specificall...,make cotton remind maternity clothes stretchy ...
19659,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses,Petite Floral Midi Dress,A beautiful floral midi dress designed for pet...,work glad store order online
19660,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses,Elegant A-Line Dress,"A classic A-line dress that flows gracefully, ...",summer party buy wed summer medium waist perfe...


In [None]:
stop here

## 4. Saving required outputs
Save the requested information as per specification.
+ "vocab.txt": unigram, cleaned vocabulary of *Review Text* attribute of "assignment3.csv" dataset (stored in `df`).
+ "processed.csv": The dataset, of which attribute *Review Text* is implemented through text processing, and only the words including in the vobabulary (stored in "vocab.txt") are stored, separated by space.

In [None]:
# add "Title And Review" column as a combination of "Title" and "Review Text"
df['Title And Review'] = (df['Title'] + ' ' + df['Review Text']).str.strip()
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Title And Review
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Some major design flaws I had such high hopes ...
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,"My favorite buy! I love, love, love this jumps..."
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Flattering shirt This shirt is very flattering...
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,Not for the very petite I love tracy reese dre...
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits,Cagrcoal shimmer fun I aded this in my basket ...


In [None]:
# Function to save df and vocab after text preprocessing pipeline
def save_processed(df, vocab, df_filename, vocab_filename):
  df.to_csv(df_filename)
  with open(vocab_filename, 'wt') as f:
    for token, i in vocab.items():
      f.write(f'{token}:{i}\n')

In [None]:
# save result when implement whole pipeline using n-grams collocation detection with n = [2, 3]

# processed_df = df.copy()
# processed_df, vocab_text = text_preprocessing(processed_df, 'Review Text', remove_outlier = True, ngram_range = [2, 3])
# processed_df, vocab_title = text_preprocessing(processed_df, 'Title', remove_outlier = True, ngram_range = [2, 3])
# processed_df, vocab_both = text_preprocessing(processed_df, 'Title And Review', remove_outlier = True, ngram_range = [2, 3])

# save_processed(processed_df, vocab_text, 'processed_23.csv', 'vocab_text_23.txt')
# save_processed(processed_df, vocab_title, 'processed_23.csv', 'vocab_title_23.txt')
# save_processed(processed_df, vocab_both, 'processed_23.csv', 'vocab_both_23.txt')

Finish proceed df[Review Text]: remain 6347 unique tokens
Finish proceed df[Title]: remain 1464 unique tokens
Finish proceed df[Title And Review]: remain 6641 unique tokens


In [None]:
# view the dataset again to check
# processed_df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Title And Review
0,1077,60,major design flaw,high hope work initially petite usual find out...,3,0,0,General,Dresses,Dresses,major design flaws high hope work initially pe...
1,1049,50,favorite buy,jumpsuit fun flirty fabulous time compliment,5,1,0,General Petite,Bottoms,Pants,favorite buy jumpsuit fun flirty fabulous time...
2,847,47,shirt,shirt due adjustable front tie length legging ...,5,1,6,General,Tops,Blouses,shirt shirt due adjustable front tie length le...
3,1080,49,petite,tracy-reese dress petite foot tall brand prett...,2,0,4,General,Dresses,Dresses,petite tracy-reese dress petite foot tall bran...
4,858,39,shimmer fun,basket hte person store pick teh darker pale h...,5,1,1,General Petite,Tops,Knits,shimmer fun basket hte person store pick teh d...


In [None]:
# save result when implement whole pipeline using n-grams collocation detection with n = [2, 3, 4]

processed_df = df.copy()
processed_df, vocab_text = text_preprocessing(processed_df, 'Review Text', remove_outlier = True, ngram_range = [2, 3, 4])
processed_df, vocab_title = text_preprocessing(processed_df, 'Title', remove_outlier = True, ngram_range = [2, 3, 4])
processed_df, vocab_both = text_preprocessing(processed_df, 'Title And Review', remove_outlier = True, ngram_range = [2, 3, 4])

# support model_experiment.ipynb and task2_3.ipynb
# save_processed(processed_df, vocab_text, 'processed_234.csv', 'vocab_text_234.txt')
# save_processed(processed_df, vocab_title, 'processed_234.csv', 'vocab_title_234.txt')
# save_processed(processed_df, vocab_both, 'processed_234.csv', 'vocab_both_234.txt')

# save result for Task 1
# save_processed(processed_df, vocab_text, 'processed.csv', 'vocab.txt') # vocab.txt store vocab of Review Text -- as required by Task 1 requirements
# save_processed(processed_df, vocab_text, 'processed.csv', 'vocab_text.txt')
# save_processed(processed_df, vocab_title, 'processed.csv', 'vocab_title.txt')
# save_processed(processed_df, vocab_both, 'processed.csv', 'vocab_both.txt')

Finish proceed df[Review Text]: remain 6362 unique tokens
Finish proceed df[Title]: remain 1477 unique tokens
Finish proceed df[Title And Review]: remain 6658 unique tokens


In [None]:
# view the dataset again to check
processed_df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Title And Review
0,1077,60,major design flaw,had-such-high-hopes work initially petite usua...,3,0,0,General,Dresses,Dresses,major design flaw had-such-high-hopes work ini...
1,1049,50,favorite buy,jumpsuit fun flirty fabulous time compliment,5,1,0,General Petite,Bottoms,Pants,favorite buy jumpsuit fun flirty fabulous time...
2,847,47,shirt,shirt due adjustable front tie length legging ...,5,1,6,General,Tops,Blouses,shirt shirt due adjustable front tie length le...
3,1080,49,petite,tracy-reese dress petite foot tall brand prett...,2,0,4,General,Dresses,Dresses,petite tracy-reese dress petite foot tall bran...
4,858,39,shimmer fun,basket hte person store pick teh darker pale h...,5,1,1,General Petite,Tops,Knits,shimmer fun basket hte person store pick teh d...


In [1]:
from gensim.models import FastText
import joblib
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Function to read vocabulary file to a Python dict()
def read_vocab(filename):
  vocab = {}
  with open(filename, 'r') as f:
    vocab = {line.split(':')[0]: int(line.split(':')[1]) for line in f} # Convert index to integer
  return vocab

# Function to calculate weighted vectors (document representation) based on an embedding model loaded in advance
def calc_weighted_vectors(df, attribute, vocab_dict, model, tfidf_vectorizer = None):
  '''
  Calculates TF-IDF weighted document vectors.

  Args:
    df: The DataFrame containing the text data.
    attribute: The column name in the DataFrame with the text.
    vocab_dict: A dictionary mapping vocabulary tokens to their unique IDs.
    model: The pre-trained word embedding model (e.g., Word2Vec, FastText).

  Returns:
    numpy.ndarray: A 2D array where each row is the weighted vector for a document.
  '''
  # Use TfidfVectorizer with the predefined vocabulary to get TF-IDF scores
  if tfidf_vectorizer is None:
    tfidf_vectorizer = TfidfVectorizer(analyzer = 'word', vocabulary = vocab_dict, lowercase = True)
    tfidf_matrix = tfidf_vectorizer.fit_transform(df[attribute].fillna('')) # Fill NaN values with empty strings before vectorization
  else:
    tfidf_matrix = tfidf_vectorizer.transform(df[attribute].fillna(''))

  # Precompute embedding matrix aligned with vocab_dict
  embedding_matrix = np.zeros((len(vocab_dict), model.wv.vector_size))
  for token, idx in vocab_dict.items():
    if token in model.wv.key_to_index:  # Check if token exists in pretrained model
      embedding_matrix[idx] = model.wv[token]
    # else remains zero vector

  # Compute Weighted Review Vectors (TF-IDF weighted mean)
  weighted_vectors = []
  for doc_idx in range(tfidf_matrix.shape[0]):
    row = tfidf_matrix.getrow(doc_idx)
    indices = row.indices # only get element that is not = 0
    weights = row.data

    if len(indices) == 0:
      weighted_vectors.append(np.zeros(model.wv.vector_size))
      continue

    # Get the corresponding word vectors from the precomputed embedding matrix
    word_vecs = embedding_matrix[indices]

    # Perform a dot product to get the weighted sum
    weighted_sum = np.dot(weights, word_vecs)
    weighted_avg = weighted_sum / weights.sum()
    weighted_vectors.append(weighted_avg)

  return np.vstack(weighted_vectors)  # shape: (n_docs, vector_size)

# ---- Define model to predict (vote of 3 models) ----

# df = pd.read_csv('processed.csv')
vocab_both = read_vocab('vocab_both.txt')
fasttext_both_model = FastText.load('fasttext_model.model')

tfidf_vectorizer = joblib.load('tfidf_vectorizer.pkl')

vote_clf = joblib.load('voting_classifier.pkl')

# demo use of pipeline to fit and predict
data = pd.DataFrame({
  'New Review': ['I love this skirt', 'I hate this', 'had such high hopes', ' I dont like it', 'Yuck', 'Would love to recommend it']
})
wv = calc_weighted_vectors(data, 'New Review', vocab_both, fasttext_both_model, tfidf_vectorizer)
print(vote_clf.predict(wv))

[1 0 0 0 0 1]
