# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Arya Ramesh Patil
#### Student ID: S4060675


Environment: Python 3 and Jupyter notebook

Libraries used:
* pandas
* numpy
* nltk
* chain
* division

## Introduction
This part of assessment primarily focuses on text pre-processing. While pre-processing also involves stemming & lemmatisation, sentence segmentation, here we focus on tokenisation, case normalisation, removal of stop words and most/less frequent words. Without basic text pre-processing, it is difficult to build a working machine learning model. The activities and lecture slides of week 7 helped me thoroughly to understand the text pre-processing steps in depth and assisted me to develop the code for this particular part of the assignment.


## Importing libraries 

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import nltk
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from itertools import chain
from __future__ import division
from nltk.probability import *

### 1.1 Examining and loading data
On loading the csv file into pandas dataframe, it is observed that the file has 19662 rows and 10 columns. As per the specifications, task 1 involves working on 'Review Text' column. The column 'Review Text' consists of string of words that represent a review on a clothing item. In order to work on this column, I extracted the column in review_text variable to perform pre-processing steps as given.

In [2]:
# assigning file name to csv_file variable
csv_file = 'assignment3.csv'

In [3]:
# importing the said csv file into a dataframe
clothes_review_data = pd.read_csv(csv_file, sep = ',')

In [4]:
# checking the data
clothes_review_data

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits
...,...,...,...,...,...,...,...,...,...,...
19657,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
19658,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
19659,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
19660,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


In [5]:
# extracting the 'Review Text' column for pre-processing
review_text = clothes_review_data['Review Text']

In [6]:
# checking the data
review_text

0        I had such high hopes for this dress and reall...
1        I love, love, love this jumpsuit. it's fun, fl...
2        This shirt is very flattering to all due to th...
3        I love tracy reese dresses, but this one is no...
4        I aded this in my basket at hte last mintue to...
                               ...                        
19657    I was very happy to snag this dress at such a ...
19658    It reminds me of maternity clothes. soft, stre...
19659    This fit well, but the top was very see throug...
19660    I bought this dress for a wedding i have this ...
19661    This dress in a lovely platinum is feminine an...
Name: Review Text, Length: 19662, dtype: object

### 1.2 Pre-processing data
The required text pre-processing steps are:
* Case Normalisation: All the text is converted into lowercase.
* Tokenisation: Splitting the reviews into tokens.
* Removing words based on given conditions:
  - Removing words with length less than 2
  - Removing stopwords from given stopwords_en.txt file
  - Removing words based on term frequency
  - Removing words based on document frequency

#### Case Normalisation and Tokenisation
Here I defined a function 'tokenize_reviews' to perform case normalisation and split the reviews into tokens.

In [7]:
# [1] w07_act1_gen_feat_vec - cell 5
# function for case normalisation and tokensisation
def tokenize_reviews(review_text):
    nl_review = review_text.lower() # converting reviews to lowercase
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?" # regex pattern to tokenise the reviews
    tokenizer = RegexpTokenizer(pattern) # tokeniser to split the reviews based on the regex pattern
    tokenized_review = tokenizer.tokenize(nl_review) # passing normalised reviews to get list of tokens
    return tokenized_review # returning cleaned list of tokens

In [8]:
tokenized_reviews = review_text.apply(tokenize_reviews) # applying the function to get cleaned list of tokens # [2]
tokenized_reviews # checking the data

0        [i, had, such, high, hopes, for, this, dress, ...
1        [i, love, love, love, this, jumpsuit, it's, fu...
2        [this, shirt, is, very, flattering, to, all, d...
3        [i, love, tracy, reese, dresses, but, this, on...
4        [i, aded, this, in, my, basket, at, hte, last,...
                               ...                        
19657    [i, was, very, happy, to, snag, this, dress, a...
19658    [it, reminds, me, of, maternity, clothes, soft...
19659    [this, fit, well, but, the, top, was, very, se...
19660    [i, bought, this, dress, for, a, wedding, i, h...
19661    [this, dress, in, a, lovely, platinum, is, fem...
Name: Review Text, Length: 19662, dtype: object

In [9]:
# [1] w07_act1_gen_feat_vec.ipynb - cell 8
# statistics to get an idea of document length, number of tokens, vocabulary size
def stats_print(tokenized_reviews):
    words = list(chain.from_iterable(tokenized_reviews)) # putting tokens in single list
    vocab = set(words) # getting set of unique words
    lexical_diversity = len(vocab)/len(words) # ratio of unique words to total words
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of reviews:", len(tokenized_reviews))
    lens = [len(article) for article in tokenized_reviews] # calculating length of each review
    print("Average document length:", np.mean(lens))
    print("Maximum document length:", np.max(lens))
    print("Minimum document length:", np.min(lens))
    print("Standard deviation of document length:", np.std(lens))

In [10]:
stats_print(tokenized_reviews)

Vocabulary size:  14806
Total number of tokens:  1206688
Lexical diversity:  0.012269948818584423
Total number of reviews: 19662
Average document length: 61.37157969687723
Maximum document length: 113
Minimum document length: 2
Standard deviation of document length: 27.802596969841698


In [11]:
stopwords_file = "stopwords_en.txt" #loading the stopwords file

In [12]:
# [1] w07_act1_gen_feat_vec.ipynb - cell 16
# reading the file 
with open(stopwords_file, 'r') as file: # opens file in read mode
    stopwords = file.read().splitlines() # splits into line such that there is one stopwword per line
stopwords # checking the data

['a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [13]:
# getting unique stopwords
stopwords = set(stopwords)
len(stopwords) # checking the number of stopwords

570

#### Removing words
Here I defined a function 'remove_words' to remove words based on the given conditions.

In [14]:
# function for removal of words
def remove_words(tokens, stopwords):
    tokens = [token for token in tokens if len(token) >= 2] # removing words with the length less than 2
    tokens = [token for token in tokens if token not in stopwords] # removing stopwords 
    return tokens # returning cleaned list of tokens

In [15]:
# applying the function on tokenized reviews to remove the words on said condition
cleaned_reviews = tokenized_reviews.apply(lambda tokens: remove_words(tokens, stopwords)) # [2]
cleaned_reviews # checking the data

0        [high, hopes, dress, wanted, work, initially, ...
1        [love, love, love, jumpsuit, fun, flirty, fabu...
2        [shirt, flattering, due, adjustable, front, ti...
3        [love, tracy, reese, dresses, petite, feet, ta...
4        [aded, basket, hte, mintue, person, store, pic...
                               ...                        
19657    [happy, snag, dress, great, price, easy, slip,...
19658    [reminds, maternity, clothes, soft, stretchy, ...
19659    [fit, top, worked, glad, store, order, online,...
19660    [bought, dress, wedding, summer, cute, fit, pe...
19661    [dress, lovely, platinum, feminine, fits, perf...
Name: Review Text, Length: 19662, dtype: object

In [16]:
stats_print(cleaned_reviews) # statistics on cleaned reviews

Vocabulary size:  14283
Total number of tokens:  452692
Lexical diversity:  0.031551253390826435
Total number of reviews: 19662
Average document length: 23.023700539110976
Maximum document length: 51
Minimum document length: 1
Standard deviation of document length: 10.165913222944233


In [17]:
# [1] w07_act1_gen_feat_vec.ipynb - cell 11
words = list(chain.from_iterable(cleaned_reviews)) # putting tokens in single list
vocab = set(words) # getting set of unique words

In [18]:
len(words) # checking the number of tokens

452692

In [19]:
len(vocab) # checking the number of unique tokens

14283

In [20]:
# [1] w07_act1_gen_feat_vec.ipynb - cell 12
term_fd = FreqDist(words) # computing term frequency for each unique word

In [21]:
term_fd # checking the data

FreqDist({'dress': 9334, 'size': 7860, 'love': 7722, 'fit': 6582, 'top': 6542, 'wear': 5715, 'great': 5302, 'fabric': 4306, 'color': 4099, 'small': 4097, ...})

In [22]:
# removing the words that only appear once in the document collection
term_freq = cleaned_reviews.apply(lambda tokens: [word for word in tokens if term_fd[word] > 1]) # [2]

In [23]:
term_freq # checking the data

0        [high, hopes, dress, wanted, work, initially, ...
1        [love, love, love, jumpsuit, fun, flirty, fabu...
2        [shirt, flattering, due, adjustable, front, ti...
3        [love, tracy, reese, dresses, petite, feet, ta...
4        [basket, hte, person, store, pick, teh, color,...
                               ...                        
19657    [happy, snag, dress, great, price, easy, slip,...
19658    [reminds, maternity, clothes, soft, stretchy, ...
19659    [fit, top, worked, glad, store, order, online,...
19660    [bought, dress, wedding, summer, cute, fit, pe...
19661    [dress, lovely, feminine, fits, perfectly, eas...
Name: Review Text, Length: 19662, dtype: object

In [24]:
# [1] w07_act1_gen_feat_vec.ipynb - cell 15
words_2 = list(chain.from_iterable([set(review) for review in term_freq])) # putting unique tokens in single list
doc_fd = FreqDist(words_2)  # computing document frequency for each unique word
doc_fd # checking the data

FreqDist({'love': 6416, 'size': 5888, 'fit': 5537, 'dress': 5346, 'wear': 4900, 'top': 4670, 'great': 4497, 'fabric': 3712, 'color': 3604, 'small': 3265, ...})

In [25]:
# [1] w07_act1_gen_feat_vec.ipynb - cell 15
top20_freq_words =doc_fd.most_common(20) # computing the top 20 most common words 
top20_freq_words = [word for word, freq in top20_freq_words] # extracting only the words from doc_fd without their frequencies
top20_freq_words # checking the data

['love',
 'size',
 'fit',
 'dress',
 'wear',
 'top',
 'great',
 'fabric',
 'color',
 'small',
 'ordered',
 'perfect',
 'flattering',
 'soft',
 'comfortable',
 'back',
 'cute',
 'fits',
 'nice',
 'bought']

In [26]:
# removing the top 20 most frequent words
processed_reviews = term_freq.apply(lambda tokens: [word for word in tokens if word not in top20_freq_words]) # [2]

In [27]:
processed_reviews # checking the data

0        [high, hopes, wanted, work, initially, petite,...
1        [jumpsuit, fun, flirty, fabulous, time, compli...
2        [shirt, due, adjustable, front, tie, length, l...
3        [tracy, reese, dresses, petite, feet, tall, br...
4        [basket, hte, person, store, pick, teh, pale, ...
                               ...                        
19657         [happy, snag, price, easy, slip, cut, combo]
19658    [reminds, maternity, clothes, stretchy, shiny,...
19659                 [worked, glad, store, order, online]
19660    [wedding, summer, medium, waist, perfectly, lo...
19661    [lovely, feminine, perfectly, easy, comfy, hig...
Name: Review Text, Length: 19662, dtype: object

In [28]:
len(processed_reviews) # checking the number of reviews

19662

In [29]:
stats_print(processed_reviews) # statistics on processed reviews

Vocabulary size:  7529
Total number of tokens:  355505
Lexical diversity:  0.021178323792914306
Total number of reviews: 19662
Average document length: 18.080815786796865
Maximum document length: 47
Minimum document length: 0
Standard deviation of document length: 8.833524535391433


In [30]:
# noticed that minimum document length is 0, hence, computed the number of empty lists
empty_lists = processed_reviews.apply(lambda x: len(x) == 0).sum() # [2]
empty_lists

10

In [31]:
# appending processed_reviews column to the original dataframe
clothes_review_data['Processed Review Text'] = processed_reviews

In [32]:
clothes_review_data # checking the data

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Processed Review Text
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,"[high, hopes, wanted, work, initially, petite,..."
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,"[jumpsuit, fun, flirty, fabulous, time, compli..."
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,"[shirt, due, adjustable, front, tie, length, l..."
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,"[tracy, reese, dresses, petite, feet, tall, br..."
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits,"[basket, hte, person, store, pick, teh, pale, ..."
...,...,...,...,...,...,...,...,...,...,...,...
19657,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses,"[happy, snag, price, easy, slip, cut, combo]"
19658,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits,"[reminds, maternity, clothes, stretchy, shiny,..."
19659,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses,"[worked, glad, store, order, online]"
19660,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses,"[wedding, summer, medium, waist, perfectly, lo..."


I am dropping reviews that result in empty lists because techniques for handling empty instances have not been covered yet. Additionally, there is no suitable way to replace empty reviews with meaningful content, as the data type is text. Furthermore, we were instructed to drop empty instances during the lectorial session.

In [33]:
# dropping the rows with empty lists
df_cleaned = clothes_review_data[clothes_review_data['Processed Review Text'].apply(lambda x: len(x) != 0)] # [2]
df_cleaned # checking the data

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Processed Review Text
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,"[high, hopes, wanted, work, initially, petite,..."
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,"[jumpsuit, fun, flirty, fabulous, time, compli..."
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,"[shirt, due, adjustable, front, tie, length, l..."
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,"[tracy, reese, dresses, petite, feet, tall, br..."
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits,"[basket, hte, person, store, pick, teh, pale, ..."
...,...,...,...,...,...,...,...,...,...,...,...
19657,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses,"[happy, snag, price, easy, slip, cut, combo]"
19658,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits,"[reminds, maternity, clothes, stretchy, shiny,..."
19659,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses,"[worked, glad, store, order, online]"
19660,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses,"[wedding, summer, medium, waist, perfectly, lo..."


In [34]:
# cross verifying if there exists any empty reviews in the cleaned dataframe
empty_lists = df_cleaned['Processed Review Text'].apply(lambda x: len(x) == 0).sum() # [2]
empty_lists

0

In [35]:
stats_print(df_cleaned['Processed Review Text']) # statistics on processed reviews

Vocabulary size:  7529
Total number of tokens:  355505
Lexical diversity:  0.021178323792914306
Total number of reviews: 19652
Average document length: 18.09001628332994
Maximum document length: 47
Minimum document length: 1
Standard deviation of document length: 8.826348342078324


I reset the index of the DataFrame because, when I drop rows with empty reviews, their corresponding indices are also removed. It's good practice to reset the index before exporting the data to maintain readability and avoid potential issues in future tasks. A sequential index ensures clarity and consistency in the DataFrame.

In [36]:
# resetting the index of the dataframe
df_cleaned = df_cleaned.reset_index(drop=True) # [3]

## Saving required outputs
Saving the requested information as per specification.
- processed.csv
- vocab.txt

In [37]:
# saving the processed data 'processed.csv' file
df_cleaned.to_csv('processed.csv', index=False)

In [38]:
# combining list of reviews into a single list and then getting only unique set of words and ordering the reviews alphabetically
vocabulary = sorted(set(chain.from_iterable(processed_reviews))) # [4]
vocabulary # checking the data

['a-cup',
 'a-flutter',
 'a-frame',
 'a-kind',
 'a-line',
 'a-lines',
 'a-symmetric',
 'aa',
 'ab',
 'abbey',
 'abby',
 'abdomen',
 'ability',
 'abnormally',
 'abo',
 'abou',
 'above-the',
 'abroad',
 'abs',
 'absolute',
 'absolutely',
 'absolutley',
 'absolutly',
 'abstract',
 'absurd',
 'abt',
 'abundance',
 'ac',
 'accent',
 'accented',
 'accenting',
 'accents',
 'accentuate',
 'accentuated',
 'accentuates',
 'accentuating',
 'accept',
 'acceptable',
 'accepted',
 'access',
 'accessories',
 'accessorize',
 'accessorized',
 'accessorizing',
 'accessory',
 'accident',
 'accidental',
 'accidentally',
 'accommodate',
 'accommodated',
 'accommodates',
 'accommodating',
 'accomodate',
 'accompanying',
 'accomplish',
 'accordian',
 'account',
 'accurate',
 'accurately',
 'acetate',
 'achieve',
 'acrylic',
 'act',
 'action',
 'active',
 'activewear',
 'activities',
 'acts',
 'actual',
 'actuality',
 'ad',
 'ada',
 'add',
 'add-on',
 'added',
 'addict',
 'addicted',
 'adding',
 'addition',
 

In [39]:
# [1] w07_act1_gen_feat_vec.ipynb - cell 55
out_file = open("./vocab.txt", 'w') # creating a file and opening it in write mode

# looping through each word in the vocabulary using its index 
for ind in range(0, len(vocabulary)):
    out_file.write(f"{vocabulary[ind]}:{ind}\n") # writing to a file in 'word_string:word_integer_index' format
                   
out_file.close() # closing the file

## Summary
This part of the assignment highlights the importance of text pre-processing and its vital role in building machine learning models. This task is fundamental for preparing text data for further analysis, such as creating document vectors and feeding these vector representations into machine learning models for classification. It is crucial to follow the exact formatting requirements, especially for the vocabulary file, to ensure compatibility with subsequent tasks. In addition, the activities and lectorial material are designed perfectly to carry out these tasks and helped me understand the concepts thoroughly.

## References

[1] Canvas/Modules/Week 7 - Activities/w07_activities/w07_act1_gen_feat_vec.ipynb https://rmit.instructure.com/courses/125024/pages/week-7-activities-2?module_item_id=6449422 <br>
[2] Usage of apply(): https://stackoverflow.com/questions/36213383/pandas-dataframe-how-to-apply-function-to-a-specific-column <br>
[3] Usage of reset_index(): https://stackoverflow.com/questions/20490274/how-to-reset-index-in-a-pandas-dataframe <br>
[4] Usage of sorted(): https://stackoverflow.com/questions/32072076/find-the-unique-values-in-a-column-and-then-sort-them <br>