# CZ4034 Information Retrieval Project
### Crawling Additional Steps: Preprocessing Code for Crawled Corpus
###### Note: Please select "Runtime" > "Run All" in the navigation bar when running this file in **Google Colab** to ensure smooth execution.

- Read the original crawled corpus file, reviews_combined.csv, and create a dataframe.

  Note: Upload the reviews_combined.csv file to Google Colab using the "Upload to Session Storage" button on the left.

In [1]:
import pandas as pd

df_og = pd.read_csv('reviews_combined.csv')

- Import and download necessary libraries for tokenization and lemmatization.

In [2]:
import nltk

# for tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
# for lemmatization
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

- Apply the following preprocessing steps on the `Review` column: 
Conversion to lowercase, decoding of HTML entities, removal of Remove non-ASCII characters, removal of punctuation, tokenization, removal of a custom set of stopwords, remove hashtags, lemmatization.

In [4]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
import re
import html

# load the CSV file
df = pd.read_csv('reviews_combined.csv')

# define a function to perform the pre-processing steps
def preprocess_text(text):
    # convert to lowercase
    text = text.lower()

    # Decode HTML entities
    text = html.unescape(text)

    # Remove non-ASCII characters
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Replace " t " with "t " to combine words like "couldn t " after removal of punctuation
    text = text.replace(" t ", "t ")
    
    # tokenization
    tokens = word_tokenize(text)
    
    # remove custom set of stopwords
    stop_words = set(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'if', 'or', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'nor', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 'will', 'just', 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ma'])
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # remove hashtags
    filtered_tokens = [token for token in filtered_tokens if not token.startswith('#')]
    
    # lemmatization
    lemmatizer = WordNetLemmatizer()
    filtered_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    
    # join the filtered tokens into a string
    text = ' '.join(filtered_tokens)
    
    return text

# apply the pre-processing function to the review column in the DataFrame
df['Review'] = df['Review'].apply(preprocess_text)

# output the file as preprocessed_reviews_combined.csv
df.to_csv('preprocessed_reviews_combined.csv', index=False)

- Find the total number of words in column `Review` before and after preprocessing (it will be the same).

In [5]:
# Load CSV file into a Pandas dataframe
df = pd.read_csv('reviews_combined.csv')

# Get the number of rows in the dataframe
num_rows = len(df)

# Print the total number of records
print('Total number of records:', num_rows)

Total number of records: 17669


- Find the total number of words in column `Review` before and after preprocessing.

In [6]:
# Load the dataset into a DataFrame
df = pd.read_csv('reviews_combined.csv')
df2 = pd.read_csv('preprocessed_reviews_combined.csv')

# Define the column that contains the text
text_column = 'Review'

# Concatenate all the texts in the column into a single string
all_text = ' '.join(df[text_column].tolist())
all_text2 = ' '.join(df2[text_column].tolist())

# Split the text into words
words = all_text.split()
words2 = all_text2.split()

# Count the number of words
num_words = len(words)
num_words2 = len(words2)

# Print the total number of words in column Review before and after preprocessing
print(f'Total number of words in column "{text_column}" (before preprocessing): {num_words}')
print(f'Total number of words in column "{text_column}" (after preprocessing): {num_words2}')

Total number of words in column "Review" (before preprocessing): 1281072
Total number of words in column "Review" (after preprocessing): 715643


- Find the total number of unique words in column `Review` before and after preprocessing.

In [7]:
# Load the data as a pandas dataframe
df = pd.read_csv("reviews_combined.csv")
df2 = pd.read_csv("preprocessed_reviews_combined.csv")

# Define the column that we want to analyze
column_to_analyze = "Review"

# Concatenate all the reviews into a single string
all_reviews = " ".join(df[column_to_analyze].tolist())
all_reviews2 = " ".join(df2[column_to_analyze].tolist())

# Split the string into words
words = all_reviews.split()
words2 = all_reviews2.split()

# Count the unique words
unique_words = set(words)
num_unique_words = len(unique_words)
unique_words2 = set(words2)
num_unique_words2 = len(unique_words2)

# Print the total number of unique words in column Review before and after preprocessing
print("Number of unique words in the column '{}' (before preprocessing): {}".format(column_to_analyze, num_unique_words))
print("Number of unique words in the column '{}' (after preprocessing): {}".format(column_to_analyze, num_unique_words2))

Number of unique words in the column 'Review' (before preprocessing): 62475
Number of unique words in the column 'Review' (after preprocessing): 26760
