# In-class Lab 2: Text Data Preprocessing
**Overview:** In this lesson, we will build a text preprocessing pipeline for English text, which includes: data cleaning, converting to lowercase, removing punctuation, tokenization, removing stop words, and stemming. The exercise requires knowledge and programming skills in Python using libraries such as string, re, nltk, and numpy.
    

### Import Libraries

In [50]:
import string # used for preprocessing
import re # used for preprocessing
import nltk # the Natural Language Toolkit, used for preprocessing
import numpy as np # used for managing NaNs
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords # used for preprocessing
from nltk.stem import WordNetLemmatizer # used for preprocessing
#nltk.download('stopwords')

## Question 1: Exploring the Dataset

The raw text data that needs preprocessing is in the file "wiki.txt". This file contains short documents (docs), with each document on a separate line. In this question, we will explore the following:

- The number of docs in the corpus
- Observing a few docs from the corpus


In [51]:
#### YOUR CODE HERE ####
with open("wiki.txt", "r", encoding="utf-8") as file:
    text_lst = [line.strip() for line in file if line.strip()]
    
print(f"Number of docs: {len(text_lst)}")

# Try to print a few docs from the corpus
for i, doc in enumerate(text_lst[:10], start = 1):
    print(f"Document {i}: {doc}\n")
#### END YOUR CODE #####

Number of docs: 2362
Document 1: Madhuca utilis is a tree in the Sapotaceae family. It grows up to 40 metres (130 ft) tall, with a trunk diameter of up to 70 centimetres (28 in). The bark is greyish brown. The fruits are ellipsoid, up to 5.5 centimetres (2.2 in) long. The specific epithet utilis is from the Latin meaning \"useful\", referring to the timber. Habitat is swamps and lowland kerangas forests. M. utilis is found in Sumatra, Peninsular Malaysia and Borneo.

Document 2: The Lycée Edmond Perrier (Edmond Perrier high school) is a general and technical secondary education institution, located in Tulle, Correze. It is dedicated to zoologist Edmond Perrier, born in Tulle in 1844. It was built by Anatole de Baudot, and has many similarities with the Lycée Lakanal, due to the same architect. His motto is \"Sint rupes virtutis iter\", identical to that of Tulle which means \"The difficulties are the path of virtue\".

Document 3: Shareh Khvor (Persian: شره خور‎‎) is a village in Bask-

## Question 2: Building Text Processing Functions

### Question 2.1: Building a Data Cleaning Function

**Description:** The function removes digits and keeps only the characters "A-Za-z(),!?\'\`".


In [52]:
# clean text
def clean_text(text):
    #### YOUR CODE HERE ####
    if text is None:
        return ""
    
    cleaned_text = re.sub(r"[^A-Za-z(),!?\'\`\s]", "", text)
    return cleaned_text
    #### END YOUR CODE #####


### Question 2.2: Function to Convert Text to Lowercase

In [53]:
# make all text lowercase
#### YOUR CODE HERE ####
def text_lowercase(text):
    if text is None:
        return ""
    
    return text.lower()
#### END YOUR CODE #####

### Question 2.3: Building a Function to Remove Punctuation

In [54]:
# remove punctuation
def remove_punctuation(text):
    #### YOUR CODE HERE ####
    if text is None:
        return ""
    
    # Remove all punctuation characters
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
    #### END YOUR CODE #####

### Question 2.4: Tokenization

In [55]:
# tokenize
def tokenize(text):
    #### YOUR CODE HERE ####    
    if text is None:
        return []
    
    return word_tokenize(text)
    #### END YOUR CODE #####

### Question 2.5: Removing Stopwords


In [56]:
# remove stopwords
stop_words = set(stopwords.words('english'))
#### YOUR CODE HERE ####
def remove_stopwords(text):
    if text is None:
        return []
    return [token for token in text if token not in stop_words]
#### END YOUR CODE #####

### Question 2.6: Building a Lemmatization Function


In [57]:
# lemmatize
lemmatizer = WordNetLemmatizer()
#### YOUR CODE HERE ####
def lemmatize(text):
    if text is None:
        return []
    
    return [lemmatizer.lemmatize(token.lower()) for token in text]
#### END YOUR CODE #####

### Question 2.7: Building a Preprocessing Function
**Hint:** This function will call the functions written above.

In [58]:

def preprocessing(text):
#### YOUR CODE HERE ####
    # Check if text is available
    if text is None:
        return []
    
    # Clean text
    cleaned = clean_text(text)
    
    # Convert to lowercase
    lowercased = text_lowercase(cleaned)
    
    # Remove punctuation
    punct_removed = remove_punctuation(lowercased)
    
    # Tokenizing text
    tokenized = tokenize(punct_removed)
    
    # Remove stopwords
    stopwords_removed = remove_stopwords(tokenized)
    
    # Lemmatizing text
    lemmatized = lemmatize(stopwords_removed)
    
    final_tokens = [token for token in lemmatized if token.strip()]
    return final_tokens
#### END YOUR CODE #####

## Question 3: Preprocessing for Input Text
**Overview:** Using the functions above, preprocess the initial text.

In [61]:
#### YOUR CODE HERE ####
processed_corpus = []

for text in text_lst:
    tokens = preprocessing(text)
    processed_corpus.append(tokens)

for i, doc in enumerate(processed_corpus[:10], 1):
    print(f"Document {i}: {doc}")

Document 1: ['madhuca', 'utilis', 'tree', 'sapotaceae', 'family', 'grows', 'metre', 'ft', 'tall', 'trunk', 'diameter', 'centimetre', 'bark', 'greyish', 'brown', 'fruit', 'ellipsoid', 'centimetre', 'long', 'specific', 'epithet', 'utilis', 'latin', 'meaning', 'useful', 'referring', 'timber', 'habitat', 'swamp', 'lowland', 'kerangas', 'forest', 'utilis', 'found', 'sumatra', 'peninsular', 'malaysia', 'borneo']
Document 2: ['lyce', 'edmond', 'perrier', 'edmond', 'perrier', 'high', 'school', 'general', 'technical', 'secondary', 'education', 'institution', 'located', 'tulle', 'correze', 'dedicated', 'zoologist', 'edmond', 'perrier', 'born', 'tulle', 'built', 'anatole', 'de', 'baudot', 'many', 'similarity', 'lyce', 'lakanal', 'due', 'architect', 'motto', 'sint', 'rupes', 'virtutis', 'iter', 'identical', 'tulle', 'mean', 'difficulty', 'path', 'virtue']
Document 3: ['shareh', 'khvor', 'persian', 'village', 'baske', 'kuleseh', 'rural', 'district', 'central', 'district', 'sardasht', 'county', 'wes