# In-class Lab 2: Text Data Preprocessing
**Overview:** In this lesson, we will build a text preprocessing pipeline for English text, which includes: data cleaning, converting to lowercase, removing punctuation, tokenization, removing stop words, and stemming. The exercise requires knowledge and programming skills in Python using libraries such as string, re, nltk, and numpy.
    

### Import Libraries

In [54]:
import string # used for preprocessing
import re # used for preprocessing
import nltk # the Natural Language Toolkit, used for preprocessing
import numpy as np # used for managing NaNs
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords # used for preprocessing
from nltk.stem import WordNetLemmatizer # used for preprocessing
from nltk.corpus import wordnet
from nltk import pos_tag
# nltk.download('all')
#nltk.download('stopwords')

## Question 1: Exploring the Dataset

The raw text data that needs preprocessing is in the file "wiki.txt". This file contains short documents (docs), with each document on a separate line. In this question, we will explore the following:

- The number of docs in the corpus
- Observing a few docs from the corpus


In [55]:
#### YOUR CODE HERE ####
file_path = "wiki.txt"

with open(file_path, 'r', encoding='utf-8') as file: # open file
    docs = file.readlines() # read all the line in file and store in a list
    num_docs = len(docs) # number of docs, also the number of lines in file

# print the number of docs
print("The number of docs in the corpus: ", num_docs)

num_preview = 10 # suppose we want to observe 10 docs from the corpus
sample_docs = docs[:num_preview] # extracts the first 10 lines

print("\nSample documents: \n")
for i, doc in enumerate(sample_docs,1):
    print(f"{i}: {doc.strip()}") # print each document

#### END YOUR CODE #####

The number of docs in the corpus:  2362

Sample documents: 

1: Madhuca utilis is a tree in the Sapotaceae family. It grows up to 40 metres (130 ft) tall, with a trunk diameter of up to 70 centimetres (28 in). The bark is greyish brown. The fruits are ellipsoid, up to 5.5 centimetres (2.2 in) long. The specific epithet utilis is from the Latin meaning \"useful\", referring to the timber. Habitat is swamps and lowland kerangas forests. M. utilis is found in Sumatra, Peninsular Malaysia and Borneo.
2: The Lycée Edmond Perrier (Edmond Perrier high school) is a general and technical secondary education institution, located in Tulle, Correze. It is dedicated to zoologist Edmond Perrier, born in Tulle in 1844. It was built by Anatole de Baudot, and has many similarities with the Lycée Lakanal, due to the same architect. His motto is \"Sint rupes virtutis iter\", identical to that of Tulle which means \"The difficulties are the path of virtue\".
3: Shareh Khvor (Persian: شره خور‎‎) is a villa

## Question 2: Building Text Processing Functions

### Question 2.1: Building a Data Cleaning Function

**Description:** The function removes digits and keeps only the characters "A-Za-z(),!?\'\`".


In [56]:
import re
# clean text
def clean_text(text):
    #### YOUR CODE HERE ####
    pattern= r"[^A-Za-z(),!?'`]" #define the regex pattern, just keep characters from "A-Za-z(),!?'`"
    clean_text = re.sub(pattern, " ", text) # remove non-matching characters
    return clean_text.strip() # remove unnecessary space
    #### END YOUR CODE #####

# test
# text = "hello ? '456 `    ?456"
# print(clean_text(text))

### Question 2.2: Function to Convert Text to Lowercase

In [57]:
# make all text lowercase
#### YOUR CODE HERE ####
def text_lowercase(text):
    return text.lower() # function to convert text to lowercase
#### END YOUR CODE #####


### Question 2.3: Building a Function to Remove Punctuation

In [58]:
# remove punctuation
def remove_punctuation(text):
    #### YOUR CODE HERE ####
    trans = str.maketrans('','',string.punctuation) # create translation table to remove punctuation, replace by item string
    return text.translate(trans)
    #### END YOUR CODE #####

### Question 2.4: Tokenization

In [59]:
# tokenize
def tokenize(text):
    #### YOUR CODE HERE ####
    tokens = re.findall(r'\b\w+\b', text) # split text into words
    return tokens
    #### END YOUR CODE #####

# text = "How are you ?"
# print(tokenize(text))

### Question 2.5: Removing Stopwords







In [60]:
# remove stopwords
nltk.download('stopwords') # download the stopwords dataset from nltk
stop_words = set(stopwords.words('english')) # load the set of english stopwords
#### YOUR CODE HERE ####
def remove_stopwords(text):
    # words = text.split() # split the input text
    filter_words = [word for word in text if word.lower() not in stop_words] # create a list of words that are not stopwords
    clean_text = ' '.join(filter_words) # join the filter words back into a single string, separated by spaces
    return clean_text
#### END YOUR CODE #####

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Question 2.6: Building a Lemmatization Function







In [61]:
from nltk.tokenize import word_tokenize
# lemmatize
lemmatizer = WordNetLemmatizer()
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
#### YOUR CODE HERE ####
def lemmatize(text):
    words = tokenize(text)  # Tokenizing the input text into words

    # map POS tag to WordNet format
    def get_wordnet_pos(tag):
        if tag.startswith('NN'):
            return wordnet.NOUN
        elif tag.startswith('VB'):
            return wordnet.VERB
        elif tag.startswith('JJ'):
            return wordnet.ADJ
        elif tag.startswith('RB'):
            return wordnet.ADV
        else:
            return wordnet.NOUN  # Default to NOUN if unsure

    # # Get part of speech tags for the words
    tagged_words = pos_tag(words)

    # Lemmatize each word using its POS tag
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged_words]

    # Join the lemmatized words into a single string
    lemmatized_text = ' '.join(lemmatized_words)
    return lemmatized_text
#### END YOUR CODE #####

### Question 2.7: Building a Preprocessing Function
**Hint:** This function will call the functions written above.

In [62]:

def preprocessing(text):
#### YOUR CODE HERE ####
    final_text = clean_text(text)
    final_text = text_lowercase(final_text)
    final_text = remove_punctuation(final_text)
    final_text = tokenize(final_text)
    # final_text = str(final_text)
    final_text = remove_stopwords(final_text)
    final_text = lemmatize(final_text)
    return final_text


#### END YOUR CODE #####

## Question 3: Preprocessing for Input Text
**Overview:** Using the functions above, preprocess the initial text.

In [63]:
#### YOUR CODE HERE ####
for i, doc in enumerate(docs, 1):  # Iterate through all the documents
    preprocessing_doc = preprocessing(doc.strip())  # apply all preprocessing functions
    print(f"{i}: {preprocessing_doc}")

1: madhuca utilis tree sapotaceae family grow metre ft tall trunk diameter centimetre bark greyish brown fruit ellipsoid centimetre long specific epithet utilis latin meaning useful refer timber habitat swamp lowland kerangas forest utilis find sumatra peninsular malaysia borneo
2: lyc e edmond perrier edmond perrier high school general technical secondary education institution locate tulle correze dedicate zoologist edmond perrier bear tulle build anatole de baudot many similarity lyc e lakanal due architect motto sint rupes virtutis iter identical tulle mean difficulty path virtue
3: shareh khvor persian village bask e kuleseh rural district central district sardasht county west azerbaijan province iran census population family
4: st rise roman catholic church complex roman catholic church complex locate lima livingston county new york complex consist four contributing building st rise church construct brendan hall construct parochial school rectory convent list national register his