# In-class Lab 2: Text Data Preprocessing
**Overview:** In this lesson, we will build a text preprocessing pipeline for English text, which includes: data cleaning, converting to lowercase, removing punctuation, tokenization, removing stop words, and stemming. The exercise requires knowledge and programming skills in Python using libraries such as string, re, nltk, and numpy.
    

### Import Libraries

In [1]:
import string # used for preprocessing
import re # used for preprocessing
import nltk # the Natural Language Toolkit, used for preprocessing
import numpy as np # used for managing NaNs
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize # used for preprocessing
from nltk.corpus import stopwords # used for preprocessing
from nltk.stem import WordNetLemmatizer # used for preprocessing

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
import sys
import os
import platform

# Python environment details
print("Python executable being used:", sys.executable)
print("Python version:", sys.version)

# Operating System details
print("Operating System:", platform.system())
print("OS Version:", platform.version())
print("OS Release:", platform.release())

# Machine and architecture details
print("Machine:", platform.machine())

# Visual Studio Code details (based on environment variable)
vscode_info = os.environ.get('VSCODE_PID', None)
if vscode_info:
    print("Running in Visual Studio Code")
else:
    print("Not running in Visual Studio Code")

Python executable being used: c:\Python312\python.exe
Python version: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
Operating System: Windows
OS Version: 10.0.19045
OS Release: 10
Machine: AMD64
Running in Visual Studio Code


## Question 1: Exploring the Dataset

The raw text data that needs preprocessing is in the file "wiki.txt". This file contains short documents (docs), with each document on a separate line. In this question, we will explore the following:

- The number of docs in the corpus
- Observing a few docs from the corpus


In [4]:
#### YOUR CODE HERE ####
# Step 1: Read the dataset
with open("wiki.txt", "r", encoding="utf-8") as file:
    docs = file.readlines()

# Step 2: Observing a few documents
print("Sample Documents from the Dataset:")

for i, doc in enumerate(docs[:5], start=1):
    print(f"Document {i}: {doc}")

# Step 3: Count the number of documents
print(f"Total Number of Documents: {len(docs)}")

#### END YOUR CODE #####

Sample Documents from the Dataset:
Document 1: Madhuca utilis is a tree in the Sapotaceae family. It grows up to 40 metres (130 ft) tall, with a trunk diameter of up to 70 centimetres (28 in). The bark is greyish brown. The fruits are ellipsoid, up to 5.5 centimetres (2.2 in) long. The specific epithet utilis is from the Latin meaning \"useful\", referring to the timber. Habitat is swamps and lowland kerangas forests. M. utilis is found in Sumatra, Peninsular Malaysia and Borneo.

Document 2: The Lycée Edmond Perrier (Edmond Perrier high school) is a general and technical secondary education institution, located in Tulle, Correze. It is dedicated to zoologist Edmond Perrier, born in Tulle in 1844. It was built by Anatole de Baudot, and has many similarities with the Lycée Lakanal, due to the same architect. His motto is \"Sint rupes virtutis iter\", identical to that of Tulle which means \"The difficulties are the path of virtue\".

Document 3: Shareh Khvor (Persian: شره خور‎‎) is a vi

## Question 2: Building Text Processing Functions

### Question 2.1: Building a Data Cleaning Function

**Description:** The function removes digits and keeps only the characters "A-Za-z(),!?\'\`".


In [5]:
# clean text
def clean_text(text):
    #### YOUR CODE HERE ####
    text_clean = re.sub(r'[^A-Za-z(),!?\`\' ]', "", text)
    return text_clean
    #### END YOUR CODE #####

### Question 2.2: Function to Convert Text to Lowercase

In [6]:
# make all text lowercase
#### YOUR CODE HERE ####
# Use the lower() funtion from Python
def text_lowercase(text):
    lower_text = text.lower()
    return lower_text

#### END YOUR CODE #####

### Question 2.3: Building a Function to Remove Punctuation

In [7]:
# remove punctuation because
# - punctuation does not contain meaning 
# - remove this would reduce the size of the data
# - use the replace() function
def remove_punctuation(text):
    #### YOUR CODE HERE ####
    translator = str.maketrans("", "", string.punctuation)
    return text.translate(translator)
    #### END YOUR CODE #####

In [8]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


### Question 2.4: Tokenization

In [9]:
# tokenize
# - For English: NLTK, SpaCy, TextBlob
# - For Vietnamese: VnCoreNLP, underthesea, coccoc-tokenizer
def tokenize(text):
    #### YOUR CODE HERE ####    
    tokens = word_tokenize(text)
    return tokens
    
    #### END YOUR CODE #####

### Question 2.5: Removing Stopwords







In [10]:
# remove stopwords
# - nltk library
# - npm install vietnamese-stopwords
stop_words = set(stopwords.words('english'))
#### YOUR CODE HERE ####
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]
#### END YOUR CODE #####

### Question 2.6: Building a Lemmatization Function







In [11]:
# lemmatize
lemmatizer = WordNetLemmatizer()
#### YOUR CODE HERE ####
def lemmatize(tokens):
    # Lemmatize each word and keep the original if no change occurs
    lemmatized_text = []
    for word in tokens:
        lemmatized_word = lemmatizer.lemmatize(word)
        # If the word doesn't change, add it as is
        if lemmatized_word == word:
            lemmatized_text.append(word)
        else:
            lemmatized_text.append(lemmatized_word)
    
    return lemmatized_text
    
#### END YOUR CODE #####

### Question 2.7: Building a Preprocessing Function
**Hint:** This function will call the functions written above.

In [12]:
def preprocess_sentence(sentence):
    sentence = clean_text(sentence)
    sentence = text_lowercase(sentence)
    sentence = remove_punctuation(sentence)
    tokens = tokenize(sentence)
    tokens = remove_stopwords(tokens)
    tokens = lemmatize(tokens)
    return tokens

def preprocess_document(document):
    # Split the document into sentences
    sentences = sent_tokenize(document)
    
    # Preprocess each sentence
    preprocessed_sentences = [" ".join(preprocess_sentence(sentence)) for sentence in sentences]
    
    # Reconstruct the preprocessed document
    return " ".join(preprocessed_sentences)

## Question 3: Preprocessing for Input Text
**Overview:** Using the functions above, preprocess the initial text.

In [13]:
# #### YOUR CODE HERE ####
preprocessed_docs = [preprocess_document(doc) for doc in docs]

# Example: Display preprocessed versions of the first 5 documents
print("Preprocessed Sample Documents:")
for i, doc in enumerate(preprocessed_docs[:5], start=1):
    print(f"Document {i}: {doc}")


Preprocessed Sample Documents:
Document 1: madhuca utilis tree sapotaceae family grows metre ft tall trunk diameter centimetre bark greyish brown fruit ellipsoid centimetre long specific epithet utilis latin meaning useful referring timber habitat swamp lowland kerangas forest utilis found sumatra peninsular malaysia borneo
Document 2: lyce edmond perrier edmond perrier high school general technical secondary education institution located tulle correze dedicated zoologist edmond perrier born tulle built anatole de baudot many similarity lyce lakanal due architect motto sint rupes virtutis iter identical tulle mean difficulty path virtue
Document 3: shareh khvor persian village baske kuleseh rural district central district sardasht county west azerbaijan province iran census population family
Document 4: st rose roman catholic church complex roman catholic church complex located lima livingston county new york complex consists four contributing building st rose church constructed brenda