# Preprocessing the data
### 1. Noise reduction techniques
Special characters are often present in text data. In NLP, these characters can affect the performance of machine learning algorithms and can make the text data difficult to process. Depending on the task, it is important to remove or replace these characters before processing the text data.

In [1]:
import re
sentence = 'Harper is a good girl!'

# remove all non-word and non-space characters
print(re.sub(r'[^\w\s]', '', sentence))

# replace all occurrences of 'good' with 'nice'
print(re.sub(r'\bgood\b', 'nice', sentence))

# replace 'a good' with 'the best'
print(re.sub('..g.*d', 'the best', sentence))

# Replaces vowels with '*'
print(re.sub(r'[aeiou]', '*', sentence)) 

# replace the exact number of occurrences to *, observe the \B and \b
print(re.sub(r'\b\w{2}\b', '**', sentence))
print(re.sub(r'\B\w{2}\B', '**', sentence))

Harper is a good girl
Harper is a nice girl!
Harper is the best girl!
H*rp*r *s * g**d g*rl!
Harper ** a good girl!
H****r is a g**d g**l!


In [2]:
sentence = 'NTU contact number is 6791       1744.'

# replace captial letters with '*'
print(re.sub(r'[A-Z]', '*', sentence))

# replace all digits with 'X'
print(re.sub(r'\d', 'X', sentence))

# Replaces multiple whitespaces with a single space
print(re.sub(r'\s+', ' ', sentence))

# swap the order of the 4-digit pairs
print(re.sub(r'(\d{4})(\s+)(\d{4})', r'\3\2\1', sentence))

# swap the order of the 4-digit pairs and disregard the multiple whitespaces pattern
print(re.sub(r'(\d{4})\s+(\d{4})', r'\2 \1', sentence))

*** contact number is 6791       1744.
NTU contact number is XXXX       XXXX.
NTU contact number is 6791 1744.
NTU contact number is 1744       6791.
NTU contact number is 1744 6791.


### 2. Tokenization
Tokenization is the process of splitting text into smaller units such as words, sentences, or subwords. You will learn more about handling OOV words using subword tokenization in the later weeks.

In [3]:
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is amazing! Let's learn it step by step."
# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence Tokens:", sentence_tokens)

# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokens:", word_tokens)

# Replacement of OOV words with UNK_TOKEN
UNK_TOKEN = '<UNK>'

# Vocabulary containing known words
vocabulary = set('Natural Language Processing is amazing'.split())

# Function to handle OOV words
def handle_oov(word):
    return word if word in vocabulary else UNK_TOKEN

for word in word_tokens:
    print(f'Original Word: {word}, Processed Word: {handle_oov(word)}')

Sentence Tokens: ['Natural Language Processing is amazing!', "Let's learn it step by step."]
Word Tokens: ['Natural', 'Language', 'Processing', 'is', 'amazing', '!', 'Let', "'s", 'learn', 'it', 'step', 'by', 'step', '.']
Original Word: Natural, Processed Word: Natural
Original Word: Language, Processed Word: Language
Original Word: Processing, Processed Word: Processing
Original Word: is, Processed Word: is
Original Word: amazing, Processed Word: amazing
Original Word: !, Processed Word: <UNK>
Original Word: Let, Processed Word: <UNK>
Original Word: 's, Processed Word: <UNK>
Original Word: learn, Processed Word: <UNK>
Original Word: it, Processed Word: <UNK>
Original Word: step, Processed Word: <UNK>
Original Word: by, Processed Word: <UNK>
Original Word: step, Processed Word: <UNK>
Original Word: ., Processed Word: <UNK>


### 3. Stop word removal
Stop words are the most common words in a language that do not carry much meaning. They include articles (the, a, an), pronouns (he, she, it), conjunctions (and, or, but), prepositions (in, on, at), and other words that are used frequently but do not carry much meaning. 

In natural language processing, stop words are removed from text data to improve the accuracy of analysis and to reduce the dimensionality of the data.

In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

text = "This is an example showing how to remove stop words."
word_tokens = word_tokenize(text)

filtered_words = [word for word in word_tokens if word.lower() not in stop_words]
print(filtered_words)

['example', 'showing', 'remove', 'stop', 'words', '.']


### 4. Normalization
Converts text into a standard format, such as lowercasing, expanding contractions, and stemming/lemmatizing.

In [5]:
text = "I'm learning NLP, and it isn't easy!"

# casing of words in the text
print(text.lower())
print(text.upper())

i'm learning nlp, and it isn't easy!
I'M LEARNING NLP, AND IT ISN'T EASY!


In [6]:
from contractions import fix

# Expands abbreviation
normalized_text = fix(text)
print(normalized_text)

I am learning NLP, and it is not easy!


In [7]:
words = word_tokenize(normalized_text)
print(words)

['I', 'am', 'learning', 'NLP', ',', 'and', 'it', 'is', 'not', 'easy', '!']


In [8]:
# stemming
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['i', 'am', 'learn', 'nlp', ',', 'and', 'it', 'is', 'not', 'easi', '!']


In [9]:
# lemmatization
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Specify 'v' for verbs
lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in words]  
print(lemmatized_words)

['I', 'be', 'learn', 'NLP', ',', 'and', 'it', 'be', 'not', 'easy', '!']


For the lemmatizer to work as intended, we need to give the lemmatizer the context of each word. This is achieved through POS tagging, which will be covered in greater detail next week. The default POS tagger assumes all words to be nouns if no context is given.

In [10]:
# POS tagging labels each word with its grammatical role.
from nltk import pos_tag
from nltk.tokenize import word_tokenize
# nltk.download('averaged_perceptron_tagger')

text = "NLP is amazing and fun to learn."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
print(pos_tags)

[('NLP', 'NNP'), ('is', 'VBZ'), ('amazing', 'JJ'), ('and', 'CC'), ('fun', 'NN'), ('to', 'TO'), ('learn', 'VB'), ('.', '.')]


In [11]:
from nltk.corpus import wordnet

def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None
    
wordnet_tagged = [(word, pos_tagger(pos)) for (word, pos) in pos_tags]
print(wordnet_tagged)

lemmatized_sentence = []
for word, tag in wordnet_tagged:
    if tag is None:
        lemmatized_sentence.append(word)
    else:       
        lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
lemmatized_sentence = " ".join(lemmatized_sentence)
print(lemmatized_sentence)

[('NLP', 'n'), ('is', 'v'), ('amazing', 'a'), ('and', None), ('fun', 'n'), ('to', None), ('learn', 'v'), ('.', None)]
NLP be amazing and fun to learn .


### 5. Vectorization 
Converting a text, sentence, or word into a numerical representation that can be used for further processing. This can be done using techniques such as Bag-of-Words, N-grams, and Word Embeddings. In this notebook, we will focus on Bag-of-Words and N-grams.  You will learn the others in the later weeks.

In [12]:
# Bag-of-Words
text = """
/W Never willing to see you fall. /s Never hesitating to be your rock.
"""

filtered_text = []
for sent in sent_tokenize(text):
    sent = re.sub(r'/W', ' ', sent)
    sent = re.sub(r'/s+', ' ', sent)
    sent = re.sub('[^a-zA-Z"]', ' ', sent)
    filtered_text.append(' '.join([word.lower() for word in word_tokenize(sent) if word.lower() not in stop_words]))
print(filtered_text)

['never willing see fall', 'never hesitating rock']


In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data to get the BoW representation
X = vectorizer.fit_transform(filtered_text)

# Get the list of unique words in the vocabulary
words = vectorizer.get_feature_names_out()

# Print the BoW matrix
print("Bag-of-Words Matrix:")
print(X.toarray())

# Print the vocabulary
print("\nVocabulary:")
print(words)

Bag-of-Words Matrix:
[[1 0 1 0 1 1]
 [0 1 1 1 0 0]]

Vocabulary:
['fall' 'hesitating' 'never' 'rock' 'see' 'willing']


In [14]:
# N-grams
def generate_ngrams(text, n):
    words = text.split()
    ngrams = zip(*[words[i:] for i in range(n)])
    return [' '.join(ngram) for ngram in ngrams]

# Sample text
text = "I am the one who knocks"
# Print the generated n-grams
print("Original Text:")
print(text)
print("\nUnigram:")
print(generate_ngrams(text, 1))
print("\nBigram:")
print(generate_ngrams(text, 2))
print("\nTrigram:")
print(generate_ngrams(text, 3))

Original Text:
I am the one who knocks

Unigram:
['I', 'am', 'the', 'one', 'who', 'knocks']

Bigram:
['I am', 'am the', 'the one', 'one who', 'who knocks']

Trigram:
['I am the', 'am the one', 'the one who', 'one who knocks']


### Practice for the week
Explore these preprocessing techniques on the given movie reviews. Below are comments and codes to guide you through. Add and/or modify the code when necessary.

In [28]:
# import necessary libraries/packages/modules.
# !pip install contractions
# download necessary resources
import nltk
# nltk.download('punkt_tab')
# nltk.download('stopwords')
from nltk.corpus import stopwords
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from contractions import fix
import contractions

# raw examples
raw_data = [
    "Excellent performance by Mary KAy Place, Steve Sandvoss, Jacqueline Bissett and Rebekah Johnson. Superb story that reels you into the movie, emotional, yet light-hearted. I own this movie, and everyone that I've shared it with, loves it. Great for mixed company, 18+ crowd.<br /><br />Nice production, not a cheap budget, well organized, and keeps your interest. Uses dome newer ideas, for flashbacks, and at one point keeps the viewer at the edge of their seat.<br /><br />No matter what walk of life a viewer is from, they will buy-in to one or more viewpoints in this film.<br /><br />Would love to see a sequel!",
    "This movie is totally wicked! It's really great to see MJH in a different role than her Sabrina character! The plot is totally cool, and the characters are excellently written. Definitely one of the best movies!!"
]

cleaned_data = []
# noise reduction, normalization, tokenization, stopword removal

# normalization and noise reduction

# Expand contractions & clean text in one pass
cleaned_data = [
    re.sub(r'[^\w\s]', '', contractions.fix(text))  # expand → remove punctuation
    for text in raw_data
]

# Output
print("Cleaned Data:", cleaned_data)
print('\n')  

# tokenization and stopword removal

stop_words = set(stopwords.words('english'))

# One-pass: build vocabulary directly from raw_data
vocabulary = {
    word.lower()
    for text in raw_data
    for sentence in sent_tokenize(text)
    for word in word_tokenize(sentence)
    if word.isalpha() and word.lower() not in stop_words         # keep only alphabetic tokens (drops punctuation/numbers)
}

# (Optional) see it sorted for readability
print("Vocabulary:", sorted(vocabulary))

for review in raw_data:
    pass

# observe your output with the original text, you might modify the code for better illustration
for idx, review in enumerate(cleaned_data):
    print(f"Preprocessed Review {idx + 1}: {review}")

# vectorization
for review in cleaned_data:
    pass

Cleaned Data: ['Excellent performance by Mary KAy Place Steve Sandvoss Jacqueline Bissett and Rebekah Johnson Superb story that reels you into the movie emotional yet lighthearted I own this movie and everyone that I have shared it with loves it Great for mixed company 18 crowdbr br Nice production not a cheap budget well organized and keeps your interest Uses dome newer ideas for flashbacks and at one point keeps the viewer at the edge of their seatbr br No matter what walk of life a viewer is from they will buyin to one or more viewpoints in this filmbr br Would love to see a sequel', 'This movie is totally wicked It is really great to see MJH in a different role than her Sabrina character The plot is totally cool and the characters are excellently written Definitely one of the best movies']


Vocabulary: ['best', 'bissett', 'br', 'budget', 'character', 'characters', 'cheap', 'company', 'cool', 'definitely', 'different', 'dome', 'edge', 'emotional', 'everyone', 'excellent', 'excellen