<h1 style="text-align: center;">Text Preprocessing 1 - Tutorial</h1> 

Text Pre-processing common steps:

1. Text Cleaning: special characters, HTML tags, new lines 
2. Tokenization: split text into sentences and words.
3. Stop Words Removal: remove words of little value like "the", "and", "a", "an".
4. Stemming: stripping the affixes from words.
5. Lemmatization: converting words to their base form.

## Install Dependencies

In [1]:
example_sentence = """A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. <br> Top speed attained, CPU rated speed,
add on cards & adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4m floppies are especially requested."""

## Cleaning

In [2]:
import re 

def clean_text(text):
    text = re.sub(r"<.*?>", " ", text)  # Remove HTML tags
    text = re.sub(r"\\n", " ", text)  # Remove explicit new-line characters
    text = re.sub(r"[^\w\s.]", " ", text)  # Remove special characters except for decimal points
    text = re.sub(r"\s+", " ", text)  # Replace multiple spaces with a single space
    return text.strip().lower()

In [3]:
cleaned_text = clean_text(example_sentence)
print(cleaned_text)

a fair number of brave souls who upgraded their si clock oscillator have shared their experiences for this poll. please send a brief message detailing your experiences with the procedure. top speed attained cpu rated speed add on cards adapters heat sinks hour of usage per day floppy disk functionality with 800 and 1.4m floppies are especially requested.


## Tokenization

In [4]:
import nltk
nltk.download('punkt_tab') #model for sentence tokenizer

[nltk_data] Downloading package punkt_tab to C:\Users\Khor Kean
[nltk_data]     Teng\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
#Sentence Tokenizer
from nltk.tokenize import sent_tokenize
tokenized_sent = sent_tokenize(cleaned_text)
print('number of sentences: ', len(tokenized_sent))
print(tokenized_sent)

number of sentences:  3
['a fair number of brave souls who upgraded their si clock oscillator have shared their experiences for this poll.', 'please send a brief message detailing your experiences with the procedure.', 'top speed attained cpu rated speed add on cards adapters heat sinks hour of usage per day floppy disk functionality with 800 and 1.4m floppies are especially requested.']


In [6]:
#Word Tokenizer
from nltk.tokenize import word_tokenize
tokenized = word_tokenize(cleaned_text)
print(tokenized)

['a', 'fair', 'number', 'of', 'brave', 'souls', 'who', 'upgraded', 'their', 'si', 'clock', 'oscillator', 'have', 'shared', 'their', 'experiences', 'for', 'this', 'poll', '.', 'please', 'send', 'a', 'brief', 'message', 'detailing', 'your', 'experiences', 'with', 'the', 'procedure', '.', 'top', 'speed', 'attained', 'cpu', 'rated', 'speed', 'add', 'on', 'cards', 'adapters', 'heat', 'sinks', 'hour', 'of', 'usage', 'per', 'day', 'floppy', 'disk', 'functionality', 'with', '800', 'and', '1.4m', 'floppies', 'are', 'especially', 'requested', '.']


In [7]:
#Tweet Tokenizer compared to word_tokenize
from nltk.tokenize import TweetTokenizer
tweet = "Dont take cryptocurrency advice from people on Twitter 😃👍 #crypto"
tokenizer = TweetTokenizer()
tokenized_tweet = tokenizer.tokenize(tweet)
print(tokenized_tweet)
print(word_tokenize(tweet))

['Dont', 'take', 'cryptocurrency', 'advice', 'from', 'people', 'on', 'Twitter', '😃', '👍', '#crypto']
['Dont', 'take', 'cryptocurrency', 'advice', 'from', 'people', 'on', 'Twitter', '😃👍', '#', 'crypto']


## Stemming and Lemmatization

### 1- NLTK

In [8]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

Remember That:
* Porter Stemmer removes suffixes in a rule-based manner
* It does not always return valid English words
* Some words retain meaningful roots

In [9]:
# Standard cases
print(porter.stem('argue'))
print(porter.stem('argued'))
print(porter.stem('argues'))
print(porter.stem('arguing'))

argu
argu
argu
argu


In [10]:
# Plurals and derivational forms
print(porter.stem('running'))    
print(porter.stem('runner'))     
print(porter.stem('flies'))      
print(porter.stem('fly'))
print(porter.stem('crying'))

run
runner
fli
fli
cri


In [11]:
# Complex endings
print(porter.stem('happiness'))  
print(porter.stem('university'))
print(porter.stem('national'))
print(porter.stem('generalization'))

happi
univers
nation
gener


## When to Use PorterStemmer?
* For lightweight and rule-based stemming such as text classification and IR systems.
* For more linguistically accurate results, consider lemmatization instead (e.g., WordNetLemmatizer).

In [12]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\Khor Kean
[nltk_data]     Teng\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("argue", 'v'))
print(lemmatizer.lemmatize("argued", 'v')) 
print(lemmatizer.lemmatize("argues", 'v'))
print(lemmatizer.lemmatize("arguing", 'n'))

argue
argue
argue
arguing


In [14]:
print(lemmatizer.lemmatize("better", 'a'))
print(lemmatizer.lemmatize("running", 'v'))
print(lemmatizer.lemmatize("running", 'n'))
print(lemmatizer.lemmatize("flies", 'n'))
print(lemmatizer.lemmatize("flies", 'v'))
print(lemmatizer.lemmatize("mice", 'n'))

good
run
running
fly
fly
mouse


In [15]:
#WordNetLemmatizer requires the correct POS tag to be accurate, default is noun
print(lemmatizer.lemmatize("went"))

went


### 2- spaCy

In [16]:
import spacy
nlp = spacy.load('en_core_web_md') #load the core English language model

In [17]:
doc=nlp('After the cats fell asleep, the mice went out to play.')
for token in doc:
    print(token,'-->',token.lemma_)

After --> after
the --> the
cats --> cat
fell --> fall
asleep --> asleep
, --> ,
the --> the
mice --> mouse
went --> go
out --> out
to --> to
play --> play
. --> .


In [18]:
#lemmatize our original example sentence
doc = nlp(cleaned_text)

# Extract original words and their lemmatized forms
original = [token.text for token in doc]
lemmatized = [token.lemma_.lower() for token in doc]

# Display results in aligned format
print(f"{'Original':<15} {'Lemmatized':<15}")
print("=" * 30)
for orig, lem in zip(original, lemmatized):
    print(f"{orig:<15} {lem:<15}")

Original        Lemmatized     
a               a              
fair            fair           
number          number         
of              of             
brave           brave          
souls           soul           
who             who            
upgraded        upgrade        
their           their          
si              si             
clock           clock          
oscillator      oscillator     
have            have           
shared          share          
their           their          
experiences     experience     
for             for            
this            this           
poll            poll           
.               .              
please          please         
send            send           
a               a              
brief           brief          
message         message        
detailing       detail         
your            your           
experiences     experience     
with            with           
the             the            
procedur

## Stop Word Removal


In [19]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Khor Kean
[nltk_data]     Teng\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
from nltk.corpus import stopwords

# Get stop words list
stop = stopwords.words('english')

# Convert to a set to remove duplicates
unique_stopwords = set(stop)

# Convert back to a list
stop = list(unique_stopwords)

print("Total unique stop words:", len(stop))
print("Sample stop words:", stop[:100])  # Print the first 100 stop words

Total unique stop words: 198
Sample stop words: ['until', 'being', 'as', "it'd", "shan't", 'can', 'both', 'his', 'isn', 'its', 'such', 't', 'now', 'some', 'the', 'then', 'won', 'very', 'other', 'should', 'doing', "should've", "he'd", 'hers', 'i', 'their', "they'd", 'couldn', 'any', 'so', "hasn't", 'above', 'doesn', 'out', 'it', 'do', 'at', 'ma', "haven't", 'our', 'weren', 'few', 'about', "she's", 'why', 'there', 'yours', 'to', "weren't", "don't", 'on', 'don', 'where', 'ain', 'didn', 'your', 'with', 'further', 've', "he's", 'a', 'all', "isn't", 'just', 'an', 'himself', 'these', "he'll", 'before', 'does', 'wouldn', 'll', 'theirs', 'them', 'wasn', 'off', 'myself', 'y', "wasn't", "you've", "mightn't", 'having', 'ours', 're', 'when', "doesn't", 'of', 'how', "that'll", 'me', 'aren', 'who', 'him', 'nor', 'against', 'too', 'itself', 'because', 'or', 'which']


In [21]:
import spacy 
from spacy.lang.en.stop_words import STOP_WORDS
stop_words=list(STOP_WORDS)

print("Total unique stop words:", len(stop_words))
print("Sample stop words:", stop_words[:100])  # First 100 stop words

Total unique stop words: 326
Sample stop words: ['say', 'now', 'very', 'other', 'should', 'last', 'formerly', 'nine', 'at', 'seem', 'others', 'few', 'third', 'why', 'there', 'many', 'became', 'on', 'eight', '’m', '’d', 'further', "'d", 'himself', 'wherein', 'becoming', 'four', 'every', 'mostly', 'off', 'would', 'somewhere', 'one', 'how', 'around', 'be', 'am', 'had', 'between', '‘d', 'moreover', 'amongst', 'into', 'serious', 'whether', 'else', 'move', 'empty', 'whenever', 'nevertheless', 'were', 'eleven', 'n‘t', 'down', 'put', 'while', 'though', 'through', 'have', 'has', 'back', 'besides', 'she', 'but', 'will', 'using', 'no', 'really', 'not', 'by', 'seems', 'otherwise', 'together', 'neither', 'either', 'along', 'can', 'also', 'both', 'his', 'its', 'then', 'done', 'per', 'top', 'might', 'indeed', 'something', 'becomes', 'former', 'n’t', 'yet', 'towards', 'please', 'upon', 'thence', 'ca', 'seemed', 'with', 'whence']


In [22]:
stop_words_removed = [word for word in lemmatized if word not in stop_words]
removed_arr = [word for word in lemmatized if word in stop_words]

In [23]:
print(stop_words_removed)

['fair', 'number', 'brave', 'soul', 'upgrade', 'si', 'clock', 'oscillator', 'share', 'experience', 'poll', '.', 'send', 'brief', 'message', 'detail', 'experience', 'procedure', '.', 'speed', 'attain', 'cpu', 'rate', 'speed', 'add', 'card', 'adapter', 'heat', 'sink', 'hour', 'usage', 'day', 'floppy', 'disk', 'functionality', '800', '1.4', 'm', 'floppy', 'especially', 'request', '.']


In [24]:
print(removed_arr)

['a', 'of', 'who', 'their', 'have', 'their', 'for', 'this', 'please', 'a', 'your', 'with', 'the', 'top', 'on', 'of', 'per', 'with', 'and', 'be']
