<h3>Basic Text Preprocessing</h3>

<ol>
    <li><b>Stop Word Removal</b> We remove stop words (common words that do not value add to the sentence like "a","an" etc. We remove them to reduce vocab size since these words dont have specific meaning</li>
    <li><b>Lower Casing</b> Standardizing the cases for the words so that they will not be considered as seperate tokens</li>
    <li><b>Stemming</b> Reducing words to their base form. For example, running -> run</li>
    <li><b>Tokenization</b> Breaks the words into tokens. For example, "Natural Language Processing" -> ["natural", "language", "processing"]</li>
</ol>

In [3]:
import nltk

In [12]:
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer 
from nltk.tokenize import RegexpTokenizer
import re

In [13]:
nltk.download("stopwords")
stopwords = stopwords.words('english')
stemmer = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'\w+')

# We remove any hashtags of people with the @\w regular expression 
tags = r"@\w*"

def preprocess_text(sentence, stem = False):
    
    sentence = [re.sub(tags, "", sentence)] # Remove all the hashtags in sentences
    text = []
    for word in sentence:
        if word not in stopwords:
            if stem:
                text.append(stemmer.stem(word).lower())
            else:
                text.append(word.lower())
    return tokenizer.tokenize(" ".join(text))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\YH\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
text = "@VirginAmerica I &lt;3 pretty graphics. so much better than minimal iconography. :D"
print(f"{text}")
print()
print(f"Preprocessed Text : {preprocess_text(text)}")

@VirginAmerica I &lt;3 pretty graphics. so much better than minimal iconography. :D

Preprocessed Text : ['i', 'lt', '3', 'pretty', 'graphics', 'so', 'much', 'better', 'than', 'minimal', 'iconography', 'd']
