# Common Text Preprocessing Steps

### 1. Tokenization

Splitting document in words and spacing out punctuation from words.


In [1]:
import string, nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# sample document that consists of a few sentences
entry = "Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting \
subjects to learn in 2021 at the university! They utilize supervised and unsupervised algorithms \
to extract and infer knowledge from the data. It is estimated that more than 66.7 % (two-thirds) \
of the practicians use primarily Python for their tasks."

# print the original document
print(entry, "\n")

# tokenize and print the list of sentences
sentences = sent_tokenize(entry)
print(sentences, "\n")

# tokenize and print the list of words
words = word_tokenize(entry)
print(words)

Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university! They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. It is estimated that more than 66.7 % (two-thirds) of the practicians use primarily Python for their tasks. 

['Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university!', 'They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data.', 'It is estimated that more than 66.7 % (two-thirds) of the practicians use primarily Python for their tasks.'] 

['Data', 'Mining', ',', 'Text', 'Mining', ',', 'and', 'Machine', 'Learning', 'are', 'probably', 'the', '3', 'most', 'interesting', 'subjects', 'to', 'learn', 'in', '2021', 'at', 'the', 'university', '!', 'They', 'utilize', 'supervised', 'and', 'unsupervised', 'algorithms', 'to', 'extract', 'and', 'infer', 'knowledge'

### 2. Lowercasing
Converting a word to lower case to reduce dimensionality (NLP, Nlp, nlp, NlP, nlP, nLp, nlP, Nlp, nLP -> nlp).

In [2]:
# lowercasing 
entry_lower = entry.lower()

# uppercasing
entry_upper = entry.upper()

# print the original document
print(entry, "\n")
print(entry_lower, "\n")
print(entry_upper, "\n")

Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university! They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. It is estimated that more than 66.7 % (two-thirds) of the practicians use primarily Python for their tasks. 

data mining, text mining, and machine learning are probably the 3 most interesting subjects to learn in 2021 at the university! they utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. it is estimated that more than 66.7 % (two-thirds) of the practicians use primarily python for their tasks. 

DATA MINING, TEXT MINING, AND MACHINE LEARNING ARE PROBABLY THE 3 MOST INTERESTING SUBJECTS TO LEARN IN 2021 AT THE UNIVERSITY! THEY UTILIZE SUPERVISED AND UNSUPERVISED ALGORITHMS TO EXTRACT AND INFER KNOWLEDGE FROM THE DATA. IT IS ESTIMATED THAT MORE THAN 66.7 % (TWO-THIRDS) OF THE PRACTICIANS USE PRIMARILY PYTHON FOR THEIR T

### 3. Stop words removal
Stop words are very commonly used words (a, an, the, etc.) in the documents. These words do not really signify any importance as they do not help in distinguishing two documents.

In [3]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

# selecting and printing stop words of English 
stop = set(stopwords.words('english')) 
print("Stop words of English language:\n")
print(stop)


Stop words of English language:

{'so', 'where', 'only', "mightn't", "you'll", 'themselves', 'on', 's', 'aren', 'if', 'with', "didn't", 'needn', 'in', 'here', 'under', "wasn't", 'we', 'didn', 'up', 'their', 'the', 'can', 'me', 'itself', 'have', 've', "she's", 'they', 'those', 'be', 'to', 'from', 'all', 'just', 'further', 'himself', "wouldn't", 'not', 'against', 'our', "weren't", 'wouldn', 'what', 'is', 'do', "that'll", "you'd", 'but', 'through', 'll', 'after', 'most', 'had', 'won', 'out', 'ours', "needn't", 'own', 'him', 'shan', 'down', 'doing', 'shouldn', "isn't", 'his', 'for', 'into', 'couldn', 'again', 'it', 'don', 'ain', "won't", 'been', 'while', 'off', "shan't", 'she', 'now', 'nor', 'them', 'very', 'ma', 'd', 'isn', 'which', 'yourself', 'below', 'each', 'hers', 'has', 'and', 'its', 'whom', 'these', 're', "hadn't", "aren't", "should've", 'over', "it's", 'should', 'yours', 'or', 'some', 'such', 'am', 'haven', 'above', 'until', "doesn't", 'once', 'having', 'were', 'ourselves', 'herse

In [4]:
print("Original text:")
print(entry, "\n")  

# removing stop words
words_no_stops = [w for w in words if not w in stop]

# join together the filtered words to rebuild the initial sentence
entry_no_stops = ' '.join(words_no_stops)

print("Text without stop words:")
print(entry_no_stops)

Original text:
Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university! They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. It is estimated that more than 66.7 % (two-thirds) of the practicians use primarily Python for their tasks. 

Text without stop words:
Data Mining , Text Mining , Machine Learning probably 3 interesting subjects learn 2021 university ! They utilize supervised unsupervised algorithms extract infer knowledge data . It estimated 66.7 % ( two-thirds ) practicians use primarily Python tasks .


### 4. Punctuation and number removal
Punctuation can be unuseful or even confusing for sometimes. The symbols are simply deleted. Numbers are either delted or replaced by some special symbol like # sign. 

In [5]:
# removing punctuation 
entry_no_punct = entry.translate(str.maketrans('', '', string.punctuation))

# removing numbers
entry_no_num = entry
for symb in entry_no_num:
    if symb.isdigit():
        entry_no_num = entry_no_num.replace(symb, '')

# replacing numbers with # symbol
entry_hash_num = entry
for symb in entry_hash_num:
    if symb.isdigit():
        entry_hash_num = entry_hash_num.replace(symb, '#')
        
print("Original text:")
print(entry, "\n")      

print("Text without punctuation:")
print(entry_no_punct, "\n")

print("Text without numbers:")
print(entry_no_num, "\n")

print("Text with # in the place of numbers:")
print(entry_hash_num, "\n")

Original text:
Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university! They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. It is estimated that more than 66.7 % (two-thirds) of the practicians use primarily Python for their tasks. 

Text without punctuation:
Data Mining Text Mining and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data It is estimated that more than 667  twothirds of the practicians use primarily Python for their tasks 

Text without numbers:
Data Mining, Text Mining, and Machine Learning are probably the  most interesting subjects to learn in  at the university! They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. It is estimated that more than . % (two-thirds) of the

### 5. Stemming
Transforming a word to its root form.

In [6]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

print("Original text:")
print(entry, "\n")  

# using Porter stemmer to stem our sentence
stemmed_entry = ' '.join([ps.stem(w) for w in entry.split()])

print("Stemmed text:")
print(stemmed_entry)

Original text:
Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university! They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. It is estimated that more than 66.7 % (two-thirds) of the practicians use primarily Python for their tasks. 

Stemmed text:
data mining, text mining, and machin learn are probabl the 3 most interest subject to learn in 2021 at the university! they util supervis and unsupervis algorithm to extract and infer knowledg from the data. it is estim that more than 66.7 % (two-thirds) of the practician use primarili python for their tasks.


### 6. Lemmatization
Unlike stemming, lemmatization reduces the words to a word existing in the language. Unlike stemming, lemmatization reduces the words to a word existing in the language. Lemmatization is preferred over Stemming because lemmatization does a morphological analysis of the words. 


In [7]:
from nltk.stem import WordNetLemmatizer 

print("Original sentence:")
print(entry, "\n")  

# using WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

lemmatized_entry = ' '.join([lemmatizer.lemmatize(w) for w in words])
print(lemmatized_entry)

Original sentence:
Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university! They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. It is estimated that more than 66.7 % (two-thirds) of the practicians use primarily Python for their tasks. 

Data Mining , Text Mining , and Machine Learning are probably the 3 most interesting subject to learn in 2021 at the university ! They utilize supervised and unsupervised algorithm to extract and infer knowledge from the data . It is estimated that more than 66.7 % ( two-thirds ) of the practician use primarily Python for their task .


### 7. Part of speech tagging
POS tagging is important for several language-related applications. It associates each token with its respective part of speech. 

In [8]:
from nltk import pos_tag

print("Original text:")
print(entry, "\n")  

print("Tagged text:")
# print the POS tagged list of tokens
print(pos_tag(words))

Original text:
Data Mining, Text Mining, and Machine Learning are probably the 3 most interesting subjects to learn in 2021 at the university! They utilize supervised and unsupervised algorithms to extract and infer knowledge from the data. It is estimated that more than 66.7 % (two-thirds) of the practicians use primarily Python for their tasks. 

Tagged text:
[('Data', 'NNP'), ('Mining', 'NNP'), (',', ','), ('Text', 'NNP'), ('Mining', 'NNP'), (',', ','), ('and', 'CC'), ('Machine', 'NNP'), ('Learning', 'NNP'), ('are', 'VBP'), ('probably', 'RB'), ('the', 'DT'), ('3', 'CD'), ('most', 'JJS'), ('interesting', 'JJ'), ('subjects', 'NNS'), ('to', 'TO'), ('learn', 'VB'), ('in', 'IN'), ('2021', 'CD'), ('at', 'IN'), ('the', 'DT'), ('university', 'NN'), ('!', '.'), ('They', 'PRP'), ('utilize', 'VBP'), ('supervised', 'JJ'), ('and', 'CC'), ('unsupervised', 'JJ'), ('algorithms', 'NN'), ('to', 'TO'), ('extract', 'VB'), ('and', 'CC'), ('infer', 'VB'), ('knowledge', 'NN'), ('from', 'IN'), ('the', 'DT'