Word Tokenization: This is the first step in any NLP process that uses text data. Tokenization is a mandatory step, which simplifies things for our machine learning model. It is the process of breaking down a piece of text into individual components or smaller units called tokens. The ultimate goal of tokenization is to process the raw text data and create a vocabulary from it.

In [11]:
sentence = 'I love NLP!!'
tokens = sentence.split(" ")
print(tokens)

['I', 'love', 'NLP!!']


In [12]:
sentence = 'I love NLP!!'
tokens = sentence.split(" ")
print(tokens)

['I', 'love', 'NLP!!']


Lower casing: This step reduces complexity. We convert the text data into the same case, preferably lowercase, so that we don't have to work with both cases.

In [13]:
sentence = 'I love NLP!!'
sentence = sentence.lower()
print(sentence)

i love nlp!!


Punctuation removal: In this step, all the punctuations present in the text are removed.

In [14]:
import string
print(string.punctuation)
text = "Messi is great player. However, he is yet to win a world cup"
text_p = "".join([char for char in text if char not in string.punctuation])
print(text_p)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Messi is great player However he is yet to win a world cup


Stop word removal: The most commonly used words are called stopwords. They contribute very less to the predictions and add very little value analytically. Hence, removing stopwords will make it easier for our models to train the text data. We can use the PorterStemmer in python to remove stopwords.

In [15]:
import nltk
#nltk.download('stopwords')

In [16]:
import nltk
#nltk.download('punkt')

In [17]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
example_sent = """This is a sample sentence,
                  showing off the stop words filtration."""
 
stop_words = set(stopwords.words('english'))
 
word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether 
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []
 
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
 
print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [18]:
from nltk.stem import PorterStemmer
 
# Create a Porter Stemmer instance
porter_stemmer = PorterStemmer()
 
# Example words for stemming
words = ["running", "jumps", "happily", "running", "happily"]
 
# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in words]
 
# Print the results
print("Original words:", words)
print("Stemmed words:", stemmed_words)

Original words: ['running', 'jumps', 'happily', 'running', 'happily']
Stemmed words: ['run', 'jump', 'happili', 'run', 'happili']


Lemmatization: NLTK uses the WordNetLemmatizer for the purpose of lemmatization.

In [19]:
# import these modules
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
 
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
 
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))

rocks : rock
corpora : corpus
better : good


Here’s how you can implement TF-IDF in Python using the scikit-learn. The algorithm works as follows:
1. Preprocessing: The text data is preprocessed by removing stop words, punctuation, and other non-alphanumeric characters.
2. Tokenization: The text is tokenized into individual words.
3. Instantiate TfidfVectorizer and fit the corpus
4. Transform that corpus to get the representation

In [20]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus of documents
corpus = ['The quick brown fox jumps over the lazy dog.',
          'The lazy dog likes to sleep all day.',
          'The brown fox prefers to eat cheese.',
          'The red fox jumps over the brown fox.',
          'The brown dog chases the fox'
         ]

# Define a function to preprocess the text
def preprocess_text(text):
    # Remove punctuation and other non-alphanumeric characters
    text =  re.sub('[^a-zA-Z]', ' ', text)
    # Tokenize the text into words
    words = word_tokenize(text.lower())
    # Remove stop words
    words = [word for word in words if word not in stopwords.words('english')]
    # Join the words back into a string
    return ' '.join(words)

# Preprocess the corpus
corpus = [preprocess_text(doc) for doc in corpus]
print('Corpus: \n{}'.format(corpus))

# Create a TfidfVectorizer object and fit it to the preprocessed corpus
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)

# Transform the preprocessed corpus into a TF-IDF matrix
tf_idf_matrix = vectorizer.transform(corpus)

# Get list of feature names that correspond to the columns in the TF-IDF matrix
print("Feature Names:\n", vectorizer.get_feature_names_out())

# Print the resulting matrix
print("TF-IDF Matrix:\n",tf_idf_matrix.toarray())

Corpus: 
['quick brown fox jumps lazy dog', 'lazy dog likes sleep day', 'brown fox prefers eat cheese', 'red fox jumps brown fox', 'brown dog chases fox']
Feature Names:
 ['brown' 'chases' 'cheese' 'day' 'dog' 'eat' 'fox' 'jumps' 'lazy' 'likes'
 'prefers' 'quick' 'red' 'sleep']
TF-IDF Matrix:
 [[0.30620672 0.         0.         0.         0.36399815 0.
  0.30620672 0.43850426 0.43850426 0.         0.         0.54351473
  0.         0.        ]
 [0.         0.         0.         0.49389914 0.33077001 0.
  0.         0.         0.39847472 0.49389914 0.         0.
  0.         0.49389914]
 [0.29550385 0.         0.52451722 0.         0.         0.52451722
  0.29550385 0.         0.         0.         0.52451722 0.
  0.         0.        ]
 [0.31309104 0.         0.         0.         0.         0.
  0.62618207 0.44836297 0.         0.         0.         0.
  0.55573434 0.        ]
 [0.39032474 0.69282362 0.         0.         0.46399205 0.
  0.39032474 0.         0.         0.         0. 