In [2]:
# !pip install nltk

## Text Preprocessing

Text preprocessing is an essential step in natural language processing (NLP) tasks. It involves transforming raw text data into a format that is more suitable for analysis and machine learning algorithms. In this tutorial, we will cover various common techniques for text preprocessing. Let's dive in!


### Lowercasing
Converting all text to lowercase can help to normalize the data and reduce the vocabulary size. It ensures that words in different cases are treated as the same word. For example, "apple" and "Apple" will both be transformed to "apple".

In [8]:
sent = "Nepotism is the act of granting an [advantage], privilege, or position to relatives in an occupation or field."

In [9]:
lower_sent = sent.lower()
lower_sent

'nepotism is the act of granting an [advantage], privilege, or position to relatives in an occupation or field.'

### Removal of Punctuation and Special Characters
Punctuation marks and special characters often do not add much meaning to the text and can be safely removed. Common punctuation marks include periods, commas, question marks, and exclamation marks. You can use regular expressions or string operations to remove them.

In [14]:
result = ""
punctuautions = [".",",",":",";","!","?","[","]","'"]
for char in lower_sent:
    if char not in punctuautions:
        result += char

result

'nepotism is the act of granting an advantage privilege or position to relatives in an occupation or field'

In [15]:
import re

cleaned = re.sub(r'[^\w\d\s]', '', lower_sent)
cleaned

'nepotism is the act of granting an advantage privilege or position to relatives in an occupation or field'

### Stop Word Removal:
Stop words are commonly occurring words in a language, such as "a," "an," "the," "is," and "in." These words provide little semantic value and can be removed to reduce noise in the data. Libraries like NLTK provide a list of predefined stop words for different languages.

Before using the code make sure you downloaded all the stopwords uning the first shell below.

In [16]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/shailesh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
from nltk.corpus import stopwords

In [20]:
stopwords_eng = stopwords.words('english')

In [22]:
cleaned = [word for word in cleaned.split(" ") if word not in stopwords_eng]

In [24]:
cleaned = " ".join(cleaned)
cleaned

'nepotism act granting advantage privilege position relatives occupation field'

## Tokenization

Tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the level of granularity desired. Tokenization is a fundamental step in text preprocessing and is crucial for various natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and language generation.

### Word Tokenization

Word tokenization is the most common form of tokenization, where the text is split into individual words. For example, given the sentence "Tokenization is important for NLP tasks," the word tokens would be: ["Tokenization", "is", "important", "for", "NLP", "tasks"].

Word tokenization is typically performed using whitespace as the delimiter. However, it's important to handle cases like punctuation marks, contractions, and hyphenated words correctly. For example, "don't" should be tokenized as ["do", "n't"] instead of ["don", "'", "t"].

Libraries like NLTK, spaCy, and the tokenizers package provide ready-to-use word tokenization functions.


Before running any of these tokenization techniques, make sure you have `punkt` downloaded. `punkt` refers to the Punkt Tokenizer, which is a pre-trained unsupervised machine learning model for sentence tokenization. The NLTK Punkt Tokenizer is trained on large corpora and is capable of handling a wide range of sentence boundary detection for multiple languages. It uses a combination of rule-based heuristics and statistical models to identify sentence boundaries accurately.

In [29]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /home/shailesh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/shailesh/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [28]:
tokens = nltk.word_tokenize(cleaned)
tokens

['nepotism',
 'act',
 'granting',
 'advantage',
 'privilege',
 'position',
 'relatives',
 'occupation',
 'field']

## Stemming and Lemmatization

Stemming and lemmatization are techniques used in natural language processing (NLP) to reduce words to their base or root forms. Both approaches aim to normalize words and reduce inflectional variations, enabling better analysis and comparison of words. However, they differ in their methods and outputs. Let's dive into each technique in detail:

### Stemming

Stemming is a process of reducing words to their base or root forms by removing prefixes or suffixes. The resulting form is often a stem, which may not be an actual word itself. The primary goal of stemming is to simplify the vocabulary and group together words with the same base meaning.

For example, when using a stemming algorithm on the words "running," "runs," and "ran," the common stem would be "run." The stemming process cuts off the suffixes ("-ning," "-s," and "-"), leaving behind the core form of the word.

Stemming algorithms follow simple rules and heuristics based on linguistic patterns, rather than considering the context or part of speech of the word. Some popular stemming algorithms include the Porter stemming algorithm, the Snowball stemmer (which supports multiple languages), and the Lancaster stemming algorithm.

Stemming is a computationally lightweight approach and can be useful in certain cases where the exact word form is not crucial. However, it may produce stems that are not actual words, leading to potential loss of meaning and ambiguity.

In [30]:
from nltk.stem import LancasterStemmer, PorterStemmer

In [40]:
stemmer = LancasterStemmer()
stemmer.stem("happily")

'happy'

### Lemmatization

Lemmatization, on the other hand, aims to reduce words to their canonical or dictionary forms, known as lemmas. Unlike stemming, lemmatization considers the context and part of speech (POS) of the word to generate meaningful lemmas. The resulting lemmas are actual words found in the language's dictionary.

For example, when lemmatizing the words "running," "runs," and "ran," the lemma for each would be "run." Lemmatization takes into account the POS information to accurately determine the base form of the word.

Lemmatization algorithms use linguistic rules and morphological analysis to identify the appropriate lemma. They often rely on language-specific resources, such as word lists and morphological databases. Some popular lemmatization tools include the WordNet lemmatizer and the spaCy library (which supports lemmatization for multiple languages).

Lemmatization typically produces more accurate and meaningful results compared to stemming because it retains the core meaning of words. It is especially useful in tasks that require precise word analysis, such as information retrieval, question answering, and sentiment analysis.

However, lemmatization can be more computationally intensive compared to stemming due to its reliance on POS tagging and language-specific resources.

Before running any of these tokenization techniques, make sure you have `wordnet` downloaded.

In [41]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/shailesh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [42]:
from nltk.stem import WordNetLemmatizer

In [48]:
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("tasty")

'tasty'

In [52]:
lemmatizer.lemmatize("runs")

'run'

When deciding between stemming and lemmatization, consider the trade-off between simplicity and accuracy. If you require speed and a broad reduction of word forms, stemming may be sufficient. However, if you need more accurate analysis and want to preserve the semantic meaning of words, lemmatization is generally the preferred choice.

It's important to note that both stemming and lemmatization have limitations. They may not always produce the correct base forms, especially for irregular words or those not present in the chosen language's dictionary. Contextual information, such as word sense disambiguation, can further enhance the accuracy of both techniques.

In [53]:
cleaned_tokens = [lemmatizer.lemmatize(word) for word in tokens]

In [54]:
cleaned_tokens

['nepotism',
 'act',
 'granting',
 'advantage',
 'privilege',
 'position',
 'relative',
 'occupation',
 'field']