December 2, 2024

# Natural Language Processing with `nltk`

`nltk` is the most popular Python package for Natural Language processing, it provides algorithms for importing, cleaning, pre-processing text data in human language and then apply computational linguistics algorithms like sentiment analysis.

It also includes many easy-to-use datasets in the `nltk.corpus` package, we can download for example the `movie_reviews` package using the `nltk.download` function:

Inspecting the Movie Reviews Dataset

In [None]:
# Importing the required libraries
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [None]:
# Import the sentiment intensity analyzer
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [None]:
#Downloading the dataset
nltk.download("movie_reviews")

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [None]:
#Running this cell to import the dataset
from nltk.corpus import movie_reviews

In [None]:
#Running this cell for later use in tokenization
nltk.download('vader_lexicon')  # for sentiment analysis
nltk.download('punkt_tab')  # for tokenizing

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Tokenizing Text in Words

In [None]:
romeo_text = """Why then, O brawling love! O loving hate!
O any thing, of nothing first create!
O heavy lightness, serious vanity,
Misshapen chaos of well-seeming forms,
Feather of lead, bright smoke, cold fire, sick health,
Still-waking sleep, that is not what it is!
This love feel I, that feel no love in this."""

The first step in Natural Language processing is generally to split the text into words, this process might appear simple but it is very tedious to handle all corner cases, see for example all the issues with punctuation we have to solve if we just start with a split on whitespace.

**Splitting `romeo_text` by spaces and storing the resultant list of words in the variable `romeo_tokens`**

In [None]:
# Split the romeo_text into a list of words and store it in romeo_tokens
romeo_tokens = romeo_text.split()
romeo_tokens[:15]

['Why',
 'then,',
 'O',
 'brawling',
 'love!',
 'O',
 'loving',
 'hate!',
 'O',
 'any',
 'thing,',
 'of',
 'nothing',
 'first',
 'create!']

In [None]:
assert type(romeo_tokens) == list
assert len(romeo_tokens) == 52

`nltk` has a sophisticated word tokenizer trained on English named `punkt` which we imported earlier in the notebook.

**Using the `nltk.word_tokenize(text)` function to properly tokenize `romeo_text` and stores the result as `romeo_words`.

In [None]:
# Use word_tokenize to properly tokenize romeo_text into individual words and punctuation
romeo_words = word_tokenize(romeo_text)
romeo_words[:15]

['Why',
 'then',
 ',',
 'O',
 'brawling',
 'love',
 '!',
 'O',
 'loving',
 'hate',
 '!',
 'O',
 'any',
 'thing',
 ',']

In [None]:
assert type(romeo_words) == list
assert len(romeo_words) == 68

## Building a bag-of-words model

The simplest model for analyzing text is just to think about text as an unordered collection of words (bag-of-words). This can generally allow to infer from the text the category, the topic or the sentiment.

From the bag-of-words model we can build features to be used by a classifier, here we assume that each word is a feature that can either be `True` or `False`.
We implement this in Python as a dictionary where for each word in a sentence we associate `True`.

**Writting a function `build_bag_of_words(words)` which returns a dictionary with {word : True} formatting given a set of words. Calling the function with `romeo_words` and storing the resultant dictionary as `romeo_word_dict`.**

In [None]:
# Creatting a bag-of-words dictionary where each word is a key with a value of True
def build_bag_of_words_features(words):
    return {word: True for word in words}

# Generating the bag-of-words dictionary for romeo_words
romeo_word_dict = build_bag_of_words_features(romeo_words)
romeo_word_dict

{'Why': True,
 'then': True,
 ',': True,
 'O': True,
 'brawling': True,
 'love': True,
 '!': True,
 'loving': True,
 'hate': True,
 'any': True,
 'thing': True,
 'of': True,
 'nothing': True,
 'first': True,
 'create': True,
 'heavy': True,
 'lightness': True,
 'serious': True,
 'vanity': True,
 'Misshapen': True,
 'chaos': True,
 'well-seeming': True,
 'forms': True,
 'Feather': True,
 'lead': True,
 'bright': True,
 'smoke': True,
 'cold': True,
 'fire': True,
 'sick': True,
 'health': True,
 'Still-waking': True,
 'sleep': True,
 'that': True,
 'is': True,
 'not': True,
 'what': True,
 'it': True,
 'This': True,
 'feel': True,
 'I': True,
 'no': True,
 'in': True,
 'this': True,
 '.': True}

In [None]:
# Sanity check
assert type(build_bag_of_words_features(romeo_words)) == dict
assert sum(value for value in romeo_word_dict.values() if value) == 45

This is what we wanted, but we notice that also punctuation like "!" and words useless for classification purposes like "of" or "that" are also included.
Those words are named "stopwords" and `nltk` has a convenient corpus we can download:

In [None]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Using the Python `string.punctuation` list and the English stopwords we can build better features by filtering out those words that would not help in the classification.

**Creatting a list `useless_words` that is a collection of stopwords in english and the punctuation characters.**

In [None]:
# Create a list of useless words combining English stopwords and punctuation characters
useless_words = list(set(stopwords.words('english')).union(set(string.punctuation)))

In [None]:
assert type(useless_words) == list
assert len(useless_words) == 211

**Writing a function `build_bag_of_words_features_filtered(words)` that returns a filtered bag of words - a dictionary with only useful words as key and 1 as the value. Calling this function with `romeo_words` and storing the resultant dictionary as `romeo_useful_word_dict`.**

In [None]:
# Creatting a filtered bag-of-words dictionary excluding useless words (stopwords and punctuation)
def build_bag_of_words_features_filtered(words):
    return {word: True for word in words if word not in useless_words}

# Generating the filtered bag-of-words dictionary for romeo_words
romeo_useful_word_dict = build_bag_of_words_features_filtered(romeo_words)
romeo_useful_word_dict

{'Why': True,
 'O': True,
 'brawling': True,
 'love': True,
 'loving': True,
 'hate': True,
 'thing': True,
 'nothing': True,
 'first': True,
 'create': True,
 'heavy': True,
 'lightness': True,
 'serious': True,
 'vanity': True,
 'Misshapen': True,
 'chaos': True,
 'well-seeming': True,
 'forms': True,
 'Feather': True,
 'lead': True,
 'bright': True,
 'smoke': True,
 'cold': True,
 'fire': True,
 'sick': True,
 'health': True,
 'Still-waking': True,
 'sleep': True,
 'This': True,
 'feel': True,
 'I': True}

In [None]:
# Sanity check
assert type(build_bag_of_words_features_filtered(romeo_words)) == dict
assert len(romeo_useful_word_dict) == 31

## Frequencies of Words

It is common to explore a dataset before starting the analysis, in this section we will find the most common words and plot their frequency.

Using the `movie_reviews.words()` (the nltk corpus we imported previously) with no argument we can extract the words from the entire dataset as `all_words` and check that it is about 1.6 millions.

In [None]:
# Extracting all words from the movie_reviews dataset and store them in a list
all_words = list(movie_reviews.words())
all_words[:15]

['plot',
 ':',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 ',',
 'drink',
 'and',
 'then',
 'drive']

Filtering out `useless_words` as defined in the previous section, and create a new list `filtered_words` this will reduce the length of the dataset by more than a factor of 2

In [None]:
# Filtering out useless words (stopwords and punctuation) from all_words to create a cleaned list
filtered_words = [word for word in all_words if word not in useless_words]
filtered_words[:15]

['plot',
 'two',
 'teen',
 'couples',
 'go',
 'church',
 'party',
 'drink',
 'drive',
 'get',
 'accident',
 'one',
 'guys',
 'dies',
 'girlfriend']

In [None]:
assert type(filtered_words) == list

The `collection` package of the standard library contains a `Counter` class that is handy for counting frequencies of words in our list:

In [None]:
from collections import Counter
word_counter = Counter(filtered_words)

It also has a [most_common() ](https://pythontic.com/containers/counter/most_common) method of the word_counter and store the top 10 used words from the corpus in `most_common_words`.

In [None]:
# Retrieving the top 10 most common words and their frequencies from the word_counter
most_common_words = word_counter.most_common(10)
most_common_words

[('film', 9517),
 ('one', 5852),
 ('movie', 5771),
 ('like', 3690),
 ('even', 2565),
 ('good', 2411),
 ('time', 2411),
 ('story', 2169),
 ('would', 2109),
 ('much', 2049)]

In [None]:
assert type(most_common_words) == list
assert len(most_common_words) == 10


## Sentiment Analysis

Using the sentiment intensity analyzer, loop over the `list_sentences` and print the polarity scores of each of the sentence.

In [None]:
# Analyze the sentiment of each sentence in the list and print the polarity scores and sentiment
analyzer = SentimentIntensityAnalyzer()

list_sentences = ["Hello, how are you?", "Today is a nice day", "I don't like the food at the cafe", "This is the worst pizza I have ever had.", "The orange juice is delicious!", "I am late to class." ]

def calculate_sentiment(sentence):
    scores = analyzer.polarity_scores(sentence)
    sentiment = (
        "Positive" if scores["compound"] >= 0.05
        else "Negative" if scores["compound"] <= -0.05
        else "Neutral"
    )
    return scores, sentiment

# Loop through sentences, calculate sentiment, and print results
for sentence in list_sentences:
    scores, sentiment = calculate_sentiment(sentence)
    print(f"Sentence: {sentence}")
    print(f"Scores: {scores}")
    print(f"Sentiment: {sentiment}\n")

Sentence: Hello, how are you?
Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Sentiment: Neutral

Sentence: Today is a nice day
Scores: {'neg': 0.0, 'neu': 0.588, 'pos': 0.412, 'compound': 0.4215}
Sentiment: Positive

Sentence: I don't like the food at the cafe
Scores: {'neg': 0.232, 'neu': 0.768, 'pos': 0.0, 'compound': -0.2755}
Sentiment: Negative

Sentence: This is the worst pizza I have ever had.
Scores: {'neg': 0.339, 'neu': 0.661, 'pos': 0.0, 'compound': -0.6249}
Sentiment: Negative

Sentence: The orange juice is delicious!
Scores: {'neg': 0.0, 'neu': 0.501, 'pos': 0.499, 'compound': 0.6114}
Sentiment: Positive

Sentence: I am late to class.
Scores: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Sentiment: Neutral

