# Lesson 1 Assignment: Sentiment Analysis with NLTK




The Natural Language Toolkit (NLTK) is a popular library for natural language processing (NLP) in Python used worldwide to develop NLP applications and analyze text data.

It provides a simple interface for various NLP tasks, including tokenization, stemming, lemmatization, parsing, and sentiment analysis. NLTK has corpora- A rich data source for training and evaluating NLP models.


Sentiment analysis, also called opinion mining is a technique used to determine the emotional tone or sentiment expressed in a text. It involves analyzing the words and phrases in the text to identify whether the underlying sentiment is positive, neutral or negative. Sentiment analysis is applied in wide range of areas: including social media monitoring, customer feedback analysis, and market research.

There are various approaches to perform sentiment analysis but most commonly, we use lexicon-based approach, a machine learning (ML) based approach, or a pre-trained transformer-based deep learning approach.

Lexicon-based analysis involved the use of syntactic features of a text, such as the presence of positive or negative words and phrases. This tutorial is a step-by-step guide for lexicon-based sentiment analysis using the NLTK library in Python.


# Sentiment Analysis example in Python
To perform sentiment analysis in Python using NLTK, we need to install the library.  The text data must first be preprocessed using techniques such as tokenization, stop word removal, and stemming or lemmatization. Once the text has been preprocessed, we will then pass it to the Vader sentiment analyzer for analyzing the sentiment of the text (positive or negative).


## Installing NLTK
To use NLTK, we first install it using PIP in the command prompt or terminal:

In [1]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Importing libraries and loading dataset

First, we’ll import the necessary libraries for text analysis and sentiment analysis, such as pandas for data handling, nltk for natural language processing, and SentimentIntensityAnalyzer for sentiment analysis:

In [3]:
# import libraries
import pandas as pd

import nltk

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer

But NLTK also requires some additional data to be downloaded before it can be used effectively. This data includes pre-trained models, corpora, and other resources that NLTK uses to perform various NLP tasks. To download this data, run:

In [4]:
# download nltk corpus (first time only)
import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

True

Once the environment is set up, we will load a dataset of Amazon reviews using pd.read_csv(). This will create a DataFrame object in Python that we can use to analyze the data. We'll display the first few rows of the DataFrame using df:

In [6]:
# Load the amazon review dataset

df = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/amazon.csv')

df.head()

Unnamed: 0,reviewText,Positive
0,This is a one of the best apps acording to a b...,1
1,This is a pretty good version of the game for ...,1
2,this is a really cool game. there are a bunch ...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1


## Text Preprocessing:
We will create a function preprocess_text that will perform the text preprocessing for us: tokenization, stop words remoaval, and lemmatization. The result of this function is a ckeaned and preprocessed text that is suitable for sentiment analysis:

In [7]:
# create preprocess_text function
def preprocess_text(text):

    # Tokenize the text

    tokens = word_tokenize(text.lower())



    # Remove stop words

    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]



    # Lemmatize the tokens

    lemmatizer = WordNetLemmatizer()

    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]



    # Join the tokens back into a string

    processed_text = ' '.join(lemmatized_tokens)

    return processed_text

Now we apply the function to preprocess the review column of our DataFrame:

In [8]:
df['reviewText'] = df['reviewText'].apply(preprocess_text)
df

Unnamed: 0,reviewText,Positive
0,one best apps acording bunch people agree bomb...,1
1,pretty good version game free . lot different ...,1
2,really cool game . bunch level find golden egg...,1
3,"silly game frustrating , lot fun definitely re...",1
4,terrific game pad . hr fun . grandkids love . ...,1
...,...,...
19995,app fricken stupid.it froze kindle wont allow ...,0
19996,please add ! ! ! ! ! need neighbor ! ginger101...,1
19997,love ! game . awesome . wish free stuff house ...,1
19998,love love love app side fashion story fight wo...,1


## NLTK Sentiment Analysis:
First, we’ll instantiate a Sentiment Intensity Analyzer object from the nltk.sentiment.vader library:

In [9]:
# inistantiate NLTK sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

Next, we’ll define a function called get_sentiment that takes a text string as its input. The function calls the polarity_scores method of the analyzer object to obtain a dictionary of sentiment scores for the text, which includes a score for positive, negative, and neutral sentiment.

The function will then check whether the positive score is greater than 0 and returns a sentiment score of 1 if it is, and a 0 otherwise. This means that any text with a positive score will be classified as having a positive sentiment, and any text with a non-positive score will be classified as having a negative sentiment:

In [10]:
# create get_sentiment function

def get_sentiment(text):

    scores = analyzer.polarity_scores(text)

    sentiment = 1 if scores['pos'] > 0 else 0

    return sentiment

Finally, we’ll apply the get_sentiment function to the reviewText column of our DataFrame using the apply method. This creates a new column called sentiment in the DataFrame, which stores the sentiment score for each review. We’ll then display the first 10 rows of the updated DataFrame:

In [12]:

# apply get_sentiment function

df['sentiment'] = df['reviewText'].apply(get_sentiment)

df.head(10)

Unnamed: 0,reviewText,Positive,sentiment
0,one best apps acording bunch people agree bomb...,1,1
1,pretty good version game free . lot different ...,1,1
2,really cool game . bunch level find golden egg...,1,1
3,"silly game frustrating , lot fun definitely re...",1,1
4,terrific game pad . hr fun . grandkids love . ...,1,1
5,entertaining game ! n't smart play . guess 's ...,1,1
6,awesome n't need wi ti play trust . really fun...,1,1
7,awesome bet one even read review know game goo...,1,1
8,basicly free version ad . 's actually awesome ...,1,1
9,far best free app available anywhere . helped ...,1,1


The NLTK sentiment analyzer returns a score between -1 and +1. We have used a cut-off threshold of 0 in the get_sentiment function above. Anything above 0 is classified as 1 (meaning positive).  

Since we have actual labels, we can evaluate the performance of this method by building a confusion matrix:

In [13]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(df['Positive'], df['sentiment']))

[[ 1131  3636]
 [  576 14657]]


We can also check the classification report:

In [14]:
from sklearn.metrics import classification_report

print(classification_report(df['Positive'], df['sentiment']))

              precision    recall  f1-score   support

           0       0.66      0.24      0.35      4767
           1       0.80      0.96      0.87     15233

    accuracy                           0.79     20000
   macro avg       0.73      0.60      0.61     20000
weighted avg       0.77      0.79      0.75     20000



As you can see, the overall accuracy of this rule-based sentiment analysis model is 79%.

Since this is labeled data, you can also try to build a ML model to evaluate if an ML-based approach will result in better accuracy.