# SENTIMENT ANALYSIS


Sentiment analysis is a popular task in natural language processing. The goal of sentiment analysis is to classify the text based on the mood or mentality expressed in the text, which can be positive negative, or neutral.

# IMPORT PACKAGES

In [16]:
import os
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


# FILE DIRECTORY AND FILE LOADING

In [10]:
os.chdir(r"C:\Users\Dell\Desktop\DataScience\machine learning\Datasets")

df = pd.read_csv("movie_review.csv")
df.head()

Unnamed: 0,fold_id,cv_tag,html_id,sent_id,text,tag
0,0,cv000,29590,0,films adapted from comic books have had plenty...,pos
1,0,cv000,29590,1,"for starters , it was created by alan moore ( ...",pos
2,0,cv000,29590,2,to say moore and campbell thoroughly researche...,pos
3,0,cv000,29590,3,"the book ( or "" graphic novel , "" if you will ...",pos
4,0,cv000,29590,4,"in other words , don't dismiss this film becau...",pos


# EXPLORATION

In [22]:
df.describe()

Unnamed: 0,fold_id,html_id,sent_id
count,64720.0,64720.0,64720.0
mean,4.549382,16074.097373,18.98118
std,2.853176,7175.282521,15.08369
min,0.0,42.0,0.0
25%,2.0,10613.0,8.0
50%,5.0,15091.0,16.0
75%,7.0,21865.0,27.0
max,9.0,29867.0,111.0


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64720 entries, 0 to 64719
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   fold_id         64720 non-null  int64 
 1   cv_tag          64720 non-null  object
 2   html_id         64720 non-null  int64 
 3   sent_id         64720 non-null  int64 
 4   text            64720 non-null  object
 5   tag             64720 non-null  object
 6   preprocess_txt  64720 non-null  object
dtypes: int64(3), object(4)
memory usage: 3.5+ MB


In [25]:
df.shape

(64720, 7)

## PREPROCESSING

In [17]:
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')

EXPLAINATION OF CODES

These two lines of code are preparing tools for text preprocessing tasks. 
The WordNetLemmatizer() will be used for lemmatization.
The stopwords.words('english') list will be used to filter out common stop words from text data.

In [18]:
def text_prep(x):
     corp = str(x).lower()
     corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
     tokens = word_tokenize(corp)
     words = [t for t in tokens if t not in stop_words]
     lemmatize = [lemma.lemmatize(w) for w in words]

     return lemmatize

EXPLANATION OF CODES

These lines of code are a part of text preprocessing pipeline, where the input text (x) is converted to lowercase, 
cleaned from non-alphabetic characters, tokenized into words, filtered to remove stopwords, 
and finally lemmatized to obtain the base forms of words for further analysis.

In [21]:
preprocess_tag = [" ".join(text_prep(i)) for i in df['text']]
df["preprocess_txt"] = preprocess_tag
df["preprocess_txt"]

0        film adapted comic book plenty success whether...
1        starter created alan moore eddie campbell brou...
2        say moore campbell thoroughly researched subje...
3        book graphic novel page long includes nearly c...
4                                 word dismiss film source
                               ...                        
64715       lack inspiration traced back insipid character
64716    like many skit current incarnation saturday ni...
64717    watching one roxbury skit snl come away charac...
64718                              bump unsuspecting woman
64719                  watching night roxbury left exactly
Name: preprocess_txt, Length: 64720, dtype: object

EXPLANATION OF CODES

This line of code preprocesses each text document in the 'text' column of the DataFrame df using the text_prep() function and then joins the processed tokens into a single string for each document. Finally, it returns a list of these preprocessed strings. This type of preprocessing is common before performing text analysis tasks such as text classification, sentiment analysis, or topic modeling.

In [26]:
sent = SentimentIntensityAnalyzer()

In [27]:
polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['preprocess_txt']]
df['sentiment_score'] = polarity
df.head()

Unnamed: 0,fold_id,cv_tag,html_id,sent_id,text,tag,preprocess_txt,sentiment_score
0,0,cv000,29590,0,films adapted from comic books have had plenty...,pos,film adapted comic book plenty success whether...,-0.11
1,0,cv000,29590,1,"for starters , it was created by alan moore ( ...",pos,starter created alan moore eddie campbell brou...,0.25
2,0,cv000,29590,2,to say moore and campbell thoroughly researche...,pos,say moore campbell thoroughly researched subje...,0.13
3,0,cv000,29590,3,"the book ( or "" graphic novel , "" if you will ...",pos,book graphic novel page long includes nearly c...,0.32
4,0,cv000,29590,4,"in other words , don't dismiss this film becau...",pos,word dismiss film source,0.0


In [30]:
text = df['text'][0]
text

"films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before ."

In [31]:
sent.polarity_scores(text)

{'neg': 0.165, 'neu': 0.719, 'pos': 0.115, 'compound': -0.5346}