## Sentiment Analysis using NLTK & Naïve Bayes  

##### Introduction  
- This project analyzes movie reviews using NLP techniques.  
- It classifies reviews as **positive** or **negative**.  

##### Dataset  
- Uses the `movie_reviews` dataset from `nltk.corpus`.  

#####  Preprocessing  
- Tokenization, stopword removal, and stemming.  

#####  Feature Extraction  
- Bag of Words (BoW) with top 3000 words.  

#####  Model Training  
- Using **Naïve Bayes classifier** from `nltk.classify`.  

#####  Results  
- Accuracy: **83.80%**  
- Displaying top informative features.  



In [2]:
! pip install spacy



In [3]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m86.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
import spacy

nlp =spacy.load("en_core_web_sm")

text = "Ines Agrebi is 24 old year from Tunisia .She is an telecommunication engineer ."

doc =nlp(text)

print([token.text for token in doc ])
for ent in doc.ents :
   print(f"Entity: {ent.text}, label: {ent.label_}")

['Ines', 'Agrebi', 'is', '24', 'old', 'year', 'from', 'Tunisia', '.She', 'is', 'an', 'telecommunication', 'engineer', '.']
Entity: Ines Agrebi, label: PERSON
Entity: 24 old year, label: DATE
Entity: Tunisia, label: GPE


In [5]:
from spacy import displacy

displacy.render(doc,style="ent",jupyter=True)

##**Sentiment Analysis Project with NTLK **

In [6]:
import nltk
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
from nltk.corpus import movie_reviews

In [8]:
positive_reviews = movie_reviews.fileids('pos')
negative_reviews = movie_reviews.fileids('neg')

print(f"Number of Positive Reviews:{len(positive_reviews)}")
print(f"Number of Negative Reviews:{len(negative_reviews)}")

print("\n Example of a Positive Review:\n")
print(movie_reviews.raw(positive_reviews[0]))
print("hellllllllllo")
print(movie_reviews.raw(negative_reviews[0]))

Number of Positive Reviews:1000
Number of Negative Reviews:1000

 Example of a Positive Review:

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's d

In [9]:
type (positive_reviews)

list

In [10]:
print(positive_reviews)

['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt', 'pos/cv010_29198.txt', 'pos/cv011_12166.txt', 'pos/cv012_29576.txt', 'pos/cv013_10159.txt', 'pos/cv014_13924.txt', 'pos/cv015_29439.txt', 'pos/cv016_4659.txt', 'pos/cv017_22464.txt', 'pos/cv018_20137.txt', 'pos/cv019_14482.txt', 'pos/cv020_8825.txt', 'pos/cv021_15838.txt', 'pos/cv022_12864.txt', 'pos/cv023_12672.txt', 'pos/cv024_6778.txt', 'pos/cv025_3108.txt', 'pos/cv026_29325.txt', 'pos/cv027_25219.txt', 'pos/cv028_26746.txt', 'pos/cv029_18643.txt', 'pos/cv030_21593.txt', 'pos/cv031_18452.txt', 'pos/cv032_22550.txt', 'pos/cv033_24444.txt', 'pos/cv034_29647.txt', 'pos/cv035_3954.txt', 'pos/cv036_16831.txt', 'pos/cv037_18510.txt', 'pos/cv038_9749.txt', 'pos/cv039_6170.txt', 'pos/cv040_8276.txt', 'pos/cv041_21113.txt', 'pos/cv042_10982.txt', 'pos/cv043_15013.tx

In [11]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [12]:
from nltk.tokenize import word_tokenize


example_review = movie_reviews.raw(positive_reviews[0])

ines = word_tokenize(example_review)

print(ines[:50])

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'re", 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', 'there', "'s", 'never', 'really', 'been', 'a', 'comic', 'book', 'like']


In [13]:
len(ines)

826

In [14]:
from nltk.corpus import stopwords
import string

stop_words = set(stopwords.words('english'))
punctuations = set(string.punctuation)

def clean_review(words):
    return[word.lower() for word in words if word.lower() not in stop_words and word not in punctuations]

cleaned_words = clean_review(ines)
print(cleaned_words[:50])




['films', 'adapted', 'comic', 'books', 'plenty', 'success', 'whether', "'re", 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'toward', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world', "'s", 'never', 'really', 'comic', 'book', 'like', 'hell', 'starters', 'created', 'alan', 'moore', 'eddie', 'campbell', 'brought', 'medium', 'whole', 'new', 'level', 'mid', "'80s", '12-part', 'series', 'called', 'watchmen', 'say', 'moore', 'campbell', 'thoroughly', 'researched', 'subject']


In [20]:
print(punctuations)

{'#', '?', '-', '"', '+', '(', '&', ')', '^', '*', ':', ']', '%', '>', '<', '!', '\\', '|', '_', '`', '[', ',', '~', '=', '}', '$', ';', '{', "'", '/', '@', '.'}


In [15]:
# traitement du donnés ...examination puis selection des donnés les plus importants pour les utiliser dans le model
all_words= [word.lower() for word in movie_reviews.words()]
all_words_freq = nltk.FreqDist(all_words)

word_features = list(all_words_freq.keys())[:3000]

#####################

def extract_features(review_words) :
  review_words_set = set(review_words)
  return{word: (word in review_words_set ) for word in word_features}


features_example = extract_features(cleaned_words)
print(list(features_example.items())[:20])

[('plot', False), (':', False), ('two', False), ('teen', False), ('couples', False), ('go', True), ('to', False), ('a', False), ('church', False), ('party', False), (',', False), ('drink', False), ('and', False), ('then', False), ('drive', False), ('.', False), ('they', False), ('get', True), ('into', False), ('an', False)]


In [19]:
print(all_words_freq[':'])

3042


In [17]:
len(all_words)

1583820

In [22]:
import random
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

# Prepare labeled dataset (positive = 1, negative = 0)
reviews = []
for fileid in positive_reviews:
    words = clean_review(movie_reviews.words(fileid))
    reviews.append((extract_features(words), "pos"))

for fileid in negative_reviews:
    words = clean_review(movie_reviews.words(fileid))
    reviews.append((extract_features(words), "neg"))

# Shuffle the dataset
random.shuffle(reviews)

# Split into training and testing sets
train_data = reviews[:1500]
test_data = reviews[1500:]

# Train the classifier
classifier = NaiveBayesClassifier.train(train_data)

# Check the accuracy
print(f"Model Accuracy: {accuracy(classifier, test_data) * 100:.2f}%")


Model Accuracy: 83.80%


In [27]:
test_review = "This movie was amazing! The acting was great, and the story was touching."
test_tokens = word_tokenize(test_review)
test_cleaned = clean_review(test_tokens)
test_features = extract_features(test_cleaned)

print("Review Sentiment:", classifier.classify(test_features))


Review Sentiment: neg


In [24]:
classifier.show_most_informative_features(10)


Most Informative Features
                   sucks = True              neg : pos    =     14.7 : 1.0
           unimaginative = True              neg : pos    =      8.6 : 1.0
                  turkey = True              neg : pos    =      7.2 : 1.0
              whatsoever = True              neg : pos    =      7.2 : 1.0
                 idiotic = True              neg : pos    =      6.9 : 1.0
                  annual = True              pos : neg    =      6.8 : 1.0
                 advised = True              neg : pos    =      6.5 : 1.0
                 martian = True              neg : pos    =      6.5 : 1.0
                  shoddy = True              neg : pos    =      6.5 : 1.0
             silverstone = True              neg : pos    =      6.5 : 1.0
