### Sentiment Analysis of Movie Reviews (NLP Example)

Data Collection: You need a dataset of movie reviews. For this example, we will use NLTK's built-in dataset of movie reviews, which is pre-labeled as positive or negative.

Preprocessing: Tokenize and preprocess the text data.

Modeling: Train a simple sentiment classifier (using Naive Bayes) to classify movie reviews as positive or negative.

In [1]:
import nltk
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\mindf\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mindf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mindf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import random

In [3]:
# Load the dataset
fileids = movie_reviews.fileids()

In [4]:
# Prepare labeled data: (word features, label)
documents = [(list(movie_reviews.words(fileid)), fileid.split('/')[0]) for fileid in fileids]


In [5]:
# Shuffle the documents for random training/test split
random.shuffle(documents)

# Prepare stopwords to remove unnecessary words
stop_words = set(stopwords.words("english"))

In [6]:
# Function to extract features from text
def extract_features(words):
    words = [word.lower() for word in words if word.isalpha()]  # Remove non-alphabetic tokens and convert to lowercase
    words = [word for word in words if word not in stop_words]  # Remove stop words
    return {word: True for word in words}

# Create feature sets for training
featuresets = [(extract_features(doc), label) for doc, label in documents]

# Split into training and test sets (80% training, 20% testing)
train_set, test_set = featuresets[:int(len(featuresets) * 0.8)], featuresets[int(len(featuresets) * 0.8):]

In [7]:
# Train the Naive Bayes classifier
classifier = NaiveBayesClassifier.train(train_set)

# Test the classifier
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Classifier Accuracy: {accuracy * 100:.2f}%")


Classifier Accuracy: 71.25%


In [8]:
def classify_review(review):
    # Tokenize and extract features from the review text
    features = extract_features(word_tokenize(review))
    return classifier.classify(features)

# Example movie reviews
review1 = "The movie was fantastic! I absolutely loved the storyline and the acting was superb."
review2 = "It was a terrible movie. The plot was boring and the characters were flat."

# Classify the reviews
print(f"Review 1 sentiment: {classify_review(review1)}")  # Positive
print(f"Review 2 sentiment: {classify_review(review2)}")  # Negative

Review 1 sentiment: pos
Review 2 sentiment: neg


Data Preprocessing: We load the movie_reviews corpus, which consists of 1,000 labeled reviews (500 positive and 500 negative). We then shuffle the reviews to randomize the order and split them into training and test sets.

Feature Extraction: For each review, we extract features based on the words in the review, ignoring stopwords (common words like "and", "the", etc.) and non-alphabetic characters. Each word becomes a feature, and we set its value to True to indicate its presence.

Training: We use Naive Bayes classifier to train the model using the training dataset. NLTK provides an easy-to-use implementation of Naive Bayes for classification tasks.

Testing: We test the classifier using the test set and print the accuracy. The model is then used to predict the sentiment of new movie reviews.