# Sentiment Analysis Demo

This notebook demonstrates how to use the sentiment analysis system we've built from scratch.

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add the parent directory to the path so we can import our modules
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

from src.preprocess import TextPreprocessor, download_sample_dataset
from src.features import FeatureExtractor
from src.model import SentimentModel

%matplotlib inline

## 1. Download Sample Dataset

First, we'll download a sample dataset for sentiment analysis. This is a small collection of movie reviews.

In [None]:
# Download sample dataset
data_path = '../data/imdb_sample.csv'
df = download_sample_dataset(data_path)
df

## 2. Explore the Data

Let's explore the dataset to understand what we're working with.

In [None]:
# Show basic statistics
print("Dataset shape:", df.shape)
print("\nSentiment value counts:")
print(df['sentiment'].value_counts())

# Visualize sentiment distribution
plt.figure(figsize=(8, 5))
sns.countplot(x='sentiment', data=df)
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment (0 = Negative, 1 = Positive)')
plt.ylabel('Count')
plt.show()

Let's also look at some examples of positive and negative reviews:

In [None]:
# Show examples of positive reviews
print("Positive review examples:")
for review in df[df['sentiment'] == 1]['review'].head(2):
    print(f"\n{review}")

# Show examples of negative reviews
print("\n\nNegative review examples:")
for review in df[df['sentiment'] == 0]['review'].head(2):
    print(f"\n{review}")

## 3. Preprocess the Text Data

Next, we'll preprocess the text data to prepare it for feature extraction.

In [None]:
# Initialize the text preprocessor
preprocessor = TextPreprocessor(remove_stopwords=True, lemmatize=True)

# Preprocess the data
processed_df = preprocessor.preprocess_data(df, 'review', 'sentiment')

# Display the original and cleaned text for a few examples
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Original: {df['review'].iloc[i]}")
    print(f"Cleaned: {processed_df['cleaned_text'].iloc[i]}")

## 4. Extract Features

We'll use TF-IDF to convert the text into numerical features that our model can understand.

In [None]:
# Initialize the feature extractor
feature_extractor = FeatureExtractor(method='tfidf', max_features=1000, ngram_range=(1, 2))

# Extract features
X = feature_extractor.fit_transform(processed_df['cleaned_text'])
y = processed_df['sentiment']

print(f"Feature matrix shape: {X.shape}")

# Show a few feature names
feature_names = feature_extractor.get_feature_names()
print("\nFirst 10 features:")
print(feature_names[:10])

## 5. Train and Evaluate Model

Now we'll split the data, train our model, and evaluate its performance.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

In [None]:
# Initialize and train the model
model = SentimentModel(model_type='logistic', class_weight='balanced')
model.train(X_train, y_train)

# Evaluate the model
results = model.evaluate(X_test, y_test)

print("Model evaluation results:")
print(f"Accuracy: {results['accuracy']:.4f}")
print(f"Precision: {results['precision']:.4f}")
print(f"Recall: {results['recall']:.4f}")
print(f"F1 Score: {results['f1']:.4f}")
print("\nClassification Report:")
print(results['report'])

## 6. Visualize Model Results

Let's visualize the confusion matrix to better understand our model's performance.

In [None]:
from sklearn.metrics import confusion_matrix

# Get predictions
y_pred = model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

## 7. Save the Model and Feature Extractor

Let's save our trained model and feature extractor so we can use them later.

In [None]:
import pickle

# Create directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save the model
model.save('../models/notebook_sentiment_model.pkl')

# Save the feature extractor
with open('../models/notebook_feature_extractor.pkl', 'wb') as f:
    pickle.dump(feature_extractor, f)
    
print("Model and feature extractor saved successfully!")

## 8. Interactive Sentiment Analysis

Finally, let's try our model on some custom text inputs!

In [None]:
def analyze_sentiment(text):
    # Preprocess the text
    cleaned_text = preprocessor.clean_text(text)
    
    # Extract features
    features = feature_extractor.transform([cleaned_text])
    
    # Make prediction
    prediction = model.predict(features)[0]
    probabilities = model.predict_proba(features)[0]
    
    # Print results
    print(f"Text: {text}")
    print(f"Cleaned Text: {cleaned_text}")
    print(f"Sentiment: {'POSITIVE' if prediction == 1 else 'NEGATIVE'}")
    print(f"Confidence: {max(probabilities):.2%}")
    print("Probability Distribution:")
    print(f"  Negative: {probabilities[0]:.2%}")
    print(f"  Positive: {probabilities[1]:.2%}")

In [None]:
# Try with some examples
analyze_sentiment("I absolutely loved this movie! It was fantastic.")

In [None]:
analyze_sentiment("The movie was terrible. I wasted my money.")

In [None]:
analyze_sentiment("It was okay. Not great but not terrible either.")

In [None]:
# Try your own examples!
your_text = input("Enter text to analyze: ")
analyze_sentiment(your_text)

## Conclusion

In this notebook, we've demonstrated the complete workflow for sentiment analysis:

1. Loading and exploring the data
2. Preprocessing text data
3. Extracting features using TF-IDF
4. Training and evaluating a sentiment classification model
5. Visualizing results
6. Using the model for real-time sentiment analysis

This approach can be extended to more complex datasets and different domains by adjusting the preprocessing steps, feature extraction methods, and model architectures.