# Twitter Tweet Sentiment Analysis using Naive Bayes

This notebook implements entity-level sentiment analysis using Naive Bayes classifier with Bag-of-Words (BoW) vectorization. The model predicts sentiment (Positive, Negative, Neutral) for tweets related to specific entities.

## Dataset
- Twitter training dataset with columns: ID, Entity, Sentiment, Tweet
- Entities include companies, products, or people mentioned in tweets
- Sentiments are labeled as Positive, Negative, or Neutral

## Methodology
1. Data preprocessing and cleaning
2. Text vectorization using Bag-of-Words
3. Naive Bayes classification
4. Model evaluation and prediction

In [None]:
# ==============================
#  Entity-Level Sentiment Analysis using Naive Bayes (BoW)
# ==============================

import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer       # To convert text into numeric vectors (BoW)
from sklearn.preprocessing import LabelEncoder                    # To convert text labels into numbers
from sklearn.model_selection import train_test_split              # To split data into train/test sets
from sklearn.naive_bayes import MultinomialNB                     # Naive Bayes classifier for text
from sklearn.metrics import classification_report                 # To evaluate model performance

# =========================================
# 1. 📥 Load and prepare the dataset
# =========================================
# The dataset has no headers by default, so we specify column names
df = pd.read_csv("dataset/twitter_training.csv", header=None)
df.columns = ["ID", "Entity", "Sentiment", "Tweet"]

# =========================================
# 2.  Clean and combine 'Entity' with 'Tweet'
# =========================================
# Define a function to clean tweet text
def clean_tweet(text):
    if not isinstance(text, str):
        return text  # Leave non-strings (like NaNs) untouched
    text = re.sub(r'@\w+', '', text)                             # Remove @mentions like @user123
    text = re.sub(r'http\S+|www.\S+|pic.twitter\S+', '', text)   # Remove URLs and Twitter image links
    text = re.sub(r'\s+', ' ', text).strip()                     # Remove extra spaces
    return text

# Combine the Entity name and Tweet for training (e.g., "Apple : I love their products")
combined = (df["Entity"] + " : " + df["Tweet"]).apply(clean_tweet).dropna()

# Also keep the matching Sentiment labels (ensures same indexes)
labels = df["Sentiment"].loc[combined.index]

# =========================================
# 3. Convert text to numerical features using Bag-of-Words
# =========================================
vectorizer = CountVectorizer()                     # Initialize BoW converter
X_bow = vectorizer.fit_transform(combined)         # Convert text to vector format (sparse matrix)

# =========================================
# 4. 🏷️ Encode string labels (Positive, Negative, Neutral) as numbers
# =========================================
le = LabelEncoder()                                # Initialize label encoder
y = le.fit_transform(labels)                       # Converts: ['Positive', 'Negative', ...] → [2, 0, ...]

# Print label classes for reference
# print(le.classes_)   → ['Negative' 'Neutral' 'Positive']

# =========================================
# 5.  Split into training and testing sets
# =========================================
# We use 80% data for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X_bow, y, test_size=0.2, random_state=42
)

# =========================================
# 6.  Train Naive Bayes classifier
# =========================================
model = MultinomialNB()            # Suitable for text classification (word counts)
model.fit(X_train, y_train)        # Train the model using BoW input and encoded labels

# =========================================
# 7.  Evaluate model performance
# =========================================
y_pred = model.predict(X_test)     # Predict sentiment for the test set

# Show precision, recall, f1-score, and accuracy
print(classification_report(y_test, y_pred, target_names=le.classes_))

# =========================================
# 8.  Predict sentiment for a new entity (no tweet)
# =========================================
entity_name = "nvidia"                                  # Example entity input
test_vec = vectorizer.transform([entity_name + " :"])   # Format matches training input
pred = model.predict(test_vec)                          # Predict sentiment
print(f"Sentiment for '{entity_name}': {le.inverse_transform(pred)[0]}")

## Results and Analysis

The model combines entity names with tweet content to provide entity-specific sentiment analysis. This approach allows for more contextual sentiment prediction compared to general tweet sentiment analysis.

### Key Features:
- **Text Preprocessing**: Removes mentions, URLs, and extra whitespace
- **Entity-Tweet Combination**: Concatenates entity and tweet for context
- **Bag-of-Words Vectorization**: Converts text to numerical features
- **Naive Bayes Classification**: Probabilistic approach suitable for text classification

### Model Performance:
The classification report shows precision, recall, and F1-scores for each sentiment class, providing insights into model effectiveness across different sentiment categories.