# Twitter Sentiment Analysis Using Jupyter Notebook

This project analyzes real Twitter data to understand public sentiment. The dataset contains tweets labeled as positive, neutral, or negative. The goal is to explore sentiment distribution, analyze how tweet length varies across sentiment categories, and compute a custom sentiment score inspired by lexicon-based methods such as VADER.

## Research Questions

1. What is the distribution of positive, neutral, and negative tweets?
2. Which sentiment dominates public opinion?
3. Do positive, neutral, and negative tweets differ in length?
4. Can a simple custom sentiment score separate positive and negative tweets?

In [None]:
# Import required libraries
# These are standard libraries available in Python 3.8 environments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Loading the Dataset

The dataset is stored locally as a CSV file and is loaded using pandas. No internet connection is used. Make sure the file **Twitter_Data.csv** is in the same directory as this notebook.

In [None]:
# Load the local CSV file
df = pd.read_csv("Twitter_Data.csv")

# Display the first few rows
df.head()

## Understanding the Data

We inspect the number of rows, columns, and data types to understand the structure of the dataset.

In [None]:
# Show basic information about the dataset
df.info()

## Data Cleaning

We remove rows with missing values and ensure that sentiment labels are valid integers.

In [None]:
# Drop any rows with missing values
df = df.dropna()

# Ensure the category column is of integer type
df["category"] = df["category"].astype(int)

# Check for remaining missing values
df.isnull().sum()

## Sentiment Label Mapping

The dataset uses numerical sentiment values:
- -1 = Negative
- 0 = Neutral
- 1 = Positive

We convert these numeric values into readable text labels.

In [None]:
# Map numeric sentiment labels to text labels
label_map = {-1: "Negative", 0: "Neutral", 1: "Positive"}
df["Sentiment"] = df["category"].map(label_map)

# Show the updated dataframe
df.head()

## Custom Sentiment Score (Lexicon-Based)

Inspired by lexicon-based methods such as VADER from NLTK, we create our own small sentiment lexicon. We then compute a simple sentiment score for each tweet based on the count of positive and negative words.

The idea:
- Define a list of positive words and a list of negative words.
- For each tweet, count how many positive and negative words it contains.
- Compute a sentiment score: `score = positive_count - negative_count`.
- Higher scores indicate more positive language, lower scores indicate more negative language.

In [None]:
# Define small custom sentiment lexicons (all lowercase)
positive_words = {
    "good", "great", "love", "like", "support", "win", "victory",
    "happy", "excellent", "wonderful", "amazing", "awesome", "progress",
    "strong", "leader", "truth", "hope", "peace"
}

negative_words = {
    "bad", "worst", "hate", "corrupt", "lie", "liar", "fake",
    "failure", "problem", "issues", "weak", "stupid", "angry",
    "crime", "violence", "scam", "fraud", "dirty"
}

def compute_sentiment_score(text):
    """Return a simple sentiment score based on custom word lists.
    Score = (#positive words) - (#negative words).
    """
    if not isinstance(text, str):
        return 0
    words = text.lower().split()
    pos_count = 0
    neg_count = 0
    for w in words:
        # Strip basic punctuation from the ends of words
        w = w.strip(".,!?;:\"'()[]{}")
        if w in positive_words:
            pos_count += 1
        elif w in negative_words:
            neg_count += 1
    return pos_count - neg_count

# Apply the custom score function to each tweet
df["score_custom"] = df["clean_text"].apply(compute_sentiment_score)

# Preview the new column
df[["clean_text", "Sentiment", "score_custom"]].head()

## Using the Custom Score to Classify Sentiment

We now convert the numeric score into a simple rule-based prediction:

- If `score_custom > 0` → Predicted sentiment = Positive
- If `score_custom < 0` → Predicted sentiment = Negative
- If `score_custom == 0` → Predicted sentiment = Neutral

In [None]:
def score_to_label(score):
    if score > 0:
        return "Positive"
    elif score < 0:
        return "Negative"
    else:
        return "Neutral"

df["Sentiment_Pred_Score"] = df["score_custom"].apply(score_to_label)

# Show a few rows comparing true vs predicted sentiment
df[["clean_text", "Sentiment", "Sentiment_Pred_Score", "score_custom"]].head()

## Evaluating the Custom Sentiment Score

We compare the original labeled sentiment with the sentiment predicted by our custom score. This is a simple way to see how well our rule-based method aligns with the labeled data.

In [None]:
# Create a confusion-style table
comparison_table = pd.crosstab(df["Sentiment"], df["Sentiment_Pred_Score"])
comparison_table

In [None]:
# Calculate simple accuracy of the custom score-based prediction
accuracy = (df["Sentiment"] == df["Sentiment_Pred_Score"]).mean()
accuracy

## Sentiment Distribution (Original Labels)

Here we count how many tweets fall into each sentiment category based on the original labels.

In [None]:
# Count the number of tweets in each sentiment category (original labels)
sentiment_counts = df["Sentiment"].value_counts()
sentiment_counts

In [None]:
# Visualize the sentiment distribution using a bar chart
plt.figure()
sentiment_counts.plot(kind="bar")
plt.title("Distribution of Twitter Sentiments (Original Labels)")
plt.xlabel("Sentiment")
plt.ylabel("Number of Tweets")
plt.show()

## Tweet Length Analysis

We calculate the number of characters in each tweet to compare how tweet length varies across sentiment categories.

In [None]:
# Create a new column for tweet length (number of characters)
df["tweet_length"] = df["clean_text"].astype(str).apply(len)

# Show a preview of sentiment and tweet length
df[["Sentiment", "tweet_length"]].head()

## Average Tweet Length by Sentiment

We compute the average tweet length for each sentiment category to see if there are noticeable differences.

In [None]:
# Calculate average tweet length per sentiment
avg_length = df.groupby("Sentiment")["tweet_length"].mean()
avg_length

In [None]:
# Visualize the average tweet length per sentiment
avg_length.plot(kind="bar")
plt.title("Average Tweet Length by Sentiment")
plt.xlabel("Sentiment")
plt.ylabel("Average Characters per Tweet")
plt.show()

## Distribution of Custom Sentiment Scores

We now visualize the distribution of our custom sentiment scores across all tweets.

In [None]:
plt.figure()
df["score_custom"].hist(bins=21)
plt.title("Distribution of Custom Sentiment Scores")
plt.xlabel("Custom Sentiment Score")
plt.ylabel("Number of Tweets")
plt.show()

## Conclusion

The analysis shows how sentiment is distributed in the Twitter dataset and how tweet length varies across sentiment categories. In addition, a custom lexicon-based sentiment score was implemented to approximate sentiment without using external libraries.

Key observations:
- The original labels reveal the overall balance of positive, neutral, and negative tweets.
- Average tweet length can differ across sentiment categories.
- The simple custom sentiment score can roughly separate positive and negative tweets, although it is not perfect. This demonstrates how basic NLP techniques can be used to engineer features for sentiment analysis in a fully self-contained way.

This project demonstrates how to work with real-world text data, clean it, engineer additional features, and generate visual insights using Python in a Jupyter notebook.