<a href="https://colab.research.google.com/github/prabhakar-gh/Twitter-Sentiment-Analysis/blob/main/SentimentAnalysis_Twitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of Twitter Data using DistilBERT

This project demonstrates the application of deep learning for sentiment analysis using the DistilBERT model. It focuses on classifying tweets into positive and negative sentiment categories. The project leverages the Hugging Face Transformers library for model implementation and the Kaggle API for accessing a large dataset of pre-processed tweets.

**Key Features:**

* **Data:** Utilizes a substantial dataset from Kaggle containing 1.6 million tweets labeled for sentiment.
* **Model:** Employs the DistilBERT model, a smaller and faster variant of BERT, for efficient sentiment classification.
* **Training:** Fine-tunes DistilBERT using the training dataset to adapt it specifically for sentiment analysis.
* **Evaluation:** Assesses the model's performance using metrics like accuracy and F1-score.
* **Deployment:** Provides a simple UI using Gradio for real-time sentiment prediction.

This project was developed as a personal exploration of natural language processing and deep learning techniques for sentiment analysis.

#Setup
Install all the necessary libraries


In [None]:
!pip install transformers datasets torch pandas sklearn
!pip install gradio --upgrade  # Optional, for UI demo



# Kaggle Integration
This section sets up access to Kaggle datasets within Google Colab.
- You'll be prompted to upload your `kaggle.json` file, containing your Kaggle API credentials.
- The code creates the necessary Kaggle directory and moves the `kaggle.json` file into it.
- File permissions are set for security.
- Kaggle access is verified by listing available datasets.
---



In [None]:
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list

# Downloading the Dataset
This section downloads the sentiment analysis dataset from Kaggle using the `kagglehub` library.
- It stores the path to the downloaded dataset files.
- The full path to the dataset file is defined for later use.

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ferno2/training1600000processednoemoticoncsv")

print("Path to dataset files:", path)

In [None]:
DATAFILE=path+"/training.1600000.processed.noemoticon.csv"
print (DATAFILE)

# Data Loading and Preprocessing
This section loads the dataset into a pandas DataFrame and preprocesses it:
- It reads the CSV file, specifying the encoding and column names.
- Target labels are mapped to 0 (negative) and 1 (positive).
- A smaller sample of the dataset is taken for faster training (5k positive, 5k negative).
- Only the relevant columns ('text' and 'target') are kept.
- The head of the DataFrame is printed for a preview.

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv(DATAFILE, encoding='latin-1',
                 names=['target', 'id', 'date', 'flag', 'user', 'text'])

# Map target: 0 (negative) -> 0, 4 (positive) -> 1
df['target'] = df['target'].map({0: 0, 4: 1})

# Sample 10k rows (5k positive, 5k negative)
df = df.groupby('target').sample(n=5000, random_state=42).reset_index(drop=True)

# Keep only relevant columns
df = df[['text', 'target']]
print(df.head())

#Distribution of Sentiment
This creates a bar chart showing the number of positive and negative tweets in your dataset. The plot is saved as 'sentiment_distribution.png'.


In [None]:
#visualize the data set
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame
sns.countplot(x='target', data=df)
plt.title('Distribution of Sentiment')
plt.xlabel('Sentiment (0: Negative, 1: Positive)')
plt.ylabel('Count')
plt.savefig('sentiment_distribution.png')
plt.show()

#Word Cloud
This generates word clouds for positive and negative tweets, saved as 'positive_wordcloud.png' and 'negative_wordcloud.png' respectively.


In [None]:
from wordcloud import WordCloud, STOPWORDS

# Assuming 'df' is your DataFrame
positive_words = ' '.join(df[df['target'] == 1]['text'].astype(str))
negative_words = ' '.join(df[df['target'] == 0]['text'].astype(str))

wordcloud_positive = WordCloud(width=800, height=500, background_color='white', stopwords=STOPWORDS).generate(positive_words)
wordcloud_negative = WordCloud(width=800, height=500, background_color='white', stopwords=STOPWORDS).generate(negative_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud_positive, interpolation='bilinear')
plt.axis("off")
plt.title('Positive Word Cloud')
plt.savefig('positive_wordcloud.png')
plt.show()

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud_negative, interpolation='bilinear')
plt.axis("off")
plt.title('Negative Word Cloud')
plt.savefig('negative_wordcloud.png')
plt.show()

# Data Splitting
This section splits the data into training and testing sets using `train_test_split` from scikit-learn:
- 80% of the data is used for training, and 20% for testing.
- A random state is set for reproducibility.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Tokenization and Dataset Preparation
This section tokenizes the text data and prepares it for the model:
- The `datasets` library is installed if needed.
- A DistilBERT tokenizer is loaded from Hugging Face Transformers.
- A function is defined to tokenize the text data, padding and truncating as necessary.
- Pandas DataFrames are converted to Hugging Face Datasets.
- The datasets are tokenized using the defined function.
- The 'target' column is renamed to 'labels' and the format is set for PyTorch.\

In [None]:
!pip install datasets

In [None]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

# Convert to Hugging Face Dataset
from datasets import Dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenize
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Rename 'target' to 'labels' and format for PyTorch
train_dataset = train_dataset.rename_column("target", "labels")
test_dataset = test_dataset.rename_column("target", "labels")

train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

# Model Training
This section defines and trains the sentiment analysis model:
- A pre-trained DistilBERT model for sequence classification is loaded.
- Training arguments are defined, including output directory, epochs, batch size, etc.
- A `Trainer` instance is created using the model, arguments, and datasets.
- The model is trained using `trainer.train()`.\

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,  # 1 epoch to save time
    per_device_train_batch_size=8,  # Small batch size for free tier
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to="none" #disabling wandb integration

)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()

# Model Evaluation
This section evaluates the trained model's performance using `trainer.evaluate()`:
- The evaluation results are printed.

#Compute models F-1 score

In [None]:
eval_results = trainer.evaluate()
print(eval_results)

In [None]:
from sklearn.metrics import f1_score

# Get predictions
predictions = trainer.predict(test_dataset)
predicted_labels = predictions.predictions.argmax(-1)  # Get predicted labels

# Calculate F1-score
f1 = f1_score(predictions.label_ids, predicted_labels)  # Compare to true labels

print(f"F1-score: {f1:.4f}")

# Model Saving
This section saves the trained model and tokenizer for later use using `save_pretrained()`:
- The model is saved to the 'sentiment_model' directory.
- The tokenizer is saved to the same directory.

In [None]:
model.save_pretrained('sentiment_model')
tokenizer.save_pretrained('sentiment_model')

#Confusion Matrix
This generates a confusion matrix plot and saves it as 'confusion_matrix.png'.



In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

predictions = trainer.predict(test_dataset)
y_pred = predictions.predictions.argmax(-1)
y_true = predictions.label_ids

cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot()
plt.title('Confusion Matrix')
plt.savefig('confusion_matrix.png')
plt.show()

# Sentiment Prediction
This section demonstrates how to use the trained model for sentiment prediction:
- A sentiment analysis pipeline is created using the trained model and tokenizer.
- The pipeline is tested with an example text.
- A function is defined to predict sentiment and format the output.
- The prediction function is tested with another example.

In [None]:
from transformers import pipeline

sentiment_classifier = pipeline('sentiment-analysis', model='sentiment_model', tokenizer='sentiment_model')

# Test it
text = "I love this product!"
result = sentiment_classifier(text)
print(result)  # Outputs [{'label': 'LABEL_1', 'score': 0.95}] (1 = positive)

In [None]:
def predict_sentiment(text):
    result = sentiment_classifier(text)[0]
    label = 'Positive' if result['label'] == 'LABEL_1' else 'Negative'
    return f"{label} (confidence: {result['score']:.2f})"

print(predict_sentiment("I hate this weather"))

# Gradio UI
This section creates a simple UI using Gradio for interactive sentiment prediction:
- Gradio is installed or upgraded if needed.
- A function is defined for Gradio to use for prediction.
- A Gradio interface is created and launched.

In [None]:
!pip install gradio --upgrade

In [None]:
import gradio as gr

def gradio_predict(text):
    return predict_sentiment(text)

interface = gr.Interface(fn=gradio_predict, inputs="text", outputs="text")
interface.launch()

# Model Packaging
This section packages the trained model for sharing or deployment:
- The 'sentiment_model' directory is zipped into an archive.

In [None]:
!zip -r sentiment_model.zip sentiment_model