<a href="https://colab.research.google.com/github/kundanpk/emotional-test/blob/main/GoEmotions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis Project

This project performs sentiment analysis on textual data using a Naive Bayes classifier. The project involves loading data, preprocessing text, vectorizing it, training a model, and evaluating its performance.

## Table of Contents
- [Prerequisites](#prerequisites)
- [Data Files](#data-files)
- [Installation](#installation)
- [Running the Project](#running-the-project)
- [Evaluation](#evaluation)
- [License](#license)

## Prerequisites
Before running the project, ensure you have the following installed:
- Python 3.x
- Pandas
- NLTK
- scikit-learn

You can install the necessary Python packages using pip:
```
pip install pandas nltk scikit-learn
```

## Data Files
Place the following files in the `/mnt/data/` directory:
1. `train.tsv`: A TSV file containing the training data with two columns: text and emotion.
2. `sentiment_mapping.json`: A JSON file containing the mapping of emotions.
3. `emotions.txt`: A text file containing the list of emotions.

## Installation
1. Clone the repository or download the project files.
2. Ensure the data files are in the `/mnt/data/` directory.
3. Download necessary NLTK data:
    ```python
    import nltk
    nltk.download('punkt')
    nltk.download('stopwords')
    ```

## Running the Project
To run the sentiment analysis, execute the following script:

In [None]:
import pandas as pd
import json
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Load data
data_df = pd.read_csv('train.tsv', sep='\t', header=None, names=['text', 'emotion'], dtype={'text': str, 'emotion': str})

In [None]:
# Load sentiment mapping and emotions
with open('sentiment_mapping.json', 'r') as f:
    sentiment_mapping = json.load(f)
with open('emotions.txt', 'r') as f:
    emotions = f.read().splitlines()

In [None]:
# Preprocess data
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text: str) -> str:
    """Tokenize and remove stop words from text"""
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    return ' '.join(tokens)

texts = data_df['text'].apply(preprocess_text)
labels = data_df['emotion']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Vectorization
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
features = vectorizer.fit_transform(texts)

In [None]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)


In [None]:
# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

In [None]:
# Predict on test set
predictions = model.predict(X_test)

In [None]:
# Evaluate model
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.3f}")
print("Classification Report:")
print(classification_report(y_test, predictions))
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

Accuracy: 0.000
Classification Report:


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
     ee103bs       0.00      0.00      0.00       1.0
     ee10724       0.00      0.00      0.00       1.0
     ee10us6       0.00      0.00      0.00       1.0
     ee122e4       0.00      0.00      0.00       1.0
     ee13zpk       0.00      0.00      0.00       1.0
     ee141l5       0.00      0.00      0.00       1.0
     ee1456h       0.00      0.00      0.00       1.0
     ee147f5       0.00      0.00      0.00       1.0
     ee14has       0.00      0.00      0.00       1.0
     ee14log       0.00      0.00      0.00       1.0
     ee15t18       0.00      0.00      0.00       1.0
     ee1630u       0.00      0.00      0.00       1.0
     ee169pr       0.00      0.00      0.00       1.0
     ee16es8       0.00      0.00      0.00       1.0
     ee16q3k       0.00      0.00      0.00       1.0
     ee16twd       0.00      0.00      0.00       1.0
     ee179t6       0.00      0.00      0.00       1.0
     ee18h9f     

## Evaluation
The script evaluates the model's performance using accuracy, classification report, and confusion matrix. After running the script, you will see the output in the console.
