
# 🧠 Task 3 – Level 3: Sentiment Classification using TF-IDF & Logistic Regression (Local Fix)

This notebook fulfills the full requirements of Task 3 – Level 3 by:

- Preprocessing social media text (tokenization, stopword removal, lemmatization)
- Converting text into numerical representation using **TF-IDF**
- Training a classification model using **Logistic Regression**
- Evaluating model performance using **Precision**, **Recall**, and **F1-score**

> 🔧 This version includes a lemmatization fix using `pos='v'` to avoid errors with `omw-1.4`.


## 📚 Import Libraries

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import re

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.data.path.append('/Users/lpn/nltk_data')  # custom path if needed


[nltk_data] Downloading package stopwords to /Users/lpn/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/lpn/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/lpn/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 📥 Load and Inspect Dataset

In [2]:

df = pd.read_csv("3) Sentiment dataset.csv")
df.columns = df.columns.str.strip()
df["Sentiment"] = df["Sentiment"].str.strip().str.lower()
df = df[["Text", "Sentiment"]].dropna()
df.head()


Unnamed: 0,Text,Sentiment
0,Enjoying a beautiful day at the park! ...,positive
1,Traffic was terrible this morning. ...,negative
2,Just finished an amazing workout! 💪 ...,positive
3,Excited about the upcoming weekend getaway! ...,positive
4,Trying out a new recipe for dinner tonight. ...,neutral


## 🎯 Encode Sentiment Labels

In [3]:

label_map = {'positive': 2, 'neutral': 1, 'negative': 0}
df["label"] = df["Sentiment"].map(label_map)
df["label"].value_counts()


label
2.0    45
1.0    18
0.0     4
Name: count, dtype: int64

## 🧼 Text Preprocessing (Lemmatization with pos='v')

In [5]:
import nltk
nltk.download('wordnet', download_dir='/Users/lpn/nltk_data')

[nltk_data] Downloading package wordnet to /Users/lpn/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"[^a-zA-Z ]", "", text)
    tokens = nltk.word_tokenize(text)
    tokens = [
        lemmatizer.lemmatize(w, pos='v') 
        for w in tokens 
        if w not in stop_words and w not in string.punctuation
    ]
    return " ".join(tokens)

df["clean_text"] = df["Text"].apply(preprocess)
df[["Text", "clean_text"]].head()


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/lpn/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.13/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.13/share/nltk_data'
    - '/Library/Frameworks/Python.framework/Versions/3.13/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/lpn/nltk_data'
**********************************************************************


## 🔠 Convert Text to TF-IDF Features

In [None]:

tfidf = TfidfVectorizer(max_features=1000)
X = tfidf.fit_transform(df["clean_text"])
y = df["label"]


## 🧪 Train-Test Split

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## 🤖 Train Logistic Regression Classifier

In [None]:

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)


## 📊 Model Evaluation

In [None]:

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Negative', 'Neutral', 'Positive']))


In [None]:

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Neg", "Neu", "Pos"], yticklabels=["Neg", "Neu", "Pos"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
