
# Sentiment Analysis Using Logistic Regression

This project performs sentiment analysis on the Sentiment140 dataset. The goal is to classify tweets as either positive or negative based on their content.

## Workflow Overview:
1. **Data Extraction**: Downloading and loading the dataset using the Kaggle API.
2. **Preprocessing**: Cleaning and preparing the text data.
3. **Feature Extraction**: Using TF-IDF to transform text into numerical features.
4. **Model Training**: Training a Logistic Regression model for sentiment classification.
5. **Evaluation**: Evaluating the model using accuracy, confusion matrix.

---


## Data Extraction

Download the dataset from Kaggle using the Kaggle API. Ensure you have the `kaggle.json` file configured in your system.

In [None]:
! pip install kaggle

In [None]:
! mkdir -p ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
#API to fetch the dataset from kaggle
! kaggle datasets download -d kazanova/sentiment140


In [None]:
from zipfile import ZipFile

dataset = r"D:\study\.vscode\NLP\project\sentiment140.zip"

with ZipFile(dataset, "r") as zip:
    zip.extractall()
    print("The dataset is extracted")

In [50]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

In [None]:
nltk.download("stopwords")
nltk.download("punkt_tab")
nltk.download("wordnet")

In [None]:
column_names = ["target", "id", "date", "flag", "user", "text"]
data = pd.read_csv(
    r"D:\study\.vscode\NLP\project\training.1600000.processed.noemoticon.csv",
    names=column_names,
    encoding="ISO-8859-1",
)
data.head()

In [None]:
data = data.drop(["id", "date", "flag", "user"], axis=1)
data.head()

In [None]:
data["target"].unique()  # negative---> 0 , positive ---> 4

In [None]:
data["target"] = data["target"].replace(4, 1)
data["target"].unique()

In [None]:
print(data["target"].value_counts())

In [None]:
sample_size_per_class = 250000
positive_samples = data[data["target"] == 1].sample(
    n=sample_size_per_class, random_state=42
)
negative_samples = data[data["target"] == 0].sample(
    n=sample_size_per_class, random_state=42
)
balanced_data = pd.concat([positive_samples, negative_samples])
print(balanced_data["target"].value_counts())

In [None]:
balanced_data.head()

## Data Preprocessing

Clean the text data by removing unnecessary characters, stop words, and stemming/lemmatizing the words.

In [None]:
def process_text(content):
    content = re.sub(r"http\S+", "", content)
    content = re.sub(r"@\w+", "", content)
    content = re.sub(r"#\w+", "", content)
    content = re.sub(
        r"\b(don't|didn't|isn't|aren't|wasn't|weren't|haven't|hasn't|hadn't|won't|wouldn't|shouldn't|cannot|can't|not)\b",
        lambda x: x.group(0).replace("'", ""),
        content,
    )
    # tokenization
    tokens = word_tokenize(content)
    tokens = [token for token in tokens if token.isalpha()]
    tokens = [token.lower() for token in tokens]
    # lemmatizer
    lemmatizer = WordNetLemmatizer()
    tokens = [
        lemmatizer.lemmatize(word)
        for word in tokens
        if word not in stopwords.words("english")
    ]
    # stemming
    porter_stemmer = PorterStemmer()
    tokens = [porter_stemmer.stem(word) for word in tokens]
    content = " ".join(tokens)
    return content


balanced_data["text"] = balanced_data["text"].apply(process_text)
balanced_data.head()

## Feature Extraction

Transform the cleaned text data into numerical representations using TF-IDF.

In [None]:
X = balanced_data["text"].values
Y = balanced_data["target"].values
X

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
print(x_train.shape, x_test.shape)
print(x_train)

In [29]:
vectorizer = TfidfVectorizer()

In [None]:
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)
print(x_train)
print(x_test)

## Model Training

Train a Logistic Regression model on the extracted features.

In [None]:
model = LogisticRegression(max_iter=1500)
model.fit(x_train, y_train)

In [None]:
x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(y_train, x_train_prediction)
print("Accuracy on training data:", training_data_accuracy)

In [None]:
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(y_test, x_test_prediction)
print("Accuracy on test data:", test_data_accuracy)

In [None]:
y_pred = model.predict(x_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))

### Sentiment Analysis Code 

This code performs sentiment analysis on a user-provided comment:

**Example Usage**:
- Input: "This product is amazing!"
- Output: "The predicted sentiment is: Positive"


In [None]:
user_comment = input("Enter a comment to analyze sentiment: ")
predicted_sentiment = process_text(user_comment)
# Vectorize the comment
comment_vector = vectorizer.transform([predicted_sentiment])
# Predict sentiment
prediction = model.predict(comment_vector)
sentiment = "Positive" if prediction[0] == 1 else "Negative"
print(f"The predicted sentiment is: {sentiment}")

## Evaluation

Evaluate the model's performance using confusion matrix.

In [None]:
# Example of Class Distribution Visualization

# Plot class distribution
data["target"].value_counts().plot(
    kind="bar", title="Class Distribution", color=["red", "blue"]
)
plt.xticks(ticks=[0, 1], labels=["Negative", "Positive"], rotation=0)
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.show()

In [None]:
# Visualizing the Confusion Matrix

cm = confusion_matrix(y_test, y_pred)

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=["Negative", "Positive"],
    yticklabels=["Negative", "Positive"],
)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()