<a href="https://colab.research.google.com/github/jagadeeshkadal/DataAnalyticsInternship-codetechit/blob/master/Task_4%5BSENTIMENT_ANALYSIS%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment Analysis Using Natural Language Processing (NLP)

### Objective

The goal of this project is to build a sentiment analysis model using Natural Language Processing (NLP) techniques. The model will classify text-based movie reviews as either positive or negative.

### What We Will Do

- Load a publicly available dataset containing movie reviews and their sentiment labels.
- Preprocess the raw text (cleaning, normalization, tokenization).
- Convert the cleaned text into numerical features using TF-IDF.
- Train a machine learning model (e.g., Logistic Regression) to classify sentiment.
- Evaluate the model's performance using metrics like accuracy, precision, recall, and confusion matrix.
- Visualize results and interpret model insights.

### Prediction Task

We will predict the sentiment (`positive` or `negative`) of a given movie review based on its textual content.


For Sentiment Analysis, we use Machine Learning (ML) — and sometimes even Deep Learning (DL), depending on complexity. Here's how ML fits in:

Why Machine Learning is Used in Sentiment Analysis
Because we want the model to learn from text data and make predictions on new, unseen text. For example, if we train it on movie reviews labeled as "positive" or "negative", it will learn patterns in the language used for each sentiment and apply that understanding to predict new reviews.

 Which ML Models Can Be Used?
For a basic sentiment analysis using ML (not deep learning), we can use:

Algorithm	Why it's useful

Logistic Regression	Simple yet effective for binary classification.
---
Naive Bayes	Great for text classification, especially with word frequencies.
---
Support Vector Machine (SVM)	Works well with high-dimensional data like TF-IDF.
---
Random Forest	Good for interpretability and performance on many problems.
---
We'll pick Logistic Regression or Naive Bayes to keep it simple and fast in Colab.

In [1]:
# Step 1: Install and Import Required Libraries
!pip install -q nltk scikit-learn

import pandas as pd
import numpy as np
import nltk
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
# Step 2: Download the IMDb Movie Review dataset
!wget https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv

# Load the dataset
df = pd.read_csv("train.csv")
df = df[['tweet', 'label']]  # Only keep relevant columns
df.columns = ['text', 'label']
df.head()


--2025-04-08 13:10:15--  https://raw.githubusercontent.com/dD2405/Twitter_Sentiment_Analysis/master/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3103165 (3.0M) [text/plain]
Saving to: ‘train.csv.1’


2025-04-08 13:10:15 (20.1 MB/s) - ‘train.csv.1’ saved [3103165/3103165]



Unnamed: 0,text,label
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


In [4]:
# Step 3: Preprocess the Text Data
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)  # remove links
    text = re.sub(r'\@w+|\#','', text)  # remove mentions and hashtags
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    text = re.sub(r'\d+', '', text)  # remove digits
    text = text.strip()
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    return " ".join(tokens)

df['clean_text'] = df['text'].apply(preprocess)
df[['text', 'clean_text', 'label']].head()


Unnamed: 0,text,clean_text,label
0,@user when a father is dysfunctional and is s...,user father dysfunctional selfish drags kids d...,0
1,@user @user thanks for #lyft credit i can't us...,user user thanks lyft credit cant use cause do...,0
2,bihday your majesty,bihday majesty,0
3,#model i love u take with u all the time in ...,model love u take u time urð ðððð ððð,0
4,factsguide: society now #motivation,factsguide society motivation,0


In [5]:
# Step 4: Vectorize the Text using TF-IDF
X = df['clean_text']
y = df['label']

tfidf = TfidfVectorizer(max_features=5000)
X_vec = tfidf.fit_transform(X)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)


In [6]:
# Step 5: Train Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9496324104489285
Classification Report:
               precision    recall  f1-score   support

           0       0.95      1.00      0.97      5937
           1       0.92      0.32      0.48       456

    accuracy                           0.95      6393
   macro avg       0.93      0.66      0.73      6393
weighted avg       0.95      0.95      0.94      6393



In [7]:
# Step 6: Test on Custom Input
def predict_sentiment(text):
    cleaned = preprocess(text)
    vector = tfidf.transform([cleaned])
    prediction = model.predict(vector)
    return "Positive" if prediction[0] == 1 else "Negative"

# Try
print(predict_sentiment("I love this movie, it's awesome!"))
print(predict_sentiment("Worst movie ever, total waste of time."))


Negative
Negative
