Name:- Mahesh Marathe

Roll No:- 391035

PRN:- 22211533

Batch:- A2

Title:- Apply advanced modeling techniques for sentiment analysis


Objective:
The objective of this assignment is to explore and implement advanced machine learning and deep learning techniques for sentiment analysis. The goal is to train models that can automatically classify textual data (e.g., movie reviews or tweets) into sentiment categories such as positive or negative. By using models like Support Vector Machine (SVM), Long Short-Term Memory (LSTM) networks, and BERT (Bidirectional Encoder Representations from Transformers), we aim to improve the accuracy and performance of sentiment classification.

Theory:
1. Sentiment Analysis:
Sentiment analysis is a type of natural language processing (NLP) that involves determining whether a given text expresses a positive, negative, or neutral sentiment. It has applications in fields like customer feedback, social media monitoring, and product review analysis.

2. Preprocessing Techniques:

Text cleaning: Removing punctuation, stopwords, and converting text to lowercase.

Tokenization: Splitting sentences into words.

Vectorization: Converting text into numerical format using techniques like TF-IDF or word embeddings.

3. Machine Learning Approaches:

Support Vector Machine (SVM): A powerful linear classifier that separates classes using a hyperplane. TF-IDF vectors are used as input features. SVM performs well with text data and is less prone to overfitting.

4. Deep Learning Approaches:

LSTM (Long Short-Term Memory): A type of Recurrent Neural Network (RNN) that is capable of learning long-term dependencies. It is widely used in NLP tasks as it can remember the context of previous words in a sentence.

BERT (Bidirectional Encoder Representations from Transformers): A transformer-based model pre-trained on a large corpus. It reads text bidirectionally, which helps it understand context more effectively than older models. Fine-tuning BERT on sentiment data can lead to state-of-the-art results.

In [4]:
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Load dataset
df = pd.read_csv('/content/IMDB Dataset.csv')

# Preprocessing
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()
    words = [word for word in text.split() if word.isalpha() and word not in stop_words]
    return " ".join(words)

df['review'] = df['review'].apply(clean_text)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
#TF-IDF + Support Vector Machine (SVM)

from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

svm = LinearSVC()
svm.fit(X_train_tfidf, y_train)
pred = svm.predict(X_test_tfidf)

print(classification_report(y_test, pred))


              precision    recall  f1-score   support

    negative       0.87      0.86      0.86      5004
    positive       0.86      0.87      0.86      4996

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



In [6]:
#LSTM-based Deep Learning Model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq, maxlen=200)
X_test_pad = pad_sequences(X_test_seq, maxlen=200)

model = Sequential([
    Embedding(input_dim=10000, output_dim=128, input_length=200),
    LSTM(128),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train_pad, y_train.map({'positive':1, 'negative':0}), epochs=3, batch_size=64)
model.evaluate(X_test_pad, y_test.map({'positive':1, 'negative':0}))


Epoch 1/3




[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 18ms/step - accuracy: 0.7816 - loss: 0.4520
Epoch 2/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 15ms/step - accuracy: 0.8988 - loss: 0.2582
Epoch 3/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 12ms/step - accuracy: 0.9270 - loss: 0.1931
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.8648 - loss: 0.3488


[0.3600291311740875, 0.8592000007629395]

In [7]:
#Using BERT (Transformer-based model)

from transformers import pipeline

# Pretrained sentiment-analysis pipeline
classifier = pipeline("sentiment-analysis")

print(classifier("This movie was absolutely amazing!"))
print(classifier("I didn’t like the acting. It was boring."))


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9998781681060791}]
[{'label': 'NEGATIVE', 'score': 0.9993508458137512}]


Conclusion:
In this assignment, we successfully applied multiple advanced modeling techniques to perform sentiment analysis on textual data. Traditional machine learning methods like SVM provided solid performance, but deep learning techniques such as LSTM and transformer-based models like BERT significantly improved the accuracy and understanding of context in the reviews. These models demonstrated superior capabilities in capturing semantic meaning, leading to more accurate sentiment classification. Advanced techniques like LSTM and BERT are ideal for real-world applications that require a deep understanding of text, such as chatbots, feedback systems, and opinion mining tools.

