# **Solution Notebook of NLP assignment**

##**Name : Mudassar khan**

##**Roll no : 10260**

##**Semester : 7th(B)**



### **Downloading IMDB movie review data from Kaggle**

In [4]:
from zipfile import ZipFile
import os

In [1]:
!pip install kaggle



In [2]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 54% 14.0M/25.7M [00:00<00:00, 146MB/s]
100% 25.7M/25.7M [00:00<00:00, 188MB/s]


In [5]:
# Here i have to unzip it
with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", 'r') as zip_ref:
    zip_ref.extractall()

In [8]:
import pandas as pd
df=pd.read_csv('IMDB Dataset.csv')
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**note:** As we see data is successfully downloaded

### **1. Text Preprocessing**

**importing some libraries**

In [9]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords

In [10]:
# Download stopwords if not already downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove stop words
    text = ' '.join(word for word in text.split() if word not in stop_words)

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text


In [12]:
# Apply the preprocessing function to the 'review' column
df['cleaned_review'] = df['review'].apply(preprocess_text)

# Display the cleaned data
print(df[['review', 'cleaned_review']].head())

                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   
2  I thought this was a wonderful way to spend ti...   
3  Basically there's a family where a little boy ...   
4  Petter Mattei's "Love in the Time of Money" is...   

                                      cleaned_review  
0  one reviewers mentioned watching oz episode yo...  
1  wonderful little production br br filming tech...  
2  thought wonderful way spend time hot summer we...  
3  basically theres family little boy jake thinks...  
4  petter matteis love time money visually stunni...  


### **2. Word Embedding using word2vec**

**importing some gensim library**

In [13]:
from gensim.models import Word2Vec

In [14]:
df['tokens'] = df['cleaned_review'].apply(lambda x: x.split())

**applying word2vec model**

In [15]:
word2vec_model = Word2Vec(sentences=df['tokens'], vector_size=100, window=5, min_count=2, workers=4)

In [17]:
word_vector = word2vec_model.wv['good']
print(word_vector)

[-1.7154952   0.6417364  -0.30999613 -3.0457346  -0.47795495  2.293713
 -0.63472193 -0.35454476 -1.4667492  -1.3532614  -1.7795126  -1.4642916
 -1.0570786  -1.5238519   1.1260267  -2.6576605   1.097982    0.24733886
  2.2284338   1.1135254   0.80360436  0.21412462  0.308577    0.04193586
 -0.71608317 -0.72204125 -1.1183072  -0.0182648   2.0218627   0.95968384
 -0.92807806 -0.6848925   0.07588099  0.83811027  1.3377154  -0.05879513
 -0.8885456  -0.57654756 -0.4015836   0.77918833  2.3095179  -3.247325
 -0.2939537  -1.5530536   0.8152297   0.08413505  0.9592549  -0.93815506
 -0.59899193 -0.07745868  0.39567703 -0.28419855 -0.01901226  1.6164235
  0.40974343 -1.687031    0.43243513 -1.7413058   1.3463613   0.30651614
 -0.2587312  -0.47344962  0.4237697  -0.27557105  2.3875575  -2.3749666
 -1.9605715   0.922255   -0.63407516 -1.9177153  -1.0672848  -0.04913205
  1.5715446   3.2378075   0.10219207  2.3821464  -0.79610425  0.9107277
 -0.7166902   0.5499417  -2.2525835   1.1703598   1.4726392

### **3. One-hot encoding**

In [20]:
import numpy as np

In [None]:
# Assuming the sentiment column exists, let's see its unique values
print(df['sentiment'].unique())

In [22]:
# Apply one-hot encoding to the 'sentiment' column
df_encoded = pd.get_dummies(df, columns=['sentiment'], prefix='sentiment')
# Display the dataframe with one-hot encoded sentiment
print(df_encoded)

                                                  review  \
0      One of the other reviewers has mentioned that ...   
1      A wonderful little production. <br /><br />The...   
2      I thought this was a wonderful way to spend ti...   
3      Basically there's a family where a little boy ...   
4      Petter Mattei's "Love in the Time of Money" is...   
...                                                  ...   
49995  I thought this movie did a down right good job...   
49996  Bad plot, bad dialogue, bad acting, idiotic di...   
49997  I am a Catholic taught in parochial elementary...   
49998  I'm going to have to disagree with the previou...   
49999  No one expects the Star Trek movies to be high...   

                                          cleaned_review  \
0      one reviewers mentioned watching oz episode yo...   
1      wonderful little production br br filming tech...   
2      thought wonderful way spend time hot summer we...   
3      basically theres family little b

### **4. Part of speech tagging**

In [23]:
import nltk

In [24]:
# Tokenize the cleaned reviews into words
df['tokens'] = df['cleaned_review'].apply(lambda x: x.split())

In [25]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [26]:
# Define a function to perform POS tagging
def pos_tagging(tokens):
    return nltk.pos_tag(tokens)

In [27]:
df['pos_tags'] = df['tokens'].apply(pos_tagging)

In [28]:
# Display the dataframe with tokens and their corresponding POS tags
print(df[['cleaned_review', 'tokens', 'pos_tags']].head())

                                      cleaned_review  \
0  one reviewers mentioned watching oz episode yo...   
1  wonderful little production br br filming tech...   
2  thought wonderful way spend time hot summer we...   
3  basically theres family little boy jake thinks...   
4  petter matteis love time money visually stunni...   

                                              tokens  \
0  [one, reviewers, mentioned, watching, oz, epis...   
1  [wonderful, little, production, br, br, filmin...   
2  [thought, wonderful, way, spend, time, hot, su...   
3  [basically, theres, family, little, boy, jake,...   
4  [petter, matteis, love, time, money, visually,...   

                                            pos_tags  
0  [(one, CD), (reviewers, NNS), (mentioned, VBD)...  
1  [(wonderful, JJ), (little, JJ), (production, N...  
2  [(thought, VBN), (wonderful, JJ), (way, NN), (...  
3  [(basically, RB), (theres, NNS), (family, NN),...  
4  [(petter, NN), (matteis, RBS), (love, JJ), (ti..

## **Question 2:**
### **1. Sentiment Analysis using VADER**

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

In [29]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

In [30]:
nltk.download('vader_lexicon')
# Initialize the Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [31]:
# Define a function to get sentiment scores
def get_sentiment_score(review):
    score = sia.polarity_scores(review)
    return score

df['sentiment_scores'] = df['cleaned_review'].apply(get_sentiment_score)


In [32]:
df['compound'] = df['sentiment_scores'].apply(lambda x: x['compound'])
df['sentiment'] = df['compound'].apply(lambda x: 'positive' if x > 0.05 else ('negative' if x < -0.05 else 'neutral'))

# Display the dataframe with sentiment scores
print(df[['cleaned_review', 'sentiment_scores', 'compound', 'sentiment']].head())

                                      cleaned_review  \
0  one reviewers mentioned watching oz episode yo...   
1  wonderful little production br br filming tech...   
2  thought wonderful way spend time hot summer we...   
3  basically theres family little boy jake thinks...   
4  petter matteis love time money visually stunni...   

                                    sentiment_scores  compound sentiment  
0  {'neg': 0.295, 'neu': 0.605, 'pos': 0.1, 'comp...   -0.9934  negative  
1  {'neg': 0.075, 'neu': 0.657, 'pos': 0.268, 'co...    0.9582  positive  
2  {'neg': 0.148, 'neu': 0.549, 'pos': 0.302, 'co...    0.9520  positive  
3  {'neg': 0.213, 'neu': 0.658, 'pos': 0.129, 'co...   -0.8858  negative  
4  {'neg': 0.03, 'neu': 0.722, 'pos': 0.249, 'com...    0.9871  positive  


### **2. Logistic Regression**

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [34]:
print(df[['cleaned_review', 'sentiment']].head())

# Split the data into features and labels
X = df['cleaned_review']
y = df['sentiment']

                                      cleaned_review sentiment
0  one reviewers mentioned watching oz episode yo...  negative
1  wonderful little production br br filming tech...  positive
2  thought wonderful way spend time hot summer we...  positive
3  basically theres family little boy jake thinks...  negative
4  petter matteis love time money visually stunni...  positive


### **Splitting the data into test and train**

In [35]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### **Convert Text to Numerical Features using TF-IDF**

In [36]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the training data, and transform the test data
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


### **Model Training**

In [37]:
model = LogisticRegression()

# Train the model
model.fit(X_train_tfidf, y_train)


In [45]:
# Make predictions on the test set
y_pred = model.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Generate a classification report
report = classification_report(y_test, y_pred)
print(report)


Accuracy: 0.87
              precision    recall  f1-score   support

    negative       0.84      0.73      0.78      3020
     neutral       0.00      0.00      0.00       124
    positive       0.88      0.95      0.91      6856

    accuracy                           0.87     10000
   macro avg       0.57      0.56      0.57     10000
weighted avg       0.86      0.87      0.86     10000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**-------------------------------------------------------------------------------------------------------------------------------------------------------------**