<a href="https://colab.research.google.com/github/kittu679/Sentiment-Analysis/blob/main/Lenovo_case_study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of K8 Reviews

This project performs sentiment analysis on K8 phone reviews using various natural language processing (NLP) techniques and a Logistic Regression model.

## Project Steps

1.  **Setup and Data Loading**:
    *   Install necessary libraries (`nltk`, `afinn`, `gensim`).
    *   Load the review data from the provided CSV file (`K8 Reviews.csv`).
    *   Initial data inspection (`data.tail()`, `data.info()`, `data["sentiment"].value_counts()`).

2.  **Text Preprocessing**:
    *   Define and apply a function to remove punctuation from the reviews.
    *   Tokenize the processed reviews using `nltk.word_tokenize`.
    *   Perform Part-of-Speech (POS) tagging on the tokens.
    *   Remove stop words from the tokenized reviews.
    *   Apply stemming using `nltk.PorterStemmer`.
    *   Apply lemmatization using `nltk.WordNetLemmatizer`.

3.  **Sentiment Scoring (Lexicon-based)**:
    *   Use VADER (Valence Aware Dictionary and sEntiment Reasoner) and Afinn lexicons to calculate sentiment scores for both stemmed and lemmatized reviews.
    *   Classify the sentiment scores into binary categories (0 or 1) based on defined thresholds.
    *   Evaluate the accuracy of the lexicon-based sentiment scores against the original sentiment labels.

4.  **Sentiment Analysis (Machine Learning)**:
    *   Prepare the data for machine learning by separating features (lemmatized reviews) and the target variable (sentiment).
    *   Vectorize the text data using TF-IDF (`TfidfVectorizer`) with a specified ngram range and maximum features.
    *   Split the data into training and testing sets.
    *   Train a Logistic Regression model on the TF-IDF transformed training data.
    *   Evaluate the performance of the Logistic Regression model on both the training and testing sets using accuracy scores.

5.  **Prediction on New Data**:
    *   Demonstrate how to preprocess new, unseen reviews using the defined preprocessing steps.
    *   Transform the preprocessed new reviews into TF-IDF features using the trained vectorizer.
    *   Predict the sentiment of the new reviews using the trained Logistic Regression model.

6.  **Model and Vectoriser Saving**:
    *   Save the trained Logistic Regression model and the TF-IDF vectorizer using `pickle` for future use.

## Libraries Used

*   `pandas`
*   `numpy`
*   `matplotlib.pyplot`
*   `nltk`
*   `afinn`
*   `gensim`
*   `sklearn`

## Data

The project uses the `K8 Reviews.csv` dataset containing customer reviews and their corresponding sentiment labels.

## Usage

To run this project, execute the code cells in the notebook sequentially. The notebook performs data loading, preprocessing, lexicon-based and machine learning-based sentiment analysis, and demonstrates prediction on new data.

The trained model and vectorizer are saved as `classifier.pickle` and `tfidfmodel.pickle` respectively, allowing for the sentiment prediction of new reviews without retraining.

In [80]:
!pip install nltk
!pip install afinn
!pip install gensim



In [81]:
import pandas as pd,numpy as np
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [82]:
data = pd.read_csv(r"/content/K8 Reviews.csv")

In [83]:
data.tail()
#data["sentiment"].value_counts()

Unnamed: 0,sentiment,review
14670,1,"I really like the phone, Everything is working..."
14671,1,The Lenovo K8 Note is awesome. It takes best p...
14672,1,Awesome Gaget.. @ this price
14673,1,This phone is nice processing will be successf...
14674,1,Good product but the pakeging was not enough.


In [84]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14675 entries, 0 to 14674
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  14675 non-null  int64 
 1   review     14675 non-null  object
dtypes: int64(1), object(1)
memory usage: 229.4+ KB


In [85]:
import string
import nltk

In [86]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [87]:
#Remove punctuation
def remove_punc(text):
  text_non_punc="".join([char for char in text if char not in string.punctuation])
  return text_non_punc

In [88]:
#Updating the data
data["punctuation_removed"]=data["review"].apply(remove_punc)
data.head()


Unnamed: 0,sentiment,review,punctuation_removed
0,1,Good but need updates and improvements,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr...",Worst mobile i have bought ever Battery is dra...
2,1,when I will get my 10% cash back.... its alrea...,when I will get my 10 cash back its already 15...
3,1,Good,Good
4,0,The worst phone everThey have changed the last...,The worst phone everThey have changed the last...


In [89]:
data["sentiment"].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
0,7712
1,6963


In [90]:
#Tokenization
from nltk.tokenize import word_tokenize
def token(text):
    tokenized=word_tokenize(text)
    return tokenized

In [91]:
data["Tokenized"]=data["punctuation_removed"].apply(token)
data.head()

Unnamed: 0,sentiment,review,punctuation_removed,Tokenized
0,1,Good but need updates and improvements,Good but need updates and improvements,"[Good, but, need, updates, and, improvements]"
1,0,"Worst mobile i have bought ever, Battery is dr...",Worst mobile i have bought ever Battery is dra...,"[Worst, mobile, i, have, bought, ever, Battery..."
2,1,when I will get my 10% cash back.... its alrea...,when I will get my 10 cash back its already 15...,"[when, I, will, get, my, 10, cash, back, its, ..."
3,1,Good,Good,[Good]
4,0,The worst phone everThey have changed the last...,The worst phone everThey have changed the last...,"[The, worst, phone, everThey, have, changed, t..."


In [92]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [93]:
#POS tagging
from nltk import pos_tag
def pos_tag(string):
    pos=nltk.pos_tag(string)
    return pos

In [94]:
POS=data["Tokenized"].apply(pos_tag)
POS

Unnamed: 0,Tokenized
0,"[(Good, JJ), (but, CC), (need, VBP), (updates,..."
1,"[(Worst, NNP), (mobile, NN), (i, NN), (have, V..."
2,"[(when, WRB), (I, PRP), (will, MD), (get, VB),..."
3,"[(Good, JJ)]"
4,"[(The, DT), (worst, JJS), (phone, NN), (everTh..."
...,...
14670,"[(I, PRP), (really, RB), (like, IN), (the, DT)..."
14671,"[(The, DT), (Lenovo, NNP), (K8, NNP), (Note, N..."
14672,"[(Awesome, NNP), (Gaget, NNP), (this, DT), (pr..."
14673,"[(This, DT), (phone, NN), (is, VBZ), (nice, JJ..."


In [95]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [96]:
#Stop words removing
stop_words=stopwords.words("english")
def stop_word(string):
    stop_words_removed=[char for char in string if char not in stop_words]
    return stop_words_removed
re_sw=data["Tokenized"].apply(stop_word)
data.head()

Unnamed: 0,sentiment,review,punctuation_removed,Tokenized
0,1,Good but need updates and improvements,Good but need updates and improvements,"[Good, but, need, updates, and, improvements]"
1,0,"Worst mobile i have bought ever, Battery is dr...",Worst mobile i have bought ever Battery is dra...,"[Worst, mobile, i, have, bought, ever, Battery..."
2,1,when I will get my 10% cash back.... its alrea...,when I will get my 10 cash back its already 15...,"[when, I, will, get, my, 10, cash, back, its, ..."
3,1,Good,Good,[Good]
4,0,The worst phone everThey have changed the last...,The worst phone everThey have changed the last...,"[The, worst, phone, everThey, have, changed, t..."


In [97]:
#stemming using porterstemmer
import nltk
ps=nltk.PorterStemmer()
def stem(string):
    stemming=" ".join([ps.stem(word) for word in string])
    return stemming
    stemmed=re_sw.apply(stem)
data["stemmed"]=re_sw.apply(stem)
data.head()

Unnamed: 0,sentiment,review,punctuation_removed,Tokenized,stemmed
0,1,Good but need updates and improvements,Good but need updates and improvements,"[Good, but, need, updates, and, improvements]",good need updat improv
1,0,"Worst mobile i have bought ever, Battery is dr...",Worst mobile i have bought ever Battery is dra...,"[Worst, mobile, i, have, bought, ever, Battery...",worst mobil bought ever batteri drain like hel...
2,1,when I will get my 10% cash back.... its alrea...,when I will get my 10 cash back its already 15...,"[when, I, will, get, my, 10, cash, back, its, ...",i get 10 cash back alreadi 15 januari
3,1,Good,Good,[Good],good
4,0,The worst phone everThey have changed the last...,The worst phone everThey have changed the last...,"[The, worst, phone, everThey, have, changed, t...",the worst phone everthey chang last phone prob...


In [98]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [99]:
#Lemmatizing


lm=nltk.WordNetLemmatizer()
def lemmatize(string):
    lemmatized=" ".join([lm.lemmatize(word).lower() for word in string])
    return lemmatized
    lemmatized=re_sw.apply(lemmatize, pos="v")
data["lemmatized"]=re_sw.apply(lemmatize)
data.head()
#data.to_csv("data_cleaned.csv",index=0)

Unnamed: 0,sentiment,review,punctuation_removed,Tokenized,stemmed,lemmatized
0,1,Good but need updates and improvements,Good but need updates and improvements,"[Good, but, need, updates, and, improvements]",good need updat improv,good need update improvement
1,0,"Worst mobile i have bought ever, Battery is dr...",Worst mobile i have bought ever Battery is dra...,"[Worst, mobile, i, have, bought, ever, Battery...",worst mobil bought ever batteri drain like hel...,worst mobile bought ever battery draining like...
2,1,when I will get my 10% cash back.... its alrea...,when I will get my 10 cash back its already 15...,"[when, I, will, get, my, 10, cash, back, its, ...",i get 10 cash back alreadi 15 januari,i get 10 cash back already 15 january
3,1,Good,Good,[Good],good,good
4,0,The worst phone everThey have changed the last...,The worst phone everThey have changed the last...,"[The, worst, phone, everThey, have, changed, t...",the worst phone everthey chang last phone prob...,the worst phone everthey changed last phone pr...


In [100]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [101]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from afinn import Afinn
analyser = SentimentIntensityAnalyzer()

afinnscore = Afinn(emoticons = True)

In [102]:
def get_vader_sentiment(text):
    return analyser.polarity_scores(text)['compound']
def get_afinn_sentiment(text):
    return afinnscore.score(text)

In [103]:
#vader and afinn score for stemmed data
data['stemm_score_vader'] = data["stemmed"].apply(get_vader_sentiment)
data['stemm_score_afinn'] = data["stemmed"].apply(get_afinn_sentiment)


#vader and afinn score for lemmatized data
data['lemm_score_vader'] = data["lemmatized"].apply(get_vader_sentiment)
data['lemm_score_afinn'] = data["lemmatized"].apply(get_afinn_sentiment)

data.head()


Unnamed: 0,sentiment,review,punctuation_removed,Tokenized,stemmed,lemmatized,stemm_score_vader,stemm_score_afinn,lemm_score_vader,lemm_score_afinn
0,1,Good but need updates and improvements,Good but need updates and improvements,"[Good, but, need, updates, and, improvements]",good need updat improv,good need update improvement,0.4404,3.0,0.7096,5.0
1,0,"Worst mobile i have bought ever, Battery is dr...",Worst mobile i have bought ever Battery is dra...,"[Worst, mobile, i, have, bought, ever, Battery...",worst mobil bought ever batteri drain like hel...,worst mobile bought ever battery draining like...,-0.6973,-8.0,-0.6973,-8.0
2,1,when I will get my 10% cash back.... its alrea...,when I will get my 10 cash back its already 15...,"[when, I, will, get, my, 10, cash, back, its, ...",i get 10 cash back alreadi 15 januari,i get 10 cash back already 15 january,0.0,0.0,0.0,0.0
3,1,Good,Good,[Good],good,good,0.4404,3.0,0.4404,3.0
4,0,The worst phone everThey have changed the last...,The worst phone everThey have changed the last...,"[The, worst, phone, everThey, have, changed, t...",the worst phone everthey chang last phone prob...,the worst phone everthey changed last phone pr...,-0.7964,-7.0,-0.8357,-7.0


In [104]:
#classifying vader and afinn based on threshold for stemmed
data['stemm_score_vader'] = data.stemm_score_vader.\
apply(lambda x:1 if x>0.1 else 0)
data['stemm_score_afinn'] = data.stemm_score_afinn.\
apply(lambda x:1 if x>0 else 0)
#classifying vader and afinn based on threshold for lemmatized
data['lemm_score_vader'] = data.lemm_score_vader.\
apply(lambda x:1 if x>0.1 else 0)
data['lemm_score_afinn'] = data.lemm_score_afinn.\
apply(lambda x:1 if x>0 else 0)
data.head()

Unnamed: 0,sentiment,review,punctuation_removed,Tokenized,stemmed,lemmatized,stemm_score_vader,stemm_score_afinn,lemm_score_vader,lemm_score_afinn
0,1,Good but need updates and improvements,Good but need updates and improvements,"[Good, but, need, updates, and, improvements]",good need updat improv,good need update improvement,1,1,1,1
1,0,"Worst mobile i have bought ever, Battery is dr...",Worst mobile i have bought ever Battery is dra...,"[Worst, mobile, i, have, bought, ever, Battery...",worst mobil bought ever batteri drain like hel...,worst mobile bought ever battery draining like...,0,0,0,0
2,1,when I will get my 10% cash back.... its alrea...,when I will get my 10 cash back its already 15...,"[when, I, will, get, my, 10, cash, back, its, ...",i get 10 cash back alreadi 15 januari,i get 10 cash back already 15 january,0,0,0,0
3,1,Good,Good,[Good],good,good,1,1,1,1
4,0,The worst phone everThey have changed the last...,The worst phone everThey have changed the last...,"[The, worst, phone, everThey, have, changed, t...",the worst phone everthey chang last phone prob...,the worst phone everthey changed last phone pr...,0,0,0,0


In [105]:
#Accuracy
from sklearn.metrics import confusion_matrix, accuracy_score
print("Accuracy of vader score for stemmed data",accuracy_score(data.sentiment,
data.stemm_score_vader))
print("Accuracy of afinn score for stemmed data",accuracy_score(data.sentiment,
data.stemm_score_afinn))

print("Accuracy of vader score for lemmatized data",accuracy_score(data.sentiment,
data.lemm_score_vader))
print("Accuracy of afinn score for lemmatized data",accuracy_score(data.sentiment,
data.lemm_score_afinn))

Accuracy of vader score for stemmed data 0.7336967632027257
Accuracy of afinn score for stemmed data 0.7204088586030665
Accuracy of vader score for lemmatized data 0.7600681431005111
Accuracy of afinn score for lemmatized data 0.7524361158432709


In [106]:
 #Dividing target and and remaining data
X = data.lemmatized.values
y = data.sentiment.values
print(X)

['good need update improvement'
 'worst mobile bought ever battery draining like hell backup 6 7 hour internet us even i put mobile idle getting dischargedthis biggest lie amazon lenove expected making full saying battery 4000mah booster charger fake take least 4 5 hour fully chargeddont know lenovo survive making full usplease dont go else regret like'
 'i get 10 cash back already 15 january' ... 'awesome gaget price'
 'this phone nice processing successful dual camera successfully dual mod'
 'good product pakeging enough']


In [107]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [108]:
vectorizer = TfidfVectorizer(ngram_range=(1,2),max_features=100)

Using Word2vec for vectorizing instead Tfidf and train Logistic regression


In [109]:
from gensim.models import Word2Vec

In [110]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.1,random_state=42)


In [111]:
X_train_bow = vectorizer.fit_transform(x_train)
X_test_bow = vectorizer.transform(x_test)


In [112]:
print(X_test_bow)

  (0, 5)	0.7830613590825709
  (0, 76)	0.6219444572562385
  (1, 3)	0.841729023499503
  (1, 6)	0.539900223188112
  (2, 5)	0.23054130569174114
  (2, 6)	0.1628675221069598
  (2, 13)	0.1662647777434439
  (2, 15)	0.2678822653882503
  (2, 32)	0.24788431569203584
  (2, 43)	0.27796616511110533
  (2, 45)	0.25460165342085755
  (2, 47)	0.235174776557976
  (2, 48)	0.2529065463967829
  (2, 50)	0.19601890847674733
  (2, 51)	0.26229250426330125
  (2, 65)	0.22666980605495934
  (2, 67)	0.2271767127560297
  (2, 70)	0.12945710363656793
  (2, 81)	0.2558757424960804
  (2, 86)	0.2726362809640473
  (2, 90)	0.28706250074169265
  (2, 98)	0.22122631450548505
  (3, 2)	0.34784492673444467
  (3, 13)	0.23146291028792398
  (3, 20)	0.40557617607263224
  :	:
  (1465, 40)	0.5080633611200696
  (1465, 70)	0.24447430806826576
  (1466, 70)	0.48635779902316556
  (1466, 89)	0.8737597446262572
  (1467, 6)	0.07276103004170432
  (1467, 12)	0.4749425999011228
  (1467, 13)	0.2228362597729554
  (1467, 15)	0.1196763437387703
  (1467

In [113]:
X_train_bow.shape, X_test_bow.shape


((13207, 100), (1468, 100))

In [114]:
from sklearn.linear_model import LogisticRegression


In [115]:
log_reg = LogisticRegression(max_iter = 10000000)


In [116]:
log_reg.fit(X_train_bow,y_train)


In [117]:
y_test_pred = log_reg.predict(X_test_bow)
y_train_pred = log_reg.predict(X_train_bow)

In [118]:
from sklearn.metrics import confusion_matrix, accuracy_score
print(accuracy_score(y_test, y_test_pred))
print(accuracy_score(y_train, y_train_pred))

0.8337874659400545
0.8188839251911865


In [119]:
# Sample new reviews
new_reviews = [
    "The phone is amazing! Battery life is great.",
    "Worst experience ever. The screen broke in two days.",
    "It is okay, but could be better.",
    "amazing and worst"
]

# Step 1: Preprocess the new reviews (Use same functions as before)
def preprocess_text(text):
    text = remove_punc(text)  # Remove punctuation
    text = word_tokenize(text)  # Tokenize text
    text = [word.lower() for word in text if word.lower() not in stop_words]  # Remove stopwords
    text = [lm.lemmatize(word) for word in text]  # Lemmatization
    return " ".join(text)  # Convert list back to string

# Apply preprocessing
processed_reviews = [preprocess_text(review) for review in new_reviews]

# Step 2: Convert new reviews into TF-IDF features
new_reviews_tfidf = vectorizer.transform(processed_reviews)

# Step 3: Predict sentiment (0 = Negative, 1 = Positive)
predictions = log_reg.predict(new_reviews_tfidf)

# Print results
for review, sentiment in zip(new_reviews, predictions):
    print(f"Review: {review}\nPredicted Sentiment: {'Positive' if sentiment == 1 else 'Negative'}\n")

Review: The phone is amazing! Battery life is great.
Predicted Sentiment: Positive

Review: Worst experience ever. The screen broke in two days.
Predicted Sentiment: Negative

Review: It is okay, but could be better.
Predicted Sentiment: Negative

Review: amazing and worst
Predicted Sentiment: Negative



In [120]:
import pickle

In [121]:
with open('classifier.pickle','wb') as f:
    pickle.dump(log_reg,f)

In [122]:
with open('tfidfmodel.pickle','wb') as f:
    pickle.dump(vectorizer,f)