# Introduction

![image.png](attachment:811e5ccd-00dc-4b7c-bdf0-70dbbd8d3c3f.png)!

 **Sentiment Analysis is the process of computationally identifying and categorizing opinions from a piece of text and determining whether attitude towards a product/topic is positive , negative or neutral. Sentiment Analysis is one of the main applications of Natural Language processing.**

# Objective

**In this kernel, let's go through the basic steps in creating a basic sentiment analysis system where it will predict whether the news is positive or negative.**

Steps followed

* Importing Dataset and required libraries
* Text Data pre-processing (Cleaning)
* One Hot representation (converting words to vector)
* Padding sequence and embedding layer
* Creating LSTM model


# A] Importing Dataset and libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('../input/covid19-india-news-headlines-for-nlp/raw_data.csv',usecols=['Headline','Sentiment','Description'])
df.head()

**Here the positive sentiment is represented by '1' and negative sentiment by '0'. Sentiment is our target variable and other variables as independent(X) variables.**

In [None]:
df.shape

In [None]:
df.info()

# B] Text Data pre-processing and cleaning

* 1)The first step is to remove the Stop words like 'the' , 'of' , 'is' , 'a' from every sentence as they do not contribute to the algorithm performance and then perform Stemming on each word where we remove the suffix of the word and reduce it to it's root word. eg.) history-histori

* 2) We do stemming on the corpus of text to bring uniformity as there are many words which have same meaning with different suffixes. So we reduce such similar words to one root word. Stemming words at times makes no sense as the words it produces may not have dictionary meaning. So, we also use Lemmatization at times because it gives dictionary meaning of words. Stemming is used in Sentiment Analysis and Lemmatization is used in Q/A applications, chatbots.

* 3) We also convert all the words to similar cases so that two similar words with different cases won't be treated distinct.

* 4) We use NLTK library, regular expressions library, PorterStemmer(for stemming).

![image.png](attachment:566b8987-0076-4de5-85ee-74e6f5a35e32.png)!

**First lets define the independent and dependent variable**

In [None]:
X=df.drop('Sentiment',axis=1)
y=df['Sentiment']

In [None]:
messages=X.copy()

In [None]:
import re
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
ps=PorterStemmer()
corpus=[]
for i in range(len(messages)):
    review=re.sub('[^a-zA-Z]',' ',messages['Headline'][i])
    review=review.lower()
    review=review.split()
    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [None]:
corpus

# C] One Hot Representation

One of the most important step in NLP is to convert text data to vector so that it generalizes well to the predictions . One of the word representation technique is One Hot Representation where we assign index to the words based on a vocabulary size . https://medium.com/p/6b633f296337/edit You can refer my mediuma article for intuition ,

One of the disadvantage is that semantic information is not captured and size is also huge . It won't treat 'good' and 'great' similar .

In [None]:
voc_size=5000

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

In [None]:
onehot_repr=[one_hot(words,voc_size) for words in corpus]
onehot_repr

# D] Padding sequence and embedding layer

**Before passing One Hot representation to the embedding layer , we need to make sure that all the length of the sentences are equal . If it is not the same , we apply pre padding with zeroes to make the lengths equal by first defining a sentence length .**

![image.png](attachment:feb16b8e-91ab-458e-a176-ef69abb21a09.png)!

In [None]:
sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

# E] LSTM Model building

![image.png](attachment:4c83009a-953c-440b-aed8-d3e03f89ebdb.png)!

In [None]:
from tensorflow.keras.layers import Dropout
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(Dropout(0.5))
model.add(LSTM(200))
model.add(Dropout(0.5))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [None]:
import numpy as np

X_final=np.array(embedded_docs)
y_final=np.array(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_final,y_final,test_size=0.33,random_state=42)

In [None]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

In [None]:
y_pred=model.predict_classes(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))