# **NLP Sentiment Analysis**

<br>

---

**Problem statement:** Develop a deep learning model using TensorFlow and Keras to perform sentiment analysis on a dataset of tweets related to various candidates.

<br>

---

# Data Loading

In [None]:
# Import the libraries
import pandas as pd
import numpy as np
import re
import math

In [None]:
# Load and read the datasets
dataset = pd.read_csv(r"/content/drive/MyDrive/Colab Notebooks/Sentiment.csv", encoding="latin")

In [None]:
# Naming the columns
dataset1 = dataset[['candidate','sentiment','text']]

In [None]:
# Print the head of dataset
dataset1.head()

Unnamed: 0,candidate,sentiment,text
0,No candidate mentioned,Neutral,RT @NancyLeeGrahn: How did everyone feel about...
1,Scott Walker,Positive,RT @ScottWalker: Didn't catch the full #GOPdeb...
2,No candidate mentioned,Neutral,RT @TJMShow: No mention of Tamir Rice and the ...
3,No candidate mentioned,Positive,RT @RobGeorge: That Carly Fiorina is trending ...
4,Donald Trump,Positive,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...


# Data Preprocessing

Data preprocessing is crucial in natural language processing (NLP) models to ensure that the text data is in a suitable format for analysis and modeling.

<br>

This includes converting the text to lowercase, removing punctuation and removing stopwords. The text should not include words or characters that may not carry significant meaning in some contexts.



In [None]:
# Convert all the text to lowercase
dataset1['text'] = dataset1['text'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1['text'] = dataset1['text'].apply(lambda x: x.lower())


In [None]:
# Remove punctuation
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

dataset1['text'] = dataset1['text'].apply(remove_punctuation)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1['text'] = dataset1['text'].apply(remove_punctuation)


In [None]:
# Remove stopwords
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def remove_stop_words(text):
  tokens = text.split()
  tokens = [word for word in tokens if word not in stop]
  cleaned_text = ' '.join(tokens)
  return cleaned_text

dataset1['text'] = dataset1['text'].apply(remove_stop_words)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1['text'] = dataset1['text'].apply(remove_stop_words)


In [None]:
# Updated dataset
dataset1.head()

Unnamed: 0,candidate,sentiment,text
0,No candidate mentioned,Neutral,rt nancyleegrahn everyone feel climate change ...
1,Scott Walker,Positive,rt scottwalker didnt catch full gopdebate last...
2,No candidate mentioned,Neutral,rt tjmshow mention tamir rice gopdebate held c...
3,No candidate mentioned,Positive,rt robgeorge carly fiorina trending hours deba...
4,Donald Trump,Positive,rt danscavino gopdebate w realdonaldtrump deli...


# Tokenization and Padding

The preprocessed text data is converted into a numerical format using tokenizaiton and padding so that it can be fed into a deep learning model.

In [None]:
dataset1['text'][0]

'rt nancyleegrahn everyone feel climate change question last night exactly gopdebate'

In [None]:
# Perform tokenization
from keras.preprocessing.text import Tokenizer

def tokenize_tweets(sentences):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(sentences)
    return tokenizer

tokenizer = tokenize_tweets(dataset1['text'])
word_index = tokenizer.word_index

word_index

{'gopdebate': 1,
 'rt': 2,
 'gopdebates': 3,
 'rwsurfergirl': 4,
 'ãåââºãåââ': 5,
 'trump': 6,
 'fox': 7,
 'realdonaldtrump': 8,
 'debate': 9,
 'amp': 10,
 'news': 11,
 'last': 12,
 'candidates': 13,
 'like': 14,
 'gop': 15,
 'megynkelly': 16,
 'night': 17,
 'dont': 18,
 'people': 19,
 'foxnews': 20,
 'jeb': 21,
 'bush': 22,
 'ãââ': 23,
 'one': 24,
 'would': 25,
 'think': 26,
 'im': 27,
 'get': 28,
 'republican': 29,
 'president': 30,
 'god': 31,
 'chris': 32,
 'donald': 33,
 'cruz': 34,
 'need': 35,
 'ask': 36,
 'rubio': 37,
 'want': 38,
 'really': 39,
 'questions': 40,
 'question': 41,
 'said': 42,
 'know': 43,
 'time': 44,
 'carson': 45,
 'next': 46,
 'watching': 47,
 'candidate': 48,
 'huckabee': 49,
 'wallace': 50,
 'tedcruz': 51,
 'right': 52,
 'doesnt': 53,
 'nights': 54,
 'women': 55,
 'tonight': 56,
 'anyone': 57,
 'job': 58,
 'see': 59,
 'didnt': 60,
 'fair': 61,
 'us': 62,
 'megyn': 63,
 'trying': 64,
 'say': 65,
 'face': 66,
 'hear': 67,
 'america': 68,
 'tcot': 69,
 'ameri

In [None]:
sequences = tokenizer.texts_to_sequences(dataset1['text'])

In [None]:
# Perform padding
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences, maxlen=15, padding='post')

dataset1['text'] = list(padded_sequences)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1['text'] = list(padded_sequences)


In [None]:
dataset1.head()

Unnamed: 0,candidate,sentiment,text
0,No candidate mentioned,Neutral,"[2, 2427, 246, 351, 318, 272, 41, 12, 17, 909,..."
1,Scott Walker,Positive,"[2, 256, 60, 1827, 486, 1, 12, 17, 5058, 97, 1..."
2,No candidate mentioned,Neutral,"[2, 7854, 394, 5059, 5060, 1, 1676, 583, 414, ..."
3,No candidate mentioned,Positive,"[2, 7855, 208, 153, 5061, 621, 9, 102, 7856, 1..."
4,Donald Trump,Positive,"[2, 1156, 1, 212, 8, 2009, 1557, 150, 555, 99,..."


In [None]:
# One-hot encoding
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(dataset1[['sentiment']])

dataset1['ohe'] = list(onehot_encoded)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset1['ohe'] = list(onehot_encoded)


In [None]:
# Updated dataset
dataset1.head()

Unnamed: 0,candidate,sentiment,text,ohe
0,No candidate mentioned,Neutral,"[2, 2427, 246, 351, 318, 272, 41, 12, 17, 909,...","[0.0, 1.0, 0.0]"
1,Scott Walker,Positive,"[2, 256, 60, 1827, 486, 1, 12, 17, 5058, 97, 1...","[0.0, 0.0, 1.0]"
2,No candidate mentioned,Neutral,"[2, 7854, 394, 5059, 5060, 1, 1676, 583, 414, ...","[0.0, 1.0, 0.0]"
3,No candidate mentioned,Positive,"[2, 7855, 208, 153, 5061, 621, 9, 102, 7856, 1...","[0.0, 0.0, 1.0]"
4,Donald Trump,Positive,"[2, 1156, 1, 212, 8, 2009, 1557, 150, 555, 99,...","[0.0, 0.0, 1.0]"


# Model Development

A deep learning model has been developed using TensorFlow and Keras.

<br>

The model includes an Embedding layer, a SpatialDropout1D layer to prevent overfitting, an LSTM layer for sequence data processing, and a Dense layer for output. It aims to classify the sentiment of each tweet into one of the three categories

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, SpatialDropout1D, LSTM, Dense, Dropout

vocab_size = 20004
max_length = 15
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

embeddings_index = {}

with open('/content/drive/MyDrive/Colab Notebooks/glove.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

embedding_dim = len(coefs)
embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in tokenizer.word_index.items():
    if i < vocab_size:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, weights=[embedding_matrix], input_length=max_length, trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(64))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(3, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 15, 300)           6001200   
                                                                 
 spatial_dropout1d_2 (Spati  (None, 15, 300)           0         
 alDropout1D)                                                    
                                                                 
 lstm_3 (LSTM)               (None, 15, 128)           219648    
                                                                 
 dropout_2 (Dropout)         (None, 15, 128)           0         
                                                                 
 lstm_4 (LSTM)               (None, 64)                49408     
                                                                 
 dense_2 (Dense)             (None, 64)                4160      
                                                      

In [None]:
x = dataset1['text'].to_numpy()
y = dataset1['ohe'].to_numpy()

x = np.array([arr.tolist() for arr in x])
y = np.array([arr.tolist() for arr in y])

# Model Training and Evaluation

The model has been trained on the processed text data, using categorical cross-entropy as the loss function, and accuracy as the evaluation metric.

<br>

A validation split was used to evaluate the model's performance and prevent
overfitting.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [None]:
y_test

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [None]:
model.fit(X_train, y_train, epochs=15, validation_data=(X_test,y_test), batch_size=32)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7dad7351e350>

In [None]:
preds = model.predict(X_test)



In [None]:
preds

array([[7.2773539e-02, 2.8288287e-01, 6.4434350e-01],
       [1.0865737e-01, 8.9126718e-01, 7.5433381e-05],
       [3.9899576e-02, 5.2288465e-02, 9.0781188e-01],
       ...,
       [9.7804499e-01, 1.9305365e-02, 2.6496907e-03],
       [9.3570985e-02, 8.7061429e-01, 3.5814732e-02],
       [9.9854457e-01, 1.4091266e-03, 4.6284713e-05]], dtype=float32)

In [None]:
y_test

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [None]:
preds_labels = np.argmax(preds, axis=1)
preds_labels

array([2, 1, 2, ..., 0, 1, 0])

In [None]:
y_test_labels = np.argmax(y_test, axis=1)
y_test_labels

array([2, 2, 2, ..., 0, 1, 0])

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

accuracy = accuracy_score(y_test_labels, preds_labels)

In [None]:
accuracy

0.664103796251802

# Conclusion

The deep learning model has been trained, tested and evaluated to provide sentiment classification for the tweets dataset. The model has demonstrated a moderate level of accuracy. While this indicates the model's capability to discern sentiment across a diverse set of tweets, there remains room for improvement. Further refinements in model architecture, parameter tuning, and potentially incorporating additional data could enhance accuracy and robustness. Understanding the model's strengths and limitations through comprehensive evaluation metrics will be crucial for optimizing its performance in practical applications, ensuring reliable sentiment classification in real-world contexts.