### Introduction to Natural Language Processing (NLP) - Assignment
#### by Gaurav Singh (grv08singh@gmail.com)

In [1]:
import numpy as np
import pandas as pd
import re, string
import emoji

import tensorflow as tf
from tensorflow import keras
from keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import TextVectorization
from keras import Sequential
from keras.layers import Dense, Embedding, SpatialDropout1D, LSTM

import nltk
from nltk.corpus import stopwords

import warnings as wr
wr.filterwarnings('ignore')

#### Problem Statement:
You are a Data Scientist in a big firm. You have to develop a `deep learning model` to perform `sentiment analysis` on a dataset of `tweets` related to various candidates.

#### Tasks to be Performed:
##### 1) Data Loading and Preprocessing:
* __Load__ the `tweet` data from a `CSV file`.
* __Filter out__ the relevant columns: `candidate`, `sentiment`, and `text`.
* __Preprocess__ the text data by __removing__ `stop words`, `punctuation`, converting to `lowercase`, and other cleaning steps.

In [2]:
nltk.download('stopwords')          #if running for the very first time
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Grv\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
df = (
    pd
    .read_csv("Tweets.csv")[['name','text','airline_sentiment']]
    .drop_duplicates()
    .apply(lambda col : col.str.strip().str.lower())                                 #lowercase
    .assign(text = lambda x : (
        x['text']
        .str.replace("http\S+|www\S+|https\S+|@[a-zA-Z0-9_]+|#|\\d+","",regex=True)  #remove url, nametag, hash symbol, numbers
        .apply(lambda y : " ".join([w for w in y.split() if w not in stop_words]))   #remove stop words
        .apply(lambda y : emoji.replace_emoji(y, ""))
        .str.translate(str.maketrans("","",string.punctuation))                      #remove punctuations
        .str.strip()
        .str.replace("\s+"," ",regex=True)                                           #more than 1 spaces to only 1 space
        
        )
    )
)
for t in df['text']:
    print(t)

said
plus added commercials experience tacky
today must mean need take another trip
really aggressive blast obnoxious entertainment guests faces amp little recourse
really big bad thing
seriously would pay flight seats playing really bad thing flying va
yes nearly every time fly vx “ear worm” won’t go away
really missed prime opportunity men without hats parody there
well didnt…but do d
amazing arrived hour early good me
know suicide second leading cause death among teens
lt pretty graphics much better minimal iconography d
great deal already thinking nd trip amp even gone st trip yet p
flying fabulous seductive skies again u take stress away travel
thanks
sfopdx schedule still mia
excited first cross country flight lax mco heard nothing great things virgin america daystogo
flew nyc sfo last week fully sit seat due two large gentleman either side me help
flying
know would amazingly awesome bosfll please want fly you
first fares may three times carriers seats available select
love graph

##### 2) Text Vectorization:
__Convert__ the preprocessed text data into numerical format using `tokenization` and `padding`, so that it can be fed into a deep learning model.

In [4]:
max_len = 500
tokenizer = Tokenizer(oov_token='<oov>')
tokenizer.fit_on_texts(df['text'])
vocab_size = len(tokenizer.word_index)

X = tokenizer.texts_to_sequences(df['text'])
X_pad = pad_sequences(X, maxlen=max_len)
X_pad

array([[    0,     0,     0, ...,     0,     0,   130],
       [    0,     0,     0, ...,  2257,   112,  5694],
       [    0,     0,     0, ...,    75,    70,   107],
       ...,
       [    0,     0,     0, ...,   349,   150, 12672],
       [    0,     0,     0, ...,  1298,    53,  2411],
       [    0,     0,     0, ...,    66,    96,     2]])

##### 3) Model Development:
__Develop__ a deep learning model using `TensorFlow` and `Keras`. The model includes an `Embedding` layer, a `SpatialDropout1D` layer to prevent overfitting, an `LSTM` layer for sequence data processing, and a `Dense` layer for output. It aims to classify the sentiment of each tweet into one of the `three categories`.

In [5]:
#sentiment to integer encoded.
sentiment_mapping = {"negative": 0, "neutral": 1, "positive": 2}
df['label'] = df['airline_sentiment'].map(sentiment_mapping).astype(int)
#OHE on sentiment with integer datatype
y_cat = to_categorical(df['label'], num_classes = 3)

In [6]:
#build model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=32, input_length=max_len))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))

model.summary()

##### 4) Model Training and Evaluation:
* __Train__ the model on the processed text data, using `categorical cross-entropy` as the loss function, and `accuracy` as the evaluation metric.
* Use a `validation split` to __evaluate__ the model's performance and prevent overfitting.

In [7]:
#compile model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
#train model
history = model.fit(x=X_pad, y=y_cat, batch_size=64, epochs=20, validation_split=0.2)

Epoch 1/20
[1m 77/182[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m53s[0m 507ms/step - accuracy: 0.5874 - loss: 0.9796