# Text Classification using TensorFlow 2.x and Keras

[source](https://www.analyticsvidhya.com/blog/2020/03/tensorflow-2-tutorial-deep-learning/)

We will pick up a [text classification problem](https://www.analyticsvidhya.com/blog/2020/03/6-pretrained-models-text-classification/?utm_source=blog&utm_source=tensorflow-2-tutorial-deep-learning) where the task is to identify whether a tweet contains hate speech or not. You can access the dataset and problem statement for this here – DataHack Practice Problem: [Twitter Sentiment Analysis](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/?utm_source=blog&utm_source=tensorflow-2-tutorial-deep-learning).

We are using tf.keras, the high-level API to build and train models in TensorFlow. 

## Problem Statement

Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer serviceto clinical medicine. (Source: Wikipedia)

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

### Data

Our overall collection of tweets was split in the ratio of 65:35 into training and testing data. Out of the testing data, 30% is public and the rest is private.

#### Data Files 

- train.csv - For training the models, we provide a labelled dataset of 31,962 tweets. The dataset is provided in the form of a csv file with each line storing a tweet id, its label and the tweet.
There is 1 test file (public)

- test_tweets.csv - The test data file contains only tweet ids and the tweet text with each tweet in a new line.

### Evaluation Metric

The metric used for evaluating the performance of classification model would be F1-Score.

The metric can be understood as
- True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.
- True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.
- False Positives (FP) – When actual class is no and predicted class is yes.
- False Negatives (FN) – When actual class is yes but predicted class in no.

Precision = TP/(TP+FP)

Recall = TP/(TP+FN)

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

F1 is usually more useful than accuracy, especially if for an uneven class distribution.

## Upgrade pip

In [1]:
! pip install --upgrade pip



## Install Necessary Libraries including TensorFlow and Keras

In [2]:
! pip install numpy pandas sklearn matplotlib nltk tensorflow==2.0.0

Collecting nltk
  Downloading nltk-3.5.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 16.3 MB/s eta 0:00:01
Collecting click
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
[K     |████████████████████████████████| 82 kB 2.8 MB/s  eta 0:00:01
[?25hCollecting regex
  Downloading regex-2020.11.13-cp36-cp36m-manylinux2014_x86_64.whl (723 kB)
[K     |████████████████████████████████| 723 kB 12.0 MB/s eta 0:00:01
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Created wheel for nltk: filename=nltk-3.5-py3-none-any.whl size=1434675 sha256=927f5682b3334a19f3a6d2e3d8c23ca89faab9ceb68c034bead279bbcdd5f48f
  Stored in directory: /root/.cache/pip/wheels/de/5e/42/64abaeca668161c3e2cecc24f864a8fc421e3d07a104fc8a51
Successfully built nltk
Installing collected packages: regex, click, nltk
Successfully installed click-7.1.2 nltk-3.5 regex-2020.11.13


## Import Necessary Libraries including TensorFlow and Keras

In [2]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

In [4]:
print(tf.__version__)

2.0.0


In [5]:
print(keras.__version__)

2.2.4-tf


In [3]:
# Helper libraries
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [4]:
# Import NLTK & Download Required Module
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Import the train tweets

Download the train and test data from [the practice problem page](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/) and then upload to storage/ directory.

In [5]:
train = pd.read_csv('train.csv')

In [6]:
train

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


In [11]:
train.shape

(31962, 3)

In [12]:
type(train)

pandas.core.frame.DataFrame

Separate the tweet texts and the labels

In [7]:
X = train.iloc[:, 2].values
y = train.iloc[:,1].values

In [15]:
X

array([' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
       "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
       '  bihday your majesty', ...,
       'listening to sad songs on a monday morning otw to work is sad  ',
       '@user #sikh #temple vandalised in in #calgary, #wso condemns  act  ',
       'thank you @user for you follow  '], dtype=object)

In [16]:
type(X)

numpy.ndarray

In [17]:
len(X)

31962

In [18]:
y

array([0, 0, 0, ..., 0, 1, 0])

In [19]:
len(y)

31962

In [20]:
type(y)

numpy.ndarray

## Text cleaning & Preprocessing

Here, we will define a function to clean the text since these are tweets with a lot of acronyms and slangs, digits, random characters which, if cleaned, can reduce the noise for our sequence model:

In [9]:
def clean_corpus(text):
    corpus = []
    for i in range(len(text)):
        tweet = re.sub(r"^https://t.co/[a-zA-Z0-9]*\s"," ", str(text[i]))
        tweet = re.sub(r"\s+https://t.co/[a-zA-Z0-9]*\s"," ", tweet)
        tweet = re.sub(r"\s+https://t.co/[a-zA-Z0-9]*$"," ", tweet)
        tweet = tweet.lower()
        tweet = re.sub(r"can't","can not", tweet)
        tweet = re.sub(r"hv","have", tweet)
        tweet = re.sub(r"ur","your", tweet)
        tweet = re.sub(r"ain't","is not", tweet)
        tweet = re.sub(r"don't","do not", tweet)
        tweet = re.sub(r"couldn't","could not", tweet)
        tweet = re.sub(r"shouldn't","should not", tweet )
        tweet = re.sub(r"won't","will not", tweet)
        tweet = re.sub(r"there's", "there is", tweet)
        tweet = re.sub(r"it's","it is", tweet)
        tweet = re.sub(r"that's","that is", tweet)
        tweet = re.sub(r"where's","where is", tweet)
        tweet = re.sub(r"who's","who is", tweet)
        tweet = re.sub(r"\W"," ", tweet)
        tweet = re.sub(r"\d"," ", tweet)
        tweet = re.sub(r"[ðâï¼½³ªãºæååçæåä¹µó¾_ëìêè]"," ", tweet)
        tweet =re.sub(r"\s[a-z]\s"," ", tweet)
        tweet = re.sub(r"\s+[a-z]\s+"," ", tweet)
        tweet = re.sub(r"^[a-z]\s"," ", tweet)
        tweet = re.sub(r"^[a-z]\s+"," ", tweet)
        tweet = re.sub(r"\s+"," ", tweet)
        tweet = re.sub(r"^\s","", tweet)
        tweet = re.sub(r"\s$","", tweet)
        corpus.append(tweet)
        
    #return the corpus
    return corpus

In [10]:
corpus = clean_corpus(X)

In [12]:
len(corpus)

31962

In [14]:
corpus[:5]

['user when father is dysfunctional and is so selfish he drags his kids into his dysfunction run',
 'user user thanks for lyft credit can not use cause they do not offer wheelchair vans in pdx disapointed getthanked',
 'bihday yoyour majesty',
 'model love take with all the time in your',
 'factsguide society now motivation']

## Tokenizing the text to feed into the model

Now, we would need to tokenize the text for which we can directly use a function from the Keras Text Preprocessing Module ‘Tokenizer’:

In [15]:
#check how many individual words present in the corpus
word_dict = {}
for doc in corpus:
    words = nltk.word_tokenize(doc)
    for word in words:
        if word not in word_dict:
            word_dict[word] = 1
        else:
            word_dict[word] += 1
            
len(word_dict)

37579

In [16]:
#tokenising the texts
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(corpus)
corpus_tokens = tokenizer.texts_to_sequences(corpus)

In [17]:
type(corpus_tokens)

list

In [19]:
len(corpus_tokens)

31962

In [23]:
corpus_tokens[:3]

[[1, 35, 68, 5, 14946, 6, 5, 21, 2610, 70, 6257, 93, 244, 249, 93, 7643, 442],
 [1,
  1,
  161,
  8,
  5345,
  2343,
  32,
  12,
  426,
  623,
  60,
  23,
  12,
  1481,
  7644,
  9812,
  7,
  7645,
  14947,
  9813],
 [54, 27, 3230]]

## Padding Text Sequences

Padding is required in order to make each input sentence of the same length. This is nothing but inserting zeroes for the smaller sentences such that all sentences are of the same size:

In [24]:
#finding the average words present per comment
print(corpus[0])
print(corpus_tokens[0:2])

num_of_words_in_doc =[]
for doc in corpus_tokens:
    num_of_words_in_doc.append(len(doc))
print("Average number of words: ", np.average(num_of_words_in_doc))


# Padding the sequences
corpus_pad = keras.preprocessing.sequence.pad_sequences(corpus_tokens,maxlen=25,padding='post')

user when father is dysfunctional and is so selfish he drags his kids into his dysfunction run
[[1, 35, 68, 5, 14946, 6, 5, 21, 2610, 70, 6257, 93, 244, 249, 93, 7643, 442], [1, 1, 161, 8, 5345, 2343, 32, 12, 426, 623, 60, 23, 12, 1481, 7644, 9812, 7, 7645, 14947, 9813]]
Average number of words:  12.311182028659033


In [26]:
type(corpus_pad)

numpy.ndarray

In [30]:
corpus_pad.shape

(31962, 25)

## Create Validation Set

Now, we will create a validation set from the train data in order to check the performance of our trained model before we build the model:

In [31]:
# Creating Validation Set
X_train,X_test,y_train,y_test = train_test_split(corpus_pad,y,test_size=0.2,random_state=101)

X_train.shape, X_test.shape

((25569, 25), (6393, 25))

## Building & Compiling the Model

Here, we will build and compile an LSTM model. Again, the hyperparameters are arrived at using several iterations and experiments:

In [57]:
# Building & Compiling the model

vocab_size = len(tokenizer.word_index) + 1
max_length = 25
model = keras.Sequential()
model.add(keras.layers.Embedding(input_dim=vocab_size,output_dim=50,input_length=max_length))
model.add(keras.layers.LSTM(units=50,dropout=0.2,recurrent_dropout=0.2))
model.add(keras.layers.Dense(units=1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 25, 50)            1879200   
_________________________________________________________________
lstm_4 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 51        
Total params: 1,899,451
Trainable params: 1,899,451
Non-trainable params: 0
_________________________________________________________________
None


## Training the Deep Learning Model

Now, it is time to train the model. This will take more than 100 seconds for each epoch so I have trained it for only 2 epochs:

In [58]:
# Train the model
model.fit(X_train,y_train,batch_size=10,epochs=2, verbose=2)
# history = model.fit(X_train, y_train, batch_size=10, validation_split=0.3, epochs=3, verbose=2)

Train on 25569 samples
Epoch 1/2
25569/25569 - 69s - loss: 0.1736 - acc: 0.9463
Epoch 2/2
25569/25569 - 63s - loss: 0.0818 - acc: 0.9749


<tensorflow.python.keras.callbacks.History at 0x7fbec5f9a588>

## Evaluate the model

In [59]:
# evaluate the model
loss, accuracy = model.evaluate(X_test, y_test, verbose=2)

6393/1 - 2s - loss: 0.1267 - acc: 0.9656


In [60]:
print(loss, accuracy)

0.10941458632972866 0.9655874


We could use the [scikit-learn classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html). See [link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html)

To convert your labels into a numerical or binary format take a look at the [scikit-learn label encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [61]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test, batch_size=10, verbose=2)
y_pred_bool = np.argmax(y_pred, axis=1)

print(classification_report(y_test, y_pred_bool))

6393/1 - 3s
              precision    recall  f1-score   support

           0       0.93      1.00      0.97      5961
           1       0.00      0.00      0.00       432

    accuracy                           0.93      6393
   macro avg       0.47      0.50      0.48      6393
weighted avg       0.87      0.93      0.90      6393



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [62]:
y_test

array([0, 0, 0, ..., 0, 0, 0])

In [63]:
y_pred

array([[0.01629123],
       [0.01713055],
       [0.00183874],
       ...,
       [0.00344735],
       [0.00276003],
       [0.01535853]], dtype=float32)

In [64]:
y_pred_bool

array([0, 0, 0, ..., 0, 0, 0])

In [65]:
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
y_pred2 = model.predict(X_test)
y_pred2_pool = np.argmax(y_pred2, axis=1)

In [66]:
# Print f1, precision, and recall scores
print(precision_score(y_test, y_pred2_pool , average="macro"))
print(recall_score(y_test, y_pred2_pool , average="macro"))
print(f1_score(y_test, y_pred2_pool , average="macro"))

0.4662130455185359
0.5
0.48251578436134046


  _warn_prf(average, modifier, msg_start, len(result))


#### Manually calculate metrics

In [68]:
K = keras.backend

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [69]:
y_pred3 = model.predict(X_test)
y_pred3_bool = np.argmax(y_pred3, axis=1)

In [71]:
type(y_test)

numpy.ndarray

In [75]:
len(y_test)

6393

In [72]:
type(y_pred3_bool)

numpy.ndarray

In [76]:
len(y_pred3_bool)

6393

In [98]:
y_test_tensor = tf.convert_to_tensor(y_test.astype('float32'))
y_pred_tensor = tf.convert_to_tensor(y_pred3_bool.astype('float32'))

In [99]:
true_positives = K.sum(K.round(K.clip(y_test_tensor * y_pred_tensor, 0, 1)))

In [100]:
print(K.get_value( recall_m(y_test_tensor, y_pred_tensor) ))

0.0


In [101]:
print(K.get_value( precision_m(y_test_tensor, y_pred_tensor) ))

0.0


In [102]:
print(K.get_value( f1_m(y_test_tensor, y_pred_tensor) ))

0.0


## Importing test data

In [103]:
#Loading the test data
test_tweets = pd.read_csv("test_tweets.csv")
test_tweets.shape

(17197, 2)

In [105]:
#cleaning the text
test_data = test_tweets['tweet']
clean_test_data  = clean_corpus(test_data)

In [106]:
#text to sequence and padding
clean_test_data_token = tokenizer.texts_to_sequences(clean_test_data)
clean_test_data_pad = keras.preprocessing.sequence.pad_sequences(clean_test_data_token,maxlen=25,padding='post')

## Prediction on the test set and creating Submission File

In [107]:
# preparing the submission file    
final_prediction = model.predict_classes(clean_test_data_pad)

test_tweets['label'] = final_prediction
test_predictions = test_tweets[['id','label']]
test_predictions.to_csv('LSTM3.csv',index=False)

Once you upload this file at the solution checker you will get a score of close to 0.75 (F1 Score). You can check it for yourself at this [link](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/?utm_source=blog&utm_source=tensorflow-2-tutorial-deep-learning).