# What
Classify text.  In this case, binary classification.


# Why
There are lots of applications for text classification, 
e.g. is the text offensive, is it potentially a scam, et cetera.
In this case the text is Reddit posts and the question is whether
it involves depression.  Again I can easily imagine uses for this
for text and email as early warning signs, although there are 
privacy challenges there.


# Background

I have been getting more comfortable with text applications so 
I wanted something to show that.
This dataset originates on Reddit but I got it as one of the Kaggle NLP data sets.
It is reddit posts that have been labeled as either related to depression or not.

Given that September is Suicide Awareness month this seemed like a good data set  
to start my NLP journey.


In [48]:
import os
import datetime
import re
import string
import nltk
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from IPython.display import display
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf

## Data
The dataset is actually a CSV file with one column for the text and another for the label.

In [49]:
data = pd.read_csv("depression_dataset_reddit_cleaned.csv")
print(data.shape)
display(data.head(3))

(7731, 2)


Unnamed: 0,clean_text,is_depression
0,we understand that most people who reply immed...,1
1,welcome to r depression s check in post a plac...,1
2,anyone else instead of sleeping more when depr...,1


## Cleaning the text.
One of the Kaggle code examples had the code below for "cleaning" the text.  
I actually tried using it but it seemed to make the text less legible.  

Perhaps given that the label for the text is "clean_text"  
it might be that such cleaning was needed originally   
but then someone posted a "cleaned" version of the text.

I don't know, but I did not use the clean function below

In [50]:
# I copied this from one of the Kaggle submissions
nltk.download("stopwords")
stemmer = nltk.SnowballStemmer("english")
stopword=set(stopwords.words('english'))
def clean(text):
    assert(False) # do not use this function
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text


[nltk_data] Downloading package stopwords to /home/john/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Split into train and val subsets
I have used the sklearn function to do this, but in this case
I prefered using pandas and numpy to keep the data as data frames.

In [51]:
#  First split the indices
all_idx = data.index
train_size = int(np.floor(0.8*data.shape[0]))
train_idx = np.random.choice(data.index, train_size, replace=False)

# take the difference of original and train to get val
other_idx = list(set(all_idx).difference(set(train_idx)))

val_size = int(np.floor(0.5*len(other_idx)))
val_idx = np.random.choice(other_idx, val_size, replace=False)

test_idx = list(set(other_idx).difference(set(val_idx)))

# and now use the indices to get the data sets
train = data.loc [train_idx].copy()
val = data.loc[val_idx].copy()
test = data.loc[test_idx].copy()
print(f" train shape {train.shape}, val shape {val.shape},  test shape {test.shape}")

 train shape (6184, 2), val shape (773, 2),  test shape (774, 2)


In [52]:
# have a look at the head of each dataset
display(train.head(3))
display(val.head(3))
print(f" % true in train {np.round(train['is_depression'].sum()/train.shape[0], 2)}")
print(f" % true in val {np.round(val['is_depression'].sum()/val.shape[0], 2)}")

Unnamed: 0,clean_text,is_depression
4636,mtsiaklides aw i wish i could i can t really s...,0
3492,heartbreaking to see kid taking their life out...,1
2332,maybe i should have been locked away for the r...,1


Unnamed: 0,clean_text,is_depression
6594,http twitpic com y z see where we ve been move...,0
1515,for about a week now i ve been experiencing ex...,1
209,someone pls tell me how to get over this i m c...,1


 % true in train 0.5
 % true in val 0.48


## Parameters for the tokenizer

In [59]:
# Vocabulary size of the tokenizer
vocab_size = 10000

# Maximum length of the padded sequences
max_length = 100

# Output dimensions of the Embedding layer
embedding_dim = 10


# Parameters for padding and OOV tokens
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"


## Final setup
Run the tokenizer to get the sequences for train and val
as well as the labels for each

In [60]:

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)

# Generate the word index dictionary for the training sentences
tokenizer.fit_on_texts(train["clean_text"])
word_index = tokenizer.word_index

# Generate and pad the training sequences
train_sequences = tokenizer.texts_to_sequences(train["clean_text"])
train_padded = pad_sequences(train_sequences,maxlen=max_length, truncating=trunc_type)

# Generate and pad the val sequences
val_sequences = tokenizer.texts_to_sequences(val["clean_text"])
val_padded = pad_sequences(val_sequences,maxlen=max_length, truncating=trunc_type)

# Generate and pad the test sequences
test_sequences = tokenizer.texts_to_sequences(test["clean_text"])
test_padded = pad_sequences(test_sequences,maxlen=max_length, truncating=trunc_type)

# Convert the labels lists into numpy arrays
train_labels = np.array(train["is_depression"])
val_labels = np.array(val["is_depression"])
test_labels = np.array(test["is_depression"])

## The model
The model is fairly simple.
* a single embedding layer  
* a flattening layer  
* A Dense layer with Relu activation
* A Dense layer with sigmoid for the binary prediction

In [61]:


# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(4, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Setup the training parameters
model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(learning_rate=.0015),
              metrics=['accuracy'])


In [62]:
# Print the model summary
model.summary()
print(model.optimizer.lr)

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 100, 10)           100000    
                                                                 
 flatten_6 (Flatten)         (None, 1000)              0         
                                                                 
 dense_12 (Dense)            (None, 4)                 4004      
                                                                 
 dense_13 (Dense)            (None, 1)                 5         
                                                                 
Total params: 104,009
Trainable params: 104,009
Non-trainable params: 0
_________________________________________________________________
<tf.Variable 'learning_rate:0' shape=() dtype=float32, numpy=0.0015>


## run the model
I tried a few variaions on epochs.  
since every run tends to be different,  
but I found that I got pretty good results  
with the number of epochs between 6 and 12.

That is, with the validation accuracy.  
The train accuracy seemed pretty good with all the runs.

I also tried a few different learning rates.  
The default for Adam is .001, and I tried .01  
and .0015.   
With this small data set and the small number of epochs  
I did not see a huge difference.

In [67]:
num_epochs = 10

# Train the model
model.fit(train_padded, train_labels, epochs=num_epochs, validation_data=(val_padded, val_labels))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fb8ba26dbd0>

## Comments on training

The model appears to be overfitting as the val_loss is considerably higher than the train_loss.

Still the accuracy of the validation set is 96% which might be good enough for some applications.

Let's see how it does on the holdout test set.

In [70]:
model.evaluate(test_padded, test_labels)



[0.2321440577507019, 0.9560723304748535]

## Test performance
Numbers for the test set are in line with the validation set, which is not surprising since we did not actually use the validation set in training.

# Summary
Building a text classifier with Keras is fairly straightforward.
I did not experiment too much with the hyperparameters in part because the ones I choose at the start performed fairly well on the training set.

For more challenging data sets, the Keras Tuner looks like an easy-to-use option for 
searching over the parameter space.