# Text Classification

This dataset originates on Reddit but I got it as one of the Kaggle NLP data sets.
It is reddit posts that have been labeled as either related to depression or not.

Given that September is Suicide Awareness month this seemed like a good data set  
to start my NLP journey.


In [1]:
import os
import datetime
import re
import string
import nltk
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from IPython.display import display
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf

2022-09-28 16:07:45.653909: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-28 16:07:45.793309: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-09-28 16:07:46.280150: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2022-09-28 16:07:46.280206: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_

## Data
The dataset is actually a CSV file with one column for the text and another for the label.

In [2]:
data = pd.read_csv("depression_dataset_reddit_cleaned.csv")
print(data.shape)
display(data.head(3))

(7731, 2)


Unnamed: 0,clean_text,is_depression
0,we understand that most people who reply immed...,1
1,welcome to r depression s check in post a plac...,1
2,anyone else instead of sleeping more when depr...,1


## Cleaning the text.
One of the Kaggle code examples had the code below for "cleaning" the text.  
I actually tried using it but it seemed to make the text less legible.  

Perhaps given that the label for the text is "clean_text"  
it might be that such cleaning was needed originally   
but then someone posted a "cleaned" version of the text.

I don't know, but I did not use the clean function below

In [3]:
# I copied this from one of the Kaggle submissions
nltk.download("stopwords")
stemmer = nltk.SnowballStemmer("english")
stopword=set(stopwords.words('english'))
def clean(text):
    assert(False) # do not use this function
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text


[nltk_data] Downloading package stopwords to /home/john/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Split into train and val subsets
I have used the sklearn function to do this,  
but here I just use numpy and pandas to acheieve the same.

In [4]:
#  First split the indices
all_idx = data.index
train_size = int(np.floor(0.8*data.shape[0]))
train_idx = np.random.choice(data.index, train_size, replace=False)
# take the difference of original and train to get val
val_idx = list(set(all_idx).difference(set(train_idx)))
print(f"{len(all_idx)}  {len(train_idx)}  {len(val_idx)}")

# and now use the indices to get the data sets
train = data.loc [train_idx].copy()
val = data.loc[val_idx].copy()
print(train.shape, val.shape)

7731  6184  1547
(6184, 2) (1547, 2)


In [5]:
# have a look at the head of each dataset
display(train.head(3))
display(val.head(3))

Unnamed: 0,clean_text,is_depression
5649,ilovedt that s what i thought bummer,0
3709,adewunmitemit 9 weirdpeace olumurewa the sound...,1
6781,isnt very happy with twitter at the moment won...,0


Unnamed: 0,clean_text,is_depression
0,we understand that most people who reply immed...,1
2,anyone else instead of sleeping more when depr...,1
10,i ve been struggling with depression for a lon...,1


## Parameters for the tokenizer

In [6]:
# I did not do tests with changing these
# I took them from a coursera course on NLP with tensorflow.

# Vocabulary size of the tokenizer
vocab_size = 10000

# Maximum length of the padded sequences
max_length = 32

# Output dimensions of the Embedding layer
embedding_dim = 16


# Parameters for padding and OOV tokens
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"


## Final setup
Run the tokenizer to get the sequences for train and val
as well as the labels for each

In [7]:

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)

# Generate the word index dictionary for the training sentences
tokenizer.fit_on_texts(train["clean_text"])
word_index = tokenizer.word_index

# Generate and pad the training sequences
train_sequences = tokenizer.texts_to_sequences(train["clean_text"])
train_padded = pad_sequences(train_sequences,maxlen=max_length, truncating=trunc_type)

# Generate and pad the test sequences
val_sequences = tokenizer.texts_to_sequences(val["clean_text"])
val_padded = pad_sequences(val_sequences,maxlen=max_length, truncating=trunc_type)

# Convert the labels lists into numpy arrays
train_labels = np.array(train["is_depression"])
val_labels = np.array(val["is_depression"])

## The model
The model is fairly simple.
* a single embedding layer  
* a flattening layer  
* Relu
* and a sigmoid for the binary prediction

In [8]:


# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Setup the training parameters
model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(learning_rate=.0015),
              metrics=['accuracy'])


2022-09-28 16:07:47.756723: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-28 16:07:47.781255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-28 16:07:47.781399: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-09-28 16:07:47.781889: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

In [9]:
# Print the model summary
model.summary()
print(model.optimizer.lr)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 32, 16)            160000    
                                                                 
 flatten (Flatten)           (None, 512)               0         
                                                                 
 dense (Dense)               (None, 6)                 3078      
                                                                 
 dense_1 (Dense)             (None, 1)                 7         
                                                                 
Total params: 163,085
Trainable params: 163,085
Non-trainable params: 0
_________________________________________________________________
<tf.Variable 'learning_rate:0' shape=() dtype=float32, numpy=0.0015>


## run the model
I tried a few variaions on epochs.  
since every run tends to be different,  
but I found that I got pretty good results  
with the number of epochs between 6 and 12.

That is, with the validation accuracy.  
The train accuracy seemed pretty good with all the runs.

I also tried a few different learning rates.  
The default for Adam is .001, and I tried .01  
and .0015.   
With this small data set and the small number of epochs  
I did not see a huge difference.

In [10]:
num_epochs = 10

# Train the model
model.fit(train_padded, train_labels, epochs=num_epochs, validation_data=(val_padded, val_labels))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f208013dd80>

## Summary

With this small number of epochs it was not clear if I was overfitting.  
On some it seemed so, but on others not.  

Maybe in my next project I will use a larger dataset and also 
try using some hyper parameter tuning tool.

In [11]:
print(datetime.datetime.now())

2022-09-28 16:07:52.847285
