If the Glove file is not in the directory. Otherwise skip the command line section.

In [1]:
!wget "http://nlp.stanford.edu/data/glove.twitter.27B.zip"

--2019-08-04 14:57:36--  http://nlp.stanford.edu/data/glove.twitter.27B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.twitter.27B.zip [following]
--2019-08-04 14:57:36--  https://nlp.stanford.edu/data/glove.twitter.27B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.twitter.27B.zip [following]
--2019-08-04 14:57:36--  http://downloads.cs.stanford.edu/nlp/data/glove.twitter.27B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1520408563 (1.4G) [appli

In [4]:
!unzip -j glove.twitter.27B.zip

Archive:  glove.twitter.27B.zip
  inflating: glove.twitter.27B.25d.txt  
  inflating: glove.twitter.27B.50d.txt  
  inflating: glove.twitter.27B.100d.txt  
  inflating: glove.twitter.27B.200d.txt  


In [2]:
import emoji
from emoji import UNICODE_EMOJI
import pandas as pd
import glob
import json
import numpy as np
import nltk
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Conv1D
from keras.layers import MaxPool1D
from keras.layers import LSTM
from keras.layers import Bidirectional
from keras.layers import Dropout
from keras.layers import SpatialDropout1D
from keras.layers import Dropout
from keras.callbacks import EarlyStopping
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

Using TensorFlow backend.


# Conventional Cleaning
The function `clean_txt(tweet)` cleans the text. It:
Converts to ASCII, convert to lowercase, separate punctuation, remove tokens with numbers, remove usernames, removes links, and stores the tweet as a string

In [62]:
import re
import string
from unicodedata import normalize
def clean_txt(tweet):
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    tweet = normalize('NFD', tweet).encode('ascii', 'ignore')
    tweet = tweet.decode('UTF-8')
    tweet = tweet.split()
    tweet = [str(word) for word in tweet]
    tweet = [word for word in tweet if word[0] != '@']
    tweet = [word.lower() for word in tweet]
    tweet = [word for word in tweet if word[0:4] != 'http']
    tweet = [re_punc.sub('', w) for w in tweet]
    tweet = [re_print.sub('', w) for w in tweet]
    tweet = [word for word in tweet if word.isalpha()]
    tweet = (' '.join(tweet))
    return tweet

Remove instances of 'RT' from beginning of tweet:

In [63]:
def rm_rt(tweet):
    rt = re.compile(r'^(RT|rt)')
    if rt.search(tweet):
        return ''
    return tweet

Import the data: this includes all tweets that have been classified (both manually and by an automated labeling service):

In [64]:
tweets = pd.read_csv('depression_final_data.csv')
tweets.head()

Unnamed: 0.1,Unnamed: 0,label,text
0,0,0.0,"In my personal opinion, if you plan on committ..."
1,1,0.0,@samtripoli they really going to allow the has...
2,2,0.0,That feel when you can't wait for work to be o...
3,3,0.0,"So, Let it be said that Amonute Matoaka Powhat..."
4,4,3.0,"https://t.co/t9BIoDCnm4 Also, Your Turn to Di..."


Clean up by dropping any NaN vals, label/clean up columns:

In [65]:
tweets = tweets.dropna()
tweets.text = tweets.text.apply(clean_txt).apply(rm_rt)
tweets.shape

(11538, 3)

In [66]:
tweets.to_csv('cleaned_data.csv')

# SMOTE Oversampling

Smote is a technique to generate more samples of the minority class, in this case depressive tweets. While theoretically useful, it ultimately didn't yield too much of a result improvment compared to simply specifying class weights, so **is therefore not used** but is included nonetheless to demonstrate the work done. 

Inspired by: Mr. Theo Viel (Original article published under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) ) 

In [4]:
classified = pd.read_csv('cleaned_data.csv')[['label','text']]
classified.text = classified.text.astype(str)
classified.head()

Unnamed: 0,label,text
0,0.0,in my personal opinion if you plan on committi...
1,0.0,they really going to allow the hashtag clinton...
2,0.0,that feel when you cant wait for work to be ov...
3,0.0,so let it be said that amonute matoaka powhata...
4,3.0,also your turn to die is now officially finish...


Clean up by dropping any NaN vals, label/clean up columns:

In [5]:
classified = classified.dropna()

Use GloVe pretrained embeddings to seed embedding layer weights. Load GloVe weights trained on twitter dataset.

In [8]:
embeddings_idx = dict()
glove_file = open('glove.twitter.27B.100d.txt')
for line in glove_file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_idx[word] = coefs
glove_file.close()
print('Loaded %s word vectors.' % len(embeddings_idx))

Loaded 1193514 word vectors.


For simplicity, clump "depressed" and "suicidal" labels into one category (can refine this in future models/iteration; for first iteration, use binary classification):

In [9]:
text = np.asarray(classified.text)
labels_binary = classified.label.replace(to_replace=3, value=1)
label = np.asarray(classified.label)

np.unique(labels_binary)

array([0., 1.])

Tokenize tweet texts. Use a max sentence length of 50 (with char limit of 280 and cleaning, this seems reasonable - by observation, most tweets still require padding), and pad all tweets out to this length:

In [10]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
# integer encode the documents
encoded = tokenizer.texts_to_sequences(text)
# pad documents to a max length of 4 words
max_length = 50
padded_tweets = pad_sequences(encoded, maxlen=max_length, padding='post')

In [11]:
padded_tweets

array([[  9,   8, 800, ...,   0,   0,   0],
       [ 34,  88,  90, ...,   0,   0,   0],
       [ 14, 100,  53, ...,   0,   0,   0],
       ...,
       [ 31,  18, 777, ...,   0,   0,   0],
       [686, 195,  12, ...,   0,   0,   0],
       [378,  10, 986, ...,   0,   0,   0]], dtype=int32)

Split into train/val set of 90%/10%;

In [12]:
text_train, text_test, label_train, label_test = train_test_split(padded_tweets, labels_binary, test_size = 0.1, random_state = 0)

Initialize embedding matrix to seed embedding weights from the loaded GloVe embeddings:

In [13]:
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_idx.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [14]:
# Place word indicies in a dictionary
index_word = {0: ''}
for word in tokenizer.word_index.keys():
    index_word[tokenizer.word_index[word]] = word

In [15]:
from sklearn.neighbors import NearestNeighbors

# initialise Nearest Neighbor from Sklearn and fit on the embedding matrix created
nn = NearestNeighbors(n_neighbors=6).fit(embedding_matrix) 

In [17]:
# find the neighbours and place it in a matrix 
neighbours_mat = nn.kneighbors(embedding_matrix[1:vocab_size])[1]
neighbours = {entry[0]: entry[1:] for entry in neighbours_mat}

In [18]:
# show how it works by prividing some examples
for index in np.random.randint(1, vocab_size, 20):
    if index in neighbours:
        print("{} : {}".format(str(index_word[index]), str([index_word[neighbours[index][i]] for i in range(5)])))

welli : ['naaa', 'hahahhahahahahah', 'yame', 'daaaang', 'jhn']
nowt : ['noting', 'nothig', 'sth', 'reckon', 'tbf']
apni : ['apne', 'kaam', 'karta', 'nahi', 'pehlay']
hummed : ['exclaimed', 'groaned', 'audibly', 'nudged', 'nonchalantly']
mole : ['pow', 'poo', 'schumachers', 'youpitch', 'wmgcamp']
buried : ['trapped', 'bury', 'burying', 'laid', 'coffin']
ohrwurm : ['caeczka', 'thedemocrats', 'tootime', 'suicidality', 'supportsmallstreamers']
safely : ['landed', 'arrived', 'returned', 'heading', 'safe']
colloidal : ['categorical', 'methamphetamine', 'hrk', 'rbg', 'strawman']
hurt : ['hurts', 'hurting', 'feel', 'wont', 'reason']
gallon : ['bucket', 'container', 'bottle', 'keg', 'pump']
yee : ['hhaha', 'hah', 'coy', 'yah', 'yeh']
stepped : ['slipped', 'jumped', 'stood', 'pushed', 'kicked']
aquemini : ['qotsa', 'impetus', 'firstclass', 'onetime', 'tbqh']
destructive : ['damaging', 'chaotic', 'partisan', 'dangerous', 'rhetoric']
tolerant : ['intolerant', 'egocentric', 'economically', 'likable

In [19]:
text_train_pos = text_train[label_train==1]

In [20]:
# define a function that automatically loops around the sentence and changes word based on a given probability
def change_sentence(sentence, neighbours):
    for i in range(len(sentence)):
        if np.random.random() > 0.5:
            try:
                syns = neighbours[sentence[i]]
                sentence[i] = np.random.choice(syns)
            except KeyError:
                pass
    return sentence

In [27]:
indicies = np.random.randint(0, text_train_pos.shape[0], 10)

In [28]:
for sentence in text_train_pos[indicies]:
    sample =  np.trim_zeros(sentence)
    original_sentence = ' '.join([index_word[index] for index in sample])
    print(original_sentence)

    modified = change_sentence(sample, neighbours)
    sentence_modified = ' '.join([index_word[index] for index in modified])
    print(sentence_modified)
    
    print(' ')

just wanna sleep forever
think wanna sleep forever
 
never again never again youre killing me slow but i aint ready to die mtvhottest camila cabello
never yet never right youre killing me low but but aint ready going als moots camila celeste
 
im better off dead thats the damn truth right now
im than up dead its the shit truth way again
 
if i commit suicide at least a real nigga killed me
know you committed attack from either another real bruh killed me
 
my car fucked up at the worst time possible not only in the freeway but also a day before clean culture i throw the white flag someone please kill me now
this truck fucks up at the worst break certain not if in the freeway but also a year after clean culture i throw the red parade someone pls hell me still
 
im tired of living in florida this bland ass state
im bored of living the florida this crass ass ohio
 
my suicide note
my suicide note
 
cleaned my room still depressed will try again next week
cleaned it floor really depressed 

In [23]:
n_texts = 5000
indexes = np.random.randint(0, text_train_pos.shape[0], n_texts)
X_gen = np.array([change_sentence(x, neighbours) for x in text_train_pos[indexes]])
y_gen = np.ones(n_texts)

In [24]:
text_train = np.concatenate((X_gen, text_train),axis=0)
label_train = np.concatenate((y_gen, label_train),axis=0)

In [25]:
print(text_train.shape)
print(label_train.shape)

(15384, 50)
(15384,)


In [85]:
df = pd.DataFrame(text_train)
df['label'] = label_train
df.to_csv('overclassed_data_trained.csv')

# Emoji Embedding

Beyond the pre-trained glove weights on traditional English words, Emojis, recognized by some as bearing significant meanings, have been shown to aid in sentiment analysis on Twitter data and was therefore considered by some model designs. For every emoji used in the Twitter dataset, a random vector of length 100 is populated on the glove vector space. By setting the embedding layers to be trainable, the emoji vectors weights, in theory, could be learned in a way that their relationship to other words can be discerned, although experimentations demonstrated a less than significant difference in these models’ predictive ability and their pretrained vector space counterparts and **are ultimately not used.**


Check if the word is emoji

In [2]:
def is_emoji(s):
    return s in emoji.UNICODE_EMOJI

In [3]:
import re
import string
from unicodedata import normalize


# clean a list of lines
def clean_txt(tweet):
# prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    re_print = re.compile('[^%s]' % re.escape(string.printable))
# tokenize on white space
    tweet = emoji.emojize(emoji.demojize(tweet).replace('::',': :'))
#     tweet = normalize('NFD', tweet).encode('ascii', 'ignore')
#     tweet = tweet.decode('UTF-8')
    tweet = tweet.split()
    tweet = [str(word) for word in tweet]
    #remove usernames
    tweet = [word for word in tweet if word[0] != '@']
    # convert to lowercase
    tweet = [word.lower() for word in tweet]
    tweet = [word for word in tweet if word[0:4] != 'http']
# remove punctuation from each token
    tweet = [re_punc.sub('', w) for w in tweet]
# remove non-printable chars form each token
    tweet = [re_print.sub('', w) if not is_emoji(w) else w for w in tweet]
# remove tokens with numbers in them
    tweet = ['' if not word.isalpha() and not is_emoji(word) else word for word in tweet ]
# store as string
    tweet = (' '.join(tweet))
    return ' '.join(tweet.split())

In [4]:
clean_txt('@HoForBangtan Pain is bad makes me wann3a die 😭😩🥺\nI eagerly await the next update')

'pain is bad makes me die 😭 😩 🥺 i eagerly await the next update'

Remove instances of 'RT' from beginning of tweet:

In [5]:
#remove RT:
def rm_rt(tweet):
    rt = re.compile(r'^(RT|rt)')
    if rt.search(tweet):
        return ''
    return tweet

In [None]:
classified = pd.read_csv('cleaned_data.csv')[['label','text']]
classified.text = classified.text.astype(str)
classified.head()

Use GloVe pretrained embeddings to seed embedding layer weights. Load GloVe weights trained on twitter dataset.

In [7]:
embeddings_idx = dict()
glove_file = open('glove.twitter.27B.100d.txt')
for line in glove_file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_idx[word] = coefs
glove_file.close()
print('Loaded %s word vectors.' % len(embeddings_idx))

Loaded 1193514 word vectors.


For simplicity, clump "depressed" and "suicidal" labels into one category (can refine this in future models/iteration; for first iteration, use binary classification):

In [69]:
text = np.asarray(classified.text)
labels_binary = classified.label.replace(to_replace=3, value=1)
label = np.asarray(classified.label)

np.unique(labels_binary)

array([0., 1.])

Tokenize tweet texts. Use a max sentence length of 50 (with char limit of 280 and cleaning, this seems reasonable - by observation, most tweets still require padding), and pad all tweets out to this length:

In [70]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
# integer encode the documents
encoded = tokenizer.texts_to_sequences(text)
# pad documents to a max length of 4 words
max_length = 50
padded_tweets = pad_sequences(encoded, maxlen=max_length, padding='post')

In [71]:
padded_tweets

array([[  9,   8, 800, ...,   0,   0,   0],
       [ 34,  88,  90, ...,   0,   0,   0],
       [ 14, 100,  53, ...,   0,   0,   0],
       ...,
       [ 31,  18, 777, ...,   0,   0,   0],
       [686, 195,  12, ...,   0,   0,   0],
       [378,  10, 986, ...,   0,   0,   0]], dtype=int32)

Split into train/val set of 90%/10%;

In [72]:
text_train, text_test, label_train, label_test = train_test_split(padded_tweets, labels_binary, test_size = 0.1, random_state = 0)

Initialize embedding matrix to seed embedding weights from the loaded GloVe embeddings:

If emojis are considered in training the embedding layer, then a randomised weight list is initiated.

In [73]:
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_idx.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    elif is_emoji(word):
        np.random.seed(np.sum([ord(char) for char in word]))
        embedding_matrix[i] = np.random.uniform(1,-1,100)

The resulting embedding matrix can be used in the embedding layer with `trainable = True`, but again, this did not yield any result of value and is therefore discarded.