# **Group 14 Apply AI Project: _________________________________**                                   
Members: Torrence Barbour, Satwik Nijampudi, Shanthan Gunti, and Ryan Edwards

Given a set of posts found on Reddit marked with if the user had depression or not, can we train a model to detect depression in a social media user?

In [1]:
#We need to import some libraries to do this, including some big ones like numpy, pandas, and matplotlib
import os
import time
import re
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt


import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim

## Further Cleaning our data                                                                                                  
We got our data from a kaggle dataset: https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned and it came pre-cleaned, however we want to further clean this dataset to suit our intended purposes (like lemmatization, removing filler words, and more). We will do this now.

In [2]:
#First, we will need to bring in our dataset. We can use numpy for this
text = pd.read_csv('depression_dataset_reddit_cleaned.csv')
text.head()

Unnamed: 0,clean_text,is_depression
0,we understand that most people who reply immed...,1
1,welcome to r depression s check in post a plac...,1
2,anyone else instead of sleeping more when depr...,1
3,i ve kind of stuffed around a lot in my life d...,1
4,sleep is my greatest and most comforting escap...,1


In [3]:
#Next, we are going to need a tool to tokenize, lemmatize, and manage our data. We will use spacy
import spacy
!python3 -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
#Just to confirm, lets print out one of the posts
print(text.iloc[2])

clean_text       anyone else instead of sleeping more when depr...
is_depression                                                    1
Name: 2, dtype: object


Now that we have some tools, our dataset, and another tool used for NLP installed, we need to start preparing our data so it can be used. The first step in this is something called tokenization. Essentially, each word (or even punctuation for that matter) in a sentence plays a role and it can be looked at as its own entity. We want to break each of these sentences apart into 'tokens' so they are more processable.

In [5]:
#Lets tokenize the sample sentence from before as an example
sample_Tokens = nlp(text.iloc[2]['clean_text'])
for part in sample_Tokens: 
    print(part.text)

anyone
else
instead
of
sleeping
more
when
depressed
stay
up
all
night
to
avoid
the
next
day
from
coming
sooner
may
be
the
social
anxiety
in
me
but
life
is
so
much
more
peaceful
when
everyone
else
is
asleep
and
not
expecting
thing
of
you


On top of tokenization, there is also lemmatization. A lemma in linguistics is essentially the core component of a word, before conjugation and the like occurs. This is very important because it essentially determines what role the word will play in the sentence. Will it be a noun? A verb? Lets see a few examples of this in code

In [6]:
#The .lemma_ command in spacy will output the lemma of the word it was called with. This will come in handy
print('The lemma of', sample_Tokens[18], 'is', sample_Tokens[18].lemma_)
print('The lemma of', sample_Tokens[4], 'is', sample_Tokens[4].lemma_)
print('The lemma of', sample_Tokens[19], 'is', sample_Tokens[19].lemma_)

The lemma of coming is come
The lemma of sleeping is sleep
The lemma of sooner is soon


In [7]:
#The .pos_ command will output the part of speech the word plays in a sentence.
print('The part of speech of the word', sample_Tokens[18].lemma_, 'is', sample_Tokens[18].pos_)
print('The part of speech of the word', sample_Tokens[19].lemma_, 'is', sample_Tokens[19].pos_)
print('The part of speech of the word', sample_Tokens[24].lemma_, 'is', sample_Tokens[24].pos_)

The part of speech of the word come is VERB
The part of speech of the word soon is ADV
The part of speech of the word anxiety is NOUN


Now, lets use this information in order to get our dataset cleaned to the utmost. The only things left should be easily recognizable by future tools.

In [8]:
#Create an empty array to store the cleaned posts in
cleaned_Posts = []

#Parse through every post
for uncleaned_Post in range(len(text)):
    to_Clean = text.iloc[uncleaned_Post].copy() #We need to make a copy or else we will mess with the original data
    
    #Lets make all the tokens lowercase so capitalization is no longer a concern (We can use PCRE regular expressions)
    to_Clean_More = re.sub("[^A-Za-z']+", ' ', to_Clean['clean_text'].replace('', ' ')).lower()
    
    #Tokenize and Lemmatize the posts now
    to_Clean_Even_More = nlp(to_Clean_More)
    absurdly_Clean_Now = [word.lemma_ for word in to_Clean_Even_More if ((not word.is_stop) or (' ' in word.text))]
    
    #If there is any text left in absurdly_Clean_Now the add it to the new column
    if len(absurdly_Clean_Now) > 1:
        to_Clean['really_Clean'] = ' '.join(absurdly_Clean_Now)
    
    #Now we just append the row to our cleaned_Posts array
    cleaned_Posts.append(to_Clean)
    
#Create a new data frame with the new column in it
clean_Text = pd.DataFrame(cleaned_Posts)

In [9]:
#If we save the dataset now we can save a lot of time later, so lets do that
clean_Text.to_csv('cleaned_depression_dataset.csv')
clean_Text.head()
print(clean_Text.iloc[0])

clean_text       we understand that most people who reply immed...
is_depression                                                    1
really_Clean       w e u n d e r s t n d t h t m o s t p e o p ...
Name: 0, dtype: object


# Prepare for Training

In [10]:
#Count the words to get an idea of word frequency
from collections import Counter 

reviews = [review.split(' ') for review in list(clean_Text['clean_text'])]
word_freq = Counter([token for review in reviews for token in review]).most_common()

In [11]:
#let's run a check
word_freq[:10]

[('i', 40409),
 ('to', 17965),
 ('and', 16326),
 ('a', 12636),
 ('the', 11932),
 ('my', 11430),
 ('it', 9976),
 ('of', 7738),
 ('t', 7695),
 ('me', 6941)]

In [12]:
#words that were only seen once 
word_freq[-25:]

[('skooool', 1),
 ('bradie', 1),
 ('vegies', 1),
 ('vickybeeching', 1),
 ('warranty', 1),
 ('voided', 1),
 ('ooooooooooooh', 1),
 ('headddd', 1),
 ('shedfire', 1),
 ('mrsshedfire', 1),
 ('bleeeech', 1),
 ('cuprohastes', 1),
 ('advert', 1),
 ('misleading', 1),
 ('civ', 1),
 ('andn', 1),
 ('ophelia', 1),
 ('sorreh', 1),
 ('spek', 1),
 ('normalz', 1),
 ('moulin', 1),
 ('rouge', 1),
 ('pirro', 1),
 ('swatching', 1),
 ('cardi', 1)]

In [13]:
# remove words that appear infrequently
word_freq = dict(word_freq)
print(len(word_freq))
min_freq = 5
word_dict = {}

# sending all the unknowns to 0
i = 1
for word in word_freq:
    if word_freq[word] > min_freq:
        word_dict[word] = i
        i += 1
    else:
        word_dict[word] = 0

# dictionary length        
dict_length = max(word_dict.values()) + 1
dict_length



18848


4038

In [14]:
#in order to collect the tensors into batches sentences must be of the same size, it's easier to create a max sentence size and pad as nedded 
max_length = 0
for idx in tqdm(range(len(clean_Text))):
    row = clean_Text.iloc[idx]
    length = len(row['clean_text'].split(' '))
    if length > max_length:
        max_length = length
print(max_length)


  0%|          | 0/7731 [00:00<?, ?it/s]

4239


In [23]:


class DepressionDataSet(Dataset):
    def __init__(self, df, word_dict, max_length):
        self.df = df
        self.word_dict = word_dict
        self.sent_dict = {'0': 0, '1': 1}
        self.max_len = max_length
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        review = row['clean_text'].split(' ')
        x = torch.zeros(self.max_len)
        
        # get review as a list of integers
        for idx in range(len(review)):
            
            # we want to front pad for RNN
            x[self.max_len - len(review) + idx] = self.word_dict[review[idx]]
            
        y = torch.tensor(self.sent_dict[row['is_depression']]).float()
        
        # embedding likes long tensors
        return x.long(), y
ds = DepressionDataSet(clean_Text, word_dict, max_length)
next(iter(ds))



KeyError: 1