<a href="https://colab.research.google.com/github/irxjxv/DataMining_Assessment/blob/main/Twitter_Depression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The goal of this project is to detect depression by using tweets. The training dataset comes from Twitter, gathered using Twint API. The testing dataset is from Kaggle, which is a collection of random tweets containing sentiment scores.

The training dataset will be cleaned and preprocessed for exploratory data analysis and to build a model from it to use for our testing dataset.

In [4]:
# install Twint for scraping data from Twitter

import os

!pip install twint
!pip uninstall twint -y
!git clone --depth=1 https://github.com/twintproject/twint.git
%cd twint/
!pip3 install . -r requirements.txt
!pip install neattext

import twint
import pandas as pd
import nest_asyncio
nest_asyncio.apply()
import time 
import datetime as dt
from glob import glob
import neattext.functions as neat_text
import string


Collecting twint
  Using cached twint-2.1.20-py3-none-any.whl
Installing collected packages: twint
Successfully installed twint-2.1.20


Found existing installation: twint 2.1.20
Uninstalling twint-2.1.20:
  Successfully uninstalled twint-2.1.20
fatal: destination path 'twint' already exists and is not an empty directory.
/content/twint/twint/twint
Processing /content/twint/twint/twint
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: twint
  Building wheel for twint (setup.py) ... [?25l[?25hdone
  Created wheel for twint: filename=twint-2.1.21-py3-none-any.whl size=38871 sha256=7e4f97a7585413f86f4b2ca59fddc4f5661cec8118a8a08532c23a81f35f4449
  Stored in directory: /tmp/pip-ephem-wheel-cache-xs5l69e8/wheels/01/9e/ce/3e992f856c29a9



In [None]:
# creating a function to get tweets using Twint

def twint_search(search_term, since, until, save_path):
    c = twint.Config()
    c.Search = search_term
    c.Lang = "en"
    c.Since = since.strftime('%Y-%m-%d %H:%M:%S')
    c.Until = until.strftime('%Y-%m-%d %H:%M:%S')
    c.Hide_output = True
    c.Store_csv = True
    c.Output = save_path
    twint.run.Search(c)
    
def twint_search_loop(search_term, start_date, end_date, save_dir):
    try:
        os.makedirs(os.path.join(os.getcwd(),save_dir,search_term))
        print(f'Successfully created the directory {os.path.join(os.getcwd(),save_dir,search_term)}')
    except FileExistsError:
        print(f'Directory {os.path.join(os.getcwd(),save_dir,search_term)} already exists')
    
    date_range = pd.date_range(start_date, end_date)
    
    for single_date in date_range:
        since = single_date
        until = single_date + dt.timedelta(days=1)
        save_path = os.path.join(save_dir, search_term, f'{single_date:%Y%m%d}.csv')
        print(f"Searching for tweets containing '{search_term}' from {single_date:%Y-%m-%d} and saving into {save_path}")
        twint_search(search_term, since, until, save_path)

In [None]:
# collect tweets with keywords such as depressed, suicide, etc. 
# tweets are gathered from 2022-2008 to achieve 100k data and saved to gdrive, creating a folder for each day with those tweets.

search_term = "depressed OR kill me OR suicide OR want to die OR lonely OR antidepressant OR hopeless OR sadness OR death OR worthless"
start_date = dt.datetime(2008, 1, 1)
end_date = dt.datetime(2022, 2, 13)
save_dir = 'drive/MyDrive/data/'

# run search
twint_search_loop(search_term, start_date, end_date, save_dir)



In [34]:
# get the location of where the tweets are stored and combine them together in 1 csv file
from glob import glob
import os
import pandas as pd

search_term = "depressed OR kill me OR suicide OR want to die OR lonely OR antidepressant OR hopeless OR sadness OR death OR worthless"
save_dir = 'drive/MyDrive/data/'


# csv_files = glob(os.path.join(save_dir, search_term, '*.csv'))
csv_files = glob(os.path.join(save_dir, search_term, '*.csv'))

# create DataFrames for each CSV file and combine into a single df
dfs = [pd.read_csv(csv_file) for csv_file in csv_files]
tweets_df = pd.concat(dfs).reset_index(drop=True)



**The CSV file contains 101387 rows and 36 columns**



In [None]:
tweets_df.shape

(101387, 36)

In [None]:
# These are the columns

for col in tweets_df.columns:
    print(col)

**First pre-processing step would be to remove unncessary column names**

In [35]:
tweets_df.drop(['conversation_id',
          'created_at',
          'timezone',
          'user_id',
          'username',
          'name',
          'place',
          'language',
          'mentions',
          'urls',
          'photos',
          'replies_count',
          'retweets_count',
          'likes_count',
          'hashtags',
          'cashtags',
          'link',
          'retweet',
          'quote_url',
          'video',
          'thumbnail',
          'near',
          'geo',
          'source',
          'user_rt_id',
          'user_rt',
          'retweet_id',
          'reply_to',
          'retweet_date',
          'translate',
          'trans_src',
          'trans_dest'], 
               axis = 1, inplace = True)

In [None]:
# remaining columns will be id, date, time and tweet

tweets_df.head()

In [None]:
# check if there are any missing rows

tweets_df.isnull().any().any()  

In [None]:
tweets_df.info(null_counts=True) 
tweets_df.dtypes

Cleaning up the date more by removing noise such as: 


*   hashtags
*   @ or userhandles

*   URLs
*   multiple spaces

*   numbers









In [36]:
import neattext.functions as neat_text
tweets_df['clean_tweet'] = tweets_df['tweet'].apply(neat_text.remove_hashtags)
tweets_df['clean_tweet'] = tweets_df['clean_tweet'].apply(lambda x: neat_text.remove_userhandles(x))
tweets_df['clean_tweet'] = tweets_df['clean_tweet'].apply(neat_text.remove_urls)
tweets_df['clean_tweet'] = tweets_df['clean_tweet'].apply(neat_text.remove_multiple_spaces)


Turning texts into lowercase to maintain consistency:

In [37]:
for columns in tweets_df.columns:
    tweets_df['clean_tweet'] = tweets_df['clean_tweet'].str.lower() 

Dropping duplicates after removing noise, we will just keep 1 of the duplicates from clean_tweet column. 90K tweets remain after deleting duplicates


In [49]:
tweets_df.sort_values("clean_tweet", inplace=True)
unique_tweets_df = tweets_df.drop_duplicates(subset=["clean_tweet"],keep = 'first')
unique_tweets_df

Unnamed: 0,id,date,time,tweet,clean_tweet
63683,280071086589231104,2012-12-15,22:05:25,#i #hate #myself #and #want #to #die #suicide ...,
49786,1063552739160973312,2018-11-16,22:01:47,@ItsTimiDuhh !! Y'all don't want every girl to...,!! y'all don't want every girl to come to you...
43817,953035217039712257,2018-01-15,22:44:36,@tonykill_ !!! i want to see tony kill live be...,!!! i want to see tony kill live before i die
18765,101795350565765120,2011-08-11,23:21:17,@libertyflintxx !!! like ever!! I'm scared I'm...,!!! like ever!! i'm scared i'm gonna fall off...
92853,588737680323645440,2015-04-16,16:16:15,@trashboatTsukki !!!!!!!!! SHET ARE YOU AIMING...,!!!!!!!!! shet are you aiming to kill me. bec...
...,...,...,...,...,...
40938,900117904351535104,2017-08-22,22:10:06,🤤🤤 death by asphyxiation .. talk dirty to me R...,🤤🤤 death by asphyxiation .. talk dirty to me r...
47950,1035301077971820544,2018-08-30,22:59:46,🥀 One of my weaknesses is my lack of a green t...,🥀 one of my weaknesses is my lack of a green t...
73853,1254494584819810306,2020-04-26,19:36:31,🥺🥺 I’m too selfish to die for someone I love m...,🥺🥺 i’m too selfish to die for someone i love m...
84944,1454950380379508738,2021-10-31,23:16:17,🦎♊🐉😈🌞🌚🌏🌍🌎🐕 saying I can make people depressed ...,🦎♊🐉😈🌞🌚🌏🌍🌎🐕 saying i can make people depressed ...


Need to expand contractions for text standardisation. 

In [5]:
# a list of contractions and their expanded form

contractions = { 
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}


In [19]:

mapping = {k : v for k, v in contractions.items() }
tweets_df['clean_tweet'] = tweets_df['clean_tweet'].replace(mapping, regex=True)
print(tweets_df)



                        id  ...                                        clean_tweet
0       438460348073795584  ...  gogeta told morrigan to let him help her to ki...
1       438459161240539136  ...   to beautiful to be depressed. they have no re...
2       438458283452428289  ...  am alone, lonely, &amp; depressed because no o...
3       438458006800330752  ...  “ "hmm “ before i die i want to kill chris sma...
4       438455764886097920  ...  “ before i die i want to kill chris smalling”o...
...                    ...  ...                                                ...
101382  759861573888258050  ...  "kill me. please" "screw you" damon begging to...
101383  759859411397414912  ...   do you want me to die? cause like my parents ...
101384  759855097786478592  ...   by trying to kill me in secret for my father'...
101385  759853846457831424  ...  this isnt real right just kill me im going to ...
101386  759853700416311296  ...  guy walking near me has the kill bill whistle ...

[10

In [None]:
tweets_df['clean_tweet']


0         gogeta told morrigan to let him help her to ki...
1          to beautiful to be depressed. they have no re...
2         am alone, lonely, &amp; depressed because no o...
3         “ "hmm “ before i die i want to kill chris sma...
4         “ before i die i want to kill chris smalling”o...
                                ...                        
101382    "kill me. please" "screw you" damon begging to...
101383     do you want me to die? cause like my parents ...
101384     by trying to kill me in secret for my father'...
101385    this isnt real right just kill me im going to ...
101386    guy walking near me has the kill bill whistle ...
Name: clean_tweet, Length: 101387, dtype: object

In [None]:
tweets_df.iloc[647]

id                                            448966550192930816
date                                                  2014-03-26
time                                                    23:35:41
tweet          i’m depressed as fuck, i’m tired of  being lik...
clean_tweet    i’m depressed as fuck, i’m tired of being like...
Name: 647, dtype: object

In [52]:
unique_tweets_df[unique_tweets_df['clean_tweet'].str.contains("dont")] 

Unnamed: 0,id,date,time,tweet,clean_tweet
43616,948669246434873344,2018-01-03,21:35:48,"@AngelOfEreri ""but you'll die first. i want to...","""but you'll die first. i want to hear you apo..."
64374,289875944322236416,2013-01-11,23:26:25,"http://t.co/gHUfc0Zg ""Fat, Worthless, Whore,...","""fat, worthless, whore, slut, immature, disgu..."
92355,577610776183074816,2015-03-16,23:21:55,"@SC_Erwin_Smith ""Help me LEVI IS GOING TO KILL...","""help me levi is going to kill me please hell..."
56915,1208174701425676295,2019-12-20,23:57:50,"@djoats02 @Csillabubu ""hey im seriously strugg...","""hey im seriously struggling and ive been con..."
94506,616693828259459073,2015-07-02,19:44:00,"@Hatred_Reaper ""I dont die, no one will be abl...","""i dont die, no one will be able to kill me o..."
...,...,...,...,...,...
50431,1081688653833912320,2019-01-05,23:07:26,⑤ oh i think id die right away.. like i dont t...,⑤ oh i think id die right away.. like i dont t...
66389,332981710906073088,2013-05-10,22:13:21,♡ its funny how people always ignore me and do...,♡ its funny how people always ignore me and do...
46252,1008088724570361856,2018-06-16,20:47:35,🆘🆘🆘 IAM A VERY HAPPY BOY BUT I WILL DIE IF SOM...,🆘🆘🆘 iam a very happy boy but i will die if som...
46255,1008083954942038016,2018-06-16,20:28:38,🆘🆘🆘I DONT WANT TO DIE BUT NOBODY WANTS ME. I A...,🆘🆘🆘i dont want to die but nobody wants me. i a...


In [None]:
tweets_df['clean_tweet'] = tweets_df['clean_tweet'].replace("’", "'")

NameError: ignored