## Scrape Potentially Depressive Tweets from Twitter

We would like to gather data from twitter based on depressive hashtags, such as #depressed, #depression, #loneliness and #hopelessness
Then apply various techniques to remove non-depressive messages
The result of this script will provide a dataset that contains a filtered collection of tweets that are potentially depressive. The script also removes all hashtags from the tweets, so that the machine learning model cannot cheat by just looking for depressive hashtags.
The final dataset will be manually reviewed and labelled, so that both the depressive and non-depressive messages within it will be correctly marked.

In [3]:
!pip install nest_asyncio



In [4]:
!pip install twint



In [0]:
import nest_asyncio
nest_asyncio.apply()
import pandas as pd
import twint

In [0]:
import pandas as pd
import re

In [7]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [8]:
# add some tweets with depressed and depression tags, for a particular year

depress_tags = ["#depressed", "#depression", "#loneliness", "#hopelessness"]

content = {}
for i in range(len(depress_tags)):
    print(depress_tags[i])
    c = twint.Config()
    
    c.Format = "Tweet id: {id} | Tweet: {tweet}"
    c.Search = depress_tags[i]
    c.Limit = 1000
    c.Year = 2015
    c.Store_csv = True
    c.Store_Object = True
    c.Output = "/content/gdrive/My Drive/data/dataset_en_all7.csv"
    c.Hide_output = True
    c.Stats = True
    c.Lowercase  = True
    c.Filter_retweets = True
    twint.run.Search(c)

#depressed
#depression
#loneliness


CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)


#hopelessness


In [0]:
# add more examples of depressed and depression tags, but with another year so it doesnt overlap

depress_tags = ["#depressed", "#depression"]

content = {}
for i in range(len(depress_tags)):
    c = twint.Config()
    
    c.Format = "Tweet id: {id} | Tweet: {tweet}"
    c.Search = depress_tags[i]
    c.Limit = 1000
    c.Year = 2016
    c.Store_csv = True
    c.Store_Object = True
    c.Output = "/content/gdrive/My Drive/data/dataset_en_al19.csv"
    c.Hide_output = True
    c.Stats = True
    c.Lowercase  = True   
    twint.run.Search(c)

In [0]:
df1 = pd.read_csv("/content/gdrive/My Drive/data/dataset_en_all7.csv")
df2 = pd.read_csv("/content/gdrive/My Drive/data/dataset_en_al19.csv")
df_all = pd.concat([df1, df2])

In [11]:
# Check for the size of each dataset
len(df1), len(df2), len(df_all)

(4000, 2000, 6000)

In [12]:
df1.hashtags.value_counts()

['#depressed']                                                                                                                            366
['#loneliness']                                                                                                                           192
['#hopelessness']                                                                                                                         190
['#depression']                                                                                                                           158
['#depression', '#therapy']                                                                                                                87
['#loneliness', '#solitude']                                                                                                               57
['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']                                        25
['#hol

In [15]:
len(df_all.id.value_counts())

5928

 **1. Combine dataset and remove duplicates based on id and tweet content**

In [0]:
df_all = df_all.drop_duplicates(subset =["id"]) 

In [17]:
df_all.shape

(5928, 31)

In [0]:
pd.set_option('display.max_colwidth', -1)

In [19]:
df_all.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
0,550440642867130368,550440642867130368,1420070250000,2014-12-31,23:57:30,UTC,1126331382,quoteninstagram,Instagram Quotes,,New #quote : #secret_society123 #crying #depressed #selfharmmm #cutting #blood #hate #quote #anorexia #anxiety #... http://flic.kr/p/qBXPN8,[],['http://flic.kr/p/qBXPN8'],[],0,0,0,"['#quote', '#secret_society123', '#crying', '#depressed', '#selfharmmm', '#cutting', '#blood', '#hate', '#quote', '#anorexia', '#anxiety']",[],https://twitter.com/QuotenInstagram/status/550440642867130368,False,,0,,,,,,,"[{'user_id': '1126331382', 'username': 'QuotenInstagram'}]",
1,550438181943123968,550438181943123968,1420069663000,2014-12-31,23:47:43,UTC,2788182309,lisbethge91,FAITH!HOPE!LOVE1991,,DA NEW YR ALONE I AM #DEPRESSED BCAUSE OF WAT HAPPENED 2ME ND NOW YA'LL WANA MAKE MY #DEPRESSION WORSE!?!I'VE TRIED MY BEST 2GIVE CHANCES I_,[],[],[],0,1,0,"['#depressed', '#depression']",[],https://twitter.com/lisbethge91/status/550438181943123968,False,,0,,,,,,,"[{'user_id': '2788182309', 'username': 'lisbethge91'}]",
2,550437557969121280,550434282066681858,1420069515000,2014-12-31,23:45:15,UTC,2730976702,hazeidine_,Jordan,,@Venom_sR Because it stands out the fact that we are older and we are getting closer to dying #Depressed,['venom_sr'],[],[],1,0,0,['#depressed'],[],https://twitter.com/HazeIdine_/status/550437557969121280,False,,0,,,,,,,"[{'user_id': '2730976702', 'username': 'HazeIdine_'}, {'user_id': '780077008986464257', 'username': 'Venom_Sr'}]",
3,550436284653531136,550436284653531136,1420069211000,2014-12-31,23:40:11,UTC,217658803,kaylajean421,Kayla💚,,Let me just sit here and wallow in self pity #depressed,[],[],[],0,0,0,['#depressed'],[],https://twitter.com/kaylajean421/status/550436284653531136,False,,0,,,,,,,"[{'user_id': '217658803', 'username': 'kaylajean421'}]",
4,550430157136068608,550430157136068608,1420067750000,2014-12-31,23:15:50,UTC,106244401,fiona_day,Fiona.,,#depressed,[],[],[],0,0,0,['#depressed'],[],https://twitter.com/fiona_day/status/550430157136068608,False,,0,,,,,,,"[{'user_id': '106244401', 'username': 'fiona_day'}]",


In [20]:
df_all.hashtags.value_counts().head(20)

['#depressed']                                                                                                                                       649
['#depression']                                                                                                                                      296
['#loneliness']                                                                                                                                      192
['#hopelessness']                                                                                                                                    190
['#depression', '#therapy']                                                                                                                          87 
['#loneliness', '#solitude']                                                                                                                         57 
['#depression', '#mentalhealth']                                                  

Let's have a look at an example where there are the same long stream of tags reoccurring many times. That looks suspiciously like a marketing message

In [22]:
df_all[df_all["hashtags"] =="['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']"]

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
3571,499391453949210624,499391453949210624,1407899175000,2014-08-13,03:06:15,UTC,2279481877,thepath_forward,Anne Graham MSFT,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide g",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/thepath_forward/status/499391453949210624,False,,0,,,,,,,"[{'user_id': '2279481877', 'username': 'thepath_forward'}]",
3572,499391440670035968,499391440670035968,1407899172000,2014-08-13,03:06:12,UTC,45961430,yeeha234,Annetastic MSFT,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide g",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/yeeha234/status/499391440670035968,False,,0,,,,,,,"[{'user_id': '45961430', 'username': 'yeeha234'}]",
3573,499391409523138560,499391409523138560,1407899164000,2014-08-13,03:06:04,UTC,427988263,echelontogether,₪ ø Ms. Mars lll ·o.,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide g",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],1,1,7,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/EchelonTogether/status/499391409523138560,False,,0,,,,,,,"[{'user_id': '427988263', 'username': 'EchelonTogether'}]",
3574,499360994691280896,499360994691280896,1407891913000,2014-08-13,01:05:13,UTC,427988263,echelontogether,₪ ø Ms. Mars lll ·o.,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide f",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/EchelonTogether/status/499360994691280896,False,,0,,,,,,,"[{'user_id': '427988263', 'username': 'EchelonTogether'}]",
3575,499360975372300289,499360975372300289,1407891908000,2014-08-13,01:05:08,UTC,45961430,yeeha234,Annetastic MSFT,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide f",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/yeeha234/status/499360975372300289,False,,0,,,,,,,"[{'user_id': '45961430', 'username': 'yeeha234'}]",
3576,499360950789492736,499360950789492736,1407891903000,2014-08-13,01:05:03,UTC,2279481877,thepath_forward,Anne Graham MSFT,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide f",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/thepath_forward/status/499360950789492736,False,,0,,,,,,,"[{'user_id': '2279481877', 'username': 'thepath_forward'}]",
3579,499330553124904960,499330553124904960,1407884655000,2014-08-12,23:04:15,UTC,45961430,yeeha234,Annetastic MSFT,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide e",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/yeeha234/status/499330553124904960,False,,0,,,,,,,"[{'user_id': '45961430', 'username': 'yeeha234'}]",
3580,499330539912822786,499330539912822786,1407884652000,2014-08-12,23:04:12,UTC,427988263,echelontogether,₪ ø Ms. Mars lll ·o.,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide e",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/EchelonTogether/status/499330539912822786,False,,0,,,,,,,"[{'user_id': '427988263', 'username': 'EchelonTogether'}]",
3581,499330526017105921,499330526017105921,1407884649000,2014-08-12,23:04:09,UTC,2279481877,thepath_forward,Anne Graham MSFT,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide e",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/thepath_forward/status/499330526017105921,False,,0,,,,,,,"[{'user_id': '2279481877', 'username': 'thepath_forward'}]",
3583,499300353741836290,499300353741836290,1407877455000,2014-08-12,21:04:15,UTC,427988263,echelontogether,₪ ø Ms. Mars lll ·o.,,"… http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html … #depression #hopelessness, #invisibleillness #RobinWilliams #socialmedia #suicide d",[],['http://yeeha234insidemychronicillness.blogspot.com/2014/08/robin-williams.html'],[],0,0,0,"['#depression', '#hopelessness', '#invisibleillness', '#robinwilliams', '#socialmedia', '#suicide']",[],https://twitter.com/EchelonTogether/status/499300353741836290,False,,0,,,,,,,"[{'user_id': '427988263', 'username': 'EchelonTogether'}]",


**2. Filtering out the relevant rows**

**Ideas for cleaning / filtering**
1. remove entries that contain positive, or medical sounding tags
2. remove entries with more than three hashtags, as it may be promotional messages
3. remove entries with at mentions, as it may be promotional messages
4. remove entries with less than x chars / words
5. remove entries containing urls - again as they are likely to be promotional messages

In [0]:
selection_to_remove = ["#mentalhealth", "#health", "#happiness", "#mentalillness", "#happy", "#joy", "#wellbeing"]

#### 1. remove entries that contain positive, or medical sounding tags


In [26]:
mask1 = df_all.hashtags.apply(lambda x: any(item for item in selection_to_remove if item in x))
df_all[mask1].tweet.tail()

1988    2015: when music destroyed #mentalhealth stigma  http://goo.gl/52eKru  #despair #depression #anxiety #suicide #bipolar via .@guardian                                                                                 
1989    Be happy in 2016. Enjoy a special #HealthyMeSummit with @taniadejong #depression & #anxiety  http://ow.ly/W0387   http://fb.me/3rRZ5rnxX                                                                              
1990    Be happy in 2016. Enjoy a special #HealthyMeSummit with @taniadejong #depression & #anxiety  http://ow.ly/W0387  pic.twitter.com/b0y5KcstCe                                                                           
1993    RT mc1748 When words don't work, #arts program can help heal #veterans  http://strib.mn/1mPKarx  #PTSD #MentalHealth #NAMI #depression #anxi…                                                                         
1994    Debunking the myth that #suicides increase over the holiday season  http://nymag.com/scienceofus/201

In [27]:
# review the result of remving certain tags
df_all[mask1==False].tweet.head(10)

0     New #quote : #secret_society123 #crying #depressed #selfharmmm #cutting #blood #hate #quote #anorexia #anxiety #...  http://flic.kr/p/qBXPN8 
1     DA NEW YR ALONE I AM #DEPRESSED BCAUSE OF WAT HAPPENED 2ME ND NOW YA'LL WANA MAKE MY #DEPRESSION WORSE!?!I'VE TRIED MY BEST 2GIVE CHANCES I_ 
2     @Venom_sR Because it stands out the fact that we are older and we are getting closer to dying #Depressed                                     
3     Let me just sit here and wallow in self pity #depressed                                                                                      
4     #depressed                                                                                                                                   
5     When you ask for a triple chocolate melt down and the waiter tells you no... #depressed #sosad                                               
6     First breakdown at work ),: more to come ill bet money on it #depressed                                   

In [28]:
# above results look good, let's apply the mask1
df_all = df_all[mask1==False]
len (df_all)

5242

#### 2. remove entries with more than three hashtags, as it may be promotional messages


In [0]:
mask2 = df_all.hashtags.apply(lambda x: x.count("#") < 4)

In [0]:
# applying the mask2
df_all = df_all[mask2]

In [37]:
#Check dataset size 
len(df_all)

3308

In [38]:
df_all.head()


Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date
1,550438181943123968,550438181943123968,1420069663000,2014-12-31,23:47:43,UTC,2788182309,lisbethge91,FAITH!HOPE!LOVE1991,,DA NEW YR ALONE I AM #DEPRESSED BCAUSE OF WAT HAPPENED 2ME ND NOW YA'LL WANA MAKE MY #DEPRESSION WORSE!?!I'VE TRIED MY BEST 2GIVE CHANCES I_,[],[],[],0,1,0,"['#depressed', '#depression']",[],https://twitter.com/lisbethge91/status/550438181943123968,False,,0,,,,,,,"[{'user_id': '2788182309', 'username': 'lisbethge91'}]",
2,550437557969121280,550434282066681858,1420069515000,2014-12-31,23:45:15,UTC,2730976702,hazeidine_,Jordan,,@Venom_sR Because it stands out the fact that we are older and we are getting closer to dying #Depressed,['venom_sr'],[],[],1,0,0,['#depressed'],[],https://twitter.com/HazeIdine_/status/550437557969121280,False,,0,,,,,,,"[{'user_id': '2730976702', 'username': 'HazeIdine_'}, {'user_id': '780077008986464257', 'username': 'Venom_Sr'}]",
3,550436284653531136,550436284653531136,1420069211000,2014-12-31,23:40:11,UTC,217658803,kaylajean421,Kayla💚,,Let me just sit here and wallow in self pity #depressed,[],[],[],0,0,0,['#depressed'],[],https://twitter.com/kaylajean421/status/550436284653531136,False,,0,,,,,,,"[{'user_id': '217658803', 'username': 'kaylajean421'}]",
4,550430157136068608,550430157136068608,1420067750000,2014-12-31,23:15:50,UTC,106244401,fiona_day,Fiona.,,#depressed,[],[],[],0,0,0,['#depressed'],[],https://twitter.com/fiona_day/status/550430157136068608,False,,0,,,,,,,"[{'user_id': '106244401', 'username': 'fiona_day'}]",
5,550429725513244672,550429725513244672,1420067647000,2014-12-31,23:14:07,UTC,1064666461,kmulaniff713,Kevin Mulaniff,,When you ask for a triple chocolate melt down and the waiter tells you no... #depressed #sosad,[],[],[],0,0,1,"['#depressed', '#sosad']",[],https://twitter.com/KMulaniff713/status/550429725513244672,False,,0,,,,,,,"[{'user_id': '1064666461', 'username': 'KMulaniff713'}]",


#### 3. remove tweets with at mentions as they are sometimes retweets

In [0]:
mask3 = df_all.mentions.apply(lambda x: len(x) < 5)

In [0]:
# applying mask3
df_all = df_all[mask3]

In [42]:
len(df_all)

2718

In [43]:
# let's check the hashtags value counts again
df_all.hashtags.value_counts().head(20)

['#depressed']                               529
['#depression']                              243
['#loneliness']                              154
['#hopelessness']                            141
['#depression', '#therapy']                  87 
['#loneliness', '#solitude']                 57 
['#anxiety', '#depression']                  19 
['#meaning', '#hopelessness']                18 
['#art', '#loneliness']                      17 
['#depression', '#anxiety']                  13 
['#loneliness', '#kill', '#myth']            12 
[]                                           11 
['#depressed', '#depression']                11 
['#depression', '#helpme', '#iwantpeace']    10 
['#tms', '#depression']                      10 
['#depression', '#alcohol', '#newyears']     10 
['#youth', '#hopelessness']                  8  
['#loneliness', '#expandedcontacts']         8  
['#sad', '#depressed']                       7  
['#depression', '#notjustsad']               7  
Name: hashtags, dtyp

In [44]:
df_all.tweet.tail(10)

1959    talked about suicidal ideation with a friend last night.she confessed to having a plan of jumping off a bridge.I had no idea.#depression                                                                                       
1967    #DEPRESSION                                                                                                                                                                                                                    
1968    ur best is plenty good enough 4 anyone or anything that is meant 4U😊Don't let ppl nor circumstances kill you😘#suicideprevention #depression                                                                                    
1971    RT talkspace #Depression costs companies $52 billion/year in absenteeism & reduced productivity; results in 400 million lost work days/year…                                                                                   
1980    Sleep is extremely important, and for this author, regulating #s

#### 4. remove entries with less than x chars / words

In [0]:
mask4a = df_all.tweet.apply(lambda x: len(x) > 25)


In [46]:
df_all = df_all[mask4a]
len(df_all)

2611

In [0]:
mask4b = df_all.tweet.apply(lambda x: x.count(" ") > 5)

In [48]:
df_all = df_all[mask4b]
len(df_all)

2366

In [49]:
df_all.tweet

1       DA NEW YR ALONE I AM #DEPRESSED BCAUSE OF WAT HAPPENED 2ME ND NOW YA'LL WANA MAKE MY #DEPRESSION WORSE!?!I'VE TRIED MY BEST 2GIVE CHANCES I_                                                                                   
3       Let me just sit here and wallow in self pity #depressed                                                                                                                                                                        
5       When you ask for a triple chocolate melt down and the waiter tells you no... #depressed #sosad                                                                                                                                 
6       First breakdown at work ),: more to come ill bet money on it #depressed                                                                                                                                                        
8       I DONT wanna cry or feel sorry for myself. Its just so hard some


#### 5. remove entries containing urls - as they are likely to be promotional messages


In [0]:
mask5 = df_all.urls.apply(lambda x: len(x) < 5)

In [51]:
# let's have a look at what we will be removing from the dataset
df_all[mask5==False].tweet.head(10), df_all[mask5==False].tweet.tail(10)

(49     And the worse part is......trying to hold on but, no one is there. #depressed #alone #reality…  http://instagram.com/p/xSHMegjiFy/                                                             
 54     #depressed? - discuss  your #depression feelings anonymously -  http://ow.ly/GsmGr   http://ow.ly/GsmGs   http://ow.ly/i/6mvQH                                                                 
 71     Happy new year ☺ Rayakan tahun yg baru dengan ini ! #Depressed  http://instagram.com/p/xR3sa5v22q/                                                                                             
 90     Is your child #depressed? Learn the #signs of childhood depression here:  http://bit.ly/WQg3z9                                                                                                 
 95     Do you feel low amidst the new year celebrations? Even so there is a quiet capacity for happiness within  http://innerspacetherapy.in/mindfulness/discovering-happy-feel-low-blue/ … #depressed


The above shows that tweets with urls are indeed more likely to be promotional / informational  / educational messages and not indicative of the user~s actual emotional state, and thus can be removed (or marked as negative scenarios)

In [52]:
df_all = df_all[mask5]
len(df_all)

1351

## 3. Finally, let's create a column containing the tweet text, but with all hashtags removed

This column can be used as input to the model, or can be sent to another software for further emotion and linguistic analysis. The idea is, if the hashtags are removed, the model and the software will examine the text and clairy if the actual emotion is negative and indicative of depression

In [0]:
df_all["mod_text"] = df_all["tweet"].apply(lambda x: re.sub(r'#\w+', '', x))

In [0]:
df_all.mod_text.head(15), df_all.mod_text.tail(15)

(1      mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.                                                                                                                                                                       
 6     With all of this unnessary  family drama, I feel like moving far away and starting over again. From one thing to another I just feel . Hope I get through this                                                                                                                                   
 7     Stress na nga sa bahay, stress pa sa school😔                                                                                                                                                                                                                                                     
 8     Step 1.  Anfangen, richtig zu essen. Nicht zu wenig, nicht zu viel. & am besten ausgewogen.  Damit ich

In [0]:
# let~s check the hashtags value counts again
df_all.hashtags.value_counts().head(20)

['#depressed']                               296
['#depression']                              110
['#loneliness']                              78 
['#hopelessness']                            21 
['#depressed', '#stressed', '#alone']        10 
['#sad', '#depressed']                       9  
['#depression', '#anxiety']                  9  
['#stoner', '#instahookah', '#depressed']    8  
['#depression', '#depressed']                6  
['#tms', '#depression']                      6  
['#depression', '#helpme', '#iwantpeace']    5  
['#lonely', '#depressed']                    4  
['#depressed', '#lonely']                    4  
['#anxiety', '#depression']                  4  
['#depressed', '#anxious']                   4  
['#depressed', '#positive']                  3  
['#ptsd', '#depression']                     3  
['#depression', '#notjustsad']               3  
['#loneliness', '#depression']               3  
['#depressed', '#sad']                       3  
Name: hashtags, dtyp

In [0]:
df_all.columns

Index(['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone',
       'user_id', 'username', 'name', 'place', 'tweet', 'mentions', 'urls',
       'photos', 'replies_count', 'retweets_count', 'likes_count', 'hashtags',
       'cashtags', 'link', 'retweet', 'quote_url', 'video', 'near', 'geo',
       'source', 'user_rt_id', 'user_rt', 'retweet_id', 'reply_to',
       'retweet_date', 'mod_text'],
      dtype='object')

In [0]:
col_list = ["id", "conversation_id", "date", "username", "mod_text", "hashtags", "tweet"]

In [0]:
df_final1 = df_all[col_list]
df_final1 = df_final1.rename(columns={"mod_text": "tweet_processed", "tweet": "tweet_original"})


In [0]:
df_final1["target"] = 1

In [0]:
df_final1.head()

Unnamed: 0,id,conversation_id,date,username,tweet_processed,hashtags,tweet_original,target
1,1163050916330770433,1163050916330770433,2019-08-18,lowerdepression,"mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.",['#depressed'],"#Depressed mood can be caused by infectious diseases, nutritional deficiencies, neurological conditions, and physiological problems.",1
6,1163030382360629248,1163030382360629248,2019-08-18,chrisbontheweb,"With all of this unnessary family drama, I feel like moving far away and starting over again. From one thing to another I just feel . Hope I get through this",['#depressed'],"With all of this unnessary family drama, I feel like moving far away and starting over again. From one thing to another I just feel #depressed. Hope I get through this",1
7,1163028021244133376,1163028021244133376,2019-08-18,kimberlybenedi5,"Stress na nga sa bahay, stress pa sa school😔","['#doublekill', '#depressed']","Stress na nga sa bahay, stress pa sa school😔 #doublekill #depressed",1
8,1163027065463087104,1163027065463087104,2019-08-18,ag0n1z3d,"Step 1. Anfangen, richtig zu essen. Nicht zu wenig, nicht zu viel. & am besten ausgewogen. Damit ich dann die nötige Kraft habe, um den Tag zu überstehen. In der letzten Zeit war ich viel zu schwach. Das muss sich ändern.",['#depressed'],"Step 1. Anfangen, richtig zu essen. Nicht zu wenig, nicht zu viel. & am besten ausgewogen. Damit ich dann die nötige Kraft habe, um den Tag zu überstehen. In der letzten Zeit war ich viel zu schwach. Das muss sich ändern. #depressed",1
11,1163020226977386497,1163020226977386497,2019-08-18,wildfoxtherapy,"I'm going to keep banging on about this, cos it's true. What you focus on, you get more of. Stop telling yourself you're or . Tell yourself you're happy, strong, confident, powerful. Not only cos you ARE, but cos your brilliant mind listens to what you tell it. pic.twitter.com/gBQn7yEjsJ","['#depressed', '#anxious']","I'm going to keep banging on about this, cos it's true. What you focus on, you get more of. Stop telling yourself you're #depressed or #anxious. Tell yourself you're happy, strong, confident, powerful. Not only cos you ARE, but cos your brilliant mind listens to what you tell it. pic.twitter.com/gBQn7yEjsJ",1


In [0]:
len(df_final1) 

1102

In [0]:
df_final1_1 = df_final1[:400]
df_final1_2 = df_final1[400:800]
df_final1_3 = df_final1[800:]
len(df_final1_1), len(df_final1_2), len(df_final1_3), 

(400, 400, 302)

In [0]:
df_final1.to_csv("/content/gdrive/My Drive/data/tweets_final.csv")

In [0]:
df_final1_1.to_csv("/content/gdrive/My Drive/data/tweets_final_1.csv")
df_final1_2.to_csv("/content/gdrive/My Drive/data/tweets_final_2.csv")
df_final1_3.to_csv("/content/gdrive/My Drive/data/tweets_final_3.csv")

In [0]:
df_all.to_csv("/content/gdrive/My Drive/data/tweets_v3.csv")

In [0]:
users = df_all.username

In [0]:

content = {}
for i in users: #users1['Names']:

    
    c = twint.Config()
    c.Search = "#depressed"
    c.Username = "noneprivacy"
    c.Username = i
    c.Format = "Tweet id: {id} | Tweet: {tweet}"
    c.Limit = 100
    c.Store_csv = True
    c.Store_Object = True
    c.Output = "/content/gdrive/My Drive/data/dataset_v3.csv"
    c.Hide_output = True
    c.Stats = True
    c.Lowercase  = True
    twint.run.Search(c)
    
#     tweets = twint.output.tweets_list()
#     print(tweets)
#     for tweet in tweets:
#     # then iterate over the hashtags of that single tweet
#         for t in tweet.tweet:
#         # increment the count if the hashtag already exists, otherwise initialize it to 1
#             if tweet.username in content:
#                 content[tweet.username].append(t)
#             else:
#                 content[tweet.username] = []
#                 content[tweet.username].append(t)
        
    print(i)
#     print(content)
#     with open('dataset.csv', 'w') as output:
#         output.write('username, tweet\n')
#         for user in content:
#             for h in content[user]:
#                 output.write('{},{}\n'.format(user, content[user][h]))
    

ag0n1z3d
simonblue16
puffpuffnpass1
lowerdepression
bobymcboby
_arxn_
depressedaunty
joshstebbins2
hokey_hoke18
ericsequeira
hunterwastaken
nick63360
rimrod007
nick63360
lowlifekev
celerglersk
wildfoxtherapy
epicgabe
samanthajoule
paklongmail1
al__zaainn
janusha61949990
friedonbusiness
sadtimes0813
semsannen_
maudlinmuse
sadtimes0813
_bluenightx
puffpuffnpass1
masederealwolf
ilyseroyal
amishman9000
goodboypaden
sadtimes0813
jameswifties
briannakole19
shy91771526
aleuthemermaid
gracie_m721
lena38348916
katrinamunoz18
clarenstro
hashtagsaloobin
hashtagsaloobin
hashtagsaloobin
nctzoozeus
vaporaccessshop
masederealwolf
delzharina
hulk27watkins
therabbitchu
wildfoxtherapy
little_red2596
siddiqbetrayer
dark_swan
semsannen_
mozenkoffmich
badassid
naveentp36cq
reeteshkhadgi
trillasahbella
richerd2020
lowerdepression
airametuc09
paddasumeet
joshlaioloplays
lowerdepression
lowerdepression
wendy_ellas
alttheoalt
darkymishi
gracie_m721
chrisbontheweb
lisamonique_04
alyssamnunez
mickirei
mickirei
m

KeyboardInterrupt: ignored

In [0]:
help(twint.output.tweets_list)

Help on list object:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate sign