<img src='../assets/graphics/domestic_violence_03.png'>

Project Notebooks:
1. **Data Collection** (Current notebook)

2. Data Cleaning & Exploratory Data Analysis 

3. Pre-Processing & Modelling

## 1. Data Collection

### Contents:
- [1.1 Extracting data from the r/domesticviolence subreddit page](#ext1)
- [1.2 Extracting data from the r/depression subreddit page](#ext2)
- [1.3 Export Data](#1.3)
- [1.4 Summary](#1.4)

For this project, we'll be using Reddit's API to collect posts from the following subreddits:
- [r/domesticviolence](https://www.reddit.com/r/domesticviolence/)
- [r/depression](https://www.reddit.com/r/depression/)

To access Reddit's API, we modified our request so as to not use the default user agent and used the time.sleep() function to allow for a break in between requests.

Data in this notebook was requested on 20 June 2020. Re-running this code on a later date would result in a new set of posts being scraped.

## Import Libraries

In [2]:
import requests
import pandas as pd
import time
import random

### 1.1 Extracting data from the r/domesticviolence subreddit page
<a id='ext1'></a>

In [3]:
dom_url = 'https://www.reddit.com/r/domesticviolence.json'

In [4]:
res_dom = requests.get(dom_url, headers={'User-agent': 'Marianne'})

In [5]:
res_dom.status_code

200

In [6]:
# Reddict's data is organised as a dictionary
reddit_dom_dict = res_dom.json()
reddit_dom_dict.keys()

dict_keys(['kind', 'data'])

In [7]:
# Reviewing Reddit's keys
reddit_dom_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [8]:
reddit_dom_dict['data']['children'][0]['data']

{'approved_at_utc': None,
 'subreddit': 'domesticviolence',
 'selftext': 'We know many of you are struggling to manage with already traumatic events and now are dealing with a global pandemic of COVID-19. Many of you may be quarantined with an abuser or dealing with their ramped up abuse due to their proximity or need for that outlet. Abusers are coming back from years ago to get to you with hoovers. Being isolated is very hard and this is a situation completely unexpected and anxiety driving in and of itself. So we wanted to put together a listing of resources, many of which are listed in our resource listing in the sidebar for you in this difficult time. Stay safe out there, folks. We are right here with you, and we will get through this together. \n\n\nSupport for Domestic Abuse:\n\n* [Thehotline.org]( https://www.thehotline.org/) is available 24/7 for chat and calls (1800-787-3224) during this crisis for women, men as well as LGBTQ folks. Please be sure to use safe electronics to c

In [9]:
# Get subreddit name
reddit_dom_dict['data']['children'][0]['data']['subreddit_name_prefixed']

'r/domesticviolence'

In [10]:
# Get id of post - can be used to view individual posts
# e.g. https://www.reddit.com/r/domesticviolence/comments/fsrd59/
reddit_dom_dict['data']['children'][0]['data']['id']

'fsrd59'

In [11]:
# Get date of post

import datetime
created_utc = reddit_dom_dict['data']['children'][0]['data']['created_utc']
datetime.datetime.utcfromtimestamp(created_utc).strftime('%Y-%m-%d')

'2020-04-01'

In [12]:
# Check first post's title with webpage
reddit_dom_dict['data']['children'][0]['data']['title']

'COVID-19 RESOURCES FOR ABUSE VICTIMS'

In [13]:
# Check first post's content with webpage
reddit_dom_dict['data']['children'][0]['data']['selftext']

'We know many of you are struggling to manage with already traumatic events and now are dealing with a global pandemic of COVID-19. Many of you may be quarantined with an abuser or dealing with their ramped up abuse due to their proximity or need for that outlet. Abusers are coming back from years ago to get to you with hoovers. Being isolated is very hard and this is a situation completely unexpected and anxiety driving in and of itself. So we wanted to put together a listing of resources, many of which are listed in our resource listing in the sidebar for you in this difficult time. Stay safe out there, folks. We are right here with you, and we will get through this together. \n\n\nSupport for Domestic Abuse:\n\n* [Thehotline.org]( https://www.thehotline.org/) is available 24/7 for chat and calls (1800-787-3224) during this crisis for women, men as well as LGBTQ folks. Please be sure to use safe electronics to contact them or any agency if an abuser has access to them. They also offe

In [14]:
# Extract r/domesticviolence posts
dv_posts = [p['data'] for p in reddit_dom_dict['data']['children']]

# Convert to dataframe
pd.DataFrame(dv_posts).head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id
0,,domesticviolence,We know many of you are struggling to manage w...,t2_2egrzrvq,False,,0,False,COVID-19 RESOURCES FOR ABUSE VICTIMS,[],...,/r/domesticviolence/comments/fsrd59/covid19_re...,,True,https://www.reddit.com/r/domesticviolence/comm...,10667,1585710000.0,1,,False,
1,,domesticviolence,I have a friend who has been suffering abuse a...,t2_ixabg,False,,0,False,How can I help my friend?,[],...,/r/domesticviolence/comments/hclztl/how_can_i_...,,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592658000.0,0,,False,91cd03a2-5cda-11ea-9a5b-0e285193b139
2,,domesticviolence,I'm a 23 year old homeowner (female). I'm sell...,t2_2z1c66ln,False,,0,False,"I caught my biggest worry on tape, can anybody...",[],...,/r/domesticviolence/comments/hcnqna/i_caught_m...,,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592666000.0,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455
3,,domesticviolence,Maybe this doesnt belong here Im not sure wher...,t2_6dnmfknj,False,,0,False,Im stupid. How long to feel better after minor...,[],...,/r/domesticviolence/comments/hcgbh8/im_stupid_...,,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592630000.0,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455
4,,domesticviolence,My main questions are at the bottom if you jus...,t2_6yb1144r,False,,0,False,Vent but advice/knowledge is appreciated. My b...,[],...,/r/domesticviolence/comments/hcha3d/vent_but_a...,,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592635000.0,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455


In [15]:
# Identifying the last post 
reddit_dom_dict['data']['after']

't3_ha1hmt'

In [16]:
# Confirming that the previous output is truly the last post
pd.DataFrame(dv_posts)['name']

0     t3_fsrd59
1     t3_hclztl
2     t3_hcnqna
3     t3_hcgbh8
4     t3_hcha3d
5     t3_hch4oy
6     t3_hcbfp4
7     t3_hbq0hr
8     t3_hbt42e
9     t3_hbjyoc
10    t3_hbx63d
11    t3_hbsmtk
12    t3_hbrbls
13    t3_hb9lzk
14    t3_hbktus
15    t3_hb4jml
16    t3_hbbt8x
17    t3_hb80rr
18    t3_hate94
19    t3_hajrlu
20    t3_hatxhi
21    t3_havmd6
22    t3_havx2u
23    t3_haut32
24    t3_ha9gjm
25    t3_ha1hmt
Name: name, dtype: object

In [17]:
# This is the new URL that gives us the next 25 posts - double checked that it works
dom_url + '?after=' + reddit_dom_dict['data']['after']

'https://www.reddit.com/r/domesticviolence.json?after=t3_ha1hmt'

In [18]:
# Creating a funcition to scrape a Reddit page
# Function loops through 25 posts at a time

def reddit_scrape(url, posts_list, num_scrapes):

    after = None

    for i in range(num_scrapes):
        if i == 0:
            print("SCRAPING {}".format(url))
    
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after

        res = requests.get(current_url, headers={'User-agent': 'Marianne'})
    
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
    
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts_list.extend(current_posts)
        after = current_dict['data']['after']
        
        # Generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        time.sleep(sleep_duration)

    print("Number of posts downloaded: {}".format(len(posts_list)))

In [19]:
# Calling reddit_scrape function on domestic violence subreddit

dom_posts = []
reddit_scrape(dom_url, dom_posts, 50)

SCRAPING https://www.reddit.com/r/domesticviolence.json
Number of posts downloaded: 1249


Intended to download 1000 posts and there seems to be 1249 posts. I suspect that there might be repeated posts and I'll be checking for duplicate rows.

In [37]:
# Convert list to dataframe

dom_df = pd.DataFrame(dom_posts)
print('Shape of dom_df:', dom_df.shape)
dom_df.head()

Shape of dom_df: (1249, 107)


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,post_hint,preview
0,,domesticviolence,We know many of you are struggling to manage w...,t2_2egrzrvq,False,,0,False,COVID-19 RESOURCES FOR ABUSE VICTIMS,[],...,True,https://www.reddit.com/r/domesticviolence/comm...,10667,1585710000.0,1,,False,,,
1,,domesticviolence,I have a friend who has been suffering abuse a...,t2_ixabg,False,,0,False,How can I help my friend?,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592658000.0,0,,False,91cd03a2-5cda-11ea-9a5b-0e285193b139,,
2,,domesticviolence,I'm a 23 year old homeowner (female). I'm sell...,t2_2z1c66ln,False,,0,False,"I caught my biggest worry on tape, can anybody...",[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592666000.0,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455,,
3,,domesticviolence,Maybe this doesnt belong here Im not sure wher...,t2_6dnmfknj,False,,0,False,Im stupid. How long to feel better after minor...,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592630000.0,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455,,
4,,domesticviolence,My main questions are at the bottom if you jus...,t2_6yb1144r,False,,0,False,Vent but advice/knowledge is appreciated. My b...,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592635000.0,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455,,


In [38]:
# Check for unique entries 
print('There are {} unique posts in dom_df.'.format(len(dom_df['name'].unique())))

There are 998 unique posts in dom_df.


In [39]:
# Drop duplicate rows and reset index

dom_df.drop_duplicates(subset='name', inplace=True)
dom_df.reset_index(drop=True, inplace=True)

print('Shape of dom_df:', dom_df.shape)
dom_df

Shape of dom_df: (998, 107)


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,post_hint,preview
0,,domesticviolence,We know many of you are struggling to manage w...,t2_2egrzrvq,False,,0,False,COVID-19 RESOURCES FOR ABUSE VICTIMS,[],...,True,https://www.reddit.com/r/domesticviolence/comm...,10667,1.585710e+09,1,,False,,,
1,,domesticviolence,I have a friend who has been suffering abuse a...,t2_ixabg,False,,0,False,How can I help my friend?,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1.592658e+09,0,,False,91cd03a2-5cda-11ea-9a5b-0e285193b139,,
2,,domesticviolence,I'm a 23 year old homeowner (female). I'm sell...,t2_2z1c66ln,False,,0,False,"I caught my biggest worry on tape, can anybody...",[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1.592666e+09,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455,,
3,,domesticviolence,Maybe this doesnt belong here Im not sure wher...,t2_6dnmfknj,False,,0,False,Im stupid. How long to feel better after minor...,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1.592630e+09,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455,,
4,,domesticviolence,My main questions are at the bottom if you jus...,t2_6yb1144r,False,,0,False,Vent but advice/knowledge is appreciated. My b...,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1.592635e+09,0,,False,7d985224-5cda-11ea-aa5c-0e4c53184455,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,,domesticviolence,Help with my friends abusive Boyfriend\n\nSo m...,t2_gepdiyw,False,,0,False,Help with my friends abusive boyfriend.,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1.577077e+09,0,,False,,,
994,,domesticviolence,"\n\nDear Reddit community,\n\nI’m writing th...",t2_60z60tf,False,,0,False,Four months later and still no peace,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1.577041e+09,1,,False,,,
995,,domesticviolence,With my ex-husband the holidays were the time ...,t2_ya7b3,False,,0,False,I hate the holidays normally,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1.577040e+09,0,,False,,,
996,,domesticviolence,This wasn't the first time.\n\nThe first time ...,t2_11e40weh,False,,0,False,I slapped my husband and he punched me in the ...,[],...,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1.577014e+09,0,,False,,,


### 1.2 Extracting data from the r/depression subreddit page
<a id='ext2'></a>

We will be using the same method to extra data from the r/depression subreddit page.

In [23]:
dep_url = 'https://www.reddit.com/r/depression.json'

In [24]:
# Calling reddit_scrape function on depression subreddit

dep_posts = []
reddit_scrape(dep_url, dep_posts, 50)

SCRAPING https://www.reddit.com/r/depression.json
Number of posts downloaded: 1231


Similar to r/domesticviolence, I will be checking for duplicates in r/depression as well.

In [67]:
# Convert list to dataframe

dep_df = pd.DataFrame(dep_posts)
print('Shape of dep_df:', dep_df.shape)
dep_df.head()

Shape of dep_df: (1231, 103)


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,0,False,Our most-broken and least-understood rules is ...,[],...,/r/depression/comments/doqwow/our_mostbroken_a...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,648241,1572361000.0,0,,False,
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_64qjj,False,,0,False,Regular Check-In Post,[],...,/r/depression/comments/exo6f1/regular_checkin_...,no_ads,True,https://www.reddit.com/r/depression/comments/e...,648241,1580649000.0,0,,False,
2,,depression,Even if some posts blow up and have a bit of a...,t2_5xpk5iif,False,,1,False,This sub is counterproductive,[],...,/r/depression/comments/hcco2h/this_sub_is_coun...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1592614000.0,0,,False,
3,,depression,I know I shouldnt complain at 23 But Ive strug...,t2_6x5lojw8,False,,0,False,23 F. I feel so behind Everyone. I feel like a...,[],...,/r/depression/comments/hcl4eo/23_f_i_feel_so_b...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1592654000.0,0,,False,
4,,depression,As i go down the rabbit hole of why any of thi...,t2_564vn2mq,False,,0,False,The more depressed i get the more music i list...,[],...,/r/depression/comments/hca12w/the_more_depress...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1592605000.0,0,,False,


In [41]:
# Check for unique entries 
print('There are {} unique posts in dep_df.'.format(len(dep_df['name'].unique())))

There are 955 unique posts in dep_df.


In [68]:
# Drop duplicate rows and reset index

dep_df.drop_duplicates(subset='name', inplace=True)
dep_df.reset_index(drop=True, inplace=True)

print('Shape of dep_df:', dep_df.shape)
dep_df

Shape of dep_df: (955, 103)


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,0,False,Our most-broken and least-understood rules is ...,[],...,/r/depression/comments/doqwow/our_mostbroken_a...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,648241,1.572361e+09,0,,False,
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_64qjj,False,,0,False,Regular Check-In Post,[],...,/r/depression/comments/exo6f1/regular_checkin_...,no_ads,True,https://www.reddit.com/r/depression/comments/e...,648241,1.580649e+09,0,,False,
2,,depression,Even if some posts blow up and have a bit of a...,t2_5xpk5iif,False,,1,False,This sub is counterproductive,[],...,/r/depression/comments/hcco2h/this_sub_is_coun...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1.592614e+09,0,,False,
3,,depression,I know I shouldnt complain at 23 But Ive strug...,t2_6x5lojw8,False,,0,False,23 F. I feel so behind Everyone. I feel like a...,[],...,/r/depression/comments/hcl4eo/23_f_i_feel_so_b...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1.592654e+09,0,,False,
4,,depression,As i go down the rabbit hole of why any of thi...,t2_564vn2mq,False,,0,False,The more depressed i get the more music i list...,[],...,/r/depression/comments/hca12w/the_more_depress...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1.592605e+09,0,,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
950,,depression,Death is all I ever think about anymore. The e...,t2_5lie7iqk,False,,0,False,I think I’m getting suicidal again,[],...,/r/depression/comments/hbr3ue/i_think_im_getti...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648243,1.592529e+09,0,,False,
951,,depression,I was so excited when my boss told me he wante...,t2_9eo6qt,False,,0,False,He doesn't understand how much it wears me out.,[],...,/r/depression/comments/hbr3dj/he_doesnt_unders...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648243,1.592529e+09,0,,False,
952,,depression,,t2_33f2vqi5,False,,0,False,Is it normal to get sad when your mom threaten...,[],...,/r/depression/comments/hbkrjh/is_it_normal_to_...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648243,1.592507e+09,0,,False,
953,,depression,Now that my meds for other health issue kind o...,t2_5540e5zx,False,,0,False,"Fuck everything, i don't get joy out of anything.",[],...,/r/depression/comments/hbn4ze/fuck_everything_...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648243,1.592515e+09,0,,False,


There seems to be a difference in the number of columns between `dom_df` and `def_df` which I will be exploring further.

In [43]:
# Identify columns that are in dom_df but not dep_df
dom_df.columns.difference(dep_df.columns)

Index(['link_flair_template_id', 'post_hint', 'preview', 'thumbnail_height',
       'thumbnail_width'],
      dtype='object')

In [44]:
# Review the columns listed above
dom_df.loc[:,['link_flair_template_id', 'post_hint', 'preview', \
              'thumbnail_height','thumbnail_width']]

Unnamed: 0,link_flair_template_id,post_hint,preview,thumbnail_height,thumbnail_width
0,,,,,
1,91cd03a2-5cda-11ea-9a5b-0e285193b139,,,,
2,7d985224-5cda-11ea-aa5c-0e4c53184455,,,,
3,7d985224-5cda-11ea-aa5c-0e4c53184455,,,,
4,7d985224-5cda-11ea-aa5c-0e4c53184455,,,,
...,...,...,...,...,...
993,,,,,
994,,,,,
995,,,,,
996,,,,,


In [55]:
# Print out null value counts for the 5 extra columms

print(dom_df['link_flair_template_id'].isnull().value_counts(),'\n')
print(dom_df['post_hint'].isnull().value_counts(),'\n')
print(dom_df['preview'].isnull().value_counts(),'\n')
print(dom_df['thumbnail_height'].isnull().value_counts(),'\n')
print(dom_df['thumbnail_width'].isnull().value_counts())

True     759
False    239
Name: link_flair_template_id, dtype: int64 

True     992
False      6
Name: post_hint, dtype: int64 

True     992
False      6
Name: preview, dtype: int64 

True    998
Name: thumbnail_height, dtype: int64 

True    998
Name: thumbnail_width, dtype: int64


The extra 5 columns in `dom_df` don't seem to have any relevant information and are mostly nulls. It's likely that they won't be used for the our classifier and I'll be dropping them next.

In [56]:
# Drop columns
dom_df.drop(['link_flair_template_id', 'post_hint', 'preview', \
              'thumbnail_height','thumbnail_width'], axis=1, inplace=True)
dom_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video
0,,domesticviolence,We know many of you are struggling to manage w...,t2_2egrzrvq,False,,0,False,COVID-19 RESOURCES FOR ABUSE VICTIMS,[],...,,/r/domesticviolence/comments/fsrd59/covid19_re...,,True,https://www.reddit.com/r/domesticviolence/comm...,10667,1585710000.0,1,,False
1,,domesticviolence,I have a friend who has been suffering abuse a...,t2_ixabg,False,,0,False,How can I help my friend?,[],...,,/r/domesticviolence/comments/hclztl/how_can_i_...,,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592658000.0,0,,False
2,,domesticviolence,I'm a 23 year old homeowner (female). I'm sell...,t2_2z1c66ln,False,,0,False,"I caught my biggest worry on tape, can anybody...",[],...,,/r/domesticviolence/comments/hcnqna/i_caught_m...,,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592666000.0,0,,False
3,,domesticviolence,Maybe this doesnt belong here Im not sure wher...,t2_6dnmfknj,False,,0,False,Im stupid. How long to feel better after minor...,[],...,,/r/domesticviolence/comments/hcgbh8/im_stupid_...,,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592630000.0,0,,False
4,,domesticviolence,My main questions are at the bottom if you jus...,t2_6yb1144r,False,,0,False,Vent but advice/knowledge is appreciated. My b...,[],...,,/r/domesticviolence/comments/hcha3d/vent_but_a...,,False,https://www.reddit.com/r/domesticviolence/comm...,10667,1592635000.0,0,,False


In [69]:
# Identify columns that are in dep_df but not dom_df
dep_df.columns.difference(dom_df.columns)

Index(['author_cakeday'], dtype='object')

In [70]:
dep_df['author_cakeday'].isnull().value_counts()

True     954
False      1
Name: author_cakeday, dtype: int64

Similar, the `author_cakeday` column in `dep_df` doesn't seem to be useful as they're mostly null values and I will be dropping the column.

In [71]:
# Drop column
dep_df.drop('author_cakeday', axis=1, inplace=True)
dep_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video
0,,depression,We understand that most people who reply immed...,t2_1t70,False,,0,False,Our most-broken and least-understood rules is ...,[],...,,/r/depression/comments/doqwow/our_mostbroken_a...,no_ads,True,https://www.reddit.com/r/depression/comments/d...,648241,1572361000.0,0,,False
1,,depression,Welcome to /r/depression's check-in post - a p...,t2_64qjj,False,,0,False,Regular Check-In Post,[],...,,/r/depression/comments/exo6f1/regular_checkin_...,no_ads,True,https://www.reddit.com/r/depression/comments/e...,648241,1580649000.0,0,,False
2,,depression,Even if some posts blow up and have a bit of a...,t2_5xpk5iif,False,,1,False,This sub is counterproductive,[],...,,/r/depression/comments/hcco2h/this_sub_is_coun...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1592614000.0,0,,False
3,,depression,I know I shouldnt complain at 23 But Ive strug...,t2_6x5lojw8,False,,0,False,23 F. I feel so behind Everyone. I feel like a...,[],...,,/r/depression/comments/hcl4eo/23_f_i_feel_so_b...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1592654000.0,0,,False
4,,depression,As i go down the rabbit hole of why any of thi...,t2_564vn2mq,False,,0,False,The more depressed i get the more music i list...,[],...,,/r/depression/comments/hca12w/the_more_depress...,no_ads,False,https://www.reddit.com/r/depression/comments/h...,648241,1592605000.0,0,,False


In [72]:
# Confirm shape for both dataframes
print(dom_df.shape)
print(dep_df.shape)

(998, 102)
(955, 102)


### 1.3 Export Data
<a id='1.3'></a>

In [73]:
# Export both dataframes to csv
dom_df.to_csv('../data/domesticviolence.csv', index = False)
dep_df.to_csv('../data/depression.csv', index = False)

### 1.4 Summary
<a id='1.4'></a>

- We have 998 unique posts for r/domesticviolence and 955 unique posts for r/depression.
- There seems to be slightly different structure in both subreddits which resulted in a difference in number of columns. Fortunately, there isn't any meaningful data in these columns and they've been removed.
- Datasets have been successfully exported for EDA and modelling in the next notebooks.