### Project 3 -  Natural Language Processing on Parenting Related Reddit
## Notebook 2/4: Cleaning and Modification of Data

#### Kristina Joos

Notebook 1: Obtaining Data.  
Notebook 2: Cleaning and Modifying Data.  
Notebook 3: Modeling.  
Notebook 4: Predicting.  
  

---




# 2. Data Cleaning

---
## 2.1. Importing

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [50]:
# Read depression data set as d_df:
d_df = pd.read_csv('../data/depression.csv')
# Read control data set as c_df:                   
c_df = pd.read_csv('../data/control.csv')
# Read parenting data set as ap_df:    
p_df = pd.read_csv('../data/parenting.csv')
# Read attachment parenting data set as ap_df:    
ap_df = pd.read_csv('../data/attachment.csv')
# Read traditional data set as td_df (sleeptrain subreddit): 
tr_df= pd.read_csv('../data/traditional.csv')
# Read DIY data set as td_df: 
diy_df= pd.read_csv('../data/diy.csv')


---
## 2.2. Combine DataFrames

### 2.2.1 Add class to depression and control dataframes & combine

In [12]:
# depression as class 1
d_df['class'] = 1

In [13]:
d_df['class']

0       1
1       1
2       1
3       1
4       1
       ..
2150    1
2151    1
2152    1
2153    1
2154    1
Name: class, Length: 2155, dtype: int64

In [14]:
# control as class 0
c_df['class'] = 0

In [15]:
c_df['class']

0       0
1       0
2       0
3       0
4       0
       ..
3138    0
3139    0
3140    0
3141    0
3142    0
Name: class, Length: 3143, dtype: int64

In [16]:
# combine c_df and d_df to depression_df

depression_df = pd.concat([d_df, c_df])

In [17]:
print(d_df.shape+c_df.shape)
print(depression_df.shape)

(2155, 9, 3143, 9)
(5298, 9)


In [18]:
d_df.columns

Index(['title', 'subreddit', 'score', 'id', 'url', 'comms_num', 'created',
       'body', 'class'],
      dtype='object')

In [19]:
c_df.columns

Index(['title', 'subreddit', 'score', 'id', 'url', 'comms_num', 'created',
       'body', 'class'],
      dtype='object')

In [20]:
depression_df.columns

Index(['title', 'subreddit', 'score', 'id', 'url', 'comms_num', 'created',
       'body', 'class'],
      dtype='object')


---
## 2.3. Combine Title and Body to Document

### 2.3.1. depression_df

In [22]:
depression_df['doc'] = depression_df['title']+depression_df['body']

In [23]:
depression_df['title'].shape

(5298,)

In [24]:
depression_df['doc'].shape

(5298,)

In [25]:
# Checking for NaN
depression_df.isnull().sum()

title          0
subreddit      0
score          0
id             0
url            0
comms_num      0
created        0
body         361
class          0
doc          361
dtype: int64

In [26]:
# A lot of posts only have a title. Fill in NaN values in doc with title:
depression_df.loc[depression_df['doc'].isnull(),'doc'] =depression_df['title']

In [27]:
depression_df.isna().sum()

title          0
subreddit      0
score          0
id             0
url            0
comms_num      0
created        0
body         361
class          0
doc            0
dtype: int64

### 2.3.2. ap_df

In [45]:
ap_df.shape

(905, 9)

In [28]:
ap_df['doc'] = ap_df['title']+ap_df['body']

In [29]:
# A lot of posts only have a title. Fill in NaN values in doc with title:
ap_df.loc[ap_df['doc'].isnull(),'doc'] = ap_df['title']

In [30]:
# Checking for NaN
ap_df.isnull().sum()

title          0
subreddit      0
score          0
id             0
url            0
comms_num      0
created        0
body         197
doc            0
dtype: int64

In [31]:
ap_df.columns

Index(['title', 'subreddit', 'score', 'id', 'url', 'comms_num', 'created',
       'body', 'doc'],
      dtype='object')

### 2.3.3. tr_df

In [47]:
tr_df.shape

(1000, 9)

In [32]:
tr_df['doc'] = tr_df['title']+tr_df['body']

In [33]:
# A lot of posts only have a title. Fill in NaN values in doc with title:
tr_df.loc[tr_df['doc'].isnull(),'doc'] = tr_df['title']

In [34]:
# Checking for NaN
tr_df.isnull().sum()

title        0
subreddit    0
score        0
id           0
url          0
comms_num    0
created      0
body         6
doc          0
dtype: int64

In [35]:
tr_df.columns

Index(['title', 'subreddit', 'score', 'id', 'url', 'comms_num', 'created',
       'body', 'doc'],
      dtype='object')

### 2.3.4. p_df

In [48]:
p_df.shape

(903, 9)

In [36]:
p_df['doc'] = p_df['title']+p_df['body']

In [37]:
# A lot of posts only have a title. Fill in NaN values in doc with title:
p_df.loc[p_df['doc'].isnull(),'doc'] = p_df['title']

In [38]:
# Checking for NaN
p_df.isnull().sum()

title        0
subreddit    0
score        0
id           0
url          0
comms_num    0
created      0
body         1
doc          0
dtype: int64

In [39]:
p_df.columns

Index(['title', 'subreddit', 'score', 'id', 'url', 'comms_num', 'created',
       'body', 'doc'],
      dtype='object')

### 2.3.5. diy_df

In [40]:
diy_df['doc'] = diy_df['title']+diy_df['body']

In [41]:
# A lot of posts only have a title. Fill in NaN values in doc with title:
diy_df.loc[diy_df['doc'].isnull(),'doc'] = diy_df['title']

In [42]:
# Checking for NaN
diy_df.isnull().sum()

title         0
subreddit     0
score         0
id            0
url           0
comms_num     0
created       0
body         86
doc           0
dtype: int64

In [43]:
diy_df.columns

Index(['title', 'subreddit', 'score', 'id', 'url', 'comms_num', 'created',
       'body', 'doc'],
      dtype='object')

---
## 2.3. Saving modified DataFrame:

In [44]:
depression_df.to_csv(r"../data/depression.csv", index=False) #combined depression and control df
ap_df.to_csv(r'../data/attachmentparenting.csv', index = False)
tr_df.to_csv(r'../data/traditionalparenting.csv', index = False)
p_df.to_csv(r'../data/parenting.csv', index = False)
diy_df.to_csv(r'../data/diy.csv', index = False)