# Project 3: Web APIs & Classification

## Notebook 2: Data Cleaning and EDA

### Contents:
- [EDA](#EDA)
- [Create classification column and combine data](#Create-classification-column-and-combine-data)

In [1]:
import pandas as pd
import ast

In [2]:
dems_uncleaned_df = pd.read_csv('../datasets/dems_top1000_posts.csv')
reps_uncleaned_df = pd.read_csv('../datasets/reps_top1000_posts.csv')

### EDA

In [3]:
# check to ensure the files are in the correct shape
print(dems_uncleaned_df.shape, reps_uncleaned_df.shape)

(998, 2) (999, 2)


In [4]:
dems_uncleaned_df['data'][0]

"{'approved_at_utc': None, 'subreddit': 'democrats', 'selftext': '', 'author_fullname': 't2_ttyik', 'saved': False, 'mod_reason_title': None, 'gilded': 2, 'clicked': False, 'title': 'This is President Barack Obama. He did not sell Americans out to the telecom lobby, but instead called upon on the FCC to take up the strongest possible rules to protect net neutrality, which they did at his instruction in 2015.', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/democrats', 'hidden': False, 'pwls': 6, 'link_flair_css_class': None, 'downs': 0, 'thumbnail_height': 140, 'top_awarded_type': None, 'hide_score': False, 'name': 't3_7gzh5a', 'quarantine': False, 'link_flair_text_color': 'dark', 'upvote_ratio': 0.68, 'author_flair_background_color': None, 'subreddit_type': 'public', 'ups': 54436, 'total_awards_received': 2, 'media_embed': {}, 'thumbnail_width': 140, 'author_flair_template_id': None, 'is_original_content': False, 'user_reports': [], 'secure_media': None, 'is_reddit_media_dom

In [5]:
type(dems_uncleaned_df['data'][0])

str

We can see that the data which is originally a dictionary has now been changed to a string type hence we would need to change it back into a string so that we can extract the relevant key value pair. We do it using the `ast` library function `literal_eval`. 

In [6]:
dems_uncleaned_df['data'] = dems_uncleaned_df['data'].map(lambda data: ast.literal_eval(data))
reps_uncleaned_df['data'] = reps_uncleaned_df['data'].map(lambda data: ast.literal_eval(data))

In [7]:
# check that it has successfully changed to a dict type
type(dems_uncleaned_df['data'][0])

dict

In [10]:
# the reddit post title
dems_uncleaned_df['data'][0]['title']

'This is President Barack Obama. He did not sell Americans out to the telecom lobby, but instead called upon on the FCC to take up the strongest possible rules to protect net neutrality, which they did at his instruction in 2015.'

In [11]:
# the reddit post text
dems_uncleaned_df['data'][0]['selftext']

''

In [12]:
# the reddit post id
dems_uncleaned_df['data'][0]['name']

't3_7gzh5a'

In [13]:
# the reddit post upvotes
dems_uncleaned_df['data'][0]['ups']

54436

In [14]:
# the reddit post downvotes
dems_uncleaned_df['data'][0]['downs']

0

In [16]:
len(dems_uncleaned_df)

998

In [17]:
# iterate through the dataset and extract these 5 key value pair
dems_titles = [dems_uncleaned_df['data'][i]['title'] for i in range (len(dems_uncleaned_df))]
dems_selftext = [dems_uncleaned_df['data'][i]['selftext'] for i in range (len(dems_uncleaned_df))]
dems_name = [dems_uncleaned_df['data'][i]['name'] for i in range (len(dems_uncleaned_df))]
dems_ups = [dems_uncleaned_df['data'][i]['ups'] for i in range (len(dems_uncleaned_df))]
dems_downs = [dems_uncleaned_df['data'][i]['downs'] for i in range (len(dems_uncleaned_df))]

In [18]:
print(len(dems_titles), len(dems_selftext), len(dems_name), len(dems_ups), len(dems_downs))

998 998 998 998 998


In [20]:
# convert the data into a pandas dataframe
dems_df = pd.DataFrame([dems_name, dems_titles, dems_selftext, dems_ups, dems_downs],
                       index = ['post_id', 'title', 'post_text', 'upvotes', 'downvotes']).T

In [21]:
# the same process is done on the republican dataset
reps_titles = [reps_uncleaned_df['data'][i]['title'] for i in range (len(reps_uncleaned_df))]
reps_selftext = [reps_uncleaned_df['data'][i]['selftext'] for i in range (len(reps_uncleaned_df))]
reps_name = [reps_uncleaned_df['data'][i]['name'] for i in range (len(reps_uncleaned_df))]
reps_ups = [reps_uncleaned_df['data'][i]['ups'] for i in range (len(reps_uncleaned_df))]
reps_downs = [reps_uncleaned_df['data'][i]['downs'] for i in range (len(reps_uncleaned_df))]

print(len(reps_titles), len(reps_selftext), len(reps_name), len(reps_ups), len(reps_downs))

999 999 999 999 999


In [22]:
reps_df = pd.DataFrame([reps_name, reps_titles, reps_selftext, reps_ups, reps_downs],
                       index = ['post_id', 'title', 'post_text', 'upvotes', 'downvotes']).T

### Create classification column and combine data

We would need to combine the dataframe but before we do that we must designate a binary value to seperate dataframe, this new column will be `'is_dems'` where `1` is for all posts from r/democrats and `0` is for all posts from r/Republican

In [23]:
dems_df['is_dems'] = 0
reps_df['is_dems'] = 1

In [24]:
# we use the ignore_index to ensure that the index is reset as we do not need to preserve it
df = pd.concat([dems_df, reps_df], ignore_index = True)

In [25]:
df.head()

Unnamed: 0,post_id,title,post_text,upvotes,downvotes,is_dems
0,t3_7gzh5a,This is President Barack Obama. He did not sel...,,54436,0,0
1,t3_7ekych,Join The Battle For Net Neutrality! Don't Let ...,,30254,0,0
2,t3_7oj9cv,Republican ‘pro-life’ congressman slept with p...,,19729,0,0
3,t3_85y5ja,It would not be polite to ask the President to...,,19068,0,0
4,t3_89dmy7,"Brian Klaas: ""The President is openly attempti...",,17387,0,0


In [26]:
df.tail()

Unnamed: 0,post_id,title,post_text,upvotes,downvotes,is_dems
1992,t3_940n6h,NY Times newest editorial board member doesn't...,,339,0,1
1993,t3_foqc2s,Why On Earth Should Anyone Believe China’s Cor...,,330,0,1
1994,t3_e01nly,California court strikes down law seeking rele...,,337,0,1
1995,t3_a56qlx,Tom Fitton to Testify on Clinton Foundation,,340,0,1
1996,t3_9ah5li,Black Pastor: Trump ‘Probably Going to Be… Mos...,,339,0,1


In [28]:
df.shape

(1997, 6)

In [36]:
# check if there is any null value
df.isnull().sum()

post_id         0
title           0
post_text       0
upvotes         0
downvotes       0
is_dems         0
combine_text    0
dtype: int64

In [30]:
df['is_dems'].value_counts()

1    999
0    998
Name: is_dems, dtype: int64

In [31]:
post_with_text = [post for post in df['post_text'] if (len(post) != 0)]

In [32]:
len(post_with_text)

35

In [33]:
df['combine_text'] = df['title'] + ' ' + df['post_text']

In [34]:
df.head()

Unnamed: 0,post_id,title,post_text,upvotes,downvotes,is_dems,combine_text
0,t3_7gzh5a,This is President Barack Obama. He did not sel...,,54436,0,0,This is President Barack Obama. He did not sel...
1,t3_7ekych,Join The Battle For Net Neutrality! Don't Let ...,,30254,0,0,Join The Battle For Net Neutrality! Don't Let ...
2,t3_7oj9cv,Republican ‘pro-life’ congressman slept with p...,,19729,0,0,Republican ‘pro-life’ congressman slept with p...
3,t3_85y5ja,It would not be polite to ask the President to...,,19068,0,0,It would not be polite to ask the President to...
4,t3_89dmy7,"Brian Klaas: ""The President is openly attempti...",,17387,0,0,"Brian Klaas: ""The President is openly attempti..."


In [35]:
df.to_csv('../datasets/df_eda.csv', index = False)

## Continue to Notebook 03