# Convert JSONs to CSVs
_By Tim Dwyer_ 


In [1]:
import pandas as pd
import numpy as np
import json
from itertools import product

The `json_to_csv` functions convert our json files into a csv file, keeping only those data with keys corresponding to the list `cols`. Some care must be taken regarding which API the json came from.

In [2]:
def json_to_csv(subreddit=None, pushshift=False, cols=[]):
    if pushshift:
        with open(f'../data/json/pushshift_{subreddit}_2017.json', 'r') as file:
            json_all = json.load(file)
    else:
        with open(f'../data/json/{subreddit}.json', 'r') as file:
            json_all = json.load(file)
        json_all = [json['data'] for json in json_all]

    
    df = {col:[] for col in cols}

    for json_tmp, col in product(json_all, cols):
        new_term = json_tmp.get(col, ' ')
        if new_term == '' and col == 'selftext':
            new_term = ' '
        df[col].append(new_term)

    df = pd.DataFrame(data=df, columns=cols)
    df.dropna(axis=1, how='all', inplace=True)
    df.drop_duplicates(inplace=True)
    df.to_csv(f'../data/csv/{subreddit}.csv',index=False)

It is reasonable at this point to ask what is going on in lines 13-17 above. The selftext component of the json is encoded as the empty string, ie `''`. When we store this in a DataFrame, Pandas interprets this to mean a null value rather than the value `''`. This is not ideal for my purposes, so I am encoding the 

The only missing data here is `selftext` when the post has no text component. This is encoded with an empty string. This causes some issues with pandas which regards the empty string as a missing value rather than a specific string value. We already have a variable that encodes whether or not a post has selftext (the `is_self` feature does this), so we're really not losing any information by doing this. 

In [3]:
cols = [
    'title',
    'num_comments',
    'score',
    'over_18',
    'locked',
    'stickied', 
    'subreddit',
    'created_utc',
    'is_self',
    'selftext',
]

subreddits = [
    'math',
    'learnmath',
    
    'python',
    'learnpython',
    
    'datascience',
    'learnmachinelearning',
]


json_to_csv(subreddit='all', pushshift=False, cols=cols)

all_posts = pd.DataFrame(columns=cols)

for subreddit in subreddits:
    json_to_csv(subreddit=subreddit, pushshift=True, cols=cols)
    df_subreddit = pd.read_csv(f'../data/csv/{subreddit}.csv')
    all_posts = all_posts.append(df_subreddit, ignore_index=True)

all_posts.to_csv('../data/csv/combined_subreddits.csv', index=False)