# Part 3: Converting to CSV

In the last notebook, we created a `.json` file with all the subreddit membership and categorization info. However, `.json` format is not the easiest to work with, so in this notebook we'll convert it into a much more readable `.csv` file.

In [2]:
import pandas as pd
import json

In [4]:
with open("subreddits.json", "r") as infile:
    subreddit_info = json.load(infile)

We start by creating a dictionary that creates a `tuple` for each Category/Name combo, and maps it to a list of that subreddit's membership info and description. We then convert this to a dataframe object using `pd.DataFrame.from_dict`, using this new tuple as the dataframe's index

In [34]:
subreddit_info_newkey = {(i,j,k, subreddit_info[i][j][k][l]['Name']): [subreddit_info[i][j][k][l]['Members'], subreddit_info[i][j][k][l]['Description']] 
        for i in subreddit_info.keys() 
        for j in subreddit_info[i].keys()
        for k in subreddit_info[i][j].keys()
        for l in range(len(subreddit_info[i][j][k]))}

In [38]:
subreddit_df = pd.DataFrame.from_dict(subreddit_info_newkey, orient='index')
subreddit_df.rename({0: 'Members', 1: 'Description'}, axis=1, inplace=True)
subreddit_df

Unnamed: 0,Members,Description
"(Psychedelics/Drugs/Hallucinogens, Cannabis, CBD/cannabis/Marijuana, r/Drugs)",773.5k Members,Sourcing drugs is NOT allowed here! \n\nYOU WI...
"(Psychedelics/Drugs/Hallucinogens, Cannabis, CBD/cannabis/Marijuana, r/weed)",540.3k Members,All about weed. The most open-minded weed comm...
"(Psychedelics/Drugs/Hallucinogens, Cannabis, CBD/cannabis/Marijuana, r/Psychonaut)",344.5k Members,A psychonaut is a person who experiences inten...
"(Psychedelics/Drugs/Hallucinogens, Cannabis, CBD/cannabis/Marijuana, r/marijuanaenthusiasts)",299.5k Members,Welcome to /r/marijuanaenthusiasts.\n\nThis is...
"(Psychedelics/Drugs/Hallucinogens, Cannabis, CBD/cannabis/Marijuana, r/StonerEngineering)",282.8k Members,You give a few pot heads a bunch of weed and n...
...,...,...
"(Technology, Internet, Weblogs/Blogging, r/Emailmarketing)",21.8k Members,"In 2020, Email still has the best returns, reg..."
"(Technology, Internet, Weblogs/Blogging, r/GrowthHacking)",16.5k Members,Welcome to world's largest Growth Hackers Redd...
"(Technology, Internet, Weblogs/Blogging, r/ladybusiness)",8.9k Members,"A place to discuss, celebrate and encourage fo..."
"(Technology, Internet, Weblogs/Blogging, r/ContentMarketing)",7.5k Members,Online marketing is shifting more and more tow...


Resetting the index will convert our dataframe index to integers, and creates a new column called `index` with our tuple inside. 

We want separate columns for each of the categories and the subreddit name, so we then convert the tuples in `index` to a list and assign them to our new column names: Category1, Category2, Category3, and Name.

In [40]:
subreddit_df = subreddit_df.reset_index()

In [43]:
subreddit_df[['Category1', 'Category2', 'Category3', 'Name']] = pd.DataFrame(subreddit_df['index'].tolist(), index=subreddit_df.index)

In [44]:
subreddit_df

Unnamed: 0,index,Members,Description,Category1,Category2,Category3,Name
0,"(Psychedelics/Drugs/Hallucinogens, Cannabis, C...",773.5k Members,Sourcing drugs is NOT allowed here! \n\nYOU WI...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana,r/Drugs
1,"(Psychedelics/Drugs/Hallucinogens, Cannabis, C...",540.3k Members,All about weed. The most open-minded weed comm...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana,r/weed
2,"(Psychedelics/Drugs/Hallucinogens, Cannabis, C...",344.5k Members,A psychonaut is a person who experiences inten...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana,r/Psychonaut
3,"(Psychedelics/Drugs/Hallucinogens, Cannabis, C...",299.5k Members,Welcome to /r/marijuanaenthusiasts.\n\nThis is...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana,r/marijuanaenthusiasts
4,"(Psychedelics/Drugs/Hallucinogens, Cannabis, C...",282.8k Members,You give a few pot heads a bunch of weed and n...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana,r/StonerEngineering
...,...,...,...,...,...,...,...
9477,"(Technology, Internet, Weblogs/Blogging, r/Ema...",21.8k Members,"In 2020, Email still has the best returns, reg...",Technology,Internet,Weblogs/Blogging,r/Emailmarketing
9478,"(Technology, Internet, Weblogs/Blogging, r/Gro...",16.5k Members,Welcome to world's largest Growth Hackers Redd...,Technology,Internet,Weblogs/Blogging,r/GrowthHacking
9479,"(Technology, Internet, Weblogs/Blogging, r/lad...",8.9k Members,"A place to discuss, celebrate and encourage fo...",Technology,Internet,Weblogs/Blogging,r/ladybusiness
9480,"(Technology, Internet, Weblogs/Blogging, r/Con...",7.5k Members,Online marketing is shifting more and more tow...,Technology,Internet,Weblogs/Blogging,r/ContentMarketing


We don't need the `index` column anymore, so we drop that. This dataset would also be more immediately readable if the subreddit name and its specific info were at the front of the dataframe, with the categorization information at the end. So, finally, we reorder our column names then save to a csv file.

In [46]:
subreddit_df = subreddit_df.drop(['index'], axis=1)[['Name', 'Members', 'Description', 'Category1', 'Category2', 'Category3']]

In [47]:
subreddit_df

Unnamed: 0,Name,Members,Description,Category1,Category2,Category3
0,r/Drugs,773.5k Members,Sourcing drugs is NOT allowed here! \n\nYOU WI...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana
1,r/weed,540.3k Members,All about weed. The most open-minded weed comm...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana
2,r/Psychonaut,344.5k Members,A psychonaut is a person who experiences inten...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana
3,r/marijuanaenthusiasts,299.5k Members,Welcome to /r/marijuanaenthusiasts.\n\nThis is...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana
4,r/StonerEngineering,282.8k Members,You give a few pot heads a bunch of weed and n...,Psychedelics/Drugs/Hallucinogens,Cannabis,CBD/cannabis/Marijuana
...,...,...,...,...,...,...
9477,r/Emailmarketing,21.8k Members,"In 2020, Email still has the best returns, reg...",Technology,Internet,Weblogs/Blogging
9478,r/GrowthHacking,16.5k Members,Welcome to world's largest Growth Hackers Redd...,Technology,Internet,Weblogs/Blogging
9479,r/ladybusiness,8.9k Members,"A place to discuss, celebrate and encourage fo...",Technology,Internet,Weblogs/Blogging
9480,r/ContentMarketing,7.5k Members,Online marketing is shifting more and more tow...,Technology,Internet,Weblogs/Blogging


In [49]:
subreddit_df.to_csv("subreddit_categories.csv", index=False)

The above data is available, without any further manipulation, on [Kaggle](https://www.kaggle.com/morganoneka/subreddit-categorization)!