# Exploring and Transforming JSON Schemas

# Introduction

In this lesson, you'll formalize how to explore a JSON file whose structure and schema is unknown to you. This often happens in practice when you are handed a file or stumble upon one with little documentation.

## Objectives
You will be able to:
* Use the JSON module to load and parse JSON documents
* Load and explore unknown JSON schemas
* Convert JSON to a pandas dataframe

## Loading the JSON file

Load the data from the file `disease_data.json`.

In [1]:
#Your code here 
import json

with open('disease_data.json') as f:
    disease_data = json.load(f)

## Explore the first and second levels of the schema hierarchy

In [2]:
from json_utils import peek_hierarchy

peek_hierarchy(disease_data)

{'keys': ['meta', 'data'], 'types': [dict, list]}

In [3]:
import pandas as pd
from nested_lookup import nested_lookup

cols = nested_lookup('name', disease_data['meta']['view']['columns'])
cols

['sid',
 'id',
 'position',
 'created_at',
 'created_meta',
 'updated_at',
 'updated_meta',
 'meta',
 'YearStart',
 'YearEnd',
 'LocationAbbr',
 'LocationDesc',
 'DataSource',
 'Topic',
 'Question',
 'Response',
 'DataValueUnit',
 'DataValueType',
 'DataValue',
 'DataValueAlt',
 'DataValueFootnoteSymbol',
 'DatavalueFootnote',
 'LowConfidenceLimit',
 'HighConfidenceLimit',
 'StratificationCategory1',
 'Stratification1',
 'StratificationCategory2',
 'Stratification2',
 'StratificationCategory3',
 'Stratification3',
 'GeoLocation',
 'ResponseID',
 'LocationID',
 'TopicID',
 'QuestionID',
 'DataValueTypeID',
 'StratificationCategoryID1',
 'StratificationID1',
 'StratificationCategoryID2',
 'StratificationID2',
 'StratificationCategoryID3',
 'StratificationID3']

## Convert to a DataFrame

Create a DataFrame from the JSON file. Be sure to retrive the column names for the dataframe. (Search within the 'meta' key of the master dictionary.) The DataFrame should include all 42 columns.

In [7]:
def peek_hierarchy(doc):
    if isinstance(doc, dict):
        return {"keys": list(doc.keys()), "types": list(map(type, doc.values()))}
    elif isinstance(doc, list):
        return list(map(type, doc))

In [8]:
df = pd.json_normalize(disease_data, record_path='data')
df.columns = cols
df.head()

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,YearStart,YearEnd,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,1,FF49C41F-CE8D-46C4-9164-653B1227CF6F,1,1527194521,959778,1527194521,959778,,2016,2016,...,59,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
1,2,F4468C3D-340A-4CD2-84A3-DF554DFF065E,2,1527194521,959778,1527194521,959778,,2016,2016,...,1,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
2,3,65609156-A343-4869-B03F-2BA62E96AC19,3,1527194521,959778,1527194521,959778,,2016,2016,...,2,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
3,4,0DB09B00-EFEB-4AC0-9467-A7CBD2B57BF3,4,1527194521,959778,1527194521,959778,,2016,2016,...,4,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
4,5,D98DA5BA-6FD6-40F5-A9B1-ABD45E44967B,5,1527194521,959778,1527194521,959778,,2016,2016,...,5,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,


In [16]:
import numpy as np 
df.applymap(lambda x: np.nan if x is None else x)

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,YearStart,YearEnd,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,1,FF49C41F-CE8D-46C4-9164-653B1227CF6F,1,1527194521,959778,1527194521,959778,,2016,2016,...,59,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
1,2,F4468C3D-340A-4CD2-84A3-DF554DFF065E,2,1527194521,959778,1527194521,959778,,2016,2016,...,01,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
2,3,65609156-A343-4869-B03F-2BA62E96AC19,3,1527194521,959778,1527194521,959778,,2016,2016,...,02,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
3,4,0DB09B00-EFEB-4AC0-9467-A7CBD2B57BF3,4,1527194521,959778,1527194521,959778,,2016,2016,...,04,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
4,5,D98DA5BA-6FD6-40F5-A9B1-ABD45E44967B,5,1527194521,959778,1527194521,959778,,2016,2016,...,05,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60261,519150,1B28C1DD-B25F-457E-86E4-7D1463BE82C3,519150,1527194644,959778,1527194644,959778,,2016,2016,...,72,DIS,DIS1_0,CRDPREV,RACE,ASN,,,,
60262,519704,4FF6ADF8-CC4B-4D94-A5B0-7766346A0D3E,519704,1527194644,959778,1527194644,959778,,2016,2016,...,72,OVC,OVC3_1,CRDPREV,RACE,BLK,,,,
60263,519705,02896705-4A9F-45A2-A84B-923DEA6DC6A2,519705,1527194644,959778,1527194644,959778,,2016,2016,...,72,OVC,OVC3_1,CRDPREV,RACE,AIAN,,,,
60264,519706,4DF2E74C-5043-474B-9739-98B4D8736BDB,519706,1527194644,959778,1527194644,959778,,2016,2016,...,72,OVC,OVC3_1,CRDPREV,RACE,ASN,,,,


## Level-Up
## Create a bar graph of states with the highest asthma rates for adults age 18+

In [26]:
#Your code here
df[df.Topic == 'Asthma'].Question.value_counts()
question = 'Current asthma prevalence among adults aged >= 18 years'
view = df[df.Question == question][['LocationAbbr', 'StratificationCategoryID1']]
view.head()

Unnamed: 0,LocationAbbr,StratificationCategoryID1
4725,IL,GENDER
5529,IN,GENDER
5632,IA,GENDER
6777,KS,GENDER
7034,KY,GENDER


## Summary

Well done! In this lab you got some extended practice exploring the structure of JSON files, converting json files to pandas DataFrame, and visualizing data!