**Below is the steps we would go through to load, view and visualize any json data.**  
  
**STEP 1: ADDING PACKAGES**  
  
**We add python packages we require.**  

In [2]:
import ijson                    # required to extract data from json format
import numpy  as np             # easy to play with arrays etc.
import pandas as pd             # required to load and read data and put in dataframe.
import matplotlib.pyplot as plt # required for data visualization purposes.
import seaborn as sns           # required for data visualization purposes.
import plotly.plotly as py      # required for data visualization purposes.
import plotly.graph_objs as go  # required for data visualization purposes.
from IPython.display import display, HTML
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

**STEP 2: READING IN A DATASET**  
  
**To read data in the form of .json, you need ijson.items()**  
**This time the parameter we use is meta.view.columns.item**

**Most json files use a similar structure therefore the code looks pretty similar for any json data**  

**https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-women-infant-and-child-dfe5d is source of data**  

In [3]:
filename = "data/healthdata.json"
with open(filename, 'r') as f:
    objects = ijson.items(f, 'meta.view.columns.item')
    columns = list(objects)
column_names = [col["fieldName"] for col in columns]
print column_names

[u':sid', u':id', u':position', u':created_at', u':created_meta', u':updated_at', u':updated_meta', u':meta', u'yearstart', u'yearend', u'locationabbr', u'locationdesc', u'datasource', u'class', u'topic', u'question', u'data_value_unit', u'data_value_type', u'data_value', u'data_value_alt', u'data_value_footnote_symbol', u'data_value_footnote', u'low_confidence_limit', u'high_confidence_limit', u'sample_size', u'total', u'age_months', u'gender', u'race_ethnicity', u'geolocation', u'classid', u'topicid', u'questionid', u'datavaluetypeid', u'locationid', u'stratificationcategory1', u'stratification1', u'stratificationcategoryid1', u'stratificationid1']


**STEP 3: DECIDING WHICH COLUMNS MATTER**  
  

**There are several columns with data that is either not required or has ':' symbols. Essentially we make a list of the columns we want and discard the rest**

In [4]:
final_columns=[u'yearstart', u'locationabbr', u'data_value', u'low_confidence_limit', u'high_confidence_limit', u'sample_size', u'stratificationcategory1', u'stratification1']

**STEP 4: EXTRACT DATA FROM EACH SELECTED COLUMN**  
  
**Once again to get the data for each column we use ijson.items()**
**This time the parameter we use is .item**

In [5]:
data = []
with open(filename, 'r') as f:
    objects = ijson.items(f, 'data.item')
    for row in objects:
        selected_row = []
        for item in final_columns:
            selected_row.append(row[column_names.index(item)])
        data.append(selected_row)
print data[9]

[u'2008', u'AL', u'25.0', u'19.3', u'30.7', u'228', u'Race/Ethnicity', u'American Indian/Alaska Native']


**STEP 5: PUTTING THE DATA IN A PANDAS DATAFRAME**  
  

**We use pd.DataFrame() to put the data we extracted into a pandas df format**

In [6]:
data = pd.DataFrame(data, columns=final_columns)

**STEP 6: VIEWING THE FIRST FEW ROWS**  
  

**To see the first few rows of the data and make sure we read it in correctly, we use .head()**


In [7]:
data.head(15)

Unnamed: 0,yearstart,locationabbr,data_value,low_confidence_limit,high_confidence_limit,sample_size,stratificationcategory1,stratification1
0,2008,AL,15.4,15.1,15.8,43287,Total,Total
1,2008,AL,15.5,15.0,16.0,21912,Gender,Male
2,2008,AL,15.3,14.9,15.8,21375,Gender,Female
3,2008,AL,15.3,14.7,15.8,18219,Age (months),24 - 35
4,2008,AL,14.9,14.4,15.5,14796,Age (months),36 - 47
5,2008,AL,16.4,15.6,17.1,10272,Age (months),48 - 59
6,2008,AL,15.8,15.2,16.3,17833,Race/Ethnicity,Non-Hispanic White
7,2008,AL,13.9,13.4,14.4,19170,Race/Ethnicity,Non-Hispanic Black
8,2008,AL,19.3,18.3,20.3,5731,Race/Ethnicity,Hispanic
9,2008,AL,25.0,19.3,30.7,228,Race/Ethnicity,American Indian/Alaska Native


**STEP 7: GET BASIC INFORMATION**  
  
**To get basic info from the dataset, we use .info()**

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7344 entries, 0 to 7343
Data columns (total 8 columns):
yearstart                  7344 non-null object
locationabbr               7344 non-null object
data_value                 7153 non-null object
low_confidence_limit       7153 non-null object
high_confidence_limit      7153 non-null object
sample_size                7153 non-null object
stratificationcategory1    7344 non-null object
stratification1            7344 non-null object
dtypes: object(8)
memory usage: 459.1+ KB


**STEP 8: SEE FURTHER DETAILS**  
  
**To get datatypes of each column, we can use .dtypes**  

**To get more details about each column, we can use .describe()**  
  
The reason we only get data from 3 columns is because the rest have commas in them which need to be removed
We can deal with this later while cleaning.

In [9]:
print data.dtypes
print "\n"
display(HTML(data.describe().to_html()))

yearstart                  object
locationabbr               object
data_value                 object
low_confidence_limit       object
high_confidence_limit      object
sample_size                object
stratificationcategory1    object
stratification1            object
dtype: object




Unnamed: 0,yearstart,locationabbr,data_value,low_confidence_limit,high_confidence_limit,sample_size,stratificationcategory1,stratification1
count,7344,7344,7153.0,7153.0,7153.0,7153,7344,7344
unique,4,54,247.0,251.0,284.0,4190,4,15
top,2008,AZ,16.0,15.7,16.3,72,Race/Ethnicity,Non-Hispanic White
freq,1836,136,105.0,93.0,117.0,12,3240,648


**STEP 6: COUNT NUMBER OF EMPTY VALUES IN COLUMN**  
   
**We can check the number of null values a column has by using .isnull().sum()**  
  
**For example, here, Climate has the most null values **  

In [10]:
print data.isnull().sum()

yearstart                    0
locationabbr                 0
data_value                 191
low_confidence_limit       191
high_confidence_limit      191
sample_size                191
stratificationcategory1      0
stratification1              0
dtype: int64


**STEP 7: SEE NUMBER OF UNIQUE VALUES IN COLUMN**  
  
**It is useful to see the number of unique values in each column using .nunique()**  
  
**Here we see region and climate have a good number of unique values to order by, therefore we can group by these columns and make good visualizations**  

In [11]:
print data.nunique()

yearstart                     4
locationabbr                 54
data_value                  247
low_confidence_limit        251
high_confidence_limit       284
sample_size                4190
stratificationcategory1       4
stratification1              15
dtype: int64


**STEP 9: NUMBER OF OCCURANCES OF EACH VALUE IN COLUMN**  
  
**A good way to visualize data of a column you wish to group by is to use .value_counts()**  
  
**It gives a clear picture of how many would be in each group etc.** 

In [14]:
stratifications = data.stratificationcategory1.value_counts()
print stratifications
stratifications1 = data.stratification1.value_counts()
print stratifications1
locations = data.locationabbr.value_counts()
print locations.head(15)

Race/Ethnicity    3240
Age (months)      2160
Gender            1296
Total              648
Name: stratificationcategory1, dtype: int64
Non-Hispanic White               648
Hispanic                         648
Male                             648
Total                            648
Asian/Pacific Islander           648
American Indian/Alaska Native    648
Female                           648
Non-Hispanic Black               648
36 - 47                          432
48 - 59                          432
24 - 35                          432
12 - 17                          216
6 - 11                           216
18 - 23                          216
3 - 5                            216
Name: stratification1, dtype: int64
AZ    136
MI    136
VI    136
VA    136
LA    136
SD    136
SC    136
CT    136
CA    136
CO    136
FL    136
MS    136
MT    136
MN    136
AR    136
Name: locationabbr, dtype: int64


**STEP 10: CLEANING DATA**  
  
**To see more from the data it has to be cleaned. Cleaning data is usally unique to each dataset.**  

**In this instance, we change column dtypes **

In [15]:
for i in data.columns:
    if i== 'locationabbr' or i=='stratificationcategory1' or i=='stratification1':        
        data[i] = data[i].str.strip().astype('category')
    else:
        data[i] = data[i].astype(float)

**Now after cleaning, lets look at the new data types and the mean, std, min, max etc. of all the columns again**

In [16]:
print data.dtypes
print "\n"
display(HTML(data.describe().to_html()))

yearstart                   float64
locationabbr               category
data_value                  float64
low_confidence_limit        float64
high_confidence_limit       float64
sample_size                 float64
stratificationcategory1    category
stratification1            category
dtype: object




Unnamed: 0,yearstart,data_value,low_confidence_limit,high_confidence_limit,sample_size
count,7344.0,7153.0,7153.0,7153.0,7153.0
mean,2011.0,14.24949,12.872571,15.627387,19027.823571
std,2.23622,3.737871,4.065295,4.057627,41023.632114
min,2008.0,1.5,0.0,3.7,50.0
25%,2009.5,11.9,10.5,13.1,1816.0
50%,2011.0,14.4,13.3,15.5,7968.0
75%,2012.5,16.5,15.6,17.6,20861.0
max,2014.0,36.1,34.5,37.7,620016.0
