<a href="https://codeimmersives.com"><img src = "https://www.codeimmersives.com/wp-content/uploads/2019/09/CodeImmersives_Logo_RGB_NYC_BW.png" width = 400> </a>


<h1 align=center><font size = 5>Agenda</font></h1>

### 
<div class="alert alert-block alert-info" style="margin-top: 20px">

1.  [Review -json](#0)<br>
2.  [json normalize](#2)<br>    
3.  [Exercise](#3)<br> 
</div>
<hr>

<h2>Review</h2>
json

A useful tool for dealing with JSON files is [here](https://www.freeformatter.com/json-escape.html#ad-output).
<code>
https://www.freeformatter.com/json-escape.html#ad-output
</code>
You can transform your json formatted files if you're having trouble parsing them.

A better tool for dealing with JSON files is [here](https://jsonformatter.org/json-pretty-print).
<code>
https://jsonformatter.org/json-pretty-print
</code>
You can transform your json formatted files if you're having trouble parsing them.

<h2>More complex json file formats</h2>
You will have to parse more difficult json file formats when dealing<br>
with output from a graphQL query. To tackle this issue we have to <br>
revert to using pandas and its json_normalize method.<br>
For example you are getting movie data from one of the movie<br>
data apis (ie. www.imdb.com, www.tmdb.com).  The file format <br>
might look like the following:<br>
    <code>
    {
  "data": {
    "human": {
      "name": "Luke Skywalker",
      "height": 1.72
    }
  }
}
    </code>

In [7]:
import json

movie_info = """{
  "data": {
    "human": {
      "name": "Luke Skywalker",
      "height": 1.72
    }
  }
}"""
print(movie_info)
print(type(movie_info))
res = json.loads(movie_info)
print(res)
print(type(res))   # <--- Returns a dictionary
print('*'*35)

{
  "data": {
    "human": {
      "name": "Luke Skywalker",
      "height": 1.72
    }
  }
}
<class 'str'>
{'data': {'human': {'name': 'Luke Skywalker', 'height': 1.72}}}
<class 'dict'>
***********************************


<h2>Exercise</h2>
How would we extract the name and height values from this<br>
dictionary?<br>
The only way is to parse into the layers of the data structure<br>

In [None]:
# Place your code here



data <class 'dict'>
{'name': 'Luke Skywalker', 'height': 1.72} <class 'dict'>
name = Luke Skywalker
height = 1.72


<h2>pandas to the rescue!</h2>
We can use the <b>json_normalize</b> method to flatten the json data<br>
so that we can get to the information we want.

In [11]:
import pandas as pd

res = json.loads(movie_info)
df = pd.json_normalize(res)
print(df)

  data.human.name  data.human.height
0  Luke Skywalker               1.72


To change the column heading to an underscore instead of a period<br>
we use the: sep = '_' optional kwarg

In [17]:
import pandas as pd

res = json.loads(movie_info)
df = pd.json_normalize(res,sep="_")
print(df)

  data_human_name  data_human_height
0  Luke Skywalker               1.72


<h2>Exercise</h2>
Flatten the following json data and replace the periods with <br>
underscores for the column names:<br>
<code>
data = [{'id': 1,
        'name': "Ralph Reed",'fitness': {'height': 70, 'weight': 200}},
        {'name': "Ayn Rand",'fitness': {'height': 66, 'weight': 140}},
    {'id': 2, 'name': 'Rachel Baker','fitness': 
    {'height': 62, 'weight': 120}}]
</code>

<br>
<br>
<br>





<b>The key to the solution was replacing the single quotes with double quotes!!</b>

In [20]:
import json
import pandas as pd

data = """[{'id': 1,
        'name': "Ralph Reed",'fitness': {'height': 70, 'weight': 200}},
        {'name': "Ayn Rand",'fitness': {'height': 66, 'weight': 140}},
    {'id': 2, 'name': 'Rachel Baker','fitness': 
    {'height': 62, 'weight': 120}}]"""



    id          name  fitness_height  fitness_weight
0  1.0    Ralph Reed              70             200
1  NaN      Ayn Rand              66             140
2  2.0  Rachel Baker              62             120


We can use the max_level kwarg to flatten the data to a <br>
certain level.  The additional levels, if any, remain in dictionary format

In [23]:
import json
import pandas as pd

data = """[{'id': 1,
        'name': "Ralph Reed",'fitness': {'height': 70, 'weight': 200}},
        {'name': "Ayn Rand",'fitness': {'height': 66, 'weight': 140}},
    {'id': 2, 'name': 'Rachel Baker','fitness': 
    {'height': 62, 'weight': 120}}]"""



    id          name                        fitness
0  1.0    Ralph Reed  {'height': 70, 'weight': 200}
1  NaN      Ayn Rand  {'height': 66, 'weight': 140}
2  2.0  Rachel Baker  {'height': 62, 'weight': 120}


If the data has imbeded lists that contain other dictionaries<br>
we have to use another approach to extract the data.<br>
In this case the data did not change after max_level = 1

In [30]:
import json
import pandas as pd

data = """[{'state': 'Florida',
    'shortname': 'FL',
    'info': {'governor': 'Rick Scott'},
    'counties': [
        {'name': 'Dade', 'population': 12345},
        {'name': 'Broward', 'population': 40000},
        {'name': 'Palm Beach', 'population': 60000}]},
    {'state': 'Ohio',
    'shortname': 'OH',
    'info': {'governor': 'John Kasich'},
    'counties': [
        {'name': 'Summit', 'population': 1234},
        {'name': 'Cuyahoga', 'population': 1337}]},
    {'state': 'New York',
    'shortname': 'NY',
    'info': {'governor': 'Andrew Cuuomo'},
    'counties': [
        {'name': 'Kings', 'population': 3200},
        {'name': 'New York', 'population': 2700}]}       
       ]"""
data = data.replace("'",'"')    # <--- Remember double quotes only
res = json.loads(data)
df = pd.json_normalize(res, sep = "_", max_level = 0)  # Flatten to level 0
print(df)
print('='*35)
df = pd.json_normalize(res, sep = "_", max_level = 1)  # Flatten to level 1
print(df)
print('='*35)
df = pd.json_normalize(res, sep = "_", max_level = 4)  # Flatten to level 4
print(df)

      state shortname                           info  \
0   Florida        FL     {'governor': 'Rick Scott'}   
1      Ohio        OH    {'governor': 'John Kasich'}   
2  New York        NY  {'governor': 'Andrew Cuuomo'}   

                                            counties  
0  [{'name': 'Dade', 'population': 12345}, {'name...  
1  [{'name': 'Summit', 'population': 1234}, {'nam...  
2  [{'name': 'Kings', 'population': 3200}, {'name...  
      state shortname                                           counties  \
0   Florida        FL  [{'name': 'Dade', 'population': 12345}, {'name...   
1      Ohio        OH  [{'name': 'Summit', 'population': 1234}, {'nam...   
2  New York        NY  [{'name': 'Kings', 'population': 3200}, {'name...   

   info_governor  
0     Rick Scott  
1    John Kasich  
2  Andrew Cuuomo  
      state shortname                                           counties  \
0   Florida        FL  [{'name': 'Dade', 'population': 12345}, {'name...   
1      Ohio        OH 

Let's examine the solution below<br>
1 - The first level keys are state,shortname, counties<br>
2 - The 2nd level has a dictionary key of 'info' and a key of 'governor'<br>
3 - The counties have a list with embeded dictionary values<br>
NOTE: we wrapped the column names inside a list to extract the values<br>

In [69]:
import json
import pandas as pd

data = """[
    {'state': 'Florida',
    'shortname': 'FL',
        'info': {'governor': 'Rick Scott'},
        'counties': [
            {'name': 'Dade', 'population': 12345},
            {'name': 'Broward', 'population': 40000},
            {'name': 'Palm Beach', 'population': 60000}]},
    {'state': 'Ohio',
    'shortname': 'OH',
        'info': {'governor': 'John Kasich'},
        'counties': [
            {'name': 'Summit', 'population': 1234},
            {'name': 'Cuyahoga', 'population': 1337}]},
    {'state': 'New York',
    'shortname': 'NY',
        'info': {'governor': 'Andrew Cuuomo'},
        'counties': [
            {'name': 'Kings', 'population': 3200},
            {'name': 'New York', 'population': 2700}]}       
       ]"""
data = data.replace("'",'"')    # <--- Remember double quotes only
res = json.loads(data)
df = pd.json_normalize(res, 'counties',\
                       ['state', 'shortname',['info', 'governor']])  # Flatten to level 0
print(df)
print('='*35)

# print(res)
df = pd.json_normalize(res, 'counties',['state', 'shortname'])  # Flatten to level 0
print(df)
print('='*35)

         name  population     state shortname  info.governor
0        Dade       12345   Florida        FL     Rick Scott
1     Broward       40000   Florida        FL     Rick Scott
2  Palm Beach       60000   Florida        FL     Rick Scott
3      Summit        1234      Ohio        OH    John Kasich
4    Cuyahoga        1337      Ohio        OH    John Kasich
5       Kings        3200  New York        NY  Andrew Cuuomo
6    New York        2700  New York        NY  Andrew Cuuomo
         name  population     state shortname
0        Dade       12345   Florida        FL
1     Broward       40000   Florida        FL
2  Palm Beach       60000   Florida        FL
3      Summit        1234      Ohio        OH
4    Cuyahoga        1337      Ohio        OH
5       Kings        3200  New York        NY
6    New York        2700  New York        NY


In [134]:
import json
import pandas as pd

data = """{
  "data": {
    "Comparison": {
      "name": "Luke Skywalker",
      "appearsIn": [
        "NEWHOPE",
        "EMPIRE",
        "JEDI"
      ],
      "friends": [
        {
          "name": "Han Solo"
        },
        {
          "name": "Leia Organa"
        },
        {
          "name": "C-3PO"
        },
        {
          "name": "R2-D2"
        }
      ]
    },
    "Comparison1": {
      "name": "R2-D2",
      "appearsIn": [
        "NEWHOPE",
        "EMPIRE",
        "JEDI"
      ],
      "friends": [
        {
          "name": "Luke Skywalker"
        },
        {
          "name": "Han Solo"
        },
        {
          "name": "Leia Organa"
        }
      ]
    }
  }
}"""

res = json.loads(data)
print(res['data'])
df = pd.json_normalize(res['data'])  # Flatten to level 0
print(df)
print('='*35)
df = pd.json_normalize(res['data'],max_level = 1, sep = '_')  # Flatten to level 0
print(df)
print('='*70)
print(type(res['data']['Comparison'].values()))
df = pd.json_normalize(res['data']['Comparison'],'appearsIn',['name'],sep = '_')  
print(df)
print('='*70)

{'Comparison': {'name': 'Luke Skywalker', 'appearsIn': ['NEWHOPE', 'EMPIRE', 'JEDI'], 'friends': [{'name': 'Han Solo'}, {'name': 'Leia Organa'}, {'name': 'C-3PO'}, {'name': 'R2-D2'}]}, 'Comparison1': {'name': 'R2-D2', 'appearsIn': ['NEWHOPE', 'EMPIRE', 'JEDI'], 'friends': [{'name': 'Luke Skywalker'}, {'name': 'Han Solo'}, {'name': 'Leia Organa'}]}}
  Comparison.name     Comparison.appearsIn  \
0  Luke Skywalker  [NEWHOPE, EMPIRE, JEDI]   

                                  Comparison.friends Comparison1.name  \
0  [{'name': 'Han Solo'}, {'name': 'Leia Organa'}...            R2-D2   

     Comparison1.appearsIn                                Comparison1.friends  
0  [NEWHOPE, EMPIRE, JEDI]  [{'name': 'Luke Skywalker'}, {'name': 'Han Sol...  
  Comparison_name     Comparison_appearsIn  \
0  Luke Skywalker  [NEWHOPE, EMPIRE, JEDI]   

                                  Comparison_friends Comparison1_name  \
0  [{'name': 'Han Solo'}, {'name': 'Leia Organa'}...            R2-D2   

     Comp

<h2>Exercise</h2>
Use the file nyc_phil.txt<br>
Create a dataframe with useful aspects of the json file<br>
step 1 - go to https://jsonformatter.org/json-pretty-print and paste the data<br>
click the 'make pretty' button<br>
step 2 - Click on the 'tree view' on the right hand side of the window.<br>
step 3 - Explore the levels<br>
step 4 - for paramters use the kwargs - record_path = 'path'<br>
In our case explore individually - 'works' and  'concerts'<br>
step 5 - To include the flat data before the records use the kwargs - 'meta = [col1,col2,..]'<br>
<code>
meta=['id', 'orchestra','programID', 'season']
</code>


In [130]:
import json 
import pandas as pd 




Unnamed: 0,workTitle,conductorName,ID,soloists,composerName,movement,interval,movement.em,movement._,id,orchestra,programID,season
0,"SYMPHONY NO. 5 IN C MINOR, OP.67","Hill, Ureli Corelli",52446*,[],"Beethoven, Ludwig van",,,,,38e072a7-8fc9-4f9a-8eac-3957905c0002,New York Philharmonic,3853,1842-43
1,OBERON,"Timm, Henry C.",8834*4,"[{'soloistName': 'Otto, Antoinette', 'soloistR...","Weber, Carl Maria Von","""Ozean, du Ungeheuer"" (Ocean, thou mighty mons...",,,,38e072a7-8fc9-4f9a-8eac-3957905c0002,New York Philharmonic,3853,1842-43
2,"QUINTET, PIANO, D MINOR, OP. 74",,3642*,"[{'soloistName': 'Scharfenberg, William', 'sol...","Hummel, Johann",,,,,38e072a7-8fc9-4f9a-8eac-3957905c0002,New York Philharmonic,3853,1842-43


In [131]:
soloist_data = pd.json_normalize(data=d['programs'], record_path=['works', 'soloists'], 
                              meta=['id'])
soloist_data.head(3)

Unnamed: 0,soloistName,soloistRoles,soloistInstrument,id
0,"Otto, Antoinette",S,Soprano,38e072a7-8fc9-4f9a-8eac-3957905c0002
1,"Scharfenberg, William",A,Piano,38e072a7-8fc9-4f9a-8eac-3957905c0002
2,"Hill, Ureli Corelli",A,Violin,38e072a7-8fc9-4f9a-8eac-3957905c0002


Unnamed: 0,Date,eventType,Venue,Location,Time,id
0,1842-12-07T05:00:00Z,Subscription Season,Apollo Rooms,"Manhattan, NY",8:00PM,38e072a7-8fc9-4f9a-8eac-3957905c0002
1,1843-02-18T05:00:00Z,Subscription Season,Apollo Rooms,"Manhattan, NY",8:00PM,c7b2b95c-5e0b-431c-a340-5b37fc860b34
2,1843-04-07T05:00:00Z,Special,Apollo Rooms,"Manhattan, NY",8:00PM,894e1a52-1ae5-4fa7-aec0-b99997555a37


The code block labelled 'concert level data' is done a different way below

                     Date            eventType             Venue  \
0    1842-12-07T05:00:00Z  Subscription Season      Apollo Rooms   
1    1843-02-18T05:00:00Z  Subscription Season      Apollo Rooms   
2    1843-04-07T05:00:00Z              Special      Apollo Rooms   
3    1843-04-22T05:00:00Z  Subscription Season      Apollo Rooms   
4    1843-11-18T05:00:00Z  Subscription Season      Apollo Rooms   
..                    ...                  ...               ...   
116  1867-12-21T05:00:00Z  Subscription Season  Academy of Music   
117  1868-02-01T05:00:00Z  Subscription Season  Academy of Music   
118  1868-03-07T05:00:00Z  Subscription Season  Academy of Music   
119  1868-04-18T05:00:00Z  Subscription Season  Academy of Music   
120  1868-11-28T05:00:00Z  Subscription Season  Academy of Music   

          Location    Time  
0    Manhattan, NY  8:00PM  
1    Manhattan, NY  8:00PM  
2    Manhattan, NY  8:00PM  
3    Manhattan, NY  8:00PM  
4    Manhattan, NY    None  
..       

This notebook is part of a course at www.codeimmersives.com called **Python Flask and Django**. If you accessed this notebook outside the course, 
you can get more information about this course online by clicking [here](https://www.codeimmersives.com/programs/python-aws/).

<hr>

Copyright &copy; 2021