# Pandas, JSON, and Joining Data

Reading and writing simple formats with Pandas is pretty easy: `pd.read_csv()` and `pd.read_excel()` pretty much do whatever you need.  When you're dealing with semi-structure data like JSON documents, you need to do some of the work on your own to convert the JSON from a hierarchical format to a rectangular format before getting it into Pandas and doing the rest of what you want to do.

In this example, we'll work through a JSON to Pandas example that requires some manual coding to accomplish.

## 1. Provider Reference Data

Let's take this format as an example:

```json
{
    "provider_groups": [{
        "npi": [1110000001],
        "tin": {
        "type": "ein",
        "value": "11-1111111"
        }
    },{
        "npi": [2220000001, 2220000002],
        "tin": {
        "type": "ein",
        "value": "22-2222222"
        }
    }],
    "version": "1.4.0"
}
```

It's not too complicated, but does have some nested lists and hierarchies.  We've got:
* Provider group with EIN 11-1111111
  * Has 1 provider with NPI 1110000001
* Provider group with EIN 22-2222222
  * Has 2 providers with NPIs 2220000001, 2220000002


Someething we probably want from this is a list of all providers with their associated provider group.  Something like:
```
group   ein          npi
------- ------------ ------------
1       11-1111111   1110000001
2       22-2222222   2220000001
2       22-2222222   2220000002
```

In [21]:
import pandas as pd

# What we'll see, though, is that just trying to read this with pd.read_json() doesn't get close to what we want.
pd.read_json('provider_sample.json')

Unnamed: 0,provider_groups,version
0,"{'npi': [1110000001], 'tin': {'type': 'ein', '...",1.4.0
1,"{'npi': [2220000001, 2220000002], 'tin': {'typ...",1.4.0
2,"{'npi': [2220000002, 3330000001], 'tin': {'typ...",1.4.0


In [22]:
# Let's take full control over how to deconstruct the JSON and make it rectangular
import json
providers = json.load(open('provider_sample.json'))

# We want a group number, ein, and npi
data = []
group_num = 1

# We'll loop through all the provider groups
for group in providers.get('provider_groups'):

    # Extract the EIN value
    tin = group.get('tin')
    if tin.get('type') == 'ein':
        ein = tin.get('value')
    else:
        ein = None

    # Loop through all the NPIs in the npi list and output one row per NPI for each group
    for npi in group.get('npi'):
        data.append([group_num, ein, npi])

    # Incremement our group number
    group_num += 1

providers = pd.DataFrame(data, columns=['group','ein','npi'])
providers

Unnamed: 0,group,ein,npi
0,1,11-1111111,1110000001
1,2,22-2222222,2220000001
2,2,22-2222222,2220000002
3,3,33-3333333,2220000002
4,3,33-3333333,3330000001


## 2. Pricing Data

Now, suppose that we have some pricing contract information for these providers for certain services.  An example might look like this:

```
procedure   11-1111111   22-2222222
----------- ------------ ------------
V504.1      1038.21      1933.43
X81.5       8921.54      7544.33
```

What we want to do is ultimately get a list that shows the procedure and rate for each separate NPI (physician)

```
procedure   npi        rate
----------- ---------- --------
```

In [23]:
prices = pd.read_csv('prices.csv')
prices

Unnamed: 0,procedure,11-1111111,22-2222222,33-3333333
0,V504.1,1038.21,1933.43,1544.87
1,X81.5,8921.54,7544.33,9822.31


In [24]:
# First step will be to melt the columns for each separate
# provider group down to one row for each group... then we can merge/join
price_by_ein = prices.melt(
    id_vars = 'procedure',
    value_vars = prices.columns[1:],
    var_name = 'ein',
    value_name = 'rate'
)

price_by_ein

Unnamed: 0,procedure,ein,rate
0,V504.1,11-1111111,1038.21
1,X81.5,11-1111111,8921.54
2,V504.1,22-2222222,1933.43
3,X81.5,22-2222222,7544.33
4,V504.1,33-3333333,1544.87
5,X81.5,33-3333333,9822.31


## 3. Join

Now we can merge/join the price by ein with the ein/npi list

In [27]:
# You can adjust the join direction and inner/outer as needed 
# if there is missing data in either dataframe
price_by_npi_group = providers.merge(price_by_ein, on='ein')

price_by_npi_group

Unnamed: 0,group,ein,npi,procedure,rate
0,1,11-1111111,1110000001,V504.1,1038.21
1,1,11-1111111,1110000001,X81.5,8921.54
2,2,22-2222222,2220000001,V504.1,1933.43
3,2,22-2222222,2220000001,X81.5,7544.33
4,2,22-2222222,2220000002,V504.1,1933.43
5,2,22-2222222,2220000002,X81.5,7544.33
6,3,33-3333333,2220000002,V504.1,1544.87
7,3,33-3333333,2220000002,X81.5,9822.31
8,3,33-3333333,3330000001,V504.1,1544.87
9,3,33-3333333,3330000001,X81.5,9822.31


## 4. Average rate for each provider for each procedure

You'll notice that we've got one provider (NPI 22200000002) who is part of two different groups!  Interesting situation.  But some doctors do work for multiple organizations.  So, let's compute his average rate for each procedure.

In [32]:
price_by_npi = price_by_npi_group.pivot_table(
    index=['npi','procedure'],
    columns=None,
    values='rate',
    aggfunc='mean'
).reset_index()

price_by_npi

Unnamed: 0,npi,procedure,rate
0,1110000001,V504.1,1038.21
1,1110000001,X81.5,8921.54
2,2220000001,V504.1,1933.43
3,2220000001,X81.5,7544.33
4,2220000002,V504.1,1739.15
5,2220000002,X81.5,8683.32
6,3330000001,V504.1,1544.87
7,3330000001,X81.5,9822.31


In [46]:
# And maybe we want to pivot it to be one column per procedure?
price_by_npi.pivot_table(
    index='npi',
    columns='procedure',
    values='rate'
).reset_index()

procedure,npi,V504.1,X81.5
0,1110000001,1038.21,8921.54
1,2220000001,1933.43,7544.33
2,2220000002,1739.15,8683.32
3,3330000001,1544.87,9822.31
