# Multidimensional data in pandas

Files needed = ('dogs.csv', 'CPS_March_2016.csv')

We have covered some pandas basics and learned how to plot. Now let's sort out how to deal with more complex data. We will often find ourselves with data in which the unit of observation is complex. Pandas helps us deal with this by allowing for many index variables. So far, we have only used single indexing, but that is about to change. 

Some examples that could use a multiIndex
1. State and country
2. Team and player
3. Industry and firm
4. Country (or person, firm,...) and time

That last one is important, and one that shows up a lot in economics. We call is *panel data*. Panel data is sometimes called longitudinal data. It follows the same firm/person/country over time. 

In [1]:
import pandas as pd                 # load pandas and shorten it to pd
import datetime as dt               # load datetime and shorten it to dt
import matplotlib.pyplot as plt     # for making figures

In [2]:
soccer = {'team' : ['Man City', 'Man City', 'Man City', 'Man City', 'Chelsea', 'Chelsea'], 
          'player' : ['Walker', 'Stones', 'Foden', 'Jesus', 'Cahill', 'Pedro'],
          'pos' : ['D', 'D', 'M', 'F', 'D', 'F'],
          'goals' : [1, 0, 0, 1, 0, 3],
          'assists': [0,0,0,0,0,0]
         }

prem = pd.DataFrame(soccer)
prem

Unnamed: 0,team,player,pos,goals,assists
0,Man City,Walker,D,1,0
1,Man City,Stones,D,0,0
2,Man City,Foden,M,0,0
3,Man City,Jesus,F,1,0
4,Chelsea,Cahill,D,0,0
5,Chelsea,Pedro,F,3,0


### Multiple indexing
The key to working with more complex datasets is getting the index right. So far, we have considered a single index, but pandas allows for multiple indexes that nest each other. 

**Key concept:** Hierarchical indexing takes multiple *levels* of indexes. 

Let's set up the DataFrame to take team and position as the indexes. 

In [3]:
prem.set_index(['team', 'pos'], inplace=True)
prem

Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Man City,D,Walker,1,0
Man City,D,Stones,0,0
Man City,M,Foden,0,0
Man City,F,Jesus,1,0
Chelsea,D,Cahill,0,0
Chelsea,F,Pedro,3,0


Wow. 

Notice that the `set_index()` method is the same one we used earlier with single indexes. In this case, we passed it a list of variables to make the indexes
```python
prem.set_index(['team', 'pos'], inplace=True)
```

In the output, the highest level of the index is team (we passed it 'team' first in the list) and the second level is position. The output does not repeat the team name for each observation. The 'missing' team name just means that the team is the same as above. \[A very Tufte-esque removal of unnecessary ink.\] 

Let's take a look under the hood. What's our index? A new kind of object: the MultiIndex

In [4]:
print(prem.index)

MultiIndex(levels=[['Chelsea', 'Man City'], ['D', 'F', 'M']],
           codes=[[1, 1, 1, 1, 0, 0], [0, 0, 2, 1, 0, 1]],
           names=['team', 'pos'])


### Subsetting with multiple indexes
With a multi index, we need two arguments to reference observations. Notice that I am using a **tuple** to pass the two values of the multiIndex.

In [5]:
# All the defenders on Man City
prem.loc[('Man City', 'D'),:] 

  return self._getitem_tuple(key)


Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Man City,D,Walker,1,0
Man City,D,Stones,0,0


It's always a good idea to pay attention to warnings, particularly 'PerformanceWarning'. Pandas is telling us that we are asking for something in the second index, but the second index is not ordered. If the index was big, this could slow down our program. Let's fix that with `sort_index()`.

**Important** Sort your mulitIndex. 

In [6]:
prem = prem.sort_index(axis=0)   # tell pandas which axis to sort. Could sort the columns, too...
                                 # returns a DataFrame unless we use inplace=True
prem

Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,D,Cahill,0,0
Chelsea,F,Pedro,3,0
Man City,D,Walker,1,0
Man City,D,Stones,0,0
Man City,F,Jesus,1,0
Man City,M,Foden,0,0


In [7]:
# Now let's ask for all the defenders on Man City
prem.loc[('Man City', 'D'), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,player,goals,assists
team,pos,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Man City,D,Walker,1,0
Man City,D,Stones,0,0


No warnings. 

### Partial indexing
With the indexes set, we can easily subset the data using only one of the indexes. In pandas, this is called *partial indexing* because we are only using part of the index to subset identify the data we want. 

We can use `loc[]` like we do with a single index if we want to index on the top level index.

In [8]:
print('All the Chelsea players:')
print(prem.loc['Chelsea',:])               # All the 'Chelsea' observations

print('\n\nAll the Man City players:')
print(prem.loc['Man City',:])              # All the 'Man City' observations

All the Chelsea players:
     player  goals  assists
pos                        
D    Cahill      0        0
F     Pedro      3        0


All the Man City players:
     player  goals  assists
pos                        
D    Walker      1        0
D    Stones      0        0
F     Jesus      1        0
M     Foden      0        0


Note that this kind of notation does not work if we want to index on the second index. Suppose we wanted all the defense, regardless of team. It seems like this should work:

```python
prem.loc[:,'D']
```

...but it does not.  That brings us to our next way to query a multiIndex. 

#### The xs( ) method
We can also use the `xs()` method of DataFrame. Here we specify which level we are looking into. Note that I can reference the levels either by an integer or by its name.

In [9]:
print(prem.xs('Chelsea', level = 0) )              # All the 'Chelsea' observations
print('\n')
print(prem.xs('Man City', level = 'team'))              # All the 'Man City' observations

     player  goals  assists
pos                        
D    Cahill      0        0
F     Pedro      3        0


     player  goals  assists
pos                        
D    Walker      1        0
D    Stones      0        0
F     Jesus      1        0
M     Foden      0        0


This pretty works the same way that `loc[]` worked for the outer index.  

With `xs()`, we can partially index on the 'inner index' as well. Suppose we want all the defenders, regardless of team.

In [10]:
prem.xs('D', level=1)

Unnamed: 0_level_0,player,goals,assists
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chelsea,Cahill,0,0
Man City,Walker,1,0
Man City,Stones,0,0


As with a single index, we can get rid of the index and replace it with a generic list of integers. This adds the index levels back into the DataFrame as columns. 

In [11]:
prem.reset_index(inplace=True)    # this moves the indexes back to columns
prem

Unnamed: 0,team,pos,player,goals,assists
0,Chelsea,D,Cahill,0,0
1,Chelsea,F,Pedro,3,0
2,Man City,D,Walker,1,0
3,Man City,D,Stones,0,0
4,Man City,F,Jesus,1,0
5,Man City,M,Foden,0,0


Who says we two indexes are enough...let's try three levels of indexes!

In [12]:
prem.set_index(['team', 'player', 'pos'], inplace=True)
prem

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,goals,assists
team,player,pos,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,Cahill,D,0,0
Chelsea,Pedro,F,3,0
Man City,Walker,D,1,0
Man City,Stones,D,0,0
Man City,Jesus,F,1,0
Man City,Foden,M,0,0


#### A multiIndex in columns
There is nothing that says you can't have multiple indexes in the `axis=1` dimension. Here is quick way to see this: transpose the DataFrame.

In [13]:
prem = prem.transpose()           # this swaps the rows for columns
print(prem)                       # print() lines up the columns well
prem

team    Chelsea       Man City                   
player   Cahill Pedro   Walker Stones Jesus Foden
pos           D     F        D      D     F     M
goals         0     3        1      0     1     0
assists       0     0        0      0     0     0


team,Chelsea,Chelsea,Man City,Man City,Man City,Man City
player,Cahill,Pedro,Walker,Stones,Jesus,Foden
pos,D,F,D,D,F,M
goals,0,3,1,0,1,0
assists,0,0,0,0,0,0


Now the rows are named 'goals' and 'assists' and the columns are ('team', 'player', 'pos'). I'm not sure this is a very useful way to look at this particular dataset, but multiIndex columns can come in handy. \[Transpose is handy, too.\] Let's change it back.

In [14]:
prem = prem.transpose()
prem

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,goals,assists
team,player,pos,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,Cahill,D,0,0
Chelsea,Pedro,F,3,0
Man City,Walker,D,1,0
Man City,Stones,D,0,0
Man City,Jesus,F,1,0
Man City,Foden,M,0,0


### Summary statistics by level
MultiIndexes provide a quick way to summarize data. We will see many different ways to do this --- getting statistics by groups --- and not all will involve a multiIndex. 

In [15]:
# When subsetting by the upppermost level, I can use xs or loc

print('Chelsea avg. goals', prem.xs('Chelsea', level='team')['goals'].mean())   # average goals for Chelsea players
print('Chelsea avg. goals', prem.loc['Chelsea','goals'].mean())   # average goals for Chelsea players

# When subsetting on the inner levels, I use xs 
print('Defender avg. goals {0:.2f}.'.format( prem.xs('D', level='pos')['goals'].mean() ) )          # average goals for defenders

Chelsea avg. goals 1.5
Chelsea avg. goals 1.5
Defender avg. goals 0.33.


  after removing the cwd from sys.path.


\[Did something go wrong? Go back and fix it up!\]

Notice the syntax with xs.
```python
 prem.xs('Chelsea', level='team')['goals']
```

The `prem.xs('Chelsea', level='team')` is returning a DataFrame with all the columns. \[Try it!\]

We then use the usual square-bracket syntax to pick off just the column 'goals' and then hit with `mean()`


### Saving multiIndex DataFrames

Saving a multiIndexed DataFrame works like before. Pandas fills in all the repeated labels, so the output is ready to go. Run the following code and then open the csv files.

In [16]:
# Multiple indexes on rows
prem.to_csv('prem.csv')

# Multiple indexes on columns
prem = prem.transpose()
prem.to_csv('prem_transposed.csv')

## Practice

Let's take data from the [Current Population Survey](https://www.census.gov/programs-surveys/cps.html), which surveys about 60,000 households each month. We will compute some average wages. We will need to clean up a bit, then work with a multiIndex. Think of this as a mini-project.

1. Load the march cps data, 'CPS_March_2016.csv'.  Note: the missing values are '.'

In [17]:
cps = pd.read_csv('CPS_March_2016.csv',na_values = '.')

cps.head(20)

Unnamed: 0,hrwage,educ,female,fulltimely
0,20.961538,Some college,0,1.0
1,20.192308,HS diploma/GED,1,1.0
2,6.410256,Some college,0,0.0
3,,Less than HS,0,
4,,Some college,0,
5,,HS diploma/GED,1,
6,14.285714,HS diploma/GED,1,1.0
7,0.0,Some college,0,0.0
8,,HS diploma/GED,1,
9,,College degree,0,


2. Keep only those with `fulltimely == 1`
3. Keep only those with `5 <= hrwage <= 200`

In [18]:
# Keep individuals who worked full time last year:
cps = cps[cps['fulltimely'] == 1]

# Keep individuals with wages between $5 and $200.
cps = cps[cps['hrwage'] <= 200]
cps = cps[cps['hrwage'] >= 5]


# cps = cps[(cps['hrwage']<=200) & (cps['hrwage']>=5)]

4. Rename 'female' to 'gender'
5. In column 'gender' replace 0 with 'male' and 1 with 'female'

In [19]:
cps.rename(columns={'female':'gender'}, inplace=True)
cps.loc[cps['gender']==0, 'gender'] = 'male'
cps.loc[cps['gender']==1, 'gender'] = 'female'
cps.head(20)

Unnamed: 0,hrwage,educ,gender,fulltimely
0,20.961538,Some college,male,1.0
1,20.192308,HS diploma/GED,female,1.0
6,14.285714,HS diploma/GED,female,1.0
10,18.26923,Some college,female,1.0
12,59.52381,Graduate degree,male,1.0
13,18.367348,College degree,female,1.0
14,8.653846,HS diploma/GED,male,1.0
15,59.13621,Graduate degree,male,1.0
21,19.711538,College degree,male,1.0
22,22.349272,College degree,female,1.0


6. Set the index to 'gender' and 'educ', in that order.
7. Sort the index. 

In [20]:
cps.set_index(['gender', 'educ'], inplace=True)
cps.sort_index(axis=0, inplace = True)

cps.sample(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,hrwage,fulltimely
gender,educ,Unnamed: 2_level_1,Unnamed: 3_level_1
female,Graduate degree,15.384615,1.0
male,Some college,8.741259,1.0
male,Some college,55.555557,1.0
male,HS diploma/GED,5.0,1.0
male,Graduate degree,15.734265,1.0
male,HS diploma/GED,31.25,1.0
female,HS diploma/GED,5.769231,1.0
male,College degree,20.192308,1.0
male,Graduate degree,24.51923,1.0
male,Some college,20.673077,1.0


8. Report the average wage for males and females. Try it with the `loc[]` method. 

In [21]:
avg_wage_f = cps.loc['female', 'hrwage'].mean()
avg_wage_m = cps.loc['male', 'hrwage'].mean()

print('Average wage of females is ${0:.2f} and males is ${1:.2f}.'.format(avg_wage_f, avg_wage_m) )


Average wage of females is $22.75 and males is $28.31.


9. Report the average wage for `HS diploma/GED` and for `College degree`, regardless of gender. Use the `xs()` method.  