# Using Groupby and Transform

In [1]:
import transportation_tutorials as tt
import pandas as pd
import numpy as np

## Questions 

1.  Within each FDOT District, what is the fraction of structurally 
    deficient bridge deck area in each County?  
2.  Which county has the highest share of structurally deficient 
    bridge deck area within its FDOT District? *(Hint: the correct
    answer is PALM BEACH.)*

## Data

To answer the question, use the following data files:

In [2]:
districts = pd.read_csv(tt.data('FL-COUNTY-BY-DISTRICT'))
districts.head()

Unnamed: 0,County,District
0,Charlotte,1
1,Collier,1
2,DeSoto,1
3,Glades,1
4,Hardee,1


In [3]:
bridges = pd.read_csv(tt.data('FL-BRIDGES'))

# Recall the necessary cleaning for the bridges data file
bridges = bridges.replace('-', 0)
bridges[['Poor #', 'SD #']] = bridges[['Poor #', 'SD #']].astype(int)
bridges.fillna(0, inplace=True)

bridges.head()

Unnamed: 0,County,Total #,Good #,Fair #,Poor #,SD #,Total Area,Good Area,Fair Area,Poor Area,SD Area
0,ALACHUA (001),111,64,47,0,0,64767,55794,8973,0.0,0.0
1,BAKER (003),89,30,52,7,8,32162,19369,12282,510.0,623.0
2,BAY (005),122,49,63,10,11,210039,98834,109628,1577.0,10120.0
3,BRADFORD (007),62,23,37,2,2,9330,5492,3217,620.0,620.0
4,BREVARD (009),241,160,81,0,0,364138,204179,159959,0.0,0.0


## Solution

We need to create a table that answers (1).  The first step is to attach
the district number for each row (County) in the bridges table. The `County`
field in the bridges table is not just the county name in upper case,
but also a three digit code number, but the districts file only has plain 
names in title case.  We'll need to strip the code numbers from `bridges`
(six characters, including the 3 digits, two parentheses, and a space), and
convert the names in `districts` to all upper case, so there is an exact
match for merging.

In [4]:
bridges['County'] = bridges['County'].str[:-6]
districts['County'] = districts['County'].str.upper()

In [5]:
bridges_2 = pd.merge(
    bridges, 
    districts[['County','District']], 
    on='County',
)

In [6]:
bridges_2.head()

Unnamed: 0,County,Total #,Good #,Fair #,Poor #,SD #,Total Area,Good Area,Fair Area,Poor Area,SD Area,District
0,ALACHUA,111,64,47,0,0,64767,55794,8973,0.0,0.0,2
1,BAKER,89,30,52,7,8,32162,19369,12282,510.0,623.0,2
2,BAY,122,49,63,10,11,210039,98834,109628,1577.0,10120.0,3
3,BRADFORD,62,23,37,2,2,9330,5492,3217,620.0,620.0,2
4,BREVARD,241,160,81,0,0,364138,204179,159959,0.0,0.0,5


The we need to use `transform` to compute the share of 
total 'SD Area' across the district in each county.

In [7]:
bridges_2['SD Area Share in District'] = bridges_2.groupby('District')[['SD Area']].transform(lambda x: (x / x.sum()))

In [8]:
bridges_2

Unnamed: 0,County,Total #,Good #,Fair #,Poor #,SD #,Total Area,Good Area,Fair Area,Poor Area,SD Area,District,SD Area Share in District
0,ALACHUA,111,64,47,0,0,64767,55794,8973,0.0,0.0,2,0.000000
1,BAKER,89,30,52,7,8,32162,19369,12282,510.0,623.0,2,0.010061
2,BAY,122,49,63,10,11,210039,98834,109628,1577.0,10120.0,3,0.071854
3,BRADFORD,62,23,37,2,2,9330,5492,3217,620.0,620.0,2,0.010012
4,BREVARD,241,160,81,0,0,364138,204179,159959,0.0,0.0,5,0.000000
5,BROWARD,689,535,150,4,7,1192081,952849,238309,923.0,1372.0,4,0.036054
6,CALHOUN,49,19,29,1,1,76300,50437,25863,0.0,0.0,3,0.000000
7,CHARLOTTE,207,172,35,0,1,250385,229102,21284,0.0,1511.0,1,0.105179
8,CITRUS,41,32,9,0,0,21903,19948,1955,0.0,0.0,7,0.000000
9,CLAY,77,29,45,3,3,68282,19034,48860,388.0,388.0,2,0.006266


We can use `idxmax` to get the index of the highest value in this new column,
and the answer to part (2).

In [9]:
bridges_2.loc[bridges_2['SD Area Share in District'].idxmax()]

County                       PALM BEACH
Total #                             604
Good #                              516
Fair #                               81
Poor #                                7
SD #                                  7
Total Area                       805336
Good Area                        647923
Fair Area                        120878
Poor Area                         36535
SD Area                           36535
District                              4
SD Area Share in District      0.960083
Name: 48, dtype: object

Palm Beach County only has a few structurally deficient 
bridges, but they are big ones, including two that span
the intracostal waterway.