## Imports

In [77]:
import pandas as pd
import geopandas as gpd
import folium
import matplotlib.pyplot as plt
%matplotlib inline

## Bring in our two files to join for time comparison.  

We want to look at the change in population per age group, as well as share of total population per age group, between the two five year periods represented by the American Community Survey 5-Year estimates. Here that's 2010-2014, and 2015-2019. We're going to bring in the 2014 file and rename the columns so that it's clear which is the older set - but I'm not going to specify the year so that this code can be reused with minimal changes when the next 5 year estimates come out. I'm going to bring in the 2019 shapefile so that we can map it and we already have a geodataframe. Make sure to reproduce that shapefile in the 2019 CBSA prep notebook because these are regularly deleted for memory purposes.

In [78]:
# 2019 shapefile
new = gpd.read_file('../output/shapefiles/2019_CBSA/2019CBSA.shp')

In [79]:
#2014 csv
old = pd.read_csv('../output/csv/agegroups_2014_cbsa.csv')

In [80]:
new.head(3)

Unnamed: 0,CSAFP,CBSAFP,FullName,MetroMicro,MEMI,MTFCC,ALAND,AWATER,INTPTLAT,INTPTLON,...,Pt40s,Pm50_65,Pf50_65,Pt50_65,Pmo65,Pfo65,Pto65,Name,State,geometry
0,122,12020,"Athens-Clarke County, GA",M1,1,G3110,2654601832,26140309,33.943984,-83.2138965,...,11.7,7.6,8.3,16.0,5.6,7.2,12.8,Athens-Clarke County,GA,"POLYGON ((-83.53739 33.96591, -83.53184 33.968..."
1,122,12060,"Atlanta-Sandy Springs-Alpharetta, GA",M1,1,G3110,22494938651,387716575,33.693728,-84.3999113,...,14.2,9.0,9.8,18.8,5.1,6.8,11.9,Atlanta-Sandy Springs-Alpharetta,GA,"POLYGON ((-85.33823 33.65312, -85.33842 33.654..."
2,428,12100,"Atlantic City-Hammonton, NJ",M1,1,G3110,1438776649,301268696,39.4693555,-74.6337591,...,12.2,10.6,11.6,22.1,7.7,9.8,17.5,Atlantic City-Hammonton,NJ,"POLYGON ((-74.85675 39.42076, -74.85670 39.420..."


Before joining, look at the older file and add a tag to the column names - the groups are the same so this will allow us to tell them apart when we calculate our time series.

In [81]:
old.head(3)

Unnamed: 0,CBSA,GEOID,total,mtotal,mu5,ftotal,fu5,tu5,mschool,fschool,...,Pt40s,Pm50_65,Pf50_65,Pt50_65,Pmo65,Pfo65,Pto65,Name,State,CBSAFIPS
0,"Homosassa Springs, FL Metro Area",310M200US26140,139771,67497,2639,72274,2793,5432,8173,7883,...,10.5,10.8,12.7,23.5,16.1,17.5,33.6,Homosassa Springs,FL,26140
1,"Hickory-Lenoir-Morganton, NC Metro Area",310M200US25860,363936,180006,10129,183930,9973,20102,31202,29556,...,14.7,10.3,10.8,21.1,7.1,9.1,16.2,Hickory-Lenoir-Morganton,NC,25860
2,"Hobbs, NM Micro Area",310M200US26020,66876,34219,2909,32657,2881,5790,7194,6948,...,11.7,8.4,8.3,16.7,4.9,5.8,10.7,Hobbs,NM,26020


In [82]:
#for loop for renaming them with an O* for old at the beginning of each column
for col in old.columns:
    old.rename(columns = {col:'O*'+col}, inplace = True)

In [83]:
#check that this was effective
old.head(3)

Unnamed: 0,O*CBSA,O*GEOID,O*total,O*mtotal,O*mu5,O*ftotal,O*fu5,O*tu5,O*mschool,O*fschool,...,O*Pt40s,O*Pm50_65,O*Pf50_65,O*Pt50_65,O*Pmo65,O*Pfo65,O*Pto65,O*Name,O*State,O*CBSAFIPS
0,"Homosassa Springs, FL Metro Area",310M200US26140,139771,67497,2639,72274,2793,5432,8173,7883,...,10.5,10.8,12.7,23.5,16.1,17.5,33.6,Homosassa Springs,FL,26140
1,"Hickory-Lenoir-Morganton, NC Metro Area",310M200US25860,363936,180006,10129,183930,9973,20102,31202,29556,...,14.7,10.3,10.8,21.1,7.1,9.1,16.2,Hickory-Lenoir-Morganton,NC,25860
2,"Hobbs, NM Micro Area",310M200US26020,66876,34219,2909,32657,2881,5790,7194,6948,...,11.7,8.4,8.3,16.7,4.9,5.8,10.7,Hobbs,NM,26020


In [84]:
#ensure same datatype on joining columns
new['CBSAFP'] = new['CBSAFP'].astype(int)
old['O*CBSAFIPS'] = old['O*CBSAFIPS'].astype(int)

Now we can join the old df onto the new geodataframe

In [85]:
cbsa = new.merge(old, left_on='CBSAFP', right_on='O*CBSAFIPS')

In [86]:
cbsa.head(2)

Unnamed: 0,CSAFP,CBSAFP,FullName,MetroMicro,MEMI,MTFCC,ALAND,AWATER,INTPTLAT,INTPTLON,...,O*Pt40s,O*Pm50_65,O*Pf50_65,O*Pt50_65,O*Pmo65,O*Pfo65,O*Pto65,O*Name,O*State,O*CBSAFIPS
0,122,12020,"Athens-Clarke County, GA",M1,1,G3110,2654601832,26140309,33.943984,-83.2138965,...,11.6,7.6,8.3,15.9,4.7,6.3,10.9,Athens-Clarke County,GA,12020
1,122,12060,"Atlanta-Sandy Springs-Alpharetta, GA",M1,1,G3110,22494938651,387716575,33.693728,-84.3999113,...,15.4,8.6,9.5,18.0,4.2,5.7,9.9,Atlanta-Sandy Springs-Roswell,GA,12060


In [87]:
cbsa.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 907 entries, 0 to 906
Columns: 107 entries, CSAFP to O*CBSAFIPS
dtypes: float64(39), geometry(1), int32(2), int64(50), object(15)
memory usage: 758.2+ KB


# Now we start thinking about the kinds of change we want to look at.

With the measures we have there are a few ways to look at things:  
- real change  
- percent change in the raw numbers  
- change in population share  
- some sort of joint metric, like change in elderly population combined with under 5 and tax base populations put together into a ratio  

We'll explore all of these options, starting from the top.

## Real change, limited to 65+

In [88]:
cbsa['Telderlyrealchange'] = cbsa['to65'] - cbsa['O*to65']
cbsa['Melderlyrealchange'] = cbsa['mo65'] - cbsa['O*mo65']
cbsa['Felderlyrealchange'] = cbsa['fo65'] - cbsa['O*fo65']

In [89]:
#make another small dataframe to check out high and low values, who's around the Nashville MSA... etc.
elderlyreal = cbsa[['FullName', 'CBSAFP', 'MetroMicro', 'geometry', 'to65', 'O*to65', 'Telderlyrealchange',
                    'Melderlyrealchange', 'mo65', 'O*mo65', 
                    'Felderlyrealchange', 'fo65', 'O*fo65']]

In [90]:
#create a list of the columns you want to rank and then create a for loop to write in the rankings as integers
cols = ['to65', 'O*to65', 'Telderlyrealchange',
        'Melderlyrealchange', 'mo65', 'O*mo65',
        'Felderlyrealchange', 'fo65', 'O*fo65']

for i in cols:
    elderlyreal['rank{}.'.format(i)] = elderlyreal['{}'.format(i)].rank().astype(int)
    
elderlyreal = elderlyreal.copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [91]:
#index into where the Nashville MSA is to check the rankings out
nash = elderlyreal.loc[elderlyreal['CBSAFP'] == 34980].reset_index(drop = True)

In [92]:
print("The following number is Nashville's rank in 2015-2019 total elderly population:")
print('______________________________________________________________________________')
print(nash['rankto65.'])
print("The following number is Nashville's rank in 2010-2014 total elderly population:")
print('______________________________________________________________________________')
print(nash['rankO*to65.'])
print("The following number is Nashville's rank in 5 year over 5 year period growth in  total elderly population:")
print('______________________________________________________________________________')
print(nash['rankTelderlyrealchange.'])
print("The following number is Nashville's rank in 5 year over 5 year period growth in  total elderly male population:")
print('______________________________________________________________________________')
print(nash['rankMelderlyrealchange.'])
print("The following number is Nashville's rank in 2015-2019 total elderly male population:")
print('______________________________________________________________________________')
print(nash['rankmo65.'])
print("The following number is Nashville's rank in 2010-2014 total elderly male population:")
print('______________________________________________________________________________')
print(nash['rankO*mo65.'])
print("The following number is Nashville's rank in 5 year over 5 year period growth in  total elderly female population:")
print('______________________________________________________________________________')
print(nash['rankFelderlyrealchange.'])
print("The following number is Nashville's rank in 2015-2019 total elderly female population:")
print('______________________________________________________________________________')
print(nash['rankfo65.'])
print("The following number is Nashville's rank in 2010-2014 total elderly female population:")
print('______________________________________________________________________________')
print(nash['rankO*fo65.'])

The following number is Nashville's rank in 2015-2019 total elderly population:
______________________________________________________________________________
0    869
Name: rankto65., dtype: int32
The following number is Nashville's rank in 2010-2014 total elderly population:
______________________________________________________________________________
0    868
Name: rankO*to65., dtype: int32
The following number is Nashville's rank in 5 year over 5 year period growth in  total elderly population:
______________________________________________________________________________
0    870
Name: rankTelderlyrealchange., dtype: int32
The following number is Nashville's rank in 5 year over 5 year period growth in  total elderly male population:
______________________________________________________________________________
0    870
Name: rankMelderlyrealchange., dtype: int32
The following number is Nashville's rank in 2015-2019 total elderly male population:
__________________________________

###### It's interesting to see how close all of these rankings are. Let's find some peer communities:

In [93]:
range = (865, 866, 867, 869, 870, 871, 872, 873, 874, 875)

In [94]:
peers_totalelderlychange = elderlyreal.loc[elderlyreal['rankTelderlyrealchange.'].isin(range)].reset_index(drop = True)

In [95]:
print('The following are our peer communities, based on total real change in elderly population:')
print('_________________________________________________________________________________________')

print(peers_totalelderlychange['FullName'])

The following are our peer communities, based on total real change in elderly population:
_________________________________________________________________________________________
0                              Cleveland-Elyria, OH
1                         Cape Coral-Fort Myers, FL
2                  Indianapolis-Carmel-Anderson, IN
3     Myrtle Beach-Conway-North Myrtle Beach, SC-NC
4    Nashville-Davidson--Murfreesboro--Franklin, TN
5                 North Port-Sarasota-Bradenton, FL
6                                    Pittsburgh, PA
7                                  Raleigh-Cary, NC
8                       San Juan-Bayamón-Caguas, PR
9        Virginia Beach-Norfolk-Newport News, VA-NC
Name: FullName, dtype: object


In [96]:
#print the whole df to see the other associated info:
#peers_totalelderlychange

## Percent change, limited to 65+, u18, and the rest as one group aka "tax base", I'm dropping the gender aspect here I don't think it's really worth much at this point.

First add up the tax base group:

In [98]:
cbsa['O*child'] = cbsa['O*tu5']+cbsa['O*tschool']
cbsa['child'] = cbsa['tu5']+cbsa['tschool']
cbsa['O*taxbase'] = cbsa['O*t18_20s']+cbsa['O*t30s']+cbsa['O*t40s']+cbsa['O*t50_65']
cbsa['taxbase'] = cbsa['t18_20s']+cbsa['t30s']+cbsa['t40s']+cbsa['t50_65']

In [119]:
cbsa['elderlypercchange'] = round((cbsa['to65'] - cbsa['O*to65'])*100/cbsa['O*to65'], 2)
cbsa['taxbasepercchange'] = round((cbsa['taxbase'] - cbsa['O*taxbase'])*100/cbsa['O*to65'], 2)
cbsa['childpercchange'] = round((cbsa['child'] - cbsa['O*child'])*100/cbsa['O*to65'], 2)

In [120]:
#make another small dataframe to check out high and low values, who's around the Nashville MSA... etc.
groupperc = cbsa[['FullName', 'CBSAFP', 'MetroMicro', 'geometry', 'elderlypercchange',
                    'taxbasepercchange', 'childpercchange']]

In [121]:
#create a list of the columns you want to rank and then create a for loop to write in the rankings as integers
cols = ['elderlypercchange','taxbasepercchange','childpercchange']

for i in cols:
    groupperc['rank{}.'.format(i)] = groupperc['{}'.format(i)].rank().astype(int)
    
groupperc = groupperc.copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [122]:
#index into where the Nashville MSA is to check the rankings out
nash = groupperc.loc[groupperc['CBSAFP'] == 34980].reset_index(drop = True)

In [123]:
print("The following number is Nashville's rank in percent growth in a 5 year over 5 year period for total elderly population:")
print('______________________________________________________________________________')
print(nash['rankelderlypercchange.'])
print("The following number is Nashville's rank in percent growth in a 5 year over 5 year period for total tax base population:")
print('______________________________________________________________________________')
print(nash['ranktaxbasepercchange.'])
print("The following number is Nashville's rank in percent growth in a 5 year over 5 year period for total child population:")
print('______________________________________________________________________________')
print(nash['rankchildpercchange.'])

The following number is Nashville's rank in percent growth in a 5 year over 5 year period for total elderly population:
______________________________________________________________________________
0    726
Name: rankelderlypercchange., dtype: int32
The following number is Nashville's rank in percent growth in a 5 year over 5 year period for total tax base population:
______________________________________________________________________________
0    833
Name: ranktaxbasepercchange., dtype: int32
The following number is Nashville's rank in percent growth in a 5 year over 5 year period for total child population:
______________________________________________________________________________
0    823
Name: rankchildpercchange., dtype: int32


Let's look at our peers for total elderly:

In [124]:
range = (721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731)
peers_elderlypercchange = groupperc.loc[groupperc['rankelderlypercchange.'].isin(range)].reset_index(drop = True)
print('The following are our peer communities, based on percent change in elderly population:')
print('_________________________________________________________________________________________')

print(peers_elderlypercchange['FullName'])
print(peers_elderlypercchange['elderlypercchange'])

The following are our peer communities, based on percent change in elderly population:
_________________________________________________________________________________________
0                          College Station-Bryan, TX
1                                        Concord, NH
2                                    Idaho Falls, ID
3                                         Ithaca, NY
4                              Manchester-Nashua, NH
5                                     Morgantown, WV
6     Nashville-Davidson--Murfreesboro--Franklin, TN
7                      San Antonio-New Braunfels, TX
8                                    Sevierville, TN
9                                       Show Low, AZ
10                                   Sioux Falls, SD
Name: FullName, dtype: object
0     21.67
1     21.49
2     21.40
3     21.64
4     21.46
5     21.72
6     21.72
7     21.89
8     21.94
9     21.86
10    21.74
Name: elderlypercchange, dtype: float64


In [110]:
# #print the whole df to see the other associated info:
# peers_elderlypercchange

Let's look at our peers for total tax base:

In [125]:
range = (721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731)
peers_tbpercchange = groupperc.loc[groupperc['ranktaxbasepercchange.'].isin(range)].reset_index(drop = True)
print('The following are our peer communities, based on percent change in tax base population:')
print('_________________________________________________________________________________________')

print(peers_tbpercchange['FullName'])
print(peers_tbpercchange['taxbasepercchange'])

The following are our peer communities, based on percent change in tax base population:
_________________________________________________________________________________________
0                          Bismarck, ND
1        Boston-Cambridge-Newton, MA-NH
2                           Edwards, CO
3                           El Paso, TX
4                    Albany-Lebanon, OR
5                           Arcadia, FL
6                           Modesto, CA
7     North Port-Sarasota-Bradenton, FL
8                         Paragould, AR
9              Rio Grande City-Roma, TX
10                           Toledo, OH
Name: FullName, dtype: object
0     13.79
1     13.80
2     13.24
3     13.36
4     13.73
5     13.59
6     14.01
7     13.91
8     13.65
9     14.04
10    13.26
Name: taxbasepercchange, dtype: float64


In [126]:
range = (721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731)
peers_childpercchange = groupperc.loc[groupperc['rankchildpercchange.'].isin(range)].reset_index(drop = True)
print('The following are our peer communities, based on percent change in child population:')
print('_________________________________________________________________________________________')

print(peers_tbpercchange['FullName'])
print(peers_tbpercchange['childpercchange'])

The following are our peer communities, based on percent change in child population:
_________________________________________________________________________________________
0                          Bismarck, ND
1        Boston-Cambridge-Newton, MA-NH
2                           Edwards, CO
3                           El Paso, TX
4                    Albany-Lebanon, OR
5                           Arcadia, FL
6                           Modesto, CA
7     North Port-Sarasota-Bradenton, FL
8                         Paragould, AR
9              Rio Grande City-Roma, TX
10                           Toledo, OH
Name: FullName, dtype: object
0      7.13
1     -2.00
2     -9.05
3    -11.57
4      2.19
5     -4.14
6      3.11
7      2.48
8      4.76
9      7.86
10     3.73
Name: childpercchange, dtype: float64


## Change in population share for our same three groups

Probably move these calculations to the first files and just export groups eventually for the entire project.... gonna finish this for now brain melting.

In [142]:
cbsa['O*Pchild'] = cbsa['O*tu5'] + cbsa['O*tschool']/cbsa['O*total']
cbsa['Pchild'] = cbsa['tu5'] + cbsa['tschool']/cbsa['total']
cbsa['O*Ptaxbase'] = cbsa['O*t18_20s']+cbsa['O*t30s']+cbsa['O*t40s']+cbsa['O*t50_65']*100/cbsa['O*total']
cbsa['Ptaxbase'] = cbsa['t18_20s']+cbsa['t30s']+cbsa['t40s']+cbsa['t50_65']*100/cbsa['total']

cbsa['elderlysharechange'] = cbsa['Pto65'] - cbsa['O*Pto65']
cbsa['tbsharechange'] = cbsa['Ptaxbase'] - cbsa['O*Ptaxbase']
cbsa['childsharechange'] = cbsa['Pchild'] - cbsa['O*Pchild']

In [143]:
#make another small dataframe to check out high and low values, who's around the Nashville MSA... etc.
sharechange = cbsa[['FullName', 'CBSAFP', 'MetroMicro', 'geometry', 'elderlysharechange',
                    'tbsharechange', 'childsharechange']]

In [144]:
#create a list of the columns you want to rank and then create a for loop to write in the rankings as integers
cols = ['elderlysharechange','tbsharechange','childsharechange']

for i in cols:
    sharechange['rank{}.'.format(i)] = sharechange['{}'.format(i)].rank().astype(int)
    
sharechange = sharechange.copy()

In [145]:
#index into where the Nashville MSA is to check the rankings out
nash = sharechange.loc[sharechange['CBSAFP'] == 34980].reset_index(drop = True)

In [146]:
print("The following number is Nashville's rank in total population share change in a 5 year over 5 year period for total elderly population:")
print('______________________________________________________________________________')
print(nash['rankelderlysharechange.'])
print(nash['elderlysharechange'])
print("The following number is Nashville's rank in total population share change in a 5 year over 5 year period for total tax base population:")
print('______________________________________________________________________________')
print(nash['ranktbsharechange.'])
print(nash['tbsharechange'])
print("The following number is Nashville's rank in total population share change in a 5 year over 5 year period for total child population:")
print('______________________________________________________________________________')
print(nash['rankchildsharechange.'])
print(nash['childsharechange'])

The following number is Nashville's rank in total population share change in a 5 year over 5 year period for total elderly population:
______________________________________________________________________________
0    158
Name: rankelderlysharechange., dtype: int32
0    1.4
Name: elderlysharechange, dtype: float64
The following number is Nashville's rank in total population share change in a 5 year over 5 year period for total tax base population:
______________________________________________________________________________
0    892
Name: ranktbsharechange., dtype: int32
0    53492.867422
Name: tbsharechange, dtype: float64
The following number is Nashville's rank in total population share change in a 5 year over 5 year period for total child population:
______________________________________________________________________________
0    894
Name: rankchildsharechange., dtype: int32
0    4695.995252
Name: childsharechange, dtype: float64
