# Greater MSP 2019 Data Visualization Challenge

## Team:
* David Treiber - david.treiber@gmail.com
* Jeremy Brezovan - jeremy.brezovan@gmail.com
* Megan Menth - meganlmenth@gmail.com

## [Github repository](https://github.com/iamjeremybe/documents_to_share/tree/master/Greater%20MSP%20Challenge)

## Reformatting Greater MSP's data

For this visualization challenge, we were given an Excel workbook with multiple sheets. A significant part of the challenge was wrangling this workbook into a usable form.

We chose to reformat the data and export it as a CSV for use in Tableau.

Our team's first pass at parsing the workbook used Alteryx. We ultimately decided to switch to Python, for a cheaper, open source solution.

### The format we chose has these advantages:
* The new format is consistent--data can easily be loaded into a database, or used as a data source by your visualization tool of choice.
* Including a new indicator, or year's worth of data, is as easy as appending those rows to the existing file in the same format.
* As indicators change over time, or data for earlier years is no longer needed, those rows can be dropped or filtered from the file with very little work.

In [1]:
import numpy as np
import pandas as pd
import math

In [2]:
file_path = '/home/jeremy/Documents/Greater MSP Challenge'
file_name = file_path + '/Historical Data - Prior Dashboard - Color Guides/2015-2019 Dashboard Trends_all.xlsx'

In [3]:
dfs = pd.read_excel(file_name, sheet_name=None, header=None)

## Parsing the workbook

Here's a high-level look at how the Excel workbook was transformed into the CSV file we loaded into Tableau.

![Converting Excel to CSV](./Greater_MSP_fade.png)

Each sheet of the Excel workbook is represented by a key/value pair in dfs:

In [4]:
dfs.keys()

dict_keys(['Key Indicators', 'Economy', 'BV', 'Talent', 'Education', 'Infrastructure', 'Environment', 'Livability', 'Vital Stats'])

## Capturing Key Indicators from the first sheet

The important info to capture from the first sheet, Key Indicators, includes the name of the indicator and its category.

Stats for Key Indicators on the first sheet are doubled up on the other sheets, so we don't need to capture them.

The first sheet has an extra row, "Dashboard Category", that does not exist in the other sheets. This extra row tells us which sheet the indicators are coming from.

The remaining sheets represent the data behind each set of stats on the Key Indicators sheet, and additional indicators. They seem to have the same structure, so looping through them will be easier.

In [5]:
dfs['Key Indicators'].loc[[3,4]]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,66,67,68,69,70,71,72,73,74,75
3,Dashboard Category,Economy,,,,,Economy,,,,...,Livability,,,,,Livability,,,,
4,Indicator Title,Yearly Job Growth,,,,,Employment Gap (White-Of Color),,,,...,Percent of Cost-Burdened Households,,,,,Annual Change in Median Apartment Rent (Overall),,,,


In [6]:
kpis = dfs['Key Indicators'].loc[[3,4]].dropna(axis='columns')
kpis

Unnamed: 0,0,1,6,11,16,21,26,31,36,41,46,51,56,61,66,71
3,Dashboard Category,Economy,Economy,Business Vitality,Business Vitality,Talent,Talent,Talent,Education,Education,Infrastructure,Infrastructure,Environment,Environment,Livability,Livability
4,Indicator Title,Yearly Job Growth,Employment Gap (White-Of Color),New Establishments (State Level),Establishments Surviving 5 Years (State Level),Net Migration Of 25-34 Year Olds,Females Aged 16-64 Years Working,Population 25+ with Bachelor's Degree or Higher,3-Year Graduation Rate at 2-Year Institutions ...,6-Year Graduation Rate at 4-Year Institutions ...,"Population Living Within 30 Minutes of 100,000...",Population With Commutes Less Than 30 Minutes,Energy Related Carbon Dioxide Emissions Per Ca...,Electricity Produced From Non-Carbon Sources,Percent of Cost-Burdened Households,Annual Change in Median Apartment Rent (Overall)


### Create a dictionary for Key Performance Indicators
Capture the Dashboard Category and Indicator Title of each indicator. We'll use this later to set a value indicating whether or not an indicator is a KPI.

In [7]:
kpi_dict = {}
for col in kpis.columns:
    if col == 0:
        continue
    category = kpis[col].iloc[0]
    indicator = kpis[col].iloc[1]
    if category in kpi_dict:
        kpi_dict[category].append(indicator)
    else:
        kpi_dict[category] = [indicator]

kpi_dict

{'Economy': ['Yearly Job Growth', 'Employment Gap (White-Of Color)'],
 'Business Vitality': ['New Establishments (State Level)',
  'Establishments Surviving 5 Years (State Level)'],
 'Talent': ['Net Migration Of 25-34 Year Olds',
  'Females Aged 16-64 Years Working',
  "Population 25+ with Bachelor's Degree or Higher"],
 'Education': ['3-Year Graduation Rate at 2-Year Institutions (State-Level)',
  '6-Year Graduation Rate at 4-Year Institutions (State Level)'],
 'Infrastructure': ['Population Living Within 30 Minutes of 100,000 Jobs By Transit or Walking',
  'Population With Commutes Less Than 30 Minutes'],
 'Environment': ['Energy Related Carbon Dioxide Emissions Per Capita (State Level)',
  'Electricity Produced From Non-Carbon Sources'],
 'Livability': ['Percent of Cost-Burdened Households',
  'Annual Change in Median Apartment Rent (Overall)']}

## For the remaining sheets: capture the data.

### Value in first row/first column is the sheet name/category (ex: "Economy")

### First two rows are headers:
1. Indicator (ex: Gross Regional Product Growth). This is only populated when it changes, null otherwise.
2. "ESTIMATES". This row contains the year for each Indicator's data, mostly represented as '4-digit year\n(year of data)'. This variable was preserved as-is, to serve as a description field, and was cleaned up so that only the 4-digit year was preserved.

### First two rows give us the indicator and a year.

### Cities show up twice on each sheet--first time with the observation, second time with the rank.

A "RANK" row separates the two sections of the sheet. This row indicates whether the ranking is "highest to lowest", or "lowest to highest". This information was captured as well as the rank and value, though it was ultimately not used.

## Functions

### Cleanup, and creation of some calculated fields:

### The year field often has an appended text string, indicating the actual date ranges involved.

Preserve the existing Year to use as a descriptor, and create a new Year variable that is just the 4-digit year.

In [8]:
def cleanup_year(string,**kwargs):
    desc_or_year = kwargs.get('desc_or_year','year')
    if desc_or_year == 'year':
        try:
            return int(float(string.split()[0]))
# split() will fail when the value is a lone year (ex: 2015 vs. "2015\n(using 13-14 data)")
# because the lone year is interpreted as an int. 
# Rather than convert that to a string, just return it.
        except:
# string may be empty/null! You can convert that null to a float.
# However, you CANNOT convert that null float to an int!
            str_as_float = float(string)
            if np.isnan(str_as_float):
                return string
            else:
                return int(str_as_float)
    else:
        return " ".join(str(string).split())

### Create a Key Indicator variable, and set it to 1 for key indicators.

In [9]:
def set_key_indc(category,indicator,**kwargs):
    my_kpi_dict = kwargs.get('kpi_dict',kpi_dict)
    if category in my_kpi_dict:
        if indicator in kpi_dict[category]:
            return 1
    return 0  

### Replicate the Alteryx logic for deciding Data Type:

This was the code used in Alteryx to calculate Data Type. Its logic is fairly elegant, so we replicated it in the Python version.

```
if abs([Measure Value]) <=1 then 'Percent'
elseif Contains([Indicator Title], 'Cost') then 'Dollar'
elseif Contains([Indicator Title], 'Wage') then 'Dollar'
elseif Contains([Indicator Title], 'Income') then 'Dollar'
elseif Contains([Indicator Title], 'Price') then 'Dollar'
elseif [Dashboard Category] = 'Business Vitality' 
	AND !(Contains([Indicator Title], 'Patent')
	or Contains([Indicator Title], 'Establishment'))
	then 'Dollar'
else 'Numeric'
endif 
```

In [10]:
# This is where we implement Dave's logic from Alteryx.
def calculate_data_type(series):
    if series['Value'] <=1:
        return 'Percent'
    elif any(x in series['Indicator'].lower() for x in ['cost','wage','income','price']):
        return 'Dollar'
    elif (series['Category'] == 'Business Vitality') \
         and not any(x in series['Indicator'].lower() for x in ['patent','establishment']):
        return 'Dollar'
    else:
        return "Numeric"

### Figure out the data type for each indicator.

A calculated variable, Data Type, was created to help us format the value for each indicator. We came up with three types:
* Percent
* Dollar
* Numeric

* Build a small dataframe of unique combos of Category + Indicator, and the max Value for each combo. (I checked--max() ignores NaNs.)
* Convert that dataframe into a dictionary containing Indicator/Data Type pairs.

In [11]:
def build_data_type_df(df):
    this_category = df['Category'].unique()[0]
    df_cat = df.loc[df['Category'] == this_category]
# There is something funky about these two sheets--the .groupby() doesn't behave the same way.
# I found that if I drop category, it works better.
    if this_category in ['Environment','Livability']:
        max_df = pd.DataFrame()
        for this_indicator in df_cat['Indicator'].unique():
            indicator_data_df = df_cat.loc[df['Indicator'] == this_indicator,['Indicator','Value']]
            append_df = indicator_data_df.loc[:,['Indicator','Value']].groupby(['Indicator']).max().reset_index()
            max_df = max_df.append(append_df,ignore_index=True)
# Add back the dropped Category column, so calculate_data_type() works.
        max_df['Category'] = [this_category] * max_df.shape[0]
    else:
        max_df = df.loc[:,['Category','Indicator','Value']].groupby(['Category','Indicator']).max().reset_index()
    max_df['Value'] = max_df['Value'].apply(lambda x: abs(x))
    max_df['Data_Type'] = max_df.apply(lambda row: calculate_data_type(row),axis='columns')

# Prep the dataframe so it can be easily transformed into a dictionary
    max_df.drop(columns=['Category','Value'],inplace=True)

    data_type_df = max_df.set_index(['Indicator']).to_dict('index')
# I couldn't find a parameter for .to_dict() that gave me exactly what I wanted--
# namely, a dict where key is Indicator, value is Data Type.
# So build a new dict that has this format, and extract what we need from data_type_df.
    return_dict = {}
    for this_key in data_type_df:
        return_dict[this_key] = data_type_df[this_key]['Data_Type']
    return return_dict

### Clean up Values.

Null values are okay. We'd prefer to work with numeric data for the rest--no strings, please.

In [12]:
def cleanup_values(x):
    if str(x) == 'no data':
        return np.nan
    elif math.isnan(x):
        return np.nan
    else:
        return float(x)

### Create a formatted value, based on data type

This calculated Formatted Value spared us from having to figure out how to display each individual indicator in Tableau.
* Dollar:
     * leading \$ sign
     * commas between thousands
     * padding to 2 decimal places if underlying data contains decimal

* Numeric:
    * commas between thousands
    * padding to 2 decimal places if underlying data contains decimal

* Percent:
    * padding to 2 decimal places
    * trailing % sign

* KEEP negative (-) signs 

In [13]:
def calculate_formatted_value(series):
# Capture nulls and empty strings straight away and return those values with no formatting.
    if pd.isnull(series['Value']) or series['Value'] == '':
        return series['Value']
    elif series['Data_Type'] == 'Percent':
        try:
            return "{:.2%}".format(series['Value'])
        except:
            print("This is supposed to be a Percent, but it's not:",series['Value'])
            return series['Value']

    elif series['Data_Type'] == 'Dollar':
        try:
            if str(abs(series['Value'])).isdigit():
                return "${:,}".format(series['Value'])
            else:
                return "${:,.2f}".format(series['Value'])
        except:
            print("This is supposed to be Dollar value, but it's not:",series['Value'])
            return series['Value']

    elif series['Data_Type'] == 'Numeric':
        try:
            if str(abs(series['Value'])).isdigit():
                return "{:,}".format(series['Value'])
            else:
                return "{:,.2f}".format(series['Value'])
        except:
            print("This is supposed to be Numeric, but it's not:",series['Value'])
            return series['Value']
    else:
# Thankfully didn't see any instances of an unknown Data Type, but we'll need a catch-all.
        print("UNKNOWN value type", series)
    return series['Value']

### Reshape a sheet of the workbook.

This is where the magic happens. Most of the other functions above are called from this one.

In [14]:
def reshape_indicator_sheet(df):
    out_df_columns = ['Category','Indicator','Metro','Year_Desc','Year',
                      'Value','Formatted_Value','Rank','Rank_Order','Key_Indicator','Data_Type']
    out_df = pd.DataFrame(columns=out_df_columns)

# Capture the index of the RANK row.
# This will help us split the sheet into its values vs. rank halves.
    rank_index = np.where(df[0] == 'RANK')[0][0]
    indicators_sheet = df.iloc[0:rank_index].copy()
    rank_sheet = df.iloc[rank_index:].copy()

# Loop through each city, snag its values and rank info, drop them into the correct columns
    for metro in indicators_sheet[0][3:]:
        city_df = pd.DataFrame(columns=out_df_columns)
        city_category = indicators_sheet.loc[[0]].values[0][0].strip()

# This will fill 'Category' with nulls, but it gives the dataframe the correct length.
# Then, fill it with the actual category name.
        city_df['Category'] = indicators_sheet.loc[[0]].values[0][1:]
        city_df['Category'].fillna(value=city_category,inplace=True)

# Indicator has some values populated, so we can use 'pad' to copy them forward to null rows.
        city_df['Indicator'] = indicators_sheet.loc[[1]].values[0][1:]
        city_df['Indicator'].fillna(method='pad',inplace=True)
        city_df['Indicator'] = city_df['Indicator'].apply(lambda x: x.strip())
        
# If we find Indicator in the KPI dictionary under this Category, this is a Key Indicator
        city_df['Key_Indicator'] = city_df.apply(lambda row: set_key_indc(row['Category'],row['Indicator']),
                                                 axis='columns')

        city_df['Year_Desc'] = indicators_sheet.iloc[2].values[1:]
        city_df['Year_Desc'] = city_df['Year_Desc'].apply(lambda x: cleanup_year(x,desc_or_year='desc'))
        city_df['Year'] = city_df['Year_Desc'].apply(lambda x: cleanup_year(x))

        indc_metro_value = np.where(indicators_sheet[0] == metro)[0][0]
        city_df['Metro'] = [metro] * city_df.shape[0]
        city_values = indicators_sheet.iloc[indc_metro_value].values[1:]
        city_df['Value'] = ([cleanup_values(x) for x in city_values])

# Find the index for our current Metro in the rank half of the sheet, and get Rank-related values.
        rank_metro_value = np.where(rank_sheet[0] == metro)[0][0]
        city_df['Rank'] = rank_sheet.iloc[rank_metro_value].values[1:]
        city_df['Rank_Order'] = rank_sheet.iloc[[0]].values[0][1:]
        city_df['Rank_Order'].fillna(method='pad',inplace=True)

# Some sheets have null columns at the end, recognizable in our dataframe because Year is null.
# It's safe to drop these rows.
        city_df = city_df.loc[(city_df['Year'].astype('float') > 0)]

# We're done with this city. Append its stats to the big spreadsheet.
        out_df = out_df.append(city_df,ignore_index=True)

    data_type_dict = build_data_type_df(out_df)
    out_df.loc[:,'Data_Type'] = out_df['Indicator'].apply(lambda x: data_type_dict[x])
    out_df.loc[:,'Formatted_Value'] = out_df.apply(lambda row: calculate_formatted_value(row),axis='columns')
    return out_df

## Restructure the data for each sheet. Concatenate the restructured output from all of the sheets (minus the Key Indicator sheet).

### Each row of the data should have the following columns:
1. Category
2. Indicator
3. Metro area
4. Year description (as taken from sheet)
5. Year (integer value only)
6. Value
7. Formatted Value (based on Data Type)
8. Rank
9. Rank Order
10. Key Indicator
11. Data Type

In [15]:
output_df = pd.DataFrame()
for sheet_key in dfs.keys():
    if sheet_key == 'Key Indicators':
        continue
    output_df = output_df.append(reshape_indicator_sheet(dfs[sheet_key]),ignore_index=True)

## Exploring/validating the data

In [16]:
output_df.describe(include='all')

Unnamed: 0,Category,Indicator,Metro,Year_Desc,Year,Value,Formatted_Value,Rank,Rank_Order,Key_Indicator,Data_Type
count,3624,3624,3624,3624,3624.0,2936.0,2936,2898.0,3204,3624.0,3624
unique,8,60,12,45,6.0,,2012,12.0,3,2.0,3
top,Vital Stats,Unemployment Rate Annualized,"Atlanta-Sandy Springs-Roswell, GA Metro Area",2018 (2016 data),2019.0,,0.00%,1.0,(highest to lowest),0.0,Percent
freq,804,72,302,288,720.0,,21,261.0,2052,2724.0,2412
mean,,,,,,583524500.0,,,,,
std,,,,,,4398316000.0,,,,,
min,,,,,,-23036.0,,,,,
25%,,,,,,0.132,,,,,
50%,,,,,,0.6305,,,,,
75%,,,,,,279.4,,,,,


### Some things to note about the output from .describe() above

* "count" doesn't include null values, so the columns will have differing counts (the counts for Value/Formatted Value/Rank/Rank Order are lower than the others). Rows with null Values were not dropped.

* All columns besides Value are treated as strings, so only the Value column has stats like mean, standard deviation, min/max, etc. We can't gather much from the stats of the combined values across categories and indicators. It might be interesting to go back and find out what these stats look like for particular indicators within each category.

* Unique counts helped me determine whether the conversion I performed above was done correctly. 8 Categories, 12 Metros, 6 Years, 12 Ranks, etc. Looks good at a glance. We will dig into some of these variables below.

### Category and Indicator

How many rows do we have for each indicator?

In [17]:
output_df.groupby(['Category','Indicator'])['Indicator'].count()

Category           Indicator                                                                       
Business Vitality  Annual Amount of Venture Capital                                                    60
                   Establishments Surviving 5 Years (State Level)                                      60
                   Loans to Businesses With Under $1M In Revenue (000s)                                60
                   New Establishments (State Level)                                                    60
                   Patents Issued Per 1,000 Workers                                                    60
                   Value of Exports                                                                    60
Economy            Average Weekly Wage                                                                 60
                   Employment Gap (White-Of Color)                                                     60
                   Gross Regional Product Growth    

Two Indicators have more than 60 values:
* Vital Stats/Total Jobs
* Vital Stats/Unemployment Rate Annualized

These two Vital Stats indicators have an extra year of data, for 2014. So, no problem there!

### Metro area

In [18]:
output_df['Metro'].value_counts()

Atlanta-Sandy Springs-Roswell, GA Metro Area          302
Charlotte-Concord-Gastonia, NC-SC Metro Area          302
Minneapolis-St. Paul-Bloomington, MN-WI Metro Area    302
Boston-Cambridge-Newton, MA-NH Metro Area             302
Dallas-Fort Worth-Arlington, TX Metro Area            302
Denver-Aurora-Lakewood, CO Metro Area                 302
San Francisco-Oakland-Hayward, CA Metro Area          302
Chicago-Naperville-Elgin, IL-IN-WI Metro Area         302
Pittsburgh, PA Metro Area                             302
Austin-Round Rock, TX Metro Area                      302
Portland-Vancouver-Hillsboro, OR-WA Metro Area        302
Seattle-Tacoma-Bellevue, WA Metro Area                302
Name: Metro, dtype: int64

Our twelve metros show up the same number of times throughout. This is expected, since we didn't drop null values.

### Year/Year Description

In [19]:
output_df['Year_Desc'].value_counts()

2018 (2016 data)     288
2019 (2017 data)     288
2017 (2015 data)     276
2016 (2014 data)     276
2015 (2013 data)     252
2019 (2018 data)     240
2018 (2017 data)     240
2017 (2016 data)     228
2015 (2014 data)     204
2016 (2015 data)     204
2015                 132
2016.0               108
2017.0                84
2017 (2015 Data)      48
2018.0                48
2018 (2016 Data)      48
2016 (2014 Data)      48
2015 (2013 Data)      48
2019.0                48
2019 (2017 Data)      48
2017 (14-15 data)     36
2019 (16-17 data)     36
2015 (12-13 data)     36
2018 (15-16 data)     36
2016 (13-14 data)     36
2017 (2014 data)      24
2018 (2015 data)      24
2016 (2013 data)      24
2015 (2012 data)      24
2019 (2016 data)      12
2017 (2010 data)      12
2019 (2013-2018)      12
2014 (2013 data)      12
2019 (2015 data)      12
2016 (2010 data)      12
2018 (2012-2017)      12
2015 (2009-2014)      12
2019 (no update)      12
2016 (2010-2015)      12
2014 (2012 data)      12


Values for Year Description are all over the place (really wish those decimals weren't there), but as a description field, this is fine.

The cleaned-up Year allows us to more easily identify the correct year to use when comparing metro areas:

In [20]:
output_df['Year'].value_counts()

2019    720
2018    720
2017    720
2016    720
2015    720
2014     24
Name: Year, dtype: int64

### Data Type

We're looking at this variable out of order, but let's talk about Data Type first, then Value and Formatted Value. It'll make more sense that way.

Ideally the values in the workbook would have units of comparison associated with them up front, and we could use that information to more accurately display the data in the dashboard.

Our approach--use the values to take an educated guess at values' types...with some hints. :) Our team originally calculated this field in Alteryx, then ported that logic to Python when we settled on Python for our data wrangling. It works pretty well!

If new indicators are added to the dataset, and there's no effort to associate the units of comparison to them and to the existing indicators, this "educated guess" approach should be safe to use for them as well.

In [21]:
output_df['Data_Type'].value_counts()

Percent    2412
Numeric     792
Dollar      420
Name: Data_Type, dtype: int64

### Value/Formatted Value

Categorizing Value by Data Type in our validation helps us better understand whether our values make sense.

For instance, valid percentages should be between -1 and 1.

Values include tax rates as well as percentages of metro-area population.

In [22]:
output_df.loc[output_df['Data_Type'] == 'Percent',
              ['Category','Indicator']].groupby(['Category','Indicator']).count()

Category,Indicator
Business Vitality,Establishments Surviving 5 Years (State Level)
Economy,Employment Gap (White-Of Color)
Economy,Gross Regional Product Growth
Economy,Jobs Paying a Family Sustaining Wage
Economy,Wage Gap (White-Of Color)
Economy,Yearly Job Growth
Education,3-Year Graduation Rate at 2-Year Institutions (State-Level)
Education,3rd Grade Students Achieving Reading Standards - OF COLOR
Education,3rd Grade Students Achieving Reading Standards - White
Education,6-Year Graduation Rate at 4-Year Institutions (State Level)


In [23]:
output_df.loc[output_df['Data_Type'] == 'Percent','Value'].describe()

count    1760.000000
mean        0.315576
std         0.260224
min        -0.026000
25%         0.055925
50%         0.287000
75%         0.556000
max         0.880000
Name: Value, dtype: float64

Dollar amounts are larger--with quite a range, as we are talking about values as small as cents/kWh and as large as the annual amount of venture capital.

In [24]:
output_df.loc[output_df['Data_Type'] == 'Dollar',
              ['Category','Indicator']].groupby(['Category','Indicator']).count()

Category,Indicator
Business Vitality,Annual Amount of Venture Capital
Business Vitality,Loans to Businesses With Under $1M In Revenue (000s)
Business Vitality,Value of Exports
Economy,Average Weekly Wage
Environment,Electricity Cost (Average Industrial) (Cents/kWh; Primary Metro Utility)
Livability,Median Home Purchase Price
Vital Stats,Median Household Income


In [25]:
output_df.loc[output_df['Data_Type'] == 'Dollar','Value'].describe()

count    3.930000e+02
mean     4.358284e+09
std      1.132914e+10
min      5.260000e+00
25%      1.149000e+03
50%      2.958000e+05
75%      9.470000e+08
max      6.915400e+10
Name: Value, dtype: float64

Remaining values are simply numeric--they're not dollar amounts or percentages, and have varying units of comparison.

In [26]:
output_df.loc[output_df['Data_Type'] == 'Numeric',['Category','Indicator']].groupby(['Category','Indicator']).count()

Category,Indicator
Business Vitality,New Establishments (State Level)
Business Vitality,"Patents Issued Per 1,000 Workers"
Environment,Energy Related Carbon Dioxide Emissions Per Capita (State Level)
Environment,"Number of Days That Air Quality Was ""Unhealthy For Sensitive Groups"" (Days/Year)"
Environment,Per Capita Water Usage (Gal/Day)
Infrastructure,Annual Hours of Delay Per Commuter
Infrastructure,Number of Direct Routes Out of Home Airport
Livability,Number of Violent Crimes Committed Per 100k Residents
Talent,Net Migration Of 25-34 Year Olds
Vital Stats,Gross Regional Product (Millions)


In [27]:
output_df.loc[output_df['Data_Type'] == 'Numeric','Value'].describe()

count    7.830000e+02
mean     5.391210e+05
std      1.385281e+06
min     -2.303600e+04
25%      3.900000e+01
50%      3.053000e+02
75%      1.140151e+05
max      9.553810e+06
Name: Value, dtype: float64

Formatted Value is simply Value, with some formatting applied based on Data Type.

### Rank/Rank Order

The ranges of these two variables looked correct at a glance (12 unique values for Rank, 3 unique values for Rank Order). Let's look at these two in just a little more detail, to make sure the values themselves make sense.

In [28]:
output_df['Rank'].value_counts().sort_index()

1     261
2     247
3     229
4     249
5     251
6     238
7     235
8     245
9     240
10    249
11    234
12    220
Name: Rank, dtype: int64

There isn't an even count for each rank, and that's probably fair--cities sometimes tie for a rank, in which case that rank shows up twice, and the rank below the shared one is dropped.

In [29]:
output_df['Rank_Order'].value_counts()

(highest to lowest)      2052
(lowest to highest)      1092
(largest to smallest)      60
Name: Rank_Order, dtype: int64

In [30]:
output_df.loc[output_df['Rank_Order'] == '(largest to smallest)']

Unnamed: 0,Category,Indicator,Metro,Year_Desc,Year,Value,Formatted_Value,Rank,Rank_Order,Key_Indicator,Data_Type
2872,Vital Stats,Population,"Atlanta-Sandy Springs-Roswell, GA Metro Area",2015 (2013 data),2015,5524693.0,5524693.0,3,(largest to smallest),0,Numeric
2873,Vital Stats,Population,"Atlanta-Sandy Springs-Roswell, GA Metro Area",2016 (2014 data),2016,5611829.0,5611829.0,3,(largest to smallest),0,Numeric
2874,Vital Stats,Population,"Atlanta-Sandy Springs-Roswell, GA Metro Area",2017 (2015 data),2017,5709731.0,5709731.0,3,(largest to smallest),0,Numeric
2875,Vital Stats,Population,"Atlanta-Sandy Springs-Roswell, GA Metro Area",2018 (2016 data),2018,5790210.0,5790210.0,3,(largest to smallest),0,Numeric
2876,Vital Stats,Population,"Atlanta-Sandy Springs-Roswell, GA Metro Area",2019 (2017 data),2019,5882450.0,5882450.0,3,(largest to smallest),0,Numeric
2939,Vital Stats,Population,"Austin-Round Rock, TX Metro Area",2015 (2013 data),2015,1883051.0,1883051.0,12,(largest to smallest),0,Numeric
2940,Vital Stats,Population,"Austin-Round Rock, TX Metro Area",2016 (2014 data),2016,1943299.0,1943299.0,12,(largest to smallest),0,Numeric
2941,Vital Stats,Population,"Austin-Round Rock, TX Metro Area",2017 (2015 data),2017,2000860.0,2000860.0,12,(largest to smallest),0,Numeric
2942,Vital Stats,Population,"Austin-Round Rock, TX Metro Area",2018 (2016 data),2018,2056405.0,2056405.0,12,(largest to smallest),0,Numeric
2943,Vital Stats,Population,"Austin-Round Rock, TX Metro Area",2019 (2017 data),2019,2115827.0,2115827.0,12,(largest to smallest),0,Numeric


Population, from the Vital Stats sheet of the workbook, uses a unique rank order. We are not using this variable in Tableau. It would likely have been useful only as a descriptor, explaining the rankings for a given indicator.

### Key Indicator indicator

This calculated variable was set based on the Key Indicators pulled from the first sheet of the workbook, so it should only ever have two values (0 or 1).

__There are some things we can check here:__
1. The Key Indicators have a full complement of data (12 cities * 5 years = 60 rows). Some of the actual values can be null for these city/year combinations--that's not ideal, but it's acceptable.
2. Since I mention it, let's see whether any of these indicators have null values associated with them. Maybe this indicator is not really "key" if we don't have data for all of the metro areas.
3. We noticed that the key indicators differed slightly between the printed pamphlet and the Excel workbook. Let's list the differences. We can perform that same null check (I say this already knowing the list of differences is short).

In [31]:
# The count of each key indicator:
output_df.loc[output_df['Key_Indicator'] == 1,
              ['Category','Indicator']].groupby(['Category','Indicator'])['Indicator'].count()

Category           Indicator                                                                
Business Vitality  Establishments Surviving 5 Years (State Level)                               60
                   New Establishments (State Level)                                             60
Economy            Employment Gap (White-Of Color)                                              60
                   Yearly Job Growth                                                            60
Education          3-Year Graduation Rate at 2-Year Institutions (State-Level)                  60
                   6-Year Graduation Rate at 4-Year Institutions (State Level)                  60
Environment        Electricity Produced From Non-Carbon Sources                                 60
                   Energy Related Carbon Dioxide Emissions Per Capita (State Level)             60
Infrastructure     Population Living Within 30 Minutes of 100,000 Jobs By Transit or Walking    60
                

In [32]:
# Key Indicators with null values
output_df.loc[(output_df['Key_Indicator'] == 1) &
              (output_df['Value'].isnull()),['Category','Indicator','Metro','Year','Value']]

Unnamed: 0,Category,Indicator,Metro,Year,Value
1645,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Atlanta-Sandy Springs-Roswell, GA Metro Area",2015,
1680,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Austin-Round Rock, TX Metro Area",2015,
1715,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Boston-Cambridge-Newton, MA-NH Metro Area",2015,
1750,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Charlotte-Concord-Gastonia, NC-SC Metro Area",2015,
1785,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Chicago-Naperville-Elgin, IL-IN-WI Metro Area",2015,
1820,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Dallas-Fort Worth-Arlington, TX Metro Area",2015,
1855,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Denver-Aurora-Lakewood, CO Metro Area",2015,
1890,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Minneapolis-St. Paul-Bloomington, MN-WI Metro ...",2015,
1925,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Pittsburgh, PA Metro Area",2015,
1960,Infrastructure,"Population Living Within 30 Minutes of 100,000...","Portland-Vancouver-Hillsboro, OR-WA Metro Area",2015,


#### These are listed as Key Indicators in the pamphlet, but not in the Excel workbook:
* __Economy:__ 'Wage Gap (White-Of Color)' (percentage)
    * _workbook has: 'Yearly Job Growth'_


* __Talent:__ 'Population 25+ with Associate's Degree or Higher' (percentage)
    * _workbook has two additional key indicators: 'Net Migration Of 25-34 Year Olds', 'Females Aged 16-64 Years Working'_


* __Environment:__ 'Electricity Cost', cents/kWh
    * _workbook has: 'Energy Related Carbon Dioxide Emissions Per Capita (State Level)'_


In [33]:
pamphlet_kpis = ['Wage Gap (White-Of Color)',
                 "Population 25+ with Associate's Degree or Higher",
                 'Electricity Cost (Average Industrial) (Cents/kWh; Primary Metro Utility)']

In [34]:
# The count of each key indicator:
output_df.loc[(output_df['Indicator'].isin(pamphlet_kpis)),
              ['Category','Indicator']].groupby(['Category','Indicator'])['Indicator'].count()

Category     Indicator                                                               
Economy      Wage Gap (White-Of Color)                                                   60
Environment  Electricity Cost (Average Industrial) (Cents/kWh; Primary Metro Utility)    60
Talent       Population 25+ with Associate's Degree or Higher                            60
Name: Indicator, dtype: int64

In [35]:
# Key Indicators from the pamphlet. Any null values?
output_df.loc[(output_df['Indicator'].isin(pamphlet_kpis)) & (output_df['Value'].isnull()),
              ['Category','Indicator','Metro','Year','Value']]

Unnamed: 0,Category,Indicator,Metro,Year,Value


## Looks good! Write a CSV file.

In [36]:
output_df.to_csv(file_path + '/greater_msp_data.csv',index=False)