This notebook was designed to work with [Google Colab](https://colab.research.google.com/github/lokdoesdata/syracuse-assorted/blob/main/ist_652/project/lok_ngan_final_project.ipynb).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lokdoesdata/syracuse-assorted/blob/main/ist_652/project/lok_ngan_final_project.ipynb)

# IST 652 - Final Project
Lok Ngan

Due: June 11, 2021

-------------
In this project, the population changes in the United States will be analyzed and visualized.  The primary dataset used in this analysis is the [annual residential population](https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv) published by the United States Census Bureau.  The analysis will be conducted at the county level to gain understanding on:

1. Population changes for each county between 2010 and 2020.
2. Drivers for population changes for those counties.
3. Domestic immigration within the United States between 2010 and 2020.

The data analysis will be supported with data visualization using Plotly.

## Set Up

### Install Geopandas on Google Colab

In [None]:
%pip install geopandas

### Import libraries

`Pandas`, `GeoPandas`, and `Plotly` were the primary libaries used for the analysis.

* `Pandas` is a data manipulation and analytical tool.
* `numpy` is a library for vectorized calculation.
* `GeoPandas` is similar to `Pandas`, but created for geospatial analysis.
* `Plotly` is an library used to create interactive visualization.

Other libaries used were `pathlib`.

* `pathlib` is a filesystem library used for I/O.

In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd
import plotly.express as px
import plotly.graph_objects as go

from pathlib import Path

### I/O

In [None]:
DATA_PATH = Path.cwd().joinpath('data')
DATA_PATH.mkdir(exist_ok=True, parents=True)

OUTPUT_PATH = Path.cwd().joinpath('output')
OUTPUT_PATH.mkdir(exist_ok=True, parents=True)


## Data

### Population Change by County

The primary dataset used in this analysis is the [annual residential population](https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv) published by the United States Census Bureau.  It has information on population, population changes, and estimated components of population changes by counties and states between April 1, 2010 to July 1, 2020.  

This dataset has 3,193 unique rows across 179 columns.  Some of the key columns are highlighted below:

| Field                                 | Description                               | Purpose                               | Data Type |
| :------------------------------------ | :---------------------------------------- | :------------------------------------ | :-------: |
| SUMLEV                                | Geographical summary level                | Used to identify state versus county  | Numerical |
| STATE                                 | State FIPS code                           | FIPS for the state, used for visual   | Numerical |
| County                                | County FIPS code                          | FIPS for the county, used for visual  | Numerical |
| STNAME                                | Name of the state                         | Used to identify the state by name    | String    |
| CTYNAME                               | Name of the county                        | Used to identify the county by name   | String    |
| CENSUS2010POP                         | Residential population from 2010 Census   | Baseline for population               | Numerical |
| POPESTIME2010 (through 2020)          | Estimated total residential population    | Estimated population by year          | Numerical |
| BIRTHS2010 (through 2020)             | Births                                    | Nirths by year                        | Numerical |
| DEATHS2010 (through 2020)             | Deaths                                    | Deaths by year                        | Numerical |
| INTERNATIONALMIG2010 (through 2020)   | Net international migration               | Net international migration by year   | Numerical |
| DOMESTICMIG2010 (through 2020)        | Net domestic migration                    | Net domestic migration by year        | Numerical |
| NETMIG2010 (through 2020)             | Net migration                             | Net migration by year                 | Numerical |
| RESIDUAL2010 (through 2020)           | Residual                                  | Residual by year                      | Numerical |



In [None]:
if not DATA_PATH.joinpath('co-est2020-alldata.csv').is_file():
    df_census = pd.read_csv(
        r'https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv',
        encoding='latin-1')
    df_census.to_csv(DATA_PATH.joinpath('co-est2020-alldata.csv'), index=False)
else:
    df_census = pd.read_csv(DATA_PATH.joinpath('co-est2020-alldata.csv'))

#### Quick Examination

In [None]:
df_census.head(3)

#### Data Cleaning

The population change dataset is a very large dataset, and good portion of the data would not be used.

##### SUMLEV

SUMLEV is the geographical summary level.  In this dataset, it indicates if the data point is for a state or a county.  As the analysis focuses on county level analysis, the data will be filtered for county data only, and SUMLEV will be deleted.

In [None]:
df_census = df_census.copy()[df_census['SUMLEV']==50]
df_census.drop(['SUMLEV'], axis=1, inplace=True)

##### FIPS Codes

The state and county FIPS codes are useful for visualization as most geospatial uses FIPS codes for references.  For the GeoJSON dataset that will be used for visualization, the FIPS codes is formatted as XXYYY, where XX is the two-digit state FIPS code, and YYY is the three-digit county FIPS code.  The FIPS codes on population change dataset will be adjusted to follow the same format.

In [None]:
df_census.insert(0, 'id', df_census['STATE'].astype(str).str.zfill(2) + df_census['COUNTY'].astype(str).str.zfill(3))

##### Remove any other columns not used

In [None]:
dict_col_drop = {
    'REGION': None,
    'DIVISION': None,
    'STATE': None,
    'COUNTY': None,
    'STNAME': None,
    'CTYNAME': None,
    'NPOPCHG_': (2010, 2020),
    'NATURALINC': (2010, 2020),
    'NETMIG': (2010, 2020),
    'GQESTIMATESBASE': (2010, 2010),
    'GQESTIMATES': (2010, 2020),
    'RBIRTH': (2011, 2020),
    'RDEATH': (2011, 2020),
    'RNATURALINC': (2011, 2020),
    'RINTERNATIONALMIG': (2011, 2020),
    'RDOMESTICMIG': (2011, 2020),
    'RNETMIG': (2011, 2020)
}

cols_to_remove = []

for k, v in dict_col_drop.items():
    if v is None:
        cols_to_remove.append(k)
    else:
        for y in range(v[0], v[1]+1):
           cols_to_remove.append(f'{k}{y}')

df_census.drop(
    cols_to_remove, axis=1, inplace=True)

### GeoJSON file of United States County

In [None]:
if not DATA_PATH.joinpath('geojson-counties-fips.json').is_file():
    gdf_county = gpd.read_file(r'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json')
    gdf_county.to_file(DATA_PATH.joinpath('geojson-counties-fips.json'), driver='GeoJSON')
else:
    gdf_county = gpd.read_file(DATA_PATH.joinpath('geojson-counties-fips.json'))

In [None]:
gdf_county.rename({'NAME': 'Name'}, axis=1, inplace=True)

#### Remove unused columns

In [None]:
gdf_county = gdf_county[['id', 'Name', 'geometry']]

### Combining datasets

In [None]:
df_census = gdf_county.merge(df_census, how='left', on='id')

In [None]:
df_census.shape

The resulting GeoDataFrame has 3,221 rows across 71 columns.

## Analysis

### Population changes for each county between 2010 and 2020

The population change between 2010 and 2020 will be approximated using the population estimate from 2010 and 2020.

In [None]:
df_pop_change = df_census.copy()[['id', 'Name', 'geometry', 'POPESTIMATE2010', 'POPESTIMATE2020']]
df_pop_change['POP_CHANGE'] = df_pop_change['POPESTIMATE2020'] - df_pop_change['POPESTIMATE2010']
df_pop_change['POP_CHANGE_PERCENT'] = round(df_pop_change['POP_CHANGE']*100/df_pop_change['POPESTIMATE2010'], 3)

In [None]:
df_pop_change.fillna(0, inplace=True)

In [None]:
df_pop_change.sort_values('POP_CHANGE_PERCENT', ascending=False).head(5)

In [None]:
df_pop_change.sort_values('POP_CHANGE_PERCENT').head(5)

The five counties with the most percent increase in populations are:

1. McKenzie County, North Dakota (138%)
2. Loving County, Texas (115%)
3. Williams County, North Dakota (71%)
4. Hays County, Texas (53%)
5. Wasatch County, Utah (51%)

The five counties with the most percent decrease in populations are:

1. Alexsander County, Illinois (33%)
2. Concho County, Texas (31)
3. Terrell County, Texas (30%)
4. McDowell County, West Virginia (23%)
5. Morton County, Kansas (22%)

In [None]:
percent_change_labels = [
    '-10% or lower', 
    '-9.9% to -2.6%', 
    '-2.5% to 2.5%', 
    '2.6% to 9.9%', 
    '10% or higher'
]

alpha = 0.9
percent_change_color_list = [
    f'rgba(94,60,153,{alpha})', 
    f'rgba(178,171,210,{alpha})',
    f'rgba(247,247,247,{alpha})', 
    f'rgba(253,184,99,{alpha})', 
    f'rgba(230,97,1,{alpha})'
]

In [None]:
df_pop_change['POPCHANGE_BIN'] = pd.cut(
    df_pop_change['POP_CHANGE_PERCENT'], 
    bins=[-np.inf, -10, -2.5001, 2.5, 9.9999, np.inf],
    labels=percent_change_labels,
    right=True,
    include_lowest=False)

In [None]:
percent_change_color_dict = {k:v for (k, v) in zip(percent_change_labels, percent_change_color_list)}

A visual is much better suited to answer this question.  From the visual, it appears that there is an outward shift of population.  There is a decline in population in the center of the US and the costal areas are increasing in population.



In [None]:
fig = px.choropleth(
    df_pop_change,
    geojson=df_pop_change.geometry,
    locations=df_pop_change.index,
    color='POPCHANGE_BIN',
    scope='usa',
    color_discrete_map=percent_change_color_dict,
    category_orders={'POPCHANGE_BIN': percent_change_labels},
    title='Percent Population Change by County Between 2010 and 2020',
    hover_name='Name',
    hover_data=['POPESTIMATE2010','POPESTIMATE2020','POP_CHANGE','POP_CHANGE_PERCENT'],
    labels={
        'POPESTIMATE2010': '2010 Population Estimate',
        'POPESTIMATE2020': '2020 Population Estimate',
        'POP_CHANGE': 'Population Change',
        'POP_CHANGE_PERCENT': 'Population Change Percentage',
        'POPCHANGE_BIN': 'Population Change Bin'
    }
)

fig.update_layout(
    legend_title_text='',
    margin=dict(
        r=0, l=0, t=75, b=0
    ),
    legend=dict(
        orientation='h',
        yanchor='bottom',
        y=0.96,
        xanchor='center',
        x=0.5
    )
)

### Drivers for population changes for those counties.

Births and Deaths are the primary drivers for many counties.  Florida's coastal area seems to take in a lot of domestic immigrants.

In [None]:
df_driver = df_census.copy()

In [None]:
df_driver['Births'] = abs(df_driver[[col for col in df_driver.columns if col.startswith('BIRTHS')]].sum(axis=1))
df_driver['Deaths'] = abs(df_driver[[col for col in df_driver.columns if col.startswith('DEATHS')]].sum(axis=1))
df_driver['International Migration'] = abs(df_driver[[col for col in df_driver.columns if col.startswith('INTERNATIONALMIG')]].sum(axis=1))
df_driver['Domestic Migration'] = abs(df_driver[[col for col in df_driver.columns if col.startswith('DOMESTICMIG')]].sum(axis=1))

In [None]:
df_driver = df_driver[['id', 'Name', 'geometry', 'Births', 'Deaths', 'International Migration', 'Domestic Migration']].copy()

In [None]:
df_driver['KEY_DRIVER'] = df_driver[['Births', 'Deaths', 'International Migration', 'Domestic Migration']].idxmax(axis=1)

In [None]:
df_driver['KEY_DRIVER'].fillna('Unknown', inplace=True)

In [None]:
driver_color_dict = {
    'Births': f'rgba(128,177,211,{alpha})',
    'Deaths': f'rgba(251,128,114,{alpha})',
    'International Migration': f'rgba(190,186,218,{alpha})',
    'Domestic Migration': f'rgba(255,255,179,{alpha})',
    'Unknown': f'rgba(141,211,199,{alpha})',
}

In [None]:
fig = px.choropleth(
    df_driver,
    geojson=df_driver.geometry,
    locations=df_driver.index,
    color='KEY_DRIVER',
    scope='usa',
    color_discrete_map=driver_color_dict,
    category_orders={'KEY_DRIVER': list(driver_color_dict.keys())},
    title='Key Driver for Population Change between 2010 and 2020',
    hover_name='Name',
    hover_data=['Name', 'Births', 'Deaths', 'International Migration', 'Domestic Migration'],
    labels={
        'KEY_DRIVER': 'Driver'
    }
)

fig.update_layout(
    legend_title_text='',
    margin=dict(
        r=0, l=0, t=75, b=0
    ),
    legend=dict(
        orientation='h',
        yanchor='bottom',
        y=0.96,
        xanchor='center',
        x=0.5
    )
)

### Domestic immigration within the United States between 2010 and 2020

The northwestern and southeastern part of United States saw an increase of population due to domestic immigration.  Los Angeles lost the most people moving away at 767 thousands.  People are likely moving to less expensive area such as Riverside which saw an increase of population due to domestic immigration.

In [None]:
df_dom_mig = df_census.copy()
df_dom_mig['Domestic Migration'] = df_dom_mig[[col for col in df_dom_mig.columns if col.startswith('DOMESTICMIG')]].sum(axis=1)
df_dom_mig = df_dom_mig[['id', 'Name', 'geometry', 'Domestic Migration']].copy()

In [None]:
df_dom_mig.sort_values('Domestic Migration')

In [None]:
dom_mig_labels = [
    '-100,000 or lower',
    '-99,999 to -10,100',
    '-9,999 to -1,000',
    '-999 to 999',
    '1,000 to 9,999', 
    '10,000 to 99,999',
    '100,000 or higher'
]

alpha = 0.9
dom_mig_color_list = [
    f'rgba(84,39,136,{alpha})', 
    f'rgba(153,142,195,{alpha})',
    f'rgba(216,218,235,{alpha})', 
    f'rgba(247,247,247,{alpha})', 
    f'rgba(254,224,182,{alpha})',
    f'rgba(241,163,64,{alpha})', 
    f'rgba(179,88,6,{alpha})'
]

In [None]:
df_dom_mig['Domestic Migration Bin'] = pd.cut(
    df_dom_mig['Domestic Migration'], 
    bins=[-np.inf, -100000, -10000, -1000, 999, 9999, 99999, np.inf],
    labels=dom_mig_labels,
    right=True,
    include_lowest=False)

In [None]:
dom_mig_color_dict = {k:v for (k, v) in zip(dom_mig_labels, dom_mig_color_list)}

In [None]:
fig = px.choropleth(
    df_dom_mig,
    geojson=df_dom_mig.geometry,
    locations=df_dom_mig.index,
    color='Domestic Migration Bin',
    scope='usa',
    color_discrete_map=dom_mig_color_dict,
    category_orders={'Domestic Migration Bin': dom_mig_labels},
    title='Domestic Immigration between Counties from 2010 to 2020',
    hover_name='Name'
)

fig.update_layout(
    legend_title_text='',
    margin=dict(
        r=0, l=0, t=75, b=0
    ),
    legend=dict(
        orientation='h',
        yanchor='bottom',
        y=0.96,
        xanchor='center',
        x=0.5
    )
)