### Population Estimate Data (2020)  - Exploratory

This Notebook contains the exploratory process of Population Estimated.

Here you'll find the process to understand the dataset and correct potential issues before joining with other data.

Data source template:
https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/

### Learnings

- Encoding of source file need to be set due decode issue
- Dataset contains records with COUNTY FLIPS = 0, they are STATE aggregated data
- SUMLEV help us to exclude aggregation state population (Exclude SUMLEV = 40)
- POPESTIMATE2019 is a very clean column with a big standard deviation
    - Min population: 86
    - Max population: 10039107


In [2]:
import pandas as pd
import numpy as np

In [15]:
# Importing data
url="https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv"

# Fix issue with encoding
countiesData=pd.read_csv(url, parse_dates=True, keep_default_na=False, encoding='ISO-8859-1')

In [34]:
# Few data from important columns
countiesData[['SUMLEV','STATE','COUNTY','STNAME','CTYNAME','POPESTIMATE2019']].head(10)

Unnamed: 0,SUMLEV,STATE,COUNTY,STNAME,CTYNAME,POPESTIMATE2019
0,40,1,0,Alabama,Alabama,4903185
1,50,1,1,Alabama,Autauga County,55869
2,50,1,3,Alabama,Baldwin County,223234
3,50,1,5,Alabama,Barbour County,24686
4,50,1,7,Alabama,Bibb County,22394
5,50,1,9,Alabama,Blount County,57826
6,50,1,11,Alabama,Bullock County,10101
7,50,1,13,Alabama,Butler County,19448
8,50,1,15,Alabama,Calhoun County,113605
9,50,1,17,Alabama,Chambers County,33254


In [27]:
# COUNTY=0 is a columative by state
countiesData[countiesData['SUMLEV'] == 40]

Unnamed: 0,POPESTIMATE2019
count,3193.0
mean,205599.45
std,1260310.32
min,86.0
25%,11128.0
50%,26516.0
75%,73309.0
max,39512223.0


### Population

In [32]:
print (f"Min population: {countiesData[countiesData['SUMLEV'] == 50]['POPESTIMATE2019'].min()}")
print (f"Max population: {countiesData[countiesData['SUMLEV'] == 50]['POPESTIMATE2019'].max()}")

Min population: 86
Max population: 10039107


In [28]:
# Stats showbig standard deviation
countiesData[['POPESTIMATE2019']].describe().apply(lambda s: s.apply('{0:.2f}'.format))

Unnamed: 0,POPESTIMATE2019
count,3193.0
mean,205599.45
std,1260310.32
min,86.0
25%,11128.0
50%,26516.0
75%,73309.0
max,39512223.0


In [31]:
countiesData[countiesData['SUMLEV'] == 50][['STATE','COUNTY','POPESTIMATE2019']].describe().apply(lambda s: s.apply('{0:.2f}'.format))

Unnamed: 0,STATE,COUNTY,POPESTIMATE2019
count,3142.0,3142.0,3142.0
mean,30.28,103.57,104468.34
std,15.14,107.7,333456.71
min,1.0,1.0,86.0
25%,18.0,35.0,10902.5
50%,29.0,79.0,25726.0
75%,45.0,133.0,68072.75
max,56.0,840.0,10039107.0


In [29]:
# Get NY FIPS to try to find a way to match with missing in NY Times dataset
countiesData[countiesData['CTYNAME'].str.contains('New York')]

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2019,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,RNETMIG2016,RNETMIG2017,RNETMIG2018,RNETMIG2019
1860,40,1,2,36,0,New York,New York,19378102,19378144,19399878,...,-9.267874,0.310701,-0.963243,-1.791701,-3.130145,-4.174457,-5.379085,-6.127425,-6.406096,-6.920598
1891,50,1,2,36,61,New York,New York County,1585873,1586381,1588767,...,-7.201265,6.150025,3.585789,-3.412067,-3.133048,-1.60527,-4.965909,-7.488348,-4.535926,-3.071435
