Collected state population data from the 2020 census on data.gov. We don't need the racial data, so selecting out just the state total population, renaming some columns, adding a postal code, and then saving.

### Data Sources
* [2020 census data](https://data.census.gov/cedsci/table?q=&t=Populations%20and%20People&y=2020&tid=DECENNIALPL2020.P1&hidePreview=true)
* [State abbreviations/fips](https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696)


In [10]:
import pandas as pd
import os
project_folder = ".." + os.sep + ".." + os.sep
population_folder = os.path.join(project_folder, "data", "population")

In [11]:
# read in origonal census data
population_df = pd.read_csv(os.path.join(population_folder, "DECENNIALPL2020.P1_data_with_overlays_2021-11-04T171558.csv"))
population_df.head(2)

Unnamed: 0,GEO_ID,NAME,P1_001N,P1_002N,P1_003N,P1_004N,P1_005N,P1_006N,P1_007N,P1_008N,...,P1_062N,P1_063N,P1_064N,P1_065N,P1_066N,P1_067N,P1_068N,P1_069N,P1_070N,P1_071N
0,id,Geographic Area Name,!!Total:,!!Total:!!Population of one race:,!!Total:!!Population of one race:!!White alone,!!Total:!!Population of one race:!!Black or A...,!!Total:!!Population of one race:!!American I...,!!Total:!!Population of one race:!!Asian alone,!!Total:!!Population of one race:!!Native Haw...,!!Total:!!Population of one race:!!Some Other...,...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...,!!Total:!!Population of two or more races:!!P...
1,0400000US01,Alabama,5024279,4767326,3220452,1296162,33625,76660,2984,137443,...,4,187,89,78,13,0,5,2,9,9


In [12]:
pd.options.mode.chained_assignment = None  # default='warn'
population_df = population_df[['GEO_ID', 'NAME', 'P1_001N']]
population_df.rename(columns={"P1_001N":"Population"}, inplace = True)
population_df.drop(0, inplace = True)
population_df.head()

Unnamed: 0,GEO_ID,NAME,Population
1,0400000US01,Alabama,5024279
2,0400000US02,Alaska,733391
3,0400000US04,Arizona,7151502
4,0400000US05,Arkansas,3011524
5,0400000US06,California,39538223


In [13]:
# show how you could make a dictionary with the data for mapping later
population_dictionary = {state:pop for (state, pop) in zip(population_df['NAME'], population_df['Population'])}
population_dictionary['California']

'39538223'

Add postal codes, since that is how the state is refered to in our covid data.

In [16]:
codes_df = pd.read_csv(os.path.join(population_folder,  "state_abr.txt"), sep = "\t")
codes_df.head()

Unnamed: 0,Name,Postal Code,FIPS
0,Alabama,AL,1
1,Alaska,AK,2
2,Arizona,AZ,4
3,Arkansas,AR,5
4,California,CA,6


In [17]:
population_df = population_df.merge(codes_df, left_on="NAME", right_on="Name")
final_population_df = population_df.drop(axis=1, columns="NAME")
final_population_df.head()

Unnamed: 0,GEO_ID,Population,Name,Postal Code,FIPS
0,0400000US01,5024279,Alabama,AL,1
1,0400000US02,733391,Alaska,AK,2
2,0400000US04,7151502,Arizona,AZ,4
3,0400000US05,3011524,Arkansas,AR,5
4,0400000US06,39538223,California,CA,6


Save cleaned population data

In [19]:
final_population_df.to_csv(os.path.join(population_folder, "clean_state_population.csv"), index=False)