<a href="https://colab.research.google.com/github/npr99/Archive/blob/master/CensusAPI_PopulationSize_2020_07_30.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Population Analysis Workflow

The following Jupyter Notebook provides a workflow for replicating a demographic analysis. This example uses text from Wang and vom Hofe (2006), mixed with python code that utilizes the US Census Bureau's Application Programming Interface (API). 

Wang, X. and vom Hofe, R. (2006). Research Methods in Urban and Regional Planning, Springer.  Available for download through [Texas A&M University Library](https://link-springer-com.lib-ezproxy.tamu.edu:9443/book/10.1007%2F978-3-540-49658-8)

# Example to replicate - Demographic Analysis Fundamental Concepts 
The following is an expert from (Wang and vom Hofe, 2006 p. 58).

> For planners and demographers alike, population analyses do not begin with immediately applying sophisticated methods in population projections. Rather, most demographic analyses start with fundamental concepts, such as describing populations by their actual size.

> The point here is to get a thorough understanding of the population of
interest by studying characteristics for periods where data are available. For
planning purposes, these first demographic analyses can already give planners
valuable and necessary information. Is the population of an area declining or
increasing? By what rate is the area declining or increasing? With such
information, school district superintendents could make some educated guesses
about expected enrollment if they know the age composition of the area’s
population. 

> Let us focus on the first of the fundamental concepts of demographic analysis,
the **population size**. Using 1990 and 2000 U.S. Census Bureau population statistics for Boone County we see immediately that the county is growing fast. Boone County had a population of 57,589 in 1990 and 85,991 in 2000.

> While the concept of population size is straightforward, it is an important
fact that, in general, people are counted according to their permanent place of
residence. For example, someone living in a neighboring county and commuting
daily to Boone County for work is not considered a resident of Boone County. As
a result this person would, of course, not show up in Boone County’s population
size in Table 3.1. The so called “de jure” approach counts people only at their
permanent place of residence.

<h6><center> Table 3.1 Boone county population size, 1990 and 2000. </center></h6>

| Boone County, Kentucky |  1990  |  2000  | Absolute Change | Percent Change |
|:----------------------:|:------:|:------:|:---------------:|:--------------:|
| Male                   | 28,111 | 42,499 | 14,388          | 51.2           |
| Female                 | 29,478 | 43,492 | 14,014          | 47.5           |
| Total                  | 57,589 | 85,991 | 28,402          | 49.3           |

<h10><center> Source: U.S. Census Bureau, Data Set: 1990 and 2000 Summary Tape File 1 (STF1). </center></h10>

## Python Workflow to replicate Table 3.1

The data in the above table comes from the US Census Bureau. To replicate the above table requires knowledge of the US Census Bureau's website, specifics about how the Census Bureau collects data, and the geographies for which the data are collected.

The following sections of this notebook provide a workflow that utilizes the US Census API using python. The basic data science workflow has the following steps:
1. Obtain Data
2. Clean Data
3. Explore Data
4. Interpret Data
5. Publish Data

### Python Packages and Version Information
Python is an open source programming language, which means that it is not owned or maintained by a private company. Programmers create packages (collections of programs) and make them publicly available. 
For example, to make use of the US Census API the python pacakge `requests` is required. The python package `pandas` provides the tools needed to clean and explore the obtained Census data.



In [None]:
import requests       ## Required for the Census API
import pandas as pd   ## Required to clean and explore data

There are many different versions of Python and the packages. It is good to know the version in case there is a bug in the program or if a new version has different features.

In [None]:
# Display versions being used - important information for replication
import sys  # For checking version of python for replication

print("Python Version     ", sys.version)
print("pandas version:    ", pd.__version__)
print("requests version:     ", requests.__version__)

Python Version      3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0]
pandas version:     1.0.5
requests version:      2.23.0


## Data Science Step 1: Obtain Data with Census API

The following section sets up and reads in data from the Census API. The following sections of code are detailed, but they also make obtianing Census Data systematic and reproducible. Understanding the Census API helps to reinforce the Census data knowledge.[link text](https://)

### 1.1 Anatomy of a Census API Query

Here is a Census API query. Test out the link, it will take you to a webpage with the basic population information for the United States in 1990. The United States had a population of 248,709,873 people.

https://api.census.gov/data/1990/sf1?get=P0010001&for=us

The API has two parts, the base URL and the the parameters or predicates.

The base url (`https://api.census.gov/data/1990/sf1`) has three parts:
1. The host website = `https://api.census.gov/data`
2. The year for the Census Data = `1990`
3. The Census source Data such as Summary File 1 = `sf1`

The predicates (`get=P0010001&for=us`) has two main parts:
1. The variables to get = `P0010001`
2. The geography to the data for = `us`

The predicates can include more parameters. 

For more details on Census API query see: 
1. Breakstone and Anderson. (2019-07-25). *Census Data API User Guide Version 1.6*. Retrieved from https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf 
2. DataCamp. (2020-03-04). *Python Tutorial: Using the Census API*. Retrieved from https://www.youtube.com/watch?v=l47HptzM7ao 

### 1.2 Base Url for 1990 and 2000 Census Data

The code in this section is based on helpful YouTube Videos and example code from Stack Exchange.

* DataCamp (2020-03-04) Python Tutorial: Using the Census API. Retrieved from https://youtu.be/l47HptzM7ao

* ANimator120 (2020-04-16) How to Format API Requests to Loop through All Tracts in a County. Retrieved from https://opendata.stackexchange.com/questions/17420/how-to-format-api-requests-to-loop-through-all-tracts-in-a-county


In [None]:
HOST = "https://api.census.gov/data"

In [None]:
# the request base url, 1990 Decennial Census
year = "1990"
dataset = "sf1"
base_url1990 = "/".join([HOST, year, dataset])
base_url1990

'https://api.census.gov/data/1990/sf1'

In [None]:
# the request base url, 2000 Decennial Census
year = "2000"
dataset = "dec/sf1"
base_url2000 = "/".join([HOST, year, dataset])
base_url2000

'https://api.census.gov/data/2000/dec/sf1'

### 1.3 Predicates or Parameters for 1990 and 2000 Census


#### 1.3.1 Variables to obtain for 1990 Census
For a full list of variables https://api.census.gov/data/1990/sf1/variables.html

| Name |  Label  |
|:----------------------:|:------:|
| ANPSADPI               | Geography Name |
| P0050001               | Sex Male |
| P0050002               | Sex Female |
| P0010001               | Total Persons |

In [None]:
# Set up the predicates for 1990 Data
predicates1990 = {}
# For a full list of variables https://api.census.gov/data/1990/sf1/variables.html
get_vars = ["ANPSADPI", "P0050001", "P0050002","P0010001"]
predicates1990["get"] = ",".join(get_vars)
predicates1990["for"] = "county:015"
predicates1990["in"] = "state:21"
predicates1990

{'for': 'county:015',
 'get': 'ANPSADPI,P0050001,P0050002,P0010001',
 'in': 'state:21'}

#### 1.3.2 Variables to obtain for 2000 Census
For a full list of variables https://api.census.gov/data/2000/dec/sf1/variables.html

| Name |  Label  |
|:----------------------:|:------:|
| LSAD_NAME              | Legal/Statistical Area Description name |
| P012002               | Sex Male |
| P012026               | Sex Female |
| P001001               | Total Persons |

In [None]:
# Set up the predicates for 2000 Data
predicates2000 = {}
# For a full list of variables https://api.census.gov/data/2000/dec/sf1/variables.html
get_vars = ["LSAD_NAME","" "P012002", "P012026", "P001001"]
predicates2000["get"] = ",".join(get_vars)
predicates2000["for"] = "county:015"
predicates2000["in"] = "state:21"
predicates2000

{'for': 'county:015',
 'get': 'LSAD_NAME,P012002,P012026,P001001',
 'in': 'state:21'}

### 1.4 Request Census Data through the API

In [None]:
# Request data
r1990 = requests.get(base_url1990, params=predicates1990)
print(r1990.text)

[["ANPSADPI","P0050001","P0050002","P0010001","state","county"],
["Boone County","28111","29478","57589","21","015"]]


In [None]:
# Request data
r2000 = requests.get(base_url2000, params=predicates2000)
print(r2000.text)

[["LSAD_NAME","P012002","P012026","P001001","state","county"],
["Boone County","42499","43492","85991","21","015"]]


The above results provide the basic input data required to replicate Table 3.1 in Wang and vom Hofe (2006). Notice that the total population for 1990 (57589) and the total population for 2000 (85991) appear in the printed request results.

### 1.5 Comparison to data.census.gov

The US Census Bureau provides access to their data through the portal [data.census.gov](https://youtu.be/XNtvO27r2g0). This website came online in 2019 and replaced FactFinder. The Census Bureau is working to add historic data to data.census.gov. Currently 1990 is not available for comparison, but 2000 data is avaiable. The links below (called [deep links](https://www.census.gov/content/dam/Census/data/data-census-gov/data-census-gov-deep-linking-guide.pdf) provide quick access to the same data available through the API.

https://data.census.gov/cedsci/table?g=0500000US21015&y=2000&tid=DECENNIALSF12000.P001

https://data.census.gov/cedsci/table?g=0500000US21015&y=2000&tid=DECENNIALSF12000.P012

## Datascience Step 2: Clean Obtained Data
Data from the Census API is returned in a JSON format. JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. (https://www.json.org/json-en.html)

To clean the data, we will need to convert the JSON format into a Pandas Dataframe. Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Dataframes are similar to a sheet in a spreadsheet program like Microsoft Excel or Google Sheets. Once the data is in a Pandas Data frame new columns can be created to generate the needed statistics.

### 2.1 Create user-friendly column names
The data from the Census API uses the variable names for the columns. The variable names can be replaced with more understandable names.

In [None]:
# create user-friendly column names
col_names = ["cnty_name", "Male", "Female", "Total", "state_fips", "county_fips"]

In [None]:
# load into pandas data frame
df1990 = pd.DataFrame(columns=col_names, data=r1990.json()[1:], index=[1])
df2000 = pd.DataFrame(columns=col_names, data=r2000.json()[1:], index=[2])

# add year varaible
df1990["year"] = 1990
df2000["year"] = 2000

### 2.2 Reshape the data

The data needs to be reshaped. Right now all of the data is in a wide format, the data needs to be long. 
To reshape data from wide to long pandas has the melt method. It can be used to move all the “values” stored in your DataFrame to a single column with all other columns being used to contain identifying information.

For our data, the county is the identifying information.

In the end the data will have one column with the type of population size (male, female, total) and then the year

https://pandas.pydata.org/docs/user_guide/reshaping.html

https://datascience.quantecon.org/pandas/reshape.html 

In [None]:
df1990_melt = df1990.melt(id_vars=['state_fips','county_fips','year','cnty_name'])
df1990_melt

Unnamed: 0,state_fips,county_fips,year,cnty_name,variable,value
0,21,15,1990,Boone County,Male,28111
1,21,15,1990,Boone County,Female,29478
2,21,15,1990,Boone County,Total,57589


In [None]:
df1990_melt = df1990_melt.rename(columns={"value": "1990", "variable": "population"})
df1990_melt = df1990_melt.drop(columns=['year'])
df1990_melt

Unnamed: 0,state_fips,county_fips,cnty_name,population,1990
0,21,15,Boone County,Male,28111
1,21,15,Boone County,Female,29478
2,21,15,Boone County,Total,57589


In [None]:
df2000_melt = df2000.melt(id_vars=['state_fips','county_fips','year','cnty_name'])
df2000_melt = df2000_melt.rename(columns={"value": "2000",  "variable": "population"})
df2000_melt = df2000_melt.drop(columns=['year'])
df2000_melt

Unnamed: 0,state_fips,county_fips,cnty_name,population,2000
0,21,15,Boone County,Male,42499
1,21,15,Boone County,Female,43492
2,21,15,Boone County,Total,85991


### 2.3 Merge dataframes together

Now that the 2 dataframes are in the correct format, we are ready to merge them togther. 

The merge will be based on the fips codes for the state and county which uniquely identify the census geography and the variable population which identifies male, femal and total.



In [None]:
df_merged = df1990_melt.merge(df2000_melt, left_on=['state_fips','county_fips', 'cnty_name','population'], 
                              right_on=['state_fips','county_fips','cnty_name','population'])
df_merged

Unnamed: 0,state_fips,county_fips,cnty_name,population,1990,2000
0,21,15,Boone County,Male,28111,42499
1,21,15,Boone County,Female,29478,43492
2,21,15,Boone County,Total,57589,85991


### 2.4 Check Data Types

Pandas stores data as different [data types](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes). The default datatype is `object`. The following sections checks the data types for each variable and then changes the data types for the population size. The `int` data type is for an integer, or a number. When python recognizes a variable as an integer the variables can be used in math equations.


In [None]:
df_merged.dtypes

state_fips     object
county_fips    object
cnty_name      object
population     object
1990           object
2000           object
dtype: object

In [None]:
# fix data types (integer): need to change data types to run diff and pct_change
df_merged["2000"] = df_merged["2000"].astype(int)
df_merged["1990"] = df_merged["1990"].astype(int)
df_merged.dtypes

state_fips     object
county_fips    object
cnty_name      object
population     object
1990            int64
2000            int64
dtype: object

## Datastep 3. Explore Data - Population Changes 

Now that the data is cleaned we can explore the data. Wang and vom Hofe (2006) describe four demographic concepts for population change: absolute change, percent change, average annual
absolute change, or average annual percent change. We use the population totals for Boone County for 1990 and 2000 (e.g., 57,589 and 85,991 respectively) to calculate the two of the concepts:

**Absolute change:** subtract the 1990 population from the 2000 population:

$${85,991 - 57,589}= 28,402 $$

**Percent change:** divide the absolute population change by the 1990 population to get percentages:

$${{28,402} \over 57,589}= 0.493*100 = 49.3\% $$

In [None]:
# create new column showing "absolute change"
df_merged["Absolute Change"] = df_merged["2000"] - df_merged["1990"]
df_merged["Percent Change"] = df_merged['Absolute Change']/df_merged["1990"] * 100
df_merged.head()

Unnamed: 0,state_fips,county_fips,cnty_name,population,1990,2000,Absolute Change,Percent Change
0,21,15,Boone County,Male,28111,42499,14388,51.182811
1,21,15,Boone County,Female,29478,43492,14014,47.540539
2,21,15,Boone County,Total,57589,85991,28402,49.318446


## Datasetp 4: Interpret Data

A positive number for **absolute change** refers to a population increase while a negative number
indicates a decline in population size. Boone County's population grew by 28,402 people or 49.3% between 1990 and 2000. 




## Datastep 5: Publish data

To publish the data from this notebook we need to first format the data. Formatting includes:
*   updating the column names,
*   adding commas to numbers
*   setting the percision or the number of digits after the decimal place
*   adding percent signs

In the previous table notices that the numbers do not have commas. Commas help the reader compare big numbers. A comma is placed every third digit. Also notice that for the percentages there are many digigts after the decimal place. In general a reader will not care if the population has grown by 49.318446. A well formatted number includes just the right level of percision - usually one or two digits after the decimal. Finally, the percent sign is important and lets the reader know that the number is a ratio and that it has been multiplied by 100.



In [None]:
# Create a copy of the cleaned dataset
df_formatted = df_merged.copy()

# Add a comma to the population size data
df_formatted["1990"] = df_formatted.apply(lambda x: "{:,}".format(x['1990']), axis=1)
df_formatted["2000"] = df_formatted.apply(lambda x: "{:,}".format(x['2000']), axis=1)
df_formatted["Absolute Change"] = df_formatted.apply(lambda x: "{:,}".format(x['Absolute Change']), axis=1)

# Set the percision to 1 decimal place and add a % sign
df_formatted["Percent Change"] = df_formatted.apply(lambda x: "{0:.1f}%".format(x['Percent Change']), axis=1)
df_formatted

Unnamed: 0,state_fips,county_fips,cnty_name,population,1990,2000,Absolute Change,Percent Change
0,21,15,Boone County,Male,28111,42499,14388,51.2%
1,21,15,Boone County,Female,29478,43492,14014,47.5%
2,21,15,Boone County,Total,57589,85991,28402,49.3%


## 5.1 Save work as a CSV File
For the work to be avaiable outside of Jupyter Notebook we need to save the results as a csv file. A CSV (Comma Seperated Values) file can be read by Microsoft Excel and then copied into a Word document for a report.

The file will be saved to the current directory - which in Google Colab content folder. This is accessible on the right hand side bar - click on the file folder. You can locate the file in the content folder and right-click to select download.

In [None]:
df_formatted.to_csv('CensusAPI_PopulationSize_2020-07-29.csv')