<a href="https://colab.research.google.com/github/npr99/PlanningMethods/blob/master/PLAN604_Population_vs_Sample_USCounties.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Application of Survey samples and confidence intervals in American Community Survey Data
 
---
This Google Colab Notebook provides a complete workflow (sequence of steps from start to finish) that will allow you to explore [US Census County and County Equivalents](https://www.census.gov/glossary/#term_Countyandequivalententity). 

This notebook has the fewest number of code blocks and minimal discussion. This notebook is designed to be modified and rerun for different proportions (% Hispanic, % Vacant Housing Units, etc) in the United States.

This notebook compares population proportions found in the 2010 Decennial Census and the 2010 1-year ACS.

This notebook introduces Python concepts of functions or modules and merging data.

A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing.

Python gives you many built-in functions like print(), etc. but you can also create your own functions. These functions are called user-defined functions.

For more information on Python Functions see: 
1. https://towardsdatascience.com/function-definition-in-python-bae11c29f4cd

For more help on how to merge data see:
1. https://stackoverflow.com/questions/53645882/pandas-merging-101
2. https://realpython.com/pandas-merge-join-and-concat/



In [2]:
# Python packages required to read in and Census API data
import requests ## Required for the Census API
import pandas as pd # For reading, writing and wrangling data

## Step 1: Obtain Data
The following is a function that provides resuable code to obtain data from Census API. The block of code with the function needs to be run first and then the function can be called in future blocks of code.

In [11]:
def obtain_census_api(
                    state: str = "*",
                    county: str = "*",
                    census_geography: str = 'county:*',
                    vintage: str = "2010", 
                    dataset_name: str = 'dec/sf1',
                    get_vars: str = 'GEO_ID'):

        """General utility for obtaining census from Census API.

        Args:
            state (str): 2-digit FIPS code. Default * for all states
            county (str): 3-digit FIPS code. Default * all counties
            census_geography (str): example '&for=block:*' would be for all blocks
              default is for all counties
            vintage (str): Census Year. Default 2010
            dataset_name (str): Census dataset name. Default Decennial SF1
            for a list of all Census API
            get_vars (str): list of variables to get from the API.

        Returns:
            obj, dict: A dataframe for with Census data

        """
        # Set up hyperlink for Census API
        api_hyperlink = ('https://api.census.gov/data/' + vintage + '/'+dataset_name + '?get=' + get_vars +
                        '&in=state:' + state + '&in=county:' + county + '&for=' + census_geography)

        print("Census API data from: " + api_hyperlink)

        # Obtain Census API JSON Data
        apijson = requests.get(api_hyperlink)

        # Convert the requested json into pandas dataframe
        df = pd.DataFrame(columns=apijson.json()[0], data=apijson.json()[1:])

        return df

## Run Obtain Census API for 2010 Decennial Census
The next block of code calls the function and gets varaibles related to total population and counts of population by race and ethnicity.

For more variables related to race and ethnicity see:

https://api.census.gov/data/2010/dec/sf1/groups/P5.html


In [17]:
get_vars = 'GEO_ID,NAME,P005001,P005010'
        # GEO_ID  = Geographic ID
        # NAME    = Geographic Area Name
        # P005001 = Total
        # P005010 = Total!!Hispanic or Latino
dec10_df = obtain_census_api(get_vars = get_vars)
dec10_df.head()

Census API data from: https://api.census.gov/data/2010/dec/sf1?get=GEO_ID,NAME,P005001,P005010&in=state:*&in=county:*&for=county:*


Unnamed: 0,GEO_ID,NAME,P005001,P005010,state,county
0,0500000US05131,"Sebastian County, Arkansas",125744,15445,5,131
1,0500000US05133,"Sevier County, Arkansas",17058,5220,5,133
2,0500000US05135,"Sharp County, Arkansas",17264,290,5,135
3,0500000US05137,"Stone County, Arkansas",12394,157,5,137
4,0500000US05139,"Union County, Arkansas",41639,1460,5,139


## Run Obtain Census API for 2010 1-year American Community Survey
The next block of code calls the function and gets varaibles related to total population and counts of population by race and ethnicity.
The American Community Survey (ACS) is a survey that is conducted monthly with a random sample of the population. The ACS provides an estimate of the population data collected on April 1, 2010 for the 2010 Census.

In [16]:
# The default for the function is the decenial census. 
# For the ACS we need to change the dataset name
# This information is from https://www.census.gov/data/developers/data-sets/acs-1year.2010.html

dataset_name = 'acs/acs1'
get_vars = 'GEO_ID,NAME,B03002_001E,B03002_001M,B03002_012E,B03002_012M,B00001_001E'
        # GEO_ID  = Geographic ID
        # NAME    = Geographic Area Name
        # B03002_001 = Total Estimate
        # B03002_012 = Total!!Hispanic or Latino.
        # B00001_001E = Sample Size
acs10_df = obtain_census_api(get_vars = get_vars, dataset_name = dataset_name)
acs10_df.head()

Census API data from: https://api.census.gov/data/2010/acs/acs1?get=GEO_ID,NAME,B03002_001E,B03002_001M,B03002_012E,B03002_012M,B00001_001E&in=state:*&in=county:*&for=county:*


Unnamed: 0,GEO_ID,NAME,B03002_001E,B03002_001M,B03002_012E,B03002_012M,B00001_001E,state,county
0,0500000US39109,"Miami County, Ohio",,,,,1789,39,109
1,0500000US39113,"Montgomery County, Ohio",535059.0,-555555555.0,12345.0,-555555555.0,7457,39,113
2,0500000US39119,"Muskingum County, Ohio",,,,,1524,39,119
3,0500000US39133,"Portage County, Ohio",,,,,2400,39,133
4,0500000US39139,"Richland County, Ohio",,,,,1896,39,139


## Step 2: Clean Data
Data cleaning is an important step in the data science process. This step is often the hardest and most time consuming. 

In [20]:
### 2.1 Set the variable type
dec10_df["P005001"] = dec10_df["P005001"].astype(int)
dec10_df["P005010"] = dec10_df["P005010"].astype(int)

# Generate new variable - Proportion
dec10_df['Percent Hispanic'] = dec10_df['P005010'] / dec10_df['P005001'] * 100
dec10_df.head()

Unnamed: 0,GEO_ID,NAME,P005001,P005010,state,county,Percent Hispanic
0,0500000US05131,"Sebastian County, Arkansas",125744,15445,5,131,12.282892
1,0500000US05133,"Sevier County, Arkansas",17058,5220,5,133,30.601477
2,0500000US05135,"Sharp County, Arkansas",17264,290,5,135,1.679796
3,0500000US05137,"Stone County, Arkansas",12394,157,5,137,1.266742
4,0500000US05139,"Union County, Arkansas",41639,1460,5,139,3.506328


## Step 3: Describe the data
Descriptive methods summarize the data. Descriptive statistics summarize data with numbers, tables, and graphs. The following block of code creates and formats a table using the `describe` function. The table provides eight descriptive statistics. These include the count, the mean, the standard deviation (std), the minimum (min), the lower quartile (25%), the median (50%), the upper quartile (75%), and the maximum (max).

In [26]:
table1 = dec10_df[['P005001','P005010','Percent Hispanic']].describe().T
varformat = "{:,.0f}" # The variable format adds a comma and rounds up
table_title = "Table 1. Descriptive statistics for percent Hispanic by county, 2010 Decennial Census."
table1 = table1.style.set_caption(table_title).format(varformat).set_properties(**{'text-align': 'right'})
table1

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
P005001,3221,97011,309299,82,11310,26076,65880,9818605
P005010,3221,16817,114594,0,269,943,4710,4687889
Percent Hispanic,3221,10,19,0,2,3,9,100
