<a href="https://colab.research.google.com/github/npr99/PlanningMethods/blob/master/PLAN604_Descriptive_Statistics_CensusTracts_Round2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Example: Descriptive Statistics to Find US Census Tract Outliers
 
---
This Google Colab Notebook provides a complete workflow (sequence of steps from start to finish) that will allow you to explore [US Census Tracts](https://www.census.gov/glossary/#term_Censustract?term=Tract). 

This notebook has the fewest number of code blocks and minimal discussion. This notebook is designed to be modified and rerun for different states in the United States.

For a more detailed notebook refer [click here](https://github.com/npr99/PlanningMethods/blob/master/PLAN604_Descriptive_Statistics_2020CensusTracts.ipynb)


# Instructions

1.   Pick a state from the [shared Google Sheet](https://docs.google.com/spreadsheets/d/1pM7gCHYsLicagsF5EjZ2xHsje63F9FQsi5CFUV8-usc/edit?usp=sharing) - *be sure to type your name in the first column.* 

*Notice that the data from 2010 and 2020 were obtained and cleaned by the previous class.*

2.   Modify the first code block with the state FIPS code and name to your selected state
3.   Run all of the codeblocks (From the Runtime Menu click Run All) 
4.   Look at the results in Tables 1 and 2 (at the bottom of the notebook)
5.   Copy and paste your results for *Total Population* into the [shared Google Sheet](https://docs.google.com/spreadsheets/d/1pM7gCHYsLicagsF5EjZ2xHsje63F9FQsi5CFUV8-usc/edit?usp=sharing) 

*Notice you are adding data from the 2000 Census*

*The goal is to have the Google Sheet completely filled in for all states and all years.*

*If you have time - add more states and check the numbers for the 2010 and 2020*



# CHANGE THE FOLLOWING VARIABLES

In [1]:
# Change the following variables
state_FIPS = '48'
state_name = 'Texas'
decennial_year = '2000'

# Once you have changed the above variables, run all of the code. (Runtime -> Run all)
# Scroll to the bottom of the page to see the output.

### Background Information
Each state in the United States has a unique 2 digit FIPS code.

For [a list of State FIPS codes click here.](https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696)

In the code block below, notice first line has the words `state_FIPS = '48'`.

This code will get data for Texas (FIPS = 48).

To get data for a different state change the FIPS code.

For example, if you want data for California, change `state_FIPS = '48'` `state_FIPS = '06'` 

Modify the code block below to get data for your selected state and selected year.

Change the FIPS code and the name of the state.

You can also change the year to get data from 2000, 2010, or 2020.

After you change the following variables, Run all of the codeblocks (Runtime -> Run All).

Scroll to the bottom of the notebook to see the results for Table 1 and Table 2.

# THE REMAINING CODE GENERATES TABLE 1 AND TABLE 2

## Step 1: Obtain Data

In [2]:
# Python packages required to read in and Census API data
import requests ## Required for the Census API
import pandas as pd # For reading, writing and wrangling data

In [3]:
# Possible Census API Hyperlinks by Decennial Year
api_hyperlink = {'2000' : 'https://api.census.gov/data/2000/dec/sf1?get=H001001,P001001'
                          +'&for=tract:*&in=state:'+state_FIPS,
                 '2010' : 'https://api.census.gov/data/2010/dec/sf1?get=H001001,P001001'
                          +'&for=tract:*&in=state:'+state_FIPS,
                 '2020' : 'https://api.census.gov/data/2020/dec/pl?get=H1_001N,P1_001N'
                          +'&for=tract:*&in=state:'+state_FIPS}

In [4]:
# Create dictionary for possible Census API
print('Census API Hyperlink for Decennial Year: '+api_hyperlink[decennial_year])
apijson = requests.get(api_hyperlink[decennial_year])
# Convert the requested json into pandas dataframe
tractdf = pd.DataFrame(columns=apijson.json()[0], data=apijson.json()[1:])
tractdf.head()

Census API Hyperlink for Decennial Year: https://api.census.gov/data/2000/dec/sf1?get=H001001,P001001&for=tract:*&in=state:48


Unnamed: 0,H001001,P001001,state,county,tract
0,2077,4449,48,1,9501
1,1557,3371,48,1,9502
2,386,738,48,1,9503
3,170,14381,48,1,9504
4,1689,3954,48,1,9505


## Step 2: Clean Data
Data cleaning is an important step in the data science process. This step is often the hardest and most time consuming. 

In [5]:
### 2.1 Set the variable type
housing_var = 'H001001'
population_var = 'P001001'
## Note for 2020 the variables are H1_001N and P1_001N
if decennial_year == '2020':
    housing_var = 'H1_001N'
    population_var = 'P1_001N'

tractdf[housing_var] = tractdf[housing_var].astype(int)
tractdf[population_var] = tractdf[population_var].astype(int)

### 2.2 Label variables
tractdf = tractdf.rename(columns={housing_var: "Total Housing Units", 
                                population_var: "Total Population"})
tractdf.head()

Unnamed: 0,Total Housing Units,Total Population,state,county,tract
0,2077,4449,48,1,9501
1,1557,3371,48,1,9502
2,386,738,48,1,9503
3,170,14381,48,1,9504
4,1689,3954,48,1,9505


## Step 3: Describe the data
Descriptive methods summarize the data. Descriptive statistics summarize data with numbers, tables, and graphs. The following block of code creates and formats a table using the `describe` function. The table provides eight descriptive statistics. These include the count, the mean, the standard deviation (std), the minimum (min), the lower quartile (25%), the median (50%), the upper quartile (75%), and the maximum (max).

In [6]:
table1 = tractdf[['Total Population']].describe().T
varformat = "{:,.0f}" # The variable format adds a comma and rounds up
table_title = "Table 1. Descriptive statistics for total population " +\
              f"by census tract, {decennial_year} {state_name}"
table1 = table1.style.set_caption(table_title).format(varformat).set_properties(**{'text-align': 'right'})

#### 3.1 Z-Score Outliers
Another way to identify outliers is by looking at the z-score, or the number of standard deviations an observation falls from the mean. 
The formula for z-score is

>$z = \frac{observation - mean}{{standard deviation}}$

If a census tracts z-score is greater than or less than 3 it would be considered an outlier.

In [7]:
mean = tractdf['Total Population'].mean()
standard_deviation = tractdf['Total Population'].std()
tractdf['Total Population Z-score'] = (tractdf['Total Population'] - mean)/standard_deviation
# Create a new variable to identify outliers
tractdf['Z-score Outlier Total Population'] = 0
tractdf.loc[abs(tractdf['Total Population Z-score']) > 3, 
            'Z-score Outlier Total Population'] = 1

In [8]:
table2 = tractdf[['Total Population','Z-score Outlier Total Population']].\
    loc[tractdf['Z-score Outlier Total Population'] == 1].describe().T
varformat = "{:,.0f}" # The variable format adds a comma and rounds up
table_title = "Table 2. Descriptive statistics for Z-score outliers " +\
              f"by census tract, {decennial_year} {state_name}"
table2 = table2.style.set_caption(table_title).format(varformat).set_properties(**{'text-align': 'right'})

# RESULTS TO COPY AND PASTE INTO THE SHARED GOOGLE SHEET

In [9]:
# Display the table 1
table1

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Total Population,4388,4752,2430,0,3022,4392,6019,22368


In [10]:
# Display the table 2
table2

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Total Population,49,14615,2308,12145,13098,13701,15318,22368
Z-score Outlier Total Population,49,1,0,1,1,1,1,1
