<a href="https://colab.research.google.com/github/npr99/PlanningMethods/blob/master/PLAN604_Descriptive_Statistics_CensusTracts_Round2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Application of Descriptive Statistics: Finding US Census Tract Outliers Round 2
 
---
This Google Colab Notebook provides a complete workflow (sequence of steps from start to finish) that will allow you to explore [US Census Tracts](https://www.census.gov/glossary/#term_Censustract?term=Tract). 

This notebook has the fewest number of code blocks and minimal discussion. This notebook is designed to be modified and rerun for different states in the United States.

# Instructions


1.   Read the text in Step 1
2.   Pick a state from the [shared Google Sheet](https://docs.google.com/spreadsheets/d/1pM7gCHYsLicagsF5EjZ2xHsje63F9FQsi5CFUV8-usc/edit?usp=sharing) - *be sure to type your name in the first column.* 

*Notice that the data from 2000 and 2000 were obtained and cleaned by the previous class.*

3.   Modify the Census API code with the state FIPS code to your selected state
3.   Run all of the codeblocks (Runtime -> Run All) 
4.   Look at the results in Tables 1 and 2
5.   Copy and paste your results for *Total Population* into the [shared Google Sheet](https://docs.google.com/spreadsheets/d/1pM7gCHYsLicagsF5EjZ2xHsje63F9FQsi5CFUV8-usc/edit?usp=sharing) 

*Notice you are adding data from the 2000 Census*



In [1]:
# Python packages required to read in and Census API data
import requests ## Required for the Census API
import pandas as pd # For reading, writing and wrangling data

## Step 1: Obtain Data
Each state in the United States has a unique 2 digit FIPS code.

For [a list of State FIPS codes click here.](https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696)

In the code block below, notice first line has the words `in=state:48`.

This code will get data for Texas (FIPS = 48).

To get data for a different state change the FIPS code.

For example, if you want data for California, change `in=state:48` to `in=state:06` 

In [2]:
apijson = requests.get('https://api.census.gov/data/2000/dec/sf1?get=H001001,P001001&for=tract:*&in=state:48')
# Convert the requested json into pandas dataframe
tractdf = pd.DataFrame(columns=apijson.json()[0], data=apijson.json()[1:])
tractdf.head()

Unnamed: 0,H001001,P001001,state,county,tract
0,2077,4449,48,1,9501
1,1557,3371,48,1,9502
2,386,738,48,1,9503
3,170,14381,48,1,9504
4,1689,3954,48,1,9505


## Step 2: Clean Data
Data cleaning is an important step in the data science process. This step is often the hardest and most time consuming. 

In [3]:
### 2.1 Set the variable type
tractdf["H001001"] = tractdf["H001001"].astype(int)
tractdf["P001001"] = tractdf["P001001"].astype(int)

### 2.2 Label variables
tractdf = tractdf.rename(columns={"H001001": "Total Housing Units", 
                                  "P001001": "Total Population"})
tractdf.head()

Unnamed: 0,Total Housing Units,Total Population,state,county,tract
0,2077,4449,48,1,9501
1,1557,3371,48,1,9502
2,386,738,48,1,9503
3,170,14381,48,1,9504
4,1689,3954,48,1,9505


## Step 3: Describe the data
Descriptive methods summarize the data. Descriptive statistics summarize data with numbers, tables, and graphs. The following block of code creates and formats a table using the `describe` function. The table provides eight descriptive statistics. These include the count, the mean, the standard deviation (std), the minimum (min), the lower quartile (25%), the median (50%), the upper quartile (75%), and the maximum (max).

In [22]:
table1 = tractdf[['Total Population']].describe().T
varformat = "{:,.0f}" # The variable format adds a comma and rounds up
table_title = "Table 1. Descriptive statistics for total population by census tract, 2000."
table1 = table1.style.set_caption(table_title).format(varformat).set_properties(**{'text-align': 'right'})
table1

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Total Population,6896,4226,2026,0,2860,3957,5194,30199


#### 3.4.2 Z-Score Outliers
Another way to identify outliers is by looking at the z-score, or the number of standard deviations an observation falls from the mean. 
The formula for z-score is

>$z = \frac{observation - mean}{{standard deviation}}$

If a census tracts z-score is greater than or less than 3 it would be considered an outlier.

In [4]:
mean = tractdf['Total Population'].mean()
standard_deviation = tractdf['Total Population'].std()
tractdf['Total Population Z-score'] = (tractdf['Total Population'] - mean)/standard_deviation
# Create a new variable to identify outliers
tractdf['Z-score Outlier Total Population'] = 0
tractdf.loc[abs(tractdf['Total Population Z-score']) > 3, 
            'Z-score Outlier Total Population'] = 1

In [6]:
table2 = tractdf[['Total Population','Z-score Outlier Total Population']].\
    loc[tractdf['Z-score Outlier Total Population'] == 1].describe().T
varformat = "{:,.0f}" # The variable format adds a comma and rounds up
table_title = "Table 2. Descriptive statistics for Z-score outlier census tracts, 2000."
table2 = table2.style.set_caption(table_title).format(varformat).set_properties(**{'text-align': 'right'})
table2

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Total Population,49,14615,2308,12145,13098,13701,15318,22368
Z-score Outlier Total Population,49,1,0,1,1,1,1,1
