# MSDS 692 Practicum 1
# Mary Hollon 2-24-2025¶

# Overview
This notebook pulls data from the U.S. Census Bureau's **American Community Survey (ACS)** using the ACS API. The data retrieved includes key demographic, economic, and housing indicators for all U.S. states.

This codebook serves as a reference for understanding and working with the Census data pull script.


# Required Python Packages:

- `requests`
- `pandas`
- `census`
- `us`


# Variables Pulled:

The script extracts multiple ACS indicators. The key variables include:

- **Demographics**: Population, age distribution, gender breakdown
- **Income**: Median household income, per capita income, poverty levels
- **Housing**: Home ownership rates, median rent, housing costs
- **Employment**: Labor force participation, unemployment rates
- **Education**: Educational attainment levels



# Data Processing Steps

1. **Establish API Connection**: Uses the Census API key to authenticate requests.
2. **Define API Parameters**: Selects dataset (`acs5`), geography (`state:*`), and variables of interest.
3. **Retrieve Data**: Sends an HTTP request to fetch data.
4. **Convert to Pandas DataFrame**: The API response is structured into a tabular format.
5. **Data Cleaning**: Renames columns, formats data types, and ensures consistency.
6. **Save to CSV**: The processed data is saved as a CSV file for further analysis.



# Example API Request:

The script constructs API requests dynamically, but a sample request might look like:

https://api.census.gov/data/2022/acs/acs5?get=NAME,B01001_001E,B19013_001E&for=state:*

This retrieves the **total population (B01001_001E)** and **median household income (B19013_001E)** for all U.S. states.


# Notes and Miscellaneous:

- The ACS 5-Year dataset provides estimates based on aggregated data.
- Ensure API rate limits are respected to avoid request failures.
- Missing values or errors may occur due to incomplete Census data.


# Future Enhancements:

- Implement error handling for failed API requests.
- Add functionality for county-level data retrieval.
- Automate periodic updates of the dataset.


In [None]:
# This File pulls ALL of the variable data into one file

In [1]:
import requests
import pandas as pd
pd.set_option('display.max_rows', 200, 'display.max_columns', 200)
import time
from datetime import datetime


from census import Census
from us import states

from census_credentials import MY_API_KEY

# Define API base URL
ACS_API_BASE = "https://api.census.gov/data"
DATASET = "acs/acs5"  # 5-Year ACS dataset
GEOGRAPHY = "state:*"  # Fetch data for all states

# Replace with my actual API key
API_KEY = MY_API_KEY
c = Census(API_KEY)

In [2]:
ACS_YEARS = ["2022", "2021", "2020","2019", "2018","2017","2016","2015","2014","2013"]  # Last 10 ACS years
DATASET = "acs5"  # 5-Year ACS dataset

# Define variables of interest with meaningful names
VARIABLES = {
    "Demographics": {
        "B01001_001E": "Total_Population",
        "B01002_001E": "Median_Age",
        "B01001_002E": "Male_Population",
        "B01001_026E": "Female_Population",
        "B02001_002E": "White_Population",
        "B02001_003E": "Black_Population",
        "B02001_005E": "Asian_Population",
        "B03003_003E": "Hispanic_Population",
        "B21001_001E": "Veteran_Status",
        "B05002_013E": "Foreign_Born_Population",
        "B16005_001E": "Non_English_Speakers"
    },
    "Income": {
        "B19013_001E": "Median_Household_Income",
        "B19301_001E": "Per_Capita_Income",
        "B17001_002E": "Population_in_Poverty",
        "B17001_001E": "Total_Population_for_Poverty",
        "B23025_005E": "Unemployed_Population",
        "B23025_002E": "Labor_Force_Population"
    },
    "Housing": {
        "B25002_001E": "Total_Households",
        "B25003_002E": "Owner_Occupied_Homes",
        "B25003_003E": "Renter_Occupied_Homes",
        "B25077_001E": "Median_Home_Value",
        "B25064_001E": "Median_Rent",
        "B25070_007E": "Cost_Burdened_Households",
        "B25004_001E": "Vacancy_Rate",
        "B25010_001E": "Average_Household_Size"
    },
    "Transportation": {
        "B08013_001E": "Mean_Travel_Time",
        "B08301_002E": "Car_Commute",
        "B08301_010E": "Public_Transit_Commute",
        "B08301_019E": "Walk_Commute",
        "B08301_021E": "Work_From_Home"
    },
    "Health_Insurance": {
        "B27010_001E": "Total_Insured_Population",
        "B27010_017E": "Total_Uninsured_Population",
        "B27010_002E": "Private_Insurance",
        "B27010_033E": "Public_Insurance",
        "B27010_050E": "No_Health_Insurance"
    },
    "Education": {
        "B15003_001E": "Total_Population_25_Over",
        "B15003_002E": "Less_Than_HS",
        "B15003_017E": "High_School_Graduate",
        "B15003_020E": "Some_College",
        "B15003_022E": "Associates_Degree",
        "B15003_023E": "Bachelors_Degree",
        "B15003_025E": "Graduate_Degree"
    }
}


In [9]:
start_time = datetime.now()

for year in ACS_YEARS:
    print(f"Fetching data for {year}...")
    var_list = [var for category in VARIABLES.values() for var in category.keys()]
    data = c.acs5.state(var_list + ["NAME"], "*", year=int(year))
    
    df = pd.DataFrame(data)
    df.rename(columns={"NAME": "State", **{k: v for d in VARIABLES.values() for k, v in d.items()}}, inplace=True)
    df.drop(columns=[col for col in df.columns if "GEO_ID" in col or "state" in col], inplace=True, errors='ignore')
    df.drop(columns=[col for col in df.columns if "Unnamed" in col], inplace=True, errors='ignore')  # Drop unnamed columns
    df["Year"] = year
    df.set_index(["State", "Year"], inplace=True)  # Re-index by State and Year
    df.to_csv(f"acs_data_{year}.csv")
    print(f"Saved data for {year} as 'acs_data_{year}.csv'")
    
    time.sleep(5)  # Pause to avoid rate limits

end_time = datetime.now()
print(f"ACS data extraction completed in {end_time - start_time}.")


Fetching data for 2022...
Saved data for 2022 as 'acs_data_2022.csv'
Fetching data for 2021...
Saved data for 2021 as 'acs_data_2021.csv'
Fetching data for 2020...
Saved data for 2020 as 'acs_data_2020.csv'
Fetching data for 2019...
Saved data for 2019 as 'acs_data_2019.csv'
Fetching data for 2018...
Saved data for 2018 as 'acs_data_2018.csv'
Fetching data for 2017...
Saved data for 2017 as 'acs_data_2017.csv'
Fetching data for 2016...
Saved data for 2016 as 'acs_data_2016.csv'
Fetching data for 2015...
Saved data for 2015 as 'acs_data_2015.csv'
Fetching data for 2014...
Saved data for 2014 as 'acs_data_2014.csv'
Fetching data for 2013...
Saved data for 2013 as 'acs_data_2013.csv'
ACS data extraction completed in 0:01:03.739634.


### Let's Check One of the .csv files 

In [12]:
df=pd.read_csv("acs_data_2020.csv")

In [13]:
df.head()

Unnamed: 0,State,Year,Total_Population,Median_Age,Male_Population,Female_Population,White_Population,Black_Population,Asian_Population,Hispanic_Population,Veteran_Status,Foreign_Born_Population,Non_English_Speakers,Median_Household_Income,Per_Capita_Income,Population_in_Poverty,Total_Population_for_Poverty,Unemployed_Population,Labor_Force_Population,Total_Households,Owner_Occupied_Homes,Renter_Occupied_Homes,Median_Home_Value,Median_Rent,Cost_Burdened_Households,Vacancy_Rate,Average_Household_Size,Mean_Travel_Time,Car_Commute,Public_Transit_Commute,Walk_Commute,Work_From_Home,Total_Insured_Population,Total_Uninsured_Population,Private_Insurance,Public_Insurance,No_Health_Insurance,Total_Population_25_Over,Less_Than_HS,High_School_Graduate,Some_College,Associates_Degree,Bachelors_Degree,Graduate_Degree
0,Pennsylvania,2020,12794885.0,40.9,6269142.0,6525743.0,10155004.0,1419582.0,449320.0,971813.0,10137264.0,896853.0,12092654.0,63627.0,35518.0,1480430.0,12387061.0,351248.0,6566126.0,5713345.0,3522269.0,1584332.0,187500.0,958.0,125809.0,606744.0,2.42,153412435.0,5029173.0,315578.0,211881.0,433801.0,12590644.0,128825.0,2818899.0,257225.0,310278.0,8989998.0,91862.0,2726970.0,930989.0,1754311.0,821652.0,141105.0
1,California,2020,39346023.0,36.7,19562882.0,19783141.0,22053721.0,2250962.0,5834312.0,15380929.0,30248480.0,10463818.0,36936941.0,78672.0,38576.0,4853434.0,38589882.0,1229079.0,20016955.0,14210945.0,7241318.0,5861796.0,538500.0,1586.0,527570.0,1107831.0,2.94,498276585.0,14963132.0,843498.0,461980.0,1529697.0,38838726.0,308355.0,9465391.0,1045925.0,1393201.0,26665143.0,773807.0,4804099.0,4008134.0,5764827.0,2377177.0,447039.0
2,West Virginia,2020,1807426.0,42.7,893743.0,913683.0,1672255.0,64285.0,14228.0,28679.0,1440304.0,29584.0,1711734.0,48037.0,27346.0,300152.0,1755591.0,52031.0,798208.0,893615.0,540786.0,193449.0,123200.0,732.0,13895.0,159380.0,2.4,18180565.0,662221.0,6085.0,20172.0,33353.0,1778080.0,11117.0,387727.0,41006.0,56763.0,1283869.0,9967.0,430438.0,161126.0,163598.0,80602.0,12839.0
3,Utah,2020,3151239.0,31.1,1586950.0,1564289.0,2682881.0,38059.0,73190.0,446067.0,2218812.0,264538.0,2902637.0,74197.0,30986.0,283360.0,3102049.0,57829.0,1600462.0,1110369.0,707663.0,295682.0,305400.0,1090.0,27164.0,107024.0,3.09,30350155.0,1286574.0,33830.0,35471.0,138218.0,3124563.0,67725.0,977347.0,98790.0,110879.0,1868472.0,16060.0,368759.0,327524.0,429936.0,156361.0,27681.0
4,New York,2020,19514849.0,39.0,9474184.0,10040665.0,12160045.0,3002401.0,1674216.0,3720707.0,15420195.0,4372167.0,18374180.0,71117.0,40898.0,2581048.0,19009098.0,570570.0,10032721.0,8362971.0,4014516.0,3402708.0,325000.0,1315.0,292744.0,945747.0,2.55,287526290.0,5415544.0,2418334.0,548517.0,634197.0,19276809.0,108372.0,4310934.0,404307.0,500025.0,13649157.0,294862.0,2931187.0,1455960.0,2854930.0,1625827.0,218016.0


In [18]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 44 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   State                         52 non-null     object 
 1   Year                          52 non-null     int64  
 2   Total_Population              52 non-null     float64
 3   Median_Age                    52 non-null     float64
 4   Male_Population               52 non-null     float64
 5   Female_Population             52 non-null     float64
 6   White_Population              52 non-null     float64
 7   Black_Population              52 non-null     float64
 8   Asian_Population              52 non-null     float64
 9   Hispanic_Population           52 non-null     float64
 10  Veteran_Status                52 non-null     float64
 11  Foreign_Born_Population       52 non-null     float64
 12  Non_English_Speakers          52 non-null     float64
 13  Median_

In [20]:
df.shape

(52, 44)

The data looks as expected. 

# Conclusion:

This codebook serves as a reference for understanding and working with the Census data pull script. By following the outlined steps, users can efficiently extract and analyze demographic, economic, and housing data from the ACS API. Future enhancements could further streamline data retrieval and improve handling of missing or incomplete data. With this framework in place, users can confidently leverage Census data for research, policymaking, and analytical insights.


## END of NOTEBOOK