## Overview

Below shows two datasets and examples what 5 random sample data inside looks like. Both of these show data only for the year 2022, but when I use it I might potentially use more years.

### Potential Dataset 1 Example

American Community Survey 1-Year Data (2005-2024)

- https://www.census.gov/data/developers/data-sets/acs-1year.html
- https://www.census.gov/data/developers/guidance/api-user-guide.Help_&_Contact_Us.html

For 2022:
- https://api.census.gov/data/2022/acs/acs1/variables.html

In [24]:
import pandas as pd
from census import Census

# 1. use my API key and Census from python
with open("census_apikey.txt") as f:
    API_KEY = f.read().strip()
c = Census(API_KEY)

# 2. get possible educational variables
# IMPORTANT: THESE CODES ARE FOR 2020 **ONLY**
YEAR = 2022
VARIABLE_CODES = (
    'NAME',
    'B15003_001E',  # Total Population 25 Years and Over
    'B15003_016E',  # 12th grade, no diploma
    'B15003_017E',  # Regular high school diploma
    'B15003_018E',  # GED or alternative credential
    'B15003_019E',  # Some college, less than 1 year
    'B15003_020E',  # Some college, 1 or more years, no degree
    'B15003_021E',  # Associate's degree
    'B15003_022E',  # Bachelor's degree
    'B15003_023E',  # Master's degree
    'B15003_024E',  # Professional school degree
    'B15003_025E'   # Doctorate degree
)

# get for whole state
GEOGRAPHY_FOR = 'state:*' 

# 3. Make the API call
data = c.acs1.get(
    VARIABLE_CODES,
    {'for': GEOGRAPHY_FOR},
    year = YEAR
)

# 4. put in dataframe with normal column names
df_education = pd.DataFrame(data)

df_education = df_education.rename(columns={
    'B15003_001E': 'Total_Pop_25_Plus',
    'B15003_016E': 'Edu_12th_No_Diploma',
    'B15003_017E': 'Edu_Regular_HS_Diploma',
    'B15003_018E': 'Edu_GED_or_Equiv',
    'B15003_019E': 'Edu_Some_College_Less_1_Year',
    'B15003_020E': 'Edu_Some_College_1_Plus_Year',
    'B15003_021E': 'Edu_Associates_Degree',
    'B15003_022E': 'Edu_Bachelors_Degree',
    'B15003_023E': 'Edu_Masters_Degree',
    'B15003_024E': 'Edu_Professional_Degree',
    'B15003_025E': 'Edu_Doctorate_Degree',
    'state': 'State_FIPS'
})

df_education['HS_Grad_or_Higher_Count'] = (
    df_education['Edu_Regular_HS_Diploma'] +
    df_education['Edu_GED_or_Equiv'] +
    df_education['Edu_Some_College_Less_1_Year'] +
    df_education['Edu_Some_College_1_Plus_Year'] +
    df_education['Edu_Associates_Degree'] +
    df_education['Edu_Bachelors_Degree'] +
    df_education['Edu_Masters_Degree'] +
    df_education['Edu_Professional_Degree'] +
    df_education['Edu_Doctorate_Degree']
)

df_education['HS_Grad_or_Higher_Pct'] = (
    df_education['HS_Grad_or_Higher_Count'] / df_education['Total_Pop_25_Plus']
) * 100
df_education['HS_Grad_or_Higher_Pct'] = round(df_education['HS_Grad_or_Higher_Pct'], 2)

df_education["Year"] = YEAR

df_education.sample(n = 5)

Unnamed: 0,NAME,Total_Pop_25_Plus,Edu_12th_No_Diploma,Edu_Regular_HS_Diploma,Edu_GED_or_Equiv,Edu_Some_College_Less_1_Year,Edu_Some_College_1_Plus_Year,Edu_Associates_Degree,Edu_Bachelors_Degree,Edu_Masters_Degree,Edu_Professional_Degree,Edu_Doctorate_Degree,State_FIPS,HS_Grad_or_Higher_Count,HS_Grad_or_Higher_Pct,Year
1,Alaska,489218.0,8169.0,116008.0,26447.0,33863.0,82215.0,48370.0,94168.0,39551.0,9379.0,6548.0,2,456549.0,93.32,2022
36,Oklahoma,2661141.0,49878.0,675343.0,138202.0,209763.0,377407.0,223641.0,495467.0,187067.0,44357.0,32026.0,40,2383273.0,89.56,2022
33,North Carolina,7372120.0,106820.0,1536443.0,298449.0,479544.0,944254.0,742711.0,1678483.0,708100.0,139874.0,121542.0,37,6649400.0,90.2,2022
4,California,26866773.0,738320.0,4846859.0,659996.0,1610206.0,3582648.0,2115440.0,6056169.0,2643964.0,718109.0,517699.0,6,22751090.0,84.68,2022
17,Kentucky,3091499.0,52084.0,806014.0,196657.0,226878.0,388240.0,272315.0,508620.0,253773.0,61733.0,37210.0,21,2751440.0,89.0,2022


### Potential Dataset 2 Example

Unemployment in America, Per US State (1976-2022)

- https://www.kaggle.com/datasets/justin2028/unemployment-in-america-per-us-state?select=Unemployment+in+America+Per+US+State.csv

In [26]:
import pandas as pd

df_unemploy = pd.read_csv("Unemployment in America Per US State.csv")

df_unemploy[df_unemploy["Year"] == 2022].sample(n = 5)

Unnamed: 0,FIPS Code,State/Area,Year,Month,Total Civilian Non-Institutional Population in State/Area,Total Civilian Labor Force in State/Area,Percent (%) of State/Area's Population,Total Employment in State/Area,Percent (%) of Labor Force Employed in State/Area,Total Unemployment in State/Area,Percent (%) of Labor Force Unemployed in State/Area
29424,11,District of Columbia,2022,4,548713,389336,71.0,370343,67.5,18993,4.9
29397,37,North Carolina,2022,3,8424216,5153298,61.2,4981094,59.1,172204,3.3
29507,41,Oregon,2022,5,3464397,2171706,62.7,2092070,60.4,79636,3.7
29639,15,Hawaii,2022,8,1123462,680168,60.5,655850,58.4,24318,3.6
29684,6,California,2022,9,31106495,19240773,61.9,18471241,59.4,769532,4.0
