## Overview

Below shows two datasets and examples what 5 random sample data inside looks like. Both of these show data only for the year 2022, but when I use it I might potentially use more years.

### Potential Dataset 1 Example

American Community Survey 1-Year Data (2005-2024)

- https://www.census.gov/data/developers/data-sets/acs-1year.html
- https://www.census.gov/data/developers/guidance/api-user-guide.Help_&_Contact_Us.html

For 2022:
- https://api.census.gov/data/2022/acs/acs1/variables.html

In [2]:
import pandas as pd
from census import Census

# 1. use my API key and Census from python
with open("../census_apikey.txt") as f:
    API_KEY = f.read().strip()
c = Census(API_KEY)

# 2. get possible educational variables
# IMPORTANT: THESE CODES ARE FOR 2020 **ONLY**
YEAR = 2022
VARIABLE_CODES = (
    'NAME',
    'B15003_001E',  # Total Population 25 Years and Over
    'B15003_016E',  # 12th grade, no diploma
    'B15003_017E',  # Regular high school diploma
    'B15003_018E',  # GED or alternative credential
    'B15003_019E',  # Some college, less than 1 year
    'B15003_020E',  # Some college, 1 or more years, no degree
    'B15003_021E',  # Associate's degree
    'B15003_022E',  # Bachelor's degree
    'B15003_023E',  # Master's degree
    'B15003_024E',  # Professional school degree
    'B15003_025E'   # Doctorate degree
)

# get for whole state
GEOGRAPHY_FOR = 'state:*' 

# 3. Make the API call
data = c.acs1.get(
    VARIABLE_CODES,
    {'for': GEOGRAPHY_FOR},
    year = YEAR
)

# 4. put in dataframe with normal column names
df_education = pd.DataFrame(data)

df_education = df_education.rename(columns={
    'B15003_001E': 'Total_Pop_25_Plus',
    'B15003_016E': 'Edu_12th_No_Diploma',
    'B15003_017E': 'Edu_Regular_HS_Diploma',
    'B15003_018E': 'Edu_GED_or_Equiv',
    'B15003_019E': 'Edu_Some_College_Less_1_Year',
    'B15003_020E': 'Edu_Some_College_1_Plus_Year',
    'B15003_021E': 'Edu_Associates_Degree',
    'B15003_022E': 'Edu_Bachelors_Degree',
    'B15003_023E': 'Edu_Masters_Degree',
    'B15003_024E': 'Edu_Professional_Degree',
    'B15003_025E': 'Edu_Doctorate_Degree',
    'state': 'State_FIPS'
})

df_education['HS_Grad_or_Higher_Count'] = (
    df_education['Edu_Regular_HS_Diploma'] +
    df_education['Edu_GED_or_Equiv'] +
    df_education['Edu_Some_College_Less_1_Year'] +
    df_education['Edu_Some_College_1_Plus_Year'] +
    df_education['Edu_Associates_Degree'] +
    df_education['Edu_Bachelors_Degree'] +
    df_education['Edu_Masters_Degree'] +
    df_education['Edu_Professional_Degree'] +
    df_education['Edu_Doctorate_Degree']
)

df_education['HS_Grad_or_Higher_Pct'] = (
    df_education['HS_Grad_or_Higher_Count'] / df_education['Total_Pop_25_Plus']
) * 100
df_education['HS_Grad_or_Higher_Pct'] = round(df_education['HS_Grad_or_Higher_Pct'], 2)

df_education["Year"] = YEAR

df_education.sample(n = 5)

Unnamed: 0,NAME,Total_Pop_25_Plus,Edu_12th_No_Diploma,Edu_Regular_HS_Diploma,Edu_GED_or_Equiv,Edu_Some_College_Less_1_Year,Edu_Some_College_1_Plus_Year,Edu_Associates_Degree,Edu_Bachelors_Degree,Edu_Masters_Degree,Edu_Professional_Degree,Edu_Doctorate_Degree,State_FIPS,HS_Grad_or_Higher_Count,HS_Grad_or_Higher_Pct,Year
40,South Carolina,3664922.0,68099.0,852142.0,163600.0,245183.0,491640.0,368850.0,725469.0,345957.0,68340.0,54737.0,45,3315918.0,90.48,2022
44,Utah,2042912.0,38007.0,390984.0,58474.0,154959.0,317269.0,203542.0,505215.0,194392.0,39257.0,36185.0,49,1900277.0,93.02,2022
6,Connecticut,2545188.0,40900.0,568906.0,91141.0,143069.0,262624.0,197009.0,584999.0,345966.0,85582.0,49857.0,9,2329153.0,91.51,2022
5,Colorado,4084004.0,62130.0,670780.0,152329.0,244723.0,517456.0,339016.0,1177891.0,508701.0,105518.0,83022.0,8,3799436.0,93.03,2022
43,Texas,19597383.0,404431.0,3852041.0,881189.0,1297863.0,2673107.0,1519767.0,4242031.0,1779032.0,366285.0,264060.0,48,16875375.0,86.11,2022


### Potential Dataset 2 Example

Unemployment in America, Per US State (1976-2022)

- https://www.kaggle.com/datasets/justin2028/unemployment-in-america-per-us-state?select=Unemployment+in+America+Per+US+State.csv

In [3]:
import pandas as pd

df_unemploy = pd.read_csv("../Unemployment in America Per US State.csv")

df_unemploy[df_unemploy["Year"] == 2022].sample(n = 5)

Unnamed: 0,FIPS Code,State/Area,Year,Month,Total Civilian Non-Institutional Population in State/Area,Total Civilian Labor Force in State/Area,Percent (%) of State/Area's Population,Total Employment in State/Area,Percent (%) of Labor Force Employed in State/Area,Total Unemployment in State/Area,Percent (%) of Labor Force Unemployed in State/Area
29444,32,Nevada,2022,4,2523752,1538332,61.0,1457650,57.8,80682,5.2
29773,42,Pennsylvania,2022,10,10494489,6478553,61.7,6196359,59.0,282194,4.4
29641,17,Illinois,2022,8,10030925,6457256,64.4,6167724,61.5,289532,4.5
29768,37,North Carolina,2022,10,8510926,5162674,60.7,4960151,58.3,202523,3.9
29445,33,New Hampshire,2022,4,1156725,765496,66.2,749090,64.8,16406,2.1
