#  Obtain raw data for the "supervised" part of the project

## first, manually download the zillow data
Downloaded from https://www.zillow.com/research/data/. 
* Select "ZHVI All Homes (SFR, Condo/Co-op) Time Series, Smoothed, Seasonally Adjusted($)" and "ZIP Code".
* save the file in this directory as "zillow1.csv"

This is the 35th to 65th precentile range. Values are calculated "by drawing information from the full distribution of homes in a given region".

## the rest of this notebook pulls the raw census data (<span style="color:green">if not present</span> as csv)
* It requires an API key which should be stored in a text file in this directory (census_api_key.txt).
* If a target CSV file exists, it will not pull that data. So **if you want fresh any data, delete the associated CSVs before running this**. Reason: pulling census data requires an API key and it can be cumbersome.

Here are the CSVs generated. Delete them before running this if you want to refresh the data!
* fields1.csv
* census1.csv

Important lists used in this notebook:
* fields: this is the list of fields to download. An example field is B01001_001E, which is population.
* years: this is the list of years to download. As of this writing, the years avaiable are 2011-2021.

Here is an official census list of variables, for reference: https://api.census.gov/data/2021/acs/acs5/variables.html

## imports

In [1]:
with open('census_api_key.txt') as f:
    lines = f.readlines()
API_KEY = lines[0]

import pandas as pd
import urllib.request, json 
import requests
import os.path

## years

In [2]:
# do 10 years, even though 11 are available for the ACS dataset. Also 2011 does not have B15003_022E (bachelors_degr)
years = list(range(2012, 2022))
years

[2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]

## list the variables to download
Columns:
* The census code for the variable
* A "friendly name" for the variable, which will be used in all code going forwards
* A binary value (0/1) indicating whether it wll be used "as is" in the models. "0" usually means it will be used in a formula but not "as is".
* An informational note column

In [3]:
fields = [
    ['B01001_001E', 'population', 1, 'use as is and use for calculations'], 
    # ['B01003_001E', 'populatio2', 1, 'seems to be same as B01001_001E'], 
    ['B19013_001E', 'median_household_income', 1, 'seems to be $'],
    ['B01002_001E', 'median_age', 1, 'just median age'],
    ['B23025_002E', 'labor_force', 0, 'divide by population'],
    ['B23025_005E', 'unemployed', 0, 'divide by population'],
    ['B15003_022E', 'bachelors_degr', 0, 'STARTS IN 2012. divide by population'],
    ['B15003_023E', 'masters_degr', 0, 'divide by population'],
    ['B11001_001E', 'num_households', 0, 'use it for ave household size (popn/).'],
    ['B25064_001E', 'median_rent', 1, 'seems to be in $. maybe not that important since I have cost_of_living'],
    # ['B25003_001E', 'total_units_1', 0, 'seems to match num_households'],
    ['B25003_002E', 'owner_occupied', 0, 'divide by num_households'],
    # ['B28011_001E', 'internet_access_perc', 1, 'ONLY AVAILABLE FROM 2017'],
    # ['B28011_002E', 'internet_access_with_subscrip', 1, 'ONLY AVAILABLE FROM 2017'],
    ['B25034_002E', 'housing_units_built_last_year', 0, 'do proportion. they also have 2 years ago etc'],
    # ['B11005_001E', 'num_families', 1, 'seems to equal total_units_1'],
    ['B11005_002E', 'families_with_children', 0, 'under 18. divide by num_households'],
    ['B05010_002E', 'below_poverty_level', 0, 'divide by population'],
    # ['B17001_001E', 'total_for_poverty_level_calc', 0, 'close to and tracks with population'],
    ['B11001_007E', 'non_family_households', 0, 'divide by households. Note that fams with children + non_fam households is much less than total households.'],
    ['B08303_001E', 'mean_travel_time_to_work', 1, 'use as is'],
    ['B25002_003E', 'vacant_units', 0, 'divide by total units'],
    # ['B25002_001E', 'total_units_2', 1, 'close to total_units_1 and follows same pattern over years'],
    ['B07001_033E', 'moved_fr_same_county', 0, 'divide by population'],
    ['B07001_049E', 'moved_fr_other_county', 0, 'divide by population'],
    ['B07001_065E', 'moved_fr_other_state', 0, 'divide by population'],
    ['B07001_081E', 'moved_fr_abroad', 0, 'divide by population'],
    ['B25077_001E', 'median_value', 1, 'use as is'],
    ['B25018_001E', 'ave_num_rooms', 1, ' use as is'],
    

    ['B25024_002E', 'single_family_units', 0, 'divide by total_units'],
    # ['B25024_001E', 'total_units_3', 0, 'seems to be same as tot_units_2, and close to total_units_1'],
    ['B08301_010E', 'workers_using_public_trans', 0, 'divide by workers'],
    ['B08301_001E', 'workers', 0, 'close to and tracks with labor force but can use for ratios'],
    
    ['B05002_013E', 'foreign_born', 0, 'divide by population'],
    ['B01001_002E', 'male', 0, 'use it for proportion'],
    ['B01001_026E', 'female', 0, 'use it for proportion'],
    # ['B19001_001E', 'income', 0, 'use median_household_income'],

    ['B19083_001E', 'gini', 1, 'gini index of income inequality. 0 means all equal, 1 means 1 person has all income'],
    ['B25092_001E', 'cost_of_living_perc', 1, 'cost of living percent of income'],
    ['B25103_001E', 'median_RE_tax', 1, 'median real estate taxes paid'],
]
df_fields = pd.DataFrame(fields, columns=['code', 'myname', 'use_as_is', 'note'])
df_fields.head()
# df_fields.sort_values('code')

Unnamed: 0,code,myname,use_as_is,note
0,B01001_001E,population,1,use as is and use for calculations
1,B19013_001E,median_household_income,1,seems to be $
2,B01002_001E,median_age,1,just median age
3,B23025_002E,labor_force,0,divide by population
4,B23025_005E,unemployed,0,divide by population


## other topics not used, but consider for future analysis: 
* crime (not in census), 
* businesses (not in census), 
* race. Separating racism from other socio-economic factors is beyond the scope of this project.
* proximity to amenities (parks, schools, shopping centers) (not in census)
* tax environment, state. Can maybe use dummy for state
* proximity to major highways or transport hubs (not in census)

## add official census descriptions to df_fields and save to fields1.csv

In [1]:
# setup some variables to be used below
textsplitter = ' zzz '
max_year = max(years)

filename1 = 'fields1.csv'
# shortname1 = filename1[:-4]
# shortname1

NameError: name 'years' is not defined

In [5]:
# function to get the offical descriptions
def get_field_descriptions(api_key, year):
    """
    Fetch descriptions for fields for a specific year.
    
    :param api_key: Your Census API Key.
    :param year: Census year.
    :return: Dictionary with field codes as keys and descriptions as values.
    """
    base_url = f"https://api.census.gov/data/{year}/acs/acs5/variables"
    response = requests.get(base_url, params={"key": api_key})
    
    if response.status_code == 200:
        data = response.json()
        # try:
        #     # Extract field code and description from the data
        #     return {var["name"]: var["label"] for var in data["variables"].values()}
        # except TypeError as e:
        #     print(f"Unexpected data structure for year {year}: {data}")
        #     return data
        # descriptions = {row[0]: row[1] for row in data[1:]}
        descriptions = {row[0]: str(row[1]) + textsplitter + str(row[2]) for row in data[1:]}
        return descriptions
    else:
        print(f"Error {response.status_code}: {response.text}")
        return {}


In [6]:
# call the function and put the descriptions in a dictionary
if not os.path.exists(filename1):
    descriptions = {}
    descriptions.update(get_field_descriptions(API_KEY, max_year))
    # add the descriptions to df_fields
    for field in df_fields['code'].to_list():
        this_descr = descriptions.get(field, "Unknown")
        splitted = this_descr.split(textsplitter)
        df_fields.loc[df_fields['code'] == field, 'descr1'] = splitted[0]
        descr2=''
        if len(splitted)>1:
            descr2=splitted[1],
        df_fields.loc[df_fields['code'] == field, 'descr2'] = descr2

    # explicitly handle NaNs 
    # df_fields.note = df_fields.note.fillna('')
    # df_fields.descr2 = df_fields.descr2.fillna('')
    
    df_fields.to_csv(filename1, index=False)
else:
    print(filename1 + ' exists!')
    
df_fields.head()

Unnamed: 0,code,myname,use_as_is,note,descr1,descr2
0,B01001_001E,population,1,use as is and use for calculations,Estimate!!Total:,SEX BY AGE
1,B19013_001E,median_household_income,1,seems to be $,Estimate!!Median household income in the past ...,MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS ...
2,B01002_001E,median_age,1,just median age,Estimate!!Median age --!!Total:,MEDIAN AGE BY SEX
3,B23025_002E,labor_force,0,divide by population,Estimate!!Total:!!In labor force:,EMPLOYMENT STATUS FOR THE POPULATION 16 YEARS ...
4,B23025_005E,unemployed,0,divide by population,Estimate!!Total:!!In labor force:!!Civilian la...,EMPLOYMENT STATUS FOR THE POPULATION 16 YEARS ...


## pull the census data and save to CSV

In [7]:
download_fields = [item[0] for item in fields]
new_field_names = pd.Series(df_fields.myname.values,index=df_fields.code).to_dict()
filename2 = 'census1.csv'

In [8]:
# function to get the raw data from the census
def get_census_data_by_zip(api_key, fields, year):
    """
    Fetch census data by ZIP Code Tabulation Areas (ZCTAs) for specified fields.
    
    :param api_key: Your Census API Key.
    :param fields: List of fields to fetch.
    :param year: Census year.
    :return: DataFrame with fetched data.
    """
    base_url = f"https://api.census.gov/data/{year}/acs/acs5"
    
    # Combine the fields into a comma-separated string
    fields_str = ",".join(fields)
    
    # Construct the final URL
    url = f"{base_url}?get={fields_str}&for=zip%20code%20tabulation%20area:*"
    
    headers = {
        "Content-Type": "application/json",
    }
    
    # Make the API request
    response = requests.get(url, headers=headers, params={"key": api_key})
    
    if response.status_code == 200:
        data = response.json()
        # Convert data to DataFrame
        df = pd.DataFrame(data[1:], columns=data[0])
        df['year'] = year
        return df
    else:
        print(f"Error {response.status_code}: {response.text}")
        return None


In [9]:
if not os.path.exists(filename2):
    dfraw = pd.DataFrame()
    for year in years:
        print(year)
        temp_df = get_census_data_by_zip(API_KEY, download_fields, year)
        dfraw = pd.concat([dfraw, temp_df])
        print(dfraw.shape)
    dfraw.rename(columns=new_field_names, inplace=True)
    dfraw.rename(columns = {'zip code tabulation area':'zipcode'}, inplace = True)
    # re-order columns
    cols=['zipcode', 'year']
    dfraw = dfraw[cols + [c for c in dfraw.columns if c not in cols]]
    # save it!
    dfraw.to_csv(filename2, index=False)
else:
    print(filename2 + ' exists!')

2012
(33120, 34)
2013
(66240, 34)
2014
(99360, 34)
2015
(132480, 34)
2016
(165600, 34)
2017
(198720, 34)
2018
(231840, 34)
2019
(264960, 34)
2020
(298080, 34)
2021
(331854, 34)


In [10]:
dfraw.head()

Unnamed: 0,zipcode,year,population,median_household_income,median_age,labor_force,unemployed,bachelors_degr,masters_degr,num_households,...,single_family_units,workers_using_public_trans,workers,foreign_born,male,female,gini,cost_of_living_perc,median_RE_tax,state
0,2655,2012,3846,73323,54.6,1626,156,732,399,1699,...,2814,25,1462,323,1723,2123,0.5339,28.9,3527,25
1,2657,2012,2974,46031,52.9,1992,258,883,422,1687,...,1913,43,1699,236,1712,1262,0.5349,30.4,3493,25
2,2659,2012,741,51466,61.0,355,31,121,62,374,...,1182,15,317,86,344,397,0.3777,20.5,2146,25
3,2660,2012,5881,48617,51.3,2835,173,976,419,2699,...,3478,42,2534,208,2807,3074,0.4286,24.3,1742,25
4,2663,2012,96,21667,34.7,80,16,48,0,64,...,315,0,64,0,64,32,0.1638,50.0,4333,25
