## Week 4: Open Data Portals

In data science, we say, "data is destiny". This means the data you have dictates how far you can go or how much you can understand no matter what methods you use! 


Today we will walk through where to find open and available data. We will explore some portals for downloading data directly and "APIs", or Application Programming Interfaces to gather data through Python. 

In [None]:
# imports 
import pandas as pd
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt


### 4.1 Exploring open data portals 

Many cities and other government bodies have open data portals. This means they make datasets relevant to their cities publically available. 

A few we can explore:
* New York City. https://data.cityofnewyork.us
* Los Angeles. https://data.lacity.org
* The CDC. https://www.cdc.gov/places/index.html
* San Francisco bike share. https://www.lyft.com/bikes/bay-wheels/system-data
* NYC bike share. 

* A list of open data portals: https://schoolofcities.github.io/urban-data-storytelling/urban-data-analytics/what-and-where-of-data/what-and-where-of-data.html

In [None]:
# Let's explore one:

df = pd.read_csv('')

In [None]:
# Explore descriptive statistics, plot relationships
df.iloc[0]

In [None]:
## YOUR TURN 
## Visit one of the open data portals above or find a new one and download some data
## Add a comment block below describing the data



In [None]:
## YOUR TURN
## Gather some descriptive statistics



In [None]:
## YOUR TURN
## Create a few plots



## 4.2 Census

We can also collect data through an API. An API is a way to connect your computer to a database without having to manually download the data files. 

One of these APIs is the US Census data API. This allows us to gather census data consistently. We will use a package called `pytidycensus` to access the API. However, we first need to get an API key. This is kind of like a password that is unique to each individual that allows the data holder to keep track of who is accessing the data. 

Go to https://api.census.gov/data/key_signup.html and enter Cornell University and your email. Then, check your email and activate the key. 

Open the associated file `key.py` and replace the text inside the quotes with your key. 

In [None]:
from key import CENSUS_KEY

In [None]:
from census import Census
from us import states

In [None]:
# We first initialize the API, let's look at the year 2020
c = Census(CENSUS_KEY)

We can see all the possible variables we can get here:

https://api.census.gov/data/YEAR/acs/ACS_TYPE/variables.html

YEAR: Replace the year with what you want to see

ACS_TYPE: Which ACS type: acs5 (5-yr), acs3 (3-yr), or acs1 (1-yr) estimates

For example, the 5-yr ACS estimates from 2020:
https://api.census.gov/data/2020/acs/acs5/variables.html


In [None]:
# Let's say we want to get the total population estimate from the 5-yr ACS survey in 2020 for New York
c.acs5.state(('NAME', 'B01001_001E'), states.NY.fips, year=2020)

In [None]:
# We can get all that value for all of the tracts in New York State like this:

ny_tracts = c.acs5.get( # start here each time
    ('B01001_001E', 'B01001_004E', 'B01001_005E', 'B01001_006E'), # specify the variables we want
    {'for': 'tract:*', # specify the geometry resolution we want
     'in': 'state:36 county:*'}, # speciffy the geometry bounds. In this case, the state of NY
    year=2020, # specify the year
)

What do the `*`s represent?

In [None]:
ny_tracts

In [None]:
# We can turn this into a dataframe so it is easier to read
ny_df = pd.DataFrame(ny_tracts)
ny_df.head()

In [None]:
# From states, we can get see the file names to get Polygon outlines as shapefiles
states.NY.shapefile_urls()

We can also explore the possible files by starting here:
https://www2.census.gov/geo/tiger/

In [None]:
# But we can read this directly into geopandas!

year = 2020
state_fips = "36"

url = f"https://www2.census.gov/geo/tiger/TIGER{year}/TRACT/tl_{year}_{state_fips}_tract.zip"

tracts = gpd.read_file(url)
tracts.head()

In [None]:
# Let's check the crs:
tracts.crs

In [None]:
# Convert it into our classic EPSG 4326
tracts = tracts.to_crs(epsg=4326)
tracts.head()

In [None]:
# We can join these together
# GEOID is the best way to join them together, it represents the full ID of the state, county, and tract ID

# We need to create a GEOID column in our ny_df

ny_df['GEOID'] = ny_df['state'] + ny_df['county'] + ny_df['tract']
ny_df.head()

In [None]:
ny_df_tracts = pd.merge(left=ny_df, right=tracts, on='GEOID', how='left')
print(len(ny_df))
print(len(tracts))
print(len(ny_df_tracts))
ny_df_tracts.head()

In [None]:
# We need to make ny_df_tracts a geodataframe

ny_df_tracts = gpd.GeoDataFrame(ny_df_tracts)
ny_df_tracts.crs = "EPSG:4326"

In [None]:
ny_df_tracts.plot(column='B01001_001E')

In [None]:
## YOUR TURN
## Plot the percent of people white alone by county in Alabama in 2020B03001_003
