Lab 9: Tables & Charts
======================
In this Notebook, we look at some basic demographic
information regarding school districts on Long Island, NY.


In [29]:
# install this very alpha library directly from github
# !pip install https://github.com/mcuringa/cartopy/raw/refs/heads/main/dist/maptools-latest.tar.gz -q

!pip install plotly nbformat -q

from maptools import census_vars
import pandas as pd
import plotly.express as px
import us
from census import Census

# from google.colab import userdata
# api_key = userdata.get('CENSUS_API_KEY')
import os
api_key = os.getenv('CENSUS_API_KEY')

# creaate one census object we will reuse
c = Census(api_key)


In [23]:
census_vars.search("vehicles")

Unnamed: 0,group,concept,match
17175,B25044,Tenure By Vehicles Available,62.03%
13618,B08014,Sex Of Workers By Vehicles Available,56.19%
2270,B99085,Allocation Of Vehicles Available For Workers,54.98%
991,B08201,Household Size By Vehicles Available,53.72%
26842,B25045,Tenure By Vehicles Available By Age Of Householder,51.14%
5997,B25046,Aggregate Number Of Vehicles Available By Tenure,49.94%
23612,B08141,Means Of Transportation To Work By Vehicles Available,49.20%
12269,B08203,Number Of Workers In Household By Vehicles Available,49.04%
26945,B99089,Allocation Of Vehicles Available For Workers For Workplace Geography,43.81%
4303,B08541,Means Of Transportation To Work By Vehicles Available For Workplace Geography,41.82%


Tenure by Vehicle
=================
We're going to use table "B25044" _Tenure by Vehicle_ as
the data set for the examples in this lab. This table,
briefly, tells us how many cars a household owns, broken
down by renter/home owner status.

In this analysis, we will look at census tracts in New York City.
To get those tracts, we will look up the counties, first, to get
their fips codes, then query the census API for the tracts in
those counties.



In [33]:
field_names = census_vars.get_table("B25044")
display(field_names)
field_names = {
    'B25044_001E': 'total_households',
    'B25044_002E': 'owner_occupied',
    'B25044_009E': 'renter_occupied',
    'B25044_003E': 'owner_no_vehicle',
    'B25044_004E': 'owner_1_vehicle',
    'B25044_005E': 'owner_2_vehicles',
    'B25044_006E': 'owner_3_vehicles',
    'B25044_007E': 'owner_4_vehicles',
    'B25044_008E': 'owner_5_or_more_vehicles',
    'B25044_010E': 'renter_no_vehicle',
    'B25044_011E': 'renter_1_vehicle',
    'B25044_012E': 'renter_2_vehicles',
    'B25044_013E': 'renter_3_vehicles',
    'B25044_014E': 'renter_4_vehicles',
    'B25044_015E': 'renter_5_or_more_vehicles'
}

{'B25044_001E': 'total',
 'B25044_002E': 'owner_occupied',
 'B25044_003E': 'owner_occupied_no_vehicle_available',
 'B25044_004E': 'owner_occupied_1_vehicle_available',
 'B25044_005E': 'owner_occupied_2_vehicles_available',
 'B25044_006E': 'owner_occupied_3_vehicles_available',
 'B25044_007E': 'owner_occupied_4_vehicles_available',
 'B25044_008E': 'owner_occupied_5_or_more_vehicles_available',
 'B25044_009E': 'renter_occupied',
 'B25044_010E': 'renter_occupied_no_vehicle_available',
 'B25044_011E': 'renter_occupied_1_vehicle_available',
 'B25044_012E': 'renter_occupied_2_vehicles_available',
 'B25044_013E': 'renter_occupied_3_vehicles_available',
 'B25044_014E': 'renter_occupied_4_vehicles_available',
 'B25044_015E': 'renter_occupied_5_or_more_vehicles_available'}

In [34]:

fields = list(field_names.keys())

data = c.acs5.get(fields=fields, geo={ 'for': 'county:*', 'in': f'state:{us.states.NY.fips}'}, year=2022)

df = pd.DataFrame(data)
df.rename(columns=field_names, inplace=True)
df["statefp"] = df["state"]
df["state"] = df.statefp.apply(census_vars.lookup_state)
df.head(10)

Unnamed: 0,total_households,owner_occupied,renter_occupied,owner_no_vehicle,owner_1_vehicle,owner_2_vehicles,owner_3_vehicles,owner_4_vehicles,owner_5_or_more_vehicles,renter_no_vehicle,renter_1_vehicle,renter_2_vehicles,renter_3_vehicles,renter_4_vehicles,renter_5_or_more_vehicles,state,county,statefp
0,132175.0,74707.0,57468.0,2656.0,22386.0,34782.0,10655.0,3132.0,1096.0,13821.0,28432.0,12509.0,2113.0,459.0,134.0,NY,1,36
1,16813.0,13584.0,3229.0,605.0,4275.0,5956.0,1929.0,517.0,302.0,620.0,1439.0,917.0,232.0,19.0,2.0,NY,3,36
2,525387.0,104887.0,420500.0,25658.0,46873.0,23467.0,6870.0,1498.0,521.0,290507.0,108030.0,19120.0,2217.0,457.0,169.0,NY,5,36
3,81339.0,52748.0,28591.0,2205.0,17227.0,22700.0,7460.0,2441.0,715.0,7999.0,13802.0,5086.0,1022.0,365.0,317.0,NY,7,36
4,31491.0,23220.0,8271.0,1240.0,7111.0,9961.0,3594.0,987.0,327.0,2049.0,3913.0,1939.0,231.0,63.0,76.0,NY,9,36
5,30910.0,21643.0,9267.0,781.0,6290.0,9816.0,3184.0,1117.0,455.0,2311.0,4727.0,1774.0,281.0,82.0,92.0,NY,11,36
6,53405.0,36891.0,16514.0,1727.0,11617.0,15672.0,5826.0,1321.0,728.0,4575.0,8121.0,3271.0,453.0,57.0,37.0,NY,13,36
7,34779.0,23801.0,10978.0,808.0,7617.0,10423.0,3593.0,1019.0,341.0,2730.0,5356.0,2324.0,501.0,52.0,15.0,NY,15,36
8,19886.0,15170.0,4716.0,534.0,4164.0,6793.0,2373.0,947.0,359.0,1130.0,2219.0,1108.0,235.0,18.0,6.0,NY,17,36
9,32651.0,22542.0,10109.0,808.0,6284.0,10277.0,3779.0,967.0,427.0,2286.0,4558.0,2728.0,351.0,167.0,19.0,NY,19,36


Selecting Columns and Sorting Data
----------------------------------
Sometimes we don't want to display all of the columns, or we want to change the order of the columns.
The code below shows how to select a subset of columns from the data.

We can also sort the data by one or more columns. Below there are a couple
of examples of how to sort a `DataFrame` by one or more columns
using the `sort_values()` method.

In [4]:
# show just the District and Total Students
display(df[["District", "Total Students"]])

print("Now sorting by Total Students:")
# sort by Total Students
display(df[["District", "Total Students"]].sort_values("Total Students", ascending=False))
# reading this table, we can see that Brentwood is the largest district in Long Island (by # of students)

# let's add County and Black Students and sort Black Students
display(df[["District", "County", "Total Students", "Black or African American Students"]].sort_values("Black or African American Students", ascending=False))

Unnamed: 0,District,Total Students
0,Academy Charter School-Uniondale,1272
1,Amagansett,125
2,Amityville,2951
3,Babylon,1558
4,Baldwin,4476
...,...,...
121,West Islip,3913
122,Westbury,4768
123,Westhampton Beach,1767
124,William Floyd,9166


Now sorting by Total Students:


Unnamed: 0,District,Total Students
11,Brentwood,18323
96,Sachem Central,11844
69,Middle Country Central,9507
124,William Floyd,9166
63,Longwood Central,8876
...,...,...
37,Fishers Island,57
35,Fire Island,34
117,Wainscott Common,26
75,New Suffolk Common,7


Unnamed: 0,District,County,Total Students,Black or African American Students
4,Baldwin,Nassau,4476,2146
101,Sewanhaka Central High,Nassau,7862,1670
63,Longwood Central,Suffolk,8876,1534
124,William Floyd,Suffolk,9166,1465
116,Valley Stream Central High,Nassau,4714,1436
...,...,...,...,...
84,Oysterponds,Suffolk,80,1
109,Springs,Suffolk,688,1
102,Shelter Island,Suffolk,176,1
75,New Suffolk Common,Suffolk,7,0


Filtering Rows
--------------
We have seen how we can select just some columns to work with. We can do the same
things for rows. In these examples, though, we are going to filter the rows and
save the result into a new `DataFrame` using variables that we can use later.

We're going to write a few filters that see if column values are
greater than (`>`), less than (`<`), or equal (`==`) to a certain value.

We're also going to use a **method** called `isin()` that checks if a value is in a list of values.

Last, we will use the logical operators `&` (and) to combine conditions.


In [5]:
# get just the districts with more than 6,000 students
# let's call this big_districts
big_districts = df[df["Total Students"] > 6000]
big_districts

# only Suffolk County
suffolk = df[df["County"] == "Suffolk"]
suffolk

# get big Nassau districts using &
big_nassau = df[(df["County"] == "Nassau") & (df["Total Students"] > 6000)]
big_nassau

# let's get just the Garden City and the districts around it
# first we will create a list of the school districts we want
districts = ['Garden City',
             'Carle Place',
             'Elmont',
             'Franklin Square',
             'Hempstead',
             'Mineola',
             'New Hyde Park-Garden City Park',
             'Sewanhaka Central High',
             'West Hempstead']
# create an new variable called gc (for Garden City)
gc = df[df["District"].isin(districts)]
gc[["District", "Total Students",
                "Asian or Asian/Pacific Islander Students", "Black or African American Students", "Hispanic Students", "White Students"]]

Unnamed: 0,District,Total Students,Asian or Asian/Pacific Islander Students,Black or African American Students,Hispanic Students,White Students
14,Carle Place,1264,139,20,313,761
32,Elmont,3394,882,1178,1042,173
39,Franklin Square,1930,317,53,484,1032
41,Garden City,3945,419,30,285,3117
49,Hempstead,6114,67,1268,4571,94
71,Mineola,2868,419,70,954,1323
74,New Hyde Park-Garden City Park,1653,1019,10,242,319
101,Sewanhaka Central High,7862,2094,1670,1816,2183
120,West Hempstead,1586,131,320,760,319


Making Bar Charts
=================
We are going to use a python library called `plotly` to make
interactive bar charts. We will use the `px.bar()` function.

A few things to note:

- the x-axis labels the values going across the bottom of the chart
- the y-axis labels the values going up the side of the chart
- we're going to work with the `gc` variable, which is a subset of the data
  to make these example charts
- if we sort our data, we will get bard in a different order

In [6]:
# a basic bar chart showing the total students in each district
chart_title = "Garden City and Neighbors: Total Students per District"
gc = gc.sort_values("Total Students")
fig = px.bar(gc, x='District', y="Total Students", title=chart_title)
fig

In [7]:
# let's make a bar chart showing the largest demographic groups, just in Garden City
# first, create a list with the columns we want to chart
cols = ["Asian or Asian/Pacific Islander Students", "Black or African American Students", "Hispanic Students", "White Students"]
# get just the one district
just_gc = gc[gc["District"] == "Garden City"]
chart_title = "Garden City: Student Demographics"
fig = px.bar(just_gc, x='District', y=cols, title=chart_title)
fig.update_layout(
    barmode='group',
    yaxis_title="Number of Students",
    yaxis_tickformat=','
)
fig

In [8]:
# last, we will make a bar chart showing the student demographic groups in each district
gc = gc.sort_values("District")

chart_title = "Garden City and Neighbors: Student Demographics"

fig = px.bar(gc, x='District', y=cols, title=chart_title)
fig.update_layout(
    barmode='group',
    yaxis_title="Number of Students",
    yaxis_tickformat=','
)

fig