Long Island School Districts
============================
In this Notebook, we look at some basic demographic
information regarding school districts on Long Island, NY.

Goals
-----
- learn the basics of using `pandas` data tables (called `DataFrame`)
- make an interactive map of Long Island school districts
- select a subset of columns
- filter rows based on a condition
- sort data by columns
- create basic bar charts

Data
----

Our data comes from two sources:

1. The from the [National Center for Education Statistics (NCES)](https://nces.ed.gov/ccd/elsi/default.aspx?agree=0), from the 2022-23 school year.
2. The [NYS GIS Clearinghouse](https://data.gis.ny.gov/), [[data](https://data.gis.ny.gov/maps/b6c624c740e4476689aa60fdc4aacb8f/about)] from 2022, updated 2024.

**Our data contains the following fields:**

- **District:** the short name of school district (e.g. "Garden City")
- **County:** Nassau or Suffolk
- **Num. Schools:** the number of schools in the district
- **Charter:** Yes or No, is this a charter school district?
- **Total Students:** the total number of students in the district
- **American Indian/Alaska Native Students:** the number of American Indian/Alaska Native students in the district
- **Asian or Asian/Pacific Islander Students:** the number of Asian or Asian/Pacific Islander students in the district
- **Hispanic Students:** the number of Latinx/Hispanic students in the district
- **Black or African American Students:** the number of Black or African American students in the district
- **White Students:** the number of White students in the district
- **Nat. Hawaiian or Other Pacific Isl. Students:** the number of Native Hawaiian or Other Pacific Islander students in the district
- **Two or More Races Students:** the number of students who identify as two or more races
- **geometry:** a special geospatial field that contains the polygon shape of the school district so that we can plot it on a map; once we create the map, we are not going to use this field anymore

_The field/column names are written the way they are reported in the CCD._

Installing libraries and loading data
=====================================
This first code block installs some
extra libraries into the Colab environment,
then we use the python `import` statement
to load the libraries (programs that other people wrote)
we need to run our code.

We use `pandas` to read the data from a CSV file.
A CSV is like a plain text spreadsheet. It contains
values in columns and rows. The first row usually
has the column names. Our data file is stored on Github.

In [1]:
!pip install plotly nbformat mapclassify -q
import pandas as pd
import plotly.express as px
import geopandas as gpd
import xyzservices.providers as xyz

# get the shapes
url = "https://raw.githubusercontent.com/mcuringa/cartopy/refs/heads/main/notebooks/data/long-island-districts.geojson"
gdf = gpd.read_file(url)
# some Districts have more than one row, because they have more than one shape
# on the map. We're going to drop those duplicate rows for our data in this Notebook
df = gdf.copy()
df.drop(columns=["geometry"], inplace=True)
df = df.drop_duplicates()
# show the data
display(df)


Unnamed: 0,District,County,Num. Schools,Charter,Total Students,American Indian/Alaska Native Students,Asian or Asian/Pacific Islander Students,Hispanic Students,Black or African American Students,White Students,Nat. Hawaiian or Other Pacific Isl. Students,Two or More Races Students
0,Academy Charter School-Uniondale,Nassau,1,No,1272,14,8,487,734,5,2,22
1,Amagansett,Suffolk,1,No,125,0,0,27,3,93,0,2
2,Amityville,Suffolk,5,No,2951,28,30,1729,919,149,5,91
3,Babylon,Suffolk,3,No,1558,0,39,247,44,1176,3,49
4,Baldwin,Nassau,7,No,4476,10,222,1473,2146,434,17,174
...,...,...,...,...,...,...,...,...,...,...,...,...
121,West Islip,Suffolk,7,No,3913,5,80,503,28,3186,0,110
122,Westbury,Nassau,6,No,4768,10,54,3586,945,159,3,11
123,Westhampton Beach,Suffolk,3,No,1767,1,31,580,21,1077,2,55
124,William Floyd,Suffolk,9,No,9166,28,202,4072,1465,2867,21,511


Create an interactive map
=========================
We're not focusing on making maps in this Notebook,
but on this map we plot all of the districts, using
the `explore()` function from the `geopandas` library.
This uses the `geometry` field to plot the shape of
each district on top of our base map. When you hover
over a district, you can see the district name.
This is the **tooltip**. When you click on a district
you see the values for the "District", "Num. Schools", andt "Total Students"
fields. This is the **popup**.

You may want to look at this map to find the exact name of the district in
our data set when you are filtering the data.

In [2]:
gdf.explore(tiles=xyz.CartoDB.Positron, attr=xyz.CartoDB.Positron.attribution, tooltip="District", popup=["District", "Num. Schools", "Total Students"])

Working with `DataFrame`
========================
Jupyter Notebooks make it easy to see the data we are working with.
The last line of code block will always show its result. If it's
the variable name of a `DataFrame`, it will display as a formatted table.
If there are more than around 10 rows, it will show the first 5 and last 5 rows.

We can use the `head()` method to see the first few rows of the data.
We can use the `tail()` method to see the last few rows of the data.
And we can use the `sample()` method to see a random sample of rows.

Every `DataFrame` has a field called `columns` that lists the column names.
We will need to use these names (_exactly as they are written_) for many of the
operations we will do.

If we want to show a table (or columns, or anything else, really) without
it being the last statement in the code block, we can use the built-in
`display()` function.

Play around with these examples below...


In [3]:
# display the columns
print("Columns:")
display(df.columns)

# display the table as the default, with head(), tail(), and sample()
# whichever is the last line is the one you will see, change the order
# and arguments for the number of rows to display


print("Rows from the table:")
df
df.head()
df.tail()
df.sample(5)



Columns:


Index(['District', 'County', 'Num. Schools', 'Charter', 'Total Students',
       'American Indian/Alaska Native Students',
       'Asian or Asian/Pacific Islander Students', 'Hispanic Students',
       'Black or African American Students', 'White Students',
       'Nat. Hawaiian or Other Pacific Isl. Students',
       'Two or More Races Students'],
      dtype='object')

Rows from the table:


Unnamed: 0,District,County,Num. Schools,Charter,Total Students,American Indian/Alaska Native Students,Asian or Asian/Pacific Islander Students,Hispanic Students,Black or African American Students,White Students,Nat. Hawaiian or Other Pacific Isl. Students,Two or More Races Students
48,Hauppauge,Suffolk,5,No,3224,7,271,580,112,2136,8,105
104,Smithtown Central,Suffolk,12,No,7924,8,513,993,125,6075,15,191
74,New Hyde Park-Garden City Park,Nassau,4,No,1653,12,1019,242,10,319,2,48
110,Syosset Central,Nassau,10,No,6938,8,3317,352,39,3029,5,188
102,Shelter Island,Suffolk,1,No,176,0,2,47,1,120,0,6


Selecting Columns and Sorting Data
----------------------------------
Sometimes we don't want to display all of the columns, or we want to change the order of the columns.
The code below shows how to select a subset of columns from the data.

We can also sort the data by one or more columns. Below there are a couple
of examples of how to sort a `DataFrame` by one or more columns
using the `sort_values()` method.

In [4]:
# show just the District and Total Students
display(df[["District", "Total Students"]])

print("Now sorting by Total Students:")
# sort by Total Students
display(df[["District", "Total Students"]].sort_values("Total Students", ascending=False))
# reading this table, we can see that Brentwood is the largest district in Long Island (by # of students)

# let's add County and Black Students and sort Black Students
display(df[["District", "County", "Total Students", "Black or African American Students"]].sort_values("Black or African American Students", ascending=False))

Unnamed: 0,District,Total Students
0,Academy Charter School-Uniondale,1272
1,Amagansett,125
2,Amityville,2951
3,Babylon,1558
4,Baldwin,4476
...,...,...
121,West Islip,3913
122,Westbury,4768
123,Westhampton Beach,1767
124,William Floyd,9166


Now sorting by Total Students:


Unnamed: 0,District,Total Students
11,Brentwood,18323
96,Sachem Central,11844
69,Middle Country Central,9507
124,William Floyd,9166
63,Longwood Central,8876
...,...,...
37,Fishers Island,57
35,Fire Island,34
117,Wainscott Common,26
75,New Suffolk Common,7


Unnamed: 0,District,County,Total Students,Black or African American Students
4,Baldwin,Nassau,4476,2146
101,Sewanhaka Central High,Nassau,7862,1670
63,Longwood Central,Suffolk,8876,1534
124,William Floyd,Suffolk,9166,1465
116,Valley Stream Central High,Nassau,4714,1436
...,...,...,...,...
84,Oysterponds,Suffolk,80,1
109,Springs,Suffolk,688,1
102,Shelter Island,Suffolk,176,1
75,New Suffolk Common,Suffolk,7,0


Filtering Rows
--------------
We have seen how we can select just some columns to work with. We can do the same
things for rows. In these examples, though, we are going to filter the rows and
save the result into a new `DataFrame` using variables that we can use later.

We're going to write a few filters that see if column values are
greater than (`>`), less than (`<`), or equal (`==`) to a certain value.

We're also going to use a **method** called `isin()` that checks if a value is in a list of values.

Last, we will use the logical operators `&` (and) to combine conditions.


In [5]:
# get just the districts with more than 6,000 students
# let's call this big_districts
big_districts = df[df["Total Students"] > 6000]
big_districts

# only Suffolk County
suffolk = df[df["County"] == "Suffolk"]
suffolk

# get big Nassau districts using &
big_nassau = df[(df["County"] == "Nassau") & (df["Total Students"] > 6000)]
big_nassau

# let's get just the Garden City and the districts around it
# first we will create a list of the school districts we want
districts = ['Garden City',
             'Carle Place',
             'Elmont',
             'Franklin Square',
             'Hempstead',
             'Mineola',
             'New Hyde Park-Garden City Park',
             'Sewanhaka Central High',
             'West Hempstead']
# create an new variable called gc (for Garden City)
gc = df[df["District"].isin(districts)]
gc[["District", "Total Students",
                "Asian or Asian/Pacific Islander Students", "Black or African American Students", "Hispanic Students", "White Students"]]

Unnamed: 0,District,Total Students,Asian or Asian/Pacific Islander Students,Black or African American Students,Hispanic Students,White Students
14,Carle Place,1264,139,20,313,761
32,Elmont,3394,882,1178,1042,173
39,Franklin Square,1930,317,53,484,1032
41,Garden City,3945,419,30,285,3117
49,Hempstead,6114,67,1268,4571,94
71,Mineola,2868,419,70,954,1323
74,New Hyde Park-Garden City Park,1653,1019,10,242,319
101,Sewanhaka Central High,7862,2094,1670,1816,2183
120,West Hempstead,1586,131,320,760,319


Making Bar Charts
=================
We are going to use a python library called `plotly` to make
interactive bar charts. We will use the `px.bar()` function.

A few things to note:

- the x-axis labels the values going across the bottom of the chart
- the y-axis labels the values going up the side of the chart
- we're going to work with the `gc` variable, which is a subset of the data
  to make these example charts
- if we sort our data, we will get bard in a different order

In [6]:
# a basic bar chart showing the total students in each district
chart_title = "Garden City and Neighbors: Total Students per District"
gc = gc.sort_values("Total Students")
fig = px.bar(gc, x='District', y="Total Students", title=chart_title)
fig

In [7]:
# let's make a bar chart showing the largest demographic groups, just in Garden City
# first, create a list with the columns we want to chart
cols = ["Asian or Asian/Pacific Islander Students", "Black or African American Students", "Hispanic Students", "White Students"]
# get just the one district
just_gc = gc[gc["District"] == "Garden City"]
chart_title = "Garden City: Student Demographics"
fig = px.bar(just_gc, x='District', y=cols, title=chart_title)
fig.update_layout(
    barmode='group',
    yaxis_title="Number of Students",
    yaxis_tickformat=','
)
fig

In [8]:
# last, we will make a bar chart showing the student demographic groups in each district
gc = gc.sort_values("District")

chart_title = "Garden City and Neighbors: Student Demographics"
fig = px.bar(gc, x='District', y=cols, title=chart_title)
fig.update_layout(
    barmode='group',
    yaxis_title="Number of Students",
    yaxis_tickformat=','
)

fig