Lab 6: Grouping, Merging, and Charts
====================================
We're going to take a break from making maps this week and focus
on working with data a little bit more. We're going to learn
some new techniques for working with data in pandas, including:

- grouping data
- using aggregate functions
- merging data sets
- creating bar charts

The data for this week includes all Census "places" in the New York
metro area. A "place" is a city, town, village, or other census
designated area.

In [1]:
# load the libraries
# these should be all that you need for the lab

!pip install mapclassify -q
import pandas as pd
import geopandas as gpd
from IPython.display import display, HTML, Markdown as md
import xyzservices.providers as xyz
import folium
import math
import matplotlib.pyplot as plt
from matplotlib import colors


Data Set: US Cities
===================
We will keep working with the US Cities data set. You can revisit
the [lab 4 data notebook](https://colab.research.google.com/drive/1i5DIOc9NVH0ek1SkVCbcRaLoJItpxBJ5#scrollTo=u3j7sfwYpXvt)
for more details.


In [2]:
# load the URL into a geopandas GeoDataFrame
url = "https://github.com/mcuringa/cartopy/raw/main/notebooks/data/us-city-demographics.geojson"
df = gpd.read_file(url)
# select just the columns we will be using
cities = df[['city', 'state', 'total_pop', 'median_inc', 'asian', 'black', 'indian',
            'latino', 'mixed', 'other', 'pacific', 'white', 'poverty']]
cities.sample(5)


Unnamed: 0,city,state,total_pop,median_inc,asian,black,indian,latino,mixed,other,pacific,white,poverty
43,Glendale,CA,194512,81219,26537,3287,168,36701,7244,665,321,119589,25792
266,Clarksville,TN,167882,62688,4384,38289,408,19946,9797,1305,588,93165,21089
12,San Tan Valley,AZ,101207,88466,2191,3465,1993,26754,3837,436,186,62345,7757
92,Visalia,CA,141466,75658,8589,3168,409,74253,3011,246,53,51737,17689
329,Spokane,WA,227922,63316,5778,5813,2233,16457,13474,854,1945,181368,32913


Group by State
==============
We can use the values of one (or more) columns to group the data,
and then use **aggregate functions** to summarize the data in
each group (i.e. the _other_ columns).

So, usually, a group by operation has two parts:\
first, group the data on the column, then
aggregate the data for each group. Usually
we do them in one statement.

In these examples we group by state and 
use the **sum** and **mean** functions.

Using `sum()`
-----------

In [3]:
# get just the columns we want to show
# we always need to include the columns used for the grouping
cols = ["state", "total_pop", "asian", "black", "latino","white" ]
state_pop = cities[cols].groupby("state").sum()

# we can call `reset_index()` to flatten the column names (index)
# after calling groupby()
state_pop.reset_index(inplace=True)
state_pop.sort_values("total_pop", ascending=False, inplace=True)

display(md("## Total population in large cities by State"))
state_pop.head()

## Total population in large cities by State

Unnamed: 0,state,total_pop,asian,black,latino,white
4,CA,20068725,3368163,1346231,8303682,6071919
41,TX,13458283,881577,1858920,6044562,4272467
31,NY,9466061,1288032,2048258,2680408,3027686
7,FL,5356116,176537,1165722,1609203,2187992
3,AZ,4304479,188365,234543,1442433,2201196


Using `mean()`
-------------
The mean() function calculates the average mean of the values in each group.


In [4]:

cols = ["state", "median_inc"]
state_inc = cities[cols].groupby("state").mean()

state_inc.reset_index(inplace=True)
state_inc.sort_values("median_inc", ascending=False, inplace=True)

display(md("## Average urban median income by State"))
state_inc.head()

## Average urban median income by State

Unnamed: 0,state,median_inc
4,CA,96081.894737
0,AK,95731.0
18,MD,91443.0
5,CO,89881.692308
44,WA,89465.888889


Combining aggregate functions
-----------------------------
So far, we have applied `sum()` or `mean()` to.
all of the aggregated columns.

But we can also apply different functions to
to different columns.

In this example we will get the total urban
population by state using `sum()` and the 
average urban median income using `mean()`.
To do this, we will call the `agg()`
function on the groupby object and pass
a dictionary that matches column names (keys)
with aggregate function names (values).


In [5]:
cols = ["state", "total_pop", "median_inc"]
# create the dict with col:function
agg = {
    "total_pop": "sum", 
    "median_inc": "mean"
}

data = cities[cols].groupby("state").mean()

data.reset_index(inplace=True)
data.sort_values("state", inplace=True)

display(md("## Total urban population and average median income by State"))
data.head()

## Total urban population and average median income by State

Unnamed: 0,state,total_pop,median_inc
0,AK,290674.0,95731.0
1,AL,181477.6,52057.0
2,AR,202218.0,58697.0
3,AZ,391316.272727,83457.727273
4,CA,264062.171053,96081.894737


Example: racial pluralities
---------------------------
We will use state aggregates
to work with a more complex example.

In [6]:
cols = ['city', 'state', 'total_pop',  'asian', 'black', 'latino', 'white', 'poverty', 'median_inc']

data = cities[cols].copy()

# use idxmax to get the index of the max value
def get_plurality(row):
    max_cat = row[["asian", "black", "latino", "white"]].idxmax()
    return max_cat

data["plurality"] = data.apply(get_plurality, axis=1)

# calculate the percent poverty
data["poverty_pct"] = data["poverty"] / data["total_pop"]

# calculate the racial percentages using a for loop
for col in ["asian", "black", "latino", "white"]:
    pct_col = col + "_pct"
    data[pct_col] = data[col] / data["total_pop"]

# which states have the "whitest" cities?
data[["city", "state", "white_pct", "latino_pct", "black_pct", "asian_pct"]].sort_values(by="white_pct", ascending=False).head(20)


Unnamed: 0,city,state,white_pct,latino_pct,black_pct,asian_pct
190,Dearborn,MI,0.866862,0.02805,0.032321,0.024932
205,Springfield,MO,0.841129,0.04594,0.042458,0.018398
206,Billings,MT,0.835225,0.070525,0.005261,0.008694
330,Spokane Valley,WA,0.830794,0.071135,0.014379,0.016461
149,Meridian,ID,0.821593,0.089587,0.014449,0.022207
148,Boise City,ID,0.812235,0.090571,0.014339,0.033673
241,Fargo,ND,0.802127,0.032116,0.084818,0.040206
102,Highlands Ranch,CO,0.800225,0.08981,0.010718,0.06146
194,Sterling Heights,MI,0.799086,0.02018,0.058425,0.088475
329,Spokane,WA,0.795746,0.072205,0.025504,0.025351


### Using value_counts()
The `value_counts()` function is a useful
way to count the number of occurrences of
each unique value in a column. In the example below
after we calculate a column with the string data,
"plurality" we can use value_counts() to
count how many times each unique value occurs.

First, we will do it for the entire data set, on
just the new column. Then we will use it in
our aggregate function.

In [7]:
# this Series shows us the number of cities
# with a plurality for each racial/ethnic category
data.plurality.value_counts()

plurality
white     212
latino     87
black      31
asian      10
Name: count, dtype: int64

In [8]:
agg = {
    "asian_pct": "mean",
    "black_pct": "mean",
    "latino_pct": "mean",
    "white_pct": "mean",
    "poverty_pct": "mean",
    "plurality": "value_counts"
}
# we can use the keys form our agg dictionary to get the columns we want for our group by
# let's do it in steps
# 1- get the keys and assign to the cols variable
cols = agg.keys()
# 2- convert the cols to a list
cols = list(cols)
# 3- combine state with the cols -- we need it for the group
cols = ["state"] + cols

# in one statement:
# cols = ["state"] + list(agg.keys())

# when we group by 2 properties, 
# we get one row for each combination of the two columns
pluralities = data[cols].groupby(["state", "plurality"]).agg(agg)
# rename pluralities, because it now tells us the number of rows (aka the number of cities
# where that group is the plurality in a state
pluralities.rename(columns={"plurality": "num_cities"}, inplace=True)
pluralities.reset_index(inplace=True)
pluralities.head(5)


Unnamed: 0,state,plurality,asian_pct,black_pct,latino_pct,white_pct,poverty_pct,num_cities
0,AK,white,0.095623,0.050892,0.096885,0.55242,0.094315,1
1,AL,black,0.022626,0.612918,0.035629,0.303047,0.217666,3
2,AL,white,0.023194,0.364989,0.053027,0.527502,0.169214,2
3,AR,white,0.029963,0.414538,0.078262,0.446325,0.161593,1
4,AZ,latino,0.033304,0.056773,0.438589,0.417867,0.165843,2


### Format the table with styler
There are several ways to format a table.
So far, we have mostly been copying the
whole dataframe and then modifying the column data.

`pandas` allows us to apply styles without
altering the underlying data. In the next code
block we set up a dictionary called `styles`
that matches column name to a format string.
We can use the data frames `style` attribute
to apply these formats to the table.

In [9]:
# use a format dictionary
styles = {
    "total_pop": "{:,.0f}",
    "asian_pct": "{:.1%}",
    "black_pct": "{:.1%}",
    "latino_pct": "{:.1%}",
    "white_pct": "{:.1%}",
    "poverty_pct": "{:.1%}",
    "median_inc": "${:,.0f}",
}
pluralities.head(5).style.format(styles)


Unnamed: 0,state,plurality,asian_pct,black_pct,latino_pct,white_pct,poverty_pct,num_cities
0,AK,white,9.6%,5.1%,9.7%,55.2%,9.4%,1
1,AL,black,2.3%,61.3%,3.6%,30.3%,21.8%,3
2,AL,white,2.3%,36.5%,5.3%,52.8%,16.9%,2
3,AR,white,3.0%,41.5%,7.8%,44.6%,16.2%,1
4,AZ,latino,3.3%,5.7%,43.9%,41.8%,16.6%,2


In [10]:
# we can keep working with the data and reuse the formats for display

display(md("## States with the most cities where the plurality is Black"))
pluralities[pluralities.plurality == "black"].sort_values("num_cities", ascending=False).style.format(styles)

## States with the most cities where the plurality is Black

Unnamed: 0,state,plurality,asian_pct,black_pct,latino_pct,white_pct,poverty_pct,num_cities
16,GA,black,2.4%,57.6%,5.2%,30.9%,17.6%,6
1,AL,black,2.3%,61.3%,3.6%,30.3%,21.8%,3
26,LA,black,2.7%,55.2%,4.3%,34.3%,22.6%,3
13,FL,black,2.9%,53.6%,34.7%,6.1%,11.7%,2
39,NC,black,4.2%,42.0%,10.9%,36.8%,17.3%,2
67,TX,black,3.5%,40.1%,24.0%,27.5%,17.9%,2
71,VA,black,2.1%,45.9%,7.1%,39.1%,15.8%,2
28,MA,black,1.8%,39.1%,12.3%,29.0%,12.7%,1
31,MD,black,2.5%,60.7%,5.9%,27.0%,18.9%,1
37,MS,black,0.3%,82.0%,1.7%,14.7%,24.8%,1


Merging Data
============
We can **merge** two data sets together using a shared
column (or combination of columns). There are several
types of merges, but for now we are most focused on an
**inner merge** where we keep the rows that have the same
key column in both data sets. Just keep in mind that you might
lose some rows from your original data after an inner merge.

In this example we merge the US Cities data, aggregated at the state
level, with the same data for the total state population.

In [29]:
# group our data at the state level
agg = {
    "total_pop": "sum",
    "asian": "sum",
    "black": "sum",
    "latino": "sum",
    "white": "sum",
    "poverty": "sum",
    "median_inc": "mean"
}
# get the cols we want (drop city and plurality)
city_data = cities.groupby("state").agg(agg).reset_index()
city_data.style.format(styles)
# first read the state data set
url = "https://raw.githubusercontent.com/mcuringa/cartopy/refs/heads/main/notebooks/data/state_demographics.csv"
state_data = pd.read_csv(url)
state_data.head()

Unnamed: 0,state,total_pop,asian,black,indian,latino,mixed,other,pacific,white,median_inc,poverty,state_name,geoid,statefp
0,AL,5028092.0,69099.0,1318388.0,14864.0,232407.0,129791.0,14724.0,1557.0,3247262.0,59609.0,768897.0,Alabama,0400000US01,1
1,AK,734821.0,46507.0,22400.0,102445.0,54890.0,65029.0,3808.0,10940.0,428802.0,86370.0,75227.0,Alaska,0400000US02,2
2,AZ,7172282.0,233864.0,307726.0,249047.0,2297513.0,247176.0,23071.0,12764.0,3801121.0,72581.0,916876.0,Arizona,0400000US04,4
3,AR,3018669.0,46593.0,454728.0,11851.0,243321.0,140175.0,7194.0,11023.0,2103784.0,56335.0,475729.0,Arkansas,0400000US05,5
4,CA,39356104.0,5861649.0,2102510.0,114271.0,15617930.0,1499338.0,176652.0,135460.0,13848294.0,91905.0,4685272.0,California,0400000US06,6


In [12]:
# keep only the matching columns in state_data
state_data = state_data[city_data.columns]

merged_data = city_data.merge(state_data, on="state", suffixes=("_urban", "_state"))
# merged_data = city_data.merge(state_data, on="state")
merged_data.head()

Unnamed: 0,state,total_pop_urban,asian_urban,black_urban,latino_urban,white_urban,poverty_urban,median_inc_urban,total_pop_state,asian_state,black_state,latino_state,white_state,poverty_state,median_inc_state
0,AK,290674,27795,14793,28162,160574,27415,95731.0,734821.0,46507.0,22400.0,54890.0,428802.0,75227.0,86370.0
1,AL,907388,20510,470876,39244,350228,178364,52057.0,5028092.0,69099.0,1318388.0,232407.0,3247262.0,768897.0,59609.0
2,AR,202218,6059,83827,15826,90255,32677,58697.0,3018669.0,46593.0,454728.0,243321.0,2103784.0,475729.0,56335.0
3,AZ,4304479,188365,234543,1442433,2201196,545690,83457.727273,7172282.0,233864.0,307726.0,2297513.0,3801121.0,916876.0,72581.0
4,CA,20068725,3368163,1346231,8303682,6071919,2523871,96081.894737,39356104.0,5861649.0,2102510.0,15617930.0,13848294.0,4685272.0,91905.0


In [13]:
black_pop = merged_data[["state", "total_pop_state", "total_pop_urban", "black_urban", "black_state"]].copy()

black_pop["black_pct_urban"] = black_pop.black_urban / black_pop.total_pop_urban
black_pop["black_pct_nonurban"] = 1 - black_pop.black_pct_urban

# get just the columns we want to show
black_pop = black_pop[["state", "total_pop_state", "black_state", "black_pct_urban", "black_pct_nonurban"]]
black_pop.sort_values("black_pct_nonurban", inplace=True)

# give the columns more meaningful names
black_pop.columns = ["State", "State Population", "Black Population",
                     "% Black People in Urban Areas", "% Black People in non-Urban Areas"]

styles = {
    "State Population": "{:,.0f}",
    "Black Population": "{:,.0f}",
    "% Black People in Urban Areas": "{:.1%}",
    "% Black People in non-Urban Areas": "{:.1%}"
}


display(md("## Urban vs Non-Urban Demographics: Black Residence"))
display(md("#### Urban concentration of Black residents"))
display(black_pop.head(10).style.format(styles))

display(md("#### Non-Urban concentration of Black residents"))
display(black_pop.tail(5).style.format(styles))

## Urban vs Non-Urban Demographics: Black Residence

#### Urban concentration of Black residents

Unnamed: 0,State,State Population,Black Population,% Black People in Urban Areas,% Black People in non-Urban Areas
22,MS,2958846,1098675,82.0%,18.0%
18,MD,6161707,1815877,55.6%,44.4%
1,AL,5028092,1318388,51.9%,48.1%
8,GA,10722325,3334095,48.7%,51.3%
16,LA,4640546,1456107,46.5%,53.5%
2,AR,3018669,454728,41.5%,58.5%
19,MI,10057921,1346918,41.3%,58.7%
40,TN,6923772,1116871,36.5%,63.5%
35,PA,12989208,1347784,34.8%,65.2%
32,OH,11774683,1431238,33.7%,66.3%


#### Non-Urban concentration of Black residents

Unnamed: 0,State,State Population,Black Population,% Black People in Urban Areas,% Black People in non-Urban Areas
42,UT,3283809,34485,1.9%,98.1%
9,HI,1450589,26664,1.7%,98.3%
11,ID,1854109,11919,1.3%,98.7%
23,MT,1091840,5248,0.5%,99.5%
36,PR,3272382,4043,0.2%,99.8%


Problems

Problem 1: NYS zip codes
========================
- load zip code demographic data
- use explore() to plot it on a map
- use the zip code as the tooltip

Problem 2:
==========
- drop the geometry column
- maybe something with quantiles or qcut
- or make some practice data
