Lab 4: Urban Demographics
=========================
In this lab we will be looking at more data about large
cities in the United States. This data set includes
information related to race and ethnic compositions,
income, and population. We will focus this lab
on using design to create meaningful visualizations
in our maps.

In [1]:
# import the libraries that we need
import pandas as pd
import geopandas as gpd
from IPython.display import display, HTML, Markdown as md
import xyzservices.providers as xyz
import folium
import matplotlib.pyplot as plt
from matplotlib import colors

In [2]:
# load the URL into a geopandas GeoDataFrame
url = "https://github.com/mcuringa/cartopy/raw/main/notebooks/data/us-city-demographics.geojson"
cities = gpd.read_file(url)
cities.head()

Unnamed: 0,city,state,total_pop,median_inc,asian,black,indian,latino,mixed,other,pacific,white,poverty,pop_change,inc_change,area,geometry
0,Birmingham,AL,200431,42464,2967,137147,206,8039,3655,1478,64,46875,49921,-11834.0,8694.0,147.068992,POINT (-86.79704 33.52717)
1,Huntsville,AL,215025,67874,4487,64878,746,13825,7964,567,185,122373,28809,24524.0,15948.0,223.631787,POINT (-86.67667 34.69772)
2,Mobile,AL,186316,48524,3441,98613,353,5267,4064,568,85,73925,37379,-5769.0,8504.0,139.480833,POINT (-88.10023 30.66843)
3,Montgomery,AL,199819,54166,6915,124930,116,7695,4252,254,8,55649,40625,-942.0,9827.0,159.855912,POINT (-86.26731 32.3485)
4,Tuscaloosa,AL,105797,47257,2700,45308,67,4418,1778,120,0,51406,21630,8424.0,4829.0,62.149915,POINT (-87.5283 33.2344)


Review: writing functions
=========================
In our last lab we learned how to write functions in Python:

- a **function** is a reusable piece of code
- it operates on input data called **arguments**
- it can **return** a value (which can be stored in a variable, passed to another function, used to evaluate if a Boolean expression is true, etc.)
- the function definition, or **function header** includes the function name and the **parameters** it takes
- **parameters** behave like variables that are local to the function (they only exist within the function)
- when we pass a value directly to a function, it is called a **literal**; we are more likely
  to pass variables or the result of other functions as arguments
- in python, the **function body** is indented; all of the indented code is executed when the function is **called**
- a function doesn't do anything until it is called (used in another part of the program)
- function names should be:
  - descriptive of what the function does
  - written in lowercase, 
  - start with a letter
  - contain only letters and numbers
  - use underscores (`_`) to separate words
- arguments are passed in the same order as the parameters are defined
- python also supports **keyword parameters**, which are optional and declare a default value; they can be passed in any order
- if a function has both positional and keyword parameters, the positional parameters must come first; 
  the order of the keyword parameters does not matter
  (see the `base_map` function below)
- functions can have a **docstring**, which is a string that describes what the function does; this can be used to automatically generate documentation 

Function Examples
=================
Simple Function Examples
------------------------


In [3]:
# a very basic function with no parameters or return statement
def say_hello():
    print("Hello, world!")

print("""
calling say_hello()""")
say_hello()

# a function that takes a parameter and returns a value
def perimeter_square(side):
    return side * 4

print("""
calling perimeter_square() and capturing the result""")
p = perimeter_square(5)
print(f"The perimeter of a square with side length 5 is {p}")

# a function with positional parameters and named parameters
def calc_tax(price, tax_rate=0.03):
    return price * tax_rate

print("""
calling calc_tax() with the default tax rate of 3%""")
tax = calc_tax(100)
print(f"The tax on a $100 purchase is ${tax:.2f}")

print("""
calling calc_tax() with a custom tax rate of 4.5%""")
tax = calc_tax(200, tax_rate=.045)
print(f"The tax on a $200 purchase with a 4.5% tax rate is ${tax:.2f}")



calling say_hello()
Hello, world!

calling perimeter_square() and capturing the result
The perimeter of a square with side length 5 is 20

calling calc_tax() with the default tax rate of 3%
The tax on a $100 purchase is $3.00

calling calc_tax() with a custom tax rate of 4.5%
The tax on a $200 purchase with a 4.5% tax rate is $9.00


`base_map`: a "real world" function example
-------------------------------------------
We're going to write a `base_map` function that will create
a nicely styled base map for our visualizations. If we want
to change the default look of the maps in this lab, we just
need to update the function and we don't need to change
our code in any other place.

In [11]:
# write a function that will create a base map that we can use as needed
def base_map(center=None, zoom=5, provider=xyz.CartoDB.Positron, name="US Cities"):
    """
    Create a base map using Folium.

    Parameters:
    - center: [latitude, longitude] for the center of the map
    - zoom: Initial zoom level of the map, scale is 1-18 (zoomed out --> zoomed in)
    - provider: Map tile provider from xyzservices (e.g., xyz.CartoDB.Positron, xyz.CartoDB.DarkMatter, etc.)
    - name: Name of the map (will show up if Layer Control is added)

    Returns:
    - folium.Map object
    """
    # if no center is provided, center it near the middle of the US
    if center == None:
        center = [37.886, -97.232]
    
    # check the range on our zoom level
    if zoom < 1 or zoom > 18:
        zoom = 5

    m = folium.Map(name="name", tiles=provider, attr=provider.attribution, location=center, zoom_start=zoom)
    return m


# call our base_map() function to see what it looks like
# base_map()

# get the center of New Jersey
nj_center = [cities[cities.state == "NJ"].geometry.y.mean(), cities[cities.state == "NJ"].geometry.x.mean()]
base_map(center=nj_center, zoom=12)

Cities Data Dictionary
======================
A **data dictionary** is a document that describes the structure of a dataset.
It should indicate where the data is gathered (and the method of collection
if needed), including when it's from.

For each field (i.e. column) it will describe the information contained and the type of data.
If some data might be missing for a particular row, it should describe how that is handled.
Typically this field would be zero, an empty string, NaN for (Not a Number), or null.

The data we loaded is compiled from the 2022 American Community Survey (ACS) 5-Year Estimates.
It includes all "places" that have a population of 100,000 or more. It includes
two "calculated fields" `inc_change` and `pop_change` which represent the 
five year change in `median_income` and `total_pop` between the 2022 and 2017 ACS estimates.
Geographic data is joined from the census 2018 TIGER/Line shapefiles, and includes
the "point" geometry for each city and the land area of the city in square miles.

The racial/ethnic categories in this data area based on the U.S. Census Bureau's classifications
for Hispanic residents (because Hispanic population can be any or multiple races). So
these categories are technically "Asian, non-Hispanic", "Black or African American, non-Hispanic", etc.


Data Fields:
------------
- **city**: The name of the city. This column contains the place name of the city as identified in the ACS dataset.
- **state**: The state in which the city is located, represented by the state's two-letter abbreviation.
- **total_pop**: The total population of the city in 2022 as recorded in the ACS survey.
- **median_inc**: The median household income in the city in 2022. This value represents the median income of all households.
- **asian**: The population count of individuals identified as Asian in the 2022 ACS survey within the city.
- **black**: The population count of individuals identified as Black or African American in the 2022 ACS survey within the city.
- **indian**: The population count of individuals identified as Native American or Alaska Native in the 2022 ACS survey within the city.
- **latino**: The population count of individuals identified as Hispanic or Latino in the 2022 ACS survey within the city.
- **mixed**: The population count of individuals identified as having two or more races in the 2022 ACS survey within the city.
- **other**: The population count of individuals identified as 'Other' in the 2022 ACS survey within the city. This includes respondents who did not classify themselves in the listed racial categories.
- **pacific**: The population count of individuals identified as Native Hawaiian or Other Pacific Islander in the 2022 ACS survey within the city.
- **white**: The population count of individuals identified as White in the 2022 ACS survey within the city.
- **poverty**: The population count of the city's residents living below the poverty line in 2022.
- **pop_change**: The change in the city's population from the 2017 ACS survey to the 2022 ACS survey.
- **inc_change**: The change in median household income in the city from the 2017 ACS survey to the 2022 ACS survey.
- **area**: The total land area of the city, measured in square square miles.
- **geometry**: This special geometry field contains a Point geometry representing the geographic center of the city. 
  This is the column used to plot our data on maps using GeoPandas and Folium.


Formatting text with Markdown
=============================
Markdown is a lightweight "markup language" that you can use to add formatting elements to plaintext text documents.
The idea is that it makes your plain text easier to read, but also can be used to produce rich text and multimedia
documents. It's used in many places, but for us, it's a built in feature of Jupyter Notebooks and Google Colab.

If you double click to enter _this cell_, you will see the underlying Markdown code!

In Jupyter, you can create markdown cells that allow to explain your methods,
describe your document and format findings. We can also use it as a lightweight
way to display formatted output by using the `display()` funciont and `Markdown()`
functions that we imported. I renamed the `Markdown` function to `md` to make it easier to use.
F-strings make it easy to embed data in our markdown, and then display it from our code.

[**Read more about Markdown in Colab here**](https://colab.research.google.com/notebooks/markdown_guide.ipynb)

Note: _you will see, I like to actually number my numbered lists (1. 2. 3. 4.) and to use `-` instead of `*` for bullet lists!_

### Formatting with HTML
We will use markdown more extensively later, in combination with other libraries and HTML markup.
In this lab, though, I'm just going to use the `<b>`bold`</b>` tag to make bold text in my tooltips and popups.

Color Maps
==========
When creating visualizations, color maps can be used to represent different values or categories.
We will be using color maps to plot different colored points on our maps, that will
convey different about each city's demographics.

Color Map Types:
----------------
1. **Sequential Color Maps**: These are used for representing ordered data that progresses 
   from low to high. For example, a color map that goes from light blue to dark blue can 
   represent income levels, where light blue indicates lower income and dark blue indicates 
   higher income.
2. **Diverging Color Maps**: These are used for representing data that has a critical midpoint, 
   such as above or below average. We will use a diverging color map to show population change
   in the cities in our data set. Where reds indicate a loss in population and greens an increase.
   Cities with little or no change will be represented in a neutral color.
3. **Categorical Color Maps**: These are used for representing distinct categories or groups.
   With categorical colors, the color itself is not meaningful, it is meant to allow the
   user to easily distinguish between different categories. Categorical color maps have
   been designed to create distinct labels for each category. We will use a categorical color map
   to represent different racial/ethnic groups in our visualizations. We will also need a key
   for the categorical colors to be useful.

See the [list of matplotlib color maps here](https://matplotlib.org/stable/tutorials/colors/colormaps.html).

Sequential Color Map Example
----------------------------


In [5]:
# use a color map to show a scale of values
# let's create a sequential color map to show
# median household income

# first, figure out the scale for our data
# the lightest blue will be at min_inc and the darkest blue will be at max_inc
min_inc = cities['median_inc'].min()
max_inc = cities['median_inc'].max()

# create a color map from matplotlib
cmap = plt.get_cmap('Blues')
# create a nurm function that will scale a value in the
# range of min_inc to max_inc to a real number between 0 and 1
norm = colors.Normalize(vmin=min_inc, vmax=max_inc)

# make a copy so we don't change the original cities data
data = cities.copy()

# write a function that convert the median income to a "hex" color
def get_color(income):
    return colors.rgb2hex(cmap(norm(income)))

data["color"] = data['median_inc'].apply(get_color)



title = f"""
Median Income in the United States by City (2022)
=================================================
**A sequential color map, on a map**. Darker blues indicate higher median income.

The highest median income is ${max_inc:,.0f} and the lowest is ${min_inc:,.0f}.

"""
display(md(title))

# make a tooltip with the city name in bold and median income
def mk_tooltip(row):
    return f"<b>{row['city']}:</b> ${row['median_inc']:,.0f}"

data["tooltip"] = data.apply(mk_tooltip, axis=1)

# now lets plot it on our map
data.explore(m=base_map(), color=data['color'], tooltip="tooltip",
             tooltip_kwds={"labels": False}, style_kwds={"radius": 5, "fillOpacity": .8})



Median Income in the United States by City (2022)
=================================================
**A sequential color map, on a map**. Darker blues indicate higher median income.

The highest median income is $174,506 and the lowest is $19,076.



Divergent Color Map Example
----------------------------

In [6]:
# let's make that divergent map, showing population change

data = cities.copy()

# filter out cities where the population change is not available
# we call the `notna()` method on the pop_change column to get
# only the rows that have a value for pop_change
# some cities will be left off the map
data = cities[cities.pop_change.notna()].copy()

# calculate the pct change in population as a real number [0..1]
data['pop_change_pct'] = data.pop_change / data.total_pop

# create the divergent color map from the min and max
max_loss = data['pop_change_pct'].min()
max_gain = data['pop_change_pct'].max()

# use the "coolwarm" color map from matplotlib
cmap = plt.get_cmap('seismic')
norm = colors.Normalize(vmin=max_loss, vmax=max_gain)


def get_color(pop_change):
    return colors.rgb2hex(cmap(norm(pop_change)))


data["color"] = data['pop_change_pct'].apply(get_color)


title = f"""
Population Gain and Loss in U.S. Cities 2017-2022
=================================================
**A divergent color map, on a map**.
Darker blue (cool) indicates population loss 
and darker red (warm) indicates population gain.

The highest population loss was {(max_loss * 100):,.0f}% 
and the greatest gain was {(max_gain * 100):,.0f}%.

"""
display(md(title))

# make a tooltip with both the population change percentage and the number of people


def mk_tooltip(row):
    return f"<b>{row.city}:</b> {(row.pop_change_pct * 100):,.0f}%, {row.pop_change:,.0f} people"


data["tooltip"] = data.apply(mk_tooltip, axis=1)


data.explore(m=base_map(), color=data['color'], tooltip="tooltip",
             tooltip_kwds={"labels": False}, style_kwds={"radius": 5, "fillOpacity": .8})


Population Gain and Loss in U.S. Cities 2017-2022
=================================================
**A divergent color map, on a map**.
Darker blue (cool) indicates population loss 
and darker red (warm) indicates population gain.

The highest population loss was -22% 
and the greatest gain was 31%.



In [7]:
# make a new map, but focus on Las Vegas
vegas = data[data.city == "Las Vegas"]
vegas_center = [vegas.geometry.y.mean(), vegas.geometry.x.mean()]

# change the center and zoom using arguments
data.explore(m=base_map(center=vegas_center, zoom=10), color=data['color'], tooltip="tooltip", 
             labels=False, tooltip_kwds={ "labels": False}, style_kwds={"radius": 5, "fillOpacity": .8})

Categorical Color Map Example
-----------------------------

In [8]:
# let's make a categorical map showing which racial/ethnic group
# have a plurality in each city

data = cities.copy()

# we have these categories in our columns
# this dict maps the column name to a numerical category
ethnic_cats = {
    'asian': 0,
    'black': 1,
    'latino': 2,
    'white': 3,
    'indian': 4,
    'mixed': 5,
    'other': 6,
    'pacific': 7
}

# because we have categories, we're not going to create a scale
# we need to:
# 1. find out which category has the highest value for each city
# 2. look up a numerical value for that category
# 3. assign a color to that category

# do it in 2 steps (creating 2 new cols) for clarity

def get_plurality(row):
    # use the built in idmax() method to find the column name of the max value
    max_cat = row[["asian", "black", "latino", "white"]].idxmax()
    return max_cat

data["plurality"] = data.apply(get_plurality, axis=1)

# tab10 has 10 distinct colors
cmap = plt.get_cmap('tab10')

def get_color(plurality):
    category_number = ethnic_cats[plurality]
    return colors.rgb2hex(cmap(category_number))

data["color"] = data.plurality.apply(get_color)


unique_pluralities = data.plurality.unique()
print(f"""There are {len(unique_pluralities)} unique plurality groups in the data.
They are: {unique_pluralities}
""")

# use sample() to give us 20 random cities
data[["city", "state", "plurality", "color"]].sample(20)

There are 4 unique plurality groups in the data.
They are: ['black' 'white' 'latino' 'asian']



Unnamed: 0,city,state,plurality,color
288,Grand Prairie,TX,latino,#2ca02c
304,Richardson,TX,white,#d62728
215,Sparks,NV,white,#d62728
267,Knoxville,TN,white,#d62728
83,Simi Valley,CA,white,#d62728
161,Indianapolis city (balance),IN,white,#d62728
95,Aurora,CO,white,#d62728
279,College Station,TX,white,#d62728
36,Escondido,CA,latino,#2ca02c
104,Pueblo,CO,latino,#2ca02c


In [9]:
# let's do a trick to make a quick legend from a dataframe

# use the same cmap to get the colors
# give the categories better
# human readable names
# we only need the 4 items from our data set
legend_data = {
    'Asian/Pacific Islander': colors.rgb2hex(cmap(0)),
    'Black/African American': colors.rgb2hex(cmap(1)),
    'Latinx/Chicanx/Hispanic': colors.rgb2hex(cmap(2)),
    f'White{"&nbsp;"*20}': colors.rgb2hex(cmap(3))
}

# create a dataframe from the legend
legend_df = pd.DataFrame([legend_data])

# style the 
legend = legend_df.style.apply(lambda row: [f'background-color: {color}' for color in row], axis=0)
legend

Unnamed: 0,Asian/Pacific Islander,Black/African American,Latinx/Chicanx/Hispanic,White
0,#1f77b4,#ff7f0e,#2ca02c,#d62728


In [10]:

title = f"""
Racial/ethnic pluralities in U.S. Cities (2022)
=================================================
**A categorical color map, on a map**.

"""
display(md(title))
display(legend)

data.explore(m=base_map(), color=data['color'], tooltip=None, style_kwds={"radius": 5, "fillOpacity": .8})


Racial/ethnic pluralities in U.S. Cities (2022)
=================================================
**A categorical color map, on a map**.



Unnamed: 0,Asian/Pacific Islander,Black/African American,Latinx/Chicanx/Hispanic,White
0,#1f77b4,#ff7f0e,#2ca02c,#d62728
