<h1 style='text-align: center'>
<div style='color: #DD3403; font-size: 60%'>Data Science DISCOVERY MicroProject</div>
<span style=''>Creating Choropleth Maps from DataFrames with folium</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/choropleth-map-dataframe/">https://discovery.cs.illinois.edu/microproject/choropleth-map-dataframe/</a></div>
</h1>

<hr style='color: #DD3403;'>

## Data Visualization: Choropleth Maps

Geographical data visualizations are some of the most impactful forms of visualization since it easily allows the user to locate places familiar to themselves.  One popular geographical visualization is a **[choropleth map](https://en.wikipedia.org/wiki/Choropleth_map)** -- a visualization of data on a map where geographical regions are shaded to visually encode data about the region as a whole.  For example, population density maps and per-capita income maps are common **choropleth maps**.

In this MicroProject, you will learn about the `folium` Python library -- [https://python-visualization.github.io/folium/](https://python-visualization.github.io/folium/) -- to create choropleth maps from a DataFrame!  Let's nerd out! :)

<hr style='color: #DD3403;'>

## Part 1: Exploring the `folium` Python library

All widely-used Python libraries will have extensive examples and it is often easy to get started by viewing an example of the library's code by the authors of the library.

The `folium` project provides a "quickstart" guide that includes a section on choropleth maps: https://python-visualization.github.io/folium/quickstart.html#Choropleth-maps

When I take a look at the code, which we provide below, I see that the provided code has four distinct sections:

1. **Data Import**: The first four lines (1) imports the geographical data about US states in `us-states.json` and (2): import data about the United States unemployment into a DataFrame, using `pd.read_csv`.
2. **Map Creation**: The next line of code creates a blank map, and sets the initial latitude/longitude and zoom level to provide a view of the entire United States.
3. **Data Visualization**: The next several lines of code is one giant call to `folium.Choropleth`, which configures the data visualization on the map.
4. **Rendering**: The final two lines are used to display the map inside of your notebook.

Try it out, and see your first choropleth map! 🗺️

In [None]:
## "Choropleth maps" from folium's QuickStart Guide:
## - https://python-visualization.github.io/folium/quickstart.html#Choropleth-maps

import pandas as pd
import folium

# Section 1: Data Import
url = (
    "https://raw.githubusercontent.com/python-visualization/folium/main/examples/data"
)
state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)

# Section 2: Map Creation
m = folium.Map(location=[48, -102], zoom_start=3)

# Section 3: Data Visualization
folium.Choropleth(
    geo_data=state_geo,
    name="choropleth",
    data=state_data,
    columns=["State", "Unemployment"],
    key_on="feature.id",
    fill_color="YlGn",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Unemployment Rate (%)",
).add_to(m)

# Section 4: Rendering
folium.LayerControl().add_to(m)
m

<hr style='color: #DD3403;'>



## Dataset: University of Illinois Demographics by State

The [Division of Management Information (DMI)](https://www.dmi.illinois.edu/) at The University of Illinois is a service unit that provides current and historical student enrollment information statistics.  One of the many datasets that DMI provides is the "Final Statistical Abstract" that provides "a summary of student information on the tenth day of the term".

> Only students taking at least one on-campus, credit-bearing class are included in these reports. The following categories of students are excluded: auditors (students taking only non-credit classes); students taking only extramural or off-campus classes; Medical Scholars taking no on-campus, non-MSP classes. (Note: Illini Center MBA students are included in these statistics.)

The exact data is provided as a large, visually formatted spreadsheet sheet that can be viewed here: https://www.dmi.illinois.edu/stuenr/abstracts/SP23_ten.htm

To help focus on building the choropleth maps, we have extracted the data shown in the teal subtable titled "By Permanent Home Address" and provided it for you as `uiuc-dmi-students-by-permanent-home-address.csv`.

## Part 2: Importing Libraries and Loading Datasets

To complete this MicroProject, you will use pandas and folium.

- As always, you should `import pandas as pd` as the convention is to use `pd` as a shorthand for `pandas`
- Unlike `pandas`, there is no shorthand for `folium`.  Since there's no shorthand, all you need is: `import folium`

Import the two libraries:

In [None]:
...

### Load the Provided Dataset of Students Attending the University of Illinois by Permanent Home Address

Use pandas to load the `uiuc-dmi-students-by-permanent-home-address.csv` dataset into a DataFrame called `df`:

In [None]:
df = ...
df

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert('df' in vars())
assert('State' in df)
assert(len(df) == 57)
assert(df.Total.sum() == 53271)
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Part 3: Making Our Own Choropleth Map

One of the best ways to begin to use a new library is to modify existing code to create your own visualization!

In the code below (provided is identical to "Part 1" from `folium`'s QuickStart guide), modify five things:

### Initial Changes (Four Different Changes)

First, in **Section 1** in the code below, remove the `state_unemployment` and `state_data` variables. Instead of looking at unemployment data, we will be using the DataFrame of home states of University of Illinois students.  *(You already loaded this into a DataFrame in the previous part of this MicroProject, so no need to replace those two lines with anything.)*

- Make sure to keep the lines with `state_geo` -- these provide data about the locations of the states in the United States.

In **Section 3**, make three changes to use the University of Illinois data instead of the QuickStart's data:

1. **`data`** attribute -- replace `data=state_data` with a new value for `data`, using the name of the DataFrame you created in Section 2 instead of `state_data`
2. **`columns`** attribute -- replace `columns=["State", "Unemployment"]` with a new value for `columns`, which should be the column names from the Illinois dataset that includes the state and the total number of students from that state who attends the University of Illinois.
3. **`legend_name`** attribute -- replace `legend_name="Unemployment Rate (%)"` with an accurate description of the graph you are creating.

### Final Change: `key_on`

The last item is a bit complex.

- In the example of unemployment data, the data identifies each state **by the two letter code** (ex: "IL" for "Illinois")
- In the dataset of students attending the University of Illinois by Permanent Home Address, the data identifies each state **by the whole state name** (ex: "Illinois")

In **Section 3**, the `key_on` field maps the data (ex: unemployment data, or students attending Illinois) to the `geo_data` (geographical location of the states in the United States).  You can view the raw geo_data by visiting the URL that is in the Python code, or by clicking here: [https://raw.githubusercontent.com/python-visualization/folium/main/examples/data/us-states.json](https://raw.githubusercontent.com/python-visualization/folium/main/examples/data/us-states.json)

The `geo_data` provides the **full state name** in the data field located at `"feature.properties.name"` (the letter code is at `"feature.id"`).  Since we have the full state name, the final change you need to make is to modify the **`key_on`** attribute to be equal to `"feature.properties.name"`.

### Let's Do It!

Make all five changes below:

In [None]:
# "Choropleth maps" from folium's QuickStart Guide:
# - https://python-visualization.github.io/folium/quickstart.html#Choropleth-maps

import pandas as pd
import folium

# Section 1: Data Import
url = (
    "https://raw.githubusercontent.com/python-visualization/folium/main/examples/data"
)
state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)

# Section 2: Map Creation
m = folium.Map(location=[48, -102], zoom_start=3)

# Section 3: Data Visualization
folium.Choropleth(
    geo_data=state_geo,
    name="choropleth",
    data=state_data,
    columns=["State", "Unemployment"],
    key_on="feature.id",
    fill_color="YlGn",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Unemployment Rate (%)",
).add_to(m)

# Section 4: Rendering
folium.LayerControl().add_to(m)
m

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert( "m" in vars() ), "Ensure your map variable remains `m`."
html = m._repr_html_()

assert( "choropleth" in html ), "Ensure your have a choropleth map."
assert( "28671" in html ), "Ensure you are using Total for your data, using the University of Illinois data."
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Part 4: Something Doesn't Look Right...

The visualization shows that **everyone** at Illinois comes from Illinois, and no where else?  To be certain that there is not an error in the data, let's do a bit of analysis:

1. Find the total number of students from **just** Illinois that attends The University of Illinois and save it in a variable `illinois_total`, and
2. Find the **single US state with the largest number of students** that is not Illinois.  Save the number of students in the variable `nonillinois_max`.

In [None]:
illinois_total = ...
illinois_total

In [None]:
nonillinois_max = ...
nonillinois_max

### Analysis

From your results, you should find that the largest non-Illinois state has less than 10% of the students than Illinois.  Since the value is **so small** in comparison, the different is below the first color scale on the choropleth map.

There's several ways to fix this, and we'll explore one in the next section! :)

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

import math

assert( "df" in vars() )
assert( "illinois_total" in vars() )
assert( "nonillinois_max" in vars() )

X = df.nlargest(4, "Total").reset_index().iloc

assert( math.isclose(illinois_total, X[0]["Total"]) ), \
    "Your value for `illinois_total` is incorrect."

assert( not math.isclose(nonillinois_max, 12749) ), \
    "There are 12,749 students from \"Other Countries\", this is not a US state."

assert( not math.isclose(nonillinois_max, 2514) ), \
    "There are 2,514 students from \"Unknown\", this is not a US state."

assert( math.isclose(nonillinois_max, X[3]["Total"]  ) ), \
    "Your value for `nonillinois_max` is incorrect."

print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Part 5: Scaling Data

Since the values of the dataset are very spread out, with Illinois being much higher than all the other values, we have to transform the scale of the data so that their relative values aren't so spread out. 

One way of accomplishing this is with a "base10 logarithm" or $log_{10}$:

- The logarithm function is the inverse of the exponent function.  *(A base10 logarithm is the inverse of $x^{10}$.)*
- The base10 logarithm is easy to understand since it **counts the number of digits, minus one, in a number**:
    - 1 has one digit, and the number of digits minus one: $log_{10}(1) = 0$
    - 10 has two digits, and $log_{10}(10) = 1$
    - 100 has three digits, and $log_{10}(100) = 2$
    - 1000 has four digits, and $log_{10}(1000) = 3$
    - 10,000 has four digits, and $log_{10}(10000) = 4$
    - ...notice it's always the number of digits minus one.


The base10 logarithm will approximate partial values:
- 33 is between 10 and 100, so we expect the base10 logarithm to be between 1 and 2.  It's `1.52`.
- 72 is also between 10 and 100, so we expect the base10 logarithm to be between 1 and 2, and closer to 2.  It's `1.86`.
- The number of students at Illinois from Illinois, 28671, has five digits so its base10 logarithm should be between `4` and `5`.  It's `4.457`.


The base10 logarithm is a great tool to visualize rages of data where the data spans many orders of magnitude, since it will calculate the approximate number of digits in each value and, thereby, shrink the range between values.


### Using the `np.log10` function

The `numpy` library, commonly imported by data scientists as `np`, provides a function that will transform a column of data into the `log10` values.  Create a new column, `Total_log10`, that uses the `np.log10` function on the `Total` columns:

In [None]:
import numpy as np

df["Total_log10"] = ...
df

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"
assert('Total_log10' in df)
assert(np.isclose(df['Total_log10'].sum(), 107.4162548530576))
assert(np.isclose(df['Total_log10'].mean(), 1.884495699176449))
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Part 6: Creating a log10-scaled Choropleth Map


Let's try remaking our choropleth map from before using our new column `Total_log10`!

- Copy and paste your code from "Part 3" above.
- Make sure to change the `columns` attribute from using `Total` to `Total_log10`

In [None]:
...

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert( "m" in vars() ), "Ensure your map variable remains `m`."
html = m._repr_html_()

assert( "choropleth" in html ), "Ensure your have a choropleth map."
assert( "4.457442" in html ), "Ensure you are using Total_log10 for your data."
print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the instructions to commit and grade this MicroProject!