In [None]:
# Mounting Google Drive 
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Change the current working directory to the mounted Google Drive
%cd /content/drive/MyDrive/Colab Notebooks/KHU_Urban_Geography/Lab2

In [None]:
# This code should provide "/content/drive/MyDrive/Colab Notebooks/KHU_Urban_Geography/Lab2"
!pwd

In [None]:
# Installing a package for choropleth mapping (Not available in Colab)
!pip install -U mapclassify

# Urban Hierarchies of the United States using population and Gross Domestic Product (GDP) data

This Jupyter notebook analyzes the urban hierarchy of the United States using population and Gross Domestic Income (GDP) data. For the analysis, this notebook utilizes varous Python packages: Pandas (https://pandas.pydata.org/), GeoPandas (https://geopandas.org/en/stable/#), Matplotlib (https://matplotlib.org/) and Scipy (https://scipy.org/). 

### Data: 
- County Geometry: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html 
- County-level Population (American Community Survey): https://www.census.gov/programs-surveys/acs/data.html 
- County-level GDP (Bureau of Economic Analysis): https://www.bea.gov/data/gdp/gdp-county-metro-and-other-areas

### Steps: 
1. Read the shapefile using GeoPandas (County Geometry) <br>
2. Selecting rows (records) based on a condition <br>
3. Load Excel File with Pandas (GDP data) <br>
4. Join (Merge) county geometry and GDP data <br>
5. Make a choropleth map of GDP data <br>
6. Correlation Analysis between Population and GDP <br>

# Import Packages
A Python package is a way of organizing related Python modules into a single directory hierarchy. It provides a mechanism for grouping Python code files, resources, and configuration settings in a structured manner, making it easier to manage and distribute code. They also facilitate code reuse and distribution by allowing developers to bundle related functionality together and share it with others.

We will be using the following packages in this notebook: <br>
`pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. <br>
source: https://pandas.pydata.org/docs/getting_started/overview.html

`geopandas` is the geographic expansion of `pandas`, allowing to have geometry and working with vector data. <br>
source: https://geopandas.org/en/stable/getting_started/introduction.html

`matplotlib` provides a collection of functions that make plots and maps. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. <br>
source: https://matplotlib.org/stable/users/getting_started/

In [None]:
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Load County Shapefile
county_gdf = gpd.read_file('./data/county_cleaned.shp')

# Load GDP Data
gdp_df = pd.read_excel('./data/GDP_data_cleaned.xlsx', dtype={'GEOID': str})

# Merge (Join) GDP Data with County Shapefile
gdp_gdf = county_gdf.merge(gdp_df, on='GEOID', how='left')

# Visualize GDP Data
# Create an empty figure and axis (canvas)
fig, ax = plt.subplots(1, 1, figsize=(10,10))  

gdp_gdf.plot(column='GDP',            # Column to visualize
             cmap='Blues',            # Color map
             scheme='NaturalBreaks',  # Classification scheme
             legend=True,             # Show legend
             legend_kwds={'loc': 'lower left', 'fontsize': 8.1}, # Legend settings
             k=7,                     # Number of classes
             ax=ax                    # Axis to plot on
             )

# Load state shapefile (Plotting purposes)
state_gdf = gpd.read_file('./data/states.shp') 

# Add state boundaries
state_gdf.boundary.plot(ax=ax, color='grey', linewidth=0.5) 

# Set title
ax.set_title('County-level Gross Domestic Product (GDP)', fontsize=16) 

# Remove axis
ax.axis('off') 

plt.show() # Show the plot


# 1. Read the shapefile using GeoPandas (County Geometry)

In [None]:
# .read_file() method is used to read various spatial data formats (shapefile, GeoJSON, etc.)
county_gdf = gpd.read_file('./data/county_cleaned.shp')
county_gdf

In [None]:
# Cheking the data types of the columns (object = string, int64 = integer, float64 = float)
county_gdf.dtypes

In [None]:
# When you import a spatial data, the type of the object is a GeoDataFrame
type(county_gdf)

In [None]:
# Checking the columns of the GeoDataFrame
county_gdf.columns

In [None]:
# Checking the contents of a column (e.g., GEOID) in the GeoDataFrame
county_gdf['GEOID']

In [None]:
# You can use .plot() method to plot the GeoDataFrame. 
# If there is no 'geometry' column, it will plot a numerical values.
county_gdf.plot()

The code cell below shows how to check the Coordinate Reference System (CRS) of a GeoDataFrame. It is represented based on the EPSG code (https://epsg.io/).

In [None]:
# Checking the Coordinate Reference System (CRS) of the GeoDataFrame
county_gdf.crs

# 2. Selecting rows (records) based on a condition

GeoPandas provides a method called `loc` to select rows based on a condition. 
The syntax is as shown below. 

```python
gdf.loc[`row condition`, `column condition`] 
```

For example, the code below demonstrates how to select rows based on a condition, selecting only the counties in the State of Alabama. 

```python
county_gdf.loc[county_gdf['StateName'] == 'Alabama']
```

If you leave the column condition empty, it will select all columns. 

In [None]:
# It is possible to compare the value within the Series (i.e., a column) to a list of values or a single value. 
# The result is a boolean Series.
county_gdf['StateName'] == 'Alabama'

In [None]:
# .loc method is used to access a group of rows and columns by label(s) or a boolean array.
county_gdf.loc[county_gdf['StateName'] == 'Alabama']

In [None]:
# You can assign the result to a new variable
alabama_gdf = county_gdf.loc[county_gdf['StateName'] == 'Alabama']
alabama_gdf

In [None]:
# Again, you can plot the GeoDataFrame using .plot() method
alabama_gdf.plot()

---
### *Exercise*
1. (6 points) The following in the syntax for the `loc` function in Pandas/GeoPandas. Select rows for New York State and assign it to a new variable called `ny_gdf`. <br><br>
Note: Replace 'COLUMN NAME' with the actual column name and 'VALUE' with the actual value indicating New York. 
<br><br>
StateCode: 36, StateName: New York <br>

    ```python
    ny_gdf = county_gdf.loc[county_gdf['COLUMN NAME'] == 'VALUE']
    ```

---

In [None]:
# Your code here
ny_gdf = county_gdf.loc[county_gdf['COLUMN NAME'] == 'VALUE']
ny_gdf

In [None]:
""" Test code for the previous function. 
This cell should NOT give any errors when it is run."""

assert ny_gdf['StateCode'].unique() == '36'
assert ny_gdf['StateName'].unique() == 'New York'
assert ny_gdf.shape[0] == 62

print("Success!")

# 3. Load Excel File with Pandas (GDP data)

In [None]:
# To read a Excel file, you can use `pandas` package and .read_excel() method.
# You can also use .read_csv() method to read a CSV file.
# the output of .read_excel() and .read_csv() method is a DataFrame
gdp_df = pd.read_excel('data/GDP_data_cleaned.xlsx', dtype={'GEOID': str})
gdp_df

In [None]:
# Checking the type of the object
type(gdp_df)

In [None]:
# Checking the data types of the columns (object = string, int64 = integer, float64 = float)
gdp_df.dtypes

---
### *Exercise*
2. (6 points) Investigate the data folder to find out the name of an Excel file that contains population data for the United States. <br> 
Then, load the Excel file using the `read_excel` function in Pandas and assign it to a new variable called `pop_df`. <br>

    ```python
    pop_df = pd.read_excel('./data/file_name.xlsx', dtype={'GEOID': str})
    ```

---

In [None]:
# Your code here

pop_df = pd.read_excel('./data/file_name.xlsx', dtype={'GEOID': str})
pop_df

In [None]:
# Your code here

pop_df = pd.read_excel('./data/population_data_cleaned.xlsx', dtype={'GEOID': str})
pop_df

In [None]:
""" Test code for the previous function. 
This cell should NOT give any errors when it is run."""

assert 'Pop' in pop_df.columns
assert pop_df.shape == (3221, 3)
assert pop_df['GEOID'].dtype == 'object'

print("Success!")

# 4. Join (Merge) county geometry and GDP data

Currently, `county_gdf` has geometry data and `gdp_df` has GDP data. We need to join (merge) these two datasets to make a choropleth map.

### Join
a join refers to the process of linking two sets of data based on a common attribute or field.

![](https://desktop.arcgis.com/en/arcmap/latest/tools/data-management-toolbox/GUID-C441B51F-B581-4743-A975-3EB04087838C-web.gif)

<br>
Merge (join) method syntax is as shown below. 
    
```python
join_gdf = df1.merge(df2, on='COLUMN NAME')
```

resource: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

In [None]:
# Checking `count_gdf`
county_gdf

In [None]:
# Checking `gdp_df`
gdp_df

Both `county_gdf` and `gdp_df` have the column of `GEOID`, so that it can be used for the join. 

In [None]:
# It is also recommend to check the data type of the column(s) that you want to merge.
gdp_df.dtypes

In [None]:
county_gdf.dtypes

Merge (join) method syntax is `df1.merge(df2, on='COLUMN NAME')`. <br>

resource: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

In [None]:
gdp_gdf = county_gdf.merge(gdp_df, on='GEOID', how='left')
gdp_gdf

In [None]:
# Since we did the left-join, there are some missing values in the GDP column.
# The following code is to select the rows that have missing values (NULL/NaN value) in the GDP column.
gdp_gdf.loc[gdp_gdf['GDP'].isna()]

In [37]:
# We can simply replace the NaN values with 0, using .fillna() method.
gdp_gdf['GDP'] = gdp_gdf['GDP'].fillna(0)

In [None]:
# NaN values are gone!
gdp_gdf.loc[gdp_gdf['GDP'].isna()]

In [None]:
gdp_gdf

---
### *Exercise*
3. (6 points) Join `county_gdf` and `pop_df` using the `.merge()` method. You want you merge `pop_df` into `county_gdf` based on the `GEOID` column, and assigned the result into a new GeoDataFrame with the name of `pop_gdf`. <br><br>
resource: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html


Merge method syntax is as shown below. 

```python
    pop_gdf = df1.merge(df2, on='COLUMN NAME', how='left')
```

Expected results are as follows: <br>
![](https://github.com/jparkgeo/KHU_Urban_Geography/blob/main/Lab2/images/q3.jpg?raw=true)

---



In [None]:
# Your code here

pop_gdf = df1.merge(df2, on='COLUMN NAME', how='left')
pop_gdf

In [None]:
pop_gdf = county_gdf.merge(pop_df, on='GEOID', how='left')
pop_gdf

In [None]:
""" Test code for the previous function. 
This cell should NOT give any errors when it is run."""

assert 'GEOID' in pop_gdf.columns.to_list()
assert 'geometry' in pop_gdf.columns.to_list()
assert 'Pop' in pop_gdf.columns.to_list()
assert pop_gdf.shape == (3108, 7)

print("Success!")

# 5. Make a choropleth map of GDP data 

GeoDataFrame has a built-in function called `plot` to make a choropleth map. <br>

Syntax: `GeoDataFrame.plot(column='COLUMN NAME', cmap='COLOR MAP NAME', legend=True, figsize=(WIDTH, HEIGHT))`

In [None]:
gdp_gdf.plot(column='GDP', figsize=(10,5), legend=True)

`camp` attribute is used to change the color map. <br>
various color maps: https://matplotlib.org/stable/users/explain/colors/colormaps.html

In [None]:
gdp_gdf.plot(column='GDP', cmap='Blues', figsize=(10,5), legend=True)

`scheme` attribute is used to change the classification method. <br>
various classification methods: https://pysal.org/mapclassify/api.html

In [None]:
gdp_gdf.plot(column='GDP', cmap='Blues', scheme='NaturalBreaks', figsize=(10,5), legend=True)

In [None]:
gdp_gdf.plot(column='GDP', cmap='Blues', scheme='NaturalBreaks', figsize=(10,5), legend=True, k=7)

The current map is missing the state boundary, making the interpretation of the map difficult. <br>
In Python, it is also possible to overlay multiple layers on a map. But, just to keep the lab simple, the following code shows the examples of overlaying the state boundary and the choropleth map of GDP. <br>


In [None]:
# Get another layer of the state boundaries
state_gdf = gpd.read_file('./data/states.shp')

fig, ax = plt.subplots(figsize=(10,10))

gdp_gdf.plot(column='GDP', cmap='Blues', scheme='NaturalBreaks', figsize=(10,10), legend=True, k=7, ax=ax, legend_kwds={'loc': 'lower left'})
state_gdf.boundary.plot(ax=ax, color='black', linewidth=0.5, alpha=0.5)
plt.show()

---
### *Exercise*
4. (7 points) Create a choropleth map of population for the conterminous United States. <br>
    - Consult using the code below and fill in a proper information for the attributes below <br>
    - `column`: column with the population information <br>
    - `cmap` : Green color map (resource: https://matplotlib.org/stable/users/explain/colors/colormaps.html) <br>
    - `scheme`: Natural Break classification method <br>
    - `legend`: True (to show the legend) <br>
    - `k`: 7 (number of classes) <br>

    ```python
    fig, ax = plt.subplots(figsize=(10,5)) # Define the canvas for the map

    # Plot the population data
    pop_gdf.plot(column=`COLUMN NAME`, cmap=`COLOR MAP NAME`, scheme=`CLASSIFICATION METHOD`, legend=True, k=`NUMBER OF CLASSES`, ax=ax)

    # Plot the state boundary
    state_gdf.boundary.plot(ax=ax, color='black', linewidth=0.5, alpha=0.5)

    # Show the map
    plt.show()
    ```

Expected results are as follows: <br>
![](https://github.com/jparkgeo/KHU_Urban_Geography/blob/main/Lab2/images/q4.jpg?raw=true)

---


In [None]:
# Your code here
fig, ax = plt.subplots(figsize=(10,5)) # Define the canvas for the map

# Plot the population data
pop_gdf.plot(column=`COLUMN NAME`, cmap=`COLOR MAP NAME`, scheme=`CLASSIFICATION METHOD`, legend=True, k=`NUMBER OF CLASSES`, ax=ax)

# Plot the state boundary
state_gdf.boundary.plot(ax=ax, color='black', linewidth=0.5, alpha=0.5)

# Show the map
plt.show()

In [None]:
# Your code here
fig, ax = plt.subplots(figsize=(10,10)) # Define the canvas for the map

# Plot the population data
pop_gdf.plot(column='Pop', cmap='Greens', scheme='NaturalBreaks', legend=True, k=7, ax=ax)

# Plot the state boundary
state_gdf.boundary.plot(ax=ax, color='black', linewidth=0.5, alpha=0.5)

# Show the map
plt.show()

# Summary

the following is the backbone of the code for the analysis. <br>

In [None]:
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt


# Load County Shapefile
county_gdf = gpd.read_file('./data/county_cleaned.shp')

# Load GDP Data
gdp_df = pd.read_excel('./data/GDP_data_cleaned.xlsx', dtype={'GEOID': str})

# Merge (Join) GDP Data with County Shapefile
gdp_gdf = county_gdf.merge(gdp_df, on='GEOID', how='left')

# Visualize GDP Data
# Create an empty figure and axis (canvas)
fig, ax = plt.subplots(1, 1, figsize=(10,10))  

gdp_gdf.plot(column='GDP',            # Column to visualize
             cmap='Blues',            # Color map
             scheme='NaturalBreaks',  # Classification scheme
             legend=True,             # Show legend
             legend_kwds={'loc': 'lower left', 'fontsize': 8.1}, # Legend settings
             k=7,                     # Number of classes
             ax=ax                    # Axis to plot on
             )

# Load state shapefile (Plotting purposes)
state_gdf = gpd.read_file('./data/states.shp') 

# Add state boundaries
state_gdf.boundary.plot(ax=ax, color='grey', linewidth=0.5) 

# Set title
ax.set_title('County-level Gross Domestic Product (GDP)', fontsize=16) 

# Remove axis
ax.axis('off') 

plt.show() # Show the plot


# 6. Correlation Analysis between Population and GDP

Pearson's r is a statistical test that measures the strength and direction of the relationship between two continuous variables. <br><br> It ranges from -1 to +1. A correlation of -1 indicates a perfect negative correlation, +1 indicates a perfect positive correlation, and 0 indicates no correlation.

![](https://www.biologyforlife.com/uploads/2/2/3/9/22392738/correlation_1.jpg?688)

In [None]:
# Visually comparing GDP and Population Data 
fig, axes = plt.subplots(1, 2, figsize=(20,10))

gdp_gdf.plot(column='GDP', cmap='Blues', scheme='NaturalBreaks', k=7, ax=axes[0])
state_gdf.boundary.plot(ax=axes[0], color='black', linewidth=0.5, alpha=0.5)
axes[0].set_title('Gross Domestic Product (GDP)', fontsize=16)

pop_gdf.plot(column='Pop', cmap='Greens', scheme='NaturalBreaks',  k=7, ax=axes[1])
state_gdf.boundary.plot(ax=axes[1], color='black', linewidth=0.5, alpha=0.5)
axes[1].set_title('Population', fontsize=16)
plt.show()

In [None]:
import scipy

# Combining GDP and Population Data into a single DataFrame
corr_df = gdp_gdf[['GEOID', 'GDP']].copy()
corr_df = corr_df.merge(pop_gdf[['GEOID', 'Pop']], on='GEOID', how='left')
corr_df = corr_df.dropna()
corr_df

In [None]:
# Conducting Pearson Correlation
corr_result = scipy.stats.pearsonr(corr_df['GDP'], corr_df['Pop'])
corr_result

In [None]:
import seaborn as sns
sns.lmplot(data=corr_df, x='Pop', y='GDP')
plt.show()

# Done