# Area of Environmental Justice Concern Prediction using Random Forest

Title: Area of Environmental Justice Concern Prediction using Random Forest

Author(s): Mattie Gisselbeck and Nikunj Chawla

**Abstract**

Redlining, a historic discriminatory practice that denied financial services such as loans to prevent minority groups from having access to certain areas based on their race or ethnicity, still affects the TCMA today. Historically redlined neighborhoods were and are areas of concentrated poverty. Low-income neighborhoods and BIPOC communities in the TCMA are heavily urbanized, have a higher potential exposures to pollutants, and more vulnerable to environmental factors. To combat this inequality, environmental justice seeks to address the inequity of environmental protection in their communities, attempts to restore this disparity.

The objective of this project is to use a Random Forest model to help to predict areas of environmental justice concern to help bridge the gap of historical disparity in the TCMA. 

**Data Sources**

Metropolitan Council (2021). Equity Considerations for Place-Based Advocacy and Decisions in the Twin Cities Region. <https://gisdata.mn.gov/dataset/us-mn-state-metc-society-equity-considerations>

United States Census Bureau (2010). Minnesota Census Tract (2010). <https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2010&layergroup=Census+Tracts>

## 1. Random Forest Model, Evaluation, and Preparation of the Data

### 1.1. Importing the Data and Creating the Random Forest Model

1.1.1. Importing the libraries necessary to run the Random Forest Regression and evaluate the model

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

1.1.2 Importing the equity considerations dataset

In [None]:
equity_considerations_df = pd.read_csv("data/equity_considerations_full.csv")

1.1.3 Extracting only the columns deemed relevant for the random forest model, converting the TR10 column to a string, and printing the number of rows and a few rows of the dataset.

The columns deemed relevant are as follows:

**ID**
- TR10: Census Tract ID

**Aggregate Demographics**
- HUTOT_ACS: Total Housing Units (ACS Estimate)
- HHTOT_ACS: Total Households (ACS Estimate)
- POPTOT_ACS: Total Population

**Low-Income Population**
- PPOV185: Percentage of the Population whose Income is Below 185% of the Poverty Line


**BIPOC Population**
- PMENA_ARAB: Percentage of Arab Population
- PMENA_EGYP: Percentage of Egyptian Population
- PMENA_IRAN: Percentage of Iranian Population
- PMENA_ISRA: Percentage of Israeli Population
- PMENA_LEBA: Percentage of Lebanese Population
- PMENA_PALE: Percentage of Palestinian Population
- PMENA_TURK: Percentage of Turkish Population
- PBANC_AFRI: Percentage of Black or African American Population
- PBANC_ETHI: Percentage of Ethiopian Population
- PBANC_NIGE: Percentage of Nigerian Population
- PAMINDNH: Percentage of American Indian Population
- PBANC_ETHI: Percentage of Ethiopian Population
- PBANC_NIGE: Percentage of Nigerian Population
- PAMINDNH: Percentage of American Indian Population
- PBIPOC: Percentage of Black, Indigenous, and People of Color Population
- PPLURALRAC: Percentage of the Population who Identify as Multiracial
- PHISPPOP: Percentage of Hispanic or Latino Population



In [None]:
# The columns deemed relevant
# noinspection SpellCheckingInspection
environmental_justice_columns = [
    "TR10", "TR_EJ", "PMENA_ARAB", "PMENA_EGYP", "PMENA_IRAN", "PMENA_ISRA", "PMENA_LEBA", "PMENA_PALE", "PMENA_TURK",
    "PBANC_AFRI", "PBANC_ETHI", "PBANC_NIGE", "PAMINDNH", "PBIPOC", "PPLURALRAC", "PPOV185", "PHISPPOP",
    "HUTOT_ACS", "HHTOT_ACS", "POPTOT_ACS"
]

# Get the columns deemed relevant
environmental_justice_df = equity_considerations_df[environmental_justice_columns]

# Converts the TR10 column to a string
environmental_justice_tr10_string = environmental_justice_df["TR10"].astype(str)
environmental_justice_df.loc[:, ["TR10"]] = environmental_justice_tr10_string

# Prints the number of rows and a few rows of the dataset
print(len(environmental_justice_df.index))
environmental_justice_df.head()

1.1.4 Drops all rows with missing values

In [None]:
environmental_justice_df = environmental_justice_df.dropna()

1.1.5 Splits the dataset into the independent variables (X) and the dependent variable (y) and splits the dataset into training and testing sets

In [None]:
# Splits the dataset into the independent variables (X) and the dependent variable (y)
X = environmental_justice_df.drop("TR10", axis = 1).drop("TR_EJ", axis = 1)
y = environmental_justice_df["TR_EJ"]

# Splits the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

1.1.6 Creates the Random Forest Regression model and fits it to the training set, using a fixed random state for reproducibility (performs very similarly even without a fixed random state)

In [None]:
random_forest_model = RandomForestRegressor(n_estimators = 1000, random_state = 42)
_ = random_forest_model.fit(X_train, y_train)

### 1.2 Evaluating the Random Forest Model

1.2.1 Predicts the dependent variable (y) using the independent variables (X) in the testing set and evaluates the model

In [None]:
y_test_predicted = random_forest_model.predict(X_test)
mse = mean_squared_error(y_test, y_test_predicted)
r2 = r2_score(y_test, y_test_predicted)

1.2.2 Prints the evaluation metrics

In [None]:
print("Random Forest Model Evaluation:")
print(f"MSE: {mse:.2f}")
print(f"r2 score: {r2:.2f}")

1.2.3 Adds the predicted TR_EJ column to the environmental justice dataset

In [None]:
environmental_justice_df["TR_EJ_PREDICTED"] = random_forest_model.predict(X)

### 1.3 Merging the environmental justice dataset with the census tract geometry

1.3.1 Imports the libraries necessary to import the tract geometry

In [None]:
# Import the libraries necessary to import the tract geometry
import os

# Set the USE_PYGEOS environment variable to 0 to use shapely instead of pygeos
os.environ["USE_PYGEOS"] = "0"

import geopandas as gpd

1.3.2 Imports the census tract geometry, printing the number of rows and a few rows of the dataset

In [None]:
# Gets the 2010 census tract geometry
census_tract_geometry_gdf = gpd.read_file("data/tl_2010_27_tract10/tl_2010_27_tract10.shp")
print(len(census_tract_geometry_gdf.index))
census_tract_geometry_gdf.head()

1.3.3 Merges the tract geometry on GEOID10 with the environmental justice dataset on TR10, printing the number of rows and a few rows of the dataset

In [None]:
environmental_justice_df = census_tract_geometry_gdf.merge(environmental_justice_df, left_on = "GEOID10", right_on = "TR10")
print(len(environmental_justice_df.index))
environmental_justice_df.head()

# 2. Data Visualization

2.1 Imports the libraries necessary to visualize the data

In [None]:
import folium
from branca.colormap import LinearColormap
import matplotlib.pyplot as plt

2.2 Converts the census tract dataset to EPSG:4326 CRS and plots it

In [None]:
census_tract_geometry_gdf = census_tract_geometry_gdf.to_crs("EPSG:4326")
census_tract_geometry_gdf.plot()

2.3 Converts the environmental justice dataset to EPSG:4326 CRS and plots it

In [None]:
environmental_justice_df = environmental_justice_df.to_crs("EPSG:4326")
environmental_justice_df.plot()

2.4 Calculate the centroid coordinates of the environmental justice dataset geometry tracts

In [None]:
centroid_lat = environmental_justice_df["geometry"].apply(lambda x: x.centroid.y).mean()
centroid_lon = environmental_justice_df["geometry"].apply(lambda x: x.centroid.x).mean()

2.5 Creates a folium map object at the centroid coordinates of the environmental justice dataset geometry tracts

In [None]:
ej_prediction_map = folium.Map(location = [centroid_lat, centroid_lon], zoom_start = 9)

2.6 Creates a linear color map with a gradient from transparent to blue

In [None]:
colormap = LinearColormap(
    colors = [(255, 255, 255, 0), (121, 17, 27, 255)],
    index = [0, 1],
    vmin = 0,
    vmax = 1
)

2.7 Add the environmental justice dataset to the map, using the tract IDs and the predicted TR_EJ column as the tooltip and using the predicted TR_EJ column as a linear interpolation between transparent and blue for the fill color

In [None]:
_ = folium.GeoJson(
    environmental_justice_df,
    name = "Area of Environment Concern Prediction",
    tooltip = folium.features.GeoJsonTooltip(
        fields = ["TR10", "TR_EJ_PREDICTED"],
        aliases = ["Census Tract ID", "Prediction"],
        localize = True
    ),
    style_function = lambda feature: {
        "fillColor": colormap(feature["properties"]["TR_EJ_PREDICTED"]),
        "color": "black",
        "weight": 1,
        "fillOpacity": 0.7
    }
).add_to(ej_prediction_map)

2.8 Display the map that predicts whether each census tract is an area of environmental justice concern with fully blue meaning that a census tract is predicted to be an area of environmental justice concern and fully transparent meaning the opposite

In [None]:
ej_prediction_map

2.9 Classify the census tracts into areas of environmental concern and areas not of environmental concern and present the total of each

In [None]:
environmental_justice_df["TR_EJ_PREDICTED"].apply(lambda x: 1 if x >= 0.5 else 0).value_counts()

2.10 Histogram of the predicted environmental justice values

In [None]:
environmental_justice_df["TR_EJ_PREDICTED"].hist()

plt.title("Predicted Area of Environmental Justice Distribution", fontsize = 16)
plt.xlabel("Prediction", fontsize = 14)
plt.ylabel("Frequency", fontsize = 14)
plt.show()