# Area of Environmental Justice Concern Prediction using Random Forest

Title: Area of Environmental Justice Concern Prediction using Random Forest

Author(s): Mattie Gisselbeck and Nikunj Chawla

**Description**

As the title implies, this project uses the Random Forest model to predict whether a tract in the Twin Cities is an area of environmental justice concern or not (binary classification) based on a variety of factors that we deemed relevant.

**Data Sources**

Metropolitan Council (2021). Equity Considerations for Place-Based Advocacy and Decisions in the Twin Cities Region. <https://gisdata.mn.gov/dataset/us-mn-state-metc-society-equity-considerations>

United States Census Bureau (2010). Minnesota Census Tract (2010). <https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2010&layergroup=Census+Tracts>

## 1. Creating and Evaluating the Random Forest Model

In [None]:
# Import the libraries necessary to run the Random Forest Regression and evaluate the model
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

In [None]:
# Import the dataset
equity_considerations = pd.read_csv("data/equity_considerations_full.csv")

In [None]:
# noinspection SpellCheckingInspection
"""
Uses only the columns deemed relevant, which are as follows:

TR10: Census tract ID
TR_EJ: Area of Environmental Justice Concern (1 = yes; 0 = no) (the column we are predicting)
PMENA_ARAB: Percentage of Arab population
PMENA_EGYP: Percentage of Egyptian population
PMENA_IRAN: Percentage of Iranian population
PMENA_ISRA: Percentage of Israeli population
PMENA_LEBA: Percentage of Lebanese population
PMENA_PALE: Percentage of Palestinian population
PMENA_TURK: Percentage of Turkish population
PBANC_AFRI: Percentage of Black or African American population
PBANC_ETHI: Percentage of Ethiopian population
PBANC_NIGE: Percentage of Nigerian population
PAMINDNH: Percentage of American Indian population
PBIPOC: Percentage of Black, Indigenous, and People of Color population
PPLURALRAC: Percentage of the population who identify as multiracial
PPOV185: Percentage of the population whose income is below 185% of the poverty line
PHISPPOP: Percentage of Hispanic or Latino population
HUTOT_ACS: Total housing units (American Community Survey estimate)
HHTOT_ACS: Total households, same as occupied housing units (American Community Survey Estimate)
POPTOT_ACS: Total Population
"""

# noinspection SpellCheckingInspection
environmental_justice_columns = [
    "TR10", "TR_EJ", "PMENA_ARAB", "PMENA_EGYP", "PMENA_IRAN", "PMENA_ISRA", "PMENA_LEBA", "PMENA_PALE", "PMENA_TURK",
    "PBANC_AFRI", "PBANC_ETHI", "PBANC_NIGE", "PAMINDNH", "PBIPOC", "PPLURALRAC", "PPOV185", "PHISPPOP",
    "HUTOT_ACS", "HHTOT_ACS", "POPTOT_ACS"
]

environmental_justice = equity_considerations[environmental_justice_columns]
environmental_justice_tr10_string = environmental_justice["TR10"].astype(str)
environmental_justice.loc[:, ["TR10"]] = environmental_justice_tr10_string
print(len(environmental_justice.index))
environmental_justice.head()

In [None]:
# Drops all rows with missing values
environmental_justice = environmental_justice.dropna()

In [None]:
# Splits the dataset into the independent variables (X) and the dependent variable (y)
X = environmental_justice.drop("TR10", axis = 1).drop("TR_EJ", axis = 1)
y = environmental_justice["TR_EJ"]

# Splits the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
# Creates the Random Forest Regression model and fits it to the training set, using a fixed random state for reproducibility
random_forest = RandomForestRegressor(n_estimators = 1000, random_state = 42)
_ = random_forest.fit(X_train, y_train)

In [None]:
# Predicts the dependent variable (y) using the independent variables (X) in the testing set and evaluates the model
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [None]:
# Prints the evaluation metrics
print("Random Forest Model Evaluation:")
print(f"MSE: {mse:.2f}")
print(f"r2 score: {r2:.2f}")

In [None]:
environmental_justice["TR_EJ_PREDICTED"] = random_forest.predict(X)

In [None]:
# Import the libraries necessary to import the tract geometry
import os

os.environ["USE_PYGEOS"] = "0"

import geopandas as gpd

In [None]:
# Gets the 2010 census tract geometry
tract_geometry = gpd.read_file("data/tl_2010_27_tract10/tl_2010_27_tract10.shp")
print(len(tract_geometry.index))
tract_geometry.head()

In [None]:
# Inner join the tract geometry on GEOID10 with the environmental justice dataset on TR10
# environmental_justice = environmental_justice.merge(tract_geometry, left_on = "TR10", right_on = "GEOID10")
environmental_justice = tract_geometry.merge(environmental_justice, left_on = "GEOID10", right_on = "TR10")
print(len(environmental_justice.index))
environmental_justice.head()

# 2. Data Visualization using folium

In [None]:
# Import packages for data visualization
import folium
from branca.colormap import LinearColormap
import matplotlib.pyplot as plt

In [None]:
# Convert the GeoDataFrame to EPSG:4326 CRS and plot it
tract_geometry.set_crs("EPSG:4326", inplace = True, allow_override = True)

tract_geometry.plot()

In [None]:
# Convert "geometry" column to GeoSeries
environmental_justice["geometry"] = gpd.GeoSeries(environmental_justice["geometry"])

# Calculate centroid coordinates
centroid_lat = environmental_justice["geometry"].apply(lambda x: x.centroid.y).mean()
centroid_lon = environmental_justice["geometry"].apply(lambda x: x.centroid.x).mean()

In [None]:
# Create folium map object
ej_prediction_map = folium.Map(location = [centroid_lat, centroid_lon], zoom_start = 9)

In [None]:
# Define a linear color map with a gradient from white to blue
colormap = LinearColormap(
    colors = [(255, 255, 255, 0), "blue"],
    index = [0, 1],
    vmin = 0,
    vmax = 1
)

In [None]:
# Add census tracts to the map
folium.GeoJson(
    environmental_justice,
    name = "Area of Environment Concern Prediction",
    tooltip = folium.features.GeoJsonTooltip(
        fields = ["TR10", "TR_EJ_PREDICTED"],
        aliases = ["Census Tract ID", "Prediction"],
        localize = True
    ),
    style_function = lambda feature: {
        "fillColor": colormap(feature["properties"]["TR_EJ_PREDICTED"]),
        "color": "black",
        "weight": 1,
        "fillOpacity": 0.7
    }
).add_to(ej_prediction_map)

In [None]:
# Display folium map object
ej_prediction_map

In [None]:
# Classify the census tracts into areas of environmental concern and areas not of environmental concern and present the total of each
environmental_justice["TR_EJ_PREDICTED"].apply(lambda x: 1 if x >= 0.5 else 0).value_counts()

In [None]:
# Histogram of the predicted environmental justice values
environmental_justice["TR_EJ_PREDICTED"].hist()

plt.title("Predicted Area of Environmental Justice Distribution", fontsize = 16)
plt.xlabel("Prediction", fontsize = 14)
plt.ylabel("Frequency", fontsize = 14)
plt.show()

This notebook stands as a "data demonstration" since it holds all the data required to complete the visualization and analysis. We were also able to complete the analysis and visualization of the environmental justice prediction data. We manipulated the data by cleaning the data of null values for use with the Random Forest model, merged the Random Forest predictions run on the relevant equity data, and merged that with census tract table for the geometry. We worked hard to finish the project, so we have already completed the analysis and visualization of the data.