# Area of Environmental Justice Concern Prediction using the Random Forest Model

As the title implies, this project uses Random Forest Model to predict whether a tract in the Twin Cities is an area of environmental justice concern or not (binary classification) based on a variety of factors that we deemed relevant.

The project uses the [Equity Considerations for Place-Based Advocacy and Decisions in the Twin Cities Region dataset](https://gisdata.mn.gov/dataset/us-mn-state-metc-society-equity-considerations).

## 1. Creating and Evaluating the Random Forest Model

In [None]:
# Import the libraries necessary to run the Random Forest Regression and evaluate the model
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

In [None]:
# Import the dataset
equity_considerations = pd.read_csv("data/equity_considerations_full.csv")

In [None]:
"""
Uses only the columns deemed relevant, which are as follows:

TR_EJ: Area of Environmental Justice Concern (1 = yes; 0 = no) (the column we are predicting)

PMENA_ARAB: Percentage of Arab population
PMENA_EGYP: Percentage of Egyptian population
PMENA_IRAN: Percentage of Iranian population
PMENA_ISRA: Percentage of Israeli population
PMENA_LEBA: Percentage of Lebanese population
PMENA_PALE: Percentage of Palestinian population
PMENA_TURK: Percentage of Turkish population
PBANC_AFRI: Percentage of Black or African American population
PBANC_ETHI: Percentage of Ethiopian population
PBANC_NIGE: Percentage of Nigerian population
PAMINDNH: Percentage of American Indian population
PBIPOC: Percentage of Black, Indigenous, and People of Color population
PPLURALRAC: Percentage of the population who identify as multiracial
PPOV185: Percentage of the population whose income is below 185% of the poverty line
PHISPPOP: Percentage of Hispanic or Latino population
HUTOT_ACS: Total housing units (American Community Survey estimate)
HHTOT_ACS: Total households, same as occupied housing units (American Community Survey estimate)
POPTOT_ACS: Total population
"""
environmental_justice_columns = [
    "TR_EJ", "PMENA_ARAB", "PMENA_EGYP", "PMENA_IRAN", "PMENA_ISRA", "PMENA_LEBA", "PMENA_PALE", "PMENA_TURK",
    "PBANC_AFRI", "PBANC_ETHI", "PBANC_NIGE", "PAMINDNH", "PBIPOC", "PPLURALRAC", "PPOV185", "PHISPPOP",
    "HUTOT_ACS", "HHTOT_ACS", "POPTOT_ACS"
]

environmental_justice = equity_considerations[environmental_justice_columns]

In [None]:
# Drops all rows with missing values
environmental_justice = environmental_justice.dropna()

In [None]:
# Splits the dataset into the independent variables (X) and the dependent variable (y)
X = environmental_justice.drop("TR_EJ", axis = 1)
y = environmental_justice["TR_EJ"]

# Splits the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
# Creates the Random Forest Regression model and fits it to the training set, using a fixed random state for reproducibility
random_forest = RandomForestRegressor(n_estimators = 1000, random_state = 42)
_ = random_forest.fit(X_train, y_train)

In [None]:
# Predicts the dependent variable (y) using the independent variables (X) in the testing set and evaluates the model
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [None]:
# Prints the evaluation metrics
print("Random Forest Model evaluation:")
print(f"MSE: {mse:.2f}")
print(f"r2 score: {r2:.2f}")