# Predicting Inland Empire Warehouse Growth

### UDS Final Project -  Lucy Briggs, Carolyn Pugh, Monisha Reginald, Alyssa Suzukawa

### Research Question:
#### If warehouses continue growing at the same rate in Southern California, what will the region look like by the year 2030?
Warehouses have been rapidly expanding in the Inland Empire, and have significant environmental justice implications. We wanted to understand where, based on trends from 2010-2020, future warehouses might be built in San Bernardino and Riverside, California. To do this, we collected data that might be related to warehouse siting, such as land values, existing warehouse locations, distance to major roadways, and local demographics. We then used machine learning to train models for predicting the probability that a warehouse will be built on a land parcel by 2030. 

### Overview of Notebook:
Section 1. Data Collection & Cleaning

Section 2. Random Forest Models

Section 3. Neural Network Models

Section 4. Discussion of Results


## Data Collection & Cleaning

#### Assessor Parcel Data
[Initial Data Cleaning for Riverside Parcel Data](https://github.com/monishareginald/uds-warehouse-project/blob/main/riversidegeom.ipynb)<br>
[Data Wrangling Notebook](https://github.com/monishareginald/uds-warehouse-project/blob/main/Assessor%20Parcel%20Data%20Wrangling%20with%20Freight%20and%20Warehouses.ipynb)<br>
Data Sources: Assessor Parcel Data from [Riverside County GIS](https://gis.rivco.org/pages/data-distribution) and [San Bernardino Open Data Portal](https://open.sbcounty.gov/datasets/countywide-parcels/about)<br>
* First, we cleaned the assessor parcel data from Riverside and San Bernardino Counties separately. This process included renaming variables of interest to match and dissolving by APN to aggregate the data to the parcel level. 
* Next, we concatenated the two geodataframes to create a single geodataframe with all of the parcels across the two counties.
* Much of the Inland Empire is rural and unlikely to experience development, so we wanted to focus our analysis on the areas where warehouses have historically been built. To achieve this, we created a "mask" representing the area within 70 miles of the Ontario Airport and then clipped our parcel dataset to that mask. The figures below display how this changed the geographic extent of our analysis.<br><br>
![title](figures/parcel_clipping.PNG) 

#### Warehouse Data (Monisha)
[Data Wrangling Noteboook](https://github.com/monishareginald/uds-warehouse-project/blob/main/Assessor%20Parcel%20Data%20Wrangling%20with%20Freight%20and%20Warehouses.ipynb)<br>
Data Source: [Warehouse CITY data on warehouse APNs and year built](https://radicalresearch.shinyapps.io/WarehouseCITY/)<br>
* First, we created a dataset of warehouses that contained one row per APN. Through exploratory mapping and a review of the counties' Zoning District Maps, we determined that duplicated APNs in the dataset represented parcels with multiple warehouses located within the parcel. We wanted our analysis to take place at the parcel level, so we used groupby to aggregate our data with the appropriate functions (e,g., the minimum "year" to represent the first year that a warehouse was built on the parcel, but the sum of "sqft" to represent the total square footage of all warehouses built within a single parcel.  
* Next, we used a series of lambda functions to create boolean variables that noted whether a parcel had a warehouse on it at the start of 2010, whether the parcel had a warehouse on it at the start of 2020, and whether a warehouse was specifically built during the 2010s (the dependent variable we used to train our machine learning models). 

#### Interactive Map of Current Warehouses in Buffer

This interactive map visualizes where the 3,389 warehouses in the area of study are located. 

**Move the map around!**
NOTE: Please run the code (hidden) below to view interactive map!

In [4]:
# Run code (hidden) below to view interactive map
warehouse_map

#### **Code**: Map of Current Warehouses in Buffer 

In [3]:
# import relevant libraries
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as ctx

# import data
all_parcels = pd.read_csv('data/join_scag_to_parcels_left_2019.csv',
                          usecols=['APN','lon', 'lat', 'num_warehouses'])
current_warehouses = all_parcels[all_parcels['num_warehouses'] >= 1]

# get geometry
current_warehouses = gpd.GeoDataFrame(current_warehouses, geometry = gpd.points_from_xy(current_warehouses['lon'], current_warehouses['lat'], crs = 'EPSG:4326'))

# calculate a 70 mile buffer around the Ontaria Airport & turn this into a geodataframe
airport=gpd.points_from_xy(x=[6683335.118285051], y=[1843271.4373799062], crs=2229) #2229
buffer=airport.buffer(369600)
buffer=gpd.GeoDataFrame(geometry=buffer,crs=2229)

# match projection to buffer
current_warehouses.to_crs(epsg=2229)

# warehouse map
warehouse_map = current_warehouses.explore(# this defines the field to "choropleth"
        legend=True,
        #cmap='RdYlGn_r', # the "_r" reverses the color
        tiles='CartoDB positron')

#### Distance to Freight Network (Monisha)
[Data Wrangling Noteboook](https://github.com/monishareginald/uds-warehouse-project/blob/main/Assessor%20Parcel%20Data%20Wrangling%20with%20Freight%20and%20Warehouses.ipynb)<br>
Data Source: [Shapefile of National Highway Freight Network from Federal Highway Administration ](https://fpcb.ops.fhwa.dot.gov/tools_nhfn.aspx)<br>
We used the geopandas function sjoin_nearest to calculate the distance between each parcel and the NHFN. The map below displays the results.<br><br>
![title](figures/freight.png)

#### SCAG Data (Alyssa to fill in this section & link to data/prep notebooks) 


#### Census Data  
[Notebook for 2009 Data](https://github.com/monishareginald/uds-warehouse-project/blob/main/Census-2009-final.ipynb)<br>
[Notebook for 2019 Data](https://github.com/monishareginald/uds-warehouse-project/blob/main/Census-2019.ipynb)<br>
Data Source: [American Community Survey (5 year estimates) for 2009 and 2019](https://www.census.gov/programs-surveys.html)<br>
First we used cenpy and API to pull variables for race, homeownership, education, income, and occupation. We then merged tables and variables for both counties and added census tract geometries. 

Three of the variables that ended up having high feature importances, percent renter-occupied, percent White-alone, and percent low income, are shown below. 

![title](figures/renter_map.png)
![title](figures/percent_white_alone.png)
![title](figures/percent_lowincome.png)

#### Joins
*** ALYSSA TO FILL IN PART ABOUT JOINING SCAG TO WAREHOUSE DATA *** <br>
[Notebook to Create Final 2010 Data](https://github.com/monishareginald/uds-warehouse-project/blob/main/Spatial%20Join%202010%20Data.ipynb)<br>
[Notebook to Create Final 2020 Data](https://github.com/monishareginald/uds-warehouse-project/blob/main/Spatial%20Join%202020%20Data.ipynb)<br>
* *** ALYSSA TO FILL IN PART ABOUT JOINING SCAG TO WAREHOUSE DATA ***
* Next, we created two separate datasets for 2010 and 2020 that incorporated information about the Census tract that each parcel was located. To speed up this process, we:
    - Created two geodataframes that contained just a unique ID and geometry (i.e., APN and geometry for the parcels and GEOID and geometry for the census tract data) to minimize the size of the dataframes involved in the spatial join
    - Converted each parcel's geometry to its centroid to speed up the spatial join
    - Performed a spatial join to create a crosswalk that matched each APN with the GEOID of the census tract that its centroid falls within
    - Used this crosswalk to do a series of index joins on APN and GEOID to add the census information into the parcel data

## Random Forest Models

Our goal was to develop a random forest model that would predict whether or not a parcel would develop into a warehouse within 10 years. We could thereby train our model on dataset representing conditions in 2010, with the benefit of knowing whether or not a warehouse _did_ develop by 2020 and apply this model to a dataset representing conditions in 2020 to predict which parcels are most likely to develop into warehouses by 2030.

### Random Forest with Parcel Attribute Data

[Random Forest Notebook](https://github.com/monishareginald/uds-warehouse-project/blob/main/RandomForest%20NO%20Census.ipynb)

Our first version of the random forest model was based on our dataset that excluded Census Data. This model represented our ability to predict warehouse growth based exclusively on parcel attributes (including the parcel size, land value, existing land uses, and distance to the National Highway Freight Network). As described above, we trained the model using 2010 data and then applied the model to 2020 data to predict warehouse development by the year 2030. 

<b>Training Model Performance:</b>
The model underestimated warehouse growth, but the "True" predictions that it made were highly accurate. The model's precision performance for positive predictions (97%) was much higher than its recall performance for true positives (32%). Unsurprisingly then, the predicted fraction of parcels with a warehouse by 2020 was 0.0001, even though the actual fraction of parcels with a warehouse was 0.0003.
<br><br>
![title](figures/rf_confusion_nocensus.png)<br>
Although the model had low recall performance, we could still use it to shed light on which variables are important to predicting warehouse growth. Land values, acreage, and land values per acre were the most important variables in the model. Location (including latitude, longitude, and distance to the NHFN) were the next most important variables. The ratio of the value of improvements to the land and the land value itself, whether a parcel already contained a warehouse at the start of the time period, and whether the parcel's land use is industrial were also important.<br>![title](figures/rf_featureimportance_nocensus.png)<br><br>
<b>Predictions for 2030:</b>
When we applied our model to 2020 data, it predicted that 55 additional parcels would become warehouses by 2030. Given the limitations of the model described above, which led it to underestimate warehouse growth, this is very likely to be an underestimate. 

### Random Forest with Census Data 

[Random Forest with Census Data Notebook](http://localhost:8891/lab/tree/OneDrive%20-%20UCLA%20IT%20Services/Documents/GitHub/uds-warehouse-project/RF%20with%20Census%2020%20Importances.ipynb)

In our second iteration of the Random Forest model, we included Census demographics on the Tract level utilizing a Spatial Join. This model accounted for demographic factors, including race, income, employment, and other measures present in the ACS 5 year estimates from 2009 and 2020. This model intended to use demographics as well as parcel attributes to predict the warehouse growth. Once again, we trained the data using 2010 conditions (with 2009 ACS data) and applied the model to 2020 (with 2019 ACS data).

First, we read in just the unique ID's and Geometries in of the Assesor-SCAG dataset and the Census data. We used a spatial join to join parcel centroids to the tracts that they were in, then rejoined the rest of the columns to create one final dataframe with all of our variables of interest. The top 20 feature importances included many of the variables listed above, as well as demographic factors such as income, race, education, and renters vs. owners.
<br>

![title](figures/rf_featureimportance_census.png)

<br><br>

This dataframe had over 100 columns, and running the 2020 dataframe through the model created memory issues. However, from the first pass we were able to see the top 20 feature importances when census data was included. We decided to rerun the model once more, only utilizing the top 20 importances for memory purposes.

After running the Random Forest model with only the top 20 importances, we found that the model underpredicted warehouse growth. From the 2010 training data, the neural network model's predicted fraction of parcels with a warehouse by 2020 was 0.0001, and the actual fraction of parcels with a warehouse was 0.0003. The model predicted only 3 additional parcels would become warehouses by 2030. This is very likely to be an underestimate, and future research can address the limitations of the model.
<br><br>

![title](figures/rf_confusion_census.png)

<br>

## Neural Networks 

### Neural Network with Parcel Data

[Initial Neural Network Notebook](https://github.com/monishareginald/uds-warehouse-project/blob/main/Neuralnetprep.ipynb)<br>

We wanted to see how a neural network model compared to our initial random forest model. To do this, we started with our dataframe of parcel level data, which we standardized and used to train a neural network model.  Like the random forest model, we trained the neural network model using 2010 data to predict whether or not a warehouse would be built on a parcel by 2020.  We then applied the model to 2020 data to predict which parcels will have warehouses built on them by 2030.   When our first neural network model was applied to 2020 data to make predictions about warehouse locations it predicted 95 new warehouses by 2030.  Compared to the random forest model the neural network's model's precision was lower (84%) but its recall peformed much better (72%). While 95 new warehouses by 2030 could still be an underestimate, it is likely moving in the right direction.

From the 2010 training data, the neural network model's predicted fraction of parcels with a warehouse by 2020 was 0.0002, and the actual fraction of parcels with a warehouse was 0.0003.  The confusion matrix is below.
![title](figures/confusion_matrix_Neuralnet.png)<br>

### Neural Network with Census Data
[Neural Network Notebook with Selected Census Data](https://github.com/monishareginald/uds-warehouse-project/blob/main/CensusNeuralnet.ipynb)<br>


Like the second iteration of the Random Forest model, we ran a second version of the Neural Network model that included Census demographics as well as parcel level data. This model included the top 20 feature importances (identified in the random forest models above) out of both the land parcel data and census data. This included variables such as land value, number of acres, dollars per acre, distance to freight network, whether a parcel was zoned as industrial or vacant, percent very low or very high income, race, education, and home ownership. This model did not perform as well as the initial model that did not include any census data, and had a precision performance of 80% and a recall performance of 67%. 

![title](figures/confusion_matrix_Neuralnet_census.png)<br>

Future work could include testing a neural network model that uses all the census variables and parcel data collected, rather than focusing on the top 20.

# Discussion of Results (Alyssa)

## Visualizations to make (Alyssa)

1. 2010 & 2020 Warehouse Locations (maybe over income/race demographics, or CalEnviroscreen?)

2. Using the 'best' model: (likely the random forest  top 20 feature importance one):

        -Predicted 2030 Warehouse Locations 

        -Places that are above a 25% or 50% chance 2030 Warehouse Locations

        -layer with CalEnviroscreen index (already read this in for lecture/class, alternatively could use demographics we have in census data)




#### **Map of 2030 Predicted Warehouses Using Random Forests**

The interactive map below visualizes predicted warehouses by 2030 based on "chance" generated by the RF model which predicts **23,682 new warehouses** by 2030 at varying chance levels.

We focus on the 10%, 25%, and 50% chance levels for our map.

#### **Legend:**
- <font color='lightblue'>**Light Blue:**</font> Current warehouses
- <font color='pink'>**Pink:**</font> Parcels With Over 10% Chance = **1,531 Warehouses**
- <font color='orange'>**Orange:**</font> Parcels With Over 25% Chance = **176 Warehouses**
- <font color='red'>**Red:**</font> Parcels With Over 50% Chance = **5 Warehouses**

**Move the map around!**
NOTE: Please run the code (hidden) below to view interactive map!

In [None]:
predictionstop20_map

#### **Code**: Map of Predicted Warehouses Using Random Forests

In [11]:
# import data
predictionstop20 = pd.read_csv('data/predictionstop20.csv')

# convert to geodataframe
predictionstop20_gdf = gpd.GeoDataFrame(predictionstop20, geometry=gpd.points_from_xy(predictionstop20['lon'], predictionstop20['lat'], crs = 'EPSG:4326'))

# match projection to buffer
predictionstop20_gdf.to_crs(epsg=2229)

# Parcels with above 10% chance of warehouse
predictionstop20_10 = predictionstop20_gdf[predictionstop20_gdf['pred_WH'] >= 0.1]

# Parcels with above 25% chance of warehouse
predictionstop20_25 = predictionstop20_gdf[predictionstop20_gdf['pred_WH'] >= 0.25]

# Parcels with above 50% change of warehouse
predictionstop20_50 = predictionstop20_gdf[predictionstop20_gdf['pred_WH'] >= 0.5]

# map
m = current_warehouses.explore(# this defines the field to "choropleth"
        color='skyblue',
        legend=True,
        #cmap='RdYlGn_r', # the "_r" reverses the color
        tiles='CartoDB positron')

predictionstop20_10_map = predictionstop20_10.explore(# this defines the field to "choropleth",
        m=m,
        legend=True,
        color='pink',
        tiles='CartoDB positron')

predictionstop20_25_map = predictionstop20_25.explore(# this defines the field to "choropleth"
        m=m,
        legend=True,
        color='orange',
        tiles='CartoDB positron',
        style_kwds={
            'opacity':1})

predictionstop20_map = predictionstop20_50.explore(
        m = m,
        legend=True,
        color='red',
        tiles='CartoDB positron',
        style_kwds={
            'opacity':1})


FileNotFoundError: [Errno 2] No such file or directory: 'data/predictionstop20.csv'

#### **Map of 2030 Predicted Warehouses Using Neural Network**

The interactive map below visualizes predicted warehouses by 2030 based generated by the NN model which predicts **95 new warehouses** by 2030 at varying chance levels.

#### **Legend:**
- <font color='lightblue'>**Light Blue:**</font> Current warehouses
- <font color='medisumslateblue'>**Dark Blue:**</font> NN Predicted Warehouses

**Move the map around!**
NOTE: Please run the code (hidden) below to view interactive map!

In [None]:
predictions_neuralnet1_gdf_50_map

#### **Code**: Map of Predicted Warehouses Using Neural Network

In [None]:
# import data
predictions_neuralnet1_gdf = pd.read_csv('data/predictions_neuralnet1_gdf.csv')

# convert to geodataframe
predictions_neuralnet1_gdf = gpd.GeoDataFrame(predictions_neuralnet1_gdf, geometry=gpd.points_from_xy(predictions_neuralnet1_gdf['lon'], predictions_neuralnet1_gdf['lat'], crs = 'EPSG:4326'))

# match projection to buffer
predictions_neuralnet1_gdf.to_crs(epsg=2229)

# Parcels with above 50% change of warehouse
predictions_neuralnet1_gdf_50 = predictions_neuralnet1_gdf[predictions_neuralnet1_gdf['pred_WH'] >= 0.5]

# map
m = current_warehouses.explore(# this defines the field to "choropleth"
        color='skyblue',
        legend=True,
        #cmap='RdYlGn_r', # the "_r" reverses the color
        tiles='CartoDB positron')

predictions_neuralnet1_gdf_50_map = predictions_neuralnet1_gdf_50.explore(
        m = m,
        legend=True,
        color='mediumslateblue',
        tiles='CartoDB positron',
        style_kwds={
            'opacity':1})


#### Map of All Predicted Warehouses

When we overlay all predicted warehouses, we see that there is some overlap between the predicted warehouses from the RF and NN models at the 50% or higher chance level! 

In [None]:
all_predictions_map

#### **Code**: Map of All Predicted Warehouses

In [None]:
# map
m = current_warehouses.explore(# this defines the field to "choropleth"
        color='skyblue',
        legend=True,
        #cmap='RdYlGn_r', # the "_r" reverses the color
        tiles='CartoDB positron')

predictionstop20_10_map = predictionstop20_10.explore(# this defines the field to "choropleth",
        m=m,
        legend=True,
        color='pink',
        tiles='CartoDB positron')

predictionstop20_25_map = predictionstop20_25.explore(# this defines the field to "choropleth"
        m=m,
        legend=True,
        color='orange',
        tiles='CartoDB positron',
        style_kwds={
            'opacity':1})

predictionstop20_map = predictionstop20_50.explore(
        m = m,
        legend=True,
        color='red',
        tiles='CartoDB positron',
        style_kwds={
            'opacity':1})

all_predictions_map = predictions_neuralnet1_gdf_50.explore(
        m = m,
        legend=True,
        color='mediumslateblue',
        tiles='CartoDB positron',
        style_kwds={
            'opacity':1})
