### Chicago Urban Greenspace Project Outline

##### Study Overview

Your workflow should:

1. **Define study area:** Gather data on urban greenspaces and socioeconomic data in Chicago from relevant sources.
2. **Fit a model:**
    1. **Data Download and Preprocessing:** including:
        - **Socioeconomic Data Collection:** obtain urls for the U.S. Census Tract shapefiles from [the TIGER service](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html).
        - **Chicago Greenspace** Gather data on urban greenspaces in Chicago from relevant sources. The City of Chicago Boundary on the [City of Chicago Data Portal](https://data.cityofchicago.org/).
     2. **Merge Data:** the greenspace data with the socioeconomic data based on geographical identifiers.
     3. **Exploratory Data Analysis (EDA):** Conduct exploratory data analysis to understand the distribution of greenspace and socioeconomic variables.
3. **Statistical Analysis:**
    - Use statistical methods to identify correlations between greenspace and socioeconomic variables.
4. **Linear Modeling**
5. **Interpret Results**

#### Chicago boundary map

<embed type="text/html" src="chicago_boundary.html" width="790" height="310">

#### Census data collection

Selecting the right census data variables is crucial for building a meaningful linear regression model. For analyzing NDVI values, I considered variables that can potentially impact or correlate with vegetation and green spaces. Here are the key variables that I decided to include:

**Socioeconomic Variables:**
- Median Household Income: Wealthier areas might have more resources for maintaining green spaces.
- Per Capita Income

**Housing Variables:**
- Percentage of Owner-Occupied Housing Units: Homeowners might invest more in green spaces around their properties.
- Median Property Value: Higher property values might be associated with better-maintained green spaces.
    
**Demographic Variables:**
- Population and population density

<embed type="text/html" src="layout_census.html" width="1040" height="510">

#### Chloropleth plots

Explore differences in both median income and mean NDVI across the City.

<embed type="text/html" src="ndvi_income_plot.html" width="1200" height="700">

The side-by-side chloropleth plots both show variations between tracts based on median income and mean NDVI values. Some tracts with higher median income values also seem to have relatively higher mean NDVI values as we compare the two maps. However, no striking visual evidence appears from these two plots that higher median income tracts would have higher mean NDVI values.

### Model Description for Linear Ordinary Least Squares (OLS) Regression

#### 1. Assumptions Made About the Data
The linear ordinary least squares (OLS) regression model makes several assumptions about the data:

- **Linearity**: The relationship between the dependent variable (greenspace as measured by NDVI) and the independent variables (socioeconomic parameters) is linear.
- **Independence**: The residuals (errors) are independent. This means that the residuals of one observation are not correlated with the residuals of another.
- **Homoscedasticity**: The residuals have constant variance at every level of the independent variables. This means that the spread of the residuals should be roughly the same for all fitted values.
- **Normality**: The residuals of the model are normally distributed. This is particularly important for small sample sizes to ensure valid hypothesis testing.
- **No Multicollinearity**: The independent variables are not highly correlated with each other. Multicollinearity can inflate the variances of the parameter estimates and make the model unstable.

#### 2. Objective of the Model and Evaluation Metrics
The objective of the linear OLS regression model is to determine whether there is a statistically significant relationship between the socioeconomic parameters from the U.S. Census and greenspace, as measured by the fraction of pixels with an NDVI greater than 0.12. Specifically, the goal is to predict the greenspace (NDVI values) using the socioeconomic variables and understand the strength and direction of these relationships.

**Evaluation Metrics:**
- **R-squared (R²)**: This metric indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, with higher values indicating better model fit.
- **Adjusted R-squared**: Adjusted R² adjusts the R² value based on the number of predictors in the model, providing a more accurate measure of model fit for multiple regression.
- **Mean Squared Error (MSE)**: This metric measures the average of the squares of the residuals, providing a measure of the model’s prediction accuracy.
- **Root Mean Squared Error (RMSE)**: The square root of the MSE, providing a measure of the average magnitude of the errors in the same units as the dependent variable.
- **P-values of the coefficients**: These indicate the statistical significance of each independent variable in predicting the dependent variable. A low p-value (typically < 0.05) suggests that the predictor is significantly related to the response variable.

#### 3. Advantages and Potential Problems with Choosing this Model
**Advantages:**
- **Simplicity and Interpretability**: OLS regression is straightforward to implement and the results are easy to interpret. Each coefficient represents the expected change in the dependent variable for a one-unit change in the predictor, holding all other predictors constant.
- **Efficiency**: If the assumptions of OLS regression are met, it is the Best Linear Unbiased Estimator (BLUE), meaning it has the lowest variance among all unbiased linear estimators.
- **Diagnostic Tools**: There are many diagnostic tools and tests available for OLS regression to check the assumptions and identify potential problems with the model (e.g., residual plots, variance inflation factor (VIF) for multicollinearity).

**Potential Problems:**
- **Violation of Assumptions**: If the assumptions of linearity, independence, homoscedasticity, normality, and no multicollinearity are violated, the OLS estimates may be biased or inefficient, leading to invalid inference.
- **Outliers and Influential Points**: OLS regression is sensitive to outliers and influential points, which can disproportionately affect the model estimates.
- **Multicollinearity**: High correlation among independent variables can lead to unstable estimates of regression coefficients, making it difficult to assess the effect of each predictor.
- **Model Specification**: If important variables are omitted from the model or irrelevant variables are included, the model can be misspecified, leading to biased estimates.

Overall, while OLS regression is a powerful tool for understanding the relationship between variables, it is important to carefully check the assumptions and consider potential limitations when interpreting the results. My current independent variables have the potential to violate the **multicollinearity** assumption. 

#### Exploratory plots to check distributions

<img src="assumptions_plot.png" width="500" height="500">

The log transformed green space fraction seems to be normally distributed, but the log median income seems to be left skewed that has the potential to violate the normality assumption. 

### Fit and Predict

Use a statistical model to fit and predict NDVI values based on the median income in Chicago neighborhoods. I used the `scikitlearn` library that has a slightly different approach than many software packages. For example, `scikitlearn` emphasizes generic model performance measures like cross-validation and importance over coefficient p-values and correlation. The scikitlearn approach is meant to generalize more smoothly to machine learning (ML) models where the statistical significance is harder to derive mathematically.

#### <u> Results </u>

<embed type="text/html" src="error_median_income_NDVI.html" width="1337" height="700">

The error plot does not seem to signal any clear direction for over or underprediction. Although considering the areas where the model overpredicts, they tend to be at tracts where the higher median income downtown areas are located in the central area of the city. This make logical sense since the greeness at those locations are spatially limited. While the underprediction seems to happen in the more suburban tracts that have higher fraction of greeness, but the median income tends to be lower. 