# ATMS 523

## Module 5 Project

For this assignment, use the dataset called `radar_parameters.csv` provided in the GitHub repository in the folder `homework`.

## Dataset Description

The training data consists of polarimetric radar parameters calculated from a disdrometer (an instrument that measures rain drop sizes, shapes, and rainfall rate) measurements from several years in Huntsville, Alabama. A model called `pytmatrix` is used to calculate polarimetric radar parameters from the droplet observations, which can be used as a way to compare what a remote sensing instrument would see and rainfall.

## Data columns

Features (radar measurements):

`Zh` - radar reflectivity factor (dBZ) - use the formula $dBZ = 10\log_{10}(Z)$

`Zdr` - differential reflectivity

`Ldr` - linear depolarization ratio

`Kdp` - specific differential phase

`Ah` - specific attenuation

`Adp` - differential attenuation

Target :

`R` - rain rate

1. Split the data into a 70-30 split for training and testing data.


In [1]:
import os

# Check if the directory exists before cloning
# jupyter code below
if not os.path.exists('/Users/joeybahret/Documents/Grad_School/ATMS_523/Module 5/ATMS-523-Module-5'):
# uncomment for colab if not os.path.exists('ATMS-523-Module-5'):

  !git clone https://github.com/jbb-illini/ATMS-523-Module-5.git
else:
  print("Directory 'ATMS-523-Module-5' already exists. Skipping clone.")

Directory 'ATMS-523-Module-5' already exists. Skipping clone.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

file_name = 'radar_parameters.csv'
colab_file_path = '/content/ATMS-523-Module-5/homework/'
jupyter_file = '/Users/joeybahret/Documents/Grad_School/ATMS_523/Module 5/ATMS-523-Module-5/homework/'

#data = pd.read_csv(colab_file_path+file_name, index_col=0)

data = pd.read_csv(jupyter_file+file_name, index_col=0)

X = data.drop('R (mm/hr)', axis=1)
y = data['R (mm/hr)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print('Data loaded and seperated.')

Data loaded and seperated.


2. Using the split created in (1), train a multiple linear regression dataset using the training dataset, and validate it using the testing dataset.  Compare the $R^2$ and root mean square errors of model on the training and testing sets to a baseline prediction of rain rate using the formula $Z = 200 R^{1.6}$.

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_train_pred_lr = model.predict(X_train)
y_test_pred_lr = model.predict(X_test)

# Evaluate Linear Regression model
r2_train_lr = r2_score(y_train, y_train_pred_lr)
rmse_train_lr = np.sqrt(mean_squared_error(y_train, y_train_pred_lr))

r2_test_lr = r2_score(y_test, y_test_pred_lr)
rmse_test_lr = np.sqrt(mean_squared_error(y_test, y_test_pred_lr))

print("Linear Regression Model Performance:")
print(f"Training R^2: {r2_train_lr:.4f}")
print(f"Training RMSE: {rmse_train_lr:.4f}")
print(f"Testing R^2: {r2_test_lr:.4f}")
print(f"Testing RMSE: {rmse_test_lr:.4f}")

Linear Regression Model Performance:
Training R^2: 0.9879
Training RMSE: 0.9229
Testing R^2: 0.9891
Testing RMSE: 0.9358


3. Repeat 1 doing a grid search over polynomial orders, using a grid search over orders 0-9, and use cross-validation of 7 folds.  For the best polynomial model in terms of $R^2$, does it outperform the baseline and the linear regression model in terms of $R^2$ and root mean square error?

In [4]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Create Polynomial Features starting at the first degree
polynomial_features = PolynomialFeatures(degree=1, include_bias=False)

In [5]:
# Define the parameter grid for the grid search
param_grid = {'polynomialfeatures__degree': np.arange(0, 10)}

print("Parameter Grid:", param_grid)

Parameter Grid: {'polynomialfeatures__degree': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}


In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score, mean_squared_error, make_scorer

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

# Set up the GridSearchCV object
# Use multiple scoring metrics and set refit to 'r2'
grid_search = GridSearchCV(PolynomialRegression(),
                           param_grid,
                           cv=7,
                           scoring={'r2': make_scorer(r2_score),
                                    'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False)},
                           refit='r2',
                           n_jobs= -1,
                          verbose= 1)

print("Grid set-up completed for multi-metric scoring with refit='r2'.")

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

print("Grid Search completed.")

Grid set-up completed for multi-metric scoring with refit='r2'.
Fitting 7 folds for each of 10 candidates, totalling 70 fits
Grid Search completed.


In [7]:
# Get the results of the grid search
results = grid_search.cv_results_

# Create a DataFrame to display the results
grid_search_results_df = pd.DataFrame({
    'Polynomial Degree': results['param_polynomialfeatures__degree'],
    'R^2': results['mean_test_r2'],
    'Rank': results['rank_test_r2'],
    'RMSE' : results['mean_test_neg_mean_squared_error'] * -1
})

# Sort by rank
grid_search_results_df = grid_search_results_df.sort_values(by='Rank')

display(grid_search_results_df)

Unnamed: 0,Polynomial Degree,R^2,Rank,RMSE
8,8,0.999942,1,0.0045
9,9,0.9999,2,0.009506
7,7,0.9993,3,0.05648
5,5,0.998869,4,0.090624
6,6,0.997642,5,0.187519
2,2,0.996999,6,0.233414
3,3,0.991524,7,0.674303
4,4,0.988007,8,0.958182
1,1,0.985456,9,1.027472
0,0,-0.000779,10,70.460662


4. Repeat 1 with a Random Forest Regressor, and perform a grid_search on the following parameters:
   
   ```python
   {'bootstrap': [True, False],  
   'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],  
   'max_features': ['auto', 'sqrt'],  
   'min_samples_leaf': [1, 2, 4],  
   'min_samples_split': [2, 5, 10],  
   'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
   ```

Can you beat the baseline, or the linear regression, or best polynomial model with the best optimized Random Forest Regressor in terms of $R^2$ and root mean square error?

In [8]:
from sklearn.ensemble import RandomForestRegressor

In [9]:
param_grid_rf = {'bootstrap': [True, False],
                 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
                 'max_features': ['sqrt', 'log2'],
                 'min_samples_leaf': [1, 2, 4],
                 'min_samples_split': [2, 5, 10],
                 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

print("Random Forest Parameter Grid:")
print(param_grid_rf)

Random Forest Parameter Grid:
{'bootstrap': [True, False], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None], 'max_features': ['sqrt', 'log2'], 'min_samples_leaf': [1, 2, 4], 'min_samples_split': [2, 5, 10], 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


In [18]:
# Instantiate a RandomForestRegressor object
rf_model = RandomForestRegressor(random_state=42)

# Instantiate a GridSearchCV object
grid_search_rf = GridSearchCV(rf_model,
                              param_grid_rf,
                              cv=7,
                              scoring={'r2': make_scorer(r2_score),
                                       'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False)},
                              refit='r2',
                              n_jobs=1,
                             verbose = 2)

print("GridSearchCV setup complete for Random Forest Regressor.")

GridSearchCV setup complete for Random Forest Regressor.


In [None]:
grid_search_rf.fit(X_train, y_train)

print("Random Forest Grid Search completed.")

Fitting 7 folds for each of 3960 candidates, totalling 27720 fits
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   2.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; t