# This is the second part of the project precipitation forecast 

Please refer to the first part of the project (notebook 1_data_visualization.ipynb) for more details about the data used in this study. Our aim is to develop a simple precipitation forecast model based on prevailing air temperature, relative humidity, and sea level pressure. 

## In this part, we will,
* Process the data and standardize the input parameters
* Split the data into training and testing sets
* Fit regression model on the training dataset and validate the model using the test set

## We build two separate models for predicting precipitation in cold and warmer weather
To develop an effective model with good prediction skills, we have to learn from the data visualization results in part 1. From data visualization we learned that the data is quite noisy so we have to focus on a subset of data to develop an effective model. In our case, we build two separate models, one for colder condition and another for warmer condition. In the first model, we mask out all the data points that have air temperature greater than 0 C. In the second model, we mask out all data points that have air temperature less than 23 C. What it really means is that the data between 0 and 23 C is noisy so they don't contribute much in improving the predictive skill of the model. 


## Part A: Model for cold weather

In [1]:
# Open the first NetCDF file containing t2m (2m air temp), msl (sea level pressure), and tp (total precipitation) and examine the content
import xarray as xr # xarray best handles the n-d arrays in python
import pandas as pd # pandas makes life so much easy

ds1 = xr.open_dataset('adaptor.mars.internal-1694206363.8933547-24585-11-8802618f-5def-422e-a4f5-039fe7b81380.nc')
ds1

# Open the second NetCDF file containing relative humidity (r) at the lowest model level (1000 hPa) and examine the content
ds2 = xr.open_dataset('adaptor.mars.internal-1694206982.4222376-12566-4-f578be68-688e-4dcb-88b8-8bc8fe08522c.nc')
ds2

In [2]:
# Convert the variable of interest to a Pandas DataFrame
df1 = ds1[['msl', 't2m', 'tp']].to_dataframe()

# Now, you can work with the DataFrame 'df'
df1

# Convert the variable of interest to a Pandas DataFrame
df2 = ds2[['r']].to_dataframe()

df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,r
time,latitude,longitude,Unnamed: 3_level_1
2022-01-01 00:00:00,37.299999,-121.75,60.812889
2022-01-01 00:00:00,37.299999,-121.50,63.179852
2022-01-01 00:00:00,37.299999,-121.25,62.960621
2022-01-01 00:00:00,37.299999,-121.00,62.878872
2022-01-01 00:00:00,37.299999,-120.75,63.313622
...,...,...,...
2022-12-31 23:00:00,32.549999,-115.75,47.298523
2022-12-31 23:00:00,32.549999,-115.50,63.778095
2022-12-31 23:00:00,32.549999,-115.25,62.670788
2022-12-31 23:00:00,32.549999,-115.00,56.030655


In [3]:
# now combine the two dataframes together columnwise

combined_df = pd.concat([df1, df2], axis=1) # axis = 1 means along column

combined_df


# conver unit of tp from meter to mm
combined_df['tp'] = combined_df['tp'] * 1000

# also replace zero values by NaN
import numpy as np
combined_df['tp'] = combined_df['tp'].replace(0, np.nan)

import numpy as np

#replace small values with zeroes, they are noisy
combined_df['tp'] = np.where(combined_df['tp'] < 0.05, 0, combined_df['tp'])
# remove all rows containing NaN, regression model doesn't work with NaNs
combined_df.dropna(subset=['tp'], inplace=True)


# now convert unit of temperature from K to C
combined_df['t2m'] = combined_df['t2m'] - 273.15
combined_df['t2m'] = np.where(combined_df['t2m'] > 0, np.nan, combined_df['t2m']) # 20 0.5, 15 0.66
combined_df.dropna(subset=['t2m'], inplace=True)

# convert unit of msl from Pa to hPa
combined_df['msl'] = combined_df['msl'] / 100
combined_df['msl'] = np.where(combined_df['msl'] > 1010, np.nan, combined_df['msl'])
combined_df.dropna(subset=['msl'], inplace=True)

#combined_df['r'] = np.where(combined_df['r'] > 80, np.nan, combined_df['r'])
#combined_df.dropna(subset=['r'], inplace=True)

combined_df

combined_df.index
# the variables time, latitude, and longitude are stored as index not as columns

# For the above reason, we should turn these variables into a regular columns using reset_index

combined_df.reset_index(inplace=True)
combined_df

Unnamed: 0,time,latitude,longitude,msl,t2m,tp,r
0,2022-01-01 00:00:00,37.299999,-119.25,1009.831482,-1.607758,0.000000,89.745964
1,2022-01-01 00:00:00,37.299999,-119.00,1009.593689,-7.472168,0.000000,89.718094
2,2022-01-01 00:00:00,37.299999,-118.75,1006.984192,-9.078278,0.000000,77.552567
3,2022-01-01 00:00:00,37.299999,-118.50,1005.700317,-5.433868,0.000000,58.330734
4,2022-01-01 00:00:00,37.299999,-118.25,1005.957336,-1.733551,0.000000,54.522038
...,...,...,...,...,...,...,...
1773,2022-12-31 23:00:00,37.299999,-118.75,1004.403748,-1.422852,4.615088,97.967171
1774,2022-12-31 23:00:00,37.049999,-118.75,1006.025024,-0.957367,4.916537,98.873825
1775,2022-12-31 23:00:00,36.799999,-118.75,1007.875854,-0.133881,5.003395,99.895676
1776,2022-12-31 23:00:00,36.799999,-118.50,1006.132751,-0.075836,3.987195,96.044243


In [7]:
X = combined_df[['t2m', 'msl', 'r']]

y = combined_df['tp']
print(X)
print(y)

           t2m          msl           r
0    -1.607758  1009.831482   89.745964
1    -7.472168  1009.593689   89.718094
2    -9.078278  1006.984192   77.552567
3    -5.433868  1005.700317   58.330734
4    -1.733551  1005.957336   54.522038
...        ...          ...         ...
1773 -1.422852  1004.403748   97.967171
1774 -0.957367  1006.025024   98.873825
1775 -0.133881  1007.875854   99.895676
1776 -0.075836  1006.132751   96.044243
1777 -0.524139  1007.938965  103.964470

[1778 rows x 3 columns]
0       0.000000
1       0.000000
2       0.000000
3       0.000000
4       0.000000
          ...   
1773    4.615088
1774    4.916537
1775    5.003395
1776    3.987195
1777    3.725101
Name: tp, Length: 1778, dtype: float32


In [8]:
#importing sklearn models

import sklearn
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

#Training set**: Used to train the classifier.
#Testing set**: Used to estimate the error rate of the trained classifier.
#Also using train_index and test_index to get train and test data index 
#random_state = 42 allows us to use the same testing data, otherwise everytime you run the code, a new test set will be generated
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'Shape of X_train={X_train.shape}')
print(f'Shape of X_test={X_test.shape}')
print(f'Shape of y_train={y_train.shape}')
print(f'Shape of y_test={y_test.shape}')

#using linear regression
# Make a linear regression instance
lr=LinearRegression()
# Training the model on the data, storing the information learned from the data
# Model is learning the relationship between X and y 
lr.fit(X_train, y_train)

# Predict using the model on the testing data
y_pred_train_lr = lr.predict(X_train)
y_pred_test_lr = lr.predict(X_test)

#Printing the R2 score of test and train set
print(f'R2 Score of training set {lr.score(X_train, y_train)}')
print(f'R2 Score of testing  set  {lr.score(X_test, y_test)}')

# Calculate and print the Mean Squared Error (MSE)
mse_train_lr = mean_squared_error(y_train, y_pred_train_lr)
print(f"Root Mean Squared Error on Training Data: {np.sqrt(mse_train_lr)}")

mse_test_lr = mean_squared_error(y_test, y_pred_test_lr)
print(f"Root Mean Squared Error on Testing Data: {np.sqrt(mse_test_lr)}")

Shape of X_train=(1422, 3)
Shape of X_test=(356, 3)
Shape of y_train=(1422,)
Shape of y_test=(356,)
R2 Score of training set 0.44556873791385965
R2 Score of testing  set  0.5122466949688538
Root Mean Squared Error on Training Data: 1.090506672859192
Root Mean Squared Error on Testing Data: 1.0776022672653198


### We have built a decent multiple linear regression model to predict precipitation in cold weather condition. The R-squared for the testing data is 0.51 which is very good. 

In [9]:

# multiple regression with interaction terms

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


# Create interaction terms using PolynomialFeatures
interaction_degree = 3  # You can adjust the degree of interaction terms
poly = PolynomialFeatures(degree=interaction_degree, interaction_only=False, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Create and fit a nonlinear regression model using training data
model = LinearRegression()
model.fit(X_train_poly, y_train)

# Predict using the model on the training and testing data
y_pred_train_model = model.predict(X_train_poly)
y_pred_test_model = model.predict(X_test_poly)

#Printing the R2 score of test and train set
print(f'R2 Score of training set {model.score(X_train_poly, y_train)}')
print(f'R2 Score of testing  set  {model.score(X_test_poly, y_test)}')

# Calculate and print the Mean Squared Error (MSE)
mse_train_model = mean_squared_error(y_train, y_pred_train_model)
print(f"Root Mean Squared Error on Training Data: {np.sqrt(mse_train_model)}")
mse_test_model = mean_squared_error(y_test, y_pred_test_model)
print(f"Root Mean Squared Error on Testing Data: {np.sqrt(mse_test_model)}")

# Get the coefficients (slope and intercept) and the details of the linear/interaction terms used
slopes = model.coef_
intercept = model.intercept_
print(slopes)
print(intercept)
poly.get_feature_names_out()

R2 Score of training set 0.584262699944275
R2 Score of testing  set  0.6458543109035737
Root Mean Squared Error on Training Data: 0.9443086385726929
Root Mean Squared Error on Testing Data: 0.9182255268096924
[-6.0690174e-05 -1.6638312e-05  4.3246969e-06  2.1776270e-04
 -3.0764690e-02 -2.5960368e-03  2.2118965e-03  1.1218171e-02
  1.7424292e-03 -3.1395943e-03  1.4606305e-05 -8.3295902e-04
  3.1329899e-05 -2.0546317e-05  1.2198753e-04 -1.2418324e-06
 -1.0936760e-05 -5.4465263e-06  2.2973769e-05]
-978.6666


array(['t2m', 'msl', 'r', 't2m^2', 't2m msl', 't2m r', 'msl^2', 'msl r',
       'r^2', 't2m^3', 't2m^2 msl', 't2m^2 r', 't2m msl^2', 't2m msl r',
       't2m r^2', 'msl^3', 'msl^2 r', 'msl r^2', 'r^3'], dtype=object)

### We have also developed another alternative multiple nonlinear regression model with added interaction terms. Notice above how the R-squared is remarkably increased by including the interaction terms. Note that you can also specify only the interaction terms (not the power terms) by setting interaction_only=True.