In [None]:
pip install kaggle --upgrade

#Problem Statement

Extreme weather events are sweeping the globe and range from heat waves, wildfires and drought to hurricanes, extreme rainfall and flooding. These weather events have multiple impacts on agriculture, energy, transportation, as well as low resource communities and disaster planning in countries across the globe.

Accurate long-term forecasts of temperature and precipitation are crucial to help people prepare and adapt to these extreme weather events. Currently, purely physics-based models dominate short-term weather forecasting. But these models have a limited forecast horizon. The availability of meteorological data offers an opportunity for data scientists to improve sub-seasonal forecasts by blending physics-based forecasts with machine learning. Sub-seasonal forecasts for weather and climate conditions (lead-times ranging from 15 to more than 45 days) would help communities and industries adapt to the challenges brought on by climate change.

# Data Description

The WiDS Datathon 2023 focuses on a prediction task involving forecasting sub-seasonal temperatures (temperatures over a two-week period, in our case) within the United States. We are using a pre-prepared dataset consisting of weather and climate information for a number of US locations, for a number of start dates for the two-week observation, as well as the forecasted temperature and precipitation from a number of weather forecast models (we will reveal the source of our dataset after the competition closes). Each row in the data corresponds to a single location and a single start date for the two-week period. Your task is to predict the arithmetic mean of the maximum and minimum temperature over the next 14 days, for each location and start date.

You are provided with two datasets:

train_data.csv: the training dataset, where contest-tmp2m-14d__tmp2m, the arithmetic mean of the max and min observed temperature over the next 14 days for each location and start date, is provided
test_data.csv: the test dataset, where we withhold the true value of contest-tmp2m-14d__tmp2m for each row.
To participate in the Datathon, you will submit a solution file containing the predicted values of contest-tmp2m-14d__tmp2m for each row in the test dataset. The predicted values you submit will be compared against the observed values for the test dataset and this will determine your standing on the Leaderboard during the competition as well as your final standing when the competition closes.

You are also provided with an example of a solution file prepared for submission.

# Data Dictionary

The WiDS 2023 Datathon is using a subset of a pre-prepared dataset in which the variables were gathered from the following datasets (source of the WiDS Datathon dataset will be revealed after the competition closes):

Temperature: Daily maximum and minimum temperature measurements at 2 meters from 1979 onwards were obtained from NOAA’s Climate Prediction Center (CPC) Global Gridded Temperature dataset and converted to Celsius. The official contest target temperature variable is tmp2m = tmax+tmin / 2.

ftp://ftp.cpc.ncep.noaa.gov/precip/PEOPLE/wd52ws/global_temp/


Global precipitation: Daily precipitation data from 1979 onward were obtained from NOAA’s CPC Gauge-Based Analysis of Global Daily Precipitation [42] and converted to mm.

ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/RT/


U.S. precipitation: Daily U.S. precipitation data in mm were collected from the CPC Unified Gauge-Based Analysis of Daily Precipitation over CONUS. Measurements were replaced with sums over the ensuing two-week period.

https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/cpc_us_precip/catalog.html


Sea surface temperature and sea ice concentration: NOAA’s Optimum Interpolation Sea Surface Temperature (SST) dataset provides SST and sea ice concentration data, daily from 1981 to the present.

ftp://ftp.cdc.noaa.gov/Projects/Datasets/noaa.oisst.v2.highres/


Multivariate ENSO index (MEI): Bimonthly MEI values (MEI) from 1949 to the present, were obtained from NOAA/Earth System Research Laboratory. The MEI is a scalar summary of six variables (sea-level pressure, zonal and meridional surface wind components, SST, surface air temperature, and sky cloudiness) associated with El Niño/Southern Oscillation (ENSO), an ocean-atmosphere coupled climate mode.

https://www.esrl.noaa.gov/psd/enso/mei/


Madden-Julian oscillation (MJO): Daily MJO values since 1974 are provided by the Australian Government Bureau of Meteorology. MJO is a metric of tropical convection on daily to weekly timescales and can have a significant impact on the United States sub-seasonal climate. Measurements of phase and amplitude on the target date were extracted over the two-week period.

http://www.bom.gov.au/climate/mjo/graphics/rmm.74toRealtime.txt


Relative humidity, sea level pressure, and precipitable water for the entire atmosphere: NOAA’s National Center for Environmental Prediction (NCEP)/National Center for Atmospheric Research Reanalysis dataset contains daily relative humidity (rhum) near the surface (sigma level 0.995) from 1948 to the present and daily pressure at the surface (pres) from 1979 to the present.

ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis/surface/


Geopotential height, zonal wind, and longitudinal wind: To capture polar vortex variability, obtained daily mean geopotential height were obtained at 10mb from the NCEP Reanalysis dataset.

ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/pressure/


North American Multi-Model Ensemble (NMME): The North American Multi-Model Ensemble (NMME) is a collection of physics-based forecast models from various modeling centers in North America. Forecasts issued monthly from the Cansips, CanCM3, CanCM4, CCSM3, CCSM4, GFDL-CM2.1-aer04, GFDL-CM2.5, FLOR-A06 and FLOR-B01, NASA-GMAO-062012, and NCEP-CFSv2 models were downloaded from the IRI/LDEO Climate Data Library. Each forecast contains monthly mean predictions from 0.5 to 8.5 months ahead.

https://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/


Pressure and potential evaporation: ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis/surface_gauss/


Elevation: http://research.jisao.washington.edu/data_sets/elevation/elev.1-deg.nc


Köppen-Geiger climate classifications: http://koeppen-geiger.vu-wien.ac.at/present.htm

Variable naming
Each variable name, prefix__suffix, consists of two parts (separated by a double underscore) that inform you of the meaning of the variable. The prefix indicates from which of the above-listed file the variable was derived (e.g. Madden-Julian oscillation, pressure, and potential evaporation from NOAA's surface_gauss etc), the suffix indicates the specific type of information that was extracted from the file.

Variable prefixes
contest-slp-14d: file containing sea level pressure (slp)

nmme0-tmp2m-34w: file containing most recent monthly NMME model forecasts for tmp2m (cancm30,
cancm40, ccsm30, ccsm40, cfsv20, gfdlflora0, gfdlflorb0, gfdl0, nasa0,
nmme0mean) and average forecast across those models (nmme0mean)

contest-pres-sfc-gauss-14d: pressure

mjo1d: MJO phase and amplitude

contest-pevpr-sfc-gauss-14d: potential evaporation

contest-wind-h850-14d: geopotential height at 850 millibars

contest-wind-h500-14d: geopotential height at 500 millibars

contest-wind-h100-14d: geopotential height at 100 millibars

contest-wind-h10-14d: geopotential height at 10 millibars

contest-wind-vwnd-925-14d: longitudinal wind at 925 millibars

contest-wind-vwnd-250-14d: longitudinal wind at 250 millibars
contest-wind-uwnd-250-14d: zonal wind at 250 millibars

contest-wind-uwnd-925-14d: zonal wind at 925 millibars

contest-rhum-sig995-14d: relative humidity

contest-prwtr-eatm-14d: precipitable water for entire atmosphere
nmme-prate-34w: weeks 3-4 weighted average of monthly NMME model forecasts for precipitation

nmme-prate-56w: weeks 5-6 weighted average of monthly NMME model forecasts for precipitation
nmme0-prate-56w: weeks 5-6 weighted average of most recent monthly NMME model forecasts for precipitation

nmme0-prate-34w: weeks 3-4 weighted average of most recent monthly NMME model forecasts for precipitation

nmme-tmp2m-34w: weeks 3-4 weighted average of most recent monthly NMME model forecasts for target label, contest-tmp2m-14d__tmp2m

nmme-tmp2m-56w: weeks 5-6 weighted average of monthly NMME model forecasts for target label, contest-tmp2m-14d__tmp2m

mei: MEI (mei), MEI rank (rank), and Niño Index Phase (nip)

elevation: elevation

contest-precip-14d: measured precipitation

climateregions: Köppen-Geigerclimateclassifications

Variables without prefix
Some variables do not have a prefix. Instead, each variable name in its entirely indicates the information the variable captures.

lat: latitude of location (anonymized)
lon: longitude of location (anonymized)
startdate: startdate of the 14 day period
sst: sea surface temperature
icec: sea ice concentration
cancm30, cancm40, ccsm30, ccsm40, cfsv20, gfdlflora0, gfdlflorb0, gfdl0, nasa0, nmme0mean: most recent forecasts from weather models
Target
contest-tmp2m-14d__tmp2m: the arithmetic mean of the max and min observed temperature over the next 14 days for each location and start date, computed as (measured max temperature + measured mini temperature) / 2

# Evaluation Metric

The evaluation metric for this competition is Root Mean Squared Error (RMSE). The RMSE is a commonly used measure of the differences between predicted values provided by a model and the actual observed values.

RMSE is computed as:
RMSE=1𝑁∑𝑛=1𝑁(𝑦(𝑛)−𝑦̂ (𝑛))2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾⎷,

where
𝑦(𝑛)
is the n-th observed value and
𝑦̂ (𝑛)
is the n-th predicted value given by the model.

# Model Development

### Environment Setup

In [None]:
# Importing Libraries

# Data Exploration
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Feature Engineering and model Building
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBRegressor

# Model Evaluation 
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, precision_recall_fscore_support

# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

# !pip install ipyleaflet
import ipyleaflet
from ipyleaflet import Map

import datetime

### Data Exploration

In [None]:
# Load Data 

# Train Data
df_train = pd.read_csv('/content/drive/MyDrive/Data Project Datasets/Sub_seasonal forecasting/train_data.csv')
# Test Data
df_test = pd.read_csv('/content/drive/MyDrive/Data Project Datasets/Sub_seasonal forecasting/test_data.csv')
# Submission Format
df_sub = pd.read_csv('/content/drive/MyDrive/Data Project Datasets/Sub_seasonal forecasting/sample_solution.csv')

In [None]:
df_train.T

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
# Viewing Columns
df_train.columns

In [None]:
df_train.info()

In [None]:
# View statistical information of the data
df_train.describe()

In [None]:
df_train.head()

In [None]:
df_train['startdate'].dtypes

In [None]:
# Change datatype of 'startdate' column to (date)
df_train['startdate'] = pd.to_datetime(df_train['startdate'])

In [None]:
df_train['startdate'].dtypes

In [None]:
# Sort Data by 'startdate' column
df_train = df_train.sort_values(by = 'startdate', ascending = True)
df_train

In [None]:
# Check for Null values
df_train.isnull().sum() 

In [None]:
def per_filter_na_cols(df):
    count_na_df = df.isna().sum() 
    if count_na_df[count_na_df > 0].tolist():
        return (count_na_df[count_na_df > 0] / len(df)) * 100 
    else:
        return 'Clean dataset'

In [None]:
per_filter_na_cols(df_train)

In [None]:
# Fill in null values
df_train[["nmme0-tmp2m-34w__ccsm30",'nmme-tmp2m-56w__ccsm3',"nmme-prate-34w__ccsm3","nmme0-prate-56w__ccsm30","nmme0-prate-34w__ccsm30","nmme-prate-56w__ccsm3","nmme-tmp2m-34w__ccsm3","ccsm30"]]

In [None]:
# Fill Null Values
df_train['nmme0-tmp2m-34w__ccsm30'].fillna(df_train['nmme0-tmp2m-34w__ccsm30'].mean(), inplace = True)

In [None]:
missing = ["nmme0-tmp2m-34w__ccsm30",'nmme-tmp2m-56w__ccsm3',"nmme-prate-34w__ccsm3","nmme0-prate-56w__ccsm30","nmme0-prate-34w__ccsm30","nmme-prate-56w__ccsm3","nmme-tmp2m-34w__ccsm3","ccsm30"]
for df in missing:
      df_train[df].fillna(df_train[df].mean(), inplace = True)

In [None]:
per_filter_na_cols(df_train)

### Exploratory Data Analytics

In [None]:
Target_value = df_train['contest-tmp2m-14d__tmp2m']

In [None]:
fig, ax = plt.subplots(figsize = (10,10))
ax.plot(df_train['startdate'],df_train['contest-tmp2m-14d__tmp2m'] )


In [None]:
Map(center = [0.833333333333333,0.0454545454545454], zoom= 10)

water every where


### Feature Engineering

In [None]:
df_train['startdate']

In [None]:
df_train['year'] = df_train.startdate.dt.year
df_train['month'] = df_train.startdate.dt.month
df_train['day'] = df_train.startdate.dt.day

In [None]:
df_train.head()

In [None]:
sns.heatmap(df_train.corr())

In [None]:
# Drop Values
df_train.drop(['index','startdate'], axis = 1, inplace = True)

In [None]:
df_train.head()

In [None]:
# # Scale Values
# sc = MinMaxScaler()
# scaled_df = sc.fit_transform(df_train)

There is a String column in the dataset and we need to handle it.


In [None]:
# Find string datatype
for label, content in df_train.items():
    if pd.api.types.is_string_dtype(content):
      print(label)

In [None]:
# Change string datatype to category
df_train['climateregions__climateregion'] = df_train['climateregions__climateregion'].astype('category')


In [None]:
# Encode category datatype
df_train['climateregions__climateregion'] = df_train['climateregions__climateregion'].cat.codes


In [None]:
df_train.head()

In [None]:
# Scale Values
sc = MinMaxScaler()
scaled_df = sc.fit_transform(df_train)

In [None]:
scaled_df

In [None]:
scaled_df = pd.DataFrame(scaled_df, columns = df_train.columns)

In [None]:
scaled_df.head()

In [None]:
# Split Data in Dependent and Independednt
X = scaled_df.drop('contest-tmp2m-14d__tmp2m', axis = 1)
y = scaled_df['contest-tmp2m-14d__tmp2m']

In [None]:
# Perform Dimensionality Reduction
pca = PCA(n_components = 20)
pca_features = pca.fit_transform(X)
print('Shape before PCA: ', X.shape)
print('Shape before PCA: ', pca_features.shape)

In [None]:
pca_columns = ['pca_1','pca_2','pca_3','pca_4','pca_5','pca_6','pca_7','pca_8','pca_9','pca_10','pca_11','pca_12','pca_13','pca_14','pca_15','pca_16','pca_17','pca_18','pca_19','pca_20']


In [None]:
pca_X = pd.DataFrame(pca_features, columns = pca_columns)

In [None]:
pca_X.head()

In [None]:
y

### Split into Train and Validation

In [None]:
# Split Data into train and validation
X_train,X_val,y_train,y_val = train_test_split(pca_X,y, 
                                               test_size= 0.2)

In [None]:
X_train.shape,X_val.shape, y_train.shape, y_val.shape

### Model Building

In [None]:
# Using Linear Regression as Baseline
LR = LinearRegression()
LR.fit(X_train,y_train)

In [None]:
# Using RandomFrorest Algorithm 
RFR = RandomForestRegressor()
RFR.fit(X_train,y_train)

In [None]:
# Using XGBoost Algorithm
xg_model = XGBRegressor()
xg_model.fit(X_train,y_train)

### Model Evaluation

In [None]:
# Xgboost Algorithm
xg_y_preds = xg_model.predict(X_val)
xg_MSE = mean_squared_error(y_val,xg_y_preds)
print(f"MSE : {xg_MSE}")
xg_r2 = r2_score(y_val,xg_y_preds)
print(f"R2 : {xg_r2}")
xg_RMSE = np.sqrt(mean_squared_error(y_val,xg_y_preds))
print(f"RMSE : {xg_RMSE}")


In [None]:
 # RandomForest Algorithm
RFR_y_preds = RFR.predict(X_val)
RFR_MSE = mean_squared_error(y_val,RFR_y_preds)
print(f"MSE : {RFR_MSE}")
RFR_r2 = r2_score(y_val,RFR_y_preds)
print(f"R2 : {RFR_r2}")
RFR_RMSE = np.sqrt(mean_squared_error(y_val,RFR_y_preds))
print(f"RMSE : {RFR_RMSE}")

In [None]:
 # LinearRegression Algorithm
LR_y_preds = LR.predict(X_val)
LR_MSE = mean_squared_error(y_val,LR_y_preds)
print(f"MSE : {LR_MSE}")
LR_r2 = r2_score(y_val,LR_y_preds)
print(f"R2 : {LR_r2}")
LR_RMSE = np.sqrt(mean_squared_error(y_val,LR_y_preds))
print(f"RMSE : {LR_RMSE}")

#### View Validation Predictions

In [None]:
# RandomForest Predictions
predictions_RFR = pd.DataFrame()
predictions_RFR['Actual Value'] = y_val
predictions_RFR['Random Forest Predictions'] = RFR_y_preds
predictions_RFR.head()

In [None]:
# XGBoost Predictions
predictions_XGB = pd.DataFrame()
predictions_XGB['Actual Value'] = y_val
predictions_XGB['Random Forest Predictions'] = xg_y_preds
predictions_XGB.head()

In [None]:
# LinearRegression Predictions
predictions_LR = pd.DataFrame()
predictions_LR['Actual Value'] = y_val
predictions_LR['Random Forest Predictions'] = LR_y_preds
predictions_LR.head()

### Feature Importance

In [None]:
#Xgboost
xgb.plot_importance(xg_model, ax = plt.gca())

### Save Model

In [None]:
# import joblib
# joblib.dump(RFR,"Wind_Model_joblib")


In [None]:
# Wind_Model = joblib.load("Wind_Model_joblib")

### Preprocessing Pipeline

In [None]:
# Test Data
df_test = pd.read_csv('/content/drive/MyDrive/Data Project Datasets/Sub_seasonal forecasting/test_data.csv')

In [None]:
df_predictions = df_test.copy()

In [None]:
df_test.head()

In [None]:
per_filter_na_cols(df_test)

In [None]:
# Preprocess data (Getting Test Dataset in the same format as training dataset)
def preprocess_data(df):
    
    #Parse TimeStamp
    # Change datatype of 'startdate' column to (date)
    df['startdate'] = pd.to_datetime(df['startdate'])


    #Fill in Numerical Null Columns with the mean
    for label, content in df.items():
        if pd.api.types.is_float_dtype(content):
              if pd.isnull(content).sum():
                  df[label] = content.fillna(content.mean())

     #Fill in Categorical Null Columns with the mode
        if pd.api.types.is_categorical_dtype(content):
              if pd.isnull(content).sum():
                  df[label] = content.fillna(content.value_counts().index[0])
    
     # Feature Enginering on Date Column             
    df['Year'] = df.startdate.dt.year
    df['Month'] = df.startdate.dt.month
    df['Day'] = df.startdate.dt.day
    
    #Convert Object Data Type to Category
    for label, content in df.items():
        if pd.api.types.is_string_dtype(content):
            df[label] = content.astype("category").cat.as_ordered()
            
                    
    # Drop Values
    df.drop(['index','startdate'], axis = 1, inplace = True)
      
                         
          
    # Encoding Categorical Columns    
    for label, content in df.items():
        if not pd.api.types.is_numeric_dtype(content):
    # We add the +1 because pandas encodes missing categories as -1
          df[label] = pd.Categorical(content).codes+1    


    # Scale Data
    scaled_df = sc.fit_transform(df) 
    scaled_df = pd.DataFrame(scaled_df, columns = df_test.columns)  

    # Perform Dimensionality Reduction
    pca_features = pca.fit_transform(scaled_df)  
    pca_test_df = pd.DataFrame(pca_features, columns = pca_columns)       
              
    return pca_test_df

In [None]:
test = preprocess_data(df_test)

In [None]:
test

In [None]:
y_test = RFR.predict(test)

In [None]:
df_predictions.head()

In [None]:
df_predictions['contest-tmp2m-14d__tmp2m Predictions'] = y_test

In [None]:
df_predictions

In [None]:
df_sub

In [None]:
df_submissions = pd.DataFrame()
df_submissions['contest-tmp2m-14d__tmp2m'] = df_predictions['contest-tmp2m-14d__tmp2m Predictions']
df_submissions['index'] = df_predictions['index']

In [None]:
df_submissions.head()

In [None]:
df_submissions.to_csv('MLizzys_Wind_Submission.csv')

Estimator - **RandomForest Algorithm**

RMSE - **0.017**