# Requirements

In [7]:
import pandas as pd

In [None]:
# Add as many imports as you need.

# Laboratory Exercise - Run Mode (8 points)

## Introduction
In this laboratory assignment, the focus is on time series forecasting, specifically targeting the prediction of the current **mean temperature** in the city of Delhi. Your task involves employing bagging and boosting methods to forecast the **mean temperature**. To accomplish this use data from the preceding three days, consisting of **mean temperature**, **humidity**, **wind speed**, and **mean pressure**.

**Note: You are required to perform this laboratory assignment on your local machine.**

## The Climate Dataset

## Downloading the Climate Dataset

In [1]:
!gdown 1kczX2FpFTH1QEsDeg6dszXM3Azwyd7XC # Download the dataset.

Downloading...
From: https://drive.google.com/uc?id=1kczX2FpFTH1QEsDeg6dszXM3Azwyd7XC
To: /content/climate-data.csv
  0% 0.00/78.1k [00:00<?, ?B/s]100% 78.1k/78.1k [00:00<00:00, 107MB/s]


## Exploring the Climate Dataset
This dataset consists of daily weather records for the city of Delhi spanning a period of 4 years (from 2013 to 2017). The dataset includes the following attributes:

- date - date in the format YYYY-MM-DD,
- meantemp - mean temperature averaged from multiple 3-hour intervals in a day,
- humidity - humidity value for the day (measured in grams of water vapor per cubic meter volume of air),
- wind_speed - wind speed measured in kilometers per hour, and
- meanpressure - pressure reading of the weather (measured in atm).

*Note: The dataset is complete, with no missing values in any of its entries.*

Load the dataset into a `pandas` data frame.

In [13]:
df = pd.read_csv("climate-data.csv")
df.head()

Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure
0,2013-01-01,10.0,84.5,0.0,1015.666667
1,2013-01-02,7.4,92.0,2.98,1017.8
2,2013-01-03,7.166667,87.0,4.633333,1018.666667
3,2013-01-04,8.666667,71.333333,1.233333,1017.166667
4,2013-01-05,6.0,86.833333,3.7,1016.5


Explore the dataset using visualizations of your choice.

In [14]:
cols = df.columns

for col in cols:
  for i in range(1,4):
    df[f"{col}-{i}"]=df[col].shift(i)

df.head()

Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure,date-1,date-2,date-3,meantemp-1,meantemp-2,meantemp-3,humidity-1,humidity-2,humidity-3,wind_speed-1,wind_speed-2,wind_speed-3,meanpressure-1,meanpressure-2,meanpressure-3
0,2013-01-01,10.0,84.5,0.0,1015.666667,,,,,,,,,,,,,,,
1,2013-01-02,7.4,92.0,2.98,1017.8,2013-01-01,,,10.0,,,84.5,,,0.0,,,1015.666667,,
2,2013-01-03,7.166667,87.0,4.633333,1018.666667,2013-01-02,2013-01-01,,7.4,10.0,,92.0,84.5,,2.98,0.0,,1017.8,1015.666667,
3,2013-01-04,8.666667,71.333333,1.233333,1017.166667,2013-01-03,2013-01-02,2013-01-01,7.166667,7.4,10.0,87.0,92.0,84.5,4.633333,2.98,0.0,1018.666667,1017.8,1015.666667
4,2013-01-05,6.0,86.833333,3.7,1016.5,2013-01-04,2013-01-03,2013-01-02,8.666667,7.166667,7.4,71.333333,87.0,92.0,1.233333,4.633333,2.98,1017.166667,1018.666667,1017.8


# Feauture Extraction
Apply a lag of one, two, and three days to each feature, creating a set of features representing the meteorological conditions from the previous three days. To maintain dataset integrity, eliminate any resulting missing values at the beginning of the dataset.

Hint: Use `df['column_name'].shift(period)`. Check the documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html.

In [15]:
df.drop(index=[0,1,2],inplace=True)
df.drop(columns=['date-1','date-2','date-3','date','humidity','wind_speed','meanpressure'],inplace=True)

df.head()

Unnamed: 0,meantemp,meantemp-1,meantemp-2,meantemp-3,humidity-1,humidity-2,humidity-3,wind_speed-1,wind_speed-2,wind_speed-3,meanpressure-1,meanpressure-2,meanpressure-3
3,8.666667,7.166667,7.4,10.0,87.0,92.0,84.5,4.633333,2.98,0.0,1018.666667,1017.8,1015.666667
4,6.0,8.666667,7.166667,7.4,71.333333,87.0,92.0,1.233333,4.633333,2.98,1017.166667,1018.666667,1017.8
5,7.0,6.0,8.666667,7.166667,86.833333,71.333333,87.0,3.7,1.233333,4.633333,1016.5,1017.166667,1018.666667
6,7.0,7.0,6.0,8.666667,82.8,86.833333,71.333333,1.48,3.7,1.233333,1018.0,1016.5,1017.166667
7,8.857143,7.0,7.0,6.0,78.6,82.8,86.833333,6.3,1.48,3.7,1020.0,1018.0,1016.5


In [16]:
df.isna().sum()

meantemp          0
meantemp-1        0
meantemp-2        0
meantemp-3        0
humidity-1        0
humidity-2        0
humidity-3        0
wind_speed-1      0
wind_speed-2      0
wind_speed-3      0
meanpressure-1    0
meanpressure-2    0
meanpressure-3    0
dtype: int64

## Dataset Splitting
Partition the dataset into training and testing sets with an 80:20 ratio.

**WARNING: DO NOT SHUFFLE THE DATASET.**



In [22]:
from sklearn.model_selection import train_test_split
X = df.drop(columns=['meantemp'])
y= df['meantemp']
x_train,x_test, y_train,y_test = train_test_split(X,y, shuffle=False, test_size=0.2)

y_train

3        8.666667
4        6.000000
5        7.000000
6        7.000000
7        8.857143
          ...    
1165    25.066667
1166    24.562500
1167    24.250000
1168    22.375000
1169    24.066667
Name: meantemp, Length: 1167, dtype: float64

## Ensemble Learning Methods

### Bagging

Create an instance of a Random Forest model and train it using the `fit` function.

In [24]:
from sklearn.ensemble import  RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score

Use the trained model to make predictions for the test set.

In [25]:
rf = RandomForestRegressor(n_estimators=1000)
rf.fit(x_train,y_train)
y_pred_rf = rf.predict(x_test)

print(f"R2 for rf = {r2_score(y_test,y_pred_rf)}")

R2 for rf = 0.9052076095686947


Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [None]:
# Write your code here. Add as many boxes as you need.

### Boosting

Create an instance of an XGBoost model and train it using the `fit` function.

In [26]:
xgb = XGBRegressor(n_estimators=1000)
xgb.fit(x_train,y_train)
y_pred_xgb=xgb.predict(x_test)

print(f"r2 for xbg {r2_score(y_test,y_pred_xgb)}")

r2 for xbg 0.8771029130464275


Use the trained model to make predictions for the test set.

In [None]:
# Write your code here. Add as many boxes as you need.

Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [None]:
# Write your code here. Add as many boxes as you need.

# Laboratory Exercise - Bonus Task (+ 2 points)

As part of the bonus task in this laboratory assignment, your objective is to fine-tune the number of estimators (`n_estimators`) for the XGBoost model using a cross-validation with grid search and time series split. This involves systematically experimenting with various values for `n_estimators` and evaluating the model's performance using cross-validation. Upon determining the most suitable `n_estimators` value, evaluate the model's performance on a test set for final assessment.

Hints:
- For grid search use the `GridCVSearch` from the `scikit-learn` library. Check the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.
- For cross-validation use the `TimeSeriesSplit` from the `scikit-learn` library. Check the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html.

## Dataset Splitting
Partition the dataset into training and testing sets with an 90:10 ratio.

**WARNING: DO NOT SHUFFLE THE DATASET.**

In [27]:
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV

## Fine-tuning the XGBoost Hyperparameter
Experiment with various values for `n_estimators` and evaluate the model's performance using cross-validation.

In [36]:
xgb = XGBRegressor()



params= {
    "n_estimators":[50,100,500,1000]
}

tss = TimeSeriesSplit()
gsearch = GridSearchCV(estimator=xgb, cv=tss,param_grid=params)

gsearch.fit(X,y)

gsearch.best_params_




{'n_estimators': 50}

## Final Assessment of the Model Performance
Upon determining the most suitable `n_estimators` value, evaluate the model's performance on a test set for final assessment.

In [37]:
rf = RandomForestRegressor(n_estimators=50)

rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)

r2_score(y_test,y_pred)

0.9038986225725807