# **Time Series Analysis Task Notebook**
## From "TSA_Task.ipynb" February 12, 2025
(use venv_requirements.txt)

This notebook is designed to test you through various Time Series Analysis tasks using the Bike Sharing dataset. The tasks will help you develop essential skills in handling time-based data, including cleaning and preprocessing, feature engineering, and model building. You'll explore techniques such as creating lag and rolling window features, implementing linear regression for time series prediction, and applying XGBoost with hyperparameter tuning. Finally, you'll evaluate and compare the performance of the models, providing insights into their effectiveness. These tasks will enhance your understanding of time series analysis and prepare you for real-world forecasting challenges.

## **Exercise**

The hand-in exercise for this topic is Task 3,4 and 6 from the notebook “TSA_Task”. This
means that you have to do the cleaning of dataset, then create features (at least 5 new
features should be created, and you should be able to justify why you created each of the
features). And then, you should train an XGBoost model on the dataset. Note that you also
need to do relevant train, test, validation split and be able to explain why you chose a
certain split. Lastly, you should calculate evaluation metrics: rmse and mae to show
performance of your model. The hyperparameter tuning part is not required.


3. Clean and pre-process the dataset as required and prepare the data for modelling.
4. Create the lag and rolling windows features for the "cnt" column such as: 1 day lag, 1 week lag, 1 month, etc. and last 3 day rolling mean, last 3 hours rolling mean, etc. But it should be based on your dataset and what makes sense for this dataset.
6. Implement XGBoost to predict how many bikes will be rented each hour of the last week and evaluate using appropriate metrics.

In [4]:
import kagglehub
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error

# 3. Clean and pre-process the dataset as required and prepare the data for modelling.



In [3]:
#download and read data
kagglehub.dataset_download("lakshmi25npathi/bike-sharing-dataset")
file_path = "C:\\Users\\Hassan\\.cache\\kagglehub\\datasets\\lakshmi25npathi\\bike-sharing-dataset\\versions\\1\\hour.csv"
df = pd.read_csv(file_path)



In [5]:
#Getting an overview of the structure and format
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [89]:
#Getting an overview
df.info()

#dteday is a str object, needs to be removed before fitting object.
#no null variables 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


In [90]:
#dupe - remove before handin
temp = df.copy()

#converting str into datetime
temp['dteday'] = pd.to_datetime(temp['dteday'])
temp.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     17379 non-null  int64         
 1   dteday      17379 non-null  datetime64[ns]
 2   season      17379 non-null  int64         
 3   yr          17379 non-null  int64         
 4   mnth        17379 non-null  int64         
 5   hr          17379 non-null  int64         
 6   holiday     17379 non-null  int64         
 7   weekday     17379 non-null  int64         
 8   workingday  17379 non-null  int64         
 9   weathersit  17379 non-null  int64         
 10  temp        17379 non-null  float64       
 11  atemp       17379 non-null  float64       
 12  hum         17379 non-null  float64       
 13  windspeed   17379 non-null  float64       
 14  casual      17379 non-null  int64         
 15  registered  17379 non-null  int64         
 16  cnt         17379 non-

# 4. Create the lag and rolling windows features for the "cnt" column 
(such as: 1 day lag, 1 week lag, 1 month, etc. and last 3 day rolling mean, last 3 hours rolling mean, etc. But it should be based on your dataset and what makes sense for this dataset.)

In [91]:
temp['lag_1d'] = temp['cnt'].shift(1*24)   # 1 day lag - mostly influenced by weather patterns, big events or holidays 
temp['lag_1w'] = temp['cnt'].shift(7*24)   # 1 week lag displays the weekly patterns (workdays, weekends, holidays) for the previous week
temp['lag_1m'] = temp['cnt'].shift(30*24)  # 1 month lag (displays trends that might be influenced by monthly patterns)
temp['lag_1y'] = temp['cnt'].shift(365*24) # 1 year lag 
 


temp['rolling_mean_3d'] = temp['cnt'].rolling(window=3*24).mean()  # past 3 days rolling mean 
temp['rolling_mean_30d'] = temp['cnt'].rolling(window=30*24).mean()  # past month rolling mean
temp['rolling_mean_same_month_last_year'] = temp['cnt'].shift(365*24).rolling(window=30).mean()  # Same month previous year rolling mean
temp['rolling_mean_same_week_last_year'] = temp['cnt'].shift(365*24).rolling(window=7).mean()  # Same week previous year rolling mean 



# 6. Implement XGBoost to predict how many bikes will be rented each hour of the last week and evaluate using appropriate metrics.



In [92]:
#dropping 'dteday' because its an object, 'instant' because we already have the data in axis and cnt as that is our target.
X = temp.drop(['cnt', 'dteday', 'instant'], axis=1)

y = temp['cnt']

#getting the last date in dataset - 7 days 
train_end_date = temp['dteday'].max() - pd.Timedelta(days=7)

# using the train_end_date to create a mask for train/test
train_mask = temp['dteday'] <= train_end_date
test_mask = temp['dteday'] > train_end_date

# spitting dataset up in train/test
X_train, X_test = X[train_mask], X[test_mask]
y_train, y_test = y[train_mask], y[test_mask]
print(f'Last day of dataset minus 7 days: {train_end_date}')
print(f'Shape of the Xtrain:{X[train_mask].shape}')
print(f'Shape of the Xtest:{X[test_mask].shape}')


Last day of dataset minus 7 days: 2012-12-24 00:00:00
Shape of the Xtrain:(17212, 22)
Shape of the Xtest:(167, 22)


In [93]:
model = xgb.XGBRegressor(objective='reg:squarederror')

#training model on train data.
model.fit(X_train, y_train)

# prediction on the last 7 days.
y_pred = model.predict(X_test)

# Eval
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

print(f"Mean average error: {mae}")
print(f"Root Mean square error: {rmse}")

Mean average error: 1.1597025394439697
Root Mean square error: 1.888529412183719
