# 0. Loading packages

Uncomment cell below if not all necessary packages are installed

In [None]:
# Install necessary packages
# %pip install numpy
# %pip install matplotlib
# %pip install pandas
# %pip install seaborn
# %pip install scikit-learn
# %pip install missingno
# %pip install imblearn
# %pip install xgboost
# %pip install statsmodels

In [None]:
import warnings
warnings.filterwarnings('ignore')

import functions as fc

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from statsmodels.tsa.statespace.sarimax import SARIMAX

# 1. Loading data

In [None]:
train_data = pd.read_csv('Datasets/train.csv')
test_data = pd.read_csv('Datasets/test.csv')

test_data_pred_col = list(test_data['date_hour'])

# 2. Inspecting data

## 2.1 Showing datasets

In [None]:
train_data.head()

In [None]:
train_data.info()

In [None]:
train_data.describe()

The datasets contains no missing data.

The columns in the dataset are predominantly of data types `int` or `float`, except for the `date_hour` column, which is of type `object`. This column will need to be converted to the `datetime` format for further analysis.

## 2.2 Inspecting individual columns

In [None]:
cols = ['holiday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'cnt']

dv_train = fc.DataVisualizer(train_data)

In [None]:
dv_train.plot_distribution(cols, 'train_data')

1. **Countplot for `holiday`:**
    - **Majority of entries are non-holidays**: The count for `0` (non-holidays) is significantly higher than `1` (holidays), indicating that most of the data represents regular working or non-holiday days.
2. **Countplot for `weathersit`:**
    - **Category 1 dominates**: Most observations fall into category `1`, representing favorable or clear weather.
    - **Category 2 and 3 are less common**: These represent moderate or less favorable weather conditions.
    - **Category 4 is absent**: These imply extreme weather conditions are not present in the dataset.
3. **Countplot for `temp`:**
    - This column is normally distributed.
4. **Countplot for `atemp`:**
    - This column is normally distributed.
5. **Countplot for `hum`:**
    - This column is left skewed.
6. **Countplot for `windspeed`:**
    - This column is right skewed.
7. **Countplot for `cnt`:**
    - Most of the amounts for cnt are nearer to zero, indicating that higher amounts for `cnt` are preserved for specific occassions.

## 2.3 Relationships between variables

In [None]:
dv_train.plot_correlation('train_data', method='pearson')

The target variable `cnt` exhibits the following correlations with the other features in the dataset:

1. **`temp` (Temperature)**:
   - Correlation: **0.41** (moderate positive)
   - Interpretation: As temperature increases, the count of rentals tends to increase. This suggests that warmer weather is favorable for usage.

2. **`atemp` (Feels-like Temperature)**:
   - Correlation: **0.4** (moderate positive)
   - Interpretation: Similar to `temp`, higher feels-like temperatures are associated with more rentals. Since `temp` and `atemp` are highly correlated with each other, their impact on `cnt` is quite similar.

3. **`hum` (Humidity)**:
   - Correlation: **-0.33** (moderate negative)
   - Interpretation: Higher humidity levels are associated with a decrease in rentals. This indicates that humid weather may discourage people from renting.

4. **`windspeed`**:
   - Correlation: **0.097** (weak positive)
   - Interpretation: Windspeed shows a very weak positive correlation with rentals. This suggests that windspeed has a minimal linear relationship with the count of rentals.

5. **`weathersit` (Weather Situation)**:
   - Correlation: **-0.14** (weak negative)
   - Interpretation: Since this column is a column consisting of four classes, a pearson correlation coefficient is not the best way to figure out relations.

6. **`holiday`**:
   - Correlation: **-0.027** (very weak negative)
   - Interpretation: The correlation between holidays and rentals is negligible, indicating that the number of rentals is not significantly affected by whether it is a holiday. Although, here again it is a column consisting of two classes, therefore, a pearson correlation coefficient is not the best way to find out relations.

**Summary:**
- The most significant predictors of `cnt` are `temp` (0.41), `atemp` (0.4), and `hum` (-0.33), as these exhibit moderate correlations.
   - Since `temp` and `atemp` have a high correlation towards eachothter (0.99), one of them can be rendered negligible.
- Features such as `windspeed`, `weathersit`, and `holiday` show weak or negligible correlations, indicating they may have limited linear influence on the target variable.

## 2.4 Inspecting trends, and seasonal components

<div style="border: 2px solid orange; background-color: #ffd7b3; color: #ff7b00; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> Check description
</div>

**Time Series Decomposition**

In time series analysis, we assume an additive decomposition model where the data can be expressed as:

$$y_t = S_t + T_t + R_t$$

Where:
- **$y_t$**: The observed data at time $t$  
- **$S_t$**: The seasonal component at time $t$  
- **$T_t$**: The trend component at time $t$  
- **$R_t$**: The residual (or irregular) component at time $t$  
*(Hyndman & Athanasopoulos, 2018)*  


**Insights from Decomposition Components**

Decomposing the time series into its primary components provides valuable insights:

1. **Trend**:  
   The trend represents the long-term movement in the data. It reveals whether the overall direction of the data is increasing, decreasing, or stable. Short-term fluctuations are ignored as they may result from noise or temporary anomalies.

2. **Seasonality**:  
   The seasonality captures periodically repeating patterns in the data. These patterns occur at consistent intervals, such as daily, weekly, or annually, and reflect regular cyclical behavior.

3. **Residuals**:  
   The residuals represent the irregular component of the data. These are deviations that cannot be explained by either the trend or the seasonality, such as unexpected peaks or outliers.  
   *(Dey, 2024)*


In [None]:
train_dc = fc.TimeSeriesDecomposer(train_data['cnt'], period=24)

In [None]:
trend, seasonal, residual = train_dc.decompose()

In [None]:
train_dc.plot_decomposition(trend, seasonal, residual)

The plot above does not clearly reveal a seasonally repeating pattern, it is currently commented out due to runtime. This is likely due to the extensive amount of data, as it encompasses hourly observations over a two-year period. To facilitate the identification of seasonal patterns, a new decomposition will be performed on a subset comprising one-thirtysecond of the dataset.


In [None]:
train_dc_1 = fc.TimeSeriesDecomposer(train_data.iloc[:int(len(train_data)/32), :]['cnt'], period=24)

In [None]:
trend, seasonal, residual = train_dc_1.decompose()

In [None]:
train_dc_1.plot_decomposition(trend, seasonal, residual)

The plot above indicates a distinct **seasonal pattern** with a periodicity of approximately **one day**, suggesting a temporal influence on the **`cnt`** variable. However, there is no apparent trend in the data, which suggests that the dataset may already be stationary. This assumption will be further tested using the **Augmented Dickey-Fuller (ADF) test** in subsequent analysis.  

## 2.5 Inspecting time specific relations

<div style="border: 2px solid orange; background-color: #ffd7b3; color: #ff7b00; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> Check description
</div>

To validate the presence of daily seasonal patterns, visualizations will be created using different time elements (e.g., hour, day, week) on the x-axis and the target variable on the y-axis. These plots will help to identify and observe visible trends or repeating patterns over time.

In [None]:
train_data = fc.create_timeseries_features(train_data)

In [None]:
train_data.info()

In [None]:
cols = ['year', 'month', 'week', 'day', 'hour', 'day_of_week']

for col in cols:
    dv_train.lineplot(x=col, y='cnt', title=f'{col} vs cnt', path=f'Figures/{col}_vs_cnt.png')

The analysis of the above plots reveals the following insights:  
- Over the two-year period, the average value of `cnt` has shown an upward trend. Since there is no real added value in this column since we are working over just two years, this column will be dropped  
- The monthly and weekly graphs demonstrate a distinct peak in `cnt` during the summer months. Since in both of the columns the same trends can be observed, the month column will be dropped.
- The day-of-the-month graph does not exhibit a clear correlation. Therefore, this column will be dropped.
- The hour-of-the-day graph shows pronounced peaks during the morning and evening hours.  
- The day-of-the-week graph indicates noticeable peaks on the fourth and fifth days of the week.

## 2.6 Stationarity

To assess whether the dataset exhibits stationarity, we will perform the Augmented Dickey-Fuller (ADF) test. This statistical test evaluates the null hypothesis ($H_0$) that the data contains a unit root, indicating non-stationarity. Rejection of the null hypothesis suggests that the data is stationary.

**Hypothesis:**

- $H_0$: The data contains a unit root and is non-stationary.
- $H_1$: The data does not contain a unit root and is stationary.

**Results:**

The outcome of the ADF test includes:
- The test statistic, which is compared against critical values at various significance levels (e.g., 1%, 5%, 10%).
- The p-value, indicating the probability of observing the test statistic under the null hypothesis.

Based on these results, we will determine if stationarity can be assumed for the dataset or if additional transformations (e.g., differencing) are necessary to achieve stationarity.


In [None]:
stat_tests = fc.StatisticalTests(train_data)

In [None]:
stat_tests.stationary_test('cnt')

## 2.7 Fourier analysis

<div style="border: 2px solid orange; background-color: #ffd7b3; color: #ff7b00; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> Check description
</div>

**Fourier Transform (FT) in Time Series Analysis**

A Fourier Transform (FT) converts data from the time domain into the frequency domain *(Omar, 2021)*. This transformation is specifically applicable to periodic signals in a time series format. When multiple periodic signals are combined, it can become challenging to discern where each signal begins and ends. By applying an FT, a frequency-amplitude graph is generated, allowing these components to be clearly identified.


**Visualizing Periodic Signals with Inverse Fourier Transform (IFT)**

To observe the periodic signals in their original form, an Inverse Fourier Transform (IFT) can be applied. However, before performing the IFT, the Fourier-transformed data must be cleaned to avoid merely reproducing the original time-domain data. This concept is demonstrated through the following visualizations:

1. **Periodic Components**  
   ![Periodic components](Figures/Explanations/Periodic%20components.png)  
   *This figure visualizes the three individual components that constitute the data.*

2. **Combined Data and Fourier Transform**  
   ![Combined + decomposed](Figures/Explanations/Combined%20+%20decomp.png)  
   *This figure shows the combined data alongside its Fourier Transform, highlighting three peaks at frequencies 10, 120, and 360.*

3. **Low-Pass Filter**  
   ![Low pass filter](Figures/Explanations/Low%20pass%20filter.png)  
   *A low-pass filter is applied here to retain only the low-frequency signals.*

4. **High-Pass Filter**  
   ![High pass filter](Figures/Explanations/High%20pass%20filter.png)  
   *A high-pass filter is applied to remove low-frequency signals, preserving only the high-frequency components.*

5. **Bandstop Filter**  
   ![Banstop filter](Figures/Explanations/Bandstop%20filter.png)  
   *A bandstop filter is applied, filtering out medium-frequency signals while keeping the low and high-frequency components.*

6. **Bandpass Filter**  
   ![Bandpass filter](Figures/Explanations/Bandpass%20filter.png)  
   *A bandpass filter is applied, retaining only the medium-frequency signals.*

7. **Noisy Periodic Components**  
   ![Noise periodic components](Figures/Explanations/Noise%20periodic%20components.png)  
   *This figure visualizes the periodic components that form the data, which include significant noise.*

8. **Noisy Combined Data and Fourier Transform**  
   ![Noise combined + decomposed](Figures/Explanations/Noise%20combined%20+%20decomp.png)  
   *This figure shows the noisy combined data and its Fourier Transform. Many small peaks are visible, alongside two prominent peaks at frequencies 10 and 120.*

9. **Noise Filter**  
   ![Noise filter](Figures/Explanations/Noise%20filter.png)  
   *A noise filter is applied to remove low-amplitude peaks, ensuring only significant periodic components are retained.*


In [None]:
stat_tests.fourier_analysis('cnt')

The Fourier analysis reveals two prominent frequency spikes:

1. A spike at a frequency of approximately **0.0001**, which corresponds to a periodicity of roughly **one year**.  
2. A second spike at a frequency of approximately **0.041**, which translates to a periodicity of approximately **24 hours**.

These findings suggest the presence of annual and daily patterns in the dataset, which may be significant for time series modeling. 

## 2.8 Autocorrelation

<div style="border: 2px solid orange; background-color: #ffd7b3; color: #ff7b00; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> Check description
</div>

**Autocorrelation**

Autocorrelation represents the similarity between a time series and a lagged version of itself. It measures the relationship between the current value of a variable and its past values. The scale for autocorrelation is the same as for regular correlation:  
- **+1** indicates a perfect positive correlation,  
- **-1** indicates a perfect negative correlation, and  
- **0** indicates no correlation.  
*(Smith, 2024)*  

**Lagging**

Lagging refers to shifting the values of a variable backward or forward in time to create new features, known as lagged features. These lagged features capture temporal dependencies and trends in the data, which can enhance the accuracy of predictive models.  
*("Analyzing the Impact of Lagged Features in Time Series Forecasting: A Linear Regression Approach," 2024)*  


In [None]:
plot_acf(train_data['cnt'], lags=12, ax=plt.gca(), alpha=0.05)
plt.savefig('Figures/ACF.png')
plt.show()

plot_pacf(train_data['cnt'], lags=12, ax=plt.gca(), alpha=0.05)
plt.savefig('Figures/PACF.png')
plt.show()

Based on the combined autocorrelation and partial autocorrelation plots, we observe significant correlations up to **lag 5**. This indicates that past values within this lag range have a meaningful relationship with the current value, which may be important for time series modeling.

## 2.9 Conclusion

The exploratory data analysis (EDA) has provided valuable insights into the dataset, its structure, and the relationships between features. Based on the findings, the following data preprocessing steps will be applied to prepare the dataset for further analysis and modeling:

1. **Column Dropping**:
   - The following columns will be removed as they either lack meaningful contribution, exhibit high correlation with other features, or show redundant information:
     - `holiday`: Weak correlation with the target variable and limited predictive power.
     - `year`, `month`, `day_of_week`, `day`: These columns demonstrate trends or patterns already captured by other features, such as `hour` or aggregated time-series patterns.
     - `atemp`: Highly correlated with `temp` (0.99), making it redundant.
     - `windspeed`: Weak correlation with the target variable, indicating limited linear influence.

2. **Dummy Variable Creation**:
   - Dummy variables will be created for the `weathersit` column to capture its categorical nature effectively and ensure its compatibility with predictive modeling.

3. **Feature Engineering with Fourier Analysis**:
   - Fourier waves will be generated based on the following columns to capture their periodicity:
     - `week` (annual periodicity).
     - `hour` (daily periodicity).
   - After generating the Fourier waves, the original `week` and `hour` columns will be dropped.

4. **Indexing**:
   - The `date_hour` column will be converted to the `datetime` format and set as the index for the dataset to facilitate time series analysis.

These steps will ensure the dataset is optimized for modeling by retaining meaningful features, addressing redundancy, and incorporating temporal patterns effectively.


## 2.10 Updating `test_data`

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
test_data = fc.create_timeseries_features(test_data)

# 3. Feature engineering

In [None]:
cols_to_drop = ['holiday', 'year', 'month', 'day_of_week', 'day', 'atemp', 'windspeed']
cols_to_dummy = ['weathersit']
cols_to_fourier = ['hour', 'week']
index_col = 'date_hour'

fe = fc.FeatureEngineering(train_data, test_data, cols_to_drop, cols_to_dummy, cols_to_fourier, index_col)

## 3.1 Dropping columns

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
fe.drop_columns()

## 3.2 Creating dummies

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
fe.create_dummies()

## 3.3 Creating fourier waves

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
fe.fourier_wave()

## 3.4 Setting index

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
train_data, test_data = fe.set_index()

In [None]:
train_data.head()

In [None]:
test_data.head()

In the training data, a dummy column named **`weathersit_4`** has been created, but it is absent in the test data. Since the **`weathersit`** variable takes values ranging from 1 to 4, the absence of **`weathersit_4`** in the test data indicates that this category is not represented. 

To ensure consistency between the training and test datasets, we can safely drop the **`weathersit_4`** column from the training data without any loss of information.

In [None]:
train_data.drop('weathersit_4', axis=1, inplace=True)

In [None]:
train_data.head()

# 4. Modelling

## 4.1 Regular models

### 4.1.1 Linear Regression

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
lr = LinearRegression()
param_grid = {'fit_intercept': [True, False], 'copy_X': [True, False]}

gs = fc.GridSearch(train_data, test_data, target='cnt', model=lr, param_grid=param_grid, n_splits=5, order=1)
gs.fit()
gs.predict(test_data_pred_col)
gs.to_csv(model='lr', path_add=f'order_1')

### 4.2 KNN Regressor

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
knn = KNeighborsRegressor()
param_grid = {'n_neighbors': [1, 2, 3, 4, 5], 'weights': ['uniform', 'distance']}

gs = fc.GridSearch(train_data, test_data, target='cnt', model=knn, param_grid=param_grid, n_splits=5, order=None)
gs.fit()
gs.predict(test_data_pred_col)
gs.to_csv(model='knn', path_add=f'order_None')

### 4.3 Decision Tree Regressor

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
dt = DecisionTreeRegressor()
param_grid = {'max_depth': [1, 2, 3, 4, 5], 'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], 'min_samples_leaf': [2, 3, 4, 5, 6, 7, 8, 9, 10]}

gs = fc.GridSearch(train_data, test_data, target='cnt', model=dt, param_grid=param_grid, n_splits=5, order=1)
gs.fit()
gs.predict(test_data_pred_col)
gs.to_csv(model='dt', path_add=f'order_1')

### 4.4 Random Forest Regressor

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
rf = RandomForestRegressor()
param_grid = {'n_estimators': [25, 50, 75, 100, 150, 200], 'max_depth': [1, 2, 3, 4, 5],'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], 'min_samples_leaf': [2, 3, 4, 5, 6, 7, 8, 9, 10]}

gs = fc.GridSearch(train_data, test_data, target='cnt', model=rf, param_grid=param_grid, n_splits=5, order=1)
gs.fit()
gs.predict(test_data_pred_col)
gs.to_csv(model='rf', path_add=f'order_1')

### 4.5 XGB Regressor

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
xgb = XGBRegressor()
param_grid = {'n_estimators': [25, 50, 75, 100, 150, 200], 'max_depth': [1, 2, 3, 4, 5], 'learning_rate': [0.01, 0.1, 0.3, 0.5, 1.0], 'subsample': [0.01, 0.1, 0.3, 0.5, 0.7, 1], 'colsample_bytree': [0.01, 0.1, 0.3, 0.5, 0.7, 1]}

gs = fc.GridSearch(train_data, test_data, target='cnt', model=xgb, param_grid=param_grid, n_splits=5, order=1)
gs.fit()
gs.predict(test_data_pred_col)
gs.to_csv(model='xgb', path_add=f'order_1')

Following the initial evaluation of all models:  

- **Order 1 and Order 2** models performed significantly better than the model with no order.  
- The difference in performance between **Order 1** and **Order 2** models was negligible.  
- For the **KNN model**, however, the model with **no order** yielded the best performance.  

As a result, we will proceed with **Order 1** for all subsequent analyses to maintain simplicity without sacrificing accuracy, except for the **KNN model**, where we will use **no order**.

## 4.2 Timeseries models

### 4.2.1 SARIMAX

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
y = train_data['cnt']

param_grid = {'order': [(0, 0, 0), (0, 0, 1), (1, 0, 0), (1, 0, 1), (2, 0, 0), (2, 0, 1)],
              'seasonal_order': [(0, 0, 0, 24), (0, 0, 1, 24), (1, 0, 0, 24), (1, 0, 1, 24), (2, 0, 0, 24), (2, 0, 1, 24)]}

sarimax_model = fc.SARIMAXModel(train_data, test_data, param_grid)
sarimax_model.grid_search()

sarimax_model.predict(test_data_pred_col)

sarimax_model.save_predictions()

### 4.2.2 Prophet

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
param_grid = {'seasonality_mode': ['additive', 'multiplicative'],
              'changepoint_prior_scale': [0.01, 0.05, 0.1],
              'yearly_seasonality': ['auto', True, False],
              'weekly_seasonality': ['auto', True, False],
              'daily_seasonality': ['auto', True, False]}

prophet_model = fc.ProphetModel(train_data, test_data, param_grid=param_grid)
prophet_model.grid_search()

prophet_model.predict()

prophet_model.save_predictions()

## 4.2 Hybrid model

<div style="border: 2px solid red; background-color: #f8d7da; color: #721c24; padding: 10px; border-radius: 5px; display: inline-block; max-width: 97%;">
    <strong>Warning:</strong> ADD DESCRIPTION!!!
</div>

In [None]:
hm = fc.HybridModel(train_data, test_data, 'cnt', {'lr': LinearRegression(), 'dt': DecisionTreeRegressor()})
hm.fit()
hm.predict(test_data_pred_col)
hm.save_predictions()