TODO:  
- [x] Simplify data transformation: use ColumnTransformer and pipeline  
- [ ] ~Implement dictionary mapping for temperature (same as for month and time)~
- [ ] ~Implement dictionary mapping for multiplied variables (temperature x hour)~
- [ ] Add trend for count of bikes (it grows over time)  
- [ ] GridSearchCV: firstly filter models with a small diff between train and test scores and take the best model out of them  
- [ ] Scatterplot with known and predicted points
- [ ] Incorporate all data modifications into the pipeline

- [Challenge description](#Challenge-description)
- [Data exploration](#Data-exploration)
    - [Load data](#Load-data)
    - [Train/Test-split](#Train/Test-split)
    - [Heatmap](#Heatmap)
    - [Pairplot](#Pairplot)
    - [Check-NaN-values](#Check-NaN-values)
    - [Average-count-by-hour](#Average-count-by-hour)
    - [Average-count-by-day-of-the-week](#Average-count-by-day-of-the-week)
    - [Total-count-by-month](#Total-count-by-month)
    - [Count-of-bike-rents-during-a-week](#Count-of-bike-rents-during-a-week)
- [Feature-engineering](#Feature-engineering)
    - [Extract-date-time-data](#Extract-date-time-data)
    - [Find-and-remove-features-with-hight-p-value](#Find-and-remove-features-with-hight-p-value)
    - [Apply-min/max-scaller](#Apply-min/max-scaller)
    - [Apply-polinomial-features-preprocessing](#Apply-polinomial-features-preprocessing)
- [Models](#Models)
    - [Linear-Regression-model](#Linear-Regression-model)
        - [Predict-results](#Predict-results)
        - [Evaluate-model](#Evaluate-model)
        - [Feature-importance](#Feature-importance)
    - [Poisson-Regressor-model](#Poisson-Regressor-model)
        - [Calculate-RMSLR](#Calculate-RMSLR)
        - [Grid-Search-Cross-Validation](#Grid-Search-Cross-Validation)
    - [Random-Forest-Regressor](#Random-Forest-Regressor)
        - [Calculate-RMSLR](#Calculate-RMSLR)
        - [Grid-Search-Cross-Validation](#Grid-Search-Cross-Validation)
        - [Feature-importance](#Feature-importance)
        - [Ensember-Regressor](#Ensember-Regressor)
- [Kaggle-submission](#Kaggle-submission)
    - [Submission-result](#Submission-result)

In [None]:
# !pip install --no-input plotly
# !pip install --no-input seaborn
# !pip install --no-input numpy
# !pip install --no-input pandas
# !pip install --no-input sklearn
# !pip install --no-input statsmodel

## Challenge description

The goal for this week’s project is to build and train a regression model on the Capital Bike Share (Washington, D.C.) Kaggle data set, in order to predict demand for bicycle rentals at any given hour, based on time and weather, e.g.

**Data description**  
[https://www.kaggle.com/c/bike-sharing-demand](https://www.kaggle.com/c/bike-sharing-demand)

## Data exploration

In [None]:
RANDOM_STATE=42

### Load data

In [None]:
# !cat './data/train.csv' | wc -l

import pandas as pd
df = pd.read_csv('./data/train.csv', parse_dates=True)
df.head()

### Train/Test split

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

X = df[df.columns.difference(['count', 'registered', 'casual'])].copy(deep=True)
y = df['count']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Heatmap

In [None]:
import seaborn as sns
tmp = df[['season', 'weather', 'temp', 'atemp', 'humidity', 'windspeed', 'count']]
sns.heatmap(
    tmp.corr(), 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True,
    
)

### Pairplot

In [None]:
import seaborn as sns
from pathlib import Path
from IPython import display
from IPython.core.display import display, HTML

PAIRPLOT_FROM = '2012-01-01'
PAIRPLOT_TO = '2013-01-01'
filename = "./output/pairplot-" + PAIRPLOT_FROM + " --" + PAIRPLOT_TO + ".png"

# Drawing a pairplot takes several minutes, that's why better to save (cache) image once it's created for the first time
file = Path(filename)
if not file.is_file():
    tmp = df.loc[PAIRPLOT_FROM:PAIRPLOT_TO, ['atemp', 'temp', 'humidity', 'season', 'weather', 'windspeed', 'count']]
    tmp = extract_datetime_data(tmp) # TODO: move this function to the top
    plot = sns.pairplot(tmp, hue='count')
    plot.savefig(file)
else:
    display(HTML('<img src="'+filename+'"></img>'))
    # display.Image(filename) # not working :(


Note: on a heatmap above we can clearly see two hightly correlated variables - temp and atemp. We will use only one of them (temp) in the model

### Check NaN values

In [None]:
X_train.info()

### Average count by hour

In [None]:
df.groupby(pd.to_datetime(df['datetime']).dt.hour)['count'].mean().plot.bar()

### Average count by day of the week

In [None]:
df.groupby(pd.to_datetime(df['datetime']).dt.weekday)['count'].mean().plot.bar()

### Total count by month

In [None]:
df.groupby(pd.to_datetime(df['datetime']).dt.month)['count'].sum().plot.bar()

### Count by temperature

In [None]:
mean_temp = df.groupby(by="temp")['count'].mean()
mean_temp.plot.bar()

In [None]:
import plotly.express as px

mean_temp = df.groupby(by=["temp", "weather"])['count'].mean()
fig = px.line(mean_temp.reset_index(), x='temp', y='count', color='weather')
fig.show()

In [None]:
mean_temp = df.groupby(by=["temp"])['count'].mean()
# mean_temp

In [None]:
fig = px.line(mean_temp.reset_index(), x='temp', y='count')
fig.show()

In [None]:
fig = px.line(df.tail(7 * 24), x="datetime", y="count", title='Count of bike rents during a week')
fig.show()


## Feature engineering

### Extract date time data

In [None]:
# Create monthly weights dictionary
tmp = df.groupby(pd.to_datetime(df['datetime']).dt.month)['count'].sum().sort_values().reset_index().drop(columns='count').to_dict()
month_weight_dict = {v:k+1 for k, v in tmp['datetime'].items()}

# Create hourly weights dictionary
tmp = df.groupby(pd.to_datetime(df['datetime']).dt.hour)['count'].sum().sort_values().reset_index().drop(columns='count').to_dict()
hour_weight_dict = {v:k for k, v in tmp['datetime'].items()}

# Function for extracting date features
def extract_datetime_data(df: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame(df)
    df['hour_weight'] = pd.to_datetime(df['datetime']).dt.hour.map(hour_weight_dict)
    df['month_weight'] = pd.to_datetime(df['datetime']).dt.month.map(month_weight_dict)

    df.drop(columns="datetime", inplace=True)

    return df

# TODO: add assertion

In [None]:
X_train_mod = extract_datetime_data(X_train)
X_test_mod = extract_datetime_data(X_test)

# # check that after transformation there are no NaN values
# def assertHasNoNa(df: pd.DataFrame):
#     assert pd.DataFrame(df).isna().sum().unique().size == 1
#     assert pd.DataFrame(df).isna().sum()[0] == 0


# assertHasNoNa(X_train_extended)
# assertHasNoNa(X_test_extended)
# assertEqualIndexes(X_train_extended, y_train)

# X_train_extended

In [None]:
# X_train_mod = X_train.copy(deep=True).sort_index().reset_index().drop(columns=['index'])
# X_test_mod = X_test.copy(deep=True).sort_index().reset_index().drop(columns=['index'])
# y_train = y_train.sort_index().reset_index().drop(columns=['index'])
# y_test = y_test.sort_index().reset_index().drop(columns=['index'])

In [None]:
# def assertEqualIndexes(df1: pd.DataFrame, df2: pd.DataFrame):
#     assert df1.sort_index().tail().index.equals(df2.sort_index().tail().index)
    

In [None]:
# assertEqualIndexes(X_train_mod, y_train)

### Find and remove features with hight p-value

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from statsmodels.api import OLS 

mod = OLS(y_train.reset_index().drop(columns=['index']), X_train_mod.reset_index().drop(columns=['index']))

res = mod.fit()

print(res.summary())

In [None]:
# # P-values of the next features are more than 0.05, so we can conclude they are statistically insignificant. 
# insignificant_modatures = [
#     'atemp',
#     'holiday', this feature turned out to be important
#     'workingday' this feature turned out to be important
# ]

# X_train_mod.drop(columns=insignificant_modatures, inplace=True)
# X_test_mod.drop(columns=insignificant_modatures, inplace=True)

#### Estimate rmslr

In [None]:
from sklearn.metrics import mean_squared_log_error
import numpy as np

def rmslr(p,a):
    return np.sqrt(mean_squared_log_error(p,a))

### Build Pipeline

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
# X_train_mod = extract_datetime_data(X_train)
# X_test_mod = extract_datetime_data(X_test)

# insignificant_modatures = ['atemp']
# X_train_mod.drop(columns=insignificant_modatures, inplace=True)
# X_test_mod.drop(columns=insignificant_modatures, inplace=True)

# pipe_hour = Pipeline([
#     ('hour', PolynomialFeatures(degree=2))
# ])

# pipe_month =Pipeline(
#     [ 
        
#     ('month', PolynomialFeatures(degree=2))
# ])

# pipe_day =Pipeline(
#     [
#     ('day', PolynomialFeatures(degree=2))
# ])

# pipe_temp=Pipeline([
#     ('temp', PolynomialFeatures(degree=2)),
#     ('temp_scale',MinMaxScaler())
# ])

# pipe_weather = Pipeline([
#     ('weather_onehot',OneHotEncoder(drop='first'))
# ])

# preprocessor = ColumnTransformer([
#     ('weather_pipe',pipe_weather,['weather']),
#     ('scaler',MinMaxScaler(),['humidity','windspeed']),
#     ('holiday','passthrough',['holiday','workingday']),
#     ('poly_hour',pipe_hour,['hour']),
#     ('poly_day',pipe_day,['day']),
#     ('temp_pipe',pipe_temp,['temp'])
# ])

# regressor = RandomForestRegressor(
# #     ccp_alpha=0.05, 
# #     min_impurity_decrease=0.0, 
# #     min_samples_leaf=3, 
# #     min_samples_split=2, 
# #     n_estimators=100,
#     max_depth=25
# )

# pipeline = Pipeline([
#     ("preprocessor", preprocessor),
#     ("regressor", regressor)
# ]).fit(X_train_mod, y_train)

# y_pred = pipeline.predict(X_test_mod)
# y_pred_train = pipeline.predict(X_train_mod)

In [None]:
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import LinearRegression

# print(X_test_mod.head())
X_train_mod = extract_datetime_data(X_train)
X_test_mod = extract_datetime_data(X_test)

insignificant_modatures = ['atemp']
X_train_mod.drop(columns=insignificant_modatures, inplace=True)
X_test_mod.drop(columns=insignificant_modatures, inplace=True)

pipe_hour = Pipeline([
#     ('pf', PolynomialFeatures(degree=2)),
    ('scale',MinMaxScaler())
])
pipe_month = Pipeline([ 
#     ('pf', PolynomialFeatures(degree=2)),
    ('scale',MinMaxScaler())
])
pipe_temp = Pipeline([
    ('scale',MinMaxScaler()),
    ('disc', KBinsDiscretizer(n_bins=5))
])

preprocessor = ColumnTransformer([
        ('temp', pipe_temp, ['temp']),
        ('hour_weight', pipe_hour, ['hour_weight']),
        ('month_weight', pipe_month, ['month_weight']),
        ('min_max', MinMaxScaler(), ['humidity', 'windspeed']),
        ('ohe', OneHotEncoder(sparse=False, drop='first'), ['season', 'weather'])
    ],
    remainder="passthrough",
)

regressor = RandomForestRegressor(
    ccp_alpha=0.05, 
    min_impurity_decrease=0.0, 
    min_samples_leaf=3, 
    min_samples_split=2, 
    n_estimators=100,
    max_depth=25,
    random_state=RANDOM_STATE
)

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", regressor)
]).fit(X_train_mod, y_train)

y_pred = pipeline.predict(X_test_mod)
y_pred_train = pipeline.predict(X_train_mod)

X_train_fe = pipeline['preprocessor'].transform(X_train_mod)
X_test_fe = pipeline['preprocessor'].transform(X_test_mod)

# y_pred_train[y_pred_train < 0] = 0
# y_pred[y_pred < 0] = 0

# print(pipeline['preprocessor'].transform(X_train_mod))
# print(pipeline['preprocessor'].get_modature_names_out())

In [None]:
'rmslr (train data): %.4f; rmslr (test data): %.4f' % (rmslr(y_train, y_pred_train), rmslr(y_test, y_pred))

In [None]:
# from sklearn.preprocessing import KBinsDiscretizer
# from sklearn.compose import ColumnTransformer


# def apply_bins_discretizer(df):

#     transformer = ColumnTransformer([
#         ('', KBinsDiscretizer(n_bins=4), ['temp'])
#     ], remainder='passthrough')

#     return pd.DataFrame(transformer.fit_transform(df), columns=transformer.get_modature_names_out())


# # tmp = X_train_mod.copy(deep=True)
# # apply_bins_discretizer(tmp).value_counts()

# X_train_mod = apply_bins_discretizer(X_train_mod)
# X_test_mod = apply_bins_discretizer(X_test_mod)

# assertHasNoNa(X_train_mod)
# assertHasNoNa(X_test_mod)
# assertEqualIndexes(X_train_mod, y_train)

# X_train_mod.tail()

### Apply min/max scaller

In [None]:
# from sklearn.preprocessing import KBinsDiscretizer
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import PolynomialFeatures

# def apply_min_max_scaller(df):

#     transformer = ColumnTransformer([
#         ('', MinMaxScaler(), ['season', 'hour_weight', 'month_weight'])
#     ], remainder='passthrough')

#     return pd.DataFrame(transformer.fit_transform(df), columns=transformer.get_modature_names_out())

# X_train_mod = apply_min_max_scaller(X_train_mod)
# X_test_mod = apply_min_max_scaller(X_test_mod)

# assertHasNoNa(X_train_mod)
# assertHasNoNa(X_test_mod)
# assertEqualIndexes(X_train_mod, y_train)

# # X_train_mod.sort_index().tail()

### Apply polynomial features preprocessing

In [None]:
# from sklearn.preprocessing import PolynomialFeatures

# pt2 = PolynomialFeatures(interaction_only=False)
# X_train_mod_debug = pd.DataFrame(pt2.fit_transform(X_train_mod), columns=pt2.get_modature_names_out())
# X_test_mod_debug = pd.DataFrame(pt2.fit_transform(X_test_mod), columns=pt2.get_modature_names_out())

# X_train_mod_debug

In [None]:
# from sklearn.preprocessing import PolynomialFeatures

# pt = PolynomialFeatures(interaction_only=True)
# X_train_mod = pd.DataFrame(pt.fit_transform(X_train_mod), columns=pt.get_modature_names_out())
# X_test_mod = pd.DataFrame(pt.fit_transform(X_test_mod), columns=pt.get_modature_names_out())

# assertHasNoNa(X_train_mod)
# assertHasNoNa(X_test_mod)
# assertEqualIndexes(X_train_mod, y_train)

## Models

### Linear Regression model

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression().fit(X_train_fe, y_train)

#### Predict results

In [None]:
y_pred = lr_model.predict(X_test_fe)
y_pred[y_pred < 0] = 0

y_pred_train = lr_model.predict(X_train_fe)
y_pred_train[y_pred_train < 0] = 0

#### Evaluate model

In [None]:
from sklearn.metrics import mean_squared_log_error
import numpy as np

def rmslr(p,a):
    return np.sqrt(mean_squared_log_error(p,a))

In [None]:
'rmslr (train data): %.4f; rmslr (test data): %.4f' % (rmslr(y_train, y_pred_train), rmslr(y_test, y_pred))

#### Feature importance

In [None]:
# tmp = pd.DataFrame(np.abs(lr_model.coef_[0]), index=lr_model.feature_names_in_, columns=['coef']).sort_values(by='coef', ascending=False)

# px.bar(tmp)

In [None]:
# TODO: permutation feature importance - https://scikit-learn.org/stable/modules/permutation_importance.html

### Poisson Regressor model

In [None]:
from sklearn.linear_model import PoissonRegressor

pr_model = PoissonRegressor(alpha=1)
pr_model.fit(X_train_fe, y_train.values.ravel())
y_pred = pr_model.predict(X_test_fe)

#### Calculate RMSLR

In [None]:
from sklearn.metrics import make_scorer, mean_squared_log_error

def rmslr(y_true, y_pred, **kwargs):
    return mean_squared_log_error(y_true, y_pred, **kwargs)**0.5

rmslr_scorer = make_scorer(rmslr, greater_is_better=False)

In [None]:
'rmslr (train data): %.4f; rmslr (test data): %.4f' % (rmslr(y_train, y_pred_train), rmslr(y_test, y_pred))

#### Grid Search Cross Validation

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'alpha': [0.05, 0.1, 0.2, 0.5, 0.75, 1], 
    'fit_intercept': [True, False],
    'max_iter': [1000]
}

g = GridSearchCV(pr_model, param_grid, cv=5, scoring=rmslr_scorer, return_train_score=True )
g.fit(X_train_fe, y_train.values.ravel())

In [None]:
res = pd.DataFrame(g.cv_results_)
res.head(3)

In [None]:
res.columns

In [None]:
col_names = ['mean_test_score', 'mean_train_score', 'mean_fit_time',
            'param_alpha', 'param_fit_intercept', 'param_max_iter']

res.sort_values('mean_test_score', ascending=False)[col_names].head(10)

In [None]:
g.best_params_

In [None]:
pr_model_best = g.best_estimator_

y_pred_train =pr_model_best.predict(X_train_fe)
y_pred = pr_model_best.predict(X_test_fe)

'rmslr (train data): %.4f; rmslr (test data): %.4f' % (rmslr(y_train, y_pred_train), rmslr(y_test, y_pred))

### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor()
rf_model.fit(X_train_fe, y_train.values.ravel())
y_pred = rf_model.predict(X_test_fe)

#### Calculate RMSLR

In [None]:
'rmslr (train data): %.4f; rmslr (test data): %.4f' % (rmslr(y_train, y_pred_train), rmslr(y_test, y_pred))

#### Grid Search Cross Validation

In [None]:
# Fitting GridSearchCV with these parameters took 31m 34.1s
# It output next best params: 
# {'ccp_alpha': 0.05,
#  'max_depth': None,
#  'max_modatures': 10,
#  'max_leaf_nodes': None,
#  'min_impurity_decrease': 0.0,
#  'min_samples_leaf': 3,
#  'min_samples_split': 2,
#  'n_estimators': 100}
# 
# param_grid = {
#     'ccp_alpha': [0.0, 0.05], 
#     'n_estimators': [1, 100],
#     'max_depth': [None, 5, 20],
#     'min_samples_split': [2, 5],
#     'min_samples_leaf': [1, 3],
#     'max_modatures': ['sqrt', 'log2', 1, 10],
#     'max_leaf_nodes': [None, 1, 10, 100],
#     'min_impurity_decrease': [0.0, 0.2]
# }
# 
# g = GridSearchCV(rf_model, param_grid, cv=5, scoring=rmslr_scorer, return_train_score=True )
# g.fit(X_train_mod3, y_train)
# res = pd.DataFrame(g.cv_results_)
# res.head(3)

In [None]:
# res.columns

In [None]:
# col_names = ['mean_test_score', 'mean_train_score', 'mean_fit_time', 'param_ccp_alpha', 'param_min_impurity_decrease']
# res.sort_values('mean_test_score', ascending=False)[col_names].head(10)

In [None]:
# g.best_params_

In [None]:
# rf_model_best = g.best_estimator_
rf_model_best = RandomForestRegressor(
    ccp_alpha=0.05, 
    min_impurity_decrease=0.0, 
    min_samples_leaf=3, 
    min_samples_split=2, 
    n_estimators=100
)
rf_model_best.fit(X_train_fe, y_train.values.ravel())

y_pred_train = rf_model_best.predict(X_train_fe)
y_pred = rf_model_best.predict(X_test_fe)

'rmslr (train data): %.4f; rmslr (test data): %.4f' % (rmslr(y_train, y_pred_train), rmslr(y_test, y_pred))

#### Feature importance

In [None]:
# # TODO
# corr = df.corr()
# corr

# upper_tri = corr.where(np.triu(np.ones(corr.shape),k=1).astype(np.bool))

# to_drop = [column for column in upper_tri.columns if any(np.sqrt((upper_tri[column])**2) > 0.60)]
# to_drop

In [None]:
# px.bar(x = rf_model.feature_names_in_, y=rf_model.feature_importances_)

In [None]:
# tmp = pd.DataFrame(rf_model.feature_importances_, index=rf_model.feature_names_in_, columns=['coef'])
# tmp.sort_values(by='coef', ascending=False).head(5)['coef'].index.to_list()

In [None]:
# TODO: https://h1ros.github.io/posts/feature-importance/

### Ensember Regressor

In [None]:
from sklearn.ensemble import VotingRegressor

er = VotingRegressor([
#     ('lr', lr_model), 
    ('rf', rf_model_best), 
    ('pr', pr_model)
])
er.fit(X_train_fe, y_train.values.ravel())
y_pred_train = er.predict(X_train_fe)
y_pred = er.predict(X_test_fe)

'rmslr (train data): %.4f; rmslr (test data): %.4f' % (rmslr(y_train, y_pred_train), rmslr(y_test, y_pred))

### Early stopping of Gradient Boosting

In [None]:
# TODO: 
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html
# https://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_early_stopping.html

In [None]:
# TODO: Ensable Model

## Kaggle submission

In [None]:
df_kaggle = pd.read_csv('./data/test.csv', parse_dates=True)
X_kaggle = df_kaggle[df.columns.difference(['count', 'registered', 'casual'])]
y_kaggle = df['count']

X_kaggle_mod = extract_datetime_data(X_kaggle)

insignificant_modatures = ['atemp']
X_kaggle_mod.drop(columns=insignificant_modatures, inplace=True)

# y_kagle_pred = pipeline.predict(X_kaggle_mod)




# X_kaggle_mod = X_kaggle.sort_index().reset_index().drop(columns=['index'])
# y_kaggle = y_kaggle.sort_index().reset_index().drop(columns=['index'])

# X_kaggle_mod = extract_datetime_data(X_kaggle_mod)
# X_kaggle_mod = apply_min_max_scaller(X_kaggle_mod)
# X_kaggle_mod = pd.DataFrame(pt.fit_transform(X_kaggle_mod), columns=pt.get_modature_names_out())
# X_kaggle_mod.drop(columns=insignificant_modatures, inplace=True)

df_kaggle = pd.DataFrame({
    "datetime": X_kaggle['datetime'],
    "count": np.ceil(pipeline.predict(X_kaggle_mod)).astype(int)
})
df_kaggle.to_csv("./output/kaggle_submission.csv", index=False)

### Submission result

Submission score **0.53651**  
position 1983 out of 3242 submissions