In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Machine Learning

## Evaluation metrics

### Classification
```
precision = TP / (TP + FP)
```
```
recall = TP / (TP + FN)
```
```
F1 = 2 * (precision * recall) / (precision + recall)
```
```
Area Under the ROC curve (AUC-ROC) (sensitivity and (1- specificity))
```

### Regression
```
Root Mean Squared Error (RMSE)
```
```
R^2 = MSE(model) / MSE(baseline)
```
```
Adjusted R-Squared penalizes for high number of features.
```

## Feature selection

Exclude features if:

- Too many **missing values**  
- Too little value **variance** (see [VarianceThreshold](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold))
- **High correlation** with the target or other features

Automatic feature selection methods:

- **Statistical testing** where little contributing features are removed (see [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)). 
- **Model-based selection** where a(n often penalized) model is used to the determine and eliminate the least contributing features (see [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel)).

In practice, `SelectFromModel` is most often used in combination a _penalized_ model. By adding a special term to our regular cost function, the model is discouraged from unnecessary model complexity. 

The two most popular selection models used with `SelectFromModel` are  `Lasso` for regression problems and for classification `LinearSVC(penalty="l1")` - linear support vector machine classifier with an additional Lasso-like penalty term.

`Lasso` regression would punish a linear model for having too many large coefficients for variables that barely contribute for predictions. It will naturally push those coefficients towards zero, and therefore we can use it more easily as a feature selection technique:

$$ J_{L1}(\mathbf{w})= \frac{1}{n}\sum_{i=1}^n\left(y_i-y( \mathbf{w},\mathbf{x}^i) \right)^2 + \lambda \sum_{j=1}^m \left|w_j\right| $$


<a id='feature-engineering'></a>

## Feature engineering

* **Date and time features** Creating features from the dates available, e.g. is a holidays or day of the week. 
* **Group values** Grouping various numeric elements to a categorical variable, e.g. the months December (12), January (1) and February (2) to the season Winter. 
* **Grouping sparse classes** If you have a feature with an individual low sample count, you might group various values together under some other category. For example: if we had a column `bike_type` it would make sense to have stand-alone values such as `race`, `road` or `grandma`, whereas you might want to group values lik `penny farthing`, `unicycle` and `tricycle` together under a single `other` category since they are rarely rented.
* **Group from threshold** A new grouped variable for other variables, e.g. `warm` and `cold` based on the temperature.
* **Indicator from threshold** An indicator variable (0 or 1) based on a threshold on a column, e.g. eligible to vote/work based on age. 
* **Interaction of variables** The sum, difference, product or quotient of two features. E.g. `profit` as result of the difference between income and expenses. 


In [5]:
def get_date_values(df):
    """ Preprocessing function
    Creates year, month, day, hour columns.
    """
    df = df.assign(**{'year': df.index.year,
                      'month': df.index.month,
                      'day': df.index.day,
                      'hour': df.index.hour})
    
    return df

def is_holiday(df):
    """Return a new column is_holiday
    Input: dataframe (df) with date column (default datetime)
    True when the date is a holiday 
    False when the date is not a holiday"""
    
    return df.assign(is_holiday = pd.to_datetime(df.index.date).isin(HOLIDAYS))

def get_weekday(df):
    """Get the day of the week"""
    
    return df.assign(**{'weekday': df.index.day_name()})


def get_season(df):
    """Return the season based off:
    Dec, Jan, Feb = winter
    Mar, Apr, May = spring
    Jun, Jul, Aug = summer
    Sep, Oct, Nov = autumn"""
    
    season_mapping = {4: 'winter',
                      1: 'spring',
                      2: 'summer',
                      3: 'autumn'}
    
    # map the dates quarter to what season it is
    offset_months = df.index - pd.DateOffset(months=1)
    seasons = offset_months.quarter.map(season_mapping)
    
    return df.assign(season = seasons)

HOLIDAYS = ["2021-12-25", "2021-12-26"]

bikes = (
    pd.read_csv('data/bike_rental_dataset.csv', parse_dates = ['datetime'], index_col='datetime')
    .pipe(get_date_values)
    .pipe(is_holiday)
    .pipe(get_weekday)
    .pipe(get_season)
)

bikes.head()

Unnamed: 0_level_0,weathersit,temp,atemp,hum,windspeed,cnt,year,month,day,hour,is_holiday,weekday,season
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2011-01-01 00:00:00,1,0.24,0.2879,0.81,0.0,16,2011,1,1,0,False,Saturday,winter
2011-01-01 01:00:00,1,0.22,0.2727,0.8,0.0,40,2011,1,1,1,False,Saturday,winter
2011-01-01 02:00:00,1,0.22,0.2727,0.8,0.0,32,2011,1,1,2,False,Saturday,winter
2011-01-01 03:00:00,1,0.24,0.2879,0.75,0.0,13,2011,1,1,3,False,Saturday,winter
2011-01-01 04:00:00,1,0.24,0.2879,0.75,0.0,1,2011,1,1,4,False,Saturday,winter
