# Feature Engineering

In [None]:
## Time-Series

import pandas as pd
# create a series of datetime with a frequency of 10 hours
s = pd.date_range('2020-01-06', '2020-01-10', freq='10H').to_series()
# create some features based on datetime
features = {
    "dayofweek": s.dt.dayofweek.values,
    "dayofyear": s.dt.dayofyear.values,
    "hour": s.dt.hour.values,
    "is_leap_year": s.dt.is_leap_year.values,
    "quarter": s.dt.quarter.values,
    "weekofyear": s.dt.weekofyear.values
}

## Aggreate features

In [None]:
def generate_features(df):
    # create a bunch of features using the date column
    df.loc[:, 'year'] = df['date'].dt.year
    df.loc[:, 'weekofyear'] = df['date'].dt.weekofyear
    df.loc[:, 'month'] = df['date'].dt.month
    df.loc[:, 'dayofweek'] = df['date'].dt.dayofweek
    df.loc[:, 'weekend'] = (df['date'].dt.weekday >=5).astype(int)
    
    # create an aggregate dictionary
    aggs = {}
    # for aggregation by month, we calculate the
    # number of unique month values and also the mean
    aggs['month'] = ['nunique', 'mean']
    aggs['weekofyear'] = ['nunique', 'mean']
    
    # we aggregate by num1 and calculate sum, max, min
    # and mean values of this column
    aggs['num1'] = ['sum','max','min','mean']
    
    # for customer_id, we calculate the total count
    aggs['customer_id'] = ['size']
    # again for customer_id, we calculate the total unique
    
    aggs['customer_id'] = ['nunique']
    
    # we group by customer_id and calculate the aggregates
    agg_df = df.groupby('customer_id').agg(aggs)
    agg_df = agg_df.reset_index()

    return agg_df

Sometimes, for example, when dealing with time-series problems, you might have
features which are not individual values but a list of values. In these cases, you can create a bunch of statistical features such as:
- Mean
- Max
- Min
- Unique
- Skew
- Kurtosis
- Kstat
- Percentile
- Quantile
- Peak to peak

In [None]:
import numpy as np

feature_dict = {}

# calculate mean
feature_dict['mean'] = np.mean(x)

# calculate max
feature_dict['max'] = np.max(x)

# calculate min
feature_dict['min'] = np.min(x)

# calculate standard deviation
feature_dict['std'] = np.std(x)

# calculate variance
feature_dict['var'] = np.var(x)

# peak-to-peak
feature_dict['ptp'] = np.ptp(x)

# percentile features
feature_dict['percentile_10'] = np.percentile(x, 10)
feature_dict['percentile_60'] = np.percentile(x, 60)
feature_dict['percentile_90'] = np.percentile(x, 90)

# quantile features
feature_dict['quantile_5'] = np.percentile(x, 5)
feature_dict['quantile_95'] = np.percentile(x, 95)
feature_dict['quantile_99'] = np.percentile(x, 99)

The time series data (list of values) can be converted to a lot of features using **tsfresh**.

In [None]:
from tsfresh.feature_extraction import feature_calculators as fc
# tsfresh based features
feature_dict['abs_energy'] = fc.abs_energy(x)
feature_dict['count_above_mean'] = fc.count_above_mean(x)
feature_dict['count_below_mean'] = fc.count_below_mean(x)
feature_dict['mean_abs_change'] = fc.mean_abs_change(x)
feature_dict['mean_change'] = fc.mean_change(x)

A simple
way to generate many features is just to **create a bunch of polynomial features**. For
example, a second-degree polynomial feature from two features “a” and “b” would
include: “a”, “b”, “ab”, “a²” and “b²”.

In [None]:
import numpy as np

# generate a random dataframe with
# 2 columns and 100 rows

df = pd.DataFrame(
np.random.rand(100, 2),
columns=[f"f_{i}" for i in range(1, 3)]
)

Create two-degree polynomial features using **PolynomialFeatures** from
scikit-learn.

If you have a lot of samples in the dataset, it is going to take a while
creating these kinds of features.

In [None]:
from sklearn import preprocessing

# initialize polynomial features class object
# for two-degree polynomial features
pf = preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# fit to the features
pf.fit(df)

# create polynomial features
poly_feats = pf.transform(df)

# create a dataframe with all the features
num_feats = poly_feats.shape[1]
df_transformed = pd.DataFrame(poly_feats,
                              columns=[f"f_{i}" for i in range(1, num_feats + 1)])

**Binning** technique can generate features that cosists in divide data into N parts. 

Binning also enables you to treat
numerical features as categorical.

In [None]:
# create bins of the numerical columns
# 10 bins
df["f_bin_10"] = pd.cut(df["f_1"], bins=10, labels=False)
# 100 bins
df["f_bin_100"] = pd.cut(df["f_1"], bins=100, labels=False)

**Log Transformation**: For example to reduce the variance of the feature.

Log and exponential transformation can be used to optimize the model for the metric RMSLE. In that case,
we can train on log-transformed targets and convert back to original using
exponential on the prediction. That would help optimize the model for the metric.


In [None]:
df.f_3.apply(lambda x: np.log(1 + x)).var()

## Fill missing values in numerical data
We can fill missing values with:
- a values that never happears in the data: for example: 0 or 9999
- mean of the numerical value
- median of the numerical feature
- mode of the numerical feature which is the value that appears most often

A not usually way of filling of missing values would be to use a **k-nearest neighbour
method.** After finding the K-NN take the mean of all nearest neighbours and fill up the missing value.
You can use the KNN imputer implementation for filling missing values like this.

In [None]:
import numpy as np
from sklearn import impute

# create a random numpy array with 10 samples
# and 6 features and values ranging from 1 to 15
X = np.random.randint(1, 15, (10, 6))

# convert the array to float
X = X.astype(float)

# randomly assign 10 elements to NaN (missing)
X.ravel()[np.random.choice(X.size, 10, replace=False)] = np.nan

# use 3 nearest neighbours to fill na values
knn_imputer = impute.KNNImputer(n_neighbors=2)
knn_imputer.fit_transform(X)

Another way of imputing missing values in a column would be to **train a regression
model** that tries to predict missing values** in a column based on other columns. So,
you start with one column that has a missing value and treat this column as the
target column for regression model without the missing values. Using all the other
columns, you now train a model on samples for which there is no missing value in
the concerned column and then try to predict target (the same column) for the
samples that were removed earlier.

**NOTES:**
- Always remember that imputing values for tree-based models is unnecessary as they
can handle it themselves.
- And always remember to scale or normalize your features if you are using linear models like logistic regression or a model like SVM. Tree-based models will always work fine without any normalization of features.

## Categorical Variables

If you ever encounter missing values in categorical features, treat is as a new category! As simple as this is, it
**(almost) always works**!