# Feature engineering

feature engineering is one of the crucial parts of the building a good machine learning model. IF we have useful feature, the model will perform better. We must keep in mind that feature engineering is something that is done in best possible manner only when you have some knowledge about the domain of the problem and depends a lot on the data concern . However , there are some techniques that we can try to create features from almost all kinds of numerical and categorical variables. **feature engineering is not about creating new features from data but also includes different types of normalization and transformations.**

Let's start with the most simple but most widely used feature engineering techniques. Let's say you are dealing with **data and time data** . So, we have a pandas dataframe with a datetime type column. Using this column, we can create features like:

- Year
- Week of Year
- Month
- Day of Week
- Weekend
- Hour
- And many more

and this can be done using pandas very easily


In [None]:
df.loc[:,'year'] = df['datetime_column'].dt.year 
df.loc[:,'weekofyear'] = df['datetime_column'].dt.weekofyear
df.loc[:, 'month'] = df['datetime_column'].dt.month
df.loc[:, 'dayofweek'] = df['datetime_column'].dt.dayofweek
df.loc[: , 'weekend'] = (df[df.datetime_column.dt.weekday >= 5]).astype(int)

So, we are creating a bunch of new columns using the datetime column. Let's see some of the sample feature that can be created.

In [None]:
import pandas as pd 

# create a series of datetime wiht a frequency of 10 hours

s = pd.date_range('2020-01-06' , '2020-01-06', freq = '10H')

#  creating some feature based on datatime

feature = {
    'dayofweek': s.dt.dayofweek.values,
    'dayofyear': s.dt.dayofyear.values,
    'hour' : s.dt.hour.values,
    'is_leap_year' : s.dt.is_leap_year.values,
    'quarter': s.dt.quarter.values,
    'weekofyear': s.dt.weekofyear.values

}

This will generate a dictionary of features from a given series. you can apply this to any datetime column in a pandas dataframe . these are some of the many date time feature that pandas offer..

For example , predicting sales of a store..
we can easily extract feature like the year, month , quater, etc from the datetime column. Then we have *customer_id* column which might have multiple entries , so a customer is seen many times. and each date and customer id will have a bunch of categorical and numerical data . There are bunch of features we can create from it:

- what is the month a customer is most active in
- what is the count of cat1 , cat1, cat3 for a customer
- what is the count of cat1 , cat2 , cat3 for a customer for a given week of the year
- what is the mean of num1 for a given customer
- And so on.

Using aggregates in pandas , it is quite easy to create features like these. let's see

In [None]:
def generate_features(df):
    # create a bunch of features using the date column
    df.loc[:, 'year'] = df['date'].dt.year
    .
    . 
    # and so on
    
    # create an aggregate dictionary
    aggs = {}
    # for aggregation by month , we calculate the number of unique month values and also the mean
    aggs['month'] = ['nunique' , 'mean']
    aggs['weekofyear'] = ['nunique' , 'mean']
    # we aggregate by num1 and calculate sum , max, min, and mean values of this column
    aggs['num1'] = ['sum' , 'max' , 'min', 'mean']
    # for customer id we calculate the total count
    aggs['customer_id'] = ['size']
    # again for customer_id , we calculate the total unique
    aggs['customer_id'] = ['nunique']

    # we group by customer_id and calculate the aggregates
    agg_df = df.groupby('customer_id').agg(aggs)
    agg_df = agg_df.reset_index()
    return agg_df

After aggregating the data we can join them this dataframe with the original dataframe with *customer_id* column

Sometimes , for example, when dealing with time-series problems, you might have features which are not individual values but a list of values. for example, transactions by a customer in a given period of time. In this case, we create different types of features such as : which numerical features, when you are grouping on a categorical column, you will get features like a list of values which are time distributed. In these cases, you can create a bunch of statistical featu such as :

- Mean
- Max
- Min
- Unique
- Skew
- Kurtosis
- Kstat
- Percentile
- Quantile
- Peak to Peak
- And many more

These can be created using simple numpy functions , as shown in the below

In [None]:
import numpy as np 

feature_dict = {}

# calculate mean
feature_dict['mean'] = np.mean(x)
# calculate max
feature_dict['max'] = np.max(x)
# calculate min
feature_dict['min'] = np.min(x)
# calculate std
feature_dict['std'] = np.std(x)
# calculate variance
feature_dict['var'] = np.var(x)
# calculate peak to peak
feature_dict['ptp'] = np.ptp(x)
# percentile feautres
feature_dict['percentile_10'] = np.percentile(x ,10)
feature_dict['percentile_60'] = np.percentile(x , 60)
feature_dict['percentile_90'] = np.percentile(x, 90)

# quantile feature
feature_dict['quantile_5'] = np.percentile(x, 5)
feature_dict['quantile_95'] = np.percentile(x, 95)
feature_dict['quantile_99'] = np.percentile(x , 99)

The **time series data** (list of value) can be converted to a lot of features.

A python library called *tsfresh* is instrumental in this case

In [None]:
from tsfresh.feature_extraction import feature_calculators as fc 

# tsfresh based features
feature_dict['abs_energy'] = fc.abs_energy(x)
feature_dict['count_above_mean'] = fc.count_above_mean(x)
feature_dict['count_below_mean'] = fc.count_below_mean(x)
feature_dict['mean_abs_change'] = fc.mean_abs_change(x)
feature_dict['mean_change'] = fc.mean_change(x)