# 7 Feature Engineering

Before starting this section, we need a little bit more fine grained language about our data points. A dataset consist of data points (instances, samples). Data points consist of a vector of values which are called inputs. Processing these inputs according to some criteria yield features. In short, features are the input values treated and transformed according to our knowledge and priorities.  

Feature engineering is basically choosing which data point inputs to use or transform existing inputs into novel features. Other than being numerical values, features have inherent informations that can be interpreted by domain experts or by simple human knowledge like turning dates into various different information such as `dayofweek`, `weekofyear` etc.

In [18]:
import numpy as np
import pandas as pd

s = pd.date_range('2020-01-06', '2020-01-10', freq='10H').to_series()

features = {
    'dayofweak' : s.dt.dayofweek.values,
    'dayofyear' : s.dt.dayofyear.values,
    'hour' : s.dt.hour.values,
    'is_leap_year' : s.dt.is_leap_year.values,
    'quarter' : s.dt.quarter.values,
    'weekofyear' : s.dt.weekofyear.values
}
features

  'weekofyear' : s.dt.weekofyear.values


{'dayofweak': array([0, 0, 0, 1, 1, 2, 2, 2, 3, 3]),
 'dayofyear': array([6, 6, 6, 7, 7, 8, 8, 8, 9, 9]),
 'hour': array([ 0, 10, 20,  6, 16,  2, 12, 22,  8, 18]),
 'is_leap_year': array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]),
 'quarter': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'weekofyear': array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])}

- For a standard column, we wouldn't be able to use `dt` methods.

In [19]:
# Let's use a column without datetime dtype.
# it will throw an error
df = pd.DataFrame(data=['2012-1-1', '2012-2-1','2016-12-1'], columns=["datetime"])
print(df)
df.datetime.dt.weekofyear()

    datetime
0   2012-1-1
1   2012-2-1
2  2016-12-1


AttributeError: Can only use .dt accessor with datetimelike values

Now we will transform `datetime` column into a datetime `dtype`.

In [20]:
# Let's use a column with datetime dtype
df = pd.DataFrame(data=['2012-1-1', '2012-2-1','2016-12-1'], columns=["datetime"])
df["datetime"] = pd.to_datetime(df["datetime"])
df.datetime.dt.dayofweek

0    6
1    2
2    3
Name: datetime, dtype: int64

Example `DataFrame` is not given in the repository of the book. So we will write down the values manually and additionally add extra records for the same `customer_id`. First, we start with numpy arrays and show a common pitfall about mixed dataypes.

In [21]:
columns = ['date', 'customer_id', 'cat1', 'cat2', 'cat3', 'num1']

data = np.array([
    ['2016-09-01', 146361, 2, 2, 0, 0.518679],
    ['2016-09-02', 146361, 3, 2, 1, 4.579128],
    ['2017-04-01', 180838, 4, 1, 0, 0.415853],
    ['2017-04-06', 180838, 1, 0, 0, 2.815853],
    ['2017-08-01', 157857, 3, 3, 1, 2.061687],
    ['2017-08-05', 157857, 0, 2, 1, 9.871232],
    ['2017-10-01', 157857, 0, 0, 1, 12.061687],
    ['2017-12-01', 159772, 5, 1, 1, 0.276558],
    ['2017-09-01', 80014, 3, 2, 1, 1.456827]])
    
df = pd.DataFrame(data=data, columns=columns)

# This turns our str type dates into date type
df['date'] = pd.to_datetime(df.date)

#Checking dtypes of columns
print(df.dtypes)

date           datetime64[ns]
customer_id            object
cat1                   object
cat2                   object
cat3                   object
num1                   object
dtype: object


It's clear from the results that `numpy` turns every column into `object` (which is python string) types such that numerical features like `cat`, `num1`, etc. have also been `str` type. This is important because if we try `agg` method with present `df.num1`, it would throw an error saying no numerical data to aggregate.

Now, we try again with a regular list having mixed dataypes.

In [22]:
columns = ['date', 'customer_id', 'cat1', 'cat2', 'cat3', 'num1']

data =[
    ['2016-09-01', 146361, 2, 2, 0, 0.518679],
    ['2016-09-02', 146361, 3, 2, 1, 4.579128],
    ['2017-04-01', 180838, 4, 1, 0, 0.415853],
    ['2017-04-06', 180838, 1, 0, 0, 2.815853],
    ['2017-08-01', 157857, 3, 3, 1, 2.061687],
    ['2017-08-05', 157857, 0, 2, 1, 9.871232],
    ['2017-10-01', 157857, 0, 0, 1, 12.061687],
    ['2017-12-01', 159772, 5, 1, 1, 0.276558],
    ['2017-09-01', 80014, 3, 2, 1, 1.456827]]
    
df = pd.DataFrame(data=data, columns=columns)

# This step turns our str type dates into date type
df['date'] = pd.to_datetime(df.date)

#Checking dtypes of columns
df.dtypes
# df.to_csv('data/feature_eng.csv', index=False)

date           datetime64[ns]
customer_id             int64
cat1                    int64
cat2                    int64
cat3                    int64
num1                  float64
dtype: object

Finally, everything is either `int` or `float` except datetime column and we can run `agg` method safely on these features.

The method `groupby` of pandas returns tuples consisting of the value of column and a `DataFrame` for each item. Now we can create aggregate features.

In [23]:
def generate_date_features(df):
    
#     df = df.copy(deep=True)
    df.loc[:,'year'] = df['date'].dt.year
    df.loc[:, 'weekofyear'] = df['date'].dt.weekofyear
    df.loc[:, 'month'] = df['date'].dt.month
    df.loc[:, 'dayofweek'] = df['date'].dt.dayofweek
    df.loc[:, 'weekend'] = (df['date'].dt.weekday >= 5).astype(int)
    
    aggs = {}
    aggs['month'] = ['nunique', 'mean']
    aggs['weekofyear'] = ['nunique', 'mean']
    aggs['num1'] = ['sum', 'max', 'min', 'mean']
    aggs['customer_id'] = ['size']
    aggs['customer_id'] = ['nunique']
    
    agg_df = df.groupby('customer_id').agg(aggs)
    agg_df = agg_df.reset_index()
    return agg_df

agg_df = generate_date_features(df)
agg_df

  df.loc[:, 'weekofyear'] = df['date'].dt.weekofyear


Unnamed: 0_level_0,customer_id,month,month,weekofyear,weekofyear,num1,num1,num1,num1,customer_id
Unnamed: 0_level_1,Unnamed: 1_level_1,nunique,mean,nunique,mean,sum,max,min,mean,nunique
0,80014,1,9.0,1,35.0,1.456827,1.456827,1.456827,1.456827,1
1,146361,1,9.0,1,35.0,5.097807,4.579128,0.518679,2.548903,1
2,157857,2,8.666667,2,33.666667,23.994606,12.061687,2.061687,7.998202,1
3,159772,1,12.0,1,48.0,0.276558,0.276558,0.276558,0.276558,1
4,180838,1,4.0,2,13.5,3.231706,2.815853,0.415853,1.615853,1


One should note that the original `df`, even though inside a function, is also modified since `DataFrame` is a mutable object and python is a *call by name* language (for further information check this wonderful [talk](https://www.youtube.com/watch?v=_AEJHKGk9ns) and [blog](https://nedbatchelder.com/text/names.html) post by Ned Batchelder). If this is not a desired behavior, first line of `generate_date_features` function should be uncommented.

After generating aggregate results, we end up with a `MultiIndex`. In order to flatten this index to change the column values and set the new index as `customer_id`, following code is used.

In [24]:
agg_df.columns = agg_df.columns.to_flat_index().map(lambda x: '_'.join(x).rstrip('_'))
agg_df.set_index('customer_id', inplace=True)

We need to match `agg_df` with the original data to merge them. Using the previous trick, we will choose the rows of `agg_df` again and again by indexing it with `df['customer_id']`.

In [25]:
agg_df = agg_df.loc[df['customer_id'],:].reset_index()
agg_df = agg_df.drop('customer_id', axis=1)

Finally, we merge the both `DataFrame`s.

In [26]:
df = pd.concat([df,agg_df], axis=1)
df

Unnamed: 0,date,customer_id,cat1,cat2,cat3,num1,year,weekofyear,month,dayofweek,weekend,month_nunique,month_mean,weekofyear_nunique,weekofyear_mean,num1_sum,num1_max,num1_min,num1_mean,customer_id_nunique
0,2016-09-01,146361,2,2,0,0.518679,2016,35,9,3,0,1,9.0,1,35.0,5.097807,4.579128,0.518679,2.548903,1
1,2016-09-02,146361,3,2,1,4.579128,2016,35,9,4,0,1,9.0,1,35.0,5.097807,4.579128,0.518679,2.548903,1
2,2017-04-01,180838,4,1,0,0.415853,2017,13,4,5,1,1,4.0,2,13.5,3.231706,2.815853,0.415853,1.615853,1
3,2017-04-06,180838,1,0,0,2.815853,2017,14,4,3,0,1,4.0,2,13.5,3.231706,2.815853,0.415853,1.615853,1
4,2017-08-01,157857,3,3,1,2.061687,2017,31,8,1,0,2,8.666667,2,33.666667,23.994606,12.061687,2.061687,7.998202,1
5,2017-08-05,157857,0,2,1,9.871232,2017,31,8,5,1,2,8.666667,2,33.666667,23.994606,12.061687,2.061687,7.998202,1
6,2017-10-01,157857,0,0,1,12.061687,2017,39,10,6,1,2,8.666667,2,33.666667,23.994606,12.061687,2.061687,7.998202,1
7,2017-12-01,159772,5,1,1,0.276558,2017,48,12,4,0,1,12.0,1,48.0,0.276558,0.276558,0.276558,0.276558,1
8,2017-09-01,80014,3,2,1,1.456827,2017,35,9,4,0,1,9.0,1,35.0,1.456827,1.456827,1.456827,1.456827,1
