# 7 Feature Grouping Operations

In most machine learning algorithms, every instance is represented by a row in the training dataset, where every column
show a different feature of the instance. This kind of data are called **Tidy datasets (each variable is a column, each
observation is a row, and each type of observation unit is a table)**

However, Datasets such as transactions rarely fit the definition of tidy data above, because of the multiple rows of
an instance. And we need to transform them to `Tidy datasets`.

Below figure shows a dataset that describes the number of visit for each city, We have multiple rows that represents the
same user. To use it, we need to transform it into a tidy dataset.

![fe_grouping.png](../img/fe_grouping.png)

There are many solutions, here we only show three possible solutions.
1. Highest frequency

In [4]:
import pandas as pd
import numpy as np

In [2]:
data=[
    {"User":1,"City":"Roma","Days":1},
    {"User":2,"City":"Madrid","Days":2},
    {"User":1,"City":"Madrid","Days":1},
    {"User":3,"City":"Istanbul","Days":1},
    {"User":2,"City":"Istanbul","Days":4},
    {"User":1,"City":"Istanbul","Days":3},
    {"User":1,"City":"Roma","Days":3}
]
df=pd.DataFrame(data)

In [3]:
df.head()

Unnamed: 0,User,City,Days
0,1,Roma,1
1,2,Madrid,2
2,1,Madrid,1
3,3,Istanbul,1
4,2,Istanbul,4


## 7.1 Highest frequency

**Highest frequency** select the label with the highest frequency. In other words, this is the max operation for categorical columns, but ordinary max functions generally do not return this value, you need to use a lambda function for this purpose.

In [7]:
df_hf=df.groupby("User").agg(lambda x:x.value_counts().index[0])

In [8]:
df_hf.head()

Unnamed: 0_level_0,City,Days
User,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Roma,1
2,Madrid,2
3,Istanbul,1


## 7.2 pivot table.

This approach resembles the encoding method in the preceding step with a difference. Instead of binary notation, it can be defined as aggregated functions for the values between grouped and encoded columns. This would be a good option if you aim to go beyond binary flag columns and merge multiple features into aggregated features, which are more informative.

The general form:

```python
df.pivot_table(index='column_to_group', columns='column_to_encode', values='aggregation_column', aggfunc=np.sum, fill_value = 0)
```

In [11]:
# In our case
# we groupby the `User` column
# The column to encode is "City"
# The aggregation column is "Days"
df_pivot=df.pivot_table(index='User', columns='City', values='Days', aggfunc=np.sum, fill_value = 0)

In [12]:
df_pivot.head()

City,Istanbul,Madrid,Roma
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3,1,4
2,4,2,0
3,1,0,0


## 7.3

Last categorical grouping option is to apply a group by function after applying one-hot encoding. This method preserves all the data -in the first option you lose some-, and in addition, you transform the encoded column from categorical to numerical in the meantime. You can check the next section for the explanation of numerical column grouping.

In [14]:
encode_col_name="City"
encoded_columns = pd.get_dummies(df[encode_col_name])
df_encoded = df.join(encoded_columns).drop(encode_col_name, axis=1)

In [15]:
df_encoded.head()

Unnamed: 0,User,Days,Istanbul,Madrid,Roma
0,1,1,0,0,1
1,2,2,0,1,0
2,1,1,0,1,0
3,3,1,1,0,0
4,2,4,1,0,0


In [18]:
#sum_cols: List of columns to sum
#mean_cols: List of columns to average
sum_cols=["Istanbul","Madrid","Roma"]
mean_cols=["Istanbul","Madrid","Roma"]

grouped = df_encoded.groupby('User')



In [19]:
grouped.head()

Unnamed: 0,User,Days,Istanbul,Madrid,Roma
0,1,1,0,0,1
1,2,2,0,1,0
2,1,1,0,1,0
3,3,1,1,0,0
4,2,4,1,0,0
5,1,3,1,0,0
6,1,3,0,0,1


In [None]:
sums = grouped[sum_cols].sum().add_suffix('_sum')
avgs = grouped[mean_cols].mean().add_suffix('_avg')

df_nc = pd.concat([sums, avgs], axis=1)

In [10]:
df_nc.head()

Unnamed: 0_level_0,Days_sum,Days_avg
User,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,2.0
2,6,3.0
3,1,1.0
