# Feature Engineering

 So far, we assume that all input features are continuous variable. But most of practical machine learning problem consists of

categorical feature or discrete feature like (Brands, Clothes, ...) 

We have to consider how to convert these kind of data into number to use in machine learning, and it is fairly important part 

determining model performance. This is called feature engineering. It is more crucial than tuning parameters or hyperparameters.

## 1. Categorical Variable

For example, there are features like workclass, gender and occupation. These can't be used to apply most of machine learning algorithm.

So, it has to be converted by some value.

### 1.1 one-hot-encoding

Let us assume, there is one categorical feature, work class and there are 4 categorical value. A, B, C and D.

one-hot-encoding makes  this one categorical feature into 4 feature to convert it, according to number of its value (4).

In [1]:
import os
import mglearn
import pandas as pd

data = pd.read_csv(os.path.join(mglearn.datasets.DATA_PATH, "adult.data"),
                header=None, index_col=False,
                   names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation',
                          'relationship', 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'])

In [2]:
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']]

display(data.head())

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


##### Encode using pandas

In [3]:
print("Feature:\n", list(data.columns), '\n')
data_dummies = pd.get_dummies(data)
print("get_dummies feature:\n", list(data_dummies.columns))

Feature:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

get_dummies feature:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupati

In [4]:
data_dummies.head()

Unnamed: 0,age,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,...,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,income_ <=50K,income_ >50K
0,39,40,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,50,13,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
2,38,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,53,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,28,40,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0


It converts every categorical features into number according to number of its value.

Then, after eliminating target value (income), it has to be transformed into numpy array with .values method.

In [5]:
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']

X = features.values
y = data_dummies['income_ >50K'].values

Now, it can be used for scikit-learn.

### 1.2 Categorical features with numeric value

What if Workclass feature expressed with numeric value. 0 for private, 1 for self- ,...?

Then, you have to clarify that this is not complete dataset that can be immediately used for machine learning algorithm.

### get_dummies? or other?

get_dummies function regard all features with numeric value as continuous value. So it doesn't make dummy for those.

Rather than using get_dummies, we can use OneHotEncoder in scikit-learn that we can choose specific feature to make dummy.

Or, we can proceed other method.

In [6]:
demo_df = pd.DataFrame({"Numeric Feature": [0, 1, 2, 1],
                        "Categorical Feature": ['Socks', 'Fox', 'Socks', 'Box']})

display(demo_df)

Unnamed: 0,Categorical Feature,Numeric Feature
0,Socks,0
1,Fox,1
2,Socks,2
3,Box,1


##### If we use get_dummies,

In [7]:
display(pd.get_dummies(demo_df))

Unnamed: 0,Numeric Feature,Categorical Feature_Box,Categorical Feature_Fox,Categorical Feature_Socks
0,0,0,0,1
1,1,0,1,0
2,2,0,0,1
3,1,1,0,0


we can't get tranformed value of Numeric Features.

For that, we have to specify those columns.

In [8]:
demo_df["Numeric Feature"] = demo_df["Numeric Feature"].astype(str)

display(pd.get_dummies(demo_df, columns=["Numeric Feature", "Categorical Feature"]))

Unnamed: 0,Numeric Feature_0,Numeric Feature_1,Numeric Feature_2,Categorical Feature_Box,Categorical Feature_Fox,Categorical Feature_Socks
0,1,0,0,0,0,1
1,0,1,0,0,1,0
2,0,0,1,0,0,1
3,0,1,0,1,0,0
