# Feature Engineering with Learn-It

In this notebook, you will learn how to develop your own feature extractor. The idea of a new feature does not come out spontaneously. Instead, you might need to analyze erros with a current feature set and consider what would be a good next step. 

The workflow is often called "feature engineering" and is considered an art of data scientists and machine learning engineers. Feature engineering is more promising work than model selection and hyper-parameter tuning. 

A general workflow of feature engineering looks like as follows:

- Step 1 (Ideation) Think of missing information that does not exist in the current feature set.
- Step 2 (Development) Develop the idea in Python code.
- Step 3 (Verification) Verify if the developed code works.
- Step 4 (Deployment) Incorporate the developed code into the framework.

Step 1 is the most important step which cannot be automated and has no silver bullet for it. ML engineers always need to carefully analyze errors with the current feature set and try to discuss if there is room for improvement following their domain knowledge on the task.

Learn-It offers intuitive workflow for Step 2 - 4 so you can focus on Step 1.

## How to Develop Your Own Feature Extraction Algorithm?

Learn-It currently uses `scikit-learn` as a machine-learning core library. In this notebook, we refer a "feature extractor" as a "transformer" so please consider these words are interchangeable. 

Conceptually, a transformer can take into account more than one columns to derive feature values from them. For simplicity, we only consider "one-column" feature extractors in this notebook. 

Below is a figure that shows the image of how a transformer extracts feature values from a column.

<image>

Here is an example of a user-defined transformer class.

## 1. Example: Name Title Extractor

The following example transformer extracts and encodes "Title" information of person names into 0/1-valued vectors.


In [1]:
from sklearn.base import BaseEstimator, TransformerMixin

class TitleEncoder(BaseEstimator, TransformerMixin):
    def __init__(self,
                 title_list=None):
        if title_list is None:
            title_list = ["Mr.",
                          "Ms."
                          "Mrs.",
                          "Miss."]
        self.title_list = title_list

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # X is np.array
        s = pd.Series(X[:, 0])
        s_list = []
        for title in self.title_list:
            s_list.append(s.apply(lambda x: title in x).astype("int"))
        df = pd.concat(s_list, axis=1)
        df.loc[:, "None"] = (df.sum(axis=1) == 0).astype("int")
        return df.as_matrix()

    def get_feature_names(self):
        return self.title_list + ["Others"]

Let's take a look at each function for detail.

### `__init__()`

This `__init__` function receives `title_list` so that the user can customize
the title list. For instance, someone might want to incorporate "Prof." and
"Dr." in addition to the common titles. In this case, the four titles will be
used by default.


### `fit()`

This example does nothing in the training phase. The function must return
`self` by definition. Please note that this function should store information 
if the transformer fits data dynamically to define features. For instance, Bag-of-Words transformer should keep vocaburaly information in `fit()` function. In this example, the order of features are defined by the order of elements in `self.title_list` so `fit()` has to do nothing.


### `transform()`

This is the main function of the feature transformer class. In this example, `transform()` judges if any titles appear in each person name. If no title appears, it activates "Other" flag. 

The input variable `X` is `np.ndarray` of the shape `(N, 1)` where `N` is the total number of rows in the input data. Therefore, you need to convert `np.ndarray` into any convenient data structure for your feature extraction algorithm.


## 2. Test Your Transformer Class

It is very difficult to implement a transformer class without any bug from the beginning. Learn-It offers a test function that verifies if your own transformer class can successfully extract feature values from input data.

In [2]:
import sys
import pandas as pd
sys.path.append("../")
from learnit import AutoConverter
# TODO: from learnit import check_transformer
from learnit.autoconverter.autoconverter import check_transformer

input_df = pd.read_csv("data/train.csv")
# It returns extracted `X` in the form of `np.ndarray`
X = check_transformer(input_df, "Name", TitleEncoder())

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Feature(s) successfully extracted! :)
# of features=4


## 3. How to Incorporate Your Transformer to Learn-It

Now, `TitleEncoder` is ready to use for the Titanic dataset.

In [3]:
# This is how we use AutoConverter with default settings
ac_old = AutoConverter(target="Survived")
X_old, y_old = ac_old.fit_transform(input_df)

# This is how we make AutoConverter apply "TitleEncoder"  to "Name" column of the dataset
ac_new = AutoConverter(target="Survived",
                                             column_converters={"Name": [(TitleEncoder, {})]})
X_new, y_new = ac_new.fit_transform(input_df)



Let's take a look if the configuration changes the output

In [4]:
print("X_old.shape={}".format(X_old.shape))
print("X_new.shape={}".format(X_new.shape))

X_old.shape=(881, 1215)
X_new.shape=(881, 708)


The new `AutoConverter` instance extracts fewer features than before because the new one extracts only 5 features instead of several hundreds of TF-IDF weighted bag-of-word features.

The default setting "overwrites" transformers applied to a target column ("Name" in this example) so default transformers such as `TfIdfVectorizer` for "textual" type columns will NOT be applied to the column if you manually configure `column_converters`.

If you want to "add" your own transformer in addition to default transformers. Use `use_column_converter_only=False` option with `AutoConverter`.

In [5]:
ac_new2 = AutoConverter(target="Survived",
                                               column_converters={"Name": [(TitleEncoder, {})]},
                                               use_column_converter_only=False)
X_new2, y_new2 = ac_new2.fit_transform(input_df)
print("X_new2.shape={}".format(X_new2.shape))



X_new2.shape=(881, 1219)


## 4. Summary: Transformer Template Class

This notebook has introduced how to develop, verify and deploy your own transformer with Learn-It. The workflow will involve a lot of trial-and-errors but Learn-It should reduce the significant amount of workload on the feature engineering tasks.

Below is a transformer template so you can copy and use it.

In [6]:
class TransformerTemplate(BaseEstimator, TransformerMixin):
    def __init__(self, *args):
        ##
        ## Implement initialization step
        ##
        pass
    
    def fit(self, X, y=None):
        ##
        ## Implement training logic if necessary
        ## Leave it if the transformer does nothing in the training phase
        ##
        return self

    def transform(self, X):
        ## 
        ## Implement a transform function that returns a feature vector/matrix.
        ## The returned value should be (N, M) where N is the number of input data
        ## and M is the total number of features that are extracted by this function.
        ##
        return [[1] * len(X)]

    def get_feature_names(self):
        ##
        ## [Optional] Describe feature name(s)
        ##
        return ["feature name"]