In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

Before defining your custom transformer, all transformers in scikit-learn (and scikit-learn compatible libraries, like feature-engine) are implemented as Python classes, each with its own attributes and methods. 
* Our custom transformer (or Class) must be implemented as a class with the same methods, like fit(), transform(), fit_transform() etc. We will inherit these methods using two scikit-learn base classes: TransformerMixin and BaseEstimator. 

For that, we will need two base transformers from Scikit-learn. 
* `BaseEstimator`: According to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html), it is a "base class for all estimators in scikit-learn". We will not focus on the technical aspects, only the frame, as it contains the core of what a transformer should have.
* `TransformerMixin`: According to the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html), it is a Mixin class for all transformers in scikit-learn.

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin

In feature-engine (and scikit-learn), we have a transformer that replaces the missing value with the mean. But let's imagine it didn't, and we want to create `MyCustomTransformerForMeanImputation()`
* Let's follow along with the code's comment to understand the steps

In [7]:
import pandas as pd # to use .mean()

# We will define three methods for the class: _init_, fit and transform
# The fit_transform() will be inherited since we are using BaseEstimator and TransformerMixin

# Define your transformer name, and as an argument inherit the base classes
class MyCustomTransformerForMeanImputation(BaseEstimator, TransformerMixin):

  #### Here, you define the variables you need to parse when you initialize the class
  def __init__(self, variables):
    # We make sure the variables will be a list, even if only one element
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables

  #### Here is where the learning happens. We perform the operation we are interested in
  #### In this case, calculate the mean
'''X is usually the input data (features) to the model. In the context of machine learning, 
X could represent a dataset (for example, a matrix or a 2D array) containing the features of the data that the model will learn from.
X is typically a matrix of shape (n_samples, n_features) where:
n_samples is the number of data points (rows).
n_features is the number of features (columns) for each data point.'''

'''y=None:
y typically represents the target or label values in a supervised learning context. 
This is the correct answer that the model is trying to predict, like the output you're training the model to predict based on the features X.
y=None means that y is an optional parameter. If no target labels are passed 
(for example, in unsupervised learning where no labels are required), it defaults to None.'''

  def fit(self, X, y=None):
   
    # We want to keep the mean value in a dictionary
    # imputer_dict_ is typically used to store information about how the missing values were imputed (filled) in a dataset.
    self.imputer_dict_ = {}
      
    # loop over each variable, calculate the mean and save it in the dictionary.  
    # For each feature, it computes the mean of the values in that feature (X[feature]) and stores it in the imputer_dict_ dictionary.
    # X[feature] accesses the values of the column (feature) feature from the dataset X, and .mean() calculates the mean of that column.
    # The resulting mean value is stored in self.imputer_dict_ with the key as the feature name and the value as the mean.
    for feature in self.variables:
        self.imputer_dict_[feature] = X[feature].mean()
    
    return self

  #### Here, you transform the variables based on what you learned in the .fit()
  #### You can transform into the train set, test set or real-time data
  def transform(self, X):
    # loop over the variables and .fillna() in a given feature based on the 
    # mean of a given feature
    for feature in self.variables:
      X[feature].fillna(self.imputer_dict_[feature], inplace=True)
      
    return X

You may create a custom transformer where you don't need to code the ``.fit().`` For example, imagine you want to apply the upper case method to all the variables. You don't need to learn that; you need to execute it.
* Let's create this transformer and call `ConvertUpperCase()`

In [12]:
# The comments relate to the new concepts for this exercise

class ConvertUpperCase(BaseEstimator, TransformerMixin):
  def __init__(self, variables):
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables

  # We don't need to learn anything here; we just return self
  # We need to do that anyway to be compatible with scikit-learn format
  def fit(self, X, y=None):
      return self

  # Here, we convert the variables using a method called .upper()
  # We loop over all the variables, check if it is an object, and then use a lambda function...
  # ...to apply .upper() to all rows
  def transform(self, X):
    for feature in self.variables:
      if X[feature].dtype == 'object':
        X[feature] = X[feature].apply(lambda x: x.upper())
      else:
        print(f"Warning: {feature} data type should be object to use ConvertUpperCase()")

    return X

We will use the 'Online_Retail' dataset, which contains information on transactions made by customers through an online retail platform. The dataset includes data on the products that were purchased, the quantity of each product, the date and time of each transaction, the price of each product, the unique identifier for each customer who made a purchase, and the country where each customer is located. 
* We check for missing data

df = pd.read_csv('Online_Retail.csv')
df = df.astype({'CustomerID':'object'})
df.isnull().sum()

We are interested in:
* Cleaning the missing data with `MyCustomTransformerForMeanImputation()` on the numerical variables and `CategoricalImputer()` for categorical variables
* Next, we want to make all words from the 'Country' column upper case. We will use our own transformer: ConvertUpperCase()


We set the pipeline using these rules in three steps. Then we run `.fit_transform()`
* Once we inspect the data with .head(), we notice the `'Country'` variable has all letters in upper case!

In [26]:
from feature_engine.imputation import CategoricalImputer

pipeline = Pipeline([
      ( 'custom_transf', MyCustomTransformerForMeanImputation(variables=['Quantity',
                                                                         'UnitPrice'] )),
                     
      ( 'categorical_imputer', CategoricalImputer(imputation_method='missing',
                                                  fill_value='Missing',
                                                  variables=['InvoiceNo', 'StockCode',
                                                             'Description', 'InvoiceDate',
                                                             'CustomerID', 'Country']) ),
      
      ('upper_case' , ConvertUpperCase(variables=['Country'])),
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

KeyError: 'Quantity'

In [28]:
df_transformed.isnull().sum()

NameError: name 'df_transformed' is not defined

In [30]:
df[['Quantity','UnitPrice']].mean()

KeyError: "None of [Index(['Quantity', 'UnitPrice'], dtype='object')] are in the [columns]"

And the learned mean values from `MyCustomTransformerForMeanImputation()` dictionary. We assess the 'custom_transf' steps and check the attribute `.imputer_dict_`, which happens to be the dictionary we stored the mean values in the `.fit()` method.

In [33]:
pipeline['custom_transf'].imputer_dict_

{}