In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
pd.__version__

'1.0.1'

In [3]:
np.__version__

'1.18.1'

In [4]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 0.22.1.


In [5]:
# Importing the dataset
dataset = pd.read_csv('Data.csv')
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [6]:
X = dataset.iloc[:, :-1].values
X 

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [7]:
y = dataset.iloc[:, 3].values
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

The developers of the sklearn library might have realised that people use __LabelEncoding__ and __OneHotEncoding__ very frequently. So they decided to come up with a new library called the __ColumnTransformer__, which will basically combine LabelEncoding and OneHotEncoding into just one line of code. And the result is exactly the same.


## [sklearn ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

## Refrence blog: [Use ColumnTransformer in SciKit instead of LabelEncoding and OneHotEncoding for data preprocessing in Machine Learning](https://towardsdatascience.com/columntransformer-in-scikit-for-labelencoding-and-onehotencoding-in-machine-learning-c6255952731b)

The first column we have here is a text field, and is categorical in a sense. So we’ll have to label encode this and also one hot encode to be sure we’ll not be working with any hierarchy. For this, we’ll still need the OneHotEncoder library to be imported in our code. But instead of the LabelEncoder library, we’ll use the new ColumnTransformer. 


In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

Next, we have to create an object of the ColumnTransformer class. But before we can do that, we need to understand the constructor signature of the class. The ColumnTransformer constructor takes quite a few arguments, but we’re only interested in two. 


The first argument is an array called __transformers__, which is a list of tuples. The array has the following elements in the same order: 

- __name__: a name for the column transformer, which will make setting of parameters and searching of the transformer easy.
- __transformer__: here we’re supposed to provide an estimator. We can also just “passthrough” or “drop” if we want. But since we’re encoding the data in this example, we’ll use the __OneHotEncoder__ (or __LabelEncoder__) here. Remember that the estimator you use here needs to support fit and transform __fit_tansform__.
- __column(s)__: the list of columns which you want to be transformed. In this case, we’ll only transform the first column.


The second parameter we’re interested in is the __remainder__. This will tell the transformer what to do with the other columns in the dataset. By default, only the columns which are transformed will be returned by the transformer. All other columns will be dropped. But we have the option to tell the transformer what to do with the other columns. We can either drop them, pass them through unchanged, or specify another estimator if we want to do some more processing.

Now that we (somewhat) understand the signature of the constructor, let’s go ahead and create an object:

In [9]:
# observe the tuple inside
columnTransformer = ColumnTransformer([('encoder1', OneHotEncoder(), [0])], remainder='passthrough')

As you can see from the snippet above, we’ll name the transformer simply “encoder1”. We’re using the OneHotEncoder() constructor to provide a new instance as the estimator. And then we’re specifying that only the first column has to be transformed. We’re also making sure that the remainder columns are passed through without any changes.

Once we have constructed this columnTransformer object, we have to fit and transform the dataset to one hot encode the column. For this, we’ll use the following simple command:

In [10]:
X = np.array(columnTransformer.fit_transform(X))
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, nan],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, nan, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

As you can see, we have easily label encoded and one hot encoded a column in our dataset using only the ColumnTransformer class. This so much more easier and cleaner than using both LabelEncoder and OneHotEncoder classes.

In [11]:
del(dataset, X, y, columnTransformer)

## Important Reference blog 2: [Easier Machine Learning model with the New Column Transformer from Scikit-Learn](https://medium.com/vickdata/easier-machine-learning-with-the-new-column-transformer-from-scikit-learn-c2268ea9564c)

__ColumnTransformer__: This function allows you to combine several feature extraction or transformation methods into a single transformer. Say, you are working on a machine learning problem, and you have a dataset containing a mixture of categorical and numerical columns. Rather than having to handle each of these separately, and perhaps writing a function to then apply this to new data. These can now be combined into a transformer which can easily be reapplied, and extended.

[Dataset used Analytics Vidhya loan prediction](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/#data_dictionary)  
__Aim of the model made:__ to predict wether or not a loan application will be successful based on a number of customer features. This contains both categorical and numerical variables, and is a nice simple data set to practice using the new ColumnTransformer.

In [12]:
train = pd.read_csv('AV_Loan_prediction/train.csv')
test = pd.read_csv('AV_Loan_prediction/test.csv')
print(train.shape, test.shape)
print(train.dtypes)

(614, 13) (367, 12)
Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object


Before going further I am going to drop the Loan_ID column as that will not be used in the model. I am also filling any null values with the most commonly occurring value for each column. There are of course a number of methods I could choose for this but as I am just trying out a new function I am not too worried about the accuracy of the model for now.

In [13]:
train.drop(columns = 'Loan_ID', inplace = True)
test.drop(columns = 'Loan_ID', inplace = True)

In [14]:
train = train.apply(lambda x:x.fillna(x.value_counts().idxmax()))
test = test.apply(lambda x:x.fillna(x.value_counts().idxmax()))

In [15]:
train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,120.0,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Next I’m going to specify the features (X) and target (y). I then use the sklearn train_test_split function to divide the training data into test and train ready to train and validate the model.

In [16]:
feature_set = train.drop(columns = 'Loan_Status')
X = feature_set.columns[:len(feature_set.columns)]
y = 'Loan_Status'

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(train[X], train[y], test_size = 0.25, random_state=0)

### ColumnTransformer

The next step is to apply transformation to columns to optimise them for use in the classification model. Usually I would write a function that applies these transformations, so that I can re-use it on both the test and train set, and any holdout data I may have for later validation. However, I am going to try using the ColumnTransformer to simplify these steps.

Now going to transform the categorical columns using the sklearn OneHotEncoder. I will also normalize the numerical columns using the Normalizer function.




In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer, OneHotEncoder

In [20]:
colT = ColumnTransformer([("dummy_col", OneHotEncoder(categories=[['Male', 'Female'],
                                           ['Yes', 'No'],
                                            ['0','1', '2','3+'],
                                            ['Graduate', 'Not Graduate'],
                                            ['No', 'Yes'],
                                            ['Semiurban', 'Urban', 'Rural']]), [0,1,2,3,4,10]),
      ("norm", Normalizer(norm='l1'), [5,6,7,8,9])])

You will notice in the code above I have used the categories argument of the OnHotEncoder function. This takes a list of all possible categories in each column as a list of lists. This produces one hot encoded columns for all categories even if data does not exist for that category in the column. The reason for doing this is that when using the ColumnTransformer function on new data. If it doesn’t contain the same categories in each feature then the array produced will not be the same shape as the data used to train the model, and you will get an error.  

I apply the transformer to the training data as shown below. The first few rows of the output, which is a list of lists containing numerical arrays, is shown beneath the code.

In [21]:
X_train = colT.fit_transform(X_train)
X_train

array([[1.00000000e+00, 0.00000000e+00, 1.00000000e+00, ...,
        7.26792204e-03, 5.94648167e-02, 1.65180046e-04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        2.43384199e-02, 6.95383427e-02, 1.93162063e-04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.51359432e-02, 3.36354293e-02, 9.34317481e-05],
       ...,
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, ...,
        2.24845419e-02, 4.04721754e-02, 1.12422709e-04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, ...,
        2.44125725e-02, 5.49282881e-02, 1.52578578e-04],
       [0.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        2.58927301e-02, 5.12163892e-02, 1.42267748e-04]])

To transform the X_test data you simply apply the column transformer again.

In [22]:
X_test = colT.transform(X_test)

### Training the Model
The data is now ready to be used in any scikit-learn classifier. For simplicity I have just used a RandomForestClassifier model with the default parameters.

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Y', 'N']))

              precision    recall  f1-score   support

           Y       0.81      0.58      0.68        43
           N       0.85      0.95      0.90       111

    accuracy                           0.84       154
   macro avg       0.83      0.76      0.79       154
weighted avg       0.84      0.84      0.84       154



### Predicting New Data
To test how the ColumnTransformer would work if we were to use this model to make predictions on previously unseen data. I took a sample of rows from the test csv file I read in earlier. I then simply re-use the column transformer to apply the preprocessing steps.

Applying the Random Forest model to predict the loan status gives the following output.

In [24]:
test_samp = test[:15]
test_samp = colT.transform(test_samp)
random_forest.predict(test_samp)

array(['Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y',
       'N', 'Y'], dtype=object)

***

#### [IMP Using ColumnTransformer with Pipeline](https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/)

#### [Using ColumnTransformer to combine data processing steps](https://towardsdatascience.com/using-columntransformer-to-combine-data-processing-steps-af383f7d5260)