[View in Colaboratory](https://colab.research.google.com/github/paulgureghian/Colab_Notebooks/blob/master/Column_Transformer_from_Scikit_Learn.ipynb)

**Created by Paul A. Gureghian on 10/2/2018**

**This notebook contains a demonstration of the new Column Transformer from Scikit-Learn ** 

**This function allows you to combine several feature extraction or transformation methods into a single transformer.**

**The data set I am using was taken from the Analytics Vidhya loan prediction competition. The purpose of this competition is to predict wether or not a loan application will be successful based on a number of customer features. This contains both categorical and numerical variables, and is a nice simple data set to practice using the new ColumnTransformer.**

In [0]:
### install pydrive 
!pip install -U -q Pydrive 

In [144]:
### install scikit-learn==0.20.0
!pip install scikit-learn==0.20.0



In [0]:
### import packages 
import pandas as pd 

from google.colab import auth
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive  
from oauth2client.client import GoogleCredentials

from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score, classification_report

In [0]:
### instantiate the pydrive client 
auth.authenticate_user()
gauth = GoogleAuth() 
gauth.credentials = GoogleCredentials.get_application_default() 
drive = GoogleDrive(gauth)   

In [147]:
### get a list of the csv files and their IDs used in this demo
file_list = drive.ListFile({'q' : "'1oO3S07GjwQrs_LWc1S-KXqfoM2G-Xz7w' in parents and trashed=false"}).GetList()
for file1 in file_list:
    
    print('title: %s, id: %s' % (file1['title'], file1['id']))                           

title: test_loan_prediction.csv, id: 1jTZfVdeHPmQMjIzywBdjB4H_mrQr4ED-
title: train_loan_prediction.csv, id: 1QIaCv92FK89WneMhXDr0DQw2ZRK-jXB_


In [148]:
### pull the files into Google Colab, save the files in the notebook, and set the file names
train_downloaded = drive.CreateFile({'id' : '1QIaCv92FK89WneMhXDr0DQw2ZRK-jXB_'})
train_downloaded.GetContentFile('train_loan_prediction.csv')
test_downloaded = drive.CreateFile({'id' : '1jTZfVdeHPmQMjIzywBdjB4H_mrQr4ED-'})
test_downloaded.GetContentFile('test_loan_prediction.csv')  

print("Files successfully imported.") 

Files successfully imported.


In [149]:
### read in the data to a pandas dataframe
train =  pd.read_csv('train_loan_prediction.csv')
test = pd.read_csv('test_loan_prediction.csv')

### print the head() of both datasets
print("Train set:\n", train.head())
print('')
print("Test set:\n", test.head()) 
print('')

Train set:
     Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2       

In [150]:
### print the shapes
print(train.shape, test.shape)

(614, 13) (367, 12)


In [151]:
### print the datatypes
print("Train set data types:\n", train.dtypes)
print('')
print("Test set data types:\n", test.dtypes)
print('')   

Train set data types:
 Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

Test set data types:
 Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome      int64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
dtype: object



In [0]:
### drop the 'Loan_ID column' 
train = train.drop(['Loan_ID'], axis =1)
test = test.drop(['Loan_ID'], axis =1)  

In [0]:
### fill any null values with common values 
train = train.apply(lambda x:x.fillna(x.value_counts().index[0]))
test = test.apply(lambda x:x.fillna(x.value_counts().index[0])) 

In [154]:
### define variables: features 'X' and target 'y' 
feature_set = train.drop(['Loan_Status'], axis =1) 
X = feature_set.columns[:len(feature_set.columns)] 
y = 'Loan_Status'

print("Head() of feature_set:\n", feature_set.head(1)) 
print('')
print("Column names in X:\n", X) 
print('') 
print("Column name in y:\n", y)
print('') 

Head() of feature_set:
   Gender Married Dependents Education Self_Employed  ApplicantIncome  \
0   Male      No          0  Graduate            No             5849   

   CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History  \
0                0.0       120.0             360.0             1.0   

  Property_Area  
0         Urban  

Column names in X:
 Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area'],
      dtype='object')

Column name in y:
 Loan_Status



In [155]:
### split the training data into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(train[X], train[y], random_state =0) 

### print the values in the train and test sets 
print("X_train:\n", X_train.head(1))  
print('')
print("X_test:\n", X_test.head(1)) 
print('')
print("y_train:\n", y_train) 
print('')
print("y_test:\n", y_test)  
print('') 

X_train:
    Gender Married Dependents Education Self_Employed  ApplicantIncome  \
46   Male     Yes          1  Graduate            No             5649   

    CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History  \
46                0.0        44.0             360.0             1.0   

   Property_Area  
46         Urban  

X_test:
     Gender Married Dependents Education Self_Employed  ApplicantIncome  \
454   Male      No          0  Graduate           Yes             7085   

     CoapplicantIncome  LoanAmount  Loan_Amount_Term  Credit_History  \
454                0.0        84.0             360.0             1.0   

    Property_Area  
454     Semiurban  

y_train:
 46     Y
272    Y
474    Y
382    Y
283    Y
107    N
205    Y
384    N
0      Y
202    N
71     Y
170    Y
102    Y
407    Y
108    N
573    N
613    N
313    Y
509    Y
419    Y
298    N
492    Y
480    Y
26     Y
604    Y
332    Y
242    Y
515    Y
582    Y
225    N
      ..
151    Y
244    Y
543    Y
5

In [156]:
### instantiate the column transformer 
### transform the categorical columns with the OneHotEncoder function
### normalize the numerical columns with the Normalizer function 
ct = ColumnTransformer(
    [("dummy_col", OneHotEncoder(categories =[['Male', 'Female'],
                                              ['Yes', 'No'],
                                              ['0', '1', '2', '3+'],
                                              ['Graduate', 'Not Graduate'],
                                              ['No', 'Yes'],
                                              ['Semiurban', 'Urban', 'Rural']]),
                                              [0, 1, 2, 3, 4, 10]),
     
     ("norm", Normalizer(norm ='l1'), [5, 6, 7, 8, 9])])

print(ct) 

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('dummy_col', OneHotEncoder(categorical_features=None,
       categories=[['Male', 'Female'], ['Yes', 'No'], ['0', '1', '2', '3+'], ['Graduate', 'Not Graduate'], ['No', 'Yes'], ['Semiurban', 'Urban', 'Rural']],
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True), [0, 1, 2, 3, 4, 10]), ('norm', Normalizer(copy=True, norm='l1'), [5, 6, 7, 8, 9])])


In [157]:
### apply the transformer to the training data 
X_train = ct.fit_transform(X_train) 

print("Output of X_train:\n", X_train)    
print('') 

Output of X_train:
 [[1.00000000e+00 0.00000000e+00 1.00000000e+00 ... 7.26792204e-03
  5.94648167e-02 1.65180046e-04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 ... 2.43384199e-02
  6.95383427e-02 1.93162063e-04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 ... 1.51359432e-02
  3.36354293e-02 9.34317481e-05]
 ...
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 ... 2.24845419e-02
  4.04721754e-02 1.12422709e-04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 ... 2.44125725e-02
  5.49282881e-02 1.52578578e-04]
 [0.00000000e+00 1.00000000e+00 1.00000000e+00 ... 2.58927301e-02
  5.12163892e-02 1.42267748e-04]]



In [158]:
### apply the transformer to the testing data 
X_test = ct.fit_transform(X_test) 

print("Output of X_test:\n", X_test) 
print('')   

Output of X_test:
 [[1.00000000e+00 0.00000000e+00 0.00000000e+00 ... 1.11553785e-02
  4.78087649e-02 1.32802125e-04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 ... 2.38145864e-02
  7.65468850e-02 2.12630236e-04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 ... 3.02126072e-02
  3.35695636e-02 9.32487878e-05]
 ...
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 ... 1.16808383e-02
  3.09198660e-02 8.58885167e-05]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 ... 2.09756873e-02
  5.72064198e-02 1.58906722e-04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 ... 2.49484536e-02
  7.42268041e-02 0.00000000e+00]]



In [0]:
### instantiate the RandomForestClassifier model with default parameters  
random_forest = RandomForestClassifier() 

In [160]:
### fit the model on the training dataset 
random_forest.fit(X_train, y_train) 



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [161]:
### make predictions from the 'X_test' test set 
y_pred = random_forest.predict(X_test)  

print("The predictions are:\n", y_pred)  
print('') 

The predictions are:
 ['Y' 'Y' 'N' 'N' 'Y' 'N' 'Y' 'N' 'N' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'N' 'Y'
 'Y' 'N' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'N' 'Y' 'N' 'Y' 'Y' 'Y' 'N'
 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'Y' 'N' 'N' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y'
 'Y' 'Y' 'Y' 'N' 'N' 'Y' 'N' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'N' 'N'
 'Y' 'Y' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y' 'Y' 'N' 'Y'
 'Y' 'Y' 'Y' 'N' 'N' 'N' 'Y' 'Y' 'Y' 'N']



In [162]:
### the score of the model 
model_score = random_forest.score(X_test, y_test)
print("Model score is:\n", model_score) 

Model score is:
 0.8246753246753247


In [163]:
### the accuracy score of the model
accuracy_score = accuracy_score(y_test, y_pred)
print("Accuracy Score is:\n", accuracy_score) 

Accuracy Score is:
 0.8246753246753247


In [164]:
### print the Classification Report 
print("Classification Report:\n", classification_report(y_test, y_pred, target_names =['Y', 'N']))    

Classification Report:
               precision    recall  f1-score   support

           Y       0.72      0.60      0.66        43
           N       0.86      0.91      0.88       111

   micro avg       0.82      0.82      0.82       154
   macro avg       0.79      0.76      0.77       154
weighted avg       0.82      0.82      0.82       154



In [165]:
### make predictions from the first 15 rows of the test dataset 
test_sample = test[:15] 
test_sample = ct.fit_transform(test_sample) 

print("Test Sample output:\n", test_sample)   

Test Sample output:
 [[1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  1.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 0.00000000e+00 9.23921822e-01
  0.00000000e+00 1.77677273e-02 5.81489259e-02 1.61524794e-04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 0.00000000e+00 6.07544934e-01
  2.96267035e-01 2.48864310e-02 7.11040885e-02 1.97511357e-04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 0.00000000e+00 6.78518116e-01
  2.44266522e-01 2.82263536e-02 4.88533044e-02 1.35703623e-04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00

In [166]:
### fit the model on the training dataset 
random_forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [167]:
### make predictions from the 'test_sample' data  
y_pred = random_forest.predict(test_sample) 

print("The predictions are:\n", y_pred)  
print('')   

The predictions are:
 ['Y' 'Y' 'Y' 'Y' 'N' 'Y' 'N' 'N' 'Y' 'Y' 'Y' 'Y' 'Y' 'N' 'Y']



In [168]:
### the score of the model 
model_score = random_forest.score(X_test, y_test)
print("Model score is:\n", model_score) 

Model score is:
 0.8441558441558441


In [169]:
### end of notebook
print("This is the end") 

This is the end
