# 3 -- Bonus Working with Heterogenous Datasets

In [1]:
%load_ext watermark

In [2]:
%watermark -a "Sebastian Raschka" -p numpy,scikit-learn

Author: Sebastian Raschka

numpy       : 1.23.5
scikit-learn: 1.2.2



In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

- Suppose you have a dataset that has both numerical and categorical features as follows: 

In [4]:
df = pd.read_csv('data/iris_mod.csv', index_col='Id')
df.head()

Unnamed: 0_level_0,SepalLength[cm],SepalWidth[cm],PetalLength[cm],PetalWidth[cm],Color_IMadeThisUp,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,5.1,3.5,1.4,0.2,red,Iris-setosa
2,4.9,3.0,1.4,0.2,red,Iris-setosa
3,4.7,3.2,1.3,0.2,red,Iris-setosa
4,4.6,3.1,1.5,0.2,red,Iris-setosa
5,5.0,3.6,1.4,0.2,red,Iris-setosa


- As usual, we first tranform the class labels into an integer format:

In [5]:
X = df.drop('Species', axis=1)
y = df['Species']

label_dict = {'Iris-setosa': 0,
              'Iris-versicolor': 1,
              'Iris-virginica': 2}

y = y.map(label_dict)

- Next, we are going to set up a `Pipeline` that performs certain preprocessing steps only on the numerical features:

In [6]:
numeric_features = ['SepalLength[cm]', 'SepalWidth[cm]', 'PetalLength[cm]', 'PetalWidth[cm]']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('feature_extraction', PCA(n_components=2))])

- Above, we weren't interested in performing these preprocessing steps on the categorical feature(s); instead, we apply **different** preprocessing steps to the categorical variable like so:

In [7]:
categorical_features = ['Color_IMadeThisUp']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder())])

- Scikit-learn's `ColumnTransformer` now allows us to merge these 2 seperate preprocessing pipelines, which operate on different feature sets in our dataset:

In [8]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

- As a result, we get a 5 dimensional feature array (design matrix) if we apply this preprocessor. What are these 5 columns?

In [9]:
temp = preprocessor.fit_transform(X)
temp.shape

(150, 5)

In [10]:
temp[:5]

array([[-2.26454173,  0.5057039 ,  0.        ,  1.        ,  0.        ],
       [-2.0864255 , -0.65540473,  0.        ,  1.        ,  0.        ],
       [-2.36795045, -0.31847731,  0.        ,  1.        ,  0.        ],
       [-2.30419716, -0.57536771,  0.        ,  1.        ,  0.        ],
       [-2.38877749,  0.6747674 ,  0.        ,  1.        ,  0.        ]])

- The preprocessor can now also be conveniently be used in a Scikit-learn pipeline as shown below:

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=0)

In [12]:
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', KNeighborsClassifier(p=3))])


clf.fit(X_train, y_train)
print(f'Test accuracy: {clf.score(X_test, y_test)*100}%')

Test accuracy: 100.0%
