# Using ColumnTransformer in Scikit-Learn for Data Preprocessing
- Data preprocessing is a critical step in any machine learning workflow. It involves cleaning and transforming raw data into a format suitable for modeling. One of the challenges in preprocessing is dealing with datasets that contain different types of features, such as numerical and categorical data. Scikit-learn's ColumnTransformer is a powerful tool that allows you to apply different transformations to different subsets of features within your dataset

## Preprocessing Strategies with ColumnTransformer
- Here Let's create a dataset which is named as CAR_SPEED_DATA which consists of 6 columns named as:

    - AGE: This column contains numerical data representing the age of individuals. It's a continuous variable that may require scaling to ensure it fits well within our model.
    - GENDER: A categorical feature that denotes the gender of individuals, often represented as 'Male' or 'Female.' To make this data usable for machine learning models, we'll need to encode it into numerical values.
    - SPEED: Another numerical feature, this column represents the speed of an individual’s vehicle. Like AGE, it might need scaling or normalization.
    - AVERAGE_SPEED: This feature is an ordinal categorical value. It represents speed categories like 'high' or 'low' .Although it seems similar to numerical data, it needs special handling because the order matters but the differences between categories are not consistent.
    - CITY: A categorical feature indicating the city where the individual resides. With potentially many unique values, we'll need to apply one-hot encoding to convert it into a form suitable for modeling.
    - HAS_DRIVING_LICENSE: This binary categorical variable shows whether an individual has a driving license ('Yes' or 'No'). Simple encoding can transform this into a numerical feature.

## Challenges Without Column Transformer
- When we work with such a diverse dataset, different preprocessing steps are required for different columns:

- Numerical Data Handling:

    - AGE and SPEED need to be scaled to ensure that they don’t overpower other features in the model. Without proper scaling, numerical columns with larger ranges could disproportionately affect the model's learning process.
    - Categorical Encoding: GENDER and CITY need encoding into numerical formats. With multiple categorical features, applying encoding manually can be big task, especially when dealing with a large number of categories.
    - Ordinal Encoding: AVERAGE_SPEED, as an ordinal categorical feature, requires careful encoding to preserve the order of categories. Applying standard one-hot encoding might not respect this inherent ordering.
    - Binary Features: The HAS_DRIVING_LICENSE column is binary and relatively straightforward, but it's another step that adds complexity when handled separately.

## Implementing Column Transformer in Sklearn
- SimpleImputer: Used to fill in missing data in a dataset with a specified strategy, such as mean, median, or mode, we use mean in our dataset.
- OneHotEncoder: Converts categorical features into a format that can be provided to machine learning algorithms by creating binary columns for each category.
- OrdinalEncoder: Transforms categorical features into integer values that represent the ordinal relationship between categories, preserving their order. In our dataset for low speed we encode it as '0' for high speed we encode it '1'.

In [10]:
# import the required libraries
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
import numpy as np 
import pandas as pd 


In [4]:
df = pd.read_csv('data/Car_Speed_Data.csv')
df.head()

Unnamed: 0,Age,Gender,Speed,Average_speed,City,has_driving_license
0,54,male,40.0,low,Kolkata,yes
1,34,female,70.0,low,Delhi,yes
2,19,female,140.0,high,Delhi,no
3,45,male,120.0,high,Kolkata,yes
4,23,male,80.0,low,Mumbai,no


In [5]:
df.isnull().sum()

Age                    0
Gender                 0
Speed                  9
Average_speed          0
City                   0
has_driving_license    0
dtype: int64

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['has_driving_license']), 
                                                   df['has_driving_license'], 
                                                   test_size=0.2)
X_train, X_train.shape

(    Age  Gender  Speed Average_speed      City
 82   51  female  170.0          high    Mumbai
 14   37  female   70.0           low  Banglore
 77   36    male  130.0          high  Banglore
 25   35    male  100.0          high  Banglore
 33   35  female   50.0           low  Banglore
 ..  ...     ...    ...           ...       ...
 45   24  female   60.0           low   Kolkata
 46   19  female   70.0           low  Banglore
 26   47  female  130.0          high  Banglore
 28   27  female  150.0          high     Delhi
 84   46  female   50.0           low     Delhi
 
 [80 rows x 5 columns],
 (80, 5))

In [15]:
# we are using SimpleImputer() for speed Column
si = SimpleImputer()
X_train_Speed = si.fit_transform(X_train[['Speed']])

#for test data
X_test_Speed = si.fit_transform(X_test[['Speed']])

X_train_Speed.shape

(80, 1)

- The Average_speed column is an ordinal categorical feature, meaning its categories have a meaningful order, such as low and high. By using OrdinalEncoder, we transform these categories into numerical values that preserve their inherent order. This encoding allows machine learning models to interpret the relative significance of each category, enabling them to capture patterns associated with different speeds. For instance, if low is encoded as 0 and high as 1, the model can recognize that high represents a greater speed than low, impacting how it learns relationships in the data.

In [17]:
# Ordinalencoding for Average_speed
eo = OrdinalEncoder(categories=[['low', 'high']])
X_train_Average_speed  = eo.fit_transform(X_train[["Average_speed"]])

# for test data
X_test_Average_speed = eo.fit_transform(X_test[['Average_speed']])

X_train_Average_speed.shape
X_train_Average_speed

array([[1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],

- One-Hot Encoding is used to convert categorical variables into a binary matrix, allowing machine learning models to interpret these features without assuming any ordinal relationship. By transforming Gender and City into binary columns, we avoid misleading the model into thinking there's a rank or order between the categories. 

In [22]:
# Performing OneHotEncoding for Gender,City
ohe = OneHotEncoder(drop='first', sparse_output=False)
X_train_Gender_City = ohe.fit_transform(X_train[['Gender', 'City']])

# For the dataset 
X_test_Gender_City = ohe.fit_transform(X_test[['Gender', 'City']])
X_train_Gender_City.shape, X_train_Gender_City

((80, 5),
 array([[0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        [1., 0., 0., 1., 0.],
        [1., 0., 1., 0., 0.],
        [1., 0., 0., 0., 0.],
        [1., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 1.],
        [1., 0., 0., 0., 1.],
        [0., 0., 0., 0., 1.],
        [1., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 1.],
        [0., 1., 1., 0., 0.],
        [1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 1.],
        [1., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 1.],
        [1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.],
        [1., 0., 0., 1., 0.],
        [1., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 1.],
        [1., 0., 0., 0., 0.],


- The Age feature is a numerical variable that often plays a critical role in predictive modeling .By extracting Age separately, we can focus on its unique contribution to the model without interference from categorical features This separation also simplifies preprocessing steps, such as scaling or normalizing numerical data, ensuring that age is treated with its own distinct statistical(mean,median,mode) considerations.

In [28]:
# Extracting Age
X_train_age = X_train.drop(columns=['Gender','Speed','Average_speed','City']).values

X_test_age = X_test.drop(columns=['Gender','Speed','Average_speed','City']).values

X_train_age.shape

(80, 1)

### Without using Column Transformer

In [29]:
X_train_transformed = np.concatenate((X_train_age,X_train_Speed,X_train_Gender_City,X_train_Average_speed),axis=1)

X_test_transformed = np.concatenate((X_test_age,X_test_Speed,X_test_Gender_City,X_test_Average_speed),axis=1)

X_train_transformed.shape

(80, 8)

## Using Column Transformer

In [30]:
from sklearn.compose import ColumnTransformer



In [31]:
transformer = ColumnTransformer(transformers=[
    ('t1', SimpleImputer(), ['Speed']),
    ('t2', OrdinalEncoder(categories=[['low', 'high']]), ['Average_speed']),
    ('t3', OneHotEncoder(sparse_output=False, drop='first'), ['Gender', 'City'])
], remainder='passthrough')

In [40]:
transformer.fit_transform(X_train).shape, transformer.transform(X_test).shape

((80, 8), (20, 8))

- Purpose of the remainder parameter specifies what to do with the columns not explicitly transformed by the transformers list.

remainder='passthrough': It  Leaves all other columns untouched and includes them in the transformed output.

- Why are we using this. As it is useful when you want to apply specific transformations to only some columns while preserving the rest as they are, ensuring that no information is lost from the original dataset.
- This is about Column Transformer where each column values have been stored into single column transformer.
As in Categorical Features we can't give machine the complete word. we transform it's data into matrix which is represtation of numbers then we give to machine for training.

In [38]:
transformer.fit(X_train)
X_train_transformed = transformer.transform(X_train)
X_test_transformed = transformer.transform(X_test)
X_train_transformed.shape, X_test_transformed.shape

((80, 8), (20, 8))

In [36]:
X_train, 

(    Age  Gender  Speed Average_speed      City
 82   51  female  170.0          high    Mumbai
 14   37  female   70.0           low  Banglore
 77   36    male  130.0          high  Banglore
 25   35    male  100.0          high  Banglore
 33   35  female   50.0           low  Banglore
 ..  ...     ...    ...           ...       ...
 45   24  female   60.0           low   Kolkata
 46   19  female   70.0           low  Banglore
 26   47  female  130.0          high  Banglore
 28   27  female  150.0          high     Delhi
 84   46  female   50.0           low     Delhi
 
 [80 rows x 5 columns],
 (80, 5))

In [35]:
X_test, X_test.shape

(    Age  Gender  Speed Average_speed      City
 21   28  female   70.0           low  Banglore
 86   34    male   60.0           low  Banglore
 52   29    male   60.0           low    Mumbai
 18   28    male   50.0           low  Banglore
 39   26    male  150.0          high  Banglore
 9    25    male   60.0           low     Delhi
 83   39  female   70.0           low     Delhi
 44   57    male   40.0           low     Delhi
 64   28  female  180.0          high    Mumbai
 53   42    male   80.0           low   Kolkata
 36   36    male   50.0           low   Kolkata
 10   35  female   80.0           low     Delhi
 58   52    male  120.0          high  Banglore
 65   37  female  170.0          high   Kolkata
 8    42    male   70.0           low  Banglore
 99   20    male   60.0           low   Kolkata
 68   56    male    NaN           low     Delhi
 97   40  female   50.0           low    Mumbai
 90   24    male  120.0          high  Banglore
 49   48  female  130.0          high  B