# DTSC670: Foundations of Machine Learning Models
## Module 2
## Assignment 4: Custom Transformer and Transformation Pipeline

#### Name:

Begin by writing your name above.

Your task in this assignment is to create a custom transformation pipeline that takes in raw data and returns fully prepared, clean data that is ready for model training.  However, we will not actually train any models in this assignment.  This pipeline will employ an imputer class, a user-defined transformer class, and a data-normalization class.

Please note that the order of features in the final feature matrix must be correct.  See the below figure that illustrates the input and output of the transformation pipeline.  The positions of features $x_1$ and $x_2$ do not change - they remain in the first and second columns, respectvely, both before and after the transformation pipeline.  In the transformed dataset, the $x_5$ feature is next, and is followed by the newly computed feature $x_6$.  Finally, the last two columns are the remaining one-hot vectors obtained from encoding the categorical feature $x_3$.

<img src="DataTransformation.png " width ="500" />

# Import Data

Import data from the file called `CustomTransformerData.csv`.

In [1]:
# importing data

import pandas as pd 
import csv

path = '/home/qasim/Downloads/pipeline/CustomTransformerData.csv'
df = pd.read_csv(path)

df

Unnamed: 0,x1,x2,x3,x4,x5
0,1.5,2.354153,COLD,593,0.75
1,2.5,3.314048,WARM,340,2.083333
2,3.5,4.021604,COLD,551,4.083333
3,4.5,,COLD,2368,6.75
4,5.5,5.847601,WARM,2636,10.083333
5,6.5,7.22991,WARM,2779,14.083333
6,7.5,7.997255,HOT,1057,18.75
7,8.5,9.203947,COLD,819,24.083333
8,9.5,10.335348,WARM,3349,
9,10.5,11.112142,HOT,3235,36.75


# Create Custom Transformer

Create a custom transformer, just as we did in the lecture video entitled "Custom Transformers", that performs two computations: 

1. Adds an attribute to the end of the data (i.e. new last column) that is equal to $\frac{x_1^3}{x_5}$ for each observation

2. Drops the entire $x_4$ feature column.  (See further instructions below.)

You must name your custom transformer class `Assignment4Transformer`.  Your class should include a parameter with a default value of `True` that deletes the $x_4$ feature column when its value is `True`, but preserves the $x_4$ feature column when its value is `False`.

NOTE: You must handle the numeric and categorical features separately.  Accordingly, you will not pass the $x_3$ feature column through this custom transformer.  This means your calculations should reflect the absence of the $x_3$ feature column when indexing data structures.

In [37]:
# Create custom transfermers using OOP
# Finds the caluculations x6
# Drops x4 depending upon the True False Parameter


from sklearn.base import BaseEstimator, TransformerMixin
class Assignment4Transformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        print("Init Done")

    def fit(self, X, y=None):
        return self

    def transform(self, X, y= True):
        #Apply function
        print(X)
        x6 = X[0]**3 / X[4]
        X[5] = x6
        if y is True:
            X.drop(3, inplace = True, axis  =1)
        return X


In [28]:

a4t = Assignment4Transformer()
#a4t.transform(df).head()


Init Done
      x1         x2    x3    x4          x5
0    1.5   2.354153  COLD   593    0.750000
1    2.5   3.314048  WARM   340    2.083333
2    3.5   4.021604  COLD   551    4.083333
3    4.5        NaN  COLD  2368    6.750000
4    5.5   5.847601  WARM  2636   10.083333
5    6.5   7.229910  WARM  2779   14.083333
6    7.5   7.997255   HOT  1057   18.750000
7    8.5   9.203947  COLD   819   24.083333
8    9.5  10.335348  WARM  3349         NaN
9   10.5  11.112142   HOT  3235   36.750000
10  11.5  11.759611  WARM   216   44.083333
11  12.5  12.629096  WARM  2529   52.083333
12  13.5  14.082589  COLD  1735   60.750000
13  14.5  14.657678   HOT  1254   70.083333
14  15.5        NaN   HOT  1245   80.083333
15  16.5  17.184114  WARM   310   90.750000
16  17.5  17.800776   HOT   201  102.083333
17  18.5  18.578861   HOT  1767  114.083333


Unnamed: 0,x1,x2,x3,x5,x6
0,1.5,2.354153,COLD,0.75,4.5
1,2.5,3.314048,WARM,2.083333,7.5
2,3.5,4.021604,COLD,4.083333,10.5
3,4.5,,COLD,6.75,13.5
4,5.5,5.847601,WARM,10.083333,16.5


In [30]:
#using simple methods
from sklearn.preprocessing import FunctionTransformer

def assign_t(df):
    x6 = df['x1']**3 / df['x5']
    df['x6'] = x6
    df.drop('x4', inplace = True, axis  =1)

assignTrans = FunctionTransformer(assign_t)

In [31]:
assignTrans.transform(df)
df.head()

Unnamed: 0,x1,x2,x3,x5,x6
0,1.5,2.354153,COLD,0.75,4.5
1,2.5,3.314048,WARM,2.083333,7.5
2,3.5,4.021604,COLD,4.083333,10.5
3,4.5,,COLD,6.75,13.5
4,5.5,5.847601,WARM,10.083333,16.5


In [11]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot',OneHotEncoder(handle_unknown='ignore'))
])

In [14]:
cat_features = df[['x3']]
num_features = df[['x1', 'x2', 'x4', 'x5']]
cat_features.head()

Unnamed: 0,x3
0,COLD
1,WARM
2,COLD
3,COLD
4,WARM


# Create Transformation Pipeline for Numerical Features

Create a custom transformation pipeline for numeric data only called `num_pipeline` that:

1. Applies the `SimpleImputer` class to the data, where the strategy is set to `mean`.

2. Applies the custom `Assignment4Transformer` class to the data.

3. Applies the `StandardScaler` class to the data.

In [33]:
# importing necessary liberaries
import numpy as np

a4t = Assignment4Transformer()
num_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy = 'mean')),
    ('transformer',a4t),
    ('scaling',StandardScaler()),
    ])

Init Done


In [38]:
num_pipeline.fit_transform(num_features)

[[1.50000000e+00 2.35415298e+00 5.93000000e+02 7.50000000e-01]
 [2.50000000e+00 3.31404772e+00 3.40000000e+02 2.08333333e+00]
 [3.50000000e+00 4.02160446e+00 5.51000000e+02 4.08333333e+00]
 [4.50000000e+00 1.05067957e+01 2.36800000e+03 6.75000000e+00]
 [5.50000000e+00 5.84760100e+00 2.63600000e+03 1.00833333e+01]
 [6.50000000e+00 7.22991004e+00 2.77900000e+03 1.40833333e+01]
 [7.50000000e+00 7.99725523e+00 1.05700000e+03 1.87500000e+01]
 [8.50000000e+00 9.20394654e+00 8.19000000e+02 2.40833333e+01]
 [9.50000000e+00 1.03353477e+01 3.34900000e+03 4.30245098e+01]
 [1.05000000e+01 1.11121419e+01 3.23500000e+03 3.67500000e+01]
 [1.15000000e+01 1.17596108e+01 2.16000000e+02 4.40833333e+01]
 [1.25000000e+01 1.26290958e+01 2.52900000e+03 5.20833333e+01]
 [1.35000000e+01 1.40825889e+01 1.73500000e+03 6.07500000e+01]
 [1.45000000e+01 1.46576780e+01 1.25400000e+03 7.00833333e+01]
 [1.55000000e+01 1.05067957e+01 1.24500000e+03 8.00833333e+01]
 [1.65000000e+01 1.71841140e+01 3.10000000e+02 9.075000

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

# Create Numeric and Categorical DataFrames

Create two new data frames.  Create one DataFrame called `data_num` that holds the numeric features.  Create another DataFrame called `data_cat` that holds the categorical features.

In [None]:
### ENTER CODE HERE ###

# Quick Testing

The full pipeline will be implemented with a `ColumnTransformer` class.  However, to be sure that our numeric pipeline is working properly, lets invoke the `fit_transform()` method of the `num_pipeline` object.  Then, take a look at the transformed data to be sure all is well.

### Run Pipeline and Create Transformed Numeric Data

In [None]:
### ENTER CODE HERE ###

### One-Hot Encode Categorical Features

Similarly, you will employ a `OneHotEncoder` class in the `ColumnTransformer` below to construct the final full pipeline.  However, let's instantiate an object of the `OneHotEncoder` class called `cat_encoder` that has the `drop` parameter set to `first`.  Next, call the `fit_transform()` method and pass it your categorical data.  Take a look at the transformed one-hot vectors to be sure all is well.

In [None]:
### ENTER CODE HERE ###

# Put it All Together with a Column Transformer

Now, we are finally ready to construct the full transformation pipeline called `full_pipeline` that will transform our raw data into clean, ready-to-train data.  Construct this ColumnTransformer below, then call the `fit_transform()` method to obtain the final, clean data.  Save this output data into a variable called `data_trans`.

In [None]:
### ENTER CODE HERE ###

# Prepare for Grading

Prepare your `data_trans` NumPy array for grading by using the NumPy [around()](https://numpy.org/doc/stable/reference/generated/numpy.around.html) function to round all the values to 2 decimal places - this will return a NumPy array.

Please note the final order of the features in your final numpy array, which is given at the top of this document.

___You MUST print your final answer, which is the NumPy array discussed above, using the `print()` function!  This MUST be the only `print()` statement in the entire notebook!  Do not print anything else using the print() function in this notebook!___

In [None]:
print(np.around(data_trans,decimals=2))