## Laboratorium 3

### Zadanie 3 (z poprzednich laboratoriów)


a) Stwórz pipeline który może działać na różnym rodzaju danych do problemu klasyfikacji binarnej. Podpowiedzią mogą być poniższe punkty:

1. Define sets of columns to be transformed in different ways
2. Split data to train and test sets
3. Create pipelines for numerical and categorical features
4. Create ColumnTransformer to apply pipeline for each column set
5. Add a model to a final pipeline
6. Display the pipeline
7. Pass data through the pipeline


b) Tak stworzony pipeline można wykorzystać w funkcji `GridSearchCV` do optymalizacji hiperparametrów modelu. Przeprowadź tę operację

c) Jak zmodyfikować ten pipeline aby **AUTOMATYCZNIE** wybierać najlepszą (najskuteczniejszą) metodę preprocessingu dla określonych danych i wybrać także algorytm ML?

Przetestuj pipeline na danych https://www.openml.org/search?type=data&status=active&id=45068




In [1]:
### Solution
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn import set_config
set_config(transform_output = "pandas")
from sklearn.ensemble import RandomForestClassifier

from sklearn.compose import ColumnTransformer




In [29]:
imputation_basic = make_column_transformer((SimpleImputer(), make_column_selector( dtype_include= np.number) ),
                                     (SimpleImputer(), make_column_selector( dtype_include= np.object_) ))


In [30]:


num_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale',MinMaxScaler())
])
cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore', sparse=False))

])


col_trans = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline, make_column_selector( dtype_include= np.number)),
    ('cat_pipeline',cat_pipeline,make_column_selector( dtype_include= np.object_))
    ],
    remainder='drop',
    n_jobs=-1)
model_pipeline = Pipeline([('preprocessing', col_trans),
                           ('model', RandomForestClassifier())])


In [31]:
display(model_pipeline)

### Zadanie 1

Zaimplementuj metodę stackingu modeli wykorzystywaną w AutoGluonie. Sprawdź jej skuteczność w porównaniu do skuteczności pojedynczych modeli.


In [78]:
### Solution
import pandas as pd
import numpy as np
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import NearestCentroid, NearestNeighbors
from sklearn.ensemble import HistGradientBoostingClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.model_selection import train_test_split, GridSearchCV, KFold


from copy import deepcopy

X, y = make_hastie_10_2(random_state=0)
X = pd.DataFrame(X, columns = ['x_' + str(i) for i in range(X.shape[1])])
X_train, X_test = X.iloc[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]




### Wersja najprostsza - bez kroswalidacji wewnatrz


In [77]:



model_list_L1 = [RandomForestClassifier(), SVC(probability=True), 
               HistGradientBoostingClassifier(), GradientBoostingClassifier()]
model_list_L2 = deepcopy(model_list_L1)

X_1 = deepcopy(X_train)

for model in model_list_L1:
    print('L1:', type(model).__name__)
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_train)
    y_pred = pd.DataFrame(y_pred[:,1], columns=['pred_'+ type(model).__name__])
    X_1 = pd.concat([X_1.reset_index(drop=True), y_pred.reset_index(drop=True)], axis=1, ignore_index=True)


l2_pred = {}
for model_l2 in model_list_L2:
    print('L2:', type(model_l2).__name__)
    model_l2.fit(X_1, y_train)
    l2_pred['l2_pred_'+ type(model_l2).__name__] = model_l2.predict_proba(X_1)[:,1]
X_2 = pd.DataFrame(l2_pred)


final_model = LogisticRegression()
final_model.fit(X_2, y_train)


L1: RandomForestClassifier
L1: SVC
L1: HistGradientBoostingClassifier
L1: GradientBoostingClassifier
L2: RandomForestClassifier
L2: SVC
L2: HistGradientBoostingClassifier
L2: GradientBoostingClassifier


In [42]:
#### Zeby zrobic predykcje trzeba zbior testowy po kolei transformować 
X_1_test = deepcopy(X_test)

for model in model_list_L1:
    y_test_pred = model.predict_proba(X_test)
    y_test_pred = pd.DataFrame(y_test_pred[:,1], columns=['pred_'+ type(model).__name__])
    # print(y_test_pred)
    X_1_test = pd.concat([X_1_test.reset_index(drop=True), y_test_pred.reset_index(drop=True)], axis=1)


l2_test_pred = {}
for model_l2 in model_list_L2:

    l2_test_pred['l2_pred_'+ type(model_l2).__name__] = model_l2.predict_proba(X_1_test)[:,1]


X_2_test = pd.DataFrame(l2_test_pred)
final_model.predict_proba(X_2_test)



array([[0.00391504, 0.99608496],
       [0.99714801, 0.00285199],
       [0.99714741, 0.00285259],
       ...,
       [0.99714872, 0.00285128],
       [0.01214134, 0.98785866],
       [0.00299535, 0.99700465]])

#### Wersja z pipeline

In [102]:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted
from sklearn.metrics import get_scorer
from sklearn import set_config
set_config(transform_output = "pandas")


#### In ColumnTransformer only traansformers are allowed (with method .fit and .transform).
#### To use estimators as transformers we need to create wrapper and add method .transform 

class ClassifierWrapper(BaseEstimator, TransformerMixin):
    
    def __init__(self, estimator, verbose=None, fit_params=None, use_proba=True, scoring=None):
        self.estimator = estimator
        self.verbose = verbose #True = 1, False = 0, 1 - moderately verbose, 2- extra verbose    
        if verbose is None:
            self.verbose=0
        else:
            self.verbose=verbose
        self.fit_params= fit_params
        self.use_proba = use_proba #whether to use predict_proba in transform
        self.scoring = scoring # calculate validation score, takes score function name
        #TODO check if scorer imported?
        self.score = None #variable to keep the score if scoring is set.

    def fit(self,X,y):
        fp=self.fit_params
        if self.verbose==2: print("X: ", X.shape, "\nFit params:", self.fit_params)
        
        if fp is not None:
            self.estimator.fit(X,y, **fp)
        else:
            self.estimator.fit(X,y)
        
        return self
    
    def transform(self, X):
        if self.use_proba:
            return pd.DataFrame(self.estimator.predict_proba(X)[:, 1].reshape(-1,1), columns=['pred_'+ type(self.estimator).__name__])
        else:
            return self.estimator.predict(X)
    
    def fit_transform(self,X,y,**kwargs):
        self.fit(X,y)
        p = self.transform(X)
        if self.scoring is not None:
            self.score = eval(self.scoring+"(y,p)")
            #TODO print own instance name?
            if self.verbose >0: print("score: ", self.score) 
        return p
    
    def predict(self,X):
        return self.estimator.predict(X)
    
    def predict_proba(self,X):
        return self.estimator.predict_proba(X)


#### Define lists of models

model_list_L1 = [RandomForestClassifier(), SVC(probability=True), 
               HistGradientBoostingClassifier(), GradientBoostingClassifier()]
model_list_L2 = [RandomForestClassifier(), SVC(probability=True), 
               HistGradientBoostingClassifier(), GradientBoostingClassifier()]
#### ColumnTransformer for parallel training models
#### make_column_selector() - without any inputs use all columns

l1_pred_layer = ColumnTransformer(transformers=[('L1_pred_'+ type(model_i).__name__, 
                                      ClassifierWrapper(model_i), 
                                      make_column_selector()) for model_i in model_list_L1
],
                  remainder = 'passthrough')

#### Concatenate original data with new columns - each of them is a vector of predictions
#### make_pipeline('passthrough') - pass original data to concatenation


l1_concat_layer = FeatureUnion([("original", make_pipeline('passthrough')),
                                ("l1_pred", l1_pred_layer)])

### 

l2_pred_layer = ColumnTransformer(transformers=[('L2_pred_'+ type(model_i).__name__, 
                                      ClassifierWrapper(model_i), 
                                      make_column_selector()) for model_i in model_list_L2
],
                  remainder = 'passthrough')

# display(ct)

# l1_concat_layer.fit_transform

stack_pipe = Pipeline([('L1',l1_concat_layer),
                       ('L2', l2_pred_layer),
                       ('model', LogisticRegression())
                        ])
display(stack_pipe)



Przed transformacją:  (2000, 10)
Po transformacji:  (2000, 14)
Po transformacji:  (2000, 4)


In [None]:
stack_pipe.fit(X_train, y_train)
xxx1 = l1_concat_layer.transform(X_train)
xxx2 = l2_pred_layer.transform(xxx1)

print('Przed transformacją: ', X_train.shape)
print('Po transformacji: ', xxx1.shape)
print('Po transformacji: ', xxx2.shape)

In [15]:
### Mozna jeszcze probowac ze Stackingiem

# L2_MODELS = []
# for model_i in model_list:
#     st_model = StackingClassifier(final_estimator= model_i, estimators= model_list, passthrough=True)
#     L2_MODELS.append(st_model)

# print(L2_MODELS)

# final_stack = StackingClassifier(final_estimator=LogisticRegression(), estimators=[L2_MODELS], passthrough=False)
# display(final_stack)

[StackingClassifier(estimators=[RandomForestClassifier(), SVC(probability=True),
                               HistGradientBoostingClassifier(),
                               GradientBoostingClassifier()],
                   final_estimator=RandomForestClassifier(), passthrough=True), StackingClassifier(estimators=[RandomForestClassifier(), SVC(probability=True),
                               HistGradientBoostingClassifier(),
                               GradientBoostingClassifier()],
                   final_estimator=SVC(probability=True), passthrough=True), StackingClassifier(estimators=[RandomForestClassifier(), SVC(probability=True),
                               HistGradientBoostingClassifier(),
                               GradientBoostingClassifier()],
                   final_estimator=HistGradientBoostingClassifier(),
                   passthrough=True), StackingClassifier(estimators=[RandomForestClassifier(), SVC(probability=True),
                               His

### Zadanie 2

Sprawdź czy stacking modeli jest skuteczniejszy, gdy rozpatrujemy jedną klasę algorytmów czy różne klasy algorytmów

In [None]:
### Solution

### Zadanie 3 
Uruchom AutoGluona

https://colab.research.google.com/github/gidler/autogluon-tutorials/blob/main/tutorials/tabular_prediction/tabular-quickstart.ipynb

In [None]:
# Uncomment the code below and run this cell if AutoGluon is not yet installed in the kernel.
# !pip install autogluon  # These tutorials are based on AutoGluon v0.5.0 and might not work with different versions.



In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


In [None]:
label = 'class'
print("Summary of class variable: \n", train_data[label].describe)

Summary of class variable: 
 <bound method NDFrame.describe of 6118       >50K
23204     <=50K
29590     <=50K
18116     <=50K
33964      >50K
          ...  
29128     <=50K
23950     <=50K
13700      >50K
35248     <=50K
24772     <=50K
Name: class, Length: 500, dtype: object>


In [None]:
save_path = 'agModels-predictClass'  # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)