In [None]:
#Begin!

# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course

Notebooks for the MLA course

by [*lufer*](mailto:lufer@ipca.pt)

---



# ML Modelling - Part VI - Pipeline
\
**Contents**:

1.  **Machine Learning Pipeline**



This notebook explores the automatization of ML models creatiation, testing and improvement.

## What is ML Pipeline?


> "(…) is a series of interconnected data processing and modelling steps designed to automate, standardize and streamline the process of building, training, evaluating and deploying machine learning models (…)"
(IBM)


**Pipeline process:**

1. Data retrieval and ingestion
2. Data preparation
3. Model training
4. Model evaluation and tuning
5. Model deployment
6. Monitoring

Pipeline allows to apply sequentially a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.

Intermediate steps of the pipeline must be `transforms`, that is, they must implement `fit` and `transform` methods. The final estimator only needs to implement `fit`. The transformers in the pipeline can be cached using memory argument.

[see more on...](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)


## Environment preparation


**Importing necessary Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

#import libraries for trainning
from sklearn.model_selection import train_test_split


In [None]:
import datetime
print(f"Last updated: {datetime.datetime.now()}")

**Mounting Drive**

In [None]:

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

## 1- Using a Dummy dataset

Lets create a dummy classification dataset. `make_classification` alows that!


```
sklearn.datasets.make_classification(
  n_samples=100,
  n_features=20, *,
  n_informative=2,
  n_redundant=2,
  n_repeated=0,
  n_classes=2,
  n_clusters_per_class=2,
  weights=None,
  flip_y=0.01,
  class_sep=1.0,
  hypercube=True,
  shift=0.0,
  scale=1.0,
  shuffle=True,
  random_state=None
)
```



This generated classification dataset has 100.000 samples and 20 features. Of the 20 features, only 2 are informative, 2 are redundant (random combinations of the informative features) and the remaining 16 are uninformative (random numbers)

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
#10000 examples (i.e., samples)
#10 features (both of them informative features, 0 redundant)
#1 cluster per class
#mild class separation
#two informative features

X, y = make_classification(
    n_samples=10000, n_features=10, n_informative=2, n_redundant=2, random_state=42, n_clusters_per_class=1
)

train_samples = 100  # Samples used for training the models
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    shuffle=False,
    test_size=10000 - train_samples,
)
#We have data!

**Check dataset insights:**

In [None]:
X.shape, y.shape

In [None]:
X

In [None]:
plt.figure(figsize=(8, 8))
plt.subplots_adjust(bottom=0.05, top=0.9, left=0.05, right=0.95)
plt.subplot(111)
plt.title("Ten features, two informative features, one clusters per class", fontsize="small")
plt.scatter(X[:, 0], X[:, 1], marker="o", c=y, s=25, edgecolor="k")
plt.show()

**Another example of `make_classification`**

In [None]:
#just for understanding better!
X1, y1 = make_classification(n_samples = 200
                           ,n_features = 2
                           ,n_informative = 2
                           ,n_redundant = 0
                           ,n_clusters_per_class = 1
                           ,flip_y = 0
                           ,class_sep = 2
                           ,random_state = 7
                           )
plt.style.use('fivethirtyeight')
plt.figure(figsize = (6,6))
sns.scatterplot(x = X1[:,0], y = X1[:,1], hue = y1)

Lets continue with the Pipeline!

In [None]:
from sklearn.svm import SVC     # C-Support Vector Classification model
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

model = SVC()
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
# An estimator's parameter can be set using '__' syntax
# "svc__C=10" means set 10 to parameter "C" of SVC model
# set parameter, train and evaluate
pipe.set_params(svc__C=10).fit(X_train, y_train).score(X_test, y_test)

## 2 - Using an existing Dataset

### 2.1 - Download Dataset

In [None]:
data = pd.read_csv("/content/gDrive/MyDrive/MIA/ColabNotebooks/Datasets/heart-disease.csv")
data.head()

### 2.2 - Dataset insights and Preparation



In [None]:
data.info()

In [None]:
data.isna().sum()
#There are NaN values! No!

In [None]:
data.describe()
#there is no Categorical features!

In [None]:
data.shape

In [None]:
data.head()

Look about features (cor)relations:

The pandas [crosstab](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html) function builds a cross-tabulation table that can show the frequency with which certain groups of data appear.


In [None]:
pd.crosstab(data.target, data.sex)

**Relation between Heart Disease and the person' Sex:**

In [None]:
# Create a plot of this crosstab
ct=pd.crosstab(data.target, data.sex)
ct.plot(kind="bar",
        figsize=(6, 4),
        color=["lightpink", "lightblue"])

SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 12

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

plt.title("Heart Disease Frequency for Sex")
plt.xlabel("0 = No Diesease, 1 = Disease")
plt.ylabel("Qty")
plt.legend(["Female", "Male"]);
plt.xticks(rotation=0);


**Relation between Heart Disease and the person' Age:**

In [None]:
plt.scatter(data.age[data.target==1],data.thalach[data.target==1],c="salmon")
plt.scatter(data.age[data.target==0],data.thalach[data.target==0],c="lightblue")

plt.title("Distribution of Heart Disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate")
plt.legend(["Disease", "No Disease"]);

**Relation between Heart Disease and the person' Sex:**

In [None]:
data.age.plot.hist();


**Heart Disease Frequency per Chest Pain Type**

- cp - chest pain type:
> 0: **Typical angina**: chest pain related decrease blood supply to the heart.
>
> 1: **Atypical angina**: chest pain not related to heart.
>
> 2: **Non-anginal pain**: typically esophageal spasms (non heart related).
>
> 3: **Asymptomatic**: chest pain not showing signs of disease.

In [None]:
ct=pd.crosstab(data.cp, data.target)
ct

In [None]:
ct.plot(kind="bar",
        figsize=(8,4),
        color=["salmon", "lightblue"])

# Add some communication
plt.title("Heart Disease Frequency Per Chest Pain Type")
plt.xlabel("Chest Pain Type: 0 - Typical angina; 1 - Atypical angina; 2 - Non-anginal pain; 3 - Asymptomatic")
plt.ylabel("Frequency")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation=0);

**Check Correlations:**

In [None]:
cm=data.corr()
cm

In [None]:
#Graphically

fig, ax= plt.subplots(figsize=(10,8))

ax= sns.heatmap(cm, annot=True, linewidths=0.5, fmt=".2f", cmap="YlGnBu")

### 2.3 - Modeling

In [None]:
#Define X and y
X=data.drop("target", axis=1)
y=data.target

In [None]:
np.random.seed(42)

# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

Lets explore three different models:
- LogisticRegression
- KNeighborsClassifier
- RandomForestClassifier


In [None]:
# Import models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
          "KNN": KNeighborsClassifier(),
          "Random Forest": RandomForestClassifier()}

# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of differetn Scikit-Learn machine learning models
    X_train : training data (no labels)
    X_test : testing data (no labels)
    y_train : training labels
    y_test : test labels
    """
    # Set random seed
    np.random.seed(42)
    # Make a dictionary to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
       # Fit the model to the data
       model.fit(X_train, y_train)
       # Evaluate the model and append its score to model_scores
       model_scores[name] = model.score(X_test, y_test)
    return model_scores



In [None]:
#res = list(models.keys())[0]
#analyse all models
scores = fit_and_score(models,X_train,X_test,y_train,y_test)
scores


### 2.4 - Model Comparison

In [None]:
model_compare = pd.DataFrame(scores, index=["accuracy"])
model_compare.T.plot.bar();

### 2.5 - Hyperparameter tuning

**2.5.1 - by hand**

Explore KNN model:

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data.

[See more aboout KNN in...](https://scikit-learn.org/stable/modules/neighbors.html#neighbors)

In [None]:
# Let's tune KNN

train_scores = []
test_scores = []

# Create a list of differnt values for n_neighbors (10 values)
neighbors = range(1, 10)

# Setup KNN instance
knn = KNeighborsClassifier()

# Loop through different n_neighbors
for i in neighbors:
    knn.set_params(n_neighbors=i)

    # Fit the algorithm
    knn.fit(X_train, y_train)

    # Update the training scores list
    train_scores.append(knn.score(X_train, y_train))

    # Update the test scores list
    test_scores.append(knn.score(X_test, y_test))


Compare the results:

In [None]:
plt.plot(neighbors, train_scores, label="Train score")
plt.plot(neighbors, test_scores, label="Test score")
plt.xticks(np.arange(1, 10, 1))

plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()

print(f"Maximum KNN score on the test data: {max(test_scores)*100:.2f}%")

**2.5.2 - Hyperparameter tuning with RandomizedSearchCV**

Lets explore the `RandomForestClassifier` model.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Hyperparameter grid for RandomizedSearchCV
param_values = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
                "max_depth": [None, 5, 10, 20, 30],
                "max_features": ["sqrt", "log2", None],
                "min_samples_split": [2, 4, 6, 8],
                "min_samples_leaf": [1, 2, 4, 8]}

rfc = RandomForestClassifier()
n_iter=5
rfc = RandomizedSearchCV(estimator=rfc,
                            param_distributions=param_values,
                            n_iter=n_iter,                          # how many models to try
                            cv=5,                                   # 5-fold cross-validation
                            verbose=2)
# Fit the RandomizedSearchCV version of clf (does cross-validation for us, so no need to use a validation set)
rfc.fit(X_train, y_train);

**2.5.2 - Hyperparameter tuning with GridSearchCV**

Lets explore the `LogisticRegression` model:

In [None]:

from sklearn.model_selection import GridSearchCV

# Different hyperparameters for our LogisticRegression model
log_params = {"C": np.logspace(-4, 4, 30),
                "solver": ["liblinear"]}

# Setup grid hyperparameter search for LogisticRegression
gs_lr = GridSearchCV(LogisticRegression(),
                          param_grid=log_params,
                          cv=5,
                          verbose=True)

# Fit grid hyperparameter search model
gs_lr.fit(X_train, y_train);

**2.5.2.1 - Check the best parameters values**

The best model has the following parameters' values:

In [None]:
gs_lr.best_params_

**2.5.2.2 - Check accuracy**

In [None]:
gs_lr.score(X_test, y_test)

**2.5.2.3 - Evaluating**

Evaluting our tuned machine learning classifier, beyond accuracy:
\

- ROC curve and AUC score
- Classification report
- Precision
- Recall
- F1-score

**Predicting**

In [None]:
y_preds_gs_lr = gs_lr.predict(X_test)
y_preds_gs_lr

**ROC-Curve and AUC score**

In [None]:
# we want our plots to appear inside the notebook
%matplotlib inline
#from sklearn.metrics import plot_roc_curve       #deprecated!!!
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_estimator(gs_lr, X_test, y_test)


**2.5.2.4 - Confusion Matrix**

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
print(confusion_matrix(y_test, y_preds_gs_lr))

Graphically

In [None]:
sns.set(font_scale=1.5)

def plot_conf_mat(y_test, y_preds):
    """
    Plots a nice looking confusion matrix using Seaborn's heatmap()
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds_gs_lr),
                     annot=True,
                     cbar=False)
    plt.xlabel("True label")
    plt.ylabel("Predicted label")

    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)

plot_conf_mat(y_test, y_preds_gs_lr)

**2.5.2.5 - Classification Report**

In [None]:
print(classification_report(y_test, y_preds_gs_lr))

**2.5.2.6 - New instance of the "best" LogisticRegression model:**

In [None]:
# Create a new classifier with best parameters
clf = LogisticRegression(C=0.20433597178569418,
                         solver="liblinear")

*Prediction without cross-validation*

In [None]:
clf.fit(X_train,y_train)
y_preds_bestmodel = clf.predict(X_test)
y_preds_bestmodel

In [None]:
print(classification_report(y_test, y_preds_bestmodel))

**Accuracy**

In [None]:
from sklearn.model_selection import  cross_val_score
# Cross-validated accuracy
cv_acc = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="accuracy")
cv_acc

In [None]:
cv_acc = np.mean(cv_acc)
cv_acc

**Precision**

In [None]:
# Cross-validated precision
cv_precision = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="precision")
cv_precision=np.mean(cv_precision)
cv_precision

**Recall**

In [None]:
# Cross-validated recall
cv_recall = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="recall")
cv_recall = np.mean(cv_recall)
cv_recall

**F1-score**

In [None]:
# Cross-validated f1-score
cv_f1 = cross_val_score(clf,
                         X,
                         y,
                         cv=5,
                         scoring="f1")
cv_f1 = np.mean(cv_f1)
cv_f1

**See all Performance Metrics:**

In [None]:
# Visualize cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
                           "Precision": cv_precision,
                           "Recall": cv_recall,
                           "F1": cv_f1},
                          index=[0])

cv_metrics.T.plot.bar(title="Cross-validated classification metrics",
                      legend=False);

**2.5.2.7 - Features relevance for this LogisticRegression model**

the model variable `coef` gives an array of weights estimated by linear regression, i.e., the coefficient of the features in the decision function.
\
`coef_` corresponds to outcome 1 (True) and `-coef_` corresponds to outcome 0 (False).

In [None]:
clf.coef_

In [None]:
data.head()

In [None]:
# Match coef's of features to columns
feature_dict = dict(zip(data.columns, list(clf.coef_[0])))
feature_dict

In [None]:
# Visualize feature importance
feature_df = pd.DataFrame(feature_dict, index=[0])
feature_df.T.plot.bar(title="Feature Importance", legend=False);

## 3 - Make a pipeline

The pipeline is built with a list of (key, value) pairs. The key is a string containing the name you want to give and the value is the estimator object.

```
class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)
```
where `steps`are the list of **(name of step, estimator)** tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`.

Very simple example code to show how to use was used in previous
 dummy example:

```
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
```




### 3.1 - Basic pipeline

Lets create a very basic pipeline with the following sequence:

- **Scaler**: For pre-processing data, i.e., transform the data to zero mean and unit variance using the StandardScaler().
- **Feature selector**: Use *VarianceThreshold()* for discarding features whose variance is less than a certain defined threshold.
- **Classifier**: *KNeighborsClassifier()*, which implements the k-nearest neighbor classifier and selects the class of the majority k points, which are closest to the test example.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import VarianceThreshold # Feature selector


pipe = Pipeline([
#('scaler', StandardScaler()),
#or
('sca', StandardScaler()),          #the key name is arbitrary
('selector', VarianceThreshold()),
('classifier', KNeighborsClassifier())
#('classifier', RandonForestClassifier())
#('classifier', LogisticRegression())
])
#This pipe object is simple to understand. It says, scale first, select features second and classify in the end

In [None]:
#pipe

In [None]:
#the pipe behaves like a model..thus it must be trained!
pipe.fit(X_train, y_train)

print('Training set score: ' + str(pipe.score(X_train,y_train)))
print('Test set score: ' + str(pipe.score(X_test,y_test)))

### 3.2 - Optimizing and Tuning the Pipeline

Optimizing means select different solvers, parameters, etc. For instance:

- Searching for other scalers. Instead of just the StandardScaler(), - we can try MinMaxScaler(), Normalizer() and MaxAbsScaler().
- Searching for the best variance threshold to use in the selector, i.e., VarianceThreshold().
- Searching for the best value of k for the KNeighborsClassifier().

In [None]:
# Various pre-processing steps
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder

parameters = {'sca': [StandardScaler(), MinMaxScaler(),     #'scaler': [StandardScaler(), MinMaxScaler(),
 Normalizer(), MaxAbsScaler()],
 'selector__threshold': [0, 0.001, 0.01],
 'classifier__n_neighbors': [1, 3, 5, 7, 10],
 'classifier__p': [1, 2],
 'classifier__leaf_size': [1, 5, 10, 15]
}


In [None]:
parameters

Apply these new parameters to the pipe "model"

In [None]:
grid = GridSearchCV(pipe, parameters, cv=2)
grid.fit(X_train, y_train)
#Don’t worry too much about the warning that you get by running the code above.
#It is generated because we have very few training samples and the cross-validation object does not
#have enough samples for a class for one of its folds.

In [None]:
print('Training set score: ' + str(grid.score(X_train, y_train)))
print('Test set score: ' + str(grid.score(X_test, y_test)))

In [None]:
...
# Access the best set of parameters
best_params = grid.best_params_
print(best_params)
# Stores the optimum model in best_pipe
best_pipe = grid.best_estimator_
print(best_pipe)

Another way to analyse the pipe result

In [None]:
#grid.cv_results_
#Attention: to much data!!!

In [None]:
result_df = pd.DataFrame.from_dict(grid.cv_results_, orient='columns')
print(result_df.columns)

In [None]:

SMALL_SIZE = 8
MEDIUM_SIZE = 10
BIGGER_SIZE = 12

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title


sns.relplot(data=result_df,
 kind='line',
 x='param_classifier__n_neighbors',
 y='mean_test_score',
 hue='param_sca',
 col='param_classifier__p')
plt.show()

The plots clearly show that using MinMaxScaler(), with n_neighbors=5 and p=1, gives the best result.

## 4 - Other Pipeline

Example explored in slides

In [None]:
ds = pd.read_csv("/content/gDrive/MyDrive/MIA/ColabNotebooks/Datasets/car-sales-extended-missing-data.csv")
ds.head()

### Dataset insights

*Data types*

In [None]:
ds.dtypes

`Make` and `Colour` are Categorical features!

*Null Values*

In [None]:
ds.isna().sum()

There are several Null values in all columns!

Lets work on ther dataset, trying to:
1. Fill missing data
2. Convert data to numbers
3. Build a model on the data

For that we'll use a pipeline!

In [None]:
# Getting data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Import modell and auxiliary processes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

### Pipeline for data preparation

In [None]:
# Setup random seed
import numpy as np
np.random.seed(42)

# Import data and drop the rows with missing labels
ds.dropna(subset=["Price"], inplace=True)

To impute different features with different arbitrary values, or the median, it is needed to set up several *SimpleImputer* steps within a pipeline and then join them with the *ColumnTransformer*.

In [None]:
# Define different features and transformer pipelines
# The SimpleImputer class provides basic strategies for imputing missing values.
# Missing values can be imputed with a provided constant value, or using the statistics
# (mean, median or most frequent) of each column in which the missing values are located.
# see https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
# When strategy == “constant”, fill_value is used to replace all occurrences of missing_values.
# For string or object data types, fill_value must be a string.
# If None, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
# strategy: “mean”, “median”, “most_frequent”, or “constant”.

categorical_features = ["Make", "Colour"]

# this imputer imputes categorical features with "an arbitrary value"
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),# missing values are replaced by "missing"
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])                  # categorical convert in numerical



In [None]:
door_feature = ["Doors"]
# this imputer imputes categorical features with 4
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))])

In [None]:

numeric_features = ["Odometer (KM)"]
# this imputer imputes numerics missing values with the mean
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))
])

In [None]:
# then we put the features list and the transformers together
# using the "ColumnTransformer"

# Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("door", door_transformer, door_feature),
        ("num", numeric_transformer, numeric_features)])

### Pipeline for model creation

Using `make_pipeline`

In [None]:
#Example:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB
a=make_pipeline(StandardScaler(), GaussianNB(priors=None))

In [None]:

# Create a preprocessing and modelling pipeline
model = Pipeline(steps=[("preprocessor", preprocessor),
                        ("model", RandomForestRegressor(n_jobs=-1))])

#equivalente a
# model = make_pipeline(preprocessor,RandomForestRegressor(n_jobs=-1))

In [None]:
b

The pipeline combines a series of data preprocessing steps (filling missing values, encoding numerical values) as well as a model!

In [None]:
#Or
#numerical_imputer = SimpleImputer(strategy = "mean")
#categorical_imputer = SimpleImputer(strategy="constant" , fill_value = "missing")
#door_imputer = SimpleImputer(strategy = "constant" , fill_value = 4)
#transformer = ColumnTransformer([
#    ("categorical_imputer" , categorical_imputer , categorical_features),
#    ("numerical_imputer" , numerical_imputer , numeric_features),
#    ("door_imputer" , door_imputer , door_feature),

### Training

In [None]:
#Training
# Split data
X1 = ds.drop("Price", axis=1)
y1 = ds["Price"]
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2)

# Fit and score the model
model.fit(X1_train, y1_train)
scorePipe=model.score(X1_test, y1_test)
scorePipe

Predicting

In [None]:
y1_preds = model.predict(X1_test)
y1_preds.shape, y1_test.shape

### Model improvement with pipeline

Lets integrate `GridSearchCV` in the `Pipeline`.

When creating a hyperparameter grid, it is necessary to add a prefix to each hyperparameter (see the [documentation for `RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) for a full list of possible hyperparameters to tune).

The prefix is the name of the `Pipeline` intended to  alter, followed by two underscores.

For example, to adjust `n_estimators` of `"model"` in the `Pipeline`, you'd use: `"model__n_estimators"` (note the double underscore after `model__` at the start).

In [None]:
# Using grid search with pipeline
pipe_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"], # note the double underscore after each prefix "preprocessor__"
    "model__n_estimators": [100, 1000],
    "model__max_depth": [None, 5],
    "model__max_features": ["sqrt"],
    "model__min_samples_split": [2, 4]
}
gs_model = GridSearchCV(model, pipe_grid, cv=5, verbose=2)  #model is the pipeline!
gs_model.fit(X1_train, y1_train)

In [None]:
# Score the best model
gsScore=gs_model.score(X1_test, y1_test)



In [None]:
y_predPipe = gs_model.predict(X1_test)
y_predPipe

Comparing models

In [None]:
df = pd.DataFrame({"Baseline Pipe":scorePipe,
                   "GridSearchCV Pipe":gsScore}, index=[0])
df.plot.bar(figsize=(6, 5));

In [None]:
gsPipeMetrics=evaluate_preds(y1_test, y_predPipe)

Comparing Performances

In [None]:
final = pd.DataFrame({"GridSearchCVPipe":gsPipeMetrics})

## 5 - Example of Pipeline for Data preparation

[See "Column Transformer with Mixed Types"](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py)

## 6 - Pipelines with tools

* [GitHub Actions](https://neptune.ai/blog/build-mlops-pipelines-with-github-actions-guide)

* [Kedro](https://neptune.ai/blog/data-science-pipelines-with-kedro)

* [Metaflow](https://metaflow.org/)

* [TensorFlow](https://www.tensorflow.org/?hl=pt)

* Others


Explore!

## 7 - Using ML Models in others applications

* [Using C# to run Python Scripts with Machine Learning Models](https://ernest-bonat.medium.com/using-c-to-run-python-scripts-with-machine-learning-models-a82cff74b027)

* [Using C# to call Python RESTful API Web Services with Machine Learning Models](https://ernest-bonat.medium.com/using-c-to-call-python-restful-api-web-services-with-machine-learning-models-6d1af4b7787e)
* [Machine Learning: Models to Production](https://towardsdatascience.com/how-to-prepare-scikit-learn-models-for-production-4aeb83161bc2)

## References


* [Python Data Science Handbookk](https://jakevdp.github.io/PythonDataScienceHandbook/)

* [Handling Missing Data with SimpleImputer](https://www.analyticsvidhya.com/blog/2022/10/handling-missing-data-with-simpleimputer/)

* [How to use Sklearn to impute missing values](https://www.educative.io/answers/how-to-use-sklearn-to-impute-missing-values)
* [Credits to...](https://www.kaggle.com/code/gunjanvermaa/prediction-model-pipeline-heart-disease/notebook)

In [None]:
#!End