In [4]:
%load_ext jupyter_black

import os
import pandas as pd
from sklearn import set_config
set_config(display='diagram')
pd.set_option('display.max_columns', 50)

# DO NOT CHANGE THIS. YOUR PROJECT WILL NOT WORK PROPERLY OTHERWISE. Make sure that you have run `kedro install` beforehand and you don't change twice os.chdir path
os.chdir("../../")
%load_ext kedro.ipython
%reload_kedro .



The jupyter_black extension is already loaded. To reload it, use:
  %reload_ext jupyter_black
The kedro.ipython extension is already loaded. To reload it, use:
  %reload_ext kedro.ipython
[35m2024-01-07 20:32:40,562 - kedro_mlflow.config.kedro_mlflow_config - INFO - The 'tracking_uri' key in mlflow.yml is relative ('server.mlflow_(tracking|registry)_uri = mlruns'). It is converted to a valid uri: 'file:///Users/Matheus_Pinto/Desktop/quantumblack/titanic-dataset/mlruns'[0m
[35m2024-01-07 20:32:40,718 - kedro.ipython - INFO - Kedro project project[0m
[35m2024-01-07 20:32:40,719 - kedro.ipython - INFO - Defined global variable 'context', 'session', 'catalog' and 'pipelines'[0m
[35m2024-01-07 20:32:40,730 - kedro.ipython - INFO - Registered line magic 'run_viz'[0m


# Titanic use case

__Tutorial notebook__ for testing end-to-end the package.



## Challenge Requirements

### **Object-Oriented Programming (OOP)**


<div class="alert alert-info">
<b></b>

> The project's codebase adheres to Object-Oriented Programming (OOP) principles. It follows the scikit-learn Transformers and Estimators schema and object injection, facilitating modularity and extensibility. This design enables compatibility with any machine learning model that adheres to the scikit-learn schema. The project's codebase is thoughtfully organized to support seamless object injection.

</div>

[Code Packages](https://github.com/matheus695p/titanic-dataset/tree/main/src/project/packages/README.md)


### **Command-Line Interface (CLI)**

<div class="alert alert-info">
<b></b>

> The project includes a command-line interface (CLI) that allow users to interact with the code in a streamlined manner. This CLI is based and integrated with the Kedro pipeline framework, which segregates data engineering and data science responsibilities. Leveraging Kedro, the project can be scaled in a modular way, simplifying the process of testing and evaluating numerous machine learning models. In this example, 9 different machine learning models are hypertuned, trained and evaluated using the package and scaling in a modular way using kedro modular pipelines.

</div>

[Pipelines Structure and Code](https://github.com/matheus695p/titanic-dataset/blob/main/src/project/pipelines/README.md)


### **Testing and Code Coverage**

<div class="alert alert-info">
<b></b>

> The project undergoes testing, encompassing both unit and integration tests, resulting in a test coverage of over 90% in the package. Continuous Integration (CI) pipelines, orchestrated by GitHub Actions, are implemented to validate the package comprehensively. These pipelines incorporate integration tests to ensure the seamless functioning of the pipelines and perform code formatting checks to maintain code quality.

</div>

[Test code](https://github.com/matheus695p/titanic-dataset/blob/main/src/tests/README.md)




## Transformers and model imports


These imports are used to bring in specific classes or modules from different parts of the project's package structure. Here's a description of each import:

1. `RawDataProcessor` from `project.packages.preprocessing.transformers.raw`:
   - This import brings in the `RawDataProcessor` class, which is likely used for raw data preprocessing. It may handle tasks like data schema validation, data type validation, and initial data transformation.

2. `IntermediateDataProcessor` from `project.packages.preprocessing.transformers.intermediate`:
   - This import includes the `IntermediateDataProcessor` class, which is responsible for processing intermediate-level data. It might perform tasks like addressing outliers, data quality checks, and ensuring data consistency.

3. `PrimaryDataProcessor` from `project.packages.preprocessing.transformers.primary`:
   - The import statement fetches the `PrimaryDataProcessor` class, which is likely used for preprocessing primary-level data. It may involve tasks such as filling missing values in categorical columns and applying text normalization.

4. `FeatureDataProcessor` from `project.packages.preprocessing.transformers.feature`:
   - This import imports the `FeatureDataProcessor` class, which is designed for feature engineering. It may create new features based on existing data, perform one-hot encoding, and prepare data for modeling.

5. `KMeansClusteringFeatures` from `project.packages.modelling.models.unsupervised.clustering_features`:
   - This import includes the `KMeansClusteringFeatures` class, which is likely used for unsupervised learning tasks. It may involve generating features related to K-Means clustering for data analysis.

6. `BinaryClassifierSklearnPipeline` from `project.packages.modelling.models.supervised.sklearn`:
   - The import statement fetches the `BinaryClassifierSklearnPipeline` class, which is likely used for supervised learning tasks with scikit-learn models. It may include building pipelines for binary classification tasks using scikit-learn algorithms.



In [5]:
from project.packages.preprocessing.transformers.raw import RawDataProcessor
from project.packages.preprocessing.transformers.intermediate import (
    IntermediateDataProcessor,
)
from project.packages.preprocessing.transformers.primary import PrimaryDataProcessor
from project.packages.preprocessing.transformers.feature import FeatureDataProcessor
from project.packages.modelling.models.unsupervised.clustering_features import (
    KMeansClusteringFeatures,
)
from project.packages.modelling.models.supervised.sklearn import (
    BinaryClassifierSklearnPipeline,
)

[34m2024-01-07 20:32:50,948 - project.packages.modelling.reproducibility.set_seed - INFO - Seeding sklearn, numpy and random libraries with the seed 42[0m


##  Titanic dataset


In [6]:
df = pd.read_csv("data/01_raw/titanic_train.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Data Engineering

### Raw data preprocessing

Certainly, let's provide a more detailed explanation of the import statement:

1. `RawDataProcessor` from `project.packages.preprocessing.transformers.raw`:

   - `RawDataProcessor` is a class or module located in the `project.packages.preprocessing.transformers.raw` package or module.
   - This class/module is likely designed for handling raw data preprocessing tasks in a machine learning or data science project.
   - Raw data preprocessing typically involves tasks performed on the initial, unprocessed data that has been collected or acquired.
   - Some common tasks that the `RawDataProcessor` class/module may handle include:
     - **Data Schema Validation**: Ensuring that the raw data adheres to a predefined schema or structure. It checks if the expected columns and data types are present.
     - **Data Type Validation**: Verifying that data types in the raw data match the expected types. This helps maintain consistency and prevent type-related errors.
     - **Initial Data Transformation**: Performing basic data transformations to prepare the raw data for further processing. This may include tasks like cleaning, filtering, or reformatting data.
   - The `RawDataProcessor` class/module is a component that aids in the initial stages of data preparation before more advanced processing or modeling steps occur.


In [8]:
raw_params = {
    "target": "Survived",
    "index": "passenger_id",
    "schemas": {
        "PassengerId": {"dtype": "int64", "name": "passenger_id"},
        "Survived": {"dtype": "int64", "name": "survived"},
        "Pclass": {"dtype": "int64", "name": "passenger_class"},
        "Name": {"dtype": "object", "name": "name"},
        "Sex": {"dtype": "object", "name": "passenger_sex"},
        "Age": {"dtype": "float64", "name": "passenger_age"},
        "Parch": {"dtype": "int64", "name": "passenger_parch"},
        "Ticket": {"dtype": "object", "name": "passenger_ticket"},
        "Fare": {"dtype": "float64", "name": "passenger_fare"},
        "Cabin": {"dtype": "object", "name": "passenger_cabin"},
        "Embarked": {"dtype": "object", "name": "passenger_embarked_port"},
        "SibSp": {"dtype": "int64", "name": "passenger_siblings"},
    },
}
raw_transformer = RawDataProcessor(raw_params)
df_raw = raw_transformer.fit_transform(df)
df_raw

Unnamed: 0_level_0,survived,passenger_class,name,passenger_sex,passenger_age,passenger_siblings,passenger_parch,passenger_ticket,passenger_fare,passenger_cabin,passenger_embarked_port
passenger_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Intermediate data preprocessor

2. `IntermediateDataProcessor` from `project.packages.preprocessing.transformers.intermediate`:

   - The import statement brings in the `IntermediateDataProcessor` class, which is likely designed to handle intermediate-level data processing in a machine learning or data science project.
   - Intermediate-level data typically refers to data that has already undergone some initial preprocessing but may still require additional refinement and quality assurance before it is used in modeling or analysis.
   - The `IntermediateDataProcessor` class/module is responsible for several tasks related to intermediate data, including:
     - **Outlier Handling**: Identifying and addressing outliers or unusual data points that deviate significantly from the majority of the data. This is important for ensuring that outliers do not unduly influence analysis or modeling results.
     - **Data Quality Checks**: Verifying the quality of intermediate data by checking for missing values, data consistency, and adherence to predefined data quality standards.
     - **Data Consistency**: Ensuring that data across various features or columns is consistent and follows expected patterns or relationships.
   - The class/module helps in refining intermediate data and making it more suitable for subsequent modeling or analysis steps.
   - This import statement is likely used when intermediate-level data processing is required as part of the data preparation pipeline.


In [9]:
intermediate_params = {
    "target": "survived",
    "outlier_params": {"iqr_alpha": 2.5, "q1_quantile": 0.25, "q3_quantile": 0.75},
    "drop_columns": ["name"],
    "categorical_features": [
        "passenger_sex",
        "passenger_ticket",
        "passenger_cabin",
        "passenger_embarked_port",
    ],
}
int_transformer = IntermediateDataProcessor(intermediate_params)
df_int = int_transformer.fit_transform(df_raw)
df_int

Unnamed: 0_level_0,survived,passenger_class,passenger_sex,passenger_age,passenger_siblings,passenger_parch,passenger_ticket,passenger_fare,passenger_cabin,passenger_embarked_port
passenger_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,female,35.0,1,0,113803,53.1000,C123,S
5,0,3,male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
887,0,2,male,27.0,0,0,211536,13.0000,,S
888,1,1,female,19.0,0,0,112053,30.0000,B42,S
889,0,3,female,,1,2,W./C. 6607,23.4500,,S
890,1,1,male,26.0,0,0,111369,30.0000,C148,C


### Primary data preprocessing

3. `PrimaryDataProcessor` from `project.packages.preprocessing.transformers.primary`:

   - This import statement imports the `PrimaryDataProcessor` class/module, which is typically used for preprocessing primary-level data in a machine learning or data science project.
   - Primary-level data refers to the initial dataset that has undergone minimal preprocessing but may still require specific transformations to make it suitable for modeling or analysis.
   - The `PrimaryDataProcessor` class/module is responsible for various preprocessing tasks related to primary data, which may include:
     - **Filling Missing Values**: Handling missing values in the primary data by imputing or replacing them with appropriate values. This is crucial for ensuring that missing data does not disrupt subsequent analysis or modeling.
     - **Text Normalization**: Applying text normalization techniques to textual data within the primary dataset. Text normalization can include tasks like removing accents, converting text to lowercase, or stemming words to reduce vocabulary variations.
     - **Categorical Data Handling**: Managing categorical variables by filling missing values, encoding them into numerical format if necessary, or applying other categorical data preprocessing techniques.
   - The class/module is responsible for preparing primary-level data in a way that makes it more suitable and reliable for downstream analysis or modeling tasks.
   - This import statement is likely used when specific preprocessing steps are required for primary data as part of the overall data preparation pipeline.


In [10]:
primary_params = {
    "target": "supervised",
    "categorical_columns_fillna": {
        "passenger_cabin": "unknown",
        "passenger_embarked_port": "unknown",
    },
}
prm_transformer = PrimaryDataProcessor(primary_params)
df_prm = prm_transformer.fit_transform(df_int)
df_prm

Unnamed: 0_level_0,survived,passenger_class,passenger_sex,passenger_age,passenger_siblings,passenger_parch,passenger_ticket,passenger_fare,passenger_cabin,passenger_embarked_port
passenger_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,male,22.0,1,0,A/5 21171,7.2500,unknown,s
2,1,1,female,38.0,1,0,PC 17599,71.2833,c85,c
3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,unknown,s
4,1,1,female,35.0,1,0,113803,53.1000,c123,s
5,0,3,male,35.0,0,0,373450,8.0500,unknown,s
...,...,...,...,...,...,...,...,...,...,...
887,0,2,male,27.0,0,0,211536,13.0000,unknown,s
888,1,1,female,19.0,0,0,112053,30.0000,b42,s
889,0,3,female,,1,2,W./C. 6607,23.4500,unknown,s
890,1,1,male,26.0,0,0,111369,30.0000,c148,c


### Feature engineering 


4. `FeatureDataProcessor` from `project.packages.preprocessing.transformers.feature`:

   - This import statement brings in the `FeatureDataProcessor` class/module, which typically plays a key role in feature engineering within a machine learning or data science project.
   - Feature engineering is a critical step in data preprocessing that involves creating new features or transforming existing ones to improve the performance of machine learning models or enhance the understanding of data patterns.
   - The `FeatureDataProcessor` class/module is responsible for various feature engineering tasks, which may include:
     - **Creating New Features**: Generating new features by combining or transforming existing data columns. These new features can capture valuable information or relationships within the data that may not be apparent initially.
     - **One-Hot Encoding**: Converting categorical variables into a numerical format through one-hot encoding. This is essential for including categorical data in machine learning models that require numerical input.
     - **Additional Data Transformation**: Carrying out additional data transformations or preprocessing steps specific to feature engineering requirements.
   - The class/module is instrumental in preparing the dataset with engineered features, making it more informative and suitable for training machine learning models.
   - This import statement is likely used when feature engineering steps are part of the data preparation pipeline, and new features need to be created or existing ones transformed for modeling purposes.


In [11]:
feature_params = {
    "target": "survived",
    "encoding_transform": {
        "one_hot_encoder": [
            "passenger_cabin_level",
            "passenger_embarked_port",
            "passenger_sex",
        ],
        "similarity_based_encoder": None,
    },
}
feat_transformer = FeatureDataProcessor(feature_params)
df_feat = feat_transformer.fit_transform(df_prm)
df_feat

Unnamed: 0_level_0,survived,passenger_class,passenger_sex,passenger_age,passenger_siblings,passenger_parch,passenger_ticket,passenger_fare,passenger_cabin,passenger_embarked_port,passenger_ticket_base,passenger_ticket_number,passenger_ticket_unknown_base,passenger_cabin_level,passenger_cabin_number,passenger_number_of_family_onboard,passenger_is_single,passenger_has_significant_other,passenger_has_childs,passenger_cabin_level_a,passenger_cabin_level_b,passenger_cabin_level_c,passenger_cabin_level_d,passenger_cabin_level_e,passenger_cabin_level_f,passenger_cabin_level_g,passenger_cabin_level_t,passenger_cabin_level_unknown,passenger_embarked_port_c,passenger_embarked_port_q,passenger_embarked_port_s,passenger_embarked_port_unknown,passenger_sex_female,passenger_sex_male
passenger_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1
1,0,3,male,22.0,1,0,A/5 21171,7.2500,unknown,s,A/5,21171.0,0,unknown,,1,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,1,1,female,38.0,1,0,PC 17599,71.2833,c85,c,PC,17599.0,0,c,85.0,1,0,0,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,unknown,s,STON/O2.,3101282.0,0,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1,1,female,35.0,1,0,113803,53.1000,c123,s,unknown,113803.0,1,c,123.0,1,0,0,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
5,0,3,male,35.0,0,0,373450,8.0500,unknown,s,unknown,373450.0,1,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,male,27.0,0,0,211536,13.0000,unknown,s,unknown,211536.0,1,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
888,1,1,female,19.0,0,0,112053,30.0000,b42,s,unknown,112053.0,1,b,42.0,0,1,1,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
889,0,3,female,,1,2,W./C. 6607,23.4500,unknown,s,W./C.,6607.0,0,unknown,,3,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
890,1,1,male,26.0,0,0,111369,30.0000,c148,c,unknown,111369.0,1,c,148.0,0,1,1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


5. `KMeansClusteringFeatures` from `project.packages.modelling.models.unsupervised.clustering_features`:

   - The `KMeansClusteringFeatures` class/module is likely designed to support unsupervised learning tasks by providing features or utilities related to K-Means clustering:
     - **Feature Generation**: It may include methods to generate features that represent the cluster assignments or distances of data points to cluster centroids obtained through K-Means clustering. It creates new clusters features based on the parameters provided and performs K-Means clustering algorithm. The number of clusters are determined by the Elbow method.
     - **Feature Engineering**: The class/module may offer features that can be used as input features for downstream machine learning models or analysis tasks.
   - K-Means clustering is a popular clustering algorithm that partitions data into clusters based on similarity. The features and utilities provided by this class/module may aid in understanding the inherent patterns and structures within the data.


New features name as __passenger_cabin_cluster_feature__, __passenger_embarked_port_cluster_feature__, __passenger_ticket_number_cluster_feature__, __passenger_family_cluster_feature__ and __passenger_social_status_cluster_feature__ will be created from these params using the features assigned on the values of these dictionary.
The idea behind is to encode simular variables into a more descriptive feature for solving the proble.

**The cluster __id__ assignment is based on a monotonically increasing mean number of the observations provide. So labels are monotonic increasing. Which allow us to also reduce the number of features, meaning we can avoid the one hot encoding of these variables.**


```python
cluster_feature_params = {
    "passenger_cabin_cluster_feature": [
        "passenger_cabin_level_a",
        "passenger_cabin_level_b",
        "passenger_cabin_level_c",
        "passenger_cabin_level_d",
        "passenger_cabin_level_e",
        "passenger_cabin_level_f",
        "passenger_cabin_level_g",
        "passenger_cabin_level_t",
        "passenger_cabin_level_unknown",
    ],
    "passenger_embarked_port_cluster_feature": [
        "passenger_embarked_port_c",
        "passenger_embarked_port_q",
        "passenger_embarked_port_s",
        "passenger_embarked_port_unknown",
    ],
    "passenger_ticket_number_cluster_feature": [
        "passenger_ticket_number",
        "passenger_ticket_unknown_base",
    ],
    "passenger_family_cluster_feature": [
        "passenger_siblings",
        "passenger_parch",
        "passenger_cabin_number",
        "passenger_number_of_family_onboard",
    ],
    "passenger_social_status_cluster_feature": [
        "passenger_class",
        "passenger_age",
        "passenger_sex_female",
    ],
}
```




In [13]:
cluster_model_params = {
    "class": "project.packages.modelling.models.unsupervised.segmentation.KMeansElbowSelector",
    "kwargs": {"min_clusters": 1, "max_clusters": 15},
}
cluster_scaler_params = {
    "class": "project.packages.modelling.transformers.scaler.ColumnsPreserverScaler",
    "kwargs": {
        "scaler_params": {"class": "sklearn.preprocessing.MinMaxScaler", "kwargs": {}}
    },
}
cluster_imputer_params = {
    "class": "project.packages.modelling.models.unsupervised.imputer.ColumnsPreserverImputer",
    "kwargs": {
        "imputer_params": {
            "class": "sklearn.impute.KNNImputer",
            "kwargs": {"n_neighbors": 10, "weights": "distance"},
        }
    },
}

# cluster feature name and features used to create the cluster feature
cluster_feature_params = {
    "passenger_cabin_cluster_feature": [
        "passenger_cabin_level_a",
        "passenger_cabin_level_b",
        "passenger_cabin_level_c",
        "passenger_cabin_level_d",
        "passenger_cabin_level_e",
        "passenger_cabin_level_f",
        "passenger_cabin_level_g",
        "passenger_cabin_level_t",
        "passenger_cabin_level_unknown",
    ],
    "passenger_embarked_port_cluster_feature": [
        "passenger_embarked_port_c",
        "passenger_embarked_port_q",
        "passenger_embarked_port_s",
        "passenger_embarked_port_unknown",
    ],
    "passenger_ticket_number_cluster_feature": [
        "passenger_ticket_number",
        "passenger_ticket_unknown_base",
    ],
    "passenger_family_cluster_feature": [
        "passenger_siblings",
        "passenger_parch",
        "passenger_cabin_number",
        "passenger_number_of_family_onboard",
    ],
    "passenger_social_status_cluster_feature": [
        "passenger_class",
        "passenger_age",
        "passenger_sex_female",
    ],
}

cluster_transformer = KMeansClusteringFeatures(
    model_params=cluster_model_params,
    scaler_params=cluster_scaler_params,
    feature_params=cluster_feature_params,
    imputer_params=cluster_imputer_params,
)
data = cluster_transformer.fit_transform(df_feat)
data

[34m2024-01-07 20:43:39,092 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 20:43:39,107 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 31, 'cluster_id_1': 0, 'cluster_id_2': 1}[0m
[34m2024-01-07 20:43:40,009 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 20:43:40,024 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 1, 'cluster_id_1': 0, 'cluster_id_2': 5}[0m
[34m2024-01-07 20:43:40,830 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 20:43:40,844 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 427, 'cluster_id_1': 594, 'cluster_id_2': 816}[0m
[34m2024-01-07 20:43:41,952 - pr

Unnamed: 0_level_0,survived,passenger_class,passenger_sex,passenger_age,passenger_siblings,passenger_parch,passenger_ticket,passenger_fare,passenger_cabin,passenger_embarked_port,passenger_ticket_base,passenger_ticket_number,passenger_ticket_unknown_base,passenger_cabin_level,passenger_cabin_number,passenger_number_of_family_onboard,passenger_is_single,passenger_has_significant_other,passenger_has_childs,passenger_cabin_level_a,passenger_cabin_level_b,passenger_cabin_level_c,passenger_cabin_level_d,passenger_cabin_level_e,passenger_cabin_level_f,passenger_cabin_level_g,passenger_cabin_level_t,passenger_cabin_level_unknown,passenger_embarked_port_c,passenger_embarked_port_q,passenger_embarked_port_s,passenger_embarked_port_unknown,passenger_sex_female,passenger_sex_male,passenger_cabin_cluster_feature,passenger_embarked_port_cluster_feature,passenger_ticket_number_cluster_feature,passenger_family_cluster_feature,passenger_social_status_cluster_feature
passenger_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
1,0,3,male,22.0,1,0,A/5 21171,7.2500,unknown,s,A/5,21171.0,0,unknown,,1,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1,1,0,0,1
2,1,1,female,38.0,1,0,PC 17599,71.2833,c85,c,PC,17599.0,0,c,85.0,1,0,0,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,2,0,0,0,2
3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,unknown,s,STON/O2.,3101282.0,0,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1,1,1,0,2
4,1,1,female,35.0,1,0,113803,53.1000,c123,s,unknown,113803.0,1,c,123.0,1,0,0,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,2,1,2,0,2
5,0,3,male,35.0,0,0,373450,8.0500,unknown,s,unknown,373450.0,1,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1,1,2,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,male,27.0,0,0,211536,13.0000,unknown,s,unknown,211536.0,1,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1,1,2,0,0
888,1,1,female,19.0,0,0,112053,30.0000,b42,s,unknown,112053.0,1,b,42.0,0,1,1,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0,1,2,0,2
889,0,3,female,,1,2,W./C. 6607,23.4500,unknown,s,W./C.,6607.0,0,unknown,,3,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1,1,0,1,2
890,1,1,male,26.0,0,0,111369,30.0000,c148,c,unknown,111369.0,1,c,148.0,0,1,1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,2,0,2,0,0


### Data engineering in a single sklearn pipeline

All data engineering transformations can be encapsulated in a single sklearn pipeline for doing all data transformations.

What are the benefits of this approach: No data leakage is ensured in the process of data engineering. Data will never be leaked due to wrong data manipulations



In [16]:
from sklearn.pipeline import Pipeline

data_eng_pipeline = Pipeline(
    [
        ("raw_transformations", RawDataProcessor(raw_params)),
        (
            "intermediate_transformations",
            IntermediateDataProcessor(intermediate_params),
        ),
        ("primary_transformations", PrimaryDataProcessor(primary_params)),
        ("feature_transformations", FeatureDataProcessor(feature_params)),
        (
            "cluster_feature_transformations",
            KMeansClusteringFeatures(
                model_params=cluster_model_params,
                scaler_params=cluster_scaler_params,
                feature_params=cluster_feature_params,
                imputer_params=cluster_imputer_params,
            ),
        ),
    ],
)
data_eng_pipeline

In [17]:
data = data_eng_pipeline.fit_transform(df)
data

[34m2024-01-07 20:51:18,953 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 20:51:18,969 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 31, 'cluster_id_1': 0, 'cluster_id_2': 1}[0m
[34m2024-01-07 20:51:19,847 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 20:51:19,864 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 1, 'cluster_id_1': 0, 'cluster_id_2': 5}[0m
[34m2024-01-07 20:51:20,687 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 20:51:20,702 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 427, 'cluster_id_1': 594, 'cluster_id_2': 816}[0m
[34m2024-01-07 20:51:21,747 - pr

Unnamed: 0_level_0,survived,passenger_class,passenger_sex,passenger_age,passenger_siblings,passenger_parch,passenger_ticket,passenger_fare,passenger_cabin,passenger_embarked_port,passenger_ticket_base,passenger_ticket_number,passenger_ticket_unknown_base,passenger_cabin_level,passenger_cabin_number,passenger_number_of_family_onboard,passenger_is_single,passenger_has_significant_other,passenger_has_childs,passenger_cabin_level_a,passenger_cabin_level_b,passenger_cabin_level_c,passenger_cabin_level_d,passenger_cabin_level_e,passenger_cabin_level_f,passenger_cabin_level_g,passenger_cabin_level_t,passenger_cabin_level_unknown,passenger_embarked_port_c,passenger_embarked_port_q,passenger_embarked_port_s,passenger_embarked_port_unknown,passenger_sex_female,passenger_sex_male,passenger_cabin_cluster_feature,passenger_embarked_port_cluster_feature,passenger_ticket_number_cluster_feature,passenger_family_cluster_feature,passenger_social_status_cluster_feature
passenger_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
1,0,3,male,22.0,1,0,A/5 21171,7.2500,unknown,s,A/5,21171.0,0,unknown,,1,0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1,1,0,0,1
2,1,1,female,38.0,1,0,PC 17599,71.2833,c85,c,PC,17599.0,0,c,85.0,1,0,0,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,2,0,0,0,2
3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,unknown,s,STON/O2.,3101282.0,0,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1,1,1,0,2
4,1,1,female,35.0,1,0,113803,53.1000,c123,s,unknown,113803.0,1,c,123.0,1,0,0,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,2,1,2,0,2
5,0,3,male,35.0,0,0,373450,8.0500,unknown,s,unknown,373450.0,1,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1,1,2,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,male,27.0,0,0,211536,13.0000,unknown,s,unknown,211536.0,1,unknown,,0,1,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1,1,2,0,0
888,1,1,female,19.0,0,0,112053,30.0000,b42,s,unknown,112053.0,1,b,42.0,0,1,1,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0,1,2,0,2
889,0,3,female,,1,2,W./C. 6607,23.4500,unknown,s,W./C.,6607.0,0,unknown,,3,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1,1,0,1,2
890,1,1,male,26.0,0,0,111369,30.0000,c148,c,unknown,111369.0,1,c,148.0,0,1,1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,2,0,2,0,0


### Model hypertune and train


For hypertuning and fit we use a `BinaryClassifierSklearnPipeline` model wrapper from `project.packages.modelling.models.supervised.sklearn`, which is used for supervised learning tasks involving binary classification. 

This class allows the and end to end model hypertuning, training and evaluation of a  model



#### __Parameters used__

Models params include the following information:

The params follows an object injection structure, so each time that we need to pass and object, these one is passed as 

```python

object_params = {
   "class": path.to.the.module.ClassName
   "kwargs": {
      "arg1":  xx,
      ...
      "argn":  xx,
   }
}
```

These allow to code without hardcoding imports of models and objects.


1. `scoring_metrics`: A list of scoring metrics used for evaluating the model's performance.

2. `optuna`: Configuration for the Optuna hyperparameter optimization framework.
   - `kwargs_study`: Arguments for creating an Optuna study.
   - `kwargs_optimize`: Arguments for hyperparameter optimization.
   - `sampler`: Configuration for the Optuna sampler.
   - `pruner`: Configuration for the Optuna pruner.

3. `cv_strategy`: Cross-validation strategy configuration.
   - `class`: The class for the cross-validation strategy.
   - `kwargs`: Additional keyword arguments for the cross-validation strategy.

4. `cv_score`: Configuration for cross-validation scoring.
   - `scoring`: The scoring metric to use during cross-validation.
   - `class`: The class for performing cross-validation.
   - `kwargs`: Additional keyword arguments for the cross-validation process.

5. `target`: The name of the target variable in the dataset.

6. `features`: A list of feature names used for modeling.



1. `pipeline`: Configuration for the machine learning pipeline, which includes sub-configurations for:

   Passing the entire pipeline as a unified entity enables a holistic approach to hyperparameter tuning throughout the entire modeling process. This includes optimizing various stages such as data imputation, feature scaling, feature selection, and model parameters collectively. This approach offers several advantages over optimizing these components separately.

   1. **Holistic Optimization**: By optimizing the entire pipeline together, you ensure that the various stages of your modeling process work seamlessly in concert. This can lead to a more coherent and optimized end-to-end solution.

   2. **Consideration of Interactions**: When hyperparameter tuning is done in isolation for each component, it may not account for interactions and dependencies between these components. Optimizing them as a whole allows for the exploration of parameter combinations that yield synergistic effects, such as the imputer vs a model hyperparam.

   3. **Efficiency**: Hyperparameter tuning is an iterative and resource-intensive process. Tuning the entire pipeline in one step can be more computationally efficient than repeatedly tuning individual components separately.

   4. **Reduced Risk of Overfitting**: Optimizing components separately might lead to overfitting, as each component may be tuned to perform exceptionally well on its own, but not necessarily in combination. Tuning the entire pipeline can mitigate this risk by finding a balanced configuration that works well as a whole.

   5. **Consistency**: Hyperparameter tuning for the entire pipeline ensures consistency in parameter choices across different modeling runs. This consistency can make it easier to compare model performance and reproduce results.

   6. **Simplified Workflow**: Managing and documenting hyperparameters for individual components can be complex and error-prone. Tuning the whole pipeline simplifies the workflow by consolidating hyperparameters into a single configuration.

   7. **Domain Knowledge Integration**: In some cases, domain knowledge dictates certain relationships between preprocessing steps and model parameters. Tuning the pipeline as a whole allows for the inclusion of such domain-specific insights. So for example, defining the exploration space of each meta parameter.




In [50]:
model_params = {
    "scoring_metrics": [
        "accuracy",
        "balanced_accuracy",
        "f1",
        "f1_micro",
        "f1_macro",
        "f1_weighted",
        "precision",
        "precision_micro",
        "precision_macro",
        "precision_weighted",
        "recall",
        "recall_micro",
        "recall_macro",
        "recall_weighted",
        "roc_auc",
        "roc_auc_ovr",
        "roc_auc_ovo",
        "roc_auc_ovr_weighted",
        "roc_auc_ovo_weighted",
    ],
    "optuna": {
        "kwargs_study": {
            "direction": "maximize",
            "study_name": "xgboost",
            "load_if_exists": False,
        },
        "kwargs_optimize": {"n_trials": 500},
        "sampler": {
            "class": "optuna.samplers.TPESampler",
            "kwargs": {"n_startup_trials": 0, "constant_liar": True, "seed": 42},
        },
        "pruner": {"class": "optuna.pruners.SuccessiveHalvingPruner", "kwargs": {}},
    },
    "cv_strategy": {
        "class": "sklearn.model_selection.StratifiedKFold",
        "kwargs": {"n_splits": 5, "random_state": 42, "shuffle": True},
    },
    "cv_score": {
        "scoring": "f1_weighted",
        "class": "sklearn.model_selection.cross_val_predict",
        "kwargs": {
            "estimator": None,
            "X": None,
            "y": None,
            "cv": None,
            "n_jobs": -1,
            "method": "predict",
        },
    },
    "target": "survived",
    "features": [
        "passenger_class",
        "passenger_age",
        "passenger_siblings",
        "passenger_parch",
        "passenger_fare",
        "passenger_ticket_number",
        "passenger_ticket_unknown_base",
        "passenger_cabin_number",
        "passenger_number_of_family_onboard",
        "passenger_is_single",
        "passenger_has_childs",
        "passenger_cabin_level_a",
        "passenger_cabin_level_b",
        "passenger_cabin_level_c",
        "passenger_cabin_level_d",
        "passenger_cabin_level_e",
        "passenger_cabin_level_unknown",
        "passenger_embarked_port_c",
        "passenger_embarked_port_q",
        "passenger_embarked_port_s",
        "passenger_sex_female",
        "passenger_cabin_cluster_feature",
        "passenger_embarked_port_cluster_feature",
        "passenger_ticket_number_cluster_feature",
        "passenger_family_cluster_feature",
        "passenger_social_status_cluster_feature",
    ],
    "pipeline": {
        "imputer": {
            "class": "project.packages.modelling.models.unsupervised.imputer.ColumnsPreserverImputer",
            "kwargs": {
                "imputer_params": {
                    "class": "sklearn.impute.KNNImputer",
                    "kwargs": {
                        "n_neighbors": 'trial.suggest_int("knn_imputer__n_neighbors", 2, 20, step=1)',
                        "weights": 'trial.suggest_categorical("knn_imputer__weights", ["distance", "uniform"])',
                    },
                }
            },
        },
        "scaler": {
            "class": "project.packages.modelling.transformers.scaler.ColumnsPreserverScaler",
            "kwargs": {
                "scaler_params": {
                    "class": 'trial.suggest_categorical("scaler__transformer", ["project.packages.modelling.transformers.scaler.NotScalerTransformer", "sklearn.preprocessing.PowerTransformer", "sklearn.preprocessing.QuantileTransformer"])',
                    "kwargs": {},
                }
            },
        },
        "feature_selector": {
            "class": "project.packages.modelling.feature_selection.feature_selector_pipeline.FeatureSelector",
            "kwargs": {
                "fs_params": {
                    "selectors": ["model_based"],
                    "model_based": {
                        "bypass_features": ["passenger_sex_female"],
                        "estimator": {
                            "class": "xgboost.XGBClassifier",
                            "kwargs": {
                                "n_estimators": 'trial.suggest_int("fs_mb_xgboost__n_estimators", 10, 500, step=10)',
                                "max_depth": 'trial.suggest_int("fs_mb_xgboost__max_depth", 2, 10)',
                                "random_state": 42,
                            },
                        },
                        "threshold": 'trial.suggest_float("fs_mb__threshold", 0.001, 0.1)',
                        "prefit": False,
                    },
                }
            },
        },
        "model": {
            "class": "xgboost.XGBClassifier",
            "kwargs": {
                "n_estimators": 'trial.suggest_int("xgboost__n_estimators", 10, 500, step=5)',
                "learning_rate": 'trial.suggest_float("xgboost__learning_rate", 0.0001, 1)',
                "min_child_weight": 'trial.suggest_int("xgboost__min_child_weight", 0, 500, step=1)',
                "max_depth": 'trial.suggest_int("xgboost__max_depth", 1, 8)',
                "subsample": 'trial.suggest_float("xgboost__subsample", 0.5, 1)',
                "reg_lambda": 'trial.suggest_float("xgboost__reg_lambda", 0, 5)',
                "reg_alpha": 'trial.suggest_float("xgboost__reg_alpha", 0, 1)',
                "random_state": 42,
            },
        },
    },
}

target = "survived"
model = BinaryClassifierSklearnPipeline(model_params)
model

#### Model fit:

The provided `fit` method is part of a custom machine learning model class, which appears to be designed for hyperparameter tuning and model fitting. Here's a description of what this method does:

1. `seed_file()`: It sets a random seed or initializes a seed for reproducibility. The specific implementation of `seed_file` is not provided in the code snippet.

2. `self.hypertune_results = self.hypertune_cross_validated_model(...)`: This line performs hyperparameter tuning using Optuna and cross-validation. It stores the results of the hyperparameter tuning process, including the best trial parameters, in the `self.hypertune_results` attribute.

3. `self.best_params = self.hypertune_results["best_trial_params"]`: It extracts the best trial parameters from the hyperparameter tuning results and stores them in the `self.best_params` attribute.

4. `self.model = self.build_model_pipeline(self.best_params)`: This line builds a machine learning model pipeline based on the best trial parameters obtained from hyperparameter tuning. It uses the `build_model_pipeline` method (not shown in the provided code) to create the model pipeline.

5. `self.model = self.model.fit(X, y)`: It fits the machine learning model pipeline to the input features `X` and target variable `y`. This step trains the model using the best hyperparameters.

6. `self.is_fitted = True`: It sets the `is_fitted` attribute to `True`, indicating that the model has been fitted.

7. `self.X_train = X` and `self.y_train = y`: It stores the input features and target variable used for training in the `self.X_train` and `self.y_train` attributes, respectively.

8. Finally, the method returns the instance of the class (`self`) after fitting the model, allowing for method chaining or further use of the fitted model.

This `fit` method encapsulates the process of hyperparameter tuning, model creation, and model training within a single method call.



In [21]:
# Train and Test datasets
y_train = data[[target]]
X_train = data[[col for col in data.columns if col != target]]

# Fit model
model.fit(X_train, y_train)

[34m2024-01-07 21:02:20,664 - project.packages.modelling.reproducibility.set_seed - INFO - Seeding sklearn, numpy and random libraries with the seed 42[0m


[I 2024-01-07 21:02:20,673] A new study created in memory with name: xgboost
[I 2024-01-07 21:02:23,322] Trial 0 finished with value: 0.46982323232323225 and parameters: {'knn_imputer__n_neighbors': 10, 'knn_imputer__weights': 'distance', 'scaler__transformer': 'sklearn.preprocessing.QuantileTransformer', 'fs_mb_xgboost__n_estimators': 30, 'fs_mb_xgboost__max_depth': 4, 'fs_mb__threshold': 0.08718230210027886, 'xgboost__n_estimators': 20, 'xgboost__learning_rate': 0.3925136468134222, 'xgboost__min_child_weight': 450, 'xgboost__max_depth': 1, 'xgboost__subsample': 0.8391268998756262, 'xgboost__reg_lambda': 2.749217162380866, 'xgboost__reg_alpha': 0.17668140036133317}. Best is trial 0 with value: 0.46982323232323225.
[I 2024-01-07 21:02:25,285] Trial 1 finished with value: 0.46982323232323225 and parameters: {'knn_imputer__n_neighbors': 10, 'knn_imputer__weights': 'distance', 'scaler__transformer': 'sklearn.preprocessing.QuantileTransformer', 'fs_mb_xgboost__n_estimators': 10, 'fs_mb_xgb

[34m2024-01-07 21:05:28,844 - project.packages.modelling.models.supervised.sklearn - INFO - final estimator: Pipeline(steps=[('columns_selector',
                 ColumnsSelector(columns=['passenger_class', 'passenger_age',
                                          'passenger_siblings',
                                          'passenger_parch', 'passenger_fare',
                                          'passenger_ticket_number',
                                          'passenger_ticket_unknown_base',
                                          'passenger_cabin_number',
                                          'passenger_number_of_family_onboard',
                                          'passenger_is_single',
                                          'passenger_has_childs',
                                          'passenger_cabin_level_a',
                                          'passeng...
                               feature_types=None, gamma=None, grow_policy=None,
     

#### Selected model pipeline

In [22]:
model.model

#### Model hypertuning results

In [29]:
model.hypertune_results


[1m{[0m
    [32m'study'[0m: [1m<[0m[1;95moptuna.study.study.Study[0m[39m object at [0m[1;36m0x286db33a0[0m[1m>[0m,
    [32m'best_trial_params'[0m: [1m{[0m
        [32m'scoring_metrics'[0m: [1m[[0m
            [32m'accuracy'[0m,
            [32m'balanced_accuracy'[0m,
            [32m'f1'[0m,
            [32m'f1_micro'[0m,
            [32m'f1_macro'[0m,
            [32m'f1_weighted'[0m,
            [32m'precision'[0m,
            [32m'precision_micro'[0m,
            [32m'precision_macro'[0m,
            [32m'precision_weighted'[0m,
            [32m'recall'[0m,
            [32m'recall_micro'[0m,
            [32m'recall_macro'[0m,
            [32m'recall_weighted'[0m,
            [32m'roc_auc'[0m,
            [32m'roc_auc_ovr'[0m,
            [32m'roc_auc_ovo'[0m,
            [32m'roc_auc_ovr_weighted'[0m,
            [32m'roc_auc_ovo_weighted'[0m
        [1m][0m,
        [32m'optuna'[0m: [1m{[0m
            [32m'kwargs_

##### Cross validation results

It shows the hypertuning results of the best model and the most important binary classification metrics, results will be analyzed later on.




In [28]:
model.hypertune_results["cross_validation_metrics"]


[1m{[0m
    [32m'accuracy'[0m: [1m{[0m[32m'value'[0m: [1;36m0.8507295173961841[0m, [32m'step'[0m: [1;36m1[0m[1m}[0m,
    [32m'balanced_accuracy'[0m: [1m{[0m[32m'value'[0m: [1;36m0.836976320582878[0m, [32m'step'[0m: [1;36m1[0m[1m}[0m,
    [32m'f1'[0m: [1m{[0m[32m'value'[0m: [1;36m0.7999999999999999[0m, [32m'step'[0m: [1;36m1[0m[1m}[0m,
    [32m'f1_micro'[0m: [1m{[0m[32m'value'[0m: [1;36m0.8507295173961841[0m, [32m'step'[0m: [1;36m1[0m[1m}[0m,
    [32m'f1_macro'[0m: [1m{[0m[32m'value'[0m: [1;36m0.8404655326768128[0m, [32m'step'[0m: [1;36m1[0m[1m}[0m,
    [32m'f1_weighted'[0m: [1m{[0m[32m'value'[0m: [1;36m0.8498666160259712[0m, [32m'step'[0m: [1;36m1[0m[1m}[0m,
    [32m'precision'[0m: [1m{[0m[32m'value'[0m: [1;36m0.8235294117647058[0m, [32m'step'[0m: [1;36m1[0m[1m}[0m,
    [32m'precision_micro'[0m: [1m{[0m[32m'value'[0m: [1;36m0.8507295173961841[0m, [32m'step'[0m: [1;36m1[0m[1

##### Best trial parameters

In [30]:
model.hypertune_results["best_trial_params"]


[1m{[0m
    [32m'scoring_metrics'[0m: [1m[[0m
        [32m'accuracy'[0m,
        [32m'balanced_accuracy'[0m,
        [32m'f1'[0m,
        [32m'f1_micro'[0m,
        [32m'f1_macro'[0m,
        [32m'f1_weighted'[0m,
        [32m'precision'[0m,
        [32m'precision_micro'[0m,
        [32m'precision_macro'[0m,
        [32m'precision_weighted'[0m,
        [32m'recall'[0m,
        [32m'recall_micro'[0m,
        [32m'recall_macro'[0m,
        [32m'recall_weighted'[0m,
        [32m'roc_auc'[0m,
        [32m'roc_auc_ovr'[0m,
        [32m'roc_auc_ovo'[0m,
        [32m'roc_auc_ovr_weighted'[0m,
        [32m'roc_auc_ovo_weighted'[0m
    [1m][0m,
    [32m'optuna'[0m: [1m{[0m
        [32m'kwargs_study'[0m: [1m{[0m[32m'direction'[0m: [32m'maximize'[0m, [32m'study_name'[0m: [32m'xgboost'[0m, [32m'load_if_exists'[0m: [3;91mFalse[0m[1m}[0m,
        [32m'kwargs_optimize'[0m: [1m{[0m[32m'n_trials'[0m: [1;36m500[0m[1m}[0m,
   

##### Hypertuning study results

1. Optimization history
2. More important hypertuning parameters during the optimization
3. Parallel plot to visualize optimization decision boundaries
4. Slice plots to visualize decision boundaries vs the objective function


In [35]:
study = model.hypertune_results["study"]
study

[1m<[0m[1;95moptuna.study.study.Study[0m[39m object at [0m[1;36m0x286db33a0[0m[1m>[0m

In [36]:
study.best_value

[1;36m0.8498666160259712[0m

In [42]:
import optuna
import optuna.visualization as optuna_visualization

fig = optuna_visualization.plot_optimization_history(study)
fig.show()

In [38]:
# Plot the parameter importance
fig = optuna_visualization.plot_param_importances(study)
fig.show()

In [39]:
fig = optuna_visualization.plot_parallel_coordinate(study)
fig.show()

In [40]:
fig = optuna_visualization.plot_slice(study)
fig.show()

In [43]:
param_importance = pd.DataFrame.from_dict(
    optuna.importance.get_param_importances(study),
    orient="index",
    columns=["param_importance"],
).sort_values(by="param_importance", ascending=False)


fig = optuna.visualization.plot_slice(
    study,
    params=[
        param_importance.index[0],
        param_importance.index[1],
        param_importance.index[2],
        param_importance.index[3],
    ],
)
fig.show()

Conclusions are in another readme, but we can see clairly the dependency on the search space and the min_child_weight parameter of xgboost, model performance is very restrictive to this parameter.

#### Model inference

In [45]:
y_probs = model.predict_proba(data)
y_probs


[1;35marray[0m[1m([0m[1m[[0m[1m[[0m[1;36m0.98991597[0m, [1;36m0.01008401[0m[1m][0m,
       [1m[[0m[1;36m0.02167255[0m, [1;36m0.97832745[0m[1m][0m,
       [1m[[0m[1;36m0.2049759[0m , [1;36m0.7950241[0m [1m][0m,
       [33m...[0m,
       [1m[[0m[1;36m0.98532516[0m, [1;36m0.01467485[0m[1m][0m,
       [1m[[0m[1;36m0.08064389[0m, [1;36m0.9193561[0m [1m][0m,
       [1m[[0m[1;36m0.9418847[0m , [1;36m0.05811528[0m[1m][0m[1m][0m, [33mdtype[0m=[35mfloat32[0m[1m)[0m

In [46]:
model.predict(data)


[1;35marray[0m[1m([0m[1m[[0m[1;36m0[0m, [1;36m1[0m, [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m1[0m, [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m,
       [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m,
       [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m,
       [1;36m1[0m, [1;36

## All ml process in a single sklearn Pipeline


In [49]:
pipeline = Pipeline(
    [
        ("raw_transformations", RawDataProcessor(raw_params)),
        (
            "intermediate_transformations",
            IntermediateDataProcessor(intermediate_params),
        ),
        ("primary_transformations", PrimaryDataProcessor(primary_params)),
        ("feature_transformations", FeatureDataProcessor(feature_params)),
        (
            "cluster_feature_transformations",
            KMeansClusteringFeatures(
                model_params=cluster_model_params,
                scaler_params=cluster_scaler_params,
                feature_params=cluster_feature_params,
                imputer_params=cluster_imputer_params,
            ),
        ),
        ("model", BinaryClassifierSklearnPipeline(model_params)),
    ],
)
pipeline

In [51]:
pipeline.fit(df, y_train)

[34m2024-01-07 21:21:46,656 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 21:21:46,670 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 31, 'cluster_id_1': 0, 'cluster_id_2': 1}[0m
[34m2024-01-07 21:21:47,586 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 21:21:47,603 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 1, 'cluster_id_1': 0, 'cluster_id_2': 5}[0m
[34m2024-01-07 21:21:48,433 - project.packages.modelling.models.unsupervised.segmentation - INFO - Optimal number of clusters: 3[0m
[34m2024-01-07 21:21:48,448 - project.packages.modelling.models.unsupervised.segmentation - INFO - Centroids dictionary -> {'cluster_id_0': 427, 'cluster_id_1': 594, 'cluster_id_2': 816}[0m
[34m2024-01-07 21:21:49,564 - pr

[I 2024-01-07 21:21:51,805] A new study created in memory with name: xgboost
[I 2024-01-07 21:22:00,883] Trial 0 finished with value: 0.46982323232323225 and parameters: {'knn_imputer__n_neighbors': 10, 'knn_imputer__weights': 'distance', 'scaler__transformer': 'sklearn.preprocessing.QuantileTransformer', 'fs_mb_xgboost__n_estimators': 30, 'fs_mb_xgboost__max_depth': 4, 'fs_mb__threshold': 0.08718230210027886, 'xgboost__n_estimators': 20, 'xgboost__learning_rate': 0.3925136468134222, 'xgboost__min_child_weight': 450, 'xgboost__max_depth': 1, 'xgboost__subsample': 0.8391268998756262, 'xgboost__reg_lambda': 2.749217162380866, 'xgboost__reg_alpha': 0.17668140036133317}. Best is trial 0 with value: 0.46982323232323225.
[I 2024-01-07 21:22:02,208] Trial 1 finished with value: 0.46982323232323225 and parameters: {'knn_imputer__n_neighbors': 10, 'knn_imputer__weights': 'distance', 'scaler__transformer': 'sklearn.preprocessing.QuantileTransformer', 'fs_mb_xgboost__n_estimators': 10, 'fs_mb_xgb

[34m2024-01-07 21:25:10,776 - project.packages.modelling.models.supervised.sklearn - INFO - final estimator: Pipeline(steps=[('columns_selector',
                 ColumnsSelector(columns=['passenger_class', 'passenger_age',
                                          'passenger_siblings',
                                          'passenger_parch', 'passenger_fare',
                                          'passenger_ticket_number',
                                          'passenger_ticket_unknown_base',
                                          'passenger_cabin_number',
                                          'passenger_number_of_family_onboard',
                                          'passenger_is_single',
                                          'passenger_has_childs',
                                          'passenger_cabin_level_a',
                                          'passeng...
                               feature_types=None, gamma=None, grow_policy=None,
     

In [52]:
df_test = pd.read_csv("data/01_raw/titanic_test.csv")

In [53]:
pipeline.predict(df_test)


[1;35marray[0m[1m([0m[1m[[0m[1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m,
       [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m,
       [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m0[0m, [1;36m0[0m, [1;36m1[0m, [1;36m0[0m, [1;36m1[0m,
       [1;36m1[0m, [1;36

In [55]:
pipeline.predict_proba(df_test)


[1;35marray[0m[1m([0m[1m[[0m[1m[[0m[1;36m0.9585429[0m , [1;36m0.04145713[0m[1m][0m,
       [1m[[0m[1;36m0.98495734[0m, [1;36m0.01504268[0m[1m][0m,
       [1m[[0m[1;36m0.91867673[0m, [1;36m0.08132327[0m[1m][0m,
       [1m[[0m[1;36m0.6080147[0m , [1;36m0.39198533[0m[1m][0m,
       [1m[[0m[1;36m0.17216593[0m, [1;36m0.82783407[0m[1m][0m,
       [1m[[0m[1;36m0.9740399[0m , [1;36m0.02596007[0m[1m][0m,
       [1m[[0m[1;36m0.72623605[0m, [1;36m0.27376395[0m[1m][0m,
       [1m[[0m[1;36m0.97367996[0m, [1;36m0.02632004[0m[1m][0m,
       [1m[[0m[1;36m0.00956863[0m, [1;36m0.99043137[0m[1m][0m,
       [1m[[0m[1;36m0.9916237[0m , [1;36m0.00837629[0m[1m][0m,
       [1m[[0m[1;36m0.9882268[0m , [1;36m0.01177321[0m[1m][0m,
       [1m[[0m[1;36m0.86152005[0m, [1;36m0.13847995[0m[1m][0m,
       [1m[[0m[1;36m0.00699151[0m, [1;36m0.9930085[0m [1m][0m,
       [1m[[0m[1;36m0.9706892[0m , [1;36m0.0

# Package CLI


The Package CLI simplifies complex machine learning workflows by breaking them down into modular pipelines. It seamlessly integrates with Kedro, a data engineering framework, to provide a structured approach to ML project development.

## 2. Package Structure

The package is organized as follows:

### Pipelines

- Pipelines are the backbone of the Package CLI. Each step in the ML workflow is broken down into separate pipelines. This modular approach improves code readability and maintainability.

### Logging and Tracking

- The Package CLI utilizes MLflow for logging and tracking experiments. This integration enables comprehensive monitoring of your ML projects, including model performance, data lineage, and hyperparameter optimization.

## 3. Logging and Tracking

MLflow is at the core of our logging and tracking system:

- **Data**: Input data, transformed data, and data splits are logged to ensure complete traceability.
- **Model Artifacts**: Serialized model files are logged, simplifying model replication and deployment.
- **Metrics**: Key performance metrics, such as accuracy, precision, recall, and F1-score, are tracked.
- **Hyperparameters**: Detailed information about the hyperparameters used during training is recorded.
- **Experiment Parameters**: Parameters set for each experiment run are logged for easy reproducibility.
- **Models reprotign**: All html files that reports, performance reports, hyperparameters study, model predictive control exploration and global model optimization reports are logged as artifacts.

## 4. Custom Pipelines

The Package CLI provides a collection of custom Kedro pipelines, making it easy to integrate into your ML projects. These pipelines cover fundamental steps in the ML workflow, including:

- Data preprocessing and feature engineering.
- Model training and evaluation.
- Model deployment and serving.

These pipelines are designed to be highly modular, allowing you to extend or customize them to meet your specific project requirements.

## 5. Model Compatibility

The Package CLI includes the `BinaryClassifierSklearnPipeline` class, which is compatible with any machine learning model adhering to the scikit-learn API. This flexibility empowers you to experiment with a wide variety of models, including:

- Logistic Regression
- Random Forest
- Support Vector Machines
- Gradient Boosting
- Neural Networks
- Xgboost
- SVM
- k-NN

And all compatible models

## 6. Hyperparameter Tuning

Using the kedro cli. You can explore different hyperparameter settings for different models to optimize its performance, using **StratifiedKFold** cross validation strategy. Hyperparameter tuning is seamlessly integrated into the pipeline, simplifying the search for the best model configurations.




## Data engineering CLI<a name="data-engineering"></a>

In [56]:
!kedro run --pipeline data_engineering

[2;36m                    [0m         [35m/distelsa/lib/python3.10/site-packa[0m [2m               [0m
[2;36m                    [0m         [35mges/kedro/io/[0m[95m__init__.py[0m:[1;36m44[0m:        [2m               [0m
[2;36m                    [0m         [32m'AbstractDataSet'[0m has been renamed  [2m               [0m
[2;36m                    [0m         to [32m'AbstractDataset'[0m, and the alias [2m               [0m
[2;36m                    [0m         will be removed in Kedro [1;36m0.19[0m.[1;36m0[0m     [2m               [0m
[2;36m                    [0m           return [1;35mgetattr[0m[1m([0mkedro.io.core,     [2m               [0m
[2;36m                    [0m         name[1m)[0m                               [2m               [0m
[2;36m                    [0m                                             [2m               [0m
[2;36m                    [0m         [35m/distelsa/lib/python3.10/site-packa[0m [2m      

Data engineering pipelines ends running from raw to cluster transformers and save train and test datasets ready to be used by models

# Data Science Pipelines

The Data Science Pipelines project comprises 9 modular pipelines, each uniquely designed to optimize different machine learning models. These pipelines are thoughtfully constructed to leverage the full potential of your data and to achieve the highest possible model performance.

## Model Optimization

### Model Selection

Our pipelines explore a diverse range of models, including:

- Bagging Models
- Boosting Models
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Neural Network Models

### Hyperparameter Tuning

To ensure that these models perform at their best, we employ hyperparameter tuning using cross-validation strategies. This rigorous process fine-tunes the model parameters to maximize predictive accuracy and generalization.

### Saving Trained Models

Once the models are optimized, they are trained on the entire dataset. These trained models are diligently saved for future use, facilitating easy inference on new data and seamless deployment in production environments.

## Out-of-Sample Predictions

To evaluate model performance, we generate out-of-sample predictions using the `cross_val_predict` class from scikit-learn. These predictions provide invaluable insights into how well the models generalize to unseen data.

## Metrics Reporting

The final segment of the pipelines involves the creation of a comprehensive metrics report. This report encompasses key performance metrics, enabling you to thoroughly assess the models' effectiveness. Commonly included metrics are accuracy, precision, recall, F1-score, and more.

## Production Deployment with MLflow

One of the standout features of these pipelines is their seamless integration with MLflow. The best-performing model is automatically registered in a production environment within MLflow. This production-ready model is primed for deployment and can be employed to perform inference on a test dataset. It plays a pivotal role in ranking models based on their performance, ensuring that only the top-performing models are deployed in production.



<div class="alert alert-info">
<b><Execution Time Alert/b>

> The data science pipeline execution can take 1/2 hours to complete. The data science pipeline evaluates several models

> If you want to speed up the execution you can run this command instead:

!kedro run --pipeline data_science --namespace xgboost  

</div>

In [None]:
!kedro run --pipeline data_science

## MLflow exploration


After running data engineering and data science pipelines you can access to the mlflow UI using the following command you should be able to see this url:

- **http://127.0.0.1:3001** 

Open it in a browser and navigate to the artifacts, metrics, models and registered models folder.

**Artifacts**: This section houses a variety of assets, including input data, transformed data, model files, and other relevant files. It provides a comprehensive view of the artifacts generated during your experiments.

**Metrics**: In the "Metrics" section, you can review key performance metrics recorded during your experiments. These metrics are crucial for assessing the effectiveness of your models and pipelines. Commonly logged metrics include accuracy, precision, recall, F1-score, and more.

**Models**: Explore the models that have been trained and logged during your experiments in the "Models" section. You can access detailed information about each model, including its hyperparameters, performance metrics, and any associated artifacts.

**Registered** Models: The "Registered Models" folder contains the best-performing models that have been specifically registered for production deployment. These models have undergone rigorous evaluation and are deemed ready for use in real-world applications. This section provides a streamlined view of models that meet the highest quality standards.



In [None]:
!kedro mlflow ui

## Model productionalization

After reviweing MLflow models, next step is put these model on production through an API, so package also contains API version to the production model and transformers used to create these model.



## Put the API on production

Excecute the following command to have a POST endpoint with the production model.

!python src/project/api/model_serving/app.py

In [57]:
# In a unix system run / sorry if you're using a different system, but you should not XD
!make deploy-model-service-api-dev

flask --app src/project/apis/model_serving run --port=5000 --debug
[35m2024-01-07 21:34:24,093 - kedro_mlflow.config.kedro_mlflow_config - INFO - The 'tracking_uri' key in mlflow.yml is relative ('server.mlflow_(tracking|registry)_uri = mlruns'). It is converted to a valid uri: 'file:///Users/Matheus_Pinto/Desktop/quantumblack/titanic-dataset/mlruns'[0m
[0m
  return getattr(kedro.io.core, name)
[0m
  attr = getattr(submod, name)
[0m
  attr = getattr(submod, name)
[0m
  from kedro.io.core import DataSetError
[0m
  from kedro.io.core import DataSetError
[0m
  attr = getattr(submod, name)
[0m
[35m2024-01-07 21:34:24,307 - kedro.io.data_catalog - INFO - Loading data from 'raw_preprocessor' (MlflowPickleDataset)...[0m
[35m2024-01-07 21:34:24,674 - kedro.io.data_catalog - INFO - Loading data from 'int_preprocessor' (MlflowPickleDataset)...[0m
[35m2024-01-07 21:34:24,675 - kedro.io.data_catalog - INFO - Loading data from 'prm_preprocessor' (MlflowPickleDataset)...[0m
[35m2024-

## Test model API

!python src/project/api/model_serving/ping_api.py


In [58]:
!python src/project/apis/model_serving/ping_api.py

Matheus Pinto Arratia has a survival probability of 16.013 [%]


After testing the API you should recieve this text on the terminal


**Model response: Matheus Pinto Arratia has a survival probability of XX [%]**

So, I would probably die ...



## Package testing 

You can test the whole package using the following command

!kedro test src/project/packages


You should be able to see the hole package testing and modules that are missing to be tested



## About CI/CD



# Application Deployment and Software Architecture


When considering deploying these applications into production, there are several architectural options to choose from, each with its own set of advantages and disadvantages.

## Gitflow

**Pros:**
- Ensures a well-structured development process with clear branching strategies.
- Continuous Integration (CI) pipeline ensures code quality and functionality.
- Integration tests and unit tests provide end-to-end testing coverage.
- Code versioning and history tracking.

## Docker

API, pipelines and packages should be containarized using docker, in order to maintain reproducibility

**Pros:**
- Enables containerization, ensuring reproducibility across different environments.
- Simplifies deployment and scaling as containers can run consistently in various environments.
- Supports microservices architecture, allowing for modularization of applications.
- Easier management of dependencies within containers.


## Deployment

Using azure function, lambda functions --> API Gateway

### Serverless

**Pros:**
- Cost-effective as you only pay for actual usage.
- Auto-scaling and automatic resource management.
- Low operational overhead as the cloud provider handles infrastructure.
- Well-suited for event-driven applications and microservices.

**Cons:**
- Limited control over infrastructure, which may be a limitation for specific requirements.
- Cold start latency can impact response times for some functions.
- Debugging and monitoring can be more challenging in a serverless environment.


### Load Balancer and Server Deployment

**Pros:**
- Provides control over the underlying infrastructure.
- Suitable for applications with specific hardware requirements.
- Can be cost-effective for steady-state workloads.
- Greater flexibility in configuring load balancing algorithms.


**Cons:**
- Increased operational complexity compared to serverless or containerized approaches.
- Limited scalability during traffic spikes without proper automation.

