pipeline in Machine Learning 
 =============================
This is a simple pipeline in machine learning. It is a simple example of how to use the pipeline in machine learning.
This repository contains a collection of Jupyter notebooks that demonstrate various aspects of machine learning. The goal is to provide an interactive environment for expl
oring machine learning ideas and techniques. The notebooks are written in Python 3 and include the scikit-learn library.
The notebooks are loosely inspired by the book [Python Machine Learning](https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130) by [Sebastian Raschka](https://sebastianraschka.com/).
The notebooks are intended to be used with the [Anaconda](https://www.continuum.io/downloads) Python distribution.

Here are the key components of a pipeline:

- **Data Preparation**: This step includes data cleaning, data transformation, and data reduction. Data cleaning is the process of removing or correcting the noisy data. Data transformation is the process of converting the data from one format to another format. Data reduction is the process of reducing the data size but still maintaining the integrity of the data.

- **Model Evaluation**:
This step includes model evaluation, model selection, and model tuning. Model evaluation is the process of evaluating the model performance. Model selection is the process of selecting the best model. Model tuning is the process of tuning the model parameters to improve the model performance.

predictions:
This step includes model deployment, model monitoring, and model explainability. Model deployment is the process of deploying the model to production. Model monitoring is the process of monitoring the model performance in production. Model explainability is the process of explaining the model predictions.

# The main Advantage of using pipeline in machine learning are:
- **Easy to read and understand**: The pipeline is easy to read and understand. This makes it easier for the reader to understand the pipeline.

`simplified Workflow`: 

- **Data Preparation**: The data preparation step is simplified. The data preparation step is easy to understand and can be easily implemented.





- [Data Preparation](20_pipeline_machine_learning.ipynb)
- [Feature Engineering](30_feature_engineering.ipynb)
- [Feature Selection](40_feature_selection.ipynb)
- [Model Evaluation](50_model_evaluation.ipynb)
- [Model Tuning](60_model_tuning.ipynb)
- [Model Optimization](70_model_optimization.ipynb)
- [Model Deploying](80_model_deploying.ipynb)
- [Model Monitoring](90_model_monitoring.ipynb)
- [Model Explainability](100_model_explainability.ipynb)
- [Model Management](110_model_management.ipynb)

```


In [17]:
import pandas as pd 
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
import #  # # 

ww@@@@@#333333###CKCKFKKFKFKFKFK
# load the titanic dataset from seaborn
titanic = sns.load_dataset('titanic')

# select Features and target variable 
X = titanic[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = titanic['survived']

# split the data into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Define the column tansforme for imputing missing values 
categorical_features = ['pclass', 'sex', 'embarked']
numerical_features = ['age', 'fare']

# create the numerical transformer
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# create the categorical transformer
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# create the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
])

# create the column transformer
Pipeline = Pipeline(steps=[
    ('Preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))  # Change RandomForestRegressor to RandomForestClassifier
])

# Fit the pipeline on the training data 
Pipeline.fit(X_train,y_train)

# Make the predictions on the test data 
y_pred = Pipeline.predict(X_test)

# Evaluate and calculate the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)







Accuracy: 0.7821229050279329


# hyperparameter Tunning in pipeline:
- **Hyperparameter Tuning**: Hyperparameter tuning is the process of finding the best hyperparameters for a given model. Hyperparameters 



In [1]:
import pandas as pd 
import seaborn as sns 
from sklearn.model_selection import train_test_split , GridSearchCV
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

# load the titanic dataset from seaborn
titanic = sns.load_dataset('titanic')

# select Features and target variable
X = titanic[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = titanic['survived']

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline =Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore')),
    ('classifier', RandomForestClassifier(random_state=42))

])

#Define thw hyperparameter to tune 
# Define the hyperparameters to tune
hyperparameters = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 5, 10],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}
# perform grid search cross-validation
grid_search = GridSearchCV(pipeline, hyperparameters, cv=5)
grid_search.fit(X_train, y_train)

# get the best model
best_model = grid_search.best_estimator_

# make prediction on the test data using the best model
y_pred = best_model.predict(X_test)

# calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# print the best hyperparameter
print("Best hyperparameters:", grid_search.best_params_)

Accuracy: 0.8212290502793296
Best hyperparameters: {'classifier__max_depth': None, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 300}
