# Sepsis Prediction Analysis

### Business Understanding

Sepsis is a critical medical condition characterized by the body's extreme response to infection, often leading to severe tissue damage, multiple organ failure, and even death. Each year, approximately 30 million individuals worldwide develop sepsis, with a staggering one-fifth of them succumbing to the disease. Detecting sepsis early and initiating immediate treatment is crucial for saving lives and improving patient outcomes. This project aims to leverage machine learning to predict whether patients will develop sepsis using their physiological data.

Project Objectives:
The primary objectives of this project are as follows:
1. Early Sepsis Detection: Develop a robust machine learning model capable of accurately predicting whether patients will develop sepsis using their physiological data.
2. Life-Saving Potential: By accurately predicting whether patients will develop sepsis, this project aims to enable healthcare professionals to intervene promptly, potentially saving lives and reducing the severity of complications associated with sepsis.
3. Model Integration: Implement the trained machine learning model into a user-friendly FastAPI-based application. This integration will make the model accessible to multiple healthcare professionals, streamlining the prediction process and facilitating early sepsis diagnosis.

Data Source:
The project relies on test and train datasets from a modified version of a publicly available patients data source. These train and test datasets are publicly available on Kaggle.

Data Preprocessing:
Data preprocessing will play a crucial role in this project. Preprocessing steps include feature scaling, and dataset  balancing, and data splitting to improve model performance.

Machine Learning Models:
Various machine learning algorithms, such as logistic regression, random forests, support vector machines, and neural networks, will be explored to determine the most effective model for sepsis prediction. Model selection will be based on factors like accuracy, sensitivity, specificity, and interpretability.

Model Evaluation:
The performance of the developed models will be rigorously evaluated using appropriate metrics, including but not limited to precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC-ROC). Cross-validation techniques will help ensure the model's generalizability.

FastAPI Integration:
To make the sepsis prediction model accessible to healthcare professionals, it will be integrated into a FastAPI-based web application. This integration will provide a user-friendly interface where users can input patient data and receive predictions in real-time.

Project Impact:
Successful implementation of this project can have a profound impact on healthcare outcomes. Early sepsis detection can lead to faster intervention, reduced mortality rates, and improved patient recovery. Additionally, by making the model available through FastAPI, it becomes a valuable tool for healthcare providers worldwide, potentially saving countless lives.

Conclusion:
This Sepsis Prediction Project combines cutting-edge machine learning with user-friendly software integration to tackle a critical medical challenge. The goal is to provide healthcare professionals with a powerful tool for early sepsis detection, ultimately leading to better patient care and improved survival rates in the face of this life-threatening condition.

### Hypothesis

Null Hypothesis (H0):
The machine learning model's accuracy in predicting sepsis based on patients' physiological data is not significantly different from a baseline level, suggesting that the model's predictions are no better than random chance.

Alternative Hypothesis (H1):
The machine learning model's accuracy in predicting sepsis based on patients' physiological data is significantly better than a baseline level, indicating that the model provides valuable predictive capabilities beyond random chance.

In simpler terms:
H0: The machine learning model doesn't improve sepsis prediction beyond random chance.
H1: The machine learning model significantly improves sepsis prediction beyond random chance.

This hypothesis specifically addresses the improvement in sepsis prediction and aligns with the project's objective of determining whether the model's predictions are meaningful compared to random guessing.

### Analytical Questions

1. How many patients on the train dataset have developed sepsis?.
2. Which age group has more occurence of sepsis?
3. Does having a health insurance reduce the chances of patients developing sepsis?
4. Does Body Mass Index (BMI) have a direct correlation with sepsis development?
5. Does the blood parameters play a role in sepsis development?

### Exploratory Data Analysis

In [None]:
# Import the needed packages
import pandas as pd
import numpy as np

# Libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Library for testing the hypothesis
import scipy.stats as stats

# Library for pandas profiling
from pandas_profiling import ProfileReport

# Library for splitting the train data
from sklearn.model_selection import train_test_split

# Library for feature scaling
from sklearn.preprocessing import StandardScaler

# Library for feature encoding
from sklearn.preprocessing import OneHotEncoder

# Libraries for balancing the dataset
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Libraries for modelling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

# Libraries for model evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Library for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Library for working with operating system
import os

# Library to serialize a Python object into a flat byte stream and transform a byte stream back into a Python object
import pickle

# Library to handle warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the datasets.

train = pd.read_csv('data/Paitients_Files_Train.csv')
test = pd.read_csv('data/Paitients_Files_Test.csv')

In [None]:
# View the first five rows of the train dataset

train.head()

In [None]:
# View the first five rows of the test dataset

test.head()

The train dataset has a 'Sepssis' column which is absent in the test dataset. This 'Sepssis' column will serve as the target column when training the model. This column will be renamed to 'Sepsis'.

In [None]:
# Rename 'Sepssis' column to 'Sepsis'

train.rename(columns={'Sepssis': 'Sepsis'}, inplace=True)
train.head()

In [None]:
# Check the number of rows and columns on both datasets.

train.shape, test.shape

The train dataset has 599 rows and 11 columns, while the test dataset has 169 rows and 10 columns.

In [None]:
# Check the datatypes and the presence of missing values on the train dataset.

train.info()

In [None]:
# Check the datatypes and the presence of missing values on the test dataset.

test.info()

There are no empty cells in both the train and test dataset. And the datatype of each column in both datasets are consistent with each other.

In [None]:
# Confirm that both train and test datasets have no missing values

train.isna().sum().sum(), test.isna().sum().sum()

In [None]:
# Check for the presence of duplicates on the train and test datasets.

train.duplicated().sum(), test.duplicated().sum()

In [None]:
# Investigating the columns on the train dataset.

train.columns
for column in train.columns:
    print('column: {} - unique value: {}'.format(column, train[column].unique()))

In [None]:
# Obtain the numerical columns of the train dataset
train_num = train.select_dtypes(include=['float64', 'int64']).columns

# Obtain the numerical columns of the test dataset
test_num = test.select_dtypes(include=['float64', 'int64']).columns

In [None]:
# Evaluate the numerical values on the train dataset.

train[train_num].describe()

In [None]:
# Evaluate the correlation of the numerical values on the train dataset.

train[train_num].corr()

In [None]:
# Visualize the correlation with a heatmap

sns.heatmap(train[train_num].corr(), annot=True)

# Save the chart as an image file
plt.savefig('Images/Correlation of the numerical columns of the train dataset.png')

In [None]:
# Evaluate the numerical values on the test dataset.

test[test_num].describe()

In [None]:
# Evaluate the correlation of the numerical values on the test dataset.

test[test_num].corr()

In [None]:
# Visualize the correlation with a heatmap

sns.heatmap(test[test_num].corr(), annot=True)

# Save the chart as an image file
plt.savefig('Images/Correlation of the numerical columns of the test dataset.png')

### Answering Analytical Questions

1. How many patients on the train dataset have developed sepsis?.

In [None]:
# Count the number of patients with and without sepsis on the train dataset
sepsis_count = train['Sepsis'].value_counts()
print(sepsis_count)

# Create a bar chart
plt.figure(figsize=(6, 4))
sepsis_count.plot(kind='bar', color=['darkgreen', 'darkred'])
plt.title('Number of Patients with and without Sepsis')
plt.xlabel('Sepsis')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

2. Which age group has more occurrence of sepsis?

In [None]:
# Create a histogram of age distribution for sepsis-positive and sepsis-negative patients

plt.figure(figsize=(10, 5))
sns.histplot(data=train, x='Age', hue='Sepsis', bins=20, kde=True)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution of Patients with and without Sepsis')
plt.show()

Does having health insurance reduce the chances of patients developing sepsis?

In [None]:
# Create a countplot to compare sepsis occurrence with and without insurance

plt.figure(figsize=(8, 5))
sns.countplot(data=train, x='Insurance', hue='Sepsis')
plt.xlabel('Insurance')
plt.ylabel('Count')
plt.title('Sepsis Occurrence with and without Health Insurance')
plt.xticks([0, 1], ['No Insurance', 'Has Insurance'])
plt.legend(title='Sepsis', labels=['Negative', 'Positive'])
plt.show()

4. Does Body Mass Index (BMI) have a direct correlation with sepsis development?

### Feature Engineering

Feature engineering is the process that selects and transforms raw data from datasets into the desired features that can be used in supervised learning for modelling.

In order to preserve the original cleaned datasets for future analysis, a copy of the train and test datasets will be created and used feature engineering.

Also, in order to avoid data leakage the copy of the train dataset created will be splitted to obtain the training set and the validation set before feature engineering processes are carried out.

In [None]:
# Create a copy of the train and test datasets on which to carry out the feature engineering processes

train_data = train.copy()
test_data = test.copy()

In [None]:
# Drop the 'ID' column as it is not needed in modelling

train_data.drop(columns='ID', inplace=True)
test_data.drop(columns='ID', inplace=True)
train_data.head()

In [None]:
# Replace positive with 1 and negative with 0 in the 'Sepsis' column of the train dataset

train_data['Sepsis'] = train_data['Sepsis'].replace(['Positive', 'Negative'], [1, 0])
train_data.head()

### Split the train dataset

In [None]:
# Obtain the X and y variables of the train dataset
X = train_data.drop('Sepsis', axis=1)
y = train_data['Sepsis']

# Split train dataset into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Print the shape of the train dataset
print("Train set shape:", X_train.shape, y_train.shape, X_val.shape, y_val.shape)

### Feature Scaling

Feature scaling is a data preprocessing technique that involves transforming the values of features or variables in a dataset to a similar scale. This is to ensure that all features contribute equally to the training of models and to prevent features with larger values from dominating the models trained.

In [None]:
# Create a scaler object using StandardScaler

scaler = StandardScaler()

In [None]:
# Use StandardScaler to scale the X_train
X_train_scaled = scaler.fit_transform(X_train)
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=scaler.get_feature_names_out())

# View the scaled X_train DataFrame
X_train_scaled_df.head()

In [None]:
# Use StandardScaler to scale the X_val
X_val_scaled = scaler.fit_transform(X_val)
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=scaler.get_feature_names_out())

# View the scaled X_val DataFrame
X_val_scaled_df.head()

In [None]:
# Use StandardScaler to scale the test dataset
test_data_scaled = scaler.fit_transform(test_data)
test_data_scaled_df = pd.DataFrame(test_data_scaled, columns=scaler.get_feature_names_out())

# View the scaled test DataFrame
test_data_scaled_df.head()

### Balancing X_train of the scaled dataset

In [None]:
# Perform oversampling using SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled_df, y_train)

# Perform undersampling using RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_train_balanced, y_train_balanced = rus.fit_resample(X_train_scaled_df, y_train)

# Print the class distribution before and after balancing
print("Before balancing:")
print(y_train.value_counts())

print("After balancing:")
print(pd.Series(y_train_balanced).value_counts())

### Modelling

In [None]:
# Create a list of models to train and evaluate
models = [
    ('Logistic Regression', LogisticRegression(random_state=42)),
    ('Decision Tree', DecisionTreeClassifier(random_state=42)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Gradient Boosting', GradientBoostingClassifier(random_state=42)),
    ('Adaptive Boosting', AdaBoostClassifier(random_state=42)),
    ('Support Vector Machine', SVC(random_state=42)),
    ('Gaussian Naive Bayes', GaussianNB()),
    ('K-Nearest Neighbors', KNeighborsClassifier())
]

### Model training and evaluation with the unbalanced dataset

In [None]:
# Initialize the best_model and best f1 score variables
unbal_best_model = None
unbal_best_f1_score = 0.0

# Create an empty dictionary to store the performance metrics of the models after training with unbalanced dataset
unbal_performance_metrics = {}

# Model training, evaluation and result calculation
for model_name, model in models:
    # Model training with unbalanced dataset
    model.fit(X_train_scaled_df, y_train)
    
    # Using the models to make predictions on the validation set
    y_pred_unbal = model.predict(X_val_scaled_df)
    
    # Calculate the performance metrics of the models on the balanced dataset
    accuracy = accuracy_score(y_val, y_pred_unbal)
    precision = precision_score(y_val, y_pred_unbal)
    recall = recall_score(y_val, y_pred_unbal)
    f1 = f1_score(y_val, y_pred_unbal)
    roc_auc = roc_auc_score(y_val, y_pred_unbal)
    
    # Store the performance metrics results
    unbal_performance_metrics[model_name] = {
        'Unbal Accuracy': accuracy,
        'Unbal Precision': precision,
        'Unbal Recall': recall,
        'Unbal F1 Score': f1,
        'Unbal ROC_AUC': roc_auc
    }
    
    # Check if the current model has a higher F1 score than the best one found so far
    if f1 > unbal_best_f1_score:
        unbal_best_model = model  # Update the best model
        unbal_best_model_name = model_name # Update the best model name
        unbal_best_f1_score = f1  # Update the best F1 score

In [None]:
# Create a DataFrame to store the performance metrics of the models on the unbalanced dataset
unbalanced_performance_metrics = pd.DataFrame(unbal_performance_metrics).transpose()
    
# Arrange the performance metrics DataFrame in descending order according to the F1 Score
unbalanced_performance_metrics = unbalanced_performance_metrics.sort_values('Unbal F1 Score', ascending=False)

# Show the performance metrics DataFrame of the models on the unbalanced dataset
unbalanced_performance_metrics.style.set_caption('The Performance Metrics Of The Models On The Unbalanced Dataset')

Based on the f1 score of the models, Logistic Regression is the best model for the unbalanced dataset with an f1 score of 0.609756. Generally, Adaptive Boosting performed better in all the metrics except in the Precision and Recall.

### Confusion matrix for unbalanced dataset

In [None]:
# Model prediction and confusion matrix computation
for model_name, model in models:
    # Model training with unbalanced dataset
    model.fit(X_train_scaled_df, y_train)
    
    # Using the models to make predictions on the validation set
    y_pred_unbal = model.predict(X_val_scaled_df)
    
    # Compute the confusion matrix
    cm = confusion_matrix(y_val, y_pred_unbal)
    
    # Print the confusion matrix
    print(f'Confusion Matrix For {model_name} On Unbalanced Dataset:\n{cm}')
    
    # plot the confusion matrix
    sns.heatmap(cm,
            annot=True,
            fmt='g',
            xticklabels=['Churn','Not Churn'],
            yticklabels=['Churn','Not Churn'])
    plt.ylabel('Prediction',fontsize=13)
    plt.xlabel('Actual',fontsize=13)
    plt.title(f'Confusion Matrix For {model_name} On Unbalanced Dataset',fontsize=17)
    plt.show()

### Model training and evaluation with the balanced dataset

In [None]:
# Create an empty dictionary to store the performance metrics of the models after training with balanced dataset
bal_performance_metrics = {}

# Initialize the best_model and best f1 score variables
bal_best_model = None
bal_best_f1_score = 0.0

# Model training, evaluation and result calculation
for model_name, model in models:
    # Model training with balanced dataset
    model.fit(X_train_balanced, y_train_balanced)
    
    # Using the models to make predictions on the validation set
    y_pred_bal = model.predict(X_val_scaled_df)
    
    # Calculate the performance metrics of the models on the balanced dataset
    accuracy = accuracy_score(y_val, y_pred_bal)
    precision = precision_score(y_val, y_pred_bal)
    recall = recall_score(y_val, y_pred_bal)
    f1 = f1_score(y_val, y_pred_bal)
    roc_auc = roc_auc_score(y_val, y_pred_bal)
    
    # Store the performance metrics results
    bal_performance_metrics[model_name] = {
        'Bal Accuracy': accuracy,
        'Bal Precision': precision,
        'Bal Recall': recall,
        'Bal F1 Score': f1,
        'Bal ROC_AUC': roc_auc
    }
    
    # Check if the current model has a higher F1 score than the best one found so far
    if f1 > bal_best_f1_score:
        bal_best_model = model  # Update the best model
        bal_best_model_name = model_name # Update the best model name
        bal_best_f1_score = f1  # Update the best F1 score

In [None]:
# Create a DataFrame to store the performance metrics of the models on the balanced dataset
balanced_performance_metrics = pd.DataFrame(bal_performance_metrics).transpose()
    
# Arrange the performance metrics DataFrame in descending order according to the F1 Score
balanced_performance_metrics = balanced_performance_metrics.sort_values('Bal F1 Score', ascending=False)

# Show the performance metrics DataFrame of the models on the balanced dataset
balanced_performance_metrics.style.set_caption('The Performance Metrics Of The Models On The Balanced Dataset')

Based on the f1 score of the models, Adaptive Boosting is the best model for the balanced dataset with an f1 score of 0.681319. Generally, Adaptive Boosting performed better in all the metrics except in the Recall.

In [None]:
# Model prediction and confusion matrix computation
for model_name, model in models:
    # Model training with balanced dataset
    model.fit(X_train_balanced, y_train_balanced)
    
    # Using the models to make predictions on the validation set
    y_pred_bal = model.predict(X_val_scaled_df)
    
    # Compute the confusion matrix
    cm = confusion_matrix(y_val, y_pred_bal)
    
    # Print the confusion matrix
    print(f'Confusion Matrix For {model_name} On Balanced Dataset:\n{cm}')
    
    # plot the confusion matrix
    sns.heatmap(cm,
            annot=True,
            fmt='g',
            xticklabels=['Positive','Negative'],
            yticklabels=['Positive','Negative'])
    plt.ylabel('Prediction',fontsize=13)
    plt.xlabel('Actual',fontsize=13)
    plt.title(f'Confusion Matrix For {model_name} On Balanced Dataset',fontsize=17)
    plt.show()

### Hyper-parameter Tuning

Hyperparameters are adjustable parameters whose values control the model training process.

Hyperparameter tuning (or hyperparameter optimization) is a process used to determine the right combination of hyperparameters that maximizes the model performance. It works by running multiple trials in a single training process. The hyperparameters are set within specified limits and executed to identify the set of hyperparameter values that are best suited for a model to give optimal results.

In [None]:
# Get the available parameters for each model

for model_name, model in models:
    available_params = model.get_params()
    print(f'Available Parameters For {model_name}:{available_params}\n')

In [None]:
# Create an empty dictionary to store the performance metrics of the tuned models on the balanced dataset
tun_bal_performance_metrics = {}

# Initialize the best_model and best f1 score variables
tun_bal_best_model = None
tun_bal_best_f1_score = 0.0

# Perform hyperparameter tuning
for model_name, model in models:
    params_selection = {
        'Logistic Regression' : {'solver': ['newton-cg', 'lbfgs', 'liblinear'], 'C': [10, 1.0, 0.01]},
        'Decision Tree' : {'max_depth': [1, 3], 'min_samples_split': [1, 2, 5, 10], 'min_samples_leaf': [0.5, 1, 2]},
        'Random Forest' : {'n_estimators': [500, 700], 'max_depth': [1, 3]},
        'Gradient Boosting' : {'n_estimators': [100, 150],'learning_rate': [0.1, 1.0, 5.0]},
        'Adaptive Boosting' : {'n_estimators': [10, 50, 100],'learning_rate': [0.1, 1.0, 5.0]},
        'Support Vector Machine' : {'kernel': ['poly', 'rbf', 'sigmoid'], 'C': [0.5, 1.0, 10]},
        'Gaussian Naive Bayes' : {'var_smoothing': [float, 1e-09]},
        'K-Nearest Neighbors' : {'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan', 'minkowski']}
    }
   
    # Get the selected parameter values for the models to tune
    param_grid = params_selection[model_name]
    
    # Perform hyperparameter tuning using GridSearchCV
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='f1', verbose=0, refit=True)
    grid_search.fit(X_train_balanced, y_train_balanced)
         
    # Get the best of each model with the best parameters
    best_params = grid_search.best_params_
    best_params_model = grid_search.best_estimator_
    
    # Show the best parameters for each model
    print(f'The best parameters for {model_name} are {best_params}\n')
    
    # Using each model with it's best parameters to make predictions on the validation set
    best_params_model.fit(X_train_balanced, y_train_balanced)
    y_pred_bal_tun = best_params_model.predict(X_val)
    
    # Calculate the performance metrics on the balanced dataset for each model with it's best parameters
    accuracy = accuracy_score(y_val, y_pred_bal_tun)
    precision = precision_score(y_val, y_pred_bal_tun)
    recall = recall_score(y_val, y_pred_bal_tun)
    f1 = f1_score(y_val, y_pred_bal_tun)
    roc_auc = roc_auc_score(y_val, y_pred_bal_tun)
    
    # Store the performance metrics results
    tun_bal_performance_metrics[model_name] = {
        'Tuned-Bal Accuracy': accuracy,
        'Tuned-Bal Precision': precision,
        'Tuned-Bal Recall': recall,
        'Tuned-Bal F1 Score': f1,
        'Tuned-Bal ROC_AUC': roc_auc
    }
    
    # Check if the current model has a higher F1 score than the best one found so far
    if f1 > tun_bal_best_f1_score:
        tun_bal_best_model = model  # Update the best model
        tun_bal_best_model_name = model_name # Update the best model name
        tun_bal_best_f1_score = f1  # Update the best F1 score

In [None]:
# Create a DataFrame to store the performance metrics of the tuned models on the balanced dataset
tuned_bal_performance_metrics = pd.DataFrame(tun_bal_performance_metrics).transpose()
    
# Arrange the performance metrics DataFrame in descending order according to the F1 Score
tuned_bal_performance_metrics = tuned_bal_performance_metrics.sort_values('Tuned-Bal F1 Score', ascending=False)

# Show the performance metrics DataFrame of the tuned models on the balanced dataset
tuned_bal_performance_metrics.style.set_caption('The Performance Metrics Of The Tuned Models On The Balanced Dataset')

From the table above, it is observed that the models did not perform better after hyper-parametr tuning.

# Combine the performance metrics of the models

The performance metrics of the models obtained for the unbalanced dataset, the balanced dataset and the performance metrics obtained for the tuned models will be combined together respectively to easily evaluate how each model performed in the three conditions.

In [None]:
# Concatenate the DataFrames while preserving columns
combined_performance_metrics = pd.concat([unbalanced_performance_metrics, balanced_performance_metrics,
                                          tuned_bal_performance_metrics], axis=1)

# Arrange the combined performance metrics DataFrame in descending order according to the F1 Score of the tuned models
combined_performance_metrics = combined_performance_metrics.sort_values('Bal F1 Score', ascending=False)

# Show the performance metrics DataFrame of the tuned models on the balanced dataset
combined_performance_metrics.style.set_caption('The Combined Performance Metrics Of The Models')

As shown in the table above which combines all the evaluation metrics, Adaptive Boosting is the best model after the entire modelling process.

### Exportation

The key Machine Learning objects, including the best model, will be exported and used later to develop a FastAPI.

In [None]:
# Display the best model_name with the highest F1 score

print(f"The best model based on F1 score is {bal_best_model_name} and it's F1 score is {bal_best_f1_score}")

In [None]:
# Save the best model_name with the highest F1 score

best_model = bal_best_model

In [None]:
# Create a dictionary to store all the Machine Learning components

model_components = {
    'model': best_model,
    'scaler': scaler
}

In [None]:
# Create an export folder named 'export'

!mkdir export

In [None]:
# Create a path to the export folder

destination = os.path.join('.', 'export')

In [None]:
# Save the components to a file using pickle

with open (os.path.join(destination, 'ml_components.pkl'), 'wb') as f:
    pickle.dump(model_components, f)

In [None]:
# Create requirements.txt file in export folder to describe the virtual environment used for the Machine Learning processes

!pip freeze > export/requirements.txt

In [None]:
# Zip the export folder and name the zipped export folder as 'export.zip'
# !tar -a -c -f export.zip export

# Delete the original export folder, leaving only the zipped export folder
# !rmdir /s /q export