# Step by step
This notebook replicates the code on the 'functions' page, offering insight into the inner workings of the functions. This allows anyone interested in comprehending and modifying the code to gain a general understanding.

In [None]:
import pandas as pd
import shap
import re
from IPython.display import Image, display
import os
import json
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import OrdinalEncoder, RobustScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error

## Preprocessing
The preprocessing step aims to get the data ready for the model to learn from. This includes making sure the data is in the right format and cleaning it up

### Load data

In [None]:
# Imports excell with patients data
data_path = "./testData/dummy_data.xlsx"

# Read the uploaded Excel file into a Pandas DataFrame
xls = pd.ExcelFile(data_path, engine="openpyxl")

sheet_names = ['Baseline', 'TEG Values', 'Events']  # Replace with your sheet names

# Access each sheet's data using the sheet name as the key
baseline_df = pd.read_excel(xls, sheet_names[0])
tegValues_df = pd.read_excel(xls, sheet_names[1])
events_df = pd.read_excel(xls, sheet_names[2])

In [None]:
baseline_df.head()

In [None]:
events_df.head()

In [None]:
tegValues_df.head()

### Merge tables
The data is currently split into three tables. To make it usable for the model, we need to combine all the important information into a two table, one with the baseline information and the other one with the TEG values

All the events for every patient will be counted and added to a column called "Events" (Count encoding)

Here's what the data looks like.

In [None]:
# Path to image
image_path = "./data/data_structure.png"
# Display the image
display(Image(filename=image_path, width=300, height=200))


In [None]:
# Count the number of events for each 'Record_ID' in events_df
event_counts = events_df['Record ID'].value_counts().reset_index()
event_counts.columns = ['Record ID', 'Events']
event_counts

In [None]:
# Merge the event counts with the baseline and teg values
tegValues_df = tegValues_df.merge(event_counts, on='Record ID', how='left')
baseline_df = baseline_df.merge(event_counts, on='Record ID', how='left')

# Fill NaN values in the 'event_count' column with 0
tegValues_df['Events'].fillna(0, inplace=True)
baseline_df['Events'].fillna(0, inplace=True)
tegValues_df.head()

In [None]:
baseline_df.head()

In [None]:
# Save in excel
excel_file = "./testData/merged_data.xlsx"

# Create an Excel writer object
with pd.ExcelWriter(excel_file, engine='xlsxwriter') as writer:
    # Write each DataFrame to a different Excel sheet
    tegValues_df.to_excel(writer, sheet_name='TEG values', index=False)
    baseline_df.to_excel(writer, sheet_name='Baseline', index=False)



### Data transformations
All columns are being transformed to the best fitting format, according to the information they hold and effectively removing any typos.

In [None]:
# Clean df in new copy
clean_TEG_df = tegValues_df.copy()
clean_baseline_df = baseline_df.copy()

#### Number
Baseline:
- Age
- BMI
- Clotting Disorder
- EGFR (mL/min/1.73m2)
- BP prior to blood draw
- ABI Right
- ABI Left
- Rutherford Score

TEG:
- TEG values
- Visit Timepoint


In [None]:
# Find teg values column
columns_to_exclude = ['Record ID', 'Visit Timepoint', 'Antiplatelet Therapy within 7 Days',
                      'Anticoagulation within 24 Hours', 'Statin within 24 Hours', 'Cilostazol within 7 days',
                      'BP prior to blood draw', 'Events']

tegValues = [col for col in tegValues_df.columns.values if col not in columns_to_exclude]
tegValues

In [None]:
number_columns_baseline = ["Age","BMI", "Clotting Disorder", "EGFR (mL/min/1.73m2)", "ABI Right", "ABI left", "Rutherford Score"]
number_columns_teg = ["Visit Timepoint", "BP prior to blood draw"]+tegValues

Visualize the values and their types to identify the kind of changes needed

In [None]:
clean_TEG_df[number_columns_teg].dtypes


In [None]:
clean_baseline_df[number_columns_baseline].dtypes

Visualize the values 

In [None]:
clean_TEG_df[number_columns_teg].head()

In [None]:
clean_baseline_df[number_columns_baseline].head()

Out of the columns visuzlied, age, BMI and clotting dissorder are in the right format.

BP needs to be split between systolic and diastolic and made into ints.

EGFR is a combination of strings and floats. The string is ">60", which can be approximated to a big number, like 65. All the other values are floats.
TEG values need to be transformed to floats. Some teg values have maximum value stored as ">n", or say "inconclusive" or other string when data was not colected. Those vaues wil be marked as nan
Both TEG values and EGFR boundary conditions are saved in the "./data_boundary.json" file

Visit timepoint is in strings and need to be based on days

ABI left and right have some strings that will be converted to NaN values

Split BP into two columns (systolic and diastolic) based on "/" 

In [None]:
# Split the column into 'Systolic' and 'Diastolic' columns
clean_TEG_df[['BP_Systolic', 'BP_Diastolic']] = clean_TEG_df['BP prior to blood draw'].str.split('/', expand=True)

# Convert 'Systolic' and 'Diastolic' columns to integers
clean_TEG_df['BP_Systolic'] = pd.to_numeric(clean_TEG_df['BP_Systolic'], errors='coerce').astype('Int64')
clean_TEG_df['BP_Diastolic'] = pd.to_numeric(clean_TEG_df['BP_Diastolic'], errors='coerce').astype('Int64')

# Drop the first column 'BP prior to blood draw'
clean_TEG_df.drop(columns=['BP prior to blood draw'], inplace = True)
number_columns_teg.remove('BP prior to blood draw')
number_columns_teg.append('BP_Systolic')
number_columns_teg.append('BP_Diastolic')

clean_TEG_df[['BP_Systolic', 'BP_Diastolic']].dtypes


Clean EGFR and TEG data with boundary values and convert all to floats

In [None]:
# Import boundary values

# Get the current working directory (base directory)
base_directory = os.getcwd()

# Define the filename
filename = 'data_boundaries.json'

# Create the full file path by joining the base directory and filename
file_path = os.path.join(base_directory, 'data', filename)

with open(file_path, 'r') as json_file:
    boundaries = json.load(json_file)
boundaries

In [None]:
# Replace all boundary values with their correcponding right values

# EGFR
egfr_column = 'EGFR (mL/min/1.73m2)'
efgr_replacement = boundaries.pop(egfr_column, None)
# Remove spaces in the column
clean_baseline_df[egfr_column] = clean_baseline_df[egfr_column].replace(regex={r'\s': ''})

# Use a regular expression to match and replace values
for name, replacement in efgr_replacement.items():
    clean_baseline_df[egfr_column] = clean_baseline_df[egfr_column].replace({f'^{name}': replacement}, regex=True)

# Iterate over TEG DataFrame and apply boundaries
for column, replacement_dict in boundaries.items():
    
    # Remove spaces in the column
    clean_TEG_df[column] = clean_TEG_df[column].replace(regex={r'\s': ''})
    
    # Use a regular expression to match and replace values
    for name, replacement in replacement_dict.items():
        clean_TEG_df[column] = clean_TEG_df[column].replace({f'^{name}': replacement}, regex=True)

# Show changes    
clean_TEG_df[list(boundaries.keys())].head()

In [None]:
# Show changes
clean_baseline_df[egfr_column].head()

In [None]:
# Convert  Rutherford Score and TEG values to float
clean_baseline_df["Rutherford Score"] = pd.to_numeric(clean_baseline_df["Rutherford Score"], errors='coerce')
clean_baseline_df["Rutherford Score"].dtypes


In [None]:
# Loop through the columns and convert to numeric
for column in tegValues:
    clean_TEG_df[column] = pd.to_numeric(clean_TEG_df[column], errors='coerce')

clean_TEG_df[tegValues].dtypes

In [None]:
# Show values to make sure strings were changed to NaN
clean_TEG_df[tegValues].head()

Change timepoints from strings to ints that represent days after the operation.

All the values are saved in ./data/timepoints.json

In [None]:
# Define the filename
filename = 'timepoints.json'

# Create the full file path by joining the base directory and filename
file_path = os.path.join(base_directory, 'data', filename)


with open(file_path, 'r') as json_file:
    timepoints = json.load(json_file)
timepoints

In [None]:
# Create a reverse mapping dictionary
reverse_mapping = {v: k for k, values in timepoints.items() for v in values}

# Replace values using the reverse mapping
clean_TEG_df['Days from operation'] = clean_TEG_df['Visit Timepoint'].map(reverse_mapping)

# Convert the column to integer
clean_TEG_df['Days from operation'] = clean_TEG_df['Days from operation'].astype(int)

# Drop old column
clean_TEG_df.drop(columns=['Visit Timepoint'], inplace = True)
number_columns_teg.remove('Visit Timepoint')
number_columns_teg.append('Days from operation')


In [None]:
clean_TEG_df['Days from operation'].dtype

Convert ABI values to floats

In [None]:
clean_baseline_df['ABI Right'] = pd.to_numeric(clean_baseline_df['ABI Right'], errors='coerce')
clean_baseline_df['ABI left'] = pd.to_numeric(clean_baseline_df['ABI left'], errors='coerce')

clean_baseline_df[['ABI Right', 'ABI left']].dtypes

Appreciate all your work

In [None]:
clean_baseline_df[number_columns_baseline].head()

In [None]:
clean_TEG_df[number_columns_teg].head()

#### Booleans
Baseline:
- Sex
- White
- Diabetes
- Hypertension
- Hyperlipidemia
- Coronary Artery Disease
- History of MI 
- Functional impairment
- Does Subject Currently have cancer?
- Past hx of cancer
- Hx of  DVT
- Hx of stroke
- Hx of pulmonary embolism:
- Does the patient have a history of solid organ transplant?
- Has subject had previous intervention of the index limb? 
- Previous occluded stents

TEG values:
- Cilostazol within 7 days

In [None]:
# Create the 'Is Male' column based on the 'sex' column
clean_baseline_df['Is Male'] = (clean_baseline_df['Sex'] == 'Male').astype(bool)

# Drop the old 'sex' column
clean_baseline_df.drop('Sex', axis=1, inplace=True)
clean_baseline_df['Is Male']

In [None]:
# Change following columns to booleans
columns_to_convert_baseline = ['White', 'Diabetes', 'Hypertension', 'Hyperlipidemia (choice=None)', 'Coronary Artery Disease', 'History of MI',
                      'Functional impairment', 'Does Subject Currently have cancer?', 'Past hx of cancer', 'Hx of  DVT', 'Hx of stroke',
                      'Hx of pulmonary embolism', 'Does the patient have a history of solid organ transplant?', 
                      'Has subject had previous intervention of the index limb?', 'Previous occluded stents',]
columns_to_convert_TEG =['Cilostazol within 7 days']

clean_baseline_df[columns_to_convert_baseline].head()

In [None]:
clean_TEG_df[columns_to_convert_TEG].head()

In [None]:
# Dictionary for replacement
replacement_dict = {'yes': True, 'no': False, '1': True, '0': False, 'cilostazol': True, 'NaN':False}

In [None]:
# Fill NaN values with False
clean_baseline_df[columns_to_convert_baseline] = clean_baseline_df[columns_to_convert_baseline].fillna('0')
clean_TEG_df[columns_to_convert_TEG] = clean_TEG_df[columns_to_convert_TEG].fillna('0')

# Put all columns in lowercase
clean_baseline_df[columns_to_convert_baseline] = clean_baseline_df[columns_to_convert_baseline].astype(str)
clean_baseline_df[columns_to_convert_baseline] = clean_baseline_df[columns_to_convert_baseline].apply(lambda x: x.str.lower())
clean_TEG_df[columns_to_convert_TEG] = clean_TEG_df[columns_to_convert_TEG].astype(str)
clean_TEG_df[columns_to_convert_TEG] = clean_TEG_df[columns_to_convert_TEG].apply(lambda x: x.str.lower())

# Use the replace method to replace values in multiple columns
clean_baseline_df[columns_to_convert_baseline] = clean_baseline_df[columns_to_convert_baseline].replace(replacement_dict).astype(bool)
clean_TEG_df[columns_to_convert_TEG] = clean_TEG_df[columns_to_convert_TEG].replace(replacement_dict).astype(bool)

clean_baseline_df[columns_to_convert_baseline].head()


In [None]:
clean_TEG_df[columns_to_convert_TEG].head()

#### Categorical ordinal
Baseline:
- Tobacco Use
- Renal Status

In [None]:
# Ordinal encoding map
category_orders = {
    'Tobacco Use (1 current 2 former, 3 none)': 
    ['None',
    'Past, quit >10 year ago',
    'quit 1 to 10 years ago', 
    'current within the last year ( < 1 pack a day)',
    'current within the last year (  > or = 1 pack a day)'],

    'Renal Status': 
    ['Normal', 
    'GFR 30 to 59', 
    'GFR 15 to 29', 
    'GFR<15 or patient is on dialysis',
    '1']
}

In [None]:
# Replace renal status values. Some of the values in the data set mean the same with different words
# Define a dictionary to map old values to new values
replace_dict = {'GFR 60 to 89': 'Normal', 'Evidence of renal dysfunction ( GFR >90)': 'Normal', '0': 'Normal', 0: 'Normal', 1: "1"}

clean_baseline_df['Renal Status'] = clean_baseline_df['Renal Status'].replace(replace_dict)

# Initialize the OrdinalEncoder with specified category orders
encoder = OrdinalEncoder(categories=[category_orders[column] for column in ['Tobacco Use (1 current 2 former, 3 none)', 'Renal Status']])

# Fit and transform the selected columns to encode ordinal values
clean_baseline_df[['Tobacco Use (1 current 2 former, 3 none)', 'Renal Status']] = encoder.fit_transform(clean_baseline_df[['Tobacco Use (1 current 2 former, 3 none)', 'Renal Status']])

# Rename column
clean_baseline_df = clean_baseline_df.rename(columns={'Tobacco Use (1 current 2 former, 3 none)': 'Tobacco Use'})

In [None]:
clean_baseline_df[['Tobacco Use', 'Renal Status']].head()

#### Categorical nominal
Baseline:
- Extremity
- Artery affected
- Intervention Classification
- Intervention Type

TEG values:
- Antiplatelet Therapy within 7 Days
- Anticoagulation within 24 Hours
- Statin within 24 Hours


In [None]:
columns_to_dummy_baseline = ['Extremity',
                    'Intervention Classification']
columns_to_dummy_TEG = ['Statin within 24 Hours']

In [None]:
# Dummy encoding of categorical values
clean_baseline_df = pd.get_dummies(clean_baseline_df, columns=columns_to_dummy_baseline,
                    prefix=columns_to_dummy_baseline)
clean_TEG_df = pd.get_dummies(clean_TEG_df, columns=columns_to_dummy_TEG,
                    prefix=columns_to_dummy_TEG)

In [None]:
# Drop unecessary columns
clean_baseline_df = clean_baseline_df.drop(columns=['Extremity_left']) # Because it is either right, left or bilateral
clean_baseline_df = clean_baseline_df.drop(columns=['Intervention Classification_Endo']) # Either endo, open or combined

In [None]:
# Show columns 
# Use the .filter() method to select columns with the original columns prefixes
dummy_columns_baseline = [col for col in clean_baseline_df.columns if any(col.startswith(prefix) for prefix in columns_to_dummy_baseline)]
clean_baseline_df[dummy_columns_baseline].head()

In [None]:
dummy_columns_TEG = [col for col in clean_TEG_df.columns if any(col.startswith(prefix) for prefix in columns_to_dummy_TEG)]
clean_TEG_df[dummy_columns_TEG].head()

The _Artery affected_, _Intervention type_, _Antiplatelet Therapy within 7 Days_, and _Anticoagulation within 24 Hours_ column has multiple values in a sigle string. They will be normalized before being encoded

Artery affected

In [None]:
# Get all unique valuses
unique_arteries = set()
unique_antiplatelet = set()
unique_intervention = set()
unique_anticoagulation = set()

for index, row in clean_baseline_df.iterrows():
    arteries = row['Artery affected'].split(', ')
    unique_arteries.update(arteries)

    intervention = row['Intervention Type'].split(', ')
    unique_intervention.update(intervention)
    

for index, row in clean_TEG_df.iterrows():

    antiplatelet = row['Antiplatelet Therapy within 7 Days'].split(', ')
    unique_antiplatelet.update(antiplatelet)

    anticoagulation = row['Anticoagulation within 24 Hours'].split(', ')
    # Delete items in parenthesis ex: heparin (Calciparine) to be just heparin
    anticoagulation = {re.sub(r'\s*\([^)]*\)\s*', '', item) for item in anticoagulation} 
    unique_anticoagulation.update(anticoagulation)


print(unique_arteries)
print(unique_antiplatelet)
print(unique_intervention)
print(unique_anticoagulation)

In [None]:
# Dummy encode ateries affected
selected_arteries = []
for artery in unique_arteries:
    column_name = "Artery affected_"+artery
    clean_baseline_df[column_name] = clean_baseline_df['Artery affected'].str.contains(artery).astype(int)
    selected_arteries.append(column_name)

selected_arteries.append('Artery affected')
clean_baseline_df[selected_arteries].head()

In [None]:
# Dummy encode antiplatelete therapy
selected_antiplatelet = []
for antiplatelet in unique_antiplatelet:
    column_name = "Antiplatelet therapy_"+antiplatelet
    clean_TEG_df[column_name] = clean_TEG_df['Antiplatelet Therapy within 7 Days'].str.contains(antiplatelet).astype(int)
    selected_antiplatelet.append(column_name)

selected_antiplatelet.append('Antiplatelet Therapy within 7 Days')
clean_TEG_df[selected_antiplatelet].head()

In [None]:
# Dummy encode intervention types
selected_intervention = []
for intervention in unique_intervention:
    column_name = 'Intervention type_'+intervention
    clean_baseline_df[column_name] = clean_baseline_df['Intervention Type'].str.contains(intervention).astype(int)
    selected_intervention.append(column_name)

selected_intervention.append('Intervention Type')
clean_baseline_df[selected_intervention].head()

In [None]:
# Dummy encode anticoagulation meds
selected_anticoagulation = []
for anticoagulation in unique_anticoagulation:
    column_name = "Anticoagulation_"+anticoagulation
    clean_TEG_df[column_name] = clean_TEG_df['Anticoagulation within 24 Hours'].str.contains(anticoagulation).astype(int)
    selected_anticoagulation.append(column_name)

selected_anticoagulation.append('Anticoagulation within 24 Hours')
clean_TEG_df[selected_anticoagulation].head()

In [None]:
# Drop old columns
clean_baseline_df.drop(columns=['Artery affected','Intervention Type'], inplace=True)
clean_TEG_df.drop(columns=['Antiplatelet Therapy within 7 Days', 'Anticoagulation within 24 Hours'], inplace=True)

In [None]:
# Save in excel
excel_file = "./testData/clean_data.xlsx"

# Create an Excel writer object
with pd.ExcelWriter(excel_file, engine='xlsxwriter') as writer:
    # Write each DataFrame to a different Excel sheet
    clean_TEG_df.to_excel(writer, sheet_name='TEG values', index=False)
    clean_baseline_df.to_excel(writer, sheet_name='Baseline', index=False)


## Extend data
Create the rate of change of teg values colum

In [None]:
# User selects to extend data
user_extend_data = False

In [None]:
# Columns
tegValues

In [None]:
extended_df = clean_TEG_df.copy()

In [None]:
if user_extend_data:    
    # Sort the DataFrame by "Record ID" and "Visit Timepoint"
    extended_df= extended_df.sort_values(by=["Record ID", "Days from operation"])
    extended_df[["Record ID", "Days from operation"]]

In [None]:
if user_extend_data:
    # Group by 'Record ID'
    grouped = extended_df.groupby('Record ID')

    #Calculate the difference in 'Days from operation'
    extended_df['Days Diff'] = grouped['Days from operation'].diff()

    # Replace 0s to avoid infinity
    extended_df["Days Diff"] = extended_df["Days Diff"].replace(0, 1)

    extended_df[["Record ID", "Days from operation", "Days Diff"]]


In [None]:
if user_extend_data:
    new_columns = []
    # Iterate TEG values
    for value in tegValues:

        # Get column names
        diff_column_name = f"{value}_difference_since_last_timepoint"
        rate_column_name = f"{value}_rate_since_last_timepoint"
        new_columns.append(diff_column_name)
        new_columns.append(rate_column_name)


        # Calculate the difference in TEG values
        extended_df[diff_column_name] = grouped[value].diff()

        # Divide  by the differences in 'Days from operation'
        extended_df[rate_column_name] = extended_df[diff_column_name] / extended_df['Days Diff']

    # Fill the first value with the next one to avoid NaN
    extended_df.bfill(inplace=True)

In [None]:
if user_extend_data:
    extended_df[new_columns]


In [None]:
if user_extend_data:
    # Drop column with diff in dates
    extended_df.drop(columns=["Days Diff"], inplace = True)

In [None]:
if user_extend_data:
    # Save in excel
    excel_file = "./testData/extended_data.xlsx"

    # Create an Excel writer object
    with pd.ExcelWriter(excel_file, engine='xlsxwriter') as writer:
        # Write each DataFrame to a different Excel sheet
        extended_df.to_excel(writer, sheet_name='TEG values', index=False)
        clean_baseline_df.to_excel(writer, sheet_name='Baseline', index=False)

## Data visualization
The goal of this section is to create the graphs that will be shown to the user describing the general data demographics
Some of the values are calculated based on the totaal number of patients in the baseline information, and some is calculated from the TEG values

Baseline summary:
- Age
- Gender
- Ethnicity
- BMI

TEG values:
- Number of events
- Total number of data points

In [None]:
fig_df = clean_baseline_df.copy()

In [None]:
# Define custom colors
male_colors = ['#d9ed92', '#99d98c'] 
white_colors = ['#184e77', '#1a759f'] 
events_colors = '#1a759f'
age_histogram_color = '#52b69a' 
bmi_histogram_color = '#1e6091'

In [None]:
# Count binary values in the "Male" column
male_counts = fig_df['Is Male'].value_counts()
male_labels = ['Male' if male_counts.index[0] else 'Female', 'Male' if not male_counts.index[0] else 'Female']
# Create a pie chart for "Male" with custom colors
sex_pie = go.Pie(labels=male_labels, values=male_counts, marker=dict(colors=male_colors))

# Visualize
data = [sex_pie]
fig = go.Figure(data = data)
fig.update_layout(width=300, height=300)
display(fig)

In [None]:
# Count binary values in the "White" column
white_counts = fig_df['White'].value_counts()
white_labels = ['White' if white_counts.index[0] else 'Non-White', 'White' if not white_counts.index[0] else 'Non-White']

# Create a pie chart for "White" with custom colors
white_pie = go.Pie(labels=white_labels, values=white_counts, marker=dict(colors=white_colors))

# Visualize
data = [white_pie]
fig = go.Figure(data = data)
fig.update_layout(width=300, height=300)
display(fig)

In [None]:
# BMI histogram
bmi_hist =  go.Histogram(x=fig_df["BMI"], name="BMI", marker=dict(color=bmi_histogram_color))

# Visualize
data = [bmi_hist]
fig = go.Figure(data = data)
fig.update_layout(width=300, height=300)
display(fig)

In [None]:
# Age histogram
age_hist=  go.Histogram(x=fig_df["Age"], name="Age", marker=dict(color=age_histogram_color))

# Visualize
data = [age_hist]
fig = go.Figure(data = data)
fig.update_layout(width=300, height=300)
display(fig)

The following metrics are bsed on the total number of TEG test values

In [None]:
# Copy TEG df to find metrics
fig_df = clean_TEG_df.copy()

In [None]:
# Events histogram 
events_hist =  go.Histogram(x=fig_df["Events"], name="Events", marker=dict(color=events_colors))

# Visualize
data = [events_hist]
fig = go.Figure(data = data)
fig.update_layout(width=300, height=300)
display(fig)

In [None]:
# Create a summary table
unique_patients = fig_df['Record ID'].nunique()
total_data_points = len(fig_df)

data_summary = pd.DataFrame({
    'Category': ['Unique Patients', 'Total Data Points'],
    'Count': [unique_patients, total_data_points]
})

patients_table = go.Table(
    header=dict(values=["Category", "Count"]),
    cells=dict(values=[data_summary['Category'], data_summary['Count']])
)

# Visualize
data = [patients_table]
fig = go.Figure(data = data)
fig.update_layout(width=300, height=300)
display(fig)

In [None]:
# Create subplots
fig = make_subplots(rows=2, cols=3,
                    specs=[[{'type':'domain'}, {'type':'domain'},{'type':'xy'}],
                            [{'type':'xy'}, {'type':'xy'},{'type':'domain'}]],
                    subplot_titles=['Gender Distribution', 'Ethnicity Distribution', 'Thrombotic event', 'BMI',
                                    'Age', 'Data Summary'])

fig.add_trace(sex_pie, row=1, col=1)
fig.add_trace(white_pie, row=1, col=2)
fig.add_trace(events_hist, row=1, col=3)
fig.add_trace(bmi_hist, row=2, col=1)
fig.add_trace(age_hist, row=2, col=2)
fig.add_trace(patients_table, row=2, col=3)

fig.update_layout(width=900, height=600)
display(fig)

## Train model function
There will be three models trained, so a function is being created now to be used multiple times.

In [None]:
def train_model(df, target_column, drop_columns):
    """
    Trains an XGBoost regression model on the given DataFrame using grid search for hyperparameter tuning.

    Parameters:
    - df (pd.DataFrame): Input DataFrame containing the features and target variable.
    - target_column (str): The name of the target variable column.
    - drop_columns (list): List of column names to be dropped from the feature set.

    Returns:
    - best_pipeline (Pipeline): The best-performing pipeline after hyperparameter tuning.

    Example:
    best_model = train_model(df=my_dataframe, target_column='target', drop_columns=['column1', 'column2'])
    """

    # Separate features (X) and target (y)
    y = df[target_column]
    X = df.drop(labels=drop_columns + [target_column], axis=1)

    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Create transformers for feature scaling
    feature_scaler = RobustScaler()
    target_scaler = MinMaxScaler()

    # Create a pipeline
    pipeline = Pipeline([
        ('feature_scaler', feature_scaler),  # Robust scaling for features
        ('target_scaler', target_scaler),    # Min-Max scaling for the target
        ('xgb_regressor', XGBRegressor())    # XGBoost regressor
    ])

    # Define hyperparameter grid for tuning (adjust as needed)
    param_grid = {
        'xgb_regressor__max_depth': [3, 4, 5],
        'xgb_regressor__gamma': [0, 0.1, 0.2],
        'xgb_regressor__min_child_weight': [1, 2, 5]
    }

    # Initialize K-Fold cross-validation
    kf = KFold(n_splits=5, shuffle=True, random_state=42)

    # Initialize GridSearchCV for hyperparameter tuning
    grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid,
                               scoring='r2', cv=kf)

    # Fit the model and perform hyperparameter tuning
    grid_search.fit(X_train, y_train)

    
    # Access the best pipeline
    best_pipeline = grid_search.best_estimator_

    # Make predictions on the test data
    y_pred = best_pipeline.predict(X_test)  
    # Evaluate the model using Mean Squared Error
    mse_test = mean_squared_error(y_test, y_pred)
    # Calculate R-squared (R2) score
    r2_test = r2_score(y_test, y_pred)

    # Make predictions on the train data
    y_pred = best_pipeline.predict(X_train)  
    # Evaluate the model using Mean Squared Error
    mse_train = mean_squared_error(y_train, y_pred)
    # Calculate R-squared (R2) score
    r2_train = r2_score(y_train, y_pred)
    
    score = {"mse test":mse_test, "r2 test": r2_test, "mse train": mse_train, "r2 train": r2_train}

    return best_pipeline, X_train, score


## Shapeley value function
The shapeley value will be used in the models to determine the most important features. This is done multiple times so a function will be created

In [None]:
def feature_importance(best_pipeline, X):
    """
    Generate SHAP (SHapley Additive exPlanations) values and a summary plot for feature importance.

    Parameters:
    - best_pipeline (Pipeline): The best-performing pipeline after hyperparameter tuning. It should have an XGBoost regressor named 'xgb_regressor'.
    - X (pd.DataFrame): Data to be tested, containing features for which SHAP values will be computed.

    Returns:
    - importance_df (pd.DataFrame): DataFrame containing feature names and their importance values.
    - shap_values (numpy.ndarray): SHAP values for the provided data.

    Example:
    importance_df, shap_values = feature_importance(best_pipeline=my_best_pipeline, X=my_test_data)
    
    Note:
    The SHAP (SHapley Additive exPlanations) values provide insights into the contribution of each feature to model predictions. The summary plot and importance DataFrame help identify the most influential features.

    Dependencies:
    - Ensure the 'shap' library is installed. You can install it using 'pip install shap'.

    Usage:
    - For the best results, pass the best-performing pipeline obtained after hyperparameter tuning. The pipeline should include an XGBoost regressor with the name 'xgb_regressor'.

    """
    # Create a SHAP explainer for the XGBoost model
    explainer = shap.Explainer(best_pipeline.named_steps['xgb_regressor'])

    # Generate SHAP values
    shap_values = explainer.shap_values(X)

    # Calculate feature importance using the absolute mean of SHAP values
    feature_importance = np.abs(shap_values).mean(axis=0)

    # Create a DataFrame to associate feature names with their importance values
    importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importance})

    # Sort the DataFrame by importance in descending order to find the most important features
    importance_df = importance_df.sort_values(by='Importance', ascending=False)

    return importance_df, shap_values

## Baseline Model
The first model will be used to the determine the risk of someone based on their baseline information

### Create model

In [None]:
best_model_baseline, baseline_train, baseline_score = train_model(clean_baseline_df, 'Events', ['Record ID'])

In [None]:
baseline_score

### Feauture importance
This information could be used for general information

In [None]:
importance_df_bsaeline, shap_values_baseline  = feature_importance(best_model_baseline, baseline_train)

In [None]:
# Plot SHAP summary plot
shap.summary_plot(shap_values_baseline, baseline_train, plot_type="bar", show= False)

## TEG model 1
The first model will be used to determine the feature importance so the user can then select parameters of interest.

### Create model

In [None]:
best_model_TEG1, TEG1_train, TEG1_score = train_model(extended_df, 'Events', ['Record ID'])

In [None]:
TEG1_score

### Feature importance

In [None]:
importance_df_TEG1, shap_values_TEG1 = feature_importance(best_model_TEG1, TEG1_train)

In [None]:
# Plot SHAP summary plot
shap.summary_plot(shap_values_TEG1, TEG1_train, plot_type="bar", show= False)

## User interface
Here , the user will select the features that they want to test in the next iteration of the model, based on the results from the first model.

Streamlit can read strings so for the sake of this notebook streamlit outputs will be printed strings

In [None]:
user_TEG_df = extended_df.copy()
user_TEG_df.head()

In [None]:
# Keep only the most important values from teg. No need for extra created ones
if user_extend_data:
    columns_to_keep = dict.fromkeys(user_TEG_df.columns.difference(tegValues + new_columns), None)
else:
    columns_to_keep = dict.fromkeys(user_TEG_df.columns.difference(tegValues), None)

# Iterate through prefixes and select the most important column for each
for prefix in tegValues:
    # Filter the importance_df_TEG1 for the current prefix
    prefix_columns = importance_df_TEG1[importance_df_TEG1['Feature'].str.startswith(prefix)]

    if not prefix_columns.empty:
        # Find the column with the maximum importance for the current prefix
        max_importance_row = prefix_columns.loc[prefix_columns['Importance'].idxmax()]

        # Check if the maximum importance value is greater than 0
        if max_importance_row['Importance'] > 0:
            max_importance_column = max_importance_row['Feature']
            columns_to_keep[max_importance_column] =max_importance_row['Importance']

        else:
            columns_to_keep[prefix] = 0

columns_to_keep


In [None]:
# Keep only non repeated values
user_TEG_df = user_TEG_df[columns_to_keep.keys()]
user_TEG_df.head()

In [None]:
# Upload collinear TEG values

# Define the filename
filename = 'TEG_collinear.json'

# Create the full file path by joining the base directory and filename
file_path = os.path.join(base_directory, 'data', filename)

with open(file_path, 'r') as json_file:
    collinearity = json.load(json_file)
collinearity

In [None]:
# Create empty dictionary to hold selection
selected_features = {}

# Use the dictionary with columns to keep to show user their options
for group_name , elements in collinearity.items():

    print(group_name) #with st.expander(f"{group_name}"):

    # Filter keys based on prefixes
    filtered_keys = [key for key in columns_to_keep.keys() if any(key.startswith(prefix) for prefix in elements)]

    # Create a list of strings by appending keys with values multiplied by 100
    radio_labels = [f"{key} ({round(columns_to_keep[key] * 100, 2)}%)" for key in filtered_keys]

    # Create a radio button to select a feature from the group
    print(radio_labels) #selected_feature = st.radio("", radio_labels, key=group_name)
    selected_feature = radio_labels[0]
    print(type(selected_feature))

    # Convert the group list to a tuple and store the selected feature in the dictionary
    selected_features[group_name] = selected_feature

selected_features

## TEG model 2
After the user selects non-correlated parameters the model will be retrained dropping the values that were not selected

In [None]:
# Extract all values from selected_features and collinearity
selected_features_values = list(selected_features.values())
collinearity_values = [item for sublist in collinearity.values() for item in sublist]

# Find prefixes to drop
prefix_to_keep = [prefix for selection in selected_features_values for prefix in collinearity_values if selection.startswith(prefix)]
prefix_to_drop = list(set(collinearity_values) - set(prefix_to_keep))

print(prefix_to_keep)
prefix_to_drop

In [None]:
# Find list of columns to drop
columns_to_drop = [column for column in columns_to_keep.keys() if any(column.startswith(prefix) for prefix in prefix_to_drop)]

columns_to_drop

In [None]:
model2_df = user_TEG_df.copy()
model2_df.drop(columns=columns_to_drop, inplace=True)
model2_df.head()

### Make model

In [None]:
best_model_TEG2, TEG2_train, TEG2_score = train_model(model2_df, 'Events', ['Record ID'])

In [None]:
TEG2_score

### Feature importance

In [None]:
importance_df_TEG2, shap_values_TEG2 = feature_importance(best_model_TEG2, TEG2_train)

In [None]:
# Plot SHAP summary plot
shap.summary_plot(shap_values_TEG2, TEG2_train, plot_type="bar", show= False)

In [None]:
stop before this

In [None]:
!jupyter nbconvert --to script steps.ipynb