# Assignment 2 - Predictive Process Monitoring

*Due: Friday, 15 December, 2023 at 14:00 CET*

In this assignment, you will learn to use several regression models to predict the process remaining time. In addition, you will also show that you can evaluate their performance and discuss the results in a report. The learning objectives of this assignment are: 

- use the data aggregation, feature encoding, and data transformation techniques to preprocess event data
- use the regression models to predict the remaining time of ongoing cases. 
- perform cross validation and fine-tune the model parameters of each algorithm
- calculate model performance (e.g., MAE, MSE, RMSE, R^2, etc.)
- design experiments to compare the performance of algorithms
- reflect on the difference between different models


This assignment includes two algorithms: Regression Tree (or Random Forest Regression) and kNN regressor. Following a similar structure as the first assignment, your first task is to perform data exploration and data cleaning. 
In Task 2, you will perform two trace encoding techniques (covered during Lecture 07). 
In Task 3-4, you will use the two algorithms to learn regression models to forecast the remaining time of each case after each event. 
In Task 5, you will compare the algorithms and evaluate their results. 

Please note that Task 3 and 4 have the following structure:
1. First, find the library (e.g., sklearn examples) and try out the algorithm by simply training the model on the training data (do not consider any parameters or cross validation just yet); 
2. Train the model with the training data by using cross validation and find the best parameter setting for the parameters of interest;
3. Report the average MAE, MSE, RMSE, and R^2 of all validation sets;
4. Finally, test the optimal model that has the best fitting parameters on your held-out test data, and report its MAE, MSE, RMSE, and R^2. 

Note that, in Task 5, you will need all the calculated MAE, MSE, RMSE, and R^2 on both encoded data from previous tasks. Make sure you save these to a list or dictionary so you can easily evaluate and compare the results. 



## Task 1: Exploring the data set



### Data set: Sepsis

Import the file *sepsis.csv* to load the Sepsis data set. This real-life event log contains events of sepsis cases from a hospital. Sepsis is a life threatening condition typically caused by an infection. One case represents a patient's pathway through the treatment process. The events were recorded by the ERP (Enterprise Resource Planning) system of the hospital. The original data set contains about 1000 cases with in total 15,000 events that were recorded for 16 different activities. Moreover, 39 data attributes are recorded, e.g., the group responsible for the activity, the results of tests and information from checklists. 

Additional information about the data can be found :
- https://data.4tu.nl/articles/dataset/Sepsis_Cases_-_Event_Log/12707639
- http://ceur-ws.org/Vol-1859/bpmds-08-paper.pdf




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data_sepsis = pd.read_csv("./sepsis.csv", sep=";")

In [None]:
data_sepsis.info()

In [None]:
data_sepsis.describe()

In [None]:
data_sepsis.head()

In [None]:
len(data_sepsis['Case ID'].unique())


### 1.1 Exploratory data analysis

For the data set, create 2-3 figures and tables that help you understand the data 

**Use the column "remtime" (which indicates the remaining time of each case after each corresponding event) as the response variable for regression**

Note that some of these variables are categorical variables. How would you preprocess these variables?


#### Tips: ---------------

During the data exploration, you, as a team, are trying to get an impression of the data. You will create figures and/or tables that help you to get to know the data. While exploring the data, you may also consider answering the following questions, which may help you understand the data better. For example, 

- How many variables are in the data? What are the data type and the distribution of each variable? 
- What is the discribution of the response variable?
- Are the variables informative?
- Are any pair of the potential predictor variables highly correlated?
- (Should the variables be normalized or not?)
- (Any relevant, useful preprocessing steps that may be taken?)



Make sure to at least check the data type of each variable and to understand the distribution of each variable, especially the response variable. 

Try to find out what factors seem to determine whether an instance is an outlier or not. What do you conclude?

*For creating data visualizations, you may consider using the matplot library and visit the [matplot gallery](https://matplotlib.org/stable/gallery/index.html) for inspiration (e.g., histograms for distribution, or heatmaps for feature correlation).*

In [None]:
# import packages
import matplotlib.pyplot as plt


# TODO: plot figure(s)

# Get mean remtime per activity
remtime_per_activity = data_sepsis.groupby("Activity", as_index = False)["remtime"].mean()
sorted_remtime_per_activity = remtime_per_activity.sort_values("remtime", ascending = False)

plt.figure(figsize = (10,5))
plt.bar(sorted_remtime_per_activity["Activity"], sorted_remtime_per_activity["remtime"], align = "center", width = 0.6)
plt.title("Average remtime per last activity")
plt.xlabel("Activity")
plt.ylabel("Remaining time")
plt.show()

In [None]:
def plot_fig(df, x, y, title, xlabel, ylabel):
    plt.bar(df[x], df[y])
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()

    
for col in ["Hypotensie", "Oligurie", "Hypoxie", "Infusion", "DisfuncOrg"]:
    data_without_missing = data_sepsis.drop(data_sepsis[data_sepsis[col] == "missing"].index)
    
    data_group = pd.DataFrame(data_without_missing.groupby(col, as_index = False)["remtime"].mean())
    title = f"Remtime difference in {col}"
    xlabel = col
    ylabel = "remtime"
    plot_fig(df=data_group, x=col, y="remtime", title=title, xlabel=xlabel, ylabel=ylabel)

### 1.2 Data cleaning

You have now gathered some information about the data during the data exploration task. You also know from the assignment description that you will be using regression trees and kNN regression models to predict the remaining time.

Based on the above information, decide on which cleaning steps you will need to perform and implement them accordingly.


In [None]:
data_sepsis = pd.read_csv("./sepsis.csv", sep=";")

In [None]:
def object_to_bool(df_row: pd.Series):
    df_row.replace({'True': 1, 'False': 0, 'missing': np.nan}, inplace=True)

In [None]:
for column in data_sepsis:
    object_to_bool(data_sepsis[column])

In [None]:
data_sepsis['Complete Timestamp'] = pd.to_datetime(data_sepsis['Complete Timestamp'])

In [None]:
df_sorted = data_sepsis.sort_values(by="Complete Timestamp")
df_sorted.reset_index(inplace=True, drop=True)

In [None]:
df_sorted.groupby(by="Case ID")

In [None]:
pd.set_option("display.max_columns", None)

In [None]:
df_sorted.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder

In [None]:
scaler = MinMaxScaler()
encoder = OrdinalEncoder()

In [None]:
df_analysis = df_sorted.copy()

In [None]:
df_encoded = pd.get_dummies(df_analysis[['Diagnose', 'org:group']], prefix=['Diagnose=', 'org:group='])
for col in df_encoded.columns:
    df_encoded[col].replace({True: 1, False: 0}, inplace=True)
df_analysis = pd.concat([df_analysis, df_encoded], axis=1)

In [None]:
df_analysis['Complete Timestamp'] = encoder.fit_transform(df_analysis['Complete Timestamp'].values.reshape(-1, 1))

In [None]:
scale_cols = ['CRP', 'LacticAcid', 'Leucocytes', 'duration', 'month', 'weekday', 'hour', 'remtime', 'elapsed', 'Complete Timestamp', "Age"]
df_analysis[scale_cols] = scaler.fit_transform(df_analysis[scale_cols])

In [None]:
df_analysis.drop(columns=['Diagnose', 'org:group'], inplace=True)

In [None]:
df_analysis.info()

In [None]:
df_analysis.describe()

In [None]:
df_analysis.isna().sum()

In [None]:
# Activity to be encoded in task 2
# Case ID is not dropped so it can be used for visualizations and grouping. It is dropped before handing data to models

### 1.3 Process Discovery and Visualization (Optional)

This is an optional task to show you how process discovery and visualizaion can be deployed using the pm4py library. 

(*The following code requires the graphviz library to be installed. If you have issues with installing the graphviz, you may try to follow the instructions on Install GraphViz on the [pm4py](https://pm4py.fit.fraunhofer.de/install-page) install page*)

The following code:
- fill in the columns for case id, activity, and timestamps
- convert the data set into an event log
- discover a Directly-follows graph (DFG) and a process model for each event log. 
- you may use the discovered process model in your report



## Task 2: Preprocessing and Trace Encoding


### 2.1 Trace Encoding


- Implement the last-2-state encoding for the data set 
- Implement the aggregated encoding for the data set (for example, see [1], Table 6)


<span style="color:gray">[1] Ilya Verenich, Marlon Dumas, Marcello La Rosa, Fabrizio Maria Maggi, Irene Teinemaa:
Survey and Cross-benchmark Comparison of Remaining Time Prediction Methods in Business Process Monitoring. ACM Trans. Intell. Syst. Technol. 10(4): 34:1-34:34 (2019) [Section 1, 2, 4.1, 4.3, 4.6, 5.2, 5.3, 5.4, and 6] </span>

These two encodings are discussed during lecture 7.
In case you find difficult to implement the algorithms, you may also consider use the pandas functions to help you:
- for the last-2-state encoding, check the pandas groupby.DataFrameGroupBy.shift and see the [answer on the stake overflow](https://stackoverflow.com/questions/53335567/use-pandas-shift-within-a-group)
- for the aggregated encoding check the pandas groupby.DataFrameGroupBy and cumsum function and read the [examples and answers on the stake overflow](https://stackoverflow.com/a/49578219)

In [None]:
# TODO: Implement the function that returns the last-state encoding of a log
def last_2_state_encoding(_data, columncase, columnactivity):
    _data.groupby(by = columncase)
    df_sorted = _data.copy()
    df_sorted['prev_activity'] = np.where(df_sorted[columncase] == df_sorted[columncase].shift(1), df_sorted[columnactivity].shift(1), None)
    df_sorted['prev_2_activity'] = np.where(df_sorted[columncase] == df_sorted[columncase].shift(2), df_sorted[columnactivity].shift(2), None)
    df_sorted = df_sorted.drop(columns = [columnactivity])
    return df_sorted

def onehotencoder(data: pd.DataFrame, columns: list):
    ohcdata = pd.get_dummies(data, columns = columns)
    for col in ohcdata.columns:
        ohcdata[col].replace({True: 1, False: 0}, inplace=True)
    
    return ohcdata
    
data_sepsis_ls = last_2_state_encoding(df_analysis, "Case ID", "Activity")
data_sepsis_ls = onehotencoder(data = data_sepsis_ls, columns = ["prev_activity", "prev_2_activity"])
data_sepsis_ls = data_sepsis_ls.drop(columns = ["Case ID"])
data_sepsis_ls.dropna(axis=1, inplace=True)

In [None]:
data_sepsis_ls

In [None]:
data_sepsis_ls.info()

In [None]:
# TODO: Implement the function that returns the aggregated state encoding of a log
target_columns = ['act_CRP', 'act_ER Registration', 'act_ER Sepsis Triage', 'act_ER Triage', 'act_LacticAcid', 'act_Leucocytes']

def agg_state_encoding(_data: pd.DataFrame, columncase, columnactivity):
    df_aggstate = pd.get_dummies(_data, prefix='act', columns=[columnactivity])
    df_aggstate[target_columns] = df_aggstate.groupby(by=columncase)[target_columns].cumsum()
    return df_aggstate

# TODO: for each of the two data sets, create a last_2_state encoding and an aggregated state encoding
data_sepsis_ag = agg_state_encoding(df_analysis, "Case ID", "Activity")
data_sepsis_ag = data_sepsis_ag.drop(columns = ["Case ID"])

In [None]:
data_sepsis_ag.dropna(axis=1, inplace=True)

### 2.2 Create Training and Held-out test data sets


Create a training and a held-out test data set. *Later in Task 3-4, the training data will be used to perform cross-validation. The held-out test data will be used to evaluate the performance of the selected models.*

Choose the size of your test data. Furthermore, how did you split the data? Motivate your choice when you discuss the experiment setup in your report. 



Tips: *You may consider reuse some of your code from Assignment 1 Task 1.2*

In [None]:
# Create training data and held-out test data for *data_sepsis_ls*
X = data_sepsis_ls.drop(columns=['remtime'])
y = data_sepsis_ls['remtime']

X_lstrain, X_lstest, y_lstrain, y_lstest = train_test_split(X, y, test_size=0.30, random_state=6, shuffle=True)

In [None]:
# Create training data and held-out test data for *data_sepsis_ag*
X = data_sepsis_ag.drop(columns=['remtime'])
y = data_sepsis_ag['remtime']

X_agtrain, X_agtest, y_agtrain, y_agtest = train_test_split(X, y, test_size=0.30, random_state=6, shuffle=True)

## Task 3: Predicting Case Remaining Time - Regression Trees


In this task, you will use the regression tree (or random forest regression if you prefer) to learn a regression model to predict case remaining time. Very similar to how you have trained a classification model in Assignment 1, now perform the following steps to train a regression model. 

i) use the default values for the parameters to get a [Regression Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor) (or a [Random Forest Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)) running on the training data. (*Optional: visualize the tree, the feature importance, and compute the error measures to get an impression of the performance of the model*).

ii) use 5-fold cross-validation to determine a possibly better choice for the two parameters *min_samples_leaf* and *max_depth*
    
iii) create 2D or 3D plot that shows how the selected parameters affect the performance. 

iv) select the best-performing regression tree (or forest), i.e., the one that achieved the lowest cross-validated errors, and report all the error measures (MAE, MSE, RMSE, R^2) of the fitted model on the held-out test data. 

    
#### TIPS:
You may consider reuse the some of your code of Assignment 1 or use the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class (see an [example](https://www.dezyre.com/recipes/find-optimal-parameters-using-gridsearchcv-for-regression), but be aware that GridSearchSV does not return MAE or the other error measures (e.g., MSE, RMSE, R^2), you will need to update the scoring function)



In [None]:
# import packages
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, make_scorer

# define scorers
mae_scorer = make_scorer(mean_absolute_error)
mse_scorer = make_scorer(mean_squared_error)
rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)))
r2_scorer = make_scorer(r2_score)

# define dictionary of scorers
scorers = {
    'MAE': mae_scorer,
    'MSE': mse_scorer,
    'RMSE': rmse_scorer,
    'R2': r2_scorer
}

# define variables for upcoming functions
main_scorer = 'MAE'
main_test = f'mean_test_{main_scorer}'

# define function for score retrieval
def measure_performance(y_test, y_pred):
    return [mean_absolute_error(y_test, y_pred), mean_squared_error(y_test, y_pred), np.sqrt(mean_squared_error(y_test, y_pred)), r2_score(y_test, y_pred)]

In [None]:
# TODO: import packages
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import KFold, GridSearchCV

In [None]:
# TODO: set the search space of the parameters *min_samples_leaf* and *max_depth*
dtr_params = {
    'min_samples_leaf': [1, 2, 4, 8, 16, 32, 64],
    'max_depth': [1, 2, 4, 8, 16, 32, 64, None]
}

# TODO: create 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=0)

# TODO: learn an optimal regression tree model (random forest regressor)
dtr = DecisionTreeRegressor(random_state=0)
dtr_ls_optimized = GridSearchCV(dtr, dtr_params, scoring=scorers, refit=main_scorer, cv=kf)

In [None]:
# fit the model
dtr_ls_optimized.fit(X_lstrain, y_lstrain)

In [None]:
# TODO: create 2D or 3D plot that shows how the selected parameter values affect the MAE (or RMSE). 
import seaborn as sns
import matplotlib.pyplot as plt

# Create a dataframe from the grid search results
dtr_ls_results = pd.DataFrame(dtr_ls_optimized.cv_results_)

# Create a pivot table from the dataframe
dtr_pivot_table = dtr_ls_results.pivot(index='param_max_depth', columns='param_min_samples_leaf', values=f'mean_test_{main_scorer}')

# Create a heatmap from the pivot table
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(dtr_pivot_table, annot=True, fmt='.3f', cmap='Purples_r', ax=ax)
ax.set_title(f"""DT Regression on ls-encoded dataset - Metric: {main_scorer}""")
ax.set_xlabel('Min Samples Leaf')
ax.set_ylabel('Max Depth')
plt.show()

In [None]:
# TODO: compute the performance of the model on your held-out test data
dtr_ls_optimized.fit(X_lstrain, y_lstrain)

In [None]:
# predict the test data
y_pred_dtr_ls_optimized = dtr_ls_optimized.predict(X_lstest)

# Retrieve scores
dtr_ls_results_test = measure_performance(y_lstest, y_pred_dtr_ls_optimized)

In [None]:
# TODO: repeat the above steps for *data_Sepsis_ag* and compare the results

# learn an optimal regression tree model (random forest regressor)
dtr = DecisionTreeRegressor(random_state=0)
dtr_ag_optimized = GridSearchCV(dtr, dtr_params, scoring=scorers, refit=main_scorer, cv=kf)

In [None]:
# Fit the model
dtr_ag_optimized.fit(X_agtrain, y_agtrain)

In [None]:
# create 2D or 3D plot that shows how the selected parameter values affect the MAE (or RMSE). 
# create a dataframe from the grid search results
dtr_ag_results = pd.DataFrame(dtr_ag_optimized.cv_results_)

# create a pivot table from the dataframe
dtr_pivot_table = dtr_ag_results.pivot(index='param_max_depth', columns='param_min_samples_leaf', values='mean_test_MAE')

# create a heatmap from the pivot table
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(dtr_pivot_table, annot=True, fmt='.3f', cmap='Purples_r', ax=ax)
ax.set_title(f"""DT Regression on ag-encoded dataset - Metric: {main_scorer}""")
ax.set_xlabel('Min Samples Leaf')
ax.set_ylabel('Max Depth')
plt.show()

In [None]:
# compute the performance of the model on your held-out test data
dtr_ag_optimized.fit(X_agtrain, y_agtrain)

In [None]:
# predict the test data
y_pred_dtr_ag_optimized = dtr_ag_optimized.predict(X_agtest)

# Store all scores
dtr_ag_results_test = measure_performance(y_agtest, y_pred_dtr_ag_optimized)

## Task 4. Predicting Case Remaining Time - kNN Regression


In this task, you will use the kNN Regression to learn a regression model to predict case remaining time. The same as task 3, now perform the following steps to train a regression model. 

i) use the default values for the parameters to get a [kNN Regression](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor) running on the training data. (*Optional: compute the error measures to get an impression of the performance of the model).

ii) use 5-fold cross-validation to determine a possibly better choice for the two parameters *n_neighbors* and *weights* 
    
iii) create 2D or 3D plot that shows how the selected parameters affect the performance. 

iv) select the best-performing kNN, i.e., the one that achieved the lowest cross-validated errors, and report all the error measures (MAE, MSE, RMSE, R^2) of the fitted model on the held-out test data. 

    
#### TIPS:
The same here, you may consider reuse the some of your code of Assignment 1 or use the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class (see an [example](https://www.dezyre.com/recipes/find-optimal-parameters-using-gridsearchcv-for-regression), but be aware that GridSearchSV does not return MAE or the other error measures (e.g., MSE, RMSE, R^2), you will need to update the scoring function)







In [None]:
# TODO: import packages
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, GridSearchCV

In [None]:
# TODO: set the search space of the parameters *n_neighbors* and *weights* 
knr_params = {
    'n_neighbors': [1, 2, 4, 5, 8, 16, 32, 64],
    'weights': ['uniform', 'distance']
}

# TODO: create 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=0)

# TODO: learn an optimal kNN regressor
knr = KNeighborsRegressor()
knr_ls_optimized = GridSearchCV(knr, knr_params, scoring=scorers, refit=main_scorer, cv=kf)

In [None]:
# fit the model
knr_ls_optimized.fit(X_lstrain, y_lstrain)

In [None]:
# TODO: create 2D or 3D plot that shows how the selected parameter values affect the MAE (or RMSE). 
import seaborn as sns
import warnings
import matplotlib.pyplot as plt

# Ignore future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Create a dataframe from the grid search results
knr_ls_results = pd.DataFrame(knr_ls_optimized.cv_results_)

# Create a line chart from the dataframw showing mean_test_MAE for each value of n_neighbors for all values of weights
fig, ax = plt.subplots(figsize=(10, 10))
sns.lineplot(data=knr_ls_results, x='param_n_neighbors', y=f'mean_test_{main_scorer}', hue='param_weights', ax=ax)
ax.set_title(f"""KN Regression on ls-encoded dataset - Metric: {main_scorer}""")
ax.set_xlabel('N Neighbors')
ax.set_ylabel(f'{main_scorer}')
plt.show()

In [None]:
# TODO: compute the performance of the model on your held-out test data

# predict the test data
y_pred_knr_ls_optimized = knr_ls_optimized.predict(X_lstest)

# Store all scores
knr_ls_results_test = measure_performance(y_lstest, y_pred_knr_ls_optimized)

In [None]:
# learn an optimal kNN regressor
knr = KNeighborsRegressor()
knr_ag_optimized = GridSearchCV(knr, knr_params, scoring=scorers, refit=main_scorer, cv=kf)

# fit the model
knr_ag_optimized.fit(X_agtrain, y_agtrain)

In [None]:
# TODO: create 2D or 3D plot that shows how the selected parameter values affect the MAE (or RMSE). 
import seaborn as sns
import warnings
import matplotlib.pyplot as plt

# Ignore future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Create a dataframe from the grid search results
knr_ag_results = pd.DataFrame(knr_ag_optimized.cv_results_)

# Create a line chart from the dataframw showing mean_test_MAE for each value of n_neighbors for all values of weights
fig, ax = plt.subplots(figsize=(10, 10))
sns.lineplot(data=knr_ag_results, x='param_n_neighbors', y='mean_test_MAE', hue='param_weights', ax=ax)
ax.set_title(f"""KN Regression on ag-encoded dataset - Metric: {main_scorer}""")
ax.set_xlabel('N Neighbors')
ax.set_ylabel(f'{main_scorer}')
plt.show()

In [None]:
# TODO: compute the performance of the model on your held-out test data

# predict the test data
y_pred_knr_ag_optimized = knr_ag_optimized.predict(X_agtest)

# Store all scores
knr_ag_results_test = measure_performance(y_agtest, y_pred_knr_ag_optimized)

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Data for the bar plots
scorer_list = ['MAE', 'MSE', 'RMSE', 'R2']

result_list = [dtr_ls_results, dtr_ls_results_test, dtr_ag_results, dtr_ag_results_test, knr_ls_results, knr_ls_results_test, knr_ag_results, knr_ag_results_test]
regressor_list = ['CV DT - LS', 'Test DT - LS', 'CV DT - AG', 'Test DT - AG', 'CV KN - LS', 'Test KN - LS', 'CV KN - AG', 'Test KN - AG']

mae_values = []
mse_values = []
rmse_values = []
r2_values = []
i = 0

for result in result_list:
    if i == 1 or i == 3 or i == 5 or i == 7:
        mae_values.append(result[0])
        mse_values.append(result[1])
        rmse_values.append(result[2])
        r2_values.append(result[3])
    else:
        mae_values.append(result.loc[result[f'rank_test_{scorer_list[0]}'] == 1, f'mean_test_{scorer_list[0]}'].values[0])
        mse_values.append(result.loc[result[f'rank_test_{scorer_list[1]}'] == 1, f'mean_test_{scorer_list[1]}'].values[0])
        rmse_values.append(result.loc[result[f'rank_test_{scorer_list[2]}'] == 1, f'mean_test_{scorer_list[2]}'].values[0])
        r2_values.append(result.loc[result[f'rank_test_{scorer_list[3]}'] == 1, f'mean_test_{scorer_list[3]}'].values[0])
    i += 1


# Create subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=['MAE', 'MSE', 'RMSE', 'R2'])

# Add bar plots to subplots
fig.add_trace(go.Bar(x=regressor_list, y=mae_values), row=1, col=1)
fig.add_trace(go.Bar(x=regressor_list, y=mse_values), row=1, col=2)
fig.add_trace(go.Bar(x=regressor_list, y=rmse_values), row=2, col=1)
fig.add_trace(go.Bar(x=regressor_list, y=r2_values), row=2, col=2)

# Update layout
fig.update_layout(title='Performance of the models on train and test data using ls and ag encoding')

# Update colors to blue purple yellow and green
fig.update_traces(marker_color=['#003f5c', '#003f5c', '#58508d', '#58508d', '#bc5090', '#bc5090', '#ff6361', '#ff6361'])

# Update legend trace 'trace 0' to 'train'
fig.update_traces(name='')

# Show the figure
fig.show()

## Task 5.  Report your results and discuss your findings

By now, you have applied two algorithms with different parameters on the two encodings of the data set. For each algorithm and each encoding, you have created tables or figures which you can add to your report. Discuss the results and their optimal performance. 

Create an overview table or figure that shows the optimal performance of each algorithm on the data set, for example, see the table here below. 


Discuss your findings and reflect on the following questions in your report:
- According to the error measures, which one would you suggest as the optimal model? 
- Are there any discrepancies between the MAE, MSE, RMSE, and R^2 measures in terms of which model performs the best? If yes, how would you explain these discrepancies. 
- Which one of the MAE, MSE, RMSE, and R^2 would you use for selecting the model? Why?
- Which one of the encoding would you suggest for this data set? Why?
- Which features have a big influence on predicting the remaining time?







| Encoding | Model | CV MAE  | Test MAE |  CV MSE  |  Test MSE  | CV R^2 | Test R^2 |... |
|------|------|------|------|------|------|------|------|-----|
|  Last-2-state | Regression Tree        |  |  | | | | |
|  Agg-state |  Regression Tree  |  |  | || | |
|   Last-2-state |kNN       |  |  | || | |
|   ... |...       |  |  | || | |











## Bonus Tasks 

We would like to challenge you with the following bonus tasks. For each task that is successfully completed, you may obtain max. 1 extra point. 

1. Implement or use another regression algorithm (for example, [Random Forest Regression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html), [LinearRegresion](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), [SVM Regression](https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html#sphx-glr-auto-examples-svm-plot-svm-regression-py)) or design your own algorithm that achieves a better MAE measure. Explain this in your report.
2. Implement techniques (e.g., preprocessing, feature engineering, feature selection, sampling) that help improve the MAE scores of existing models. For example, try out a feature selection for kNN or implement inter-case features. Explain this in your report.

