<a href="https://colab.research.google.com/github/knschuckmann/ML_bundesliga_challange/blob/master/Original_ML_Workshop_FTL_Price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML Workshop


## Introduction to Colab

We want to welcome everybody to the annual ML Workshop in Berlin.


### Task
You require:
- Laptop
- Google Drive account
- Concentration

### Goal
To understand some minor machine learning algorithms through exercises and examine the possibilities of machine learning.


## Excercise
You will receive a dataset on a dummy logistics case. Your task will be to display the dataset but moreover to predict the main features and their weights on behalve of the price. 

⛔️  you will need to copy this Notebook to your own Google drive account

1. Click on File in Menu 
2. Click on 'Save a copy in Drive' as shown below <br/>
![copy File][2] 

3. A new Tab will be created, so you can start running this script.

[2]: https://raw.githubusercontent.com/knschuckmann/4flow_bundesliga/master/pictures/Save.PNG


### ⚠️ How Colab works

1. Colab is an online [Jupyter Notebook](https://jupyter.org/index.html), which can be accessed without any further installation
2. By Connecting to a computation power source a new instance will be created. For now we will devote ourselves, that it works and not dive into details. <br/>
![runtime][6]
3. Clicking on the play Button in each Codefield, runs the code inside this Codefield. <br/>
![run_o][3] <br/>
Only if you hover on the area of the button you will be able to see the play button<br/> ![run_1][4]
4. After you ran a cell you will see a number on the field, marking the order of the running  process  <br/>
![run_1][5]
4. ### ⚠️ **Possible Errors**  
  1. It is essential to follow the order of this Notebook and **run the Codefields one after the other**. the numbers will tell you if you forgot to run a field

[6]: https://raw.githubusercontent.com/knschuckmann/4flow_bundesliga/master/pictures/Runtime.PNG
[3]: https://raw.githubusercontent.com/knschuckmann/4flow_bundesliga/master/pictures/Run_0.PNG
[4]: https://raw.githubusercontent.com/knschuckmann/4flow_bundesliga/master/pictures/Run_1.PNG
[5]: https://raw.githubusercontent.com/knschuckmann/4flow_bundesliga/master/pictures/Run_2.PNG

# **Step 1: Collect data and import libraries**
<img src="https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/pictures/Collect_data.PNG" height="400px" width="100%"/>



In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from sklearn import model_selection
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
import seaborn as sns
import ipywidgets as widgets

!pip install jupyter-dash -q
from jupyter_dash import JupyterDash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output

# declare static variables (urls)
url_orig = 'https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/data/Logistic.csv'

**Understanding the dataset**

It is essential to understand the dataset and get all the necessary insights about it.

What do the column headings mean? It would help if you described them one by one.


**The following table explains the column headings**
<br>
<br>

<table>
  <thead>
  <th width="200" align="left">Header</th>
  <th width="600" align="left">Description</th>
  </thead>
  <tr>
    <th align="left">Sample</th>
    <td> Unique Acneding sample number</td>
  </tr>
  <tr>
    <th align="left">Distance in [km]</th>
    <td> The distance in kilometer between origin and destination</td>
  <tr>
  <tr>
    <th align="left">Carrier</th>
    <td> The seven different created dummy Carriers </td>
  <tr>
  <tr>
    <th align="left">Country pair</th>
    <td> The Pair of Country codes like FR-DE (Origin - Destination)</td>
  <tr>
  <tr>
    <th align="left">Lead time</th>
    <td> Time Passed to get from origin to destination</td>
  <tr>
  <tr>
    <th align="left">Equipment</th>
    <td> The Carrier equipment three categories (Standard/ Mega/ Jumbo)</td>
  <tr>
  <tr>
    <th align="left">Volume yearly [m3]</th>
    <td> The Volume of different customers at different time points</td>
  <tr>
  <tr>
    <th align="left">Customer</th>
    <td> eight different dummy Customers</td>
  <tr>
  <tr>
    <th align="left">Domestic</th>
    <td> (yes/ no) Delivery only inside country Borders?</td>
  <tr>
  <tr>
    <th align="left">Origin region</th>
    <td> four different possible origin regions</td>
  <tr>
  <tr>
    <th align="left">Destination region</th>
    <td> four different possible destination regions</td>
  <tr>
  <tr>
    <th align="left">Holiday</th>
    <td> 0 no Holiday 1 Holliday</td>
  <tr>
  <tr>
    <th align="left">Driver stress level</th>
    <td> The measured stress level of the driver during the ride x ∈ [10,40]</td>
  <tr>
  <tr>
    <th align="left">Loading meter</th>
    <td> How many meters of Loading area is preserved with goods</td>
  <tr>
  <tr>
    <th align="left">Driver exp in years</th>
    <td> The years of experience a driver has</td>
  <tr>
  <tr>
    <th align="left">Inflation rate</th>
    <td> The money inflation clustered in 6 different types</td>
  <tr>
  <tr>
    <th align="left">Pick up time</th>
    <td> The time the carrier picked up the laggage in HH(am/pm) Format</td>
  <tr>
  <tr>
    <th align="left">Carrier Hub Loc</th>
    <td> Countrycode of Headquarter of the Carrier</td>
  <tr>
  <tr>
    <th align="left">Price</th>
    <td> Final Price for the transport with all necessary features</td>
  <tr>
</table>

# **Step 2: Prepare data**
<img src="https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/pictures/Prepare.PNG" height="400px" width="100%"/>

## **Step 2.1 Data transformation**
- Convert ordinal categories into numbers, **code is not relevant, run the cell and focus on outcome**
- **Machine** does **not understand string/charaters** so we need to **convert into numbers** for eg. Customer has been transformed into Customer1, Customer2 ... etc. as in Video below

![Alt Text](https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/pictures/dummy.gif)

In [None]:
# Load the dataset from Github link
data = pd.read_csv(url_orig, delimiter=';', decimal=',')

# drop unneccesary columns
data.drop('Sample', axis = 1 , inplace=True)

# subset only numerical data
data_num = data.select_dtypes(include=['float64','int64'])

# subset only categorical
data_cat = data.select_dtypes(exclude=['float64','int64'])

# create dummy data for regression task
dummy_data = pd.get_dummies(data_cat, prefix='', prefix_sep='')

# create label encoding for different ML tasks
cat_cast_data = data_cat.astype('category')
for col in cat_cast_data.columns:
  cat_cast_data[col] = cat_cast_data[col].cat.codes

# combine all created data together
final_data_label = data_num.merge(cat_cast_data, how='inner', right_index=True, left_index=True)
final_data = final_data_label.merge(dummy_data, how='inner', right_index=True, left_index=True)

column_list = ['Distance in [km]', 'Lead time', 'Volume yearly [m3]', 'Holiday', 'Driver stress level', 'Loading meter', 'Driver exp in years', 'Inflation rate', 'Equipment', 'Origin region', 'Carrier Hub Loc', 
               'Carrier1', 'Carrier2', 'Carrier3', 'Carrier4', 'Carrier5', 'Carrier6', 'Carrier7', 'CZ-DE', 'DE-CZ', 'DE-DE', 'DE-FR', 'DE-GB', 'DE-HU', 'DE-PL', 'FR-DE', 'GB-DE', 
               'NL-DE', 'PL-DE', 'TR-DE', 'Customer1', 'Customer2', 'Customer3', 'Customer4', 'Customer5', 'Customer6', 'Customer7', 'Customer8', 'no', 'yes', 'Dest_east', 'Dest_north', 'Dest_south',
               'Dest_west', '14pm', '15pm', '16pm']
               
final_data = final_data[column_list]
# display final data
final_data.head()


Unnamed: 0,Distance in [km],Lead time,Volume yearly [m3],Holiday,Driver stress level,Loading meter,Driver exp in years,Inflation rate,Equipment,Origin region,Carrier Hub Loc,Carrier1,Carrier2,Carrier3,Carrier4,Carrier5,Carrier6,Carrier7,CZ-DE,DE-CZ,DE-DE,DE-FR,DE-GB,DE-HU,DE-PL,FR-DE,GB-DE,NL-DE,PL-DE,TR-DE,Customer1,Customer2,Customer3,Customer4,Customer5,Customer6,Customer7,Customer8,no,yes,Dest_east,Dest_north,Dest_south,Dest_west,14pm,15pm,16pm
0,663.79,22.126333,1840,0,40,0,11,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1
1,465.093,15.5031,1920,1,28,7,39,4,2,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0
2,1505.548,50.184933,480,1,28,3,23,2,2,2,5,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0
3,1284.343,42.811433,4720,1,13,9,22,2,2,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0
4,1157.473,38.582433,1200,0,28,0,32,2,1,3,4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0


## **Step 2.2 Data analysis (important to understand the data)**


In general, we start exploring the graphical data by drawing a heatmap. This plot indicates the correlation between all numerical values.

**Correlation:** <br>
- An indicator for positive/ negative relationship between two variables 




In [None]:
fig = go.Figure()
corr = final_data_label.corr()
fig.add_trace(go.Heatmap(z=corr.values,
                  x=corr.index.values,
                  y=corr.columns.values))
fig.update_layout(
    title='Heatmap of correlations',
    xaxis_title='Data Columns'
)
fig.show()

The above plot is a heatmap. 

1. **X-Axis:** <br>
  &#8195; Represents all features

2. **Y-Axis:** <br> 
&#8195; Represents all features 

The more yellow one cell, the higher the correlation between the corresponding features.

# **Step 2.3 Dashboards (further data analysis)**

You will find an **interactive legend** on the right side of the plots, and you can play around by clicking on the legends headings. Furthermore, you will see some **dropdowns**. These indicate different combinations of variables we want to take a closer look at. Don't be shy and try out many combinations to get a better understanding of the dataset.

<img src="https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/pictures/Dashboards.PNG" width="100%"/>


⛔️  **Beware that if you run one Dashboard** you can access and manipulate it as long as you did not run the other Dashboard. Once you run another Dashboard, you can only use the other one. If you want to manipulate the first one again, you need to rerun the previous code cell. 

We also provide a manual on exploring the data and how you can interpret the graphical outcomes.

A small **description of** all **plots** in this Dashboard: 
1. All plots are divided into customers
2. **X-Axis:** <br>
  &#8195;  You can either choose to plot the breakdown of "Equipment" or "Carrier"

2. **Y-Axis:** <br> 
&#8195; You can either choose to plot the "Price" or the "Price/Km" 

Furthermore, you will find a dropdown indicating what kind of plots you want to display. Possibilities are a Violin or a Boxplot. 


In [None]:
try:
  app._terminate_server_for_port("localhost", 7123)
except:
  pass
  
app = JupyterDash(__name__)

app.layout = html.Div([
    html.H1("Dynamic Categorical Plots"),
    html.P("Plot Category:"),
    dcc.Dropdown(
        id='plot_cat', 
        value='Box', 
        options=[{'value': x, 'label': x} 
                 for x in ['Box','Violin']],
        clearable=False),
    html.Div([
      html.P("Category:"),
      dcc.Dropdown(
          id='x_Axis', 
          value='Equipment', 
          options=[{'value': x, 'label': x} 
                   for x in ['Equipment','Carrier']],
          clearable=False),
    ],style=dict(width='45%', display='inline-block')
    ),
    html.Div([],style=dict(width='10%', display='inline-block')
    ),
    html.Div([
      html.P("Values:"),
      dcc.Dropdown(
          id='y_Axis', 
          value='Price', 
          options=[{'value': x, 'label': x} 
                    for x in ['Price', 'Price/Km', 'Distance in [km]']],
          clearable=False),
    ],style=dict(width='45%', display='inline-block')
    ),
    dcc.Graph(id="bar-chart")
])
@app.callback(
    Output(component_id='bar-chart', component_property='figure'),
    [Input(component_id='plot_cat', component_property='value'),
     Input(component_id='x_Axis', component_property='value'),
     Input(component_id='y_Axis', component_property='value'),]
)
def update_output_div(plot_cat, x_Axis, y_Axis):
  fig = go.Figure()
  for customer in data['Customer'].unique():
    temp = data[data['Customer'].str.contains(customer)]
    if plot_cat == 'Violin':
      if y_Axis== 'Price/Km':
        fig.add_trace(go.Violin(x=temp[x_Axis], y=temp['Price']/temp['Distance in [km]'], name= customer))
      elif y_Axis == 'Price':
        fig.add_trace(go.Violin(x=temp[x_Axis], y=temp['Price'], name= customer))
      elif y_Axis == 'Distance in [km]':
        fig.add_trace(go.Violin(x=temp[x_Axis], y=temp['Distance in [km]'], name= customer))
    elif plot_cat == 'Box':
      if y_Axis== 'Price/Km':
        fig.add_trace(go.Box(name = customer, x = temp[x_Axis], y = temp['Price']/temp['Distance in [km]']))
      elif y_Axis == 'Price':
        fig.add_trace(go.Box(name = customer, x = temp[x_Axis], y = temp['Price']))
      elif y_Axis == 'Distance in [km]':
        fig.add_trace(go.Box(name = customer, x = temp[x_Axis], y = temp['Distance in [km]']))
  fig.update_layout(
    title= y_Axis + ' ' + plot_cat +'plot for all Customers and ' + x_Axis,
    xaxis_title=x_Axis,
    yaxis_title=y_Axis,
    boxmode='group',
    violinmode='group'
  )
  return fig

if __name__ == '__main__':
  app.run_server(mode='inline', port=7123)

<IPython.core.display.Javascript object>

**Main Outcomes**
+ In the plot above (Box, Carrier or Equipment, Price) we see that Customer1 **has many outliers**. So the range of payed prices is very high. This could be relevant for our model. 

+ However, ploting the Price/Km we obtain that the assumed "Customer1" is not realy a big outlier. The past outliers for Price could be described with the long distances the customer is driving and the previous high correlation.



⚠️ Due to the behavior of the outliers column and the dependencies to especially "Distance" we decided to **leave the data as it is**.

In [None]:
app._terminate_server_for_port("localhost", 7123)
app = JupyterDash(__name__)

app.layout = html.Div([
    html.H1("Dynamic Categorical Bar Plots"),
    html.Div([
      html.P("Split Category:"),
      dcc.Dropdown(
          id='x_Axis', 
          value='Customer', 
          options=[{'value': x, 'label': x} 
                   for x in data_cat.columns],
          clearable=False
      ),
    ],style=dict(width='30%', display='inline-block')),
    html.Div([],style=dict(width='10%', display='inline-block')),
    html.Div([
      html.P("Dependent Category:"),
      dcc.Dropdown(
          id='counts', 
          value='Equipment', 
          options=[{'value': x, 'label': x} 
                   for x in data_cat.columns],
          clearable=False
      ),
    ],style=dict(width='30%', display='inline-block')),
    dcc.Graph(id="bar-chart")
])
@app.callback(
    Output(component_id='bar-chart', component_property='figure'),
    [Input(component_id='x_Axis', component_property='value'),
    Input(component_id='counts', component_property='value')]
)
def update_output_div(x_Axis, counts):
  fig = go.Figure()
  temp = data_cat.groupby(by=[x_Axis,counts]).count()
  data_temp = temp[temp.columns[0]].unstack()
  for col in data_temp.columns:
    fig.add_trace(go.Bar(name=col, x = data_temp.index, y = data_temp[col]))
  #fig.layout.update(barmode='group')
  fig.update_layout(
    xaxis_title=x_Axis,
    yaxis_title=counts + ' counts',
    boxmode='group'
  )
  return fig

if __name__ == '__main__':
    app.run_server(mode='inline', port=7123)

<IPython.core.display.Javascript object>

**Main Outcomes**
+ Split Category:
  + Carrier 
    + All Carriers usually more deliveries inside Germany
    + All Carriers are from different locations -> strange that most deliveries inside of Germany even though different locations
  +  Country pair
    + Jumbo deliveries only to and from CZ
  + Customer
    + Customer8 all three types of Equipment and the only one delivering Jumbo

For more outcomes, we advise you to explore the plots on your own. 


# **Step 3: Test-train split (training: 60% / testing: 40%)**

In [None]:
X = final_data
y = data['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify=X[['Equipment']])

print('Number of samples in \nTraining dataset: {}\nTest dataset:  {}'.format(X_train.shape[0],X_test.shape[0]))

Number of samples in 
Training dataset: 2511
Test dataset:  1674


# **Step 4: Extract relevant features**
<img src="https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/pictures/features_extract.PNG" height="400px" width="100%"/>

# **Task:**</br>
The goal is to find the best model by choosing different features from the following displayed once  
1. Choose a maximum of **5 features** for further calculation and prediction
  * You will be able to repeat this step over and over until you find the best result. **Our best accuracy was arround 77%.** 
  
**WOULD YOU LIKE TO CHALLANGE US?** <br>

Winner takes our code 😉


In [None]:
checkboxes = [widgets.Checkbox(value=False, description=label) for label in final_data.columns]
output = widgets.GridBox(children=checkboxes, layout=widgets.Layout(grid_template_columns="repeat(4, 200px)"))
display(output)

GridBox(children=(Checkbox(value=False, description='Distance in [km]'), Checkbox(value=False, description='Le…

### **Loop Start**   

⚠️ Click on the feature checkboxes you want your model to train with above. </br>
Try to **choose wisely**.</br>

#### Come back here and change the features. **No need to execute cells above again, just check the features**

⚠️ please do not run the cell above after selecting the features.

In [None]:
selected_data = []
for i in range(0, len(checkboxes)):
    if checkboxes[i].value == True:
        selected_data = selected_data + [checkboxes[i].description]

selected = [widgets.Label(str(label)) for label in selected_data]
output = widgets.GridBox(children=selected, layout=widgets.Layout(grid_template_columns="repeat(4, 200px)"))
print('Display selected Data:\n')
display(output)


Display selected Data:



GridBox(children=(Label(value='Distance in [km]'), Label(value='Volume yearly [m3]'), Label(value='Carrier6'),…

# **Step 5: Train model (Now you will start training your model)**
<img src="https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/pictures/train.PNG" height="400px" width="100%"/>

*The above four steps were important to prepare the data for ML. In practice, we spent 80% of our time on these four steps. Just training ML-Model (Step5) is not a very time-consuming process.*

**So let's start predicting for real now...**

Here we begin to predict the outcome of any given data point. Therefore we need to use some Machine learning algorithms. The most common ones we use in the upcoming prediction are:


1.   **[Linear Regression][lin]**
2.   **[K Nearest Neighbors][knn]**
3.   **[Random Forrest][RF]**

⛔️  **Do not run the cell with features. Just check and uncheck the boxes and run the next cell.**

[lin]: https://en.wikipedia.org/wiki/Linear_regression#:~:text=In%20statistics%2C%20linear%20regression%20is,as%20dependent%20and%20independent%20variables).
[knn]: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47
[RF]: https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76


In [None]:
# create pipeline for ML algorithms
pipes = {'linear': Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)), ('linear', LinearRegression())]), 
         'knn':Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)), ('knn', KNeighborsRegressor(n_neighbors=14))]),
         'rf':Pipeline(steps=[('standardscaler', StandardScaler(with_mean=False)), ('rf', RandomForestRegressor(max_depth=14, random_state=0))])}

# **Step 6.1: Evaluate model (model performance)**
<img src="https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/pictures/evaluate.PNG" height="400px" width="100%"/>

#### **Create results**

We will create the final result table containing all introduced predictions in different measurements in the following code lines.

For every Algorithm, we will have:
1. Original Price
2. Predicted Price
2. Calculated residuals
3. Percentage of the residuals
4. Final classification for better and understandable measurement  

Formula we used to calculate residuals: </br>
$\text{resid} =|\text{predicted} - \text{original}|$  

Formula we used to calculate percentage of residuals: </br>
$\text{%resid} =\frac{|\text{predicted} - \text{original}|}{\text{original}}$  

Formula for Class (good/bad): <br>
$\text{%resid} \leq 2\% \Rightarrow \text{good}$ <br>
$\text{%resid} > 2\% \Rightarrow \text{bad}$

In [None]:
result_df = pd.DataFrame(y_test.reset_index(drop=True))
for pipe in pipes:
  pipes[pipe].fit(X_train[selected_data], y_train)
  result_df['pred ' + pipe + ' Price'] = list(pipes[pipe].predict(X_test[selected_data]))
  result_df['resid ' + pipe] = np.abs(result_df['Price'] - result_df['pred ' + pipe + ' Price'])
  result_df['%resid ' + pipe] =result_df['resid ' + pipe]/result_df['Price']
  result_df['class ' + pipe] = np.where(result_df['%resid ' + pipe]<=0.02, 'good', 'bad') 

result_df.head()

Unnamed: 0,Price,pred linear Price,resid linear,%resid linear,class linear,pred knn Price,resid knn,%resid knn,class knn,pred rf Price,resid rf,%resid rf,class rf
0,885.139185,904.277589,19.138405,0.021622,bad,921.549259,36.410074,0.041135,bad,892.534236,7.395052,0.008355,good
1,1035.298871,1044.92615,9.627279,0.009299,good,1051.719733,16.420862,0.015861,good,1039.723213,4.424342,0.004273,good
2,761.234275,756.317605,4.91667,0.006459,good,790.811561,29.577286,0.038854,bad,765.811423,4.577149,0.006013,good
3,1073.622396,1091.852214,18.229818,0.01698,good,1082.86687,9.244474,0.008611,good,1096.727362,23.104966,0.021521,bad
4,1340.419044,1335.280682,5.138362,0.003833,good,1327.193429,13.225615,0.009867,good,1343.625737,3.206693,0.002392,good


#### **Visualize results on Test data**

In [None]:
app._terminate_server_for_port("localhost", 7123)
app = JupyterDash(__name__)

app.layout = html.Div([
    html.H1("Dynamic Categorical Bar Plots"),
    html.Div([
      html.P("Model name:"),
      dcc.Dropdown(
          id='model_name', 
          value='linear', 
          options=[{'value': x, 'label': x} 
                   for x in pipes.keys()],
          clearable=False
      ),
    ],style=dict(width='30%', display='inline-block')),
    dcc.Graph(id="lineplot")
])
@app.callback(
    Output(component_id='lineplot', component_property='figure'),
    [Input(component_id='model_name', component_property='value')]
)
def update_output_div(model_name):
  result_subset = result_df[result_df.columns[result_df.columns.str.contains('Price')]]
  resid_subset = result_df[['resid linear', 'resid knn', 'resid rf']]
  switcher = {'rf':'Random Forrest', 'knn':'K-Nearest Neighbors', 'linear': 'Linear'}
  fig = go.Figure()
  fig.add_trace(go.Scatter(y = result_subset[result_subset.columns[0]], name='Price', mode='lines'))
  fig.add_trace(go.Scatter(y = result_subset['pred ' + model_name + ' Price'], name='prediction', mode='lines'))
  fig.add_trace(go.Scatter(y = resid_subset['resid ' + model_name], name='residuals', mode='lines'))

  fig.update_layout(
      title='Residual and predictionplot ' + switcher.get(model_name) ,
      yaxis_title = 'Price and Price difference' 
      )
  return fig

if __name__ == '__main__':
    app.run_server(mode='inline', port=7123)

<IPython.core.display.Javascript object>

# **Step 6.2: Evaluate model output**
<img src="https://raw.githubusercontent.com/knschuckmann/ML_bundesliga_challange/master/pictures/evaluate.PNG" height="400px" width="100%"/>

The Table and the Plot above are already good indicators for goodness. Still, the actual scores, calculated by a simple rule of thumb where you divide right predicted by the exact number of samples, is a better indicator for the Algorithm's goodness. We will call this measurement Accuracy.

Accuracy Formula: <br>
&#8195; **$\frac{\text{Good results}}{\text{All results}}$**

**For the best Accuracy result, try to improve only one of the following 3 values** </br>

In [None]:
for pipe in pipes:
  print(pipe + ' Acc: {0:.2f}%'.format( (result_df['%resid ' + pipe][result_df['%resid ' + pipe]<0.02].count()/result_df.shape[0])*100))

linear Acc: 76.94%
knn Acc: 52.33%
rf Acc: 68.22%


⛔️ Above, we see the **accuracy** of three models on **test dataset**. If you are not satisfied with the performance of the model, you can **rerun** the Loop from **feature selection** again. ** Choose different features and rerun all cells below the selection.**

### **Loop End**  

Go to loop start to select features again.

If you found the best result, **make sure that it is reproducible** by running the loop one last time and getting the same results as before. </br>

Now download the .ipynb file and send it to us, by clicking on Download .ipynb</br>
![download][1] 


[1]: https://raw.githubusercontent.com/knschuckmann/4flow_bundesliga/master/pictures/Download.PNG