# ING Machine Learning workshop

The goal of this workshop is to learn *something* about Machine Learning by going through the model building process step by step. The goal is *not* to build the most accurate model but rather understand what steps are involved to get to your first minimal viable model. Building a good machine learning model is an iterative process, this workshop is a good starting point for engineers who want to get a first try at practical machine learning.

### Business case

Every machine learning project starts with a business problem. FC ING is bankcrupt and therefore the board has decided to sell all their players. Certainly they want to get the best deal, but they have not been up to date with the recent developments in the soccer industry. Thats why they need your help! 
Develop a Machine Learning model that:

- ### Predicts the value of a football player

In order to successfully build your first machine learning model, you just have to complete this Jupyter Notebook by following the instructions in the code cells. Please write your code in Python and use the pandas, numpy or sklearn library for help. 

<img src="https://ichef.bbci.co.uk/news/660/cpsprodpb/CF91/production/_103573135_neymareasports.jpg" alt="Drawing" style="width: 100px;"/>

### The dataset

The FIFA 2019 data set is an open source data set that contains detailed attributes of well-known soccer players (https://www.kaggle.com/karangadiya/fifa19).
The target variable (y) that we want to predict is the "value" column. Everything else you can use as input/features (X) for your model.

In [0]:
import pandas as pd # Pandas is a very handy Python library that helps you to load, clean, analyse and preprocess your data before you build a model.

In [0]:
# TO DO:
# Load the dataset into a pandas Dataframe, print the shape and the first 5 rows to get a first glance at the data.
data = pd.read_csv('https://drive.google.com/uc?authuser=0&id=1XomAUds7vJ2aA2Rde0LAaKto3MldAeTH&export=download', sep = ",", index_col = 0)
data.head(n=5)

### Preprocessing the data

(Big) data often comes in messy and unstructured format. For this workshop we choose a rather clean dataset but regardless you have to do some preprocessing to get the features ready for modeling.

In [0]:
# TO DO:
# - Drop the "Nationality" column, we don't need this one.

In [0]:
# TO DO:
# Write a function that transforms the weight from lbs to kg
# Apply that function to your data

### Encoding categorical features

Most Machine Learning models need numeric input variables (X) to make predictions about the target variable (y). However in this dataset the column "Club" still contains string values such as "FC Barcelona", "FC Real Madrid" etc. 
We cannot just encode the strings in numeric values from 1-n because that introduces an order. 


One Hot Encoding or dummy encoding is a very well-known technique to encode categorical features. In a nutsheell: for n categories in your feature you create a new column which indicates whether the feature is present or not. 

Example: 

                   Club_Juventus, Club_Bayern_Munchen, Club_FC Barcelona
    C. Ronaldo     1                  0                    0

    M. Neuer       0                  1                    0 

In [0]:
# TO DO:
# Dummy encode the "Club" feature
# Hint: check the OneHotEncoder from the sklearn package
# Drop the Club column and concatenate the new columns to the existing dataframe

### Exploratory Data Analysis

In this step you want to explore your data, ask questions, find patterns and maybe create visualisation. Usually this is an iterative process which you come back to throughout a Machine Learning project. It is a really valuable step to get to know your dataset. For example:

- Is there any missing data?
- How is age distributed among the players?
- Do the most skilled players earn the highest wage?
- Is the overall score of a player related to his preferred foot?
- What Nationality has the best Overall score?

In [0]:
# TO DO: 
# Answer one of the above questions or pick your own. 

### How to deal with missing values (NaN)

Often you have to deal with missing values in your dataset. Among many different methods you can replace missing values with the mean of the feauture column. 

For example, any players that have height missing, we assign the average height of all players.


In [0]:
# TO DO:
# Calculate the average height of a player and fill all the missing values with it.

### Feature selection 

Correlation and regression analysis are related in the sense that both deal with relationships among variables. The correlation coefficient is a measure of linear association between two variables. Values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a positive linear sense, a correlation coefficient of -1 indicates that two variables are perfectly related in a negative linear sense, and a correlation coefficient of 0 indicates that there is no linear relationship between the two variables. 


**We use this concept to get an idea of what features to select for our model.**


**Example:** 
If the age of a player is positively correlated to the value of a player, it might be a good predictor for our model because when age increases/decreases the value of a player increases/decreases.

In [0]:
# TO DO:
# Print a sorted list with the correlation values of all features with "Value".

### Build your first model

Now we are ready to build our first model! You never train the model on all of your data. You want to partition your data in 80% training data and 20% test data. 
The training data is used to train the model. Once you trained a model you have to evaluate how good it is. You evaluate a model by exposing it to unseen data. This is why you always set aside a test set. 

**Tip**: checkout `sklearn.model_selection.train_test_split()`


In [0]:
# Make sure you use the Value column as your target and the rest of the columns as your features.
# Only use numeric features.
# Finally split the data into training and testing set

Train your model (on the training data) with a regression algorithm of your choice. **Tip**: Sklearn is a great machine learning llibrary that has a lot of models to choose from.

In [0]:
# Instantiate model with 25 decision trees or whatever parameters you like :-)
# Train the model on training data

### Evaluation of the model


Now that you trained a model, you want to investigate how well it will do on new unseen data. That is why we always hold out a small partion of the data, called the test set.
There are many different evaluation metrics in machine learning. Depending on the algorithm you use you choose this metric.
For simplicity, we will use the Root Mean Square Error (RMSE). When we talk about model we actually mean fitting a line/function to our training data by minimizing the error distance of that line and the datapoints. The RMSE is the distance, on average, of a data point from the fitted line, measured along a vertical line. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axis. In our case that is the value in €. Thus, the small the RSME, the better your model. 

In [0]:
# Use the model's predict method on the test data

In [0]:
# Calculate the RMSE

### Submit your results to codalab

The fun part of this workshop is to compete with your colleagues on who developed the best model.

Below you can find a small dataset that we kept from you. We want you to make predictions on this data to see how your model performs on our unseen data. 


Make sure you first run this data through all your preprocessing steps. In case you used any `.fit_transform` method calls, this time please only use `.transform`. You don't need to refit your model. 

Afterwards you can upload your zip file here: https://competitions.codalab.org/competitions/21488?secret_key=7292568d-fee4-450b-a53c-4b290201fba5

In [0]:
data = pd.read_csv('https://drive.google.com/uc?authuser=0&id=1avU0LzKHbskkkSney4Jh3lAvfib1zA7c&export=download', sep = ",", index_col = 0)

In [0]:
# TO DO:
# Run the test data through your preprocessing steps
codalab_predictions = # make your predictions 

In [0]:
import zipfile
pd.Series(codalab_predictions, index=data.index).to_csv('submission.csv', header=False)
with zipfile.ZipFile("submission.zip", "w") as file:
  file.write('submission.csv')