# Back-end summit Machine Learning workshop 2019

The goal of this workshop is to learn about Machine Learning by going through the model building process step by step. The goal is **_not_** to build the most accurate model, but rather understand what steps are involved to get to your first minimal viable model. Building a good machine learning model is an iterative process, this workshop is a good starting point for engineers who want to get a first try at practical machine learning.

### Business case

Every machine learning project starts with a business problem. FC ING is bankcrupt and therefore the board has decided to sell all their players. Certainly they want to get the best deal, but they have not been up to date with the recent developments in the soccer industry. Thats why they need your help! 
Develop a Machine Learning model that:

- ### Predicts the value of a football player

In order to successfully build your first machine learning model, you just have to complete this Jupyter Notebook by following the instructions in the code cells. Please write your code in Python and use the pandas, numpy or sklearn library for help. 

<img src="https://ichef.bbci.co.uk/news/660/cpsprodpb/CF91/production/_103573135_neymareasports.jpg" alt="Drawing" style="width: 100px;"/>

### The dataset

The FIFA 2019 data set is an open source data set that contains detailed attributes of well-known soccer players (https://www.kaggle.com/karangadiya/fifa19).
The target variable that we want to predict is the "value" column. Everything else you can use as input/features for your model.

In [0]:
# Run this cell before you do anything else

# Pandas is a very handy Python library that helps you to load, clean, analyse and preprocess your data before you build a model.
import pandas as pd 
pd.set_option("display.max_rows", 100)

# These are all the imports that you could require one way or another to complete the notebook
# Of course, there might be libraries not imported and you would like to use, please add them here
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt
from math import sqrt

# Model imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.linear_model import SGDRegressor, Lasso, Ridge, LinearRegression

In [0]:
# Load the dataset into a pandas Dataframe and print the first 5 rows to get a first glance at the data.
raw_data = pd.read_csv("https://drive.google.com/uc?authuser=0&id=1XomAUds7vJ2aA2Rde0LAaKto3MldAeTH&export=download", index_col=0)
raw_data.head()

### Preprocessing the data

(Big) data often comes in messy and unstructured format. For this workshop we choose a rather clean dataset, but regardless you have to do some pre-processing to get the features ready for modeling.
Here we will do 3 types of pre-processing:
- Dropping of redundant columns
- Transforming non-numeric to numeric values
- Deal with NaN values (missing values)

### Dropping of redundant columns

 Drop the "Nationality" column, we don't need this data.  
Hint: use the the drop method of the pandas Dataframe , https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html 


In [0]:
no_nat_data = raw_data.drop(columns=["Nationality"])
no_nat_data.head()


## Transforming non-numeric to numeric values
Transform the "Weight" column from lbs string to a kg float.
This method takes as input a string that looks like: 123lbs
and should output a float like: 55.79

In [0]:
def transform_weight(weight:str) -> np.float64:
  weight_trans = weight.replace("lbs", "")
  return float(weight_trans) * 0.45

weight = "123lbs"
transform_weight(weight)

Run this cell after completing the transform_weight method in the previous cell. The next cell will apply your implemented method to the data in the correct way and will print some validation information.

In [0]:
# Make a deep copy of the data
weight_trans_data = no_nat_data.copy()

print("Data type of weight column before transformation is: {}".format(weight_trans_data["Weight"].dtype))
print()

# Apply method to a Series(indexed list) of data
weight_trans_data["Weight"] = no_nat_data["Weight"].apply(transform_weight)
print(weight_trans_data["Weight"].head())
print("\nData type of weight column after transformation should be float64 and is: {}".format(weight_trans_data["Weight"].dtype))

### Deal with missing values (NaN)

Often you have to deal with missing values in your dataset. Among many different methods you can replace missing values with the mean of the feauture column. 

For example, each player that has height missing, we assign the average height of the rest of the players.

The following cell will explore which columns have missing values.

In [0]:
print("Counting all non null values in different columns: ")
# Sums up the count of null values in each column in the DataFrame
nulls = weight_trans_data.isna().sum()
print(nulls)

# Shows all columns that have at least 1 nan value
print("\nShowing columns with nan values: ")
nulls.iloc[nulls.to_numpy().nonzero()[0]]

Fill the found NaN values with the average weight of the players.  

Use the Dataframe fill_na method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html

In [0]:
height_filled_data = weight_trans_data.copy()
# Inspect data before transformation
print("Before transformation: \n")
print(height_filled_data["Height"].head())

# Apply transformation
mean = height_filled_data["Height"].mean()
height_filled_data["Height"] = height_filled_data["Height"].fillna(value=mean)

print("\nAfter transformation: \n")
print(height_filled_data["Height"].head())

print("\nColumns with nan values after imputing: ")
new_nulls = height_filled_data.isna().sum()
new_nulls.iloc[new_nulls.to_numpy().nonzero()[0]]

### Exploratory Data Analysis

In this step you want to explore your data, ask questions, find patterns and maybe create visualisation. Usually this is an iterative process which you come back to throughout a Machine Learning project. It is a really valuable step to get to know your dataset. For example:

1. How is "Age" distributed among the players?
2. Do players with a high "Special" number have a greater "Value"?
3. Is the "Overall" score of a player related to his "Preferred Foot"?

Tip:
1. For a distribution plot have a look at https://seaborn.pydata.org/generated/seaborn.kdeplot.html
2. To see how two parameters relate to each other use: https://seaborn.pydata.org/generated/seaborn.scatterplot.html
3. See 2


In [0]:
#1 
sns.kdeplot(raw_data["Age"], shade=True)
plt.figure()
#2
sns.scatterplot(x="Special", y="Value", data=raw_data)
plt.figure()
#3a
sns.scatterplot(x="Overall", y="Value", hue="Preferred Foot", data=raw_data)
plt.figure()
#3b
d = raw_data.copy()
d["Value"] = d["Value"].apply(np.log)
sns.scatterplot(x="Overall", y="Value", hue="Preferred Foot", data=d)
plt.figure()

### Feature selection 

Correlation and regression analysis are related in the sense that both deal with relationships among variables. The correlation coefficient is a measure of linear association between two variables. Values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a positive linear sense, a correlation coefficient of -1 indicates that two variables are perfectly related in a negative linear sense, and a correlation coefficient of 0 indicates that there is no linear relationship between the two variables. 

We use this concept to get an idea of what features to select for our model.


**Example:** 
If the age of a player is negatively correlated to the value of a player, it might be a good predictor for our model because when age increases/decreases the value of a player decreases/increases, respectively.

## Correlation analysis

Print a sorted list with the greatest absolute correlation values of all features compared to the "Value" to get an indication of what
are the best features for predicting a player's value.

Tip:
- DataFrame.corr https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
- Select K best features: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

In [0]:
# Set number of features to show
n = 40

non_value_columns = height_filled_data.columns.drop("Value")
correlations = height_filled_data.corr()['Value'][non_value_columns]
abs_sorted_correlations = correlations.apply(abs).sort_values(ascending = False)[:n]
print(abs_sorted_correlations)

### Build your first model

Now we are ready to build our first model! You never train the model on all of your data. You want to partition your data in 80% training data and 20% test data.  
The training data is used to train the model. Once you trained a model you should evaluate how good it is using the test set. You evaluate a model by exposing it to unseen data. This is why you always set aside a test set. 

**Tip**: 
1. Select numeric data: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
2. Split data: [sklearn.model_selection.train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)



In [0]:
# The model target is the Value of our players we are trying to estimate
target = height_filled_data["Value"]
# The rest of the columns are your features, but make sure we only use numeric features, as we can only use numeric values in our models.

# 1 
num_features = height_filled_data.select_dtypes(exclude='object')

# Also, make sure to drop Value as a feature, as this would obviously not be available in our "real world" data
features = num_features.drop(columns=["Value"])


In [0]:
# 2
# Finally split the data into training and testing set
train_X, test_X, train_y, test_y = train_test_split(features, target, test_size = 0.20, random_state =1)

# Model training
Train your model (on the training data) with a regression algorithm of your choice. 

Choosing what model to use for modeling the problem at hand can be challenging. In order to guide you,
please use the following guide to help you get started: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html.  
After you picked the correct regression model, please instantiate it, and assign it to the "clf" variable.  

**Example**:   
clf = LinearRegression()


In [0]:
clf = RandomForestRegressor()

# Train the model on training data
clf.fit(train_X, train_y)

### Evaluation of the model


Now that you trained a model, you want to investigate how well it will do on new unseen data. That is why we always hold out a small partion of the data, called the test set.
There are many different evaluation metrics in machine learning. Depending on the algorithm you use you choose this metric.
For simplicity, we will use the Root Mean Square Error (RMSE). When we talk about model we actually mean fitting a line/function to our training data by minimizing the error distance of that line and the datapoints. The RMSE is the distance, on average, of a data point from the fitted line, measured along a vertical line. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axis. In our case that is the value in â‚¬. Thus, the smaller the RMSE, the better your model. 

In [0]:
# Helper method to pretty print results about your model
def print_predictions(pred, real) -> None:
  # Prepare for concatenation
  pred_array = pred.reshape(-1,1)
  real_array = real.to_numpy().reshape(-1,1)
  
  # Concatenate real value, predictions and difference as a pandas DataFrame
  diff = abs(pred_array - real_array)
  results = np.concatenate((real_array, pred_array, diff), axis=1)
  df_results = pd.DataFrame(results, columns=["Real value", "Predicted value", "Difference"])

  print("Showing difference in predicted and real value for first 5 players: \n")
  print(df_results.head())

  print("\nListing difference in predicted and real value sorting by greatest difference: \n")
  print(df_results.sort_values(by="Difference", ascending=False).head())

  # Calculate the RMSE
  rmse = sqrt(mean_squared_error(real_array, pred_array))
  print("\n Model score: \n")
  print(rmse)

# Use the model's predict method on the test data
predictions = clf.predict(test_X)

print_predictions(predictions, test_y)