# The right price

The aim of this notebook is to use a Multilayer Percerptron (MLP) architecture to predict the price of football players based on some data available about them. The right price (the labeled data for training) comes from transfermrkt.com from 2017. For more details about the dataset, check [here](https://github.com/Lakshaypahuja21/FOOTBALL-ENGLISH_PREMIER_LEAGUE)

# Exploratory Data Analysis (EDA)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import piplite
await piplite.install('seaborn')
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import KFold

First things first! Let's load the data into memory using Pandas:

In [None]:
# your code here
data_raw.head(5)

Now we will do some analysis. Which is the highest market value? Which players are worth this amount?

In [None]:
# add your code here

Do the same for the least valuable players:

In [None]:
# add your code here

Now, lets select some variables. Create a dataframe containing only the variables **name, age, position, market_value, page_views, fpl_value, fpl_sel, region, big_club, new_signing**.

In [None]:
data = ...

Now, some names are too long. Rename: position for pos, market_value for value, page_views for views and new signing for signing.

In [None]:
# your code here

The fpl_sel variable has the "%" sign on it and won't be interpreted by the algorithm. The following line handles with this:


In [None]:
data['fpl_sel'] = data['fpl_sel'].replace('%','',regex=True).astype('float')

Now, for each position let's print the average value (fpl_value) of the players. Which positions are the most, and least valued?

In [None]:
# your code here

Let's now generate a new column fpl_value_dist with the difference between the fpl_value of a player and the mean of the players in the same position

In [None]:
data['fpl_value_dist'] = ...
data.head()

The new fpl_value_dist variable allows us to compare the fpl_value between different players in different positions in a "fair" way. Let's retrieve the top and least "valuable" players. To do so, sort the dataframe based on the value of fpl_value_dist and display the 5 upper rows and the 5 lower rows of the ordered dataset

In [1]:
# your code here

Let's do some statistical analysis: Print the pearson correlation matrix between all the numeric variables. 

In [None]:
# data_numeric will contain the numeric variables:
data_numeric = data.select_dtypes(include = np.number)
corr = ...
print(corr)


Now we can visualize the correlation matrix using the [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function from seaborn.

In [None]:
# your code here

Now we will do a [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html) to see how the outcome variable (price) is distributed between postions.

Positions naming:

1. AM: attacking midfielder
2. CF: center forward
3. LW: left wing
4. RW: right wing
5. SS: second stricker 
6. RM: right midfielder
7. LB: left back defender
8. RB: right back defender
9. CB: center back defender
10. CM: center midfielder
11. GK: goal keeper
12. LM: left midfielder
13. DM: defensive midfielder


In [None]:
# your code here

Let's do the same visualization to compare players who are signing a new contract and those that are not.

In [None]:
# your code here

We can also try to find relations between the price and the views or the price and the % of fpl selection. Create a [scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) plot between this variables.

In [None]:
# your code here

Notice the most viewed player is very cheap. Who is him? Print his name!

In [None]:
# your code here

Lastly to use a ML model with theis dataset we have to format the categorical variables and convert them into dummies. 

In [None]:
data = pd.get_dummies(data, columns = ['pos', 'big_club', 'region', 'signing'])

# Generate train - test split

In [None]:
y = data['value']
X = data.drop('value', axis = 1)

We have to scale the data between 0 and 1. Do it without using any function from sklearn by applying the formula: 

$$ scaled variable = \frac{variable-minimum}{maximum-minimum}$$

In [None]:
# predictors are all variables except the name of the player
predictors = X.columns
predictors = predictors.drop('name')

#store the values for later use:
minimum = ...
maximum = ...

#scale
X[predictors] = ...

Now let's make the train/test split by keeping 70% of the data as training data

In [None]:
X_train, X_test, y_train, y_test = ...

In [None]:
names_train = X_train['name']
names_test = X_test['name']
X_train = X_train.drop('name', axis = 1)
X_test = X_test.drop('name', axis = 1)

# Model selection

Create two different models and try k-fold (with k = 5) cross validation to choose the best one. The first model should have only one hidden layer with 20 neurons and the second one 2 hidden layers with 10 neurons each. Both models should have max_iter=8000. 

Create the first model

In [None]:
mlp1 = ...

Create the second one

In [None]:
mlp2 = ...

Now we will use k-fold crossvalidation to select the best model. Before that, we need to transform the dataframes into numpy arrays

In [None]:
X_train_arr = np.array(X_train)
X_test_arr = np.array(X_test)
y_train_arr = np.array(y_train)
y_test_arr = np.array(y_test)


Complete the code below to perform [k-fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) crossvalidation. We want to store the MSE in each iteration in the corresponding error list:

In [None]:
kf = KFold(n_splits=5, shuffle = True, random_state = 33)

kf_error1 = []
kf_error2 = []

for train_index, test_index in kf.split(X_train_arr):

    X_train_kf, X_test_kf = ...
    y_train_kf, y_test_kf = ...
    
    mlp1.fit(X_train_kf, y_train_kf)
    mlp2.fit(X_train_kf, y_train_kf)
    
    predictions1 = ...
    predictions2 = ...
    
    mse1 = ...
    mse2 = ...

    errorA.append(mse1)
    errorB.append(mse2)
    

Print the average cross-validation errors for the two models, which model had the best overall performance?

In [None]:
# your code here

# Model evaluation

Now, let's retrain the best model configuration it using the full training data available

In [None]:
mlp_best = ...

In [9]:
# fit the model with the whole training data
...

Predict the values for the testing dataset. Compute the Mean Squared Error and the Mean Absolute Error.

In [None]:
predictions = ...

MSE = ...
MAE = ...

print(MSE)
print(MAE)

Now we can create an imaginary player by deciding arbitrary values for the variables. Check how the predicted price changed as you modify the different variables!

In [None]:
data.loc[100]

In [None]:
# Generate the variables for the player
new_player = ...

# Standardize the variables
new_player = (new_player-minimum)/(maximum-minimum)

# Predict the price
price = ...