# Part 2: Linear Regression


In this part, we will be working with a dataset scraped by [Shubham Maurya](https://www.kaggle.com/mauryashubham/linear-regression-to-predict-market-value/data), which collects facts about players in the English Premier League as of 2017. His original goal was to establish if there was a relationship between a player's popularity and his market value, as estimated by transfermrkt.com.

**Your goal is to fit a model able to predict a player's market value.**

## The dataset

The dataset contains the following information:

| **Field**   |     **Description**      |  
|-------------|-------------|
| name   |  Name of the player |
| club   |  Club of the player |
| age    | Age of the player |
|position| The usual position on the pitch
|position_cat| 1 for attackers, 2 for midfielders, 3 for defenders, 4 for goalkeepers|
|market_value| As on transfermrkt.com on July 20th, 2017|
|page_views| Average daily Wikipedia page views from September 1, 2016 to May 1, 2017|
|fpl_value| Value in Fantasy Premier League as on July 20th, 2017|
|fpl_sel| % of FPL players who have selected that player in their team|
|fpl_points| FPL points accumulated over the previous season|
|region| 1 for England, 2 for EU, 3 for Americas, 4 for Rest of World|
|nationality| Player's nationality|
|new_foreign| Whether a new signing from a different league, for 2017/18 (till 20th July)|
|age_cat| a categorical version of the Age feature|
|club_id| a numerical version of the Club feature|
|big_club| Whether one of the Top 6 clubs|
|new_signing| Whether a new signing for 2017/18 (till 20th July)|

## Exercise 1: Exploring the data
The first step you need to do is to explore your data.

We will start with the necessary imports. In this exercise, we will be working with the library `pandas`. If you are not familiar with it, it is recommended that you follow the introductory exercises that can be found in the course's github repository.

In [1]:
import numpy as np
import pandas as pd

We will now proceed to read the dataset:

In [2]:
league_df = pd.read_csv('data/football_data.csv') #Reads a CSV file

### Task 1.1: Using pandas for data exploration
Use the method `name_dataframe.head(N)` (N is the number of entries) to look at the first instances of the dataframe. 

Then, use the method `name_dataframe.describe(include='all')` to generate descriptive statistics that summarize each field of the dataframe. 

Finally, print the result of `name_dataframe.dtypes`, in this way you print out the data types associated to each of the fields in the table 

In [8]:
#Your code for head
league_df.head(7)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3.0,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2.0,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2.0,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1.0,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2.0,France,0,4,1,1,0
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2.0,Spain,0,2,1,1,0
6,Olivier Giroud,Arsenal,30,CF,1,22.0,2230,8.5,2.50%,116,2.0,France,0,4,1,1,0


In [5]:
#Your code for describe
league_df.describe(include='all')

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
count,461,461,461.0,461,461.0,461.0,461.0,461.0,461,461.0,460.0,461,461.0,461.0,461.0,461.0,461.0
unique,461,20,,13,,,,,113,,,61,,,,,
top,Alexis Sanchez,Arsenal,,CB,,,,,0.10%,,,England,,,,,
freq,1,28,,85,,,,,64,,,156,,,,,
mean,,,26.804772,,2.180043,11.012039,763.776573,5.447939,,57.314534,1.993478,,0.034707,3.206074,10.334056,0.303688,0.145336
std,,,3.961892,,1.000061,12.257403,931.805757,1.346695,,53.113811,0.957689,,0.183236,1.279795,5.726475,0.460349,0.352822
min,,,17.0,,1.0,0.05,3.0,4.0,,0.0,1.0,,0.0,1.0,1.0,0.0,0.0
25%,,,24.0,,1.0,3.0,220.0,4.5,,5.0,1.0,,0.0,2.0,6.0,0.0,0.0
50%,,,27.0,,2.0,7.0,460.0,5.0,,51.0,2.0,,0.0,3.0,10.0,0.0,0.0
75%,,,30.0,,3.0,15.0,896.0,5.5,,94.0,2.0,,0.0,4.0,15.0,1.0,0.0


In [6]:
#Your code for d_type
league_df.dtypes

name             object
club             object
age               int64
position         object
position_cat      int64
market_value    float64
page_views        int64
fpl_value       float64
fpl_sel          object
fpl_points        int64
region          float64
nationality      object
new_foreign       int64
age_cat           int64
club_id           int64
big_club          int64
new_signing       int64
dtype: object

### Question set 1.1: About the data
1. What is the name of the appearing in the 7th record of the dataset?
2. What is the mean age in the English Premier League (in 2017)? 
3. What fields store a continuous value?

__Your answers here:__
1. _What is the name of the appearing in the 7th record of the dataset?_
    * The 7th player in the dataset is "Olivier Giroud"
2. _What is the mean age in the English Premier League (in 2017)?_
    * The mean age in the enlish premier league is 26.8 years
3. _What fields store a continuous value?_
    * `market_value, fpl_value, region`

## Exercise 2: Data splits, data preparation and training
Before starting the training procedure, we need to split the data into the training, validation and test sets.

In this exercise, the data will be already given split for you. 

In [34]:
#Loading the splits
df_train = pd.read_csv('data/league_train.csv')
df_val = pd.read_csv('data/league_val.csv')
df_test = pd.read_csv('data/league_test.csv')

Alternatively, for the type of data used in this exercise, the library `scikit-learn` contains the function `train_test_split` that allows to automatically split the data.

### Question set 2.1 Train_test_split
Look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) of the `train_test_split` function:
1. What parameters it receives as input? Provide examples illustrating.
2. What is the role of the parameter shuffle?
3. What is the role of the parameter test_size?
4. The function does not generate a validation set. What would you do to obtain the desired data splits (train, validation and test)? Answer using pseudo-code (Bonus: Write the code for it so that it can run using some dummy generated data). 

__Your answers here:__

1. _What parameters it receives as input? Provide examples illustrating._
The `train_test_split` function takes as required parameters:
* `arrays (eg. league_df) ` which has to be a set of indexable data structures which all have the same length or the same size in their first dimension (`shape[0]`), namely it can be any of the following, as long as they adher to the shape constraint
    * array of python lists
    * list of numpy arrays
    * list of scipy-parse matrices
    * panda dataframes

Besides this the function also has the following optional parameters:
* `test_size (eg. 0.33)` a value in $[0.0; 1.0]$, describing the proportion of the dataset to test set split; if it is an `int` it represents the absolut number of test samples; if not given it defaults to $0.25$
* `train_size (eg. 0.66)` same as test size, just with inverse proportions; defaults to $1.0 - test_size$
* `random_state (eg. 42)` defines the shuffling behaviour appled to the data before splitting it into the two sets; can be a number to fix the seed of the underlying random number generator other wise it can be a `RandomState` instance; defaults to scipy's default random state
* `shuffle (eg. True)` controls if the data is going to be shuffeled before splitting; defaults to `True`
* `stratify (eg. None)` array-like datastructure; this can be a list of labels to use stratify to maintain the relative class frequency in the train and test dataset

2. _What is the role of the parameter shuffle?_

The `shuffle` parameter controls wether the data is going to be shuffeled before splitting or not, it defaults to `True`

3. _What is the role of the parameter test_size?_

The `test_size` parameter describes the ratio in size between the test and train data sets generated by the function. It has to be within $[0.0; 1.0]$ or an integer describing the total amount of test samples that should be returned by the splitting.

4. _The function does not generate a validation set. What would you do to obtain the desired data splits (train, validation and test)? Answer using pseudo-code (Bonus: Write the code for it so that it can run using some dummy generated data). _

We can obtain a train, validation, test split by simply applying the function twice.
Assume we want a (60%, 15%, 25%) split:
```
input: data_set
temp_set, test_set = train_test_split(data_set, test_size=0.25)
train_set, validation_set = train_test_split(temp_set, test_size=0.15/(1.0-0.25))
return train_set, validation_set, test_set
```


#### In python using dummy dataset

In [35]:
from sklearn.model_selection import train_test_split
data_set = np.arange(400).reshape((100, 4))

temp_set, test_set = train_test_split(data_set, test_size=0.25, random_state=42)
train_set, validation_set = train_test_split(temp_set, test_size=0.15/0.75, random_state=42)

# Print the set sizes
print("#data_set", len(data_set))
print("#train_set", len(train_set))
print("#validation_set", len(validation_set))
print("#test_set", len(test_set))

#data_set 100
#train_set 60
#validation_set 15
#test_set 25


The dataset contains a lot of features that can be used to build the model. We will start by using `age, fpl_value, big_club` and `page_views`.

$$\hat{y} = w_0 + w_1 x_{age} + w_2 x_{fplavalue} + w_3 x_{bigclub} + w_4(x_{pageviews})^{1/2}$$

Before training the model, we need to prepare the data so that it can be used for training, validation and testing. The following steps need to be executed to prepare the data:

1. Apply the np.sqrt( ) on the values of page_views
2. Transform our variable in numpy array np.array(variable)
3. Add a columns of ones to the matrix $\mathbf{X}$  so it can handle the parameter $w_0$.

### Task 2.1 Prepare data
Complete the function `prepare_data(DataFrame)` where indicated so that all the steps listed above are performed.

In [36]:
from sklearn.preprocessing import PolynomialFeatures

def prepare_data(df):
    '''
        INPUT :
        - df : a pandas DataFrame

         OUTPUT :
        - variable_array : The processed array
    ''' 
    #We obtain a copy of the relevalnt fields from the DataFrame. This avoids modifying the dataframe directly. Instead, we work in a copy. Notice that we are not copying pageviews field
    variable = df[['age', 'fpl_value', 'big_club']].copy()
    
    #Step 1.  Apply the np.sqrt( ) on the values of page_views
    variable['sqrt_page_views'] = np.sqrt(df[('page_views')]) #YOUR CODE HERE
    
    # Step 2. Transform our variable in numpy array np.array(variable)
    variable_array = np.array(variable) #YOUR CODE HERE

    # Step 3. Add a columns of ones to the matrix ùêó so it can handle the parameter ùë§0.
    # For this purpose we will use the function PolynomialFeatures from scikit-learn
    variable_array = PolynomialFeatures(1).fit_transform(variable_array)

    return variable_array


### Question set 2.2 PolynomialFeatures function
Investigate the role of the [Polynomial features function](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) from scikit-learn. 
1. Why did the order of the polynomial was set to one in the prepare_data function? 
2. Given two features $x_1, x_2$, write down the expression that you would obtain by using the function by setting `degree=2`

__Your answer here:__  
1. _Why did the order of the polynomial was set to one in the prepare_data function_  
We essentially want a transform from $(x_1, x_2, x_3, x_4)$ to $(1, x_1, x_2, x_3, x_4)$.
This is achieved by setting degree to 1 which means, that the `fit_transform` will treat all our already existing dimensions as $f(x)=x$ leaving them untouched.
Giving us the intended behaviour of adding a single column of 1s ($x^0$) to our array.

2. _Given two features $x_1, x_2$, write down the expression that you would obtain by using the function by setting `degree=2`_  
Given two features $x_1, x_2$, the expression obtained by using the function with `degree=2` is $(x_1, x_2) \to (1, x_1, x_2, x_1*x_2, x_1^2, x_2^2)$  

Now, we execute the function to prepare the data.

In [37]:
#We copy the output label
output_df_train=df_train['market_value'].copy()
#We remove the output label from X
input_df_train=df_train.drop(['market_value'],axis=1)

#process is repeated for test and validation
output_df_val=df_val['market_value'].copy()
input_df_val=df_val.drop(['market_value'],axis=1)

output_df_test=df_test['market_value'].copy()
input_df_test=df_test.drop(['market_value'],axis=1)

#We call prepare_data
X_train = prepare_data(input_df_train)
X_val = prepare_data(input_df_val)
X_test = prepare_data(input_df_test)
y_train = np.array(output_df_train)
y_val = np.array(output_df_val)
y_test = np.array(output_df_test)

We will now proceed to train our first model. In this case, we will use a "home made" implementation of linear regression. When dealing with more complex (and real) applications it is best to use the implementation that can be found in scikit-learn. 

We will define a class called my_linear_regression with four methods:
1. `__init__(self)` : Constructor for the object to assign the object its properties
2. `fit(self, X, y)` : Learning step of linear regression.
3. `predict(self, X)` : predicts new labels $\hat{y}$ given an input X
4. `MSE(self,y_pred, y_test)` : Estimates the mean sum of squared errors between a set of predictions and the ground truth. 


### Task 2.2 Mean sum of squared errors
Implement the MSE function in the class below: 

In [38]:
class my_linear_regression:
    def __init__(self) : # initialize constructor for the object to assign the object its properties
        self.X_train = []
        self.y_train = []
        self.weights = []
        
    def fit(self, X, y) :
        self.X_train = X
        self.y_train = y
        self.weights = np.linalg.solve(X.T@X,X.T@y)
    
    def predict(self,x_test) : # method of the object that can be used
        self.y_hat=np.sum(x_test*self.weights,axis=1)
        
        return self.y_hat
    
    def MSE(self,y_pred, y_test) :
        #YOUR CODE HERE
        assert len(y_pred) == len(t_test)
        MSE = 1.0/len(y_pred) * np.sum(np.square(np.subtract(y_test, y_pred)))
        #YOUR CODE ENDS HERE
        return MSE

Now we can train our first model. 

In [39]:
model_1=my_linear_regression()
model_1.fit(X_train,y_train)

print(f'The learned model has parameters:\n{model_1.weights}\n')

The learned model has parameters:
[-15.66271385  -0.16641898   4.45892732   6.28285382   0.18420319]



### Question set 2.3: Interpreting the weights
The estimated weights $\mathbf{w}$ (excluding $w_0$) are associated to 'age', 'fpl_value', 'big_club' and 'page_views' (squared root), in that order. 
1. How do you interpret the values of each of these parameters? Based on this information, what can you say about the effect in a player's market value of his: age? number of page views? fpl value?
2. Which of these features seems to have the largest effect on a player's value? 
3. How do you interpret the value obtained for $w_0$?

__Your answers here:__  
1. _How do you interpret the values of each of these parameters? Based on this information, what can you say about the effect in a player's market value of his: age? number of page views? fpl value?_  
The model parameters are the weights which are multiplied with the different dimensions of a sample.
Therefore, they represent the direct effect each dimension has on the prediction result, with the first value representing a general displacement regardless of the actual sample.
Based on this the age of a player has a slight negative effect on his value (-0.16); the number of page views has a slight positive effect (0.18) on his value and fpl value has a strong positive effect (4.46) on his value.
But the range of these values plays a role here as well because it is a multiplication so the page views would have a more significant effect in comparison to the age beacuse their absolute value is much larger than the value of the age even though their weights are of similar strength. This is why we took the square root of the page views before.

2. _Which of these features seems to have the largest effect on a player's value?_  
The largest effect seems to be with the fpl_value and the big_club dimensions, yet both of those have rather small mean values so even after multiplication they _can_ be supassed by the age or page views. Regardless they have the most significant weights.

3. _How do you interpret the value obtained for $w_0$?_  
$w_0$ is a general displacement of the prediction result so it simply moves our function along the prediction-axis.
So a value of -15 means, that every player start out with a value of -15 and his attributes are then adding onto this to calculate his final value.

## Exercise 3: Adding categorical features
It is well known that the position where a football player plays has an impact in his market value. Midfielders and stikers tend to be more expensive. Your goal now is to include this information in the model.

As seen from the description, the player position is encoded as a numeric variable (1, 2, 3, 4). However, they represent categories and not values on their own. Categorical variables are commonly encoded under a scheme denoted 1-of-K encoding. This allows to convert a variable representing K different categories into K different binary values. Example:

| **attacker**   |  **midfielder**      |  **defender** | **goalkeeper** |
|-------------|-------------|-------------|-------------|
| 1 | 0 | 0 | 0|
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 | 

### Question 3.1: Adding the position to the model
Write down the expression of the model if you consider the position of the player using 1-of-K encoding.

__Your answer here:__

$$
\begin{equation}
\hat{y} = w_0 + w_1 x_{age} + w_2 x_{fplavalue} + w_3 x_{bigclub} + w_4(x_{pageviews})^{1/2} + w_5 x_{attacker} + w_6 x_{midfielder} + w_7 x_{defender} + w_8 x_{goalkeeper}
\end{equation}
$$

### Task 3.1 Preparing data with position features
We need to modify the data preparation function so that it now includes the categorical features. For this matter, we have implemented the function `prepare_data_with_position(df)`. It contains the same functionality as the function `prepare_data(df)` and it adds the generation of the 1-of-K encoding. 

Complete the missing code in the function.

In [None]:
def prepare_data_with_position(df):
    variable = df[['age', 'fpl_value', 'big_club']].copy()
    variable['sqrt_page_views'] =  #YOUR CODE HERE

    variable=variable.join(pd.get_dummies(df.position_cat, prefix='pos')) # get_dummies to create 1-of-K encoding, join to add the new columns
    variable_array = # YOUR CODE HERE
    variable_array = PolynomialFeatures(1).fit_transform(variable_array)
    
    return variable_array

### Question 3.2 The get_dummies function
Explain what the following line of code is doing:

`variable=variable.join(pd.get_dummies(df.position_cat, prefix='pos'))`

### Task 3.2 Train the new model
Your task now is to train the new model. For this you will need to execute the following steps: 
1. Prepare all your data (train, validation and testing). 
2. Create a new `my_linear_regression` object and store it in a variable named `model_2`
3. Run the learning process
4. For inspection purposes, print out the obtained weights.

**Important:** While preparing the data, make sure you do not override the previous data used for model_1

In [None]:
#Your code here


### Question 3.3 Value of the position
Based on the obtained weights, does it seem as if the position of the player has an important role in his market value?

Your answer here:

## Exercise 4: Choosing a model
We will now use the validation set to choose between the two models we have built so far. 

### Task 4.1 MSE estimation
Using the validation data, estimate the MSE for each of the two models that you have built so far. For this you will need to: 
1. Predict labels for the validation set using each of the trained models.
2. Call the MSE function from any of the two models (it is equivalent).

In [None]:
#------------YOUR CODE HERE ------------

#------------ YOUR CODE ENDS HERE ---------

print(f'MSE model 1 :\n{mse_1}\n')
print(f'MSE model 2 :\n{mse_2}\n')

### Question set 4.1 Analysis
1. Based on the obtained results, which model would you choose?
2. Is the position feature useful to improve the model? 

## Exercise 5: Model testing
Use the test dataset to evaluate the generalization capabilities of the **model you chose** in the previous step. For this you need to:
1. Predict the labels of the test set
2. Estimate the MSE. Please note that other metrics, such as the RSS, could be used as well.

In [None]:
#------------YOUR CODE HERE ------------

#------------ YOUR CODE ENDS HERE ---------

print(f'MSE test:\n{mse}\n')

### Question 5.1 Analysis
Based on the previous result, what can you say about your model? Do you consider it makes sufficiently accurate predictions? Feel free to implement other metrics if you consider you need further information. Examples: RSS, Root Mean Squared Error or Mean Absolute Error. 

Your answer here: 