# Data processing 2

## Contents <a id=ov>
7. [Plotting](#plt)
8. [Statsmodels](#sm)
9. [Sklearn](#sm)




In [None]:
import numpy as np
import pandas as pd
#Import data from a excel_file
df=pd.read_excel('top_500_football_players.xlsx',sheet_name='final')

df['Market value'] = [float(value.replace('€','').replace('m','')) for value in df['Market value']]

df.columns=[col.replace(' ','_') for col in df.columns]
print(df)

## Plotting <a id=plt>
[Back to Content Overview](#ov)

The most popular liberary for creating plots in Python is Matplotlib:

In [None]:
import sys
!{sys.executable} -m pip install matplotlib
import matplotlib.pyplot as plt

### Standard plot

In [None]:
df_most_valuable=df.drop_duplicates(subset=['Age']).sort_values(by=['Age'])
print(df_most_valuable)

In [None]:
# Plot the market value of the most valuable player of each age.
plt.plot(df_most_valuable['Age'],df_most_valuable['Market_value'])
plt.title('Most valuable player')
plt.savefig('Most valuable player.pdf')

In [None]:
# Jupyter Shows plots automatically
plt.show()

<span style="color:blue"><b>Task:</b></span> Plot the mean of Market value of players of each age.

In [None]:




plt.title('Mean of Market Value per Age')
plt.savefig('Mean of Market Value per Age.pdf')

# Add title
plt.title('Most valuable player')

In [None]:
# Add legend
plt.plot(df_most_valuable['Age'],
         df_most_valuable['Market_value'],
         label='Market value')
plt.legend()

In [None]:
# Change design of the line
plt.plot(df_most_valuable['Age'],
         df_most_valuable['Market_value'],
         label='Most valuable Player',
         linewidth=2.0,
         color='red',
         linestyle='--')

In [None]:
# Add labels to the axes
plt.xlabel('Age')
plt.ylabel('Value in €')
plt.title('Most valuable player')
plt.legend()

### Bar Plot

In [None]:
plt.bar(df_most_valuable['Age'],
        df_most_valuable['Market_value'],
        color='g')

### Scatter Plot

In [None]:
plt.scatter(df['Age'],df['Market_value'])
plt.plot(df_most_valuable['Age'],
         df_most_valuable['Market_value'],
         label='Market value',
         linewidth=2.0,
         color='red',
         linestyle='--')
plt.xlabel('Age')
plt.ylabel('Value in €')

### Histogramm

In [None]:
plt.hist(df['Age'])
plt.title('Age Distribution')

In [None]:
plt.hist(df['Market_value'],bins=25)
plt.title('Market value Distribution')

In [None]:
plt.hist(df['Market_value'],bins=25,log=True)
plt.title('Market value Distribution')

In [None]:
plt.hist(df[['Goals','Assists']],bins=15,label=['Goals','Assists'])
plt.legend()
plt.title('Scorer Points Distribution')

In [None]:
plt.hist(df[['Goals','Assists']],bins=15,histtype='barstacked',label=['Goals','Assists'])
plt.legend()
plt.title('Scorer Points Distribution')

### Subplots

In [None]:
plt.figure(1)
plt.subplot(211)             
plt.plot(df_most_valuable['Age'],df_most_valuable['Market_value'])
plt.title('Market_value')
plt.subplot(212)
plt.hist(df['Age'],bins=15,)
plt.title('Age Distribution')

plt.tight_layout()


<span style="color:blue"><b>Task:</b></span> Plot the median age for every position in a bar plot!

<span style="color:blue"><b>Task:</b></span> Plot the mean of the market value for every age in line plot and in scatter plot!

<span style="color:blue"><b>Task:</b></span> Plot the yellow an red cards in stacked histogram for 200 most valuable players!

<span style="color:blue"><b>Task:</b></span> Show the market value histogram of the every age quartile in a 2X2 subplot figure!

In [None]:
def quantile_plot(plot_series: str,
                  filter_series:str,
                  n_shape: tuple,
                  file=None,
                  bins=None,
                  y_lim=100):


In [None]:
quantile_plot('Market_value','Age',n_shape=(1,1),bins=20)


## Statsmodels <a id=sm>
[Back to Content Overview](#ov)

With statsmodels you can regress in a R-fashioned simple way

In [None]:
import sys
!{sys.executable} -m pip install statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
df['Age2']=df['Age']**2
res=smf.ols('Market_value ~ Age + Age2 + Goals',data=df).fit()
print(res.summary())

In [None]:
#Get the estimated parameters
print(res.params)

In [None]:
#Get the standard errors
print(res.bse)

In [None]:
#Get the predicted values
print(res.predict())

<span style="color:blue"><b>Task:</b></span> Plot a histogramm of the residuals and compare it with normal distributed values of the same mean and variance.

<span style="color:blue"><b>Task:</b></span> Make a scatter plot of the market value and the residuals.

In [None]:
res=smf.ols('np.log(Market_value) ~ Age + Goals',data=df).fit()
print(res.summary())

In [None]:
res=smf.ols('Market_value ~ Age + Goals + C(Position)',data=df).fit()
print(res.summary())

In [None]:
# Log-Model with position dummies
res=smf.ols('np.log(Market_value) ~ Age + Goals + C(Position)',data=df).fit()
print(res.summary())

<span style="color:blue"><b>Task:</b></span> Estimate a linear model with age and position dummies.

<span style="color:blue"><b>Task:</b></span> Use a logit model to estimate that a player is an offender. Use the explanatory variables that make sense to you!

In [None]:
df['is_offender'] = 

<span style="color:blue"><b>Task:</b></span> Create a confusion matrix manually. Calculate the accuracy of the model.

<span style="color:blue"><b>Task:</b></span> Use a balanced dataset (same number of offender and not offenders in the training dataset) and re-estimate the logit model.

##  Estimation with matrices

In [None]:
# You can also get the corresponding matrices for y and X
import patsy
var_string='Age + Goals + Yellow_cards + Yellow_red_cards + Red_cards + Substitutions_on + Substitutions_off + Age2'

y, X = patsy.dmatrices(f'Market_value ~ {var_string}', df, return_type="matrix")

print(y,X)

One can also estimate with matrices:

In [None]:
res=sm.OLS(y, X).fit()
print(res.summary())

<span style="color:blue"><b>Task:</b></span> Split the dataset into random halves and evaluate the out-of-sample performance of the models: (Extra: Build a n-fold cross-validation algorithm.)

In [None]:
#Make vector with 250 zeros and 250 ones

#Randomize the order of the vector

#Split y and X into halfs with condional indexing on index_vector

#Train the model

#Predict the model with test data

#Calculate the in sample MSE

#Calculate the out of sample MSE

<span style="color:blue"><b>Task:</b></span> Utilize the function from the last task and calculate the out-of-sample performance for all possible models when excluding one explanatory variable at a time.(Extra: Add and remove regressors depending on whether they improve or worsen the performance until you find the "optimal" model.)

## Sklearn <a id=sklearn>
[Back to Content Overview](#ov)

Sklearn is a powerful library for big datasets and simple machine learning algorithms:

In [None]:
import sys
!{sys.executable} -m pip install sklearn

### Regressions
##### OLS

In [None]:
from sklearn.linear_model import LinearRegression, Lasso, LassoCV

In [None]:
model=LinearRegression().fit(X, y)

print(model.get_params())

In [None]:
#Get parameters for this estimator:
print(model.coef_)

In [None]:
# Predict:
print(model.predict(X))

In [None]:
# Get the R-Squared:
print(model.score(X,y))

##### Lasso

In [None]:
model=Lasso(normalize=True,alpha=0.03).fit(X, y)

In [None]:
# Get estimated coefs:
print(model.coef_)

<span style="color:blue"><b>Task:</b></span> Create a dict with the names of the variables and their estimated parameters.

Lasso selects less regressors if alpha increases:

In [None]:
for alpha in np.linspace(0,0.4,9):
    model=Lasso(normalize=True,alpha=alpha).fit(X, y)
    r2=model.score(X,y)
    non_0=sum([i!=0 for i in model.coef_])
    print(f'{round(alpha,2)=}, {r2=}, {non_0=}')
    print({name:round(coef,2) for name,coef in zip(var_string.split(' + '),model.coef_,) if coef!=0})


In [None]:
model=LassoCV(normalize=True).fit(X, y)
alpha=model.alpha_
r2=model.score(X,y)
non_0=sum([i!=0 for i in model.coef_])
print(f'{round(alpha,2)=}, {r2=}, {non_0=}')
print({name:round(coef,2) for name,coef in zip(var_string.split(' + '),model.coef_) if coef!=0})


##### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor as RF

In [None]:
model=RF(random_state=0).fit(X,y)

In [None]:
# Get the R-Squared:
print(model.score(X,y))

In [None]:
# Get the importance of each regressor.
print(model.feature_importances_)

print({name:round(imp,2) for name, imp in zip(var_string.split(' + '),model.feature_importances_[1:],)})

<span style="color:blue"><b>Task:</b></span> Make a pie plot with feature importances.