# FIFA 21 IRONHACK COMPETITION

# PART (I)

**Link to repo: https://github.com/ironhack-edu/data_project_FIFA_21**


You will use the fifa21_trainning.csv dataset provided to predict the position ('OVA') of each player. The competition will take place from monday morning to tuesday. 
<br><br>
Your model will be saved in a pickle file.
<br><br>
The ranking of the competitors will be calculated according to the highest Mean Average Error (MAE), rounded to 2 decimals.
<br><br>
Ties will be broken using, respectively: R2 Score (rounded to 2 decimals), Root Mean Squared Error (rounded to 2 decimals), time to run the code (using timeit)
<br>

## DELIVERABLES:

Your group should deliver a `group Jupyter notebook` with all the preprocessing functions alongside with the model.

Everything must be delivered until 12am on Tuesday. 
<br><br>
Be prepared to share your work on Tuesday morning, the best scores will have the opportunity to show their notebook and go through their pipeline (~10 min).
<br><br>

To deliver:
* A notebook with your work and model (group_number.ipynb);
* Pickle file with the model (group_number.pkl). 
<br><br>

The instructor will use your `group Jupyter notebook` to load a new dataset and use your functions and
your model to make a prediction in unseen data.


<br><br>

For this small project you are going to work in groups to put in practice some of the concepts of the previous week.

With your group mates, open the file in `file_for_project/fifa21_training.csv`. The objective is to create the best linear model to predict the column `OVA`.

You can find some documentation about the meaning of each column in the following links:

- [link - 0](https://sofifa.com/)
- [link - 1](https://gaming.stackexchange.com/questions/167318/what-do-fifa-14-position-acronyms-mean)
- [link - 2](https://www.fifauteam.com/fifa-ultimate-team-positions-and-tactics/)

### 1

Each member of the team should have his/her own _juypter_ notebook. In addition, each group should have a `group jupyter notebook`.

### 2

Decide which columns can be predictive and which ones can be directly dropped and take the needed actions.

### 3

Decide among the members of the group who is going to take care of inspecting the remaining columns
of the dataset. For example:
Member 1: cols 1 -> 5
Member 2: cols 6 -> 10
...
and so on

### 4

Each member must do:

- Explore their assigned columns and write python code to perform any cleanup operation that the assigned columns may need.
- Perform any scaling operation that the assigned column may need.

### 5

Put all the code of each member into the `group jupyter notebook`.

In [325]:
import pandas as pd
import numpy as np
import math
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
data = pd.read_csv("./file_for_project/fifa21_training.csv")
df = pd.DataFrame(data)
list(data.columns)
import pickle


In [305]:
data.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Nationality,Club,BP,Position,Team & Contract,Height,...,CDM,RDM,RWB,LB,LCB,CB,RCB,RB,GK,OVA
0,1954,184383,A. Pasche,26,Switzerland,FC Lausanne-Sport,CM,CM CDM,FC Lausanne-Sport 2015 ~ 2020,"5'9""",...,59+1,59+1,59+1,58+1,54+1,54+1,54+1,58+1,15+1,64
1,2225,188044,Alan Carvalho,30,China PR,Beijing Sinobo Guoan FC,ST,ST LW LM,"Beijing Sinobo Guoan FC Dec 31, 2020 On Loan","6'0""",...,53+2,53+2,57+2,53+2,48+2,48+2,48+2,53+2,18+2,77
2,1959,184431,S. Giovinco,33,Italy,Al Hilal,CAM,CAM CF,Al Hilal 2019 ~ 2022,"5'4""",...,56+2,56+2,59+2,53+2,41+2,41+2,41+2,53+2,12+2,80
3,9815,233796,J. Evans,22,Wales,Swansea City,CDM,CDM CM,Swansea City 2016 ~ 2021,"5'10""",...,58+2,58+2,56+2,57+2,58+2,58+2,58+2,57+2,14+2,59
4,10074,234799,Y. Demoncy,23,France,US Orléans Loiret Football,CDM,CDM CM,US Orléans Loiret Football 2018 ~ 2021,"5'11""",...,64+2,64+2,64+2,63+2,61+2,61+2,61+2,63+2,15+2,65


In [306]:
data['ID']

0        184383
1        188044
2        184431
3        233796
4        234799
          ...  
13695    239074
13696    241223
13697    210930
13698    162993
13699    254882
Name: ID, Length: 13700, dtype: int64

In [307]:
data.columns = df.columns.str.lower().str.replace(' ','_')
clean_columns = []
clean_columns = df[['name','value','growth','base_stats','total_stats','ir','ova','pac','sho','pas','dri','def','phy']].copy()
clean_columns

Unnamed: 0,name,value,growth,base_stats,total_stats,ir,ova,pac,sho,pas,dri,def,phy
0,A. Pasche,€525K,1,357,1682,1 ★,64,69,51,63,63,51,60
1,Alan Carvalho,€8.5M,0,412,1961,2 ★,77,83,75,68,82,33,71
2,S. Giovinco,€9M,0,404,1925,2 ★,80,80,77,78,86,27,56
3,J. Evans,€275K,13,329,1527,1 ★,59,57,44,54,57,57,60
4,Y. Demoncy,€725K,8,360,1664,1 ★,65,66,44,60,64,60,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13695,S. Aw,€325K,11,315,1443,1 ★,60,76,28,46,55,53,57
13696,S. Mogi,€190K,9,318,928,1 ★,59,60,55,57,62,30,54
13697,Carles Gil,€8M,0,388,1867,2 ★,76,65,69,78,77,39,60
13698,J. Perch,€140K,0,346,1639,1 ★,63,53,47,58,58,61,69


<font color = 'magenta'>
Una vez hemos escogido las columnas más importantes según nuestro criterio, procedemos a limpiar los caracteres especiales <br>
como la estrella en la reputación internacional(ir) o el simbolo de euro y unidad de cantidad(M(millón) o K(mil)) en el valor de mercado de jugador.
 </font>

In [308]:
clean_columns['value'] = clean_columns['value'].str.replace('€','')
clean_columns['value'] = clean_columns['value'].str.replace('K','000')
clean_columns['value'] = clean_columns['value'].str.replace('M','000000')
clean_columns['value'] = clean_columns['value'].str.replace('.','')
clean_columns['ir'] = clean_columns['ir'].str.replace('★','')
clean_columns

  clean_columns['value'] = clean_columns['value'].str.replace('.','')


Unnamed: 0,name,value,growth,base_stats,total_stats,ir,ova,pac,sho,pas,dri,def,phy
0,A. Pasche,525000,1,357,1682,1,64,69,51,63,63,51,60
1,Alan Carvalho,85000000,0,412,1961,2,77,83,75,68,82,33,71
2,S. Giovinco,9000000,0,404,1925,2,80,80,77,78,86,27,56
3,J. Evans,275000,13,329,1527,1,59,57,44,54,57,57,60
4,Y. Demoncy,725000,8,360,1664,1,65,66,44,60,64,60,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13695,S. Aw,325000,11,315,1443,1,60,76,28,46,55,53,57
13696,S. Mogi,190000,9,318,928,1,59,60,55,57,62,30,54
13697,Carles Gil,8000000,0,388,1867,2,76,65,69,78,77,39,60
13698,J. Perch,140000,0,346,1639,1,63,53,47,58,58,61,69


<font color='magenta'>
    El objetivo de eliminar los símbolos y unidades ha sido transformar las columnas categóricas en numéricas. Para finalizar este proceso, necesitamos transformar el tipo de variable de 'objetct' como puede ser value o ir a 'int'. Una vez tenemos las columnas en el formato deseado, necesitamos comprobar que no nos hemos 'cargado' ninguna muestra mirando los valores nulos.
 </font>

In [317]:
clean_columns['log_value'] = [np.log(x) for x in clean_columns['value'] ]
clean_columns['value'] = clean_columns['value'].astype(float)
clean_columns['value'] = clean_columns['log_value']
clean_columns['value'] = clean_columns['value'].dropna()
clean_columns['ir'] = clean_columns['ir'].astype(int)

TypeError: ufunc 'log' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [318]:
clean_columns.isnull().sum()

name           0
value          0
growth         0
base_stats     0
total_stats    0
ir             0
ova            0
pac            0
sho            0
pas            0
dri            0
def            0
phy            0
dtype: int64

<font color = 'magenta'>
Aqui finaliza la parte 1

In [319]:
numeric_columns = clean_columns[['value','growth','base_stats','total_stats','ir','ova','pac','sho','pas','dri','def','phy']]

In [320]:
StandardTransformer = StandardScaler().fit(numeric_columns)
x_standarized = StandardTransformer.transform(numeric_columns)
#comprobamos el tamaño
print(x_standarized.shape)
x_standarized = pd.DataFrame(x_standarized, columns = numeric_columns.columns)
y = numeric_columns['ova']
print(y.shape)
X = numeric_columns.drop(columns=['ova'])
print(X.shape)




(13700, 12)
(13700,)
(13700, 11)


In [321]:
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
r2_score(y_test, predictions)



0.8153506292528897

In [322]:
mse = mean_squared_error(y_test,predictions)
print(mse)

9.17856150186342


In [323]:
rmse = math.sqrt(mse)
print(rmse)

3.029614084642369


In [324]:
r2 = r2_score(y_test, predictions)
r2

0.8153506292528897

In [326]:
filename = 'fifa_project'
outfile = open(filename,'wb')
pickle.dump(r2_score,outfile)
outfile.close()