# Neteja de dades i regressió logística

Emprarem les dades del **Fifa22**. [Enllaç](https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database?resource=download&select=FIFA22_official_data.csv)

> The dataset contains +17k unique players and more than 60 columns, general information and all KPIs the famous videogame offers. As the esport scene keeps rising espacially on FIFA, I thought it can be useful for the community (kagglers and/or gamers)
>
>Context
>
>The data was retrieved thanks to a crawler that I implemented to retrieve:
>
>    Aggregated data such as name of the players, age, country
>    Detailed data such as offensive potential, defense, acceleration
>    I like football a lot and this dataset is for me the opportunity to bring my contribution for the realization of projects that can go from simple analysis to elaboration of strategies on optimal composition under constraints…


L'objectiu d'aquesta **mini** pràctica es prediure la possició del jugador emprant entre totes les característiques disponibles quatre. S'han d'entrenar dos classificadors lineals: **Regressió Logística** i **Perceptró**. 


## Importam llibraries

El que primer fem, com sempre, és importar les llibreries que emprarem. **Pot ser que per fer l'exercici necessitis importar més llibreries**.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model

import pandas as pd
import numpy as np

## Llegim les dades

Per llegir les dades emprarem la llibreria de ``pandas``. El fitxer en qüestió és el fitxer que heu descarregat de Kaggle.

In [2]:
org = pd.read_csv("./dades.csv")
org.head()

Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,...,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Best Position,Best Overall Rating,Release Clause,DefensiveAwareness
0,212198,Bruno Fernandes,26,https://cdn.sofifa.com/players/212/198/22_60.png,Portugal,https://cdn.sofifa.com/flags/pt.png,88,89,Manchester United,https://cdn.sofifa.com/teams/11/30.png,...,65.0,12.0,14.0,15.0,8.0,14.0,CAM,88.0,€206.9M,72.0
1,209658,L. Goretzka,26,https://cdn.sofifa.com/players/209/658/22_60.png,Germany,https://cdn.sofifa.com/flags/de.png,87,88,FC Bayern München,https://cdn.sofifa.com/teams/21/30.png,...,77.0,13.0,8.0,15.0,11.0,9.0,CM,87.0,€160.4M,74.0
2,176580,L. Suárez,34,https://cdn.sofifa.com/players/176/580/22_60.png,Uruguay,https://cdn.sofifa.com/flags/uy.png,88,88,Atlético de Madrid,https://cdn.sofifa.com/teams/240/30.png,...,38.0,27.0,25.0,31.0,33.0,37.0,ST,88.0,€91.2M,42.0
3,192985,K. De Bruyne,30,https://cdn.sofifa.com/players/192/985/22_60.png,Belgium,https://cdn.sofifa.com/flags/be.png,91,91,Manchester City,https://cdn.sofifa.com/teams/10/30.png,...,53.0,15.0,13.0,5.0,10.0,13.0,CM,91.0,€232.2M,68.0
4,224334,M. Acuña,29,https://cdn.sofifa.com/players/224/334/22_60.png,Argentina,https://cdn.sofifa.com/flags/ar.png,84,84,Sevilla FC,https://cdn.sofifa.com/teams/481/30.png,...,82.0,8.0,14.0,13.0,13.0,14.0,LB,84.0,€77.7M,80.0


## EDA: *Exploratory data analysis*


Eliminam les columnes innecessàries.

In [3]:
org.drop(['ID', 'Photo', 'Flag', 'Club Logo',], axis=1, inplace=True)
pd.set_option('display.max_columns', None)
org.head()


Unnamed: 0,Name,Age,Nationality,Overall,Potential,Club,Value,Wage,Special,Preferred Foot,International Reputation,Weak Foot,Skill Moves,Work Rate,Body Type,Real Face,Position,Jersey Number,Joined,Loaned From,Contract Valid Until,Height,Weight,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Best Position,Best Overall Rating,Release Clause,DefensiveAwareness
0,Bruno Fernandes,26,Portugal,88,89,Manchester United,€107.5M,€250K,2341,Right,3.0,3.0,4.0,High/ High,Unique,Yes,"<span class=""pos pos18"">CAM",18.0,"Jan 30, 2020",,2025,179cm,69kg,87.0,83.0,64.0,91.0,87.0,83.0,87.0,87.0,88.0,87.0,77.0,73.0,80.0,91.0,79.0,89.0,73.0,91.0,70.0,89.0,78.0,66.0,87.0,90.0,91.0,87.0,,73.0,65.0,12.0,14.0,15.0,8.0,14.0,CAM,88.0,€206.9M,72.0
1,L. Goretzka,26,Germany,87,88,FC Bayern München,€93M,€140K,2314,Right,4.0,4.0,3.0,High/ Medium,Unique,Yes,"<span class=""pos pos11"">LDM",8.0,"Jul 1, 2018",,2026,189cm,82kg,75.0,82.0,86.0,86.0,69.0,84.0,76.0,75.0,84.0,87.0,78.0,83.0,76.0,88.0,71.0,85.0,79.0,88.0,88.0,86.0,81.0,86.0,85.0,84.0,60.0,82.0,,85.0,77.0,13.0,8.0,15.0,11.0,9.0,CM,87.0,€160.4M,74.0
2,L. Suárez,34,Uruguay,88,88,Atlético de Madrid,€44.5M,€135K,2307,Right,5.0,4.0,3.0,High/ Medium,Unique,Yes,"<span class=""pos pos24"">RS",9.0,"Sep 25, 2020",,2022,182cm,83kg,80.0,93.0,84.0,83.0,90.0,83.0,86.0,82.0,77.0,86.0,76.0,69.0,75.0,92.0,78.0,89.0,69.0,78.0,85.0,88.0,87.0,41.0,91.0,84.0,83.0,87.0,,45.0,38.0,27.0,25.0,31.0,33.0,37.0,ST,88.0,€91.2M,42.0
3,K. De Bruyne,30,Belgium,91,91,Manchester City,€125.5M,€350K,2304,Right,4.0,5.0,4.0,High/ High,Unique,Yes,"<span class=""pos pos13"">RCM",17.0,"Aug 30, 2015",,2025,181cm,70kg,94.0,82.0,55.0,94.0,82.0,88.0,85.0,83.0,93.0,91.0,76.0,76.0,79.0,91.0,78.0,91.0,63.0,89.0,74.0,91.0,76.0,66.0,88.0,94.0,83.0,89.0,,65.0,53.0,15.0,13.0,5.0,10.0,13.0,CM,91.0,€232.2M,68.0
4,M. Acuña,29,Argentina,84,84,Sevilla FC,€37M,€45K,2292,Left,2.0,3.0,4.0,High/ High,Stocky (170-185),No,"<span class=""pos pos7"">LB",19.0,"Sep 14, 2020",,2024,172cm,69kg,87.0,66.0,58.0,82.0,68.0,87.0,88.0,75.0,78.0,88.0,77.0,76.0,83.0,83.0,90.0,82.0,63.0,90.0,80.0,81.0,84.0,79.0,81.0,82.0,76.0,87.0,,84.0,82.0,8.0,14.0,13.0,13.0,14.0,LB,84.0,€77.7M,80.0


In [4]:
df = org[['Best Position', 'Overall', 'Value', 'Skill Moves', 'BallControl']]
df.head()

Unnamed: 0,Best Position,Overall,Value,Skill Moves,BallControl
0,CAM,88,€107.5M,4.0,87.0
1,CM,87,€93M,3.0,87.0
2,ST,88,€44.5M,3.0,86.0
3,CM,91,€125.5M,4.0,91.0
4,LB,84,€37M,4.0,88.0


In [5]:
def converter(x):
    #Delete the €
    x = x[1:]
    
    if 'M' in x:
        return (float(x.strip('M'))*1000000)
    if 'K' in x:
        return (float(x.strip('K'))*1000)
    
    
df['Value'] = df['Value'].apply(converter)
df.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Value'] = df['Value'].apply(converter)


Unnamed: 0,Best Position,Overall,Value,Skill Moves,BallControl
0,CAM,88,107500000.0,4.0,87.0
1,CM,87,93000000.0,3.0,87.0
2,ST,88,44500000.0,3.0,86.0
3,CM,91,125500000.0,4.0,91.0
4,LB,84,37000000.0,4.0,88.0


In [6]:
df['Best Position'].value_counts()

CB     3214
ST     2549
CAM    2203
GK     1546
RM     1313
CDM    1262
CM     1011
RB      856
LB      813
LM      745
RW      335
RWB     310
LWB     289
LW      182
CF       82
Name: Best Position, dtype: int64

In [7]:

df.dropna(subset = ['Best Position'], inplace=True) #eliminam els NaN

dict_pos = {}

for value, key in enumerate(df["Best Position"].unique()):
    dict_pos[key] = value
    

df.replace({"Best Position": dict_pos}, inplace=True)

Equivalencias = pd.DataFrame(list(dict_pos.items()), columns = ['Posicion','Valor numerico'])

pd.set_option("display.max_rows", None, "display.max_columns", None)

Equivalencias.head()




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(subset = ['Best Position'], inplace=True) #eliminam els NaN
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.replace({"Best Position": dict_pos}, inplace=True)


Unnamed: 0,Posicion,Valor numerico
0,CAM,0
1,CM,1
2,ST,2
3,LB,3
4,CDM,4


In [8]:
pd.set_option("display.max_rows", 10, "display.max_columns", None)
#print df
df.head()

Unnamed: 0,Best Position,Overall,Value,Skill Moves,BallControl
0,0,88,107500000.0,4.0,87.0
1,1,87,93000000.0,3.0,87.0
2,2,88,44500000.0,3.0,86.0
3,1,91,125500000.0,4.0,91.0
4,3,84,37000000.0,4.0,88.0


## Entrenament

## Avaluació