<img src="BL.png">
# Preprocessing of FIFA 18 - Dataset for Training of Player Classifier
## Import of complete dataset and rough sorting out of unrelevant dimensions
In here, we want to complete the preprocessing for a player classifier, that will take a couple of player attributes and suggests a position the player should take. 
First we will sort out dimensions that obviously do not have a connection to the strength of a player on a position, e.g. club logo, league, shirt number, ...

Our basis is a dataset that comprises 185(!) different attributes for each player! We will reduce it to 35 that might have an impact on the best position of a single player. Moreover we suspect, that the data for players who play in tin pot leagues might be quite bad, so we will keep about 4500 of 17000 player by selecting those who play in famous european leagues.

We start with rough filtering:

In [26]:
#Pandas is used for data wrangling purpose
import pandas as pd
df = pd.read_csv('data/complete.csv')

#DROPPING LISTS DUE TO NO MEANING FOR CLASSIFICATION TASK (e.g. club-logo, number on shirt ,...)
dataset=df.iloc[:,:-91]
to_drop = ['international_reputation', 'potential','stamina','strength','skill_moves', 'weak_foot', 'work_rate_att', 'work_rate_def', 'preferred_foot','gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning', 'gk_reflexes', 'ID','birth_date','special','photo','eur_value','eur_wage','eur_release_clause', 'name','club_logo','body_type', 'real_face','flag','nationality']
for i in to_drop:
    dataset=dataset.drop(i,axis=1)

#DROP PLAYERS OF TIN POT LEAGUES (data-quality might be weird here...)
#dataset = dataset[df['league'].isin(['Spanish Primera División', 'French Ligue 1', 'German Bundesliga','English Premier League', 'Italian Serie A', 'Turkish Süper Lig','Portuguese Primeira Liga','German 2. Bundesliga'])]
#dataset = dataset.dropna(subset=['league'])
dataset = dataset.reset_index()

#EXTRACT THE ONE POSIION WITH BEST RATING AND STORE IN NEW COLUMN 'BestPos':
dataset['BestPos'] = dataset[['rs', 'rw', 'rf', 'ram', 'rcm', 'rm', 'rdm', 'rcb', 'rb', 'rwb', 'st', 'lw', 'cf', 'cam', 'cm', 'lm', 'cdm', 'cb', 'lb', 'lwb', 'ls', 'lf', 'lam', 'lcm', 'ldm', 'lcb', 'gk']].idxmax(axis=1)


dataset

#MISCELLANIOUS STEPS USED IN THIS SECTION

#a) look up all leagues in dataset:
#leagues = pd.unique(dataset['league'])

#b) list of column names:
#list(dataset.columns.values)




Unnamed: 0,index,full_name,club,age,league,height_cm,weight_kg,overall,pac,sho,...,lb,lwb,ls,lf,lam,lcm,ldm,lcb,gk,BestPos
0,0,C. Ronaldo dos Santos Aveiro,Real Madrid CF,32,Spanish Primera División,185.0,80.0,94,90,93,...,61.0,66.0,92.0,91.0,89.0,82.0,62.0,53.0,,rs
1,1,Lionel Messi,FC Barcelona,30,Spanish Primera División,170.0,72.0,93,89,90,...,57.0,62.0,88.0,92.0,92.0,84.0,59.0,45.0,,rf
2,2,Neymar da Silva Santos Jr.,Paris Saint-Germain,25,French Ligue 1,175.0,68.0,92,92,84,...,59.0,64.0,84.0,88.0,88.0,79.0,59.0,46.0,,rw
3,3,Luis Suárez,FC Barcelona,30,Spanish Primera División,182.0,86.0,92,82,90,...,64.0,68.0,88.0,88.0,87.0,80.0,65.0,58.0,,rs
4,4,Manuel Neuer,FC Bayern Munich,31,German Bundesliga,193.0,92.0,92,91,90,...,,,,,,,,,92.0,gk
5,5,Robert Lewandowski,FC Bayern Munich,28,German Bundesliga,185.0,79.0,91,81,88,...,58.0,61.0,88.0,87.0,84.0,78.0,62.0,57.0,,rs
6,6,David De Gea Quintana,Manchester United,26,English Premier League,193.0,76.0,90,90,85,...,,,,,,,,,90.0,gk
7,7,Eden Hazard,Chelsea,26,English Premier League,173.0,76.0,90,90,82,...,59.0,64.0,82.0,87.0,88.0,81.0,61.0,47.0,,rw
8,8,Toni Kroos,Real Madrid CF,27,Spanish Primera División,182.0,78.0,90,56,81,...,76.0,78.0,77.0,81.0,83.0,87.0,82.0,72.0,,rcm
9,9,Gonzalo Higuaín,Juventus,29,Italian Serie A,184.0,87.0,90,79,87,...,51.0,55.0,87.0,84.0,81.0,71.0,52.0,46.0,,rs


## Mapping of best positions onto four classes
As we are looking for a first impression of how strong a classifier can be on the dataset, we will aggregate the detailed positions to less classes. E.g. 'right storm', 'left storm', 'storm', central front',... is mapped to just 'storm'.
We are creating classes 'goalkeeper', 'defense', 'middle', 'storm':


In [27]:
#THIS DICT CONTAINS THE MAPPING RULES
position_mapping = {'rs':'Sturm',
                    'rw':'Sturm',
                    'rf':'Sturm',
                    'ram':'Mittelfeld',
                    'rcm':'Mittelfeld',
                    'rm':'Mittelfeld',
                    'rdm':'Mittelfeld',
                    'rcb':'Verteidigung',
                    'rb':'Verteidigung',
                    'rwb':'Verteidigung',
                    'st':'Sturm',
                    'lw':'Sturm',
                    'cf':'Sturm',
                    'cam':'Mittelfeld',
                    'cm':'Mittelfeld',
                    'lm':'Mittelfeld',
                    'cdm':'Mittelfeld',
                    'cb':'Verteidigung',
                    'lb':'Verteidigung',
                    'lwb':'Verteidigung',
                    'ls':'Sturm',
                    'lf':'Sturm',
                    'lam':'Mittelfeld',
                    'lcm':'Mittelfeld',
                    'ldm':'Mittelfeld',
                    'lcb':'Verteidigung',
                    'gk':'Torwart'}


#MAPPING FUNCTION TAKES VALUES OF 'BestPos' AND RETURNS VALUES OF MAPPING, e.g. extract_position(rcm) = 'Mittelfeld',
#WHERE rcm MEANS RIGHT CENTRAL MIDDLEFIELD
def map_position(x):
    for i in position_mapping:
        if x == i:
            return position_mapping[i]

#APPLICATION OF map_position 
dataset['BestPos'] = dataset['BestPos'].apply(map_position)


#SPLIT X INTO TABLE WITH PLAYER INFORMATION (LIKE NAME, CLUB, LEAGUE, AGE) AND INDEPENDENT VARIABLES FOR TRAINING OF CLASSIFIER
player_info= dataset.iloc[:,:5]
X = dataset.iloc[:,5:-29]
#EXTRACT DEPENDENT VARIABLE
y=dataset.iloc[:,-1]
y = pd.DataFrame(y)
y.columns = ['Position']
y

Unnamed: 0,Position
0,Sturm
1,Sturm
2,Sturm
3,Sturm
4,Torwart
5,Sturm
6,Torwart
7,Sturm
8,Mittelfeld
9,Sturm


## Export

In [28]:
#WRITE X,y AND player_info TO CSV-FILES
X.to_csv('X.csv')
y.to_csv('y.csv')
player_info.to_csv('player_info.csv')

