# Tutorial
#### This tutorial will introduce you to the *fifa_preprocessing*'s functionality!
In general, the following functions will alow you to preprocess your data to be able to perform machine learning or statistical data analysis by reformatting, casting or deleting certain values.

The data used in these examples comes from https://www.kaggle.com/karangadiya/fifa19, a webpage this package was inspired by. The module's functions work best with this data set, however, they will work with any data structured in a similar manner.

## Prerequisites

First, import the fifa_preprocessing, pandas and math modules:

In [1]:
import fifa_preprocessing as fp
import pandas as pd
import math

Load your data:

In [2]:
data = pd.read_csv('data.csv')

In [3]:
data

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18202,18202,238813,J. Lundstram,19,https://cdn.sofifa.org/players/4/19/238813.png,England,https://cdn.sofifa.org/flags/14.png,47,65,Crewe Alexandra,...,45.0,40.0,48.0,47.0,10.0,13.0,7.0,8.0,9.0,€143K
18203,18203,243165,N. Christoffersson,19,https://cdn.sofifa.org/players/4/19/243165.png,Sweden,https://cdn.sofifa.org/flags/46.png,47,63,Trelleborgs FF,...,42.0,22.0,15.0,19.0,10.0,9.0,9.0,5.0,12.0,€113K
18204,18204,241638,B. Worman,16,https://cdn.sofifa.org/players/4/19/241638.png,England,https://cdn.sofifa.org/flags/14.png,47,67,Cambridge United,...,41.0,32.0,13.0,11.0,6.0,5.0,10.0,6.0,13.0,€165K
18205,18205,246268,D. Walker-Rice,17,https://cdn.sofifa.org/players/4/19/246268.png,England,https://cdn.sofifa.org/flags/14.png,47,66,Tranmere Rovers,...,46.0,20.0,25.0,27.0,14.0,6.0,14.0,8.0,9.0,€143K


## Exclude goalkeepers
Before any preprocessing, the data contains all the players.

In [4]:
data[['Name', 'Position']]

Unnamed: 0,Name,Position
0,L. Messi,RF
1,Cristiano Ronaldo,ST
2,Neymar Jr,LW
3,De Gea,GK
4,K. De Bruyne,RCM
...,...,...
18202,J. Lundstram,CM
18203,N. Christoffersson,ST
18204,B. Worman,ST
18205,D. Walker-Rice,RW


This command will exclude goalkeepers from your data set (i.e. delete all the rows where column 'Position' is equal to 'GK'):

In [5]:
data = fp.exclude_goalkeepers(data)

As you may notice, the row number 3 was deleted.

In [6]:
data[['Name', 'Position']]

Unnamed: 0,Name,Position
0,L. Messi,RF
1,Cristiano Ronaldo,ST
2,Neymar Jr,LW
4,K. De Bruyne,RCM
5,E. Hazard,LF
...,...,...
18202,J. Lundstram,CM
18203,N. Christoffersson,ST
18204,B. Worman,ST
18205,D. Walker-Rice,RW


## Format currencies
To remove unnecessary characters form a monetary value use:

In [7]:
money = '€23.4M'
fp.money_format(money)

23400

The value will be expressed in thousands of euros:

In [8]:
money = '€7K'
fp.money_format(money)

7

## Format players' rating
In FIFA players get a ranking on they skills on the pitch. The ranking is represented as a sum of two integers.
The following function lets you take in a string containing two numbers separated by a '+' and get the actual sum:

In [9]:
rating = '81+3'
fp.rating_format(rating)

84

## Format players' work rate
The next function takes in a qualitative parameter that could be expressed as a quantitive value.
If you have a data set where one category is expressed as 'High', 'Medium' or 'Low', this function will assign numbers to these values (2, 1 and 0 respectively):

In [10]:
fp.work_format('High')

2

In [11]:
fp.work_format('Medium')

1

In [12]:
fp.work_format('Low')

0

In fact, the function returns 0 in every case where the passed in parameter id different than 'High' and 'Medium':


In [13]:
fp.work_format('Mediocre')

0

## Cast to int
This simple function casts a float to int, but also adds extra flexibility and returns 0 when it encounters a NaN (Not a Number):

In [14]:
fp.to_int(3.24)

3

In [15]:
import numpy
nan = numpy.nan
fp.to_int(nan)

0

## Apply format of choice
This generic function lets you choose what format to apply to every value in the columns of the data frame you specify.

In [16]:
data[['Name', 'Jersey Number', 'Skill Moves', 'Weak Foot']]

Unnamed: 0,Name,Jersey Number,Skill Moves,Weak Foot
0,L. Messi,10.0,4.0,4.0
1,Cristiano Ronaldo,7.0,5.0,4.0
2,Neymar Jr,10.0,5.0,5.0
4,K. De Bruyne,7.0,4.0,5.0
5,E. Hazard,10.0,4.0,4.0
...,...,...,...,...
18202,J. Lundstram,22.0,2.0,2.0
18203,N. Christoffersson,21.0,2.0,2.0
18204,B. Worman,33.0,2.0,3.0
18205,D. Walker-Rice,34.0,2.0,3.0


By format, is meant a function that operates on the values in the specified columns:

In [17]:
columns = ['Jersey Number', 'Skill Moves', 'Weak Foot']
format_fun = fp.to_int

data = fp.apply_format(data, columns, format_fun)

data[['Name', 'Jersey Number', 'Skill Moves', 'Weak Foot']]

Unnamed: 0,Name,Jersey Number,Skill Moves,Weak Foot
0,L. Messi,10,4,4
1,Cristiano Ronaldo,7,5,4
2,Neymar Jr,10,5,5
4,K. De Bruyne,7,4,5
5,E. Hazard,10,4,4
...,...,...,...,...
18202,J. Lundstram,22,2,2
18203,N. Christoffersson,21,2,2
18204,B. Worman,33,2,3
18205,D. Walker-Rice,34,2,3


## Dummy variables
If we intend to build machine learning models to explore our data, we usually are not able to extract any information from qualitative data. Here 'Club' and 'Preferred Foot' are categories that could bring interesting information. To be able to use it in our machine learning algorithms we can get dummy variables.

In [18]:
data[['Name', 'Preferred Foot']]

Unnamed: 0,Name,Preferred Foot
0,L. Messi,Left
1,Cristiano Ronaldo,Right
2,Neymar Jr,Right
4,K. De Bruyne,Right
5,E. Hazard,Right
...,...,...
18202,J. Lundstram,Right
18203,N. Christoffersson,Right
18204,B. Worman,Right
18205,D. Walker-Rice,Right


If we choose 'Preferred Foot', new columns will be aded, their titles will be the same as the values in 'Preferred Foot' column: 'Left' and 'Right'. So now instead of seeing 'Left' in the column 'Preferred Foot' we will see 1 in 'Left' column (and 0 in 'Right').

In [19]:
data = fp.to_dummy(data, ['Preferred Foot'])
data[['Name', 'Left', 'Right']]

Unnamed: 0,Name,Left,Right
0,L. Messi,1,0
1,Cristiano Ronaldo,0,1
2,Neymar Jr,0,1
4,K. De Bruyne,0,1
5,E. Hazard,0,1
...,...,...,...
18202,J. Lundstram,0,1
18203,N. Christoffersson,0,1
18204,B. Worman,0,1
18205,D. Walker-Rice,0,1


Learn more about [dummy variables](https://en.wikiversity.org/wiki/Dummy_variable_(statistics)).

The data frame will no longer contain the columns we transformed:

In [20]:
'Preferred Foot' in data

False

We can get dummy variables for multiple columns at once.

In [21]:
data[['Name', 'Club', 'Position']]

Unnamed: 0,Name,Club,Position
0,L. Messi,FC Barcelona,RF
1,Cristiano Ronaldo,Juventus,ST
2,Neymar Jr,Paris Saint-Germain,LW
4,K. De Bruyne,Manchester City,RCM
5,E. Hazard,Chelsea,LF
...,...,...,...
18202,J. Lundstram,Crewe Alexandra,CM
18203,N. Christoffersson,Trelleborgs FF,ST
18204,B. Worman,Cambridge United,ST
18205,D. Walker-Rice,Tranmere Rovers,RW


In [22]:
data = fp.to_dummy(data, ['Club', 'Nationality'])
data[['Name', 'Paris Saint-Germain', 'Manchester City', 'Brazil', 'Portugal']]

Unnamed: 0,Name,Paris Saint-Germain,Manchester City,Brazil,Portugal
0,L. Messi,0,0,0,0
1,Cristiano Ronaldo,0,0,0,1
2,Neymar Jr,1,0,1,0
4,K. De Bruyne,0,1,0,0
5,E. Hazard,0,0,0,0
...,...,...,...,...,...
18202,J. Lundstram,0,0,0,0
18203,N. Christoffersson,0,0,0,0
18204,B. Worman,0,0,0,0
18205,D. Walker-Rice,0,0,0,0


## Split work rate column
In FIFA the players' work rate is saved in a special way, two qualiative values are split with a slash:

In [23]:
data[['Name', 'Work Rate']]

Unnamed: 0,Name,Work Rate
0,L. Messi,Medium/ Medium
1,Cristiano Ronaldo,High/ Low
2,Neymar Jr,High/ Medium
4,K. De Bruyne,High/ High
5,E. Hazard,High/ Medium
...,...,...
18202,J. Lundstram,Medium/ Medium
18203,N. Christoffersson,Medium/ Medium
18204,B. Worman,Medium/ Medium
18205,D. Walker-Rice,Medium/ Medium


This next function allows you to split column 'Work Rate' into 'Defensive Work Rate' and 'Offensive Work Rate':

In [24]:
data = fp.split_work_rate(data)
data[['Name', 'Defensive Work Rate', 'Offensive Work Rate']]

Unnamed: 0,Name,Defensive Work Rate,Offensive Work Rate
0,L. Messi,1,1
1,Cristiano Ronaldo,2,0
2,Neymar Jr,2,1
4,K. De Bruyne,2,2
5,E. Hazard,2,1
...,...,...,...
18202,J. Lundstram,1,1
18203,N. Christoffersson,1,1
18204,B. Worman,1,1
18205,D. Walker-Rice,1,1


## Default preprocessing
To perform all the basic preprocessing (optimized for the FIFA 19 data set) on your data, simply go:

In [25]:
data = pd.read_csv('data.csv')
fp.preprocess(data)

Unnamed: 0,Age,Overall,Potential,Value,Wage,Special,International Reputation,Weak Foot,Skill Moves,Jersey Number,...,United Arab Emirates,United States,Uruguay,Uzbekistan,Venezuela,Wales,Zambia,Zimbabwe,Defensive Work Rate,Offensive Work Rate
0,31,94,94,110500,565,2202,5,4,4,10,...,0,0,0,0,0,0,0,0,1,1
1,33,94,94,77000,405,2228,5,4,5,7,...,0,0,0,0,0,0,0,0,2,0
2,26,92,93,118500,290,2143,5,5,5,10,...,0,0,0,0,0,0,0,0,2,1
4,27,91,92,102000,355,2281,4,5,4,7,...,0,0,0,0,0,0,0,0,2,2
5,27,91,91,93000,340,2142,4,4,4,10,...,0,0,0,0,0,0,0,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18202,19,47,65,60,1,1307,1,2,2,22,...,0,0,0,0,0,0,0,0,1,1
18203,19,47,63,60,1,1098,1,2,2,21,...,0,0,0,0,0,0,0,0,1,1
18204,16,47,67,60,1,1189,1,3,2,33,...,0,0,0,0,0,0,0,0,1,1
18205,17,47,66,60,1,1228,1,3,2,34,...,0,0,0,0,0,0,0,0,1,1


## Let's get coding
Now it is your turn to try out our functions and preprocess your data!