Jupyter slideshow: This notebook can be displayed as slides. To view it as a slideshow in your browser, type the following in the console:

> jupyter nbconvert EDASlides.ipynb --to slides --post serve

To toggle off the slideshow cell formatting, click the CellToolbar button, then View --> Cell Toolbar --> None.

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Exploratory Data Analysis (EDA)

_Jonathan | Jon | Ibrahim_

---

### Objectives

- Data Collection
- Data Cleaning
- Data Visualisation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

%matplotlib inline

### Data Collection

Data collection is the process of gathering information in an established systematic way that enables one to test hypothesis and evaluate outcomes easily.

In [None]:
fifa = pd.read_csv('./DataSets/FIFA18.csv', low_memory=False)

In [None]:
fifa.head()

After getting data we need to check the data-type of features.

There are following types of features :

- numeric
- categorical
- ordinal
- datetime
- coordinates

In order to know the data types/features of data, we need to run:

In [None]:
fifa.dtypes

In [None]:
fifa.info()

### Data Cleaning

High-level, data types are a way of storing data that can be utilized for different purposes. For our arithmetic and visualization purposes, we want to change the datatypes to floats or integers that will allow subsequent mathematical processes to be run on them.

In [None]:
'''
From our initial dataset read through, we can determine that wage is returned as an object data type, but
it should be a float or integer, so it warrants further inspection.
'''

fifa.Wage.head()

In [None]:
def remove_currency(data_column):
    '''
    We can see from the above that we need to from replace €, K from data column and convert to int.
    
    We can do so using this function.
    '''
    data_column = data_column.str.replace('€', '')
    data_column = data_column.str.replace('K', '')
    data_column = pd.to_numeric(data_column)
    return data_column

In [None]:
'''
Here we call the above function on this Wage column.
'''

fifa["Wage"] = remove_currency(fifa["Wage"])

In [None]:
'''
Here we check that the symbols have been removed.
'''

fifa.Wage.head()

In [None]:
fifa.info()

In [None]:
'''
We can now observe that value needs the same treatment
'''

fifa.Value.head()

In [None]:
def remove_currency_mill(data_column):
    data_column = data_column.str.replace('€', '')
    data_column = data_column.str.replace('M', '')
    data_column = data_column.str.replace('K', '')
    data_column = pd.to_numeric(data_column)
    return data_column

In [None]:
fifa["Value"] = remove_currency_mill(fifa["Value"])

In [None]:
fifa.info()

We can use the describe function to provide more information on the data.

- describe() function shows the statistics of the data.

- If you try running this with columns that do not have all integers or floats, then the column data will not be displayed. 

- It can be very useful to look at these statistics early on to get high level idea about the data.

In [None]:
fifa.describe().T

from the above we can observe a couple of columns that could use more cleaning/investigation.

1. the unnamed: 0 column appears to be another index row, which is not required as the data frame has applied an automatic index.

2. row ID has a minimum alot lower than the IQR.

In [None]:
fifa.columns

In [None]:
fifa.drop('Unnamed: 0', axis = 1, inplace=True)

In [None]:
fifa.ID.plot.hist()

In [None]:
fifa.ID.plot.box(vert = False)

We need to decide whether we want to remove the outliers from this dataset prior to visualisation, or whether we come back to it after we have had an opportunity to look for any potential relationships between data points.

### Data Visualizations

- Charts or graphs that visualize large amounts of complex data are easier to understand than spreadsheets or reports.

- Data visualization is a quick, easy way to convey concepts in a universal manner

In [None]:
fifa.corr()

In [None]:
ax = fifa.plot.scatter('Age', 'Wage', figsize=(16,4))
ax.set_title("Age/Wage")

- Data visualization is a quick, easy way to convey concepts in a universal manner



- 'line' : line plot (default)
- 'bar' : vertical bar plot
- 'barh' : horizontal bar plot
- 'hist' : histogram
- 'box' : boxplot
- 'kde' : Kernel Density Estimation plot
- 'density' : same as 'kde'
- 'area' : area plot
- 'pie' : pie plot
- 'scatter' : scatter plot
- 'hexbin' : hexbin plot

In [None]:
# 'bar' : vertical bar plot

fifa.loc[0:50]['Age'].plot(kind='bar', figsize=(16,8))

In [None]:
# 'barh' : horizontal bar plot

fifa.loc[0:50]['Age'].plot(kind='barh', figsize=(16,8))

In [None]:
# 'hist' : histogram

fifa.loc[0:]['Age'].plot(kind='hist', figsize=(16,8), bins=100)

In [None]:
# 'box' : boxplot

fifa.loc[0:]['Age'].plot(kind='box', figsize=(16,8))

In [None]:
# 'kde' : Kernel Density Estimation plot

fifa.loc[0:]['Age'].plot(kind='kde', figsize=(16,8))

In [None]:
# 'density' : same as 'kde'

fifa.loc[0:]['Age'].plot(kind='density', figsize=(16,8))

In [None]:
# 'area' : area plot

fifa.loc[0:]['Age'].plot(kind='area', figsize=(16,8))

In [None]:
# 'pie' : pie plot

fifa.loc[0:10]['Age'].plot(kind='pie', figsize=(8,4))

In [None]:
# 'scatter' : scatter plot

fifa.plot(kind='scatter', x='Age', y='Wage')

In [None]:
# 'hexbin' : hexbin plot

fifa.plot(kind='hexbin', x='Age', y='Overall')