![download.png](attachment:535f544d-433f-48fc-967a-14fec00ddb50.png)

<h1 style="color: #9370DB;">Exploratory Data Analysis Test</h1>

Exploratory data analysis (EDA) is amongst the first steps of analyzing data. 
* It is usually done with descriptive statistics, where you'll summarize the statistical aspects of your dataset and start to unravel your insights. 
* Further on, I highly recommend using data visualization in this step.

<div class="alert alert-block alert-info">
Tip # 1
Best practices; write all your libraries & functions at the beginning !
</div>

In [1]:
# 📚 Basic libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h2 style="color: #9370DB;"> 01 | Data Extraction </h2>

In [3]:
data = pd.read_csv('fifa_23.csv')

<h3 style="color: #4169E1;">1.1 | Exploring the Data </h3>

In [4]:
data.sample(5)

Unnamed: 0,Known As,Full Name,Overall,Potential,Value(in Euro),Positions Played,Best Position,Nationality,Image Link,Age,Height(in cm),Weight(in kg),TotalStats,BaseStats,Club Name,Wage(in Euro),Release Clause,Club Position,Contract Until,Club Jersey Number,Joined On,On Loan,Preferred Foot,Weak Foot Rating,Skill Moves,International Reputation,National Team Name,National Team Image Link,National Team Position,National Team Jersey Number,Attacking Work Rate,Defensive Work Rate,Pace Total,Shooting Total,Passing Total,Dribbling Total,Defending Total,Physicality Total,Crossing,Finishing,Heading Accuracy,Short Passing,Volleys,Dribbling,Curve,Freekick Accuracy,LongPassing,BallControl,Acceleration,Sprint Speed,Agility,Reactions,Balance,Shot Power,Jumping,Stamina,Strength,Long Shots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,Standing Tackle,Sliding Tackle,Goalkeeper Diving,Goalkeeper Handling,GoalkeeperKicking,Goalkeeper Positioning,Goalkeeper Reflexes,ST Rating,LW Rating,LF Rating,CF Rating,RF Rating,RW Rating,CAM Rating,LM Rating,CM Rating,RM Rating,LWB Rating,CDM Rating,RWB Rating,LB Rating,CB Rating,RB Rating,GK Rating
8602,E. Jordanov,Edisson Jordanov,66,66,750000,"RB,RM,LB",RB,Bulgaria,https://cdn.sofifa.net/players/208/140/23_60.png,29,172,70,1810,374,KVC Westerlo,4000,1100000,SUB,2023,32,2021,-,Right,3,3,1,-,-,-,-,Medium,Medium,72,53,61,70,61,57,59,49,46,66,58,68,61,47,55,68,77,68,83,67,83,56,80,69,45,55,67,67,58,63,53,71,63,61,58,15,8,13,12,12,60,64,63,63,63,64,66,66,65,66,66,65,66,66,63,66,20
17161,K. Ferguson,Kyle Ferguson,56,64,300000,CB,CB,Scotland,https://cdn.sofifa.net/players/261/239/23_60.png,22,195,80,1296,290,Harrogate Town,1000,585000,RES,2023,24,2022,-,Right,2,2,1,-,https://cdn.sofifa.net/flags/gb-sct.png,-,-,Medium,High,53,29,40,46,54,68,31,28,55,52,29,48,31,26,43,47,55,51,34,47,44,37,64,59,75,24,63,49,26,32,33,45,54,58,54,6,13,9,10,9,44,41,41,41,41,41,43,45,46,45,52,54,52,53,58,53,16
603,Diogo Costa,Diogo Meireles Costa,79,86,30500000,GK,GK,Portugal,https://cdn.sofifa.net/players/234/577/23_60.png,22,188,84,1123,417,FC Porto,11000,67099999,GK,2026,99,2016,-,Right,2,1,1,Portugal,https://cdn.sofifa.net/flags/pt.png,SUB,22,Medium,Medium,76,75,73,83,32,78,13,6,11,28,8,18,10,10,29,25,30,36,34,77,35,55,53,23,63,9,23,18,18,49,14,44,19,13,11,76,75,73,78,83,30,26,29,29,29,26,32,30,32,30,27,30,27,26,28,26,80
1813,J. El Yamiq,Jawad El Yamiq,75,75,4600000,CB,CB,Morocco,https://cdn.sofifa.net/players/241/775/23_60.png,30,193,80,1808,402,Real Valladolid CF,20000,9700000,SUB,2024,15,2020,-,Right,3,2,1,Morocco,https://cdn.sofifa.net/flags/ma.png,SUB,20,Medium,Medium,86,48,52,62,74,80,38,40,76,65,30,60,36,32,71,61,84,87,69,73,56,67,90,75,84,52,74,77,60,37,41,68,73,75,70,15,12,10,10,8,65,59,61,61,61,59,61,63,64,63,70,72,70,73,75,73,20
12525,Kim Ji Hyun,Ji Hyun Kim,63,65,600000,"ST,RW,LW",ST,Korea Republic,https://cdn.sofifa.net/players/245/400/23_60.png,25,184,80,1577,343,Sangju Sangmu FC,3000,0,ST,2023,28,2021,TRUE,Right,4,3,1,-,-,-,-,Medium,Low,73,56,55,65,29,65,55,55,56,59,51,62,50,38,55,68,72,74,76,65,58,66,50,78,70,49,42,22,68,53,55,59,28,28,26,14,11,9,8,6,65,63,63,63,63,63,64,65,59,65,52,49,52,50,44,50,18


In [5]:
data.shape

(18539, 89)

In [6]:
data.dtypes

Known As          object
Full Name         object
Overall            int64
Potential          int64
Value(in Euro)     int64
                   ...  
RWB Rating         int64
LB Rating          int64
CB Rating          int64
RB Rating          int64
GK Rating          int64
Length: 89, dtype: object

**First impression:**
    
_____________

The following dataset is a collection of **one-year** 18.539 player information distributed among 89 different columns. The majority of our data types are mostly **numericals** (71 integers / 18 objects).

Our **project goal** is to identify players who have the potential to become **the next "Mbappé"**. After reading the [documentation](https://www.kaggle.com/datasets/ekrembayar/fifa-21-complete-player-dataset?select=fifa21_male2.csv) we wil proceed with the following **strategy**:

1. The **target** of our dataset will be `Overall Score`, which is a summary of a player's performance and potential. 
2. Through **Exploratory Data Analysis** we will identify the features that contribute to this prediction.
_____________

<h3 style="color: #4169E1;">1.2 | Copies</h3>

In [7]:
df = data.copy()

<h3 style="color: #4169E1;">1.3 | Column standardization </h3>

In [None]:
data.columns = [column.lower().replace(" ", "_") for column in data.columns]

In [None]:
data.head(3)

<h2 style="color: #9370DB;"> 02 | Data Cleaning </h2>

<h3 style="color: #4169E1;"> 2.1 | Dealing with Data types</h3>

<h3 style="color: #4169E1;"> 2.2 | Dealing with NaN values</h3>

<h3 style="color: #4169E1;"> 2.3 | Dealing with Duplicates</h3>

<h3 style="color: #4169E1;"> 2.4 | Dealing with Empty Spaces</h3>

<h3 style="color: #4169E1;"> 2.5 | Dealing with outliers</h3>

<h3 style="color: #4169E1;"> 2.6 | Moving target to the right </h3>

In [None]:
data.head(0)

<h2 style="color: #008080;">Feature Selection (Dropping unnecesary features)</h2>

In [None]:
data.shape

_____________
From all features above, we will drop the following:
* `know_as`, `full_name`, `image_link`, `national_team_image` it's just the player identifier.
* `club_name`, `club_position`, `contract_until`, `club_jersey_number`, `joined_on`, `on_loan` these are specific to the player's current club situation and do not directly influence their potential or performance metrics.
* `national_team_name`, `national_team_position`, `national_team_jersey_number` are unnecessary details.
* `st_rating`, `lw_raating`, `cf_rating` are giving us to much detail of the player information, not needed if we are focusing on **overall score** and best position.
* Also, since we are looking for **field players** we can drop some goalkeeper-specific features like `goalkeeper_diving`, `goalkeeper_handling`, `_goalkeeperkicking`, `goalkeeper_positioning` and `goalkeeper_reflexes`
* Aggregated stats like `totalstats` and `basestats` are the sum of different stats. Same with `crossing`, `finishing`, etc., are the sum of `attacking`
    * Attacking = crossing, finishing, heading_accuracy, short_passing, volleys
    * Skill = dribbling, curve, fk_accuracy, long_passing, ball_control
    * Movement = acceleration, sprint_speed, agility, reactions, balance
    * Power = shot_power, jumping, stamina, strength, long_shots
    * Mentality = aggression, interceptions, positioning, vision, penalties, composure
    * Defending = marking, standing_tackle, sliding_tackle
_____________

In [11]:
df = df.drop(columns=['known_as', 'full_name', 'image_link', 'national_team_image_link',
                      'club_name', 'club_position', 'contract_until', 'club_jersey_number',
                      'joined_on', 'on_loan', 'national_team_name', 'national_team_position',
                      'national_team_jersey_number', 'st_rating', 'lw_rating', 'cf_rating',
                      'rf_rating', 'rw_rating', 'cam_rating', 'lm_rating', 'cm_rating',
                      'rm_rating', 'lwb_rating', 'cdm_rating', 'rwb_rating', 'lb_rating',
                      'cb_rating', 'rb_rating', 'gk_rating',
                      'goalkeeper_diving', 'goalkeeper_handling', '_goalkeeperkicking',
                      'goalkeeper_positioning', 'goalkeeper_reflexes', 'totalstats',
                      'basestats', 'crossing', 'finishing', 'volleys', 'dribbling', 
                      'curve', 'freekick_accuracy', 'longpassing', 'ballcontrol', 
                      'acceleration', 'sprint_speed', 'agility', 'reactions', 
                      'balance', 'shot_power', 'jumping', 'stamina', 'strength', 
                      'long_shots', 'aggression', 'interceptions', 'positioning', 
                      'vision', 'penalties', 'composure', 'marking', 'standing_tackle', 
                      'sliding_tackle', 'positions_played'])

In [None]:
df.shape

<h2 style="color: #008080;">Checking Null Values</h2>

In [None]:
df.isnull().sum()

<h2 style="color: #008080;">Checking Duplicates</h2>

In [None]:
df.duplicated().sum()

In [15]:
df.drop_duplicates(inplace=True)

<h2 style="color: #008080;">Checking Emtpy Spaces</h2>

In [None]:
df.eq(" ").sum()

<h2 style="color: #008080;">Moving our target to the right</h2>

In [None]:
df.columns

In [None]:
df.head(3)

In [19]:
target = df.pop('overall')

In [22]:
df['overall'] = target

<div class="alert alert-block alert-info">
Tip # 2
How to move your target to the right
</div>

[stackoverflow](https://stackoverflow.com/questions/35321812/move-column-in-pandas-dataframe)

<h2 style="color: #9370DB;"> 03 | EDA (Exploratory Data Analysis) </h2>

<h3 style="color: #4169E1;">3.1 | Descriptive Statistics </h3>

In [None]:
df.describe().T

<div class="alert alert-block alert-info">
Tip # 3

How to interpret basic statistics</div>

+ Measures of central tendency - Mean, median, mode
+ Measures of spread / dispersion - SD, var, range, quartiles, percentiles
+ Meaures of frequency - Frequency

+ **Range:** defines the difference between the highest and lowest values.
+ **Variance**: measures how far each number in the set is from the mean and thus from every other number in the set.
+ **Standard deviation:** The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance
+ **Quartiles:** A quartile is a statistical term that describes a division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations.
+ **Percentiles:** same but divided in 100 groups.

![quartiles](https://www.onlinemathlearning.com/image-files/median-quartiles.png)

### Exercise 1: What is the [Inter Quartile Range (IQR)](https://medium.com/@vinitasilaparasetty/quartiles-for-beginners-in-data-science-2ca5a640b07b)? What conclusions can we draw from `describe`?

In [None]:
# your solution

<h3 style="color: #4169E1;"> Optional | Selecting Numerical </h3>

In [27]:
num = df.select_dtypes("number")

<h3 style="color: #4169E1;"> 3.2 | Checking Distributions</h3>

#### Using maptplotlib ---> Check [documentation](https://matplotlib.org/) !

In [None]:
color = '#0072B2'

# grid size
nrows, ncols = 5, 4  # adjust for your number of features

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 16))

axes = axes.flatten()

# Plot each numerical feature
for i, ax in enumerate(axes):
    if i >= len(num.columns):
        ax.set_visible(False)  # hide unesed plots
        continue
    ax.hist(num.iloc[:, i], bins=30, color=color, edgecolor='black')
    ax.set_title(num.columns[i])

plt.tight_layout()
plt.show()

### Exercise 2: How do we interpret these histograms?

In [None]:
# your solution

In [29]:
boxplot = num[['value(in_euro)', 'wage(in_euro)', 'release_clause', 'dribbling_total', 'short_passing', 'lf_rating']]

In [None]:
color = '#0072B2'

# grid size
nrows, ncols = 5, 4 

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 16))

axes = axes.flatten()

for i, ax in enumerate(axes):
    if i >= len(num.columns):
        ax.set_visible(False)
        continue
    ax.boxplot(num.iloc[:, i].dropna(), vert=False, patch_artist=True, 
               boxprops=dict(facecolor=color, color='black'), 
               medianprops=dict(color='yellow'), whiskerprops=dict(color='black'), 
               capprops=dict(color='black'), flierprops=dict(marker='o', color='red', markersize=5))
    ax.set_title(num.columns[i], fontsize=10)
    ax.tick_params(axis='x', labelsize=8)

plt.tight_layout()
plt.show()

In [None]:
color = '#0072B2'

# grid size
nrows, ncols = 5, 4 

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 16))

axes = axes.flatten()

for i, ax in enumerate(axes):
    if i >= len(boxplot.columns):
        ax.set_visible(False)
        continue
    ax.boxplot(boxplot.iloc[:, i].dropna(), vert=False, patch_artist=True, 
               boxprops=dict(facecolor=color, color='black'), 
               medianprops=dict(color='yellow'), whiskerprops=dict(color='black'), 
               capprops=dict(color='black'), flierprops=dict(marker='o', color='red', markersize=5))
    ax.set_title(boxplot.columns[i], fontsize=10)
    ax.tick_params(axis='x', labelsize=8)  # Adjust x-axis ticks

plt.tight_layout()
plt.show()

### Exercise 3: What conclusions can you draw from the box plots?

<h3 style="color: #4169E1;"> 3.3 | Checking our target distribution</h3>

<h3 style="color: #4169E1;">3.4 | Checking Outliers </h3>

<h3 style="color: #4169E1;">3.5 | Looking for Correlations </h3>

In [32]:
num_corr = round(num.corr(), 2)

<div class="alert alert-block alert-info">
Tip # 4
    
- We don't want multicolinearity --> correlation between features biases the model...
- We want high correlations (+ or -) with the target --> valuable information for the predictions</div>

#### Checking correlations with [Seaborn](https://seaborn.pydata.org/index.html)

In [None]:
# Correlation Matrix-Heatmap Plot
mask = np.zeros_like(num_corr)
mask[np.triu_indices_from(mask)] = True # optional, to hide repeat half of the matrix
f, ax = plt.subplots(figsize=(25, 15))
sns.set(font_scale=1.5) # increase font size
ax = sns.heatmap(num_corr, mask=mask, annot=True, annot_kws={"size": 12}, linewidths=.5, cmap="coolwarm", fmt=".2f", ax=ax) # round to 2 decimal places
ax.set_title("Dealing with Multicollinearity", fontsize=20) # add title
plt.show()

In [None]:
df.columns

### Exercise 4: How to interpret this correlation matrix? Which other correlation methods we know?

In [None]:
# your solution

### Specific Correlations with the Target

### Perform 3 Plots and Explain the Findings from the Data

### Contigency Tables, Chi-Square...

<h2 style="color: #9370DB;"> 04 | Data Processing </h2>

<h3 style="color: #4169E1;"> 4.1 | X-Y Split</h3>

<h3 style="color: #4169E1;"> 4.2 | Selecting the Model</h3>

<h4 style="color: #00BFFF;"> 4.2.1 | Selecting Model: Linear Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.2 | Selecting Model: Ridge Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.3 | Selecting Model: Lasso Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.4 | Selecting Model: Decision Tree Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.5 | Selecting Model: KNN Regression </h4>

<h4 style="color: #00BFFF;"> 4.2.6 | Selecting Model: XGBoost Regression </h4>

<h3 style="color: #4169E1;"> 4.3 | Final Comparision</h3>

<h2 style="color: #9370DB;"> 05 | Improving Model </h2>

<h2 style="color: #008080;">Train-Test Split</h2>

<h2 style="color: #008080;">Model Validation</h2>

<h1 style="color: #00BFFF;">06 | Improving the Model</h1>

<h1 style="color: #00BFFF;">07 | Reporting</h1>