# Intelligent Decision Support System (IDSS) for Player Selection for Football League

This notebook is for the programming implementation of designing an Intelligent Decision Support System (IDSS) that will help the management of football leagues to buy right player for the required role at the right price.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
sns.set()
def color_print(text: str, color: str = "blue"):
  colors = {
      "black": "\033[30m",
      "red": "\033[31m",
      "green": "\033[32m",
      "yellow": "\033[33m",
      "blue": "\033[34m",
      "magenta": "\033[35m",
      "cyan": "\033[36m",
      "white": "\033[37m",
      "reset": "\033[0m"
  }
  color_code = colors.get(color.lower(), colors["blue"])
  print(color_code + text + colors["reset"])


# 1. Load Dataset

In [14]:
df = pd.read_csv("/content/fifa_players.csv")

In [15]:
color_print(f"We have '{df.shape[0]}' records and '{df.shape[1]}' features in the dataset")

[34mWe have '17954' records and '51' features in the dataset[0m


In [16]:
# check some 5 random samples
df.sample(5)


Unnamed: 0,name,full_name,birth_date,age,height_cm,weight_kgs,positions,nationality,overall_rating,potential,...,long_shots,aggression,interceptions,positioning,vision,penalties,composure,marking,standing_tackle,sliding_tackle
524,J. Tavernier,James Tavernier,10/31/1991,27,182.88,74.8,RB,England,75,76,...,68,69,68,69,69,82,73,63,74,72
11171,C. Telo,Christopher Telo,11/4/1989,29,172.72,72.1,LB,Sweden,66,66,...,53,66,68,50,57,49,69,58,58,62
13743,Y. Salmier,Yoann Salmier,11/21/1992,26,187.96,84.8,CB,France,69,72,...,17,71,62,29,33,33,62,72,75,65
15215,P. Da Silva,Paulo César Da Silva Barrios,2/1/1980,39,154.94,76.2,CB,Paraguay,72,72,...,56,71,70,26,55,63,70,70,70,73
15958,M. Dituro,Matías Dituro,5/8/1987,31,190.5,89.8,GK,Argentina,73,73,...,16,20,21,19,36,20,68,18,12,18


# 2. Overview of the Dataset

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17954 entries, 0 to 17953
Data columns (total 51 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   name                           17954 non-null  object 
 1   full_name                      17954 non-null  object 
 2   birth_date                     17954 non-null  object 
 3   age                            17954 non-null  int64  
 4   height_cm                      17954 non-null  float64
 5   weight_kgs                     17954 non-null  float64
 6   positions                      17954 non-null  object 
 7   nationality                    17954 non-null  object 
 8   overall_rating                 17954 non-null  int64  
 9   potential                      17954 non-null  int64  
 10  value_euro                     17699 non-null  float64
 11  wage_euro                      17708 non-null  float64
 12  preferred_foot                 17954 non-null 

In [18]:
number_of_categorical_features = 0
number_of_numerical_features = 0
for column_name in df.columns:
  if df[column_name].dtype == "object":
    number_of_categorical_features+=1
  else:
    number_of_numerical_features+=1

color_print(f"There are total {number_of_categorical_features} categorical features and {number_of_numerical_features} numerical features.")

[34mThere are total 9 categorical features and 42 numerical features.[0m


# 3. Check NULL values

In [19]:
df.isna().sum()

Unnamed: 0,0
name,0
full_name,0
birth_date,0
age,0
height_cm,0
weight_kgs,0
positions,0
nationality,0
overall_rating,0
potential,0


In [20]:
# Let's check the NULL values in terms of percentages
df.isna().sum() / len(df) * 100

Unnamed: 0,0
name,0.0
full_name,0.0
birth_date,0.0
age,0.0
height_cm,0.0
weight_kgs,0.0
positions,0.0
nationality,0.0
overall_rating,0.0
potential,0.0


From the above table, it is clear that some columns have a significant number of NULL value:
1. `national_team`	-> 95%
2. `national_rating`	-> 95%
3. `national_team_position`	-> 95%
4. `national_jersey_number`	-> 95%

These NULL values won't affect our IDSS model, because they are not significant features for our task of player ranking in FIFA.

Additionally, the following features also have some small percentage of NULL values:
1. `value_euro`	-> 1.4%
2. `wage_euro`	-> 1.4%
3. `release_clause_euro`	-> 10%

We might need to do something about these values because these are important features.

Since the percentage of NULL values is minimal, using the `median` for `imputation` is an appropriate choice. Had these features contained a substantial amount of missing data, we would have evaluated alternative imputation techniques to determine the most suitable approach. In this case, nevertheless, `median` `imputation` is adequate.

In [26]:
# Handle missing values for "value_euro", "wage_euro", and "release_clause_euro"
value_euro_median = df['value_euro'].median()
wage_euro_median = df['wage_euro'].median()
release_clause_euro_median = df['release_clause_euro'].median()

df.fillna({
    'value_euro': value_euro_median,
    'wage_euro': wage_euro_median,
    'release_clause_euro': release_clause_euro_median
}, inplace=True)

# Now, let's verify that NULL values have been taken care of
df.isna().sum()

Unnamed: 0,0
name,0
full_name,0
birth_date,0
age,0
height_cm,0
weight_kgs,0
positions,0
nationality,0
overall_rating,0
potential,0


# 4. Duplicate values

# 5. Statistical Summary and Feature Analysis


In [None]:
# Statistical summary of numerical features
df.describe().T.style.background_gradient(cmap='coolwarm')


# 6. Target Variable Analysis (Overall Rating - for Ranking Model)


In [None]:
# Distribution of Overall Rating (Target for Ranking Model)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram
axes[0].hist(df['overall_rating'], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Distribution of Overall Rating', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Overall Rating')
axes[0].set_ylabel('Frequency')
axes[0].axvline(df['overall_rating'].mean(), color='red', linestyle='--', label=f'Mean: {df["overall_rating"].mean():.2f}')
axes[0].axvline(df['overall_rating'].median(), color='green', linestyle='--', label=f'Median: {df["overall_rating"].median():.2f}')
axes[0].legend()

# Box plot
axes[1].boxplot(df['overall_rating'], vert=True, patch_artist=True, 
                boxprops=dict(facecolor='lightblue', color='blue'),
                medianprops=dict(color='red', linewidth=2))
axes[1].set_title('Box Plot of Overall Rating', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Overall Rating')

# KDE plot
df['overall_rating'].plot(kind='kde', ax=axes[2], color='purple', linewidth=2)
axes[2].fill_between(axes[2].get_xlim(), 0, axes[2].get_ylim()[1], alpha=0.3, color='purple')
axes[2].set_title('Density Plot of Overall Rating', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Overall Rating')
axes[2].set_ylabel('Density')

plt.tight_layout()
plt.show()

color_print(f"Overall Rating Statistics:", "green")
print(f"  Mean: {df['overall_rating'].mean():.2f}")
print(f"  Median: {df['overall_rating'].median():.2f}")
print(f"  Std Dev: {df['overall_rating'].std():.2f}")
print(f"  Min: {df['overall_rating'].min()}")
print(f"  Max: {df['overall_rating'].max()}")


# 7. Value and Wage Analysis (Target for Price Prediction Model)


In [None]:
# Distribution of Player Value and Wage
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Value distribution
axes[0, 0].hist(df['value_euro'], bins=50, color='gold', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Distribution of Player Value (Euro)', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Value (Euro)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].axvline(df['value_euro'].mean(), color='red', linestyle='--', label=f'Mean: €{df["value_euro"].mean()/1e6:.2f}M')
axes[0, 0].legend()

# Log-transformed value (better visualization for skewed data)
axes[0, 1].hist(np.log1p(df['value_euro']), bins=50, color='orange', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Log-Transformed Player Value', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Log(Value + 1)')
axes[0, 1].set_ylabel('Frequency')

# Wage distribution
axes[1, 0].hist(df['wage_euro'], bins=50, color='lightgreen', edgecolor='black', alpha=0.7)
axes[1, 0].set_title('Distribution of Player Wage (Euro)', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Wage (Euro)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].axvline(df['wage_euro'].mean(), color='red', linestyle='--', label=f'Mean: €{df["wage_euro"].mean()/1e3:.2f}K')
axes[1, 0].legend()

# Log-transformed wage
axes[1, 1].hist(np.log1p(df['wage_euro']), bins=50, color='green', edgecolor='black', alpha=0.7)
axes[1, 1].set_title('Log-Transformed Player Wage', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Log(Wage + 1)')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

color_print("Value and Wage Statistics:", "cyan")
print(f"Value (Euro):")
print(f"  Mean: €{df['value_euro'].mean()/1e6:.2f}M")
print(f"  Median: €{df['value_euro'].median()/1e6:.2f}M")
print(f"  Max: €{df['value_euro'].max()/1e6:.2f}M")
print(f"\nWage (Euro):")
print(f"  Mean: €{df['wage_euro'].mean()/1e3:.2f}K")
print(f"  Median: €{df['wage_euro'].median()/1e3:.2f}K")
print(f"  Max: €{df['wage_euro'].max()/1e3:.2f}K")


# 8. Player Position Analysis


In [None]:
# Position distribution and analysis
position_counts = df['positions'].value_counts().head(15)

fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Bar plot of top positions
axes[0, 0].barh(position_counts.index, position_counts.values, color='steelblue')
axes[0, 0].set_title('Top 15 Player Positions', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Number of Players')
axes[0, 0].set_ylabel('Position')
axes[0, 0].invert_yaxis()

# Pie chart of top 10 positions
top_10_positions = df['positions'].value_counts().head(10)
axes[0, 1].pie(top_10_positions.values, labels=top_10_positions.index, autopct='%1.1f%%', startangle=90)
axes[0, 1].set_title('Top 10 Positions Distribution', fontsize=14, fontweight='bold')

# Average rating by position
avg_rating_by_position = df.groupby('positions')['overall_rating'].mean().sort_values(ascending=False).head(15)
axes[1, 0].barh(avg_rating_by_position.index, avg_rating_by_position.values, color='coral')
axes[1, 0].set_title('Average Rating by Position (Top 15)', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Average Overall Rating')
axes[1, 0].set_ylabel('Position')
axes[1, 0].invert_yaxis()

# Average value by position
avg_value_by_position = df.groupby('positions')['value_euro'].mean().sort_values(ascending=False).head(15)
axes[1, 1].barh(avg_value_by_position.index, avg_value_by_position.values/1e6, color='gold')
axes[1, 1].set_title('Average Value by Position (Top 15)', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Average Value (Million Euro)')
axes[1, 1].set_ylabel('Position')
axes[1, 1].invert_yaxis()

plt.tight_layout()
plt.show()

color_print(f"Total unique positions: {df['positions'].nunique()}", "magenta")


# 9. Age and Physical Attributes Analysis


In [None]:
# Age, Height, and Weight Analysis
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Age distribution
axes[0, 0].hist(df['age'], bins=30, color='teal', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Age Distribution', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].axvline(df['age'].mean(), color='red', linestyle='--', label=f'Mean: {df["age"].mean():.1f}')
axes[0, 0].legend()

# Height distribution
axes[0, 1].hist(df['height_cm'], bins=30, color='purple', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Height Distribution', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Height (cm)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].axvline(df['height_cm'].mean(), color='red', linestyle='--', label=f'Mean: {df["height_cm"].mean():.1f} cm')
axes[0, 1].legend()

# Weight distribution
axes[0, 2].hist(df['weight_kgs'], bins=30, color='brown', edgecolor='black', alpha=0.7)
axes[0, 2].set_title('Weight Distribution', fontsize=12, fontweight='bold')
axes[0, 2].set_xlabel('Weight (kg)')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].axvline(df['weight_kgs'].mean(), color='red', linestyle='--', label=f'Mean: {df["weight_kgs"].mean():.1f} kg')
axes[0, 2].legend()

# Age vs Overall Rating
axes[1, 0].scatter(df['age'], df['overall_rating'], alpha=0.3, c=df['overall_rating'], cmap='viridis', s=10)
axes[1, 0].set_title('Age vs Overall Rating', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Overall Rating')

# Age vs Value
axes[1, 1].scatter(df['age'], df['value_euro']/1e6, alpha=0.3, c=df['value_euro'], cmap='plasma', s=10)
axes[1, 1].set_title('Age vs Player Value', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Age')
axes[1, 1].set_ylabel('Value (Million Euro)')

# Height vs Weight (colored by rating)
scatter = axes[1, 2].scatter(df['height_cm'], df['weight_kgs'], alpha=0.3, 
                             c=df['overall_rating'], cmap='coolwarm', s=10)
axes[1, 2].set_title('Height vs Weight (colored by rating)', fontsize=12, fontweight='bold')
axes[1, 2].set_xlabel('Height (cm)')
axes[1, 2].set_ylabel('Weight (kg)')
plt.colorbar(scatter, ax=axes[1, 2], label='Overall Rating')

plt.tight_layout()
plt.show()


# 10. Skill Attributes Analysis


In [None]:
# Analyze key skill attributes
skill_columns = ['pace', 'shooting', 'passing', 'dribbling', 'defending', 'physic']

# Note: These composite skills might not be in the dataset, so let's use actual skill columns
actual_skill_columns = ['finishing', 'dribbling', 'short_passing', 'ball_control', 
                        'acceleration', 'sprint_speed', 'stamina', 'strength',
                        'positioning', 'vision', 'composure', 'marking']

# Calculate average skills
skill_averages = df[actual_skill_columns].mean().sort_values(ascending=False)

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Average skill ratings
axes[0, 0].barh(skill_averages.index, skill_averages.values, color='skyblue')
axes[0, 0].set_title('Average Skill Ratings Across All Players', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Average Rating')
axes[0, 0].invert_yaxis()

# Distribution of key attacking skills
attacking_skills = df[['finishing', 'dribbling', 'ball_control', 'positioning']]
axes[0, 1].boxplot([attacking_skills[col] for col in attacking_skills.columns], 
                    labels=attacking_skills.columns, patch_artist=True)
axes[0, 1].set_title('Distribution of Attacking Skills', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Rating')
axes[0, 1].tick_params(axis='x', rotation=45)

# Distribution of defensive skills
defensive_skills = df[['marking', 'standing_tackle', 'sliding_tackle', 'interceptions']]
axes[1, 0].boxplot([defensive_skills[col] for col in defensive_skills.columns], 
                    labels=defensive_skills.columns, patch_artist=True)
axes[1, 0].set_title('Distribution of Defensive Skills', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Rating')
axes[1, 0].tick_params(axis='x', rotation=45)

# Physical attributes
physical_skills = df[['acceleration', 'sprint_speed', 'stamina', 'strength']]
axes[1, 1].boxplot([physical_skills[col] for col in physical_skills.columns], 
                    labels=physical_skills.columns, patch_artist=True)
axes[1, 1].set_title('Distribution of Physical Attributes', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('Rating')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()


# 11. Correlation Analysis (Important for Feature Selection)


In [None]:
# Correlation analysis for numerical features
# Select only numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Remove columns with too many missing values for correlation
cols_for_correlation = [col for col in numerical_cols if col not in ['national_rating', 'national_jersey_number']]

# Calculate correlation matrix
correlation_matrix = df[cols_for_correlation].corr()

# Create figure with subplots
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# Full correlation heatmap (subset for visibility)
key_features = ['overall_rating', 'potential', 'value_euro', 'wage_euro', 'age',
                'finishing', 'dribbling', 'short_passing', 'ball_control', 
                'acceleration', 'sprint_speed', 'strength', 'positioning', 'vision']
key_correlation = df[key_features].corr()

sns.heatmap(key_correlation, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=axes[0])
axes[0].set_title('Correlation Heatmap - Key Features', fontsize=14, fontweight='bold')

# Correlation with overall_rating (for ranking model)
rating_correlation = correlation_matrix['overall_rating'].sort_values(ascending=False)[1:16]
axes[1].barh(rating_correlation.index, rating_correlation.values, color='teal')
axes[1].set_title('Top 15 Features Correlated with Overall Rating', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Correlation Coefficient')
axes[1].invert_yaxis()
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=0.8)

plt.tight_layout()
plt.show()

color_print("Top 10 features correlated with Overall Rating:", "green")
print(correlation_matrix['overall_rating'].sort_values(ascending=False)[1:11])


In [None]:
# Correlation with value_euro (for price prediction model)
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

value_correlation = correlation_matrix['value_euro'].sort_values(ascending=False)[1:16]
axes[0].barh(value_correlation.index, value_correlation.values, color='gold')
axes[0].set_title('Top 15 Features Correlated with Player Value', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Correlation Coefficient')
axes[0].invert_yaxis()
axes[0].axvline(x=0, color='black', linestyle='--', linewidth=0.8)

wage_correlation = correlation_matrix['wage_euro'].sort_values(ascending=False)[1:16]
axes[1].barh(wage_correlation.index, wage_correlation.values, color='lightgreen')
axes[1].set_title('Top 15 Features Correlated with Player Wage', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Correlation Coefficient')
axes[1].invert_yaxis()
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=0.8)

plt.tight_layout()
plt.show()

color_print("Top 10 features correlated with Player Value:", "cyan")
print(correlation_matrix['value_euro'].sort_values(ascending=False)[1:11])


# 12. Nationality Analysis


In [None]:
# Nationality analysis
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Top nationalities by player count
top_nationalities = df['nationality'].value_counts().head(20)
axes[0, 0].barh(top_nationalities.index, top_nationalities.values, color='steelblue')
axes[0, 0].set_title('Top 20 Nationalities by Player Count', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Number of Players')
axes[0, 0].invert_yaxis()

# Average rating by nationality (top 15)
avg_rating_by_nation = df.groupby('nationality')['overall_rating'].mean().sort_values(ascending=False).head(15)
axes[0, 1].barh(avg_rating_by_nation.index, avg_rating_by_nation.values, color='coral')
axes[0, 1].set_title('Top 15 Nationalities by Average Rating', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Average Overall Rating')
axes[0, 1].invert_yaxis()

# Average value by nationality (top 15)
avg_value_by_nation = df.groupby('nationality')['value_euro'].mean().sort_values(ascending=False).head(15)
axes[1, 0].barh(avg_value_by_nation.index, avg_value_by_nation.values/1e6, color='gold')
axes[1, 0].set_title('Top 15 Nationalities by Average Value', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Average Value (Million Euro)')
axes[1, 0].invert_yaxis()

# Preferred foot distribution
foot_counts = df['preferred_foot'].value_counts()
axes[1, 1].pie(foot_counts.values, labels=foot_counts.index, autopct='%1.1f%%', 
               colors=['lightblue', 'lightcoral'], startangle=90)
axes[1, 1].set_title('Preferred Foot Distribution', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

color_print(f"Total unique nationalities: {df['nationality'].nunique()}", "magenta")


# 13. Advanced Visualizations - Player Comparison by Position


In [None]:
# Compare skill profiles for different positions
# Select top 5 most common positions
top_5_positions = df['positions'].value_counts().head(5).index.tolist()

# Filter data for these positions
position_data = df[df['positions'].isin(top_5_positions)]

# Create radar chart data for each position
skills_for_radar = ['finishing', 'dribbling', 'short_passing', 'ball_control', 
                    'acceleration', 'strength', 'marking', 'standing_tackle']

# Calculate average skills for each position
position_skills = position_data.groupby('positions')[skills_for_radar].mean()

# Create subplots for comparison
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Violin plot - Overall Rating by Position
import matplotlib.pyplot as plt
from matplotlib import pyplot
parts = axes[0, 0].violinplot([position_data[position_data['positions'] == pos]['overall_rating'].values 
                                for pos in top_5_positions], 
                               positions=range(len(top_5_positions)), 
                               showmeans=True, showmedians=True)
axes[0, 0].set_xticks(range(len(top_5_positions)))
axes[0, 0].set_xticklabels(top_5_positions)
axes[0, 0].set_title('Overall Rating Distribution by Position', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Overall Rating')
axes[0, 0].set_xlabel('Position')

# Box plot - Value by Position
position_data.boxplot(column='value_euro', by='positions', ax=axes[0, 1])
axes[0, 1].set_title('Player Value Distribution by Position', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Value (Euro)')
axes[0, 1].set_xlabel('Position')
plt.sca(axes[0, 1])
plt.xticks(rotation=45)

# Heatmap of average skills by position
sns.heatmap(position_skills.T, annot=True, fmt='.1f', cmap='YlOrRd', ax=axes[1, 0], cbar_kws={"shrink": 0.8})
axes[1, 0].set_title('Average Skill Ratings by Position', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Position')
axes[1, 0].set_ylabel('Skill')

# Age distribution by position
for pos in top_5_positions:
    pos_ages = position_data[position_data['positions'] == pos]['age']
    axes[1, 1].hist(pos_ages, alpha=0.5, label=pos, bins=20)
axes[1, 1].set_title('Age Distribution by Position', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Age')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].legend()

plt.tight_layout()
plt.show()


# 14. Potential vs Overall Rating Analysis


In [None]:
# Potential vs Overall Rating - Important for identifying undervalued players
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# Scatter plot: Potential vs Overall Rating
scatter = axes[0].scatter(df['overall_rating'], df['potential'], 
                          c=df['age'], cmap='viridis', alpha=0.5, s=20)
axes[0].plot([df['overall_rating'].min(), df['overall_rating'].max()], 
             [df['overall_rating'].min(), df['overall_rating'].max()], 
             'r--', linewidth=2, label='Equal Line')
axes[0].set_xlabel('Overall Rating')
axes[0].set_ylabel('Potential')
axes[0].set_title('Potential vs Overall Rating (colored by age)', fontsize=12, fontweight='bold')
axes[0].legend()
plt.colorbar(scatter, ax=axes[0], label='Age')

# Calculate potential growth
df['potential_growth'] = df['potential'] - df['overall_rating']

# Distribution of potential growth
axes[1].hist(df['potential_growth'], bins=30, color='purple', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Potential Growth')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Potential Growth', fontsize=12, fontweight='bold')
axes[1].axvline(df['potential_growth'].mean(), color='red', linestyle='--', 
                label=f'Mean: {df["potential_growth"].mean():.2f}')
axes[1].legend()

# Top players with highest potential growth (young talents)
young_talents = df[df['age'] <= 23].nlargest(20, 'potential_growth')[['name', 'age', 'overall_rating', 'potential', 'potential_growth', 'value_euro']]
axes[2].barh(range(len(young_talents)), young_talents['potential_growth'].values, color='green')
axes[2].set_yticks(range(len(young_talents)))
axes[2].set_yticklabels(young_talents['name'].values, fontsize=8)
axes[2].set_xlabel('Potential Growth')
axes[2].set_title('Top 20 Young Talents (Age ≤ 23)', fontsize=12, fontweight='bold')
axes[2].invert_yaxis()

plt.tight_layout()
plt.show()

color_print("Young talents with high potential growth are great investments!", "green")


# 15. Body Type and Work Rate Analysis


In [None]:
# Body type analysis
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Body type distribution
body_type_counts = df['body_type'].value_counts().head(10)
axes[0].barh(body_type_counts.index, body_type_counts.values, color='teal')
axes[0].set_title('Top 10 Body Types', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Number of Players')
axes[0].invert_yaxis()

# Average rating by body type
avg_rating_by_body = df.groupby('body_type')['overall_rating'].mean().sort_values(ascending=False).head(10)
axes[1].barh(avg_rating_by_body.index, avg_rating_by_body.values, color='coral')
axes[1].set_title('Average Rating by Body Type (Top 10)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Average Overall Rating')
axes[1].invert_yaxis()

# Skill ratings distribution
skill_ratings = df[['weak_foot(1-5)', 'skill_moves(1-5)', 'international_reputation(1-5)']].mean()
axes[2].bar(range(len(skill_ratings)), skill_ratings.values, 
            color=['skyblue', 'lightgreen', 'gold'], edgecolor='black')
axes[2].set_xticks(range(len(skill_ratings)))
axes[2].set_xticklabels(['Weak Foot', 'Skill Moves', 'Int. Reputation'], rotation=15)
axes[2].set_ylabel('Average Rating')
axes[2].set_title('Average Skill Ratings', fontsize=12, fontweight='bold')
axes[2].set_ylim([0, 5])

plt.tight_layout()
plt.show()


# 16. Key Insights for Deep Learning Models


In [None]:
## Summary of EDA Findings

### For Model 1: Player Ranking Prediction (Overall Rating)

**Target Variable:** `overall_rating`

**Key Findings:**
1. **Distribution**: Overall rating follows a roughly normal distribution with mean ~66 and range 47-94
2. **Top Correlated Features**: 
   - Reactions, composure, short_passing, ball_control, vision (all > 0.85 correlation)
   - These should be priority features for the ranking model
3. **Position Impact**: Different positions have distinct skill profiles that affect ratings
4. **Age Factor**: Peak performance typically occurs between ages 25-30

**Recommended Features for Ranking Model:**
- Technical skills: reactions, composure, ball_control, short_passing, vision
- Physical attributes: stamina, strength, acceleration
- Player metadata: age, potential, position
- Skill ratings: weak_foot, skill_moves, international_reputation

### For Model 2: Player Price Prediction (Value/Wage)

**Target Variables:** `value_euro`, `wage_euro`

**Key Findings:**
1. **Distribution**: Both value and wage are heavily right-skewed (use log transformation)
2. **Top Correlated Features with Value**:
   - Overall rating (0.91), potential (0.76), reactions (0.82)
   - Wage is also highly correlated with value (0.91)
3. **Age Impact**: Younger players with high potential have better value-to-performance ratios
4. **Position Premium**: Certain positions (ST, CAM, CF) command higher average values

**Recommended Features for Price Model:**
- Performance metrics: overall_rating, potential, potential_growth
- Technical skills: reactions, composure, ball_control
- Market factors: age, position, nationality, international_reputation
- Physical attributes: height, weight, stamina

### Data Preprocessing Recommendations:

1. **Handling Skewness**: Apply log transformation to `value_euro` and `wage_euro`
2. **Feature Engineering**:
   - Create `potential_growth` = potential - overall_rating
   - Create age groups (young: <23, prime: 23-30, veteran: >30)
   - Encode categorical variables: position, nationality, preferred_foot, body_type
3. **Feature Scaling**: Normalize/standardize numerical features before training
4. **Handle Missing Values**: Already done with median imputation for value, wage, release_clause

### Next Steps:
1. Feature engineering and transformation
2. Train-test split (80-20 or 70-30)
3. Build deep learning models:
   - Model 1: Neural network for ranking prediction (regression)
   - Model 2: Neural network for price prediction (regression with log-transformed target)
4. Evaluate models using appropriate metrics (MAE, RMSE, R²)
5. Create decision support interface for player selection

color_print("\\n" + "="*80, "green")
color_print("EDA COMPLETE! Ready for Model Building", "green")
color_print("="*80, "green")


# 17. Top Players Analysis (Bonus Visualization)


In [None]:
# Showcase top players
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# Top 20 players by overall rating
top_players = df.nlargest(20, 'overall_rating')[['name', 'overall_rating', 'positions', 'age', 'nationality']]
axes[0, 0].barh(range(len(top_players)), top_players['overall_rating'].values, color='gold')
axes[0, 0].set_yticks(range(len(top_players)))
axes[0, 0].set_yticklabels([f"{row['name']} ({row['positions']})" for _, row in top_players.iterrows()], fontsize=8)
axes[0, 0].set_xlabel('Overall Rating')
axes[0, 0].set_title('Top 20 Players by Overall Rating', fontsize=12, fontweight='bold')
axes[0, 0].invert_yaxis()

# Top 20 most valuable players
top_valuable = df.nlargest(20, 'value_euro')[['name', 'value_euro', 'positions', 'age', 'overall_rating']]
axes[0, 1].barh(range(len(top_valuable)), top_valuable['value_euro'].values/1e6, color='green')
axes[0, 1].set_yticks(range(len(top_valuable)))
axes[0, 1].set_yticklabels([f"{row['name']} ({row['positions']})" for _, row in top_valuable.iterrows()], fontsize=8)
axes[0, 1].set_xlabel('Value (Million Euro)')
axes[0, 1].set_title('Top 20 Most Valuable Players', fontsize=12, fontweight='bold')
axes[0, 1].invert_yaxis()

# Top 20 highest paid players
top_wage = df.nlargest(20, 'wage_euro')[['name', 'wage_euro', 'positions', 'age', 'overall_rating']]
axes[1, 0].barh(range(len(top_wage)), top_wage['wage_euro'].values/1e3, color='coral')
axes[1, 0].set_yticks(range(len(top_wage)))
axes[1, 0].set_yticklabels([f"{row['name']} ({row['positions']})" for _, row in top_wage.iterrows()], fontsize=8)
axes[1, 0].set_xlabel('Wage (Thousand Euro)')
axes[1, 0].set_title('Top 20 Highest Paid Players', fontsize=12, fontweight='bold')
axes[1, 0].invert_yaxis()

# Best value for money (high rating, low value)
df['value_rating_ratio'] = df['overall_rating'] / (df['value_euro'] / 1e6 + 1)  # +1 to avoid division by zero
best_value = df[df['overall_rating'] >= 75].nlargest(20, 'value_rating_ratio')[['name', 'overall_rating', 'value_euro', 'positions', 'age']]
axes[1, 1].barh(range(len(best_value)), best_value['value_rating_ratio'].values, color='purple')
axes[1, 1].set_yticks(range(len(best_value)))
axes[1, 1].set_yticklabels([f"{row['name']} (R:{row['overall_rating']})" for _, row in best_value.iterrows()], fontsize=8)
axes[1, 1].set_xlabel('Value-to-Rating Ratio')
axes[1, 1].set_title('Top 20 Best Value Players (Rating ≥ 75)', fontsize=12, fontweight='bold')
axes[1, 1].invert_yaxis()

plt.tight_layout()
plt.show()

color_print("\\nThese visualizations help identify the best players for different budget scenarios!", "cyan")


In [None]:
# Save the cleaned dataset for model training
df.to_csv('fifa_players_cleaned.csv', index=False)
color_print("\\nCleaned dataset saved as 'fifa_players_cleaned.csv'", "green")


In [32]:
# Check if we have any duplicate records
df.duplicated().any()

np.False_

It means we don't have nay duplicate records. Nonetheless, there might be players with duplicate names

In [33]:
df['full_name'].duplicated().sum()

np.int64(56)

It means there are **56** players with the same full name. But to check whether they are actually different players, we can check their `age` as well.

In [40]:
df[df.duplicated(subset=['full_name', 'birth_date'], keep=False)]

Unnamed: 0,name,full_name,birth_date,age,height_cm,weight_kgs,positions,nationality,overall_rating,potential,...,long_shots,aggression,interceptions,positioning,vision,penalties,composure,marking,standing_tackle,sliding_tackle


We got **0** records where the player's full name and date of birth are same. This means that all of the records in our dataset are unique.