# Player Data Analysis - Football Statistics Processing

This Jupyter notebook processes and analyzes football player statistics from various datasets. The workflow includes:

1. Data cleaning and standardization
2. Position identification for players
3. Merging datasets with position information
4. Handling goalkeeper-specific data
5. Fixing data formatting issues
6. Updating missing player positions

The code uses web scraping techniques to retrieve player positions from Transfermarkt when not available in the original datasets.

# 1) Importing Required Libraries

This cell imports all necessary Python libraries for our football data analysis:

- **os**: File and directory manipulation
- **re**: Regular expression operations for text cleaning
- **time**: Timing operations for web scraping delays
- **pandas & numpy**: Data processing and numerical operations
- **requests & requests_html**: HTTP requests for web scraping
- **BeautifulSoup**: HTML parsing for web scraping
- **fuzzywuzzy**: Fuzzy string matching for player name comparisons

In [2]:
import os
import re
import time
import pandas as pd
import numpy as np
import requests
from requests_html import HTMLSession
from bs4 import BeautifulSoup
from fuzzywuzzy import fuzz, process



# 2) Data Cleaning and Processing Functions

This section defines functions for cleaning and processing football player data:

- **clean_and_save_dataset()**: A general-purpose data cleaning function that:
    - Standardizes country and competition names
    - Handles missing values
    - Filters players with meaningful playing time
    - Removes invalid entries and duplicates
    - Saves the cleaned dataset

The function takes input and output file paths, performs various cleaning operations specific to football data, and returns the cleaned DataFrame while saving it to the specified location.

This function is primarily used for processing goalkeeper-specific datasets, where position data handling differs from outfield players.

In [None]:
def clean_and_save_dataset(input_file, output_file=None):
    df = pd.read_csv(input_file)
    
    if 'Nation' in df.columns:
        df['Nation'] = df['Nation'].apply(lambda x: re.sub(r'^[a-z]+\s', '', str(x)))
    
    if 'Comp' in df.columns:
        df['Comp'] = df['Comp'].apply(lambda x: re.sub(r'^[a-z]+\s', '', str(x)))
    
    df = df.fillna(0)

    if '90s' in df.columns:
        df['90s'] = pd.to_numeric(df['90s'], errors='coerce').fillna(0)
 
        df = df[df['90s'] > 0]
    
    if 'Rk' in df.columns:
        df = df[df['Rk'].astype(str).str.isdigit()]
    
    if 'Real_Pos' in df.columns and 'Pos' in df.columns:
        df['Pos'] = df['Real_Pos']
    
    duplicates = df.duplicated()
    if duplicates.sum() > 0:
        print(f"⚠️ {duplicates.sum()} duplicate rows found and removed.")
        df = df.drop_duplicates()
    
    df = df.reset_index(drop=True)
    
    folder_name = "Cleaned dataset"
    os.makedirs(folder_name, exist_ok=True)
    
    if output_file is None:
        output_file = os.path.basename(input_file)
    
    output_path = f"{folder_name}/{output_file}"
    df.to_csv(output_path, index=False)
    
    print(f"File Saved Successfully in '{output_path}' ✅")
    
    return df

# 3) Player Position Extraction from Transfermarkt

This cell defines a function to scrape player position data from Transfermarkt:

- **get_player_info()**: Retrieves a player's position by:
    - Constructing a search URL with the player's name
    - Setting proper headers to avoid detection as a bot
    - Parsing the search results HTML with BeautifulSoup
    - First attempting to find an exact name match
    - Falling back to the closest match if no exact match is found
    - Returning "Unknown" if no position data can be retrieved

The function includes detailed logging to track the scraping process, including search URLs, response codes, and match quality information.

In [None]:
# Function to get player position from Transfermarkt search page
def get_player_info(player_name):
    search_url = f"https://www.transfermarkt.com/schnellsuche/ergebnis?query={player_name.replace(' ', '+')}"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
    
    try:
        print(f"  Searching URL: {search_url}")
        response = requests.get(search_url, headers=headers)
        print(f"  Response status: {response.status_code}")
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            
            player_rows = soup.find_all("tr", class_=["odd", "even"])
            print(f"  Found {len(player_rows)} player results")
            
            if len(player_rows) == 0:
                print(f"  No search results for {player_name}")
                return "Unknown"
            
            for player_row in player_rows:
                hauptlink_td = player_row.find("td", class_="hauptlink")
                if hauptlink_td:
                    player_link = hauptlink_td.find("a")
                    if player_link:
                        found_name = player_link.text.strip()
                        print(f"  Checking player: {found_name}")
                        
                        if found_name == player_name:
                            position_td = player_row.find("td", class_="zentriert")
                            if position_td:
                                position = position_td.text.strip()
                                print(f"  ✓ Exact match! Position: {position}")
                                return position
            
            if player_rows:
                player_row = player_rows[0]
                hauptlink_td = player_row.find("td", class_="hauptlink")
                if hauptlink_td:
                    player_link = hauptlink_td.find("a")
                    if player_link:
                        found_name = player_link.text.strip()
                        position_td = player_row.find("td", class_="zentriert")
                        if position_td:
                            position = position_td.text.strip()
                            print(f"  ⚠ No exact match, using closest: {found_name} (Position: {position})")
                            return position
            
            print(f"  ✗ No suitable match found for {player_name}")
    except Exception as e:
        print(f"  ✗ Error fetching position for {player_name}: {e}")
    
    return "Unknown"

Starting position lookup for first 20 players...

Looking up position for Max Aarons (#1)...
  Searching URL: https://www.transfermarkt.com/schnellsuche/ergebnis?query=Max+Aarons
  Response status: 200
  Found 1 player results
  Checking player: Max Aarons
  ✓ Exact match! Position: RB

Looking up position for Brenden Aaronson (#2)...
  Searching URL: https://www.transfermarkt.com/schnellsuche/ergebnis?query=Brenden+Aaronson
  Response status: 200
  Found 1 player results
  Checking player: Brenden Aaronson
  ✓ Exact match! Position: AM

Looking up position for Paxten Aaronson (#3)...
  Searching URL: https://www.transfermarkt.com/schnellsuche/ergebnis?query=Paxten+Aaronson
  Response status: 200
  Found 1 player results
  Checking player: Paxten Aaronson
  ✓ Exact match! Position: CM

Looking up position for Yunis Abdelhamid (#4)...
  Searching URL: https://www.transfermarkt.com/schnellsuche/ergebnis?query=Yunis+Abdelhamid
  Response status: 200
  Found 1 player results
  Checking pla

# 4) Data Cleaning and Position Lookup for Player Defensive Actions Dataset

This section processes the Player Defensive Actions dataset, which serves as our primary source for player positions. The workflow includes:

- Applying standard data cleaning steps:
    - Removing country/competition prefixes
    - Handling missing values
    - Filtering for players with meaningful playing time
    - Removing invalid entries and duplicates
- Adding a 'Pos' column to store player positions
- Looking up missing positions from Transfermarkt using the `get_player_info()` function
- Tracking success rate of position lookups
- Saving the cleaned dataset with position information

This processed dataset will later serve as a reference for position information when cleaning other player datasets.

In [None]:
df = pd.read_csv("seperate csv files/Player Defensive Actions.csv")
df['Nation'] = df['Nation'].apply(lambda x: re.sub(r'^[a-z]+\s', '', str(x)))
df['Comp'] = df['Comp'].apply(lambda x: re.sub(r'^[a-z]+\s', '', str(x)))
df = df.fillna(0)
df['90s'] = pd.to_numeric(df['90s'], errors='coerce').fillna(0)
df = df[df['90s'] > 0]
df = df[df['Rk'].astype(str).str.isdigit()]
duplicates = df.duplicated()
if duplicates.sum() > 0:
    print(f"⚠️ {duplicates.sum()} duplicate rows found and removed.")
    df = df.drop_duplicates()
df = df.reset_index(drop=True)

if 'Pos' not in df.columns:
    df['Pos'] = "Unknown"

print("Starting position lookup for first 20 players...")
success_count = 0

for index, row in df.iterrows():
    player_name = row['Player']
    print(f"\nLooking up position for {player_name} (#{index+1})...")
    position = get_player_info(player_name)
    
    if position != "Unknown":
        success_count += 1
    
    df.at[index, 'Pos'] = position

print(f"\nSuccessfully found positions for {success_count} players.")
folder_name = "Cleaned dataset"
os.makedirs(folder_name, exist_ok=True)
filename = f"{folder_name}/Player Defensive Actions.csv"
df.to_csv(filename, index=False)
print(f"Full dataset saved successfully in '{filename}' ✅")

# 5) Data Cleaning and Position Merging Function for All other Datasets

This section defines the `clean_and_merge_positions()` function - a comprehensive tool for processing player datasets while ensuring position data consistency:

- **Key Features**:
    - Applies all standard cleaning techniques from our base function
    - Merges position data from our primary position dataset 
    - Handles position conflicts by prioritizing defensive actions dataset
    - Preserves original positions when available
    - Works with various dataset formats while maintaining column integrity

The function takes an input file, optional output filename, and a position reference file (defaults to our processed defensive actions dataset). This approach ensures consistent position information across all player datasets, which is critical for position-based analysis.

In [None]:
def clean_and_merge_positions(input_file, output_file=None, position_file="Cleaned dataset/Player Defensive Actions.csv"):
    df = pd.read_csv(input_file)
    
    if 'Nation' in df.columns:
        df['Nation'] = df['Nation'].apply(lambda x: re.sub(r'^[a-z]+\s', '', str(x)))
    
    if 'Comp' in df.columns:
        df['Comp'] = df['Comp'].apply(lambda x: re.sub(r'^[a-z]+\s', '', str(x)))
    
    df = df.fillna(0)
    
    if '90s' in df.columns:
        df['90s'] = pd.to_numeric(df['90s'], errors='coerce').fillna(0)
        df = df[df['90s'] > 0]
    
    if 'Rk' in df.columns:
        df = df[df['Rk'].astype(str).str.isdigit()]
    
    try:
        pos_df = pd.read_csv(position_file)
        if {'Player', 'Squad', 'Pos'}.issubset(df.columns) and {'Player', 'Squad', 'Pos'}.issubset(pos_df.columns):
            df = df.merge(pos_df[['Player', 'Squad', 'Pos']], on=['Player', 'Squad'], how='left', suffixes=('', '_Def'))
            df['Pos'] = df['Pos_Def'].combine_first(df['Pos'])
            if 'Pos_Def' in df.columns:
                df.drop(columns=['Pos_Def'], inplace=True)
    except Exception as e:
        print(f"Warning: Could not merge positions: {e}")
    
    duplicates = df.duplicated()
    if duplicates.sum() > 0:
        print(f"⚠️ {duplicates.sum()} duplicate rows found and removed.")
        df = df.drop_duplicates()
    
    df = df.reset_index(drop=True)
    
    folder_name = "Cleaned dataset"
    os.makedirs(folder_name, exist_ok=True)
    
    if output_file is None:
        output_file = os.path.basename(input_file)
    
    output_path = f"{folder_name}/{output_file}"
    df.to_csv(output_path, index=False)
    
    print(f"File Saved Successfully in '{output_path}' ✅")
    
    return df


File Saved Successfully in 'Cleaned dataset/Player Goal and Shot Creation 2.csv' ✅


# 6) Applying Data Cleaning Functions to Multiple Player Datasets

This section applies our cleaning functions to process all player datasets in the collection. The workflow includes:

- **Dataset Processing Logic**:
    - Using `clean_and_merge_positions()` for standard player datasets
    - Using `clean_and_save_dataset()` for goalkeeper-specific datasets
    - Skipping the defensive actions dataset as it was already processed

- **Key Operations**:
    - Reading each CSV file from the source directory
    - Determining the appropriate cleaning function based on dataset type
    - Applying position merging for field player datasets
    - Handling goalkeeper data with specialized cleaning
    - Saving each cleaned dataset to the output directory
    
This batch processing ensures consistent cleaning and position handling across all player datasets, creating a unified foundation for subsequent analysis.

In [None]:
csv_files = [
    "seperate csv files/Player Goal and Shot Creation.csv",
    #"seperate csv files/Player Defensive Actions.csv",
    "seperate csv files/Player Miscellaneous Stats.csv",
    "seperate csv files/Player Pass Types.csv",
    "seperate csv files/Player Passing.csv",
    "seperate csv files/Player Playing Time.csv",
    "seperate csv files/Player Possession.csv",
    "seperate csv files/Player Shooting.csv",
    "seperate csv files/Player Standard.csv",
    "seperate csv files/Player Goalkeeping.csv",
    "seperate csv files/Player Advanced Goalkeeping.csv"
]

for file in csv_files:
    print(f"\n📊 Processing: {file}")
    output_filename = os.path.basename(file)
    if "Goalkeeping" in file:
        print(f"🧤 Using goalkeeper-specific cleaning for {file}")
        df = clean_and_save_dataset(file, output_filename)
    else:
        df = clean_and_merge_positions(file, output_filename)
    print(f"✅ Processed {len(df)} rows in {output_filename}")

# 7) Analyzing Unknown Player Positions

In this section, we'll analyze players with unknown positions in our cleaned datasets. This will help us identify data gaps that need to be addressed to ensure complete position information for all players in our analysis.

The function below will identify players with unknown or missing position data from each dataset, allowing us to track and resolve these cases systematically.

In [None]:
def find_unknown_positions(input_file, output_file=None):
    df = pd.read_csv(input_file)
    unknown_pos = df[df['Pos'].isna() | (df['Pos'] == 'Unknown')]
    print(f"Players with unknown positions in {input_file}:")
    print(unknown_pos[['Player', 'Squad', 'Pos']])
    if output_file is None:
        output_file = "unknown_positions.csv"
    unknown_pos.to_csv(output_file, index=False)
    print(f"✅ Unknown positions saved to {output_file}")
    return unknown_pos

Players with unknown positions:
                         Player            Squad      Pos
55       Jean-Daniel Akpa-Akpro            Monza  Unknown
167                 Srđan Babić          Almería  Unknown
200             Daniel Bandeira    Hellas Verona  Unknown
286   Victor Bernth Kristiansen          Bologna  Unknown
345           Mohamed Bouchenna    Clermont Foot  Unknown
399       Rareș-Cătălin Burnete            Lecce  Unknown
553   Nikita Contini Baranovsky           Napoli  Unknown
591            Juan Cruz Armada          Osasuna  Unknown
794                Adri Embarba          Almería  Unknown
908       Pablo Galdames Millán            Genoa  Unknown
949            Lautaro Gianetti          Udinese  Unknown
1085             Hwang Hee-chan           Wolves  Unknown
1112              Son Heung-min        Tottenham  Unknown
1145            Pierre Højbjerg        Tottenham  Unknown
1175            Diego Iturralde          Sevilla  Unknown
1182               Lee Jae-sung         

# Identifying Players with Unknown Positions

This section examines the cleaned data to identify players with missing position information. We will use a dedicated function to scan our primary dataset (Player Defensive Actions) to find players whose positions remain unknown despite our earlier processing.

The analysis will help us:

1. Quantify the remaining data quality issues
2. Identify specific players requiring position information
3. Prepare for targeted position updates in subsequent processing steps

By addressing these unknown positions, we'll ensure complete position coverage across our player database, enabling more accurate position-based analysis in later stages.

In [None]:
def update_player_positions(input_file, position_source_file="filled.csv"):
    original_df = pd.read_csv(input_file)
    real_positions_df = pd.read_csv(position_source_file)
    merged_df = original_df.merge(real_positions_df[['Player', 'Squad', 'Pos']], 
                                 on=['Player', 'Squad'], 
                                 how='left', 
                                 suffixes=('', '_real'))
    original_df['Pos'] = merged_df['Pos_real'].combine_first(original_df['Pos'])
    original_df.to_csv(input_file, index=False)

    print(f"✅ Unknown player positions have been replaced in {input_file} successfully!")
    
    return original_df

✅ Unknown player positions have been replaced in place successfully!


# 8) Fixing and Updating Missing Player Positions

In this section, we define a function to systematically update player positions in our datasets:

- **update_player_positions()**: A utility function that:
    - Takes an input dataset file and a position source file
    - Merges the original dataset with known positions from the position source
    - Prioritizes positions from the source file while preserving existing valid positions
    - Updates the original file with the merged position data
    - Returns the updated DataFrame with complete position information

This function is crucial for maintaining position consistency across all player datasets, ensuring that position information discovered for a player in one dataset is properly propagated to all other datasets containing that player.

We also implement a data formatting fix for the 'Min' column across all datasets, converting values with commas to proper integers, which ensures numerical operations on playing time data work correctly in our subsequent analysis.

In [None]:
csv_files = [
    "Cleaned dataset/Player Goal and Shot Creation.csv",
    "Cleaned dataset/Player Defensive Actions.csv",
    "Cleaned dataset/Player Miscellaneous Stats.csv",
    "Cleaned dataset/Player Pass Types.csv",
    "Cleaned dataset/Player Passing.csv",
    "Cleaned dataset/Player Playing Time.csv",
    "Cleaned dataset/Player Possession.csv",
    "Cleaned dataset/Player Shooting.csv",
    "Cleaned dataset/Player Standard.csv",
    "Cleaned dataset/Player Goalkeeping.csv",
    "Cleaned dataset/Player Advanced Goalkeeping.csv"
]
def fix_min_column(df):
    if 'Min' in df.columns:
        df['Min'] = df['Min'].astype(str).str.replace(',', '').astype(float).astype(int)
    return df
for file in csv_files:
    print(f"🔄 Fixing Min column in {file}...")
    df = pd.read_csv(file)
    df = fix_min_column(df)
    df.to_csv(file, index=False)
    print(f"✅ {file} fixed and saved successfully!")
print("🎯 All CSV files have been fixed!")

🔄 Fixing Min column in Cleaned dataset/Player Goal and Shot Creation 2.csv...
✅ Cleaned dataset/Player Goal and Shot Creation 2.csv fixed and saved successfully!
🔄 Fixing Min column in Cleaned dataset/Player Defensive Actions 2023-2.csv...
✅ Cleaned dataset/Player Defensive Actions 2023-2.csv fixed and saved successfully!
🔄 Fixing Min column in Cleaned dataset/Player Miscellaneous Stats 2023.csv...
✅ Cleaned dataset/Player Miscellaneous Stats 2023.csv fixed and saved successfully!
🔄 Fixing Min column in Cleaned dataset/Player Pass Types 2023-2024 Big.csv...
✅ Cleaned dataset/Player Pass Types 2023-2024 Big.csv fixed and saved successfully!
🔄 Fixing Min column in Cleaned dataset/Player Passing 2023-2024 Big 5 .csv...
✅ Cleaned dataset/Player Passing 2023-2024 Big 5 .csv fixed and saved successfully!
🔄 Fixing Min column in Cleaned dataset/Player Playing Time 2023-2024.csv...
✅ Cleaned dataset/Player Playing Time 2023-2024.csv fixed and saved successfully!
🔄 Fixing Min column in Cleaned d

# 9) Exploring Hybrid Positions and Position Unknowns

This section focuses on identifying players with hybrid or unknown positions in our dataset. The code:

1. Loads the Player Standard Stats dataset for position analysis
2. Defines position categories of interest for filtering:
    - `DF,FW`: Players who operate as both defenders and forwards
    - `MF,FW`: Players who operate as both midfielders and forwards
    - `DF,MF`: Players who operate as both defenders and midfielders
    - `Unknown`: Players whose positions haven't been successfully identified

3. Filters the dataset to show only players in these position categories
4. Displays the filtered players to help identify position gaps and hybrid roles

This analysis helps identify players who need position corrections and highlights versatile players who operate across multiple positions, which is valuable for tactical analysis.

In [5]:
# Load the Player Standard Stats.csv file
df_player_standard = pd.read_csv('Cleaned dataset/Player Standard 2023-2024 Big 5.csv')

# Define the desired positions
positions_to_filter = ['DF,FW', 'MF,FW', 'Unknown', 'DF,MF']

# Filter the DataFrame based on the 'Pos' column
filtered_df = df_player_standard[df_player_standard['Pos'].isin(positions_to_filter)]

# Display the filtered DataFrame
filtered_df

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,Gls2,Ast3,G+A4,G-PK5,G+A-PK,xG6,xAG7,xG+xAG,npxG8,npxG+xAG9
1150,1174,Iglesias,ESP,"DF,FW",Getafe,La Liga,25,1998,23,16,...,0.0,0.13,0.13,0.0,0.13,0.04,0.04,0.08,0.04,0.08
1338,1367,N'Guessan Kouadio,CIV,"MF,FW",Metz,Ligue 1,20,2003,6,1,...,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.07,0.07,0.07
1746,1785,Dion Moise Sahi,CIV,"MF,FW",Strasbourg,Ligue 1,21,2001,18,7,...,0.43,0.14,0.57,0.43,0.57,0.58,0.19,0.77,0.58,0.77
2081,2130,Yeremi Pino,ESP,"MF,FW",Villarreal,La Liga,20,2002,7,7,...,0.0,0.0,0.0,0.0,0.0,0.27,0.05,0.31,0.27,0.31
2390,2443,Jailson Siqueira,BRA,"DF,MF",Celta Vigo,La Liga,27,1995,14,9,...,0.0,0.1,0.1,0.0,0.1,0.02,0.05,0.07,0.02,0.07


# 10) Manual Position Updates for Missing Data

This section addresses specific players whose positions remain unidentified or incorrect after our automated processes. We implement two key functions:

1. A position dictionary approach to update specific players with known positions
2. A specialized function to handle goalkeeper identification issues

These manual corrections ensure that:

- Specific young talents like Bertuğ Yıldırım and Kenan Yıldız have correct position assignments
- Goalkeepers incorrectly classified in field player datasets are properly identified
- Position consistency is maintained across all datasets for these edge cases

This step is necessary to handle exceptional cases where automated scraping failed to identify the correct positions, particularly for newer players or those with unusual name representations in different data sources.

In [None]:
players_positions = {
    "Bertuğ Yıldırım": "CF",
    "Kenan Yıldız": "LW"
}
def update_player_positions(df, players_positions):
    for player, new_position in players_positions.items():
        df.loc[df['Player'] == player, 'Pos'] = new_position
    return df
for file in csv_files:
    print(f"🔄 Updating player positions in {file}...")
    df = pd.read_csv(file)
    df = update_player_positions(df, players_positions)
    df.to_csv(file, index=False)
    print(f"✅ {file} updated and saved successfully!")
print("🎯 All player positions have been updated in all CSV files!")

🔄 Updating player positions in Cleaned dataset/Player Goal and Shot Creation 2.csv...
✅ Cleaned dataset/Player Goal and Shot Creation 2.csv updated and saved successfully!
🔄 Updating player positions in Cleaned dataset/Player Defensive Actions 2023-2.csv...
✅ Cleaned dataset/Player Defensive Actions 2023-2.csv updated and saved successfully!
🔄 Updating player positions in Cleaned dataset/Player Miscellaneous Stats 2023.csv...
✅ Cleaned dataset/Player Miscellaneous Stats 2023.csv updated and saved successfully!
🔄 Updating player positions in Cleaned dataset/Player Pass Types 2023-2024 Big.csv...
✅ Cleaned dataset/Player Pass Types 2023-2024 Big.csv updated and saved successfully!
🔄 Updating player positions in Cleaned dataset/Player Passing 2023-2024 Big 5 .csv...
✅ Cleaned dataset/Player Passing 2023-2024 Big 5 .csv updated and saved successfully!
🔄 Updating player positions in Cleaned dataset/Player Playing Time 2023-2024.csv...
✅ Cleaned dataset/Player Playing Time 2023-2024.csv upda

In [None]:
def process_player_data(directory):
  """Processes player data files, replacing positions with 'GK' for specific players.

  Args:
    directory: The directory containing the CSV files.
  """
  for filename in os.listdir(directory):
    if filename.endswith(".csv") and not (filename.startswith('Player Advanced Goalkeeping') or filename.startswith('Player Goalkeeping')):
      filepath = os.path.join(directory, filename)
      try:
        df = pd.read_csv(filepath)
        if 'Player' in df.columns:
          count = 0
          for index, row in df.iterrows():
            if row['Player'] == 'Yassine Bounou' or row['Player'] == 'Cristian' or (row['Player'] == 'Fernando' and 'Nation' in df.columns and row['Nation'] == 'ESP'):
              if 'Pos' in df.columns:
                df.loc[index, 'Pos'] = 'GK'
                count += 1
          print(f"\nChanges made in {filename}: {count}")
          df.to_csv(filepath, index=False)
      except Exception as e:
        print(f"Error reading {filename}: {e}")

# Replace 'your_directory' with the actual path to your CSV files.
process_player_data('/cleaned dataset')

In [None]:
def replace_player_positions(directory):
  """Replaces positions for specific players in all CSV files in a directory.

  Args:
    directory: The directory containing the CSV files.
  """
  for filename in os.listdir(directory):
    if filename.endswith(".csv"):
      filepath = os.path.join(directory, filename)
      try:
        df = pd.read_csv(filepath)
        if 'Player' in df.columns and 'Pos' in df.columns:
          changes_count = 0
          for index, row in df.iterrows():
            if row['Player'] == 'Iglesias' and 'Nation' in df.columns and row['Nation'] == 'ESP':
              df.loc[index, 'Pos'] = 'RB'
              changes_count += 1
            elif row['Player'] == "N'Guessan Kouadio" and 'Nation' in df.columns and row['Nation'] == 'CIV':
              df.loc[index, 'Pos'] = 'CM'
              changes_count += 1
            elif row['Player'] == 'Dion Moise Sahi' and 'Nation' in df.columns and row['Nation'] == 'CIV':
              df.loc[index, 'Pos'] = 'CF'
              changes_count += 1
            elif row['Player'] == 'Yeremi Pino' and 'Nation' in df.columns and row['Nation'] == 'ESP':
              df.loc[index, 'Pos'] = 'RW'
              changes_count += 1
            elif row['Player'] == 'Jailson Siqueira' and 'Nation' in df.columns and row['Nation'] == 'BRA':
              df.loc[index, 'Pos'] = 'DM'
              changes_count += 1
            elif row['Player'] == 'Bertuğ Yıldırım' and 'Nation' in df.columns and row['Nation'] == 'TUR':
              df.loc[index, 'Pos'] = 'CF'
              changes_count += 1
            elif row['Player'] == 'Kenan Yıldız' and 'Nation' in df.columns and row['Nation'] == 'TUR':
              df.loc[index, 'Pos'] = 'CF'
              changes_count += 1
          if changes_count > 0:
            print(f"Changes made in {filename}: {changes_count}")
            df.to_csv(filepath, index=False)
      except Exception as e:
        print(f"Error reading {filename}: {e}")

replace_player_positions('/cleaned dataset')

# 11) Merging Goalkeeper Datasets

This section focuses on merging two goalkeeper-specific datasets to create a comprehensive view of goalkeeper performance metrics:

1. **Input Datasets**:
    - `Player Advanced Goalkeeping`: Contains advanced goalkeeping metrics such as PSxG, PSxG+/-, PSxG/SoT
    - `Player Goalkeeping`: Contains standard goalkeeping metrics like saves, clean sheets, and goals conceded

2. **Merge Strategy**:
    - Perform an outer join on common identifier columns
    - Maintain unique column names using appropriate suffixes
    - Preserve all metrics from both datasets
    - Handle duplicate columns appropriately

3. **Output**:
    - Create a single consolidated goalkeeper dataset with complete stats
    - Save the result as 'Merged_Goalkeeping_Data.csv'

This merged dataset will enable more comprehensive analysis of goalkeeper performance by combining traditional and advanced metrics in a single view.

In [None]:
df_advanced_gk = pd.read_csv('Cleaned dataset/Player Advanced Goalkeeping.csv')
df_gk = pd.read_csv('Cleaned dataset/Player Goalkeeping.csv')

# Merge the datasets on common columns, keeping only one copy of the common columns
df_merged = pd.merge(df_advanced_gk, df_gk, on=['Rk','Player', 'Nation', 'Pos', 'Squad', 'Comp', 'Age', 'Born', '90s', 'GA', 'PKA'], how='outer', suffixes=('_advanced', '_gk'))

# Save the merged DataFrame to a new CSV file
df_merged.to_csv('Merged_Goalkeeping_Data.csv', index=False)

# 12) Merging Goalkeeping Data with Player Standard Stats

This section combines our consolidated goalkeeper dataset with the standard player statistics to create a comprehensive player database:

1. **Input Datasets**:
    - `Merged_Goalkeeping_Data.csv`: Our previously merged goalkeeper metrics
    - `Player Standard Stats.csv`: Contains standard performance metrics for all players

2. **Merge Operation**:
    - Left join from goalkeeper data to standard stats
    - Match on key player identifiers: Player, Nation, Position, Squad, Competition, Age, etc.
    - Preserve all goalkeeper-specific metrics while adding relevant standard stats
    - Validate the merge by ensuring goalkeeper count is maintained

3. **Data Validation**:
    - Check that the merged dataset contains the same number of goalkeepers as the original
    - Only proceed with saving if the merge maintains data integrity

The resulting dataset `Merged_All_Data.csv` provides a complete view of goalkeeper performance with additional context from standard player metrics, enabling deeper analytical comparisons.

In [None]:
df_merged_gk = pd.read_csv('Merged_Goalkeeping_Data.csv')
df_player_standard = pd.read_csv('Cleaned dataset/Player Standard Stats.csv')

# Merge the datasets on specified columns
df_merged_all = pd.merge(df_merged_gk, df_player_standard, on=['Player', 'Nation', 'Pos', 'Squad', 'Comp', 'Age', 'Born', '90s'], how='left')

# Check if the merged dataset has the same size as the merged goalkeeping dataset
if df_merged_all.shape[0] == df_merged_gk.shape[0]:
  # Save the merged dataset
  df_merged_all.to_csv('Merged_All_Data.csv', index=False)
else:
  print("Merged dataset size is different from the merged goalkeeping dataset size. Merge not performed.")

# 13) Creating a Comprehensive Dataset for Field Players

This section focuses on merging all outfield player datasets to create a single comprehensive dataset for analysis:

1. **Input Datasets**:
    - `Player Standard Stats.csv`: Base dataset with core player information
    - Various specialized datasets covering specific aspects of player performance:
      - Shooting statistics
      - Passing metrics
      - Pass types and distribution
      - Goal and shot creation
      - Defensive actions
      - Possession metrics
      - Playing time details
      - Miscellaneous statistics

2. **Merge Strategy**:
    - Use Player Standard Stats as the foundation dataset
    - Merge each specialized dataset one by one
    - Join on consistent player identifier columns
    - Only include unique columns from each dataset to avoid duplication
    - Maintain data integrity by checking dataset sizes before merging

3. **Output**:
    - Create a single comprehensive outfield player dataset
    - Save as 'Merged Final.csv' for subsequent analysis

This merged dataset will enable multi-dimensional analysis of player performance by combining metrics across all aspects of play in a single, consolidated view.

In [None]:
# Load the Player Standard Stats.csv file
df_player_standard = pd.read_csv('Player Standard Stats.csv')

# List of files to merge (excluding goalkeeping)
files_to_merge = [
    'Cleaned dataset/Player Shooting.csv',
    'Cleaned dataset/Player Passing.csv',
    'Cleaned dataset/Player Pass Types.csv',
    'Cleaned dataset/Player Goal and Shot Creation.csv',
    'Cleaned dataset/Player Defensive Actions.csv',
    'Cleaned dataset/Player Possession.csv',
    'Cleaned dataset/Player Playing Time.csv',
    'Cleaned dataset/Player Miscellaneous Stats.csv'
]

# Columns to merge on
merge_columns = ['Player', 'Nation', 'Pos', 'Squad', 'Comp', 'Age', 'Born']

# Create a copy of df_player_standard to start the merge process
df_merged = df_player_standard.copy()

# Iterate through the files to merge
for file in files_to_merge:
    try:
        df_current = pd.read_csv(file)

        # Ensure both datasets have the same size before merging
        if df_current.shape[0] == df_player_standard.shape[0]:
            # Keep only columns that are NOT already in df_merged (except merge keys)
            new_columns = [col for col in df_current.columns if col not in df_merged.columns or col in merge_columns]

            # Merge only on specified columns, keeping unique columns from each dataset
            df_merged = pd.merge(df_merged, df_current[new_columns], on=merge_columns, how='left')
        else:
            print(f"Warning: Skipping merge with {file} as dataset size does not match.")

    except FileNotFoundError:
        print(f"File not found: {file}")

# Save the merged dataset
df_merged.to_csv('Merged Final.csv', index=False)