# Player Archetype Classification

The goal of this notebook is to classify players into specific roles or "archetypes" based on what position they play. The reason why this is important is because it helps aid in player recruitment since different teams need different types of players depending on the formation and tactical philosophy they adopt.  

For example, a team that likes to play out from the back would prefer to have a modern goalkeeper who has good distribution stats along with ball playing or libeiro center backs that can pass and carry the ball forwards. Being able to identify which players suite these archetypes will allow teams to optimize their starting XIs to have the best players possible.

Given the cleaned data for goalkeepers, defenders, midfielders, and attackers, we can classify our players into their ideal player archetypes based on their stats from the previous season. There are two methods that we will use to do this:  
*Method 1:* Weighted Classification  
*Method 2:* K-Means Clustering

## Exploratory Data Analysis (EDA)

In [1]:
# Import libraries
import pandas as pd

In [2]:
# Load your dataset 
df = pd.read_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Positional Stats/Goalkeeping/Sorted Data/GoalkeepingSortedData.csv')

In [3]:
df

Unnamed: 0.1,Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,...,PSxG,PSxG/SoT,PSxG+/-,PKatt,PKA,PKsv,PKm,Save%.1,#OPA,AvgDist
0,0,Alisson,GK,2023-2024,30,Brazil,Liverpool,Premier League,28,2520,...,28.7,0.26,-0.3,1,1,0,0,0.0,33,17.7
1,1,Alphonse Areola,GK,2023-2024,30,France,West Ham,Premier League,31,2699,...,52.1,0.24,2.1,7,5,2,0,28.6,7,8.9
2,2,Simone Aresti,GK,2023-2024,37,Italy,Cagliari,Serie A,1,1,...,1.0,0.00,0.0,1,1,0,0,0.0,0,
3,3,Noah Atubolu,GK,2023-2024,21,Germany,Freiburg,Bundesliga,34,3060,...,50.0,0.30,-7.0,5,3,2,0,40.0,30,13.6
4,4,Oliver Baumann,GK,2023-2024,33,Germany,Hoffenheim,Bundesliga,34,3060,...,68.6,0.31,3.6,5,2,3,0,60.0,73,16.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124,124,Guglielmo Vicario,GK,2023-2024,26,Italy,Tottenham,Premier League,38,3420,...,60.0,0.32,2.0,7,7,0,0,0.0,81,17.1
125,125,Iván Villar,GK,2023-2024,26,Spain,Celta Vigo,La Liga,12,1014,...,16.3,0.31,-0.7,3,2,1,0,33.3,9,13.2
126,126,Odisseas Vlachodimos,GK,2023-2024,29,Greece,Nott'ham Forest,Premier League,5,450,...,7.9,0.31,-4.1,1,1,0,0,0.0,2,10.3
127,127,Robin Zentner,GK,2023-2024,28,Germany,Mainz 05,Bundesliga,30,2700,...,49.2,0.37,2.2,4,3,0,1,0.0,48,16.0


In [4]:
# Rename the column 'Save%.1' to 'PKSave%'
df = df.rename(columns={'Save%.1': 'PKSave%'})

In [5]:
# Fill all missing values with zeros
df = df.fillna(0)

## Goalkeepers

### Archetypes
- Classic Goalkeeper: More long balls and long goal kicks (higher Launch%, AvgLen, Att)  
- Modern Goalkeeper: Better short passing range, less launch%, AvgLen  
- Sweeper Keeper: More defensive actions outside the box (highe #OPA, AvgDist, Stp%)

In [6]:
# Import libraries
import numpy as np
from scipy.stats import percentileofscore

In [7]:
df.columns

Index(['Unnamed: 0', 'Player', 'Pos', 'Season', 'Age', 'Nation', 'Team',
       'Comp', 'MP', 'Min', '90s', 'Starts', 'Subs', 'GA', 'SoTA', 'Saves',
       'Save%', 'W', 'D', 'L', 'CS', 'CS%', 'Att', 'Launch%', 'AvgLen', 'Opp',
       'Stp', 'Stp%', 'PSxG', 'PSxG/SoT', 'PSxG+/-', 'PKatt', 'PKA', 'PKsv',
       'PKm', 'PKSave%', '#OPA', 'AvgDist'],
      dtype='object')

In [8]:
df.dropna()

Unnamed: 0.1,Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,...,PSxG,PSxG/SoT,PSxG+/-,PKatt,PKA,PKsv,PKm,PKSave%,#OPA,AvgDist
0,0,Alisson,GK,2023-2024,30,Brazil,Liverpool,Premier League,28,2520,...,28.7,0.26,-0.3,1,1,0,0,0.0,33,17.7
1,1,Alphonse Areola,GK,2023-2024,30,France,West Ham,Premier League,31,2699,...,52.1,0.24,2.1,7,5,2,0,28.6,7,8.9
2,2,Simone Aresti,GK,2023-2024,37,Italy,Cagliari,Serie A,1,1,...,1.0,0.00,0.0,1,1,0,0,0.0,0,0.0
3,3,Noah Atubolu,GK,2023-2024,21,Germany,Freiburg,Bundesliga,34,3060,...,50.0,0.30,-7.0,5,3,2,0,40.0,30,13.6
4,4,Oliver Baumann,GK,2023-2024,33,Germany,Hoffenheim,Bundesliga,34,3060,...,68.6,0.31,3.6,5,2,3,0,60.0,73,16.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124,124,Guglielmo Vicario,GK,2023-2024,26,Italy,Tottenham,Premier League,38,3420,...,60.0,0.32,2.0,7,7,0,0,0.0,81,17.1
125,125,Iván Villar,GK,2023-2024,26,Spain,Celta Vigo,La Liga,12,1014,...,16.3,0.31,-0.7,3,2,1,0,33.3,9,13.2
126,126,Odisseas Vlachodimos,GK,2023-2024,29,Greece,Nott'ham Forest,Premier League,5,450,...,7.9,0.31,-4.1,1,1,0,0,0.0,2,10.3
127,127,Robin Zentner,GK,2023-2024,28,Germany,Mainz 05,Bundesliga,30,2700,...,49.2,0.37,2.2,4,3,0,1,0.0,48,16.0


In [9]:
# Drop the first column by name
df = df.drop('Unnamed: 0', axis=1)

#### Method 1: Weighted Classification

In [10]:
# Define weights for each archetype
weights = {
    "Classic GK": {"Launch%": 0.3, "Att": 0.2, "AvgLen": 0.5},
    "Modern GK": {"Launch%": -0.3, "Att": -0.2, "AvgLen": -0.5},
    "Sweeper Keeper": {"AvgDist": 0.35, "#OPA": 0.35, "Stp": 0.15, "Stp%": 0.15}
}

In [11]:
# Function to calculate the score for each archetype
def calculate_score(row, archetype):
    score = 0
    for feature, weight in weights[archetype].items():
        score += weight * row[feature]
    return score

In [12]:
# Apply the scoring and classification
def classify_archetype(row):
    scores = {archetype: calculate_score(row, archetype) for archetype in weights}
    return max(scores, key=scores.get)

In [13]:
df['Ideal Archetype'] = df.apply(classify_archetype, axis=1)

In [14]:
df.head()

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,PSxG/SoT,PSxG+/-,PKatt,PKA,PKsv,PKm,PKSave%,#OPA,AvgDist,Ideal Archetype
0,Alisson,GK,2023-2024,30,Brazil,Liverpool,Premier League,28,2520,28.0,...,0.26,-0.3,1,1,0,0,0.0,33,17.7,Classic GK
1,Alphonse Areola,GK,2023-2024,30,France,West Ham,Premier League,31,2699,30.0,...,0.24,2.1,7,5,2,0,28.6,7,8.9,Classic GK
2,Simone Aresti,GK,2023-2024,37,Italy,Cagliari,Serie A,1,1,0.0,...,0.0,0.0,1,1,0,0,0.0,0,0.0,Classic GK
3,Noah Atubolu,GK,2023-2024,21,Germany,Freiburg,Bundesliga,34,3060,34.0,...,0.3,-7.0,5,3,2,0,40.0,30,13.6,Classic GK
4,Oliver Baumann,GK,2023-2024,33,Germany,Hoffenheim,Bundesliga,34,3060,34.0,...,0.31,3.6,5,2,3,0,60.0,73,16.7,Classic GK


In [15]:
df['Ideal Archetype'].value_counts()

Ideal Archetype
Classic GK    129
Name: count, dtype: int64

### Method 2: K-Means Clustering

In [16]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [17]:
# Load your dataset
gk_df = pd.read_csv(r"/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Positional Stats/Goalkeeping/Sorted Data/GoalkeepingSortedData.csv")

In [18]:
gk_df

Unnamed: 0.1,Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,...,PSxG,PSxG/SoT,PSxG+/-,PKatt,PKA,PKsv,PKm,Save%.1,#OPA,AvgDist
0,0,Alisson,GK,2023-2024,30,Brazil,Liverpool,Premier League,28,2520,...,28.7,0.26,-0.3,1,1,0,0,0.0,33,17.7
1,1,Alphonse Areola,GK,2023-2024,30,France,West Ham,Premier League,31,2699,...,52.1,0.24,2.1,7,5,2,0,28.6,7,8.9
2,2,Simone Aresti,GK,2023-2024,37,Italy,Cagliari,Serie A,1,1,...,1.0,0.00,0.0,1,1,0,0,0.0,0,
3,3,Noah Atubolu,GK,2023-2024,21,Germany,Freiburg,Bundesliga,34,3060,...,50.0,0.30,-7.0,5,3,2,0,40.0,30,13.6
4,4,Oliver Baumann,GK,2023-2024,33,Germany,Hoffenheim,Bundesliga,34,3060,...,68.6,0.31,3.6,5,2,3,0,60.0,73,16.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124,124,Guglielmo Vicario,GK,2023-2024,26,Italy,Tottenham,Premier League,38,3420,...,60.0,0.32,2.0,7,7,0,0,0.0,81,17.1
125,125,Iván Villar,GK,2023-2024,26,Spain,Celta Vigo,La Liga,12,1014,...,16.3,0.31,-0.7,3,2,1,0,33.3,9,13.2
126,126,Odisseas Vlachodimos,GK,2023-2024,29,Greece,Nott'ham Forest,Premier League,5,450,...,7.9,0.31,-4.1,1,1,0,0,0.0,2,10.3
127,127,Robin Zentner,GK,2023-2024,28,Germany,Mainz 05,Bundesliga,30,2700,...,49.2,0.37,2.2,4,3,0,1,0.0,48,16.0


In [19]:
# Fill all missing values with zeros
gk_df = gk_df.fillna(0)

In [20]:
# Rename the column 'Save%.1' to 'PKSave%'
gk_df = gk_df.rename(columns={'Save%.1': 'PKSave%'})

In [21]:
# Drop the first column by name
gk_df = gk_df.drop('Unnamed: 0', axis=1)

#### Step 1: Run K-Means Clustering

In [22]:
# Select relevant features for clustering
features = ["Launch%", "Att", "AvgLen", "AvgDist", "#OPA", "Stp", "Stp%"]
X = gk_df[features]

In [23]:
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [24]:
# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
gk_df['Cluster'] = kmeans.fit_predict(X_scaled)

#### Step 2: Analyze Cluster Centers

In [25]:
# Get the cluster centers and transform them back to the original scale
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=features)

print("Cluster Centers:")
print(cluster_centers_df)

Cluster Centers:
     Launch%         Att     AvgLen    AvgDist       #OPA        Stp      Stp%
0  68.226471  245.352941  50.711765  14.182353  36.882353  27.323529  5.855882
1  41.993220   98.711864  36.586441  14.210169  14.372881   9.322034  4.425424
2  48.294444  168.361111  40.072222  14.338889  39.250000  33.527778  8.372222


#### Step 3: Intrepret Cluster Centers

Cluster 0: This cluster most closely aligns with "Classic GK"  
Cluster 1: This cluster most closely aligns with "Modern GK"  
Cluster 2: This cluster most closely aligns with "Sweeper Keeper"  

#### Step 4: Manually Map Clusters to Archetypes

In [26]:
# Map clusters to archetypes manually based on cluster characteristics
cluster_mapping = {
    0: "Classic GK", 
    1: "Modern GK",
    2: "Sweeper Keeper"
}

In [27]:
gk_df['Ideal Archetype'] = gk_df['Cluster'].map(cluster_mapping)

In [28]:
gk_df

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,PSxG+/-,PKatt,PKA,PKsv,PKm,PKSave%,#OPA,AvgDist,Cluster,Ideal Archetype
0,Alisson,GK,2023-2024,30,Brazil,Liverpool,Premier League,28,2520,28.0,...,-0.3,1,1,0,0,0.0,33,17.7,1,Modern GK
1,Alphonse Areola,GK,2023-2024,30,France,West Ham,Premier League,31,2699,30.0,...,2.1,7,5,2,0,28.6,7,8.9,0,Classic GK
2,Simone Aresti,GK,2023-2024,37,Italy,Cagliari,Serie A,1,1,0.0,...,0.0,1,1,0,0,0.0,0,0.0,1,Modern GK
3,Noah Atubolu,GK,2023-2024,21,Germany,Freiburg,Bundesliga,34,3060,34.0,...,-7.0,5,3,2,0,40.0,30,13.6,0,Classic GK
4,Oliver Baumann,GK,2023-2024,33,Germany,Hoffenheim,Bundesliga,34,3060,34.0,...,3.6,5,2,3,0,60.0,73,16.7,0,Classic GK
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124,Guglielmo Vicario,GK,2023-2024,26,Italy,Tottenham,Premier League,38,3420,38.0,...,2.0,7,7,0,0,0.0,81,17.1,2,Sweeper Keeper
125,Iván Villar,GK,2023-2024,26,Spain,Celta Vigo,La Liga,12,1014,11.3,...,-0.7,3,2,1,0,33.3,9,13.2,1,Modern GK
126,Odisseas Vlachodimos,GK,2023-2024,29,Greece,Nott'ham Forest,Premier League,5,450,5.0,...,-4.1,1,1,0,0,0.0,2,10.3,1,Modern GK
127,Robin Zentner,GK,2023-2024,28,Germany,Mainz 05,Bundesliga,30,2700,30.0,...,2.2,4,3,0,1,0.0,48,16.0,2,Sweeper Keeper


In [29]:
gk_df['Ideal Archetype'].value_counts()

Ideal Archetype
Modern GK         59
Sweeper Keeper    36
Classic GK        34
Name: count, dtype: int64

### Group Classification by Save Metrics

In [30]:
# Select relevant columns
columns_to_include = [
    'Player', 'Pos', 'Season', 'Age', 'Team', 'Comp', 'Ideal Archetype'
]

In [31]:
grouped_gk__df = gk_df[columns_to_include]

### Export Goalkeeper Dataframe

In [32]:
grouped_gk__df.to_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/Classified Archetypes/GKArchetypes.csv')

## Defenders

### Center Back Archetypes
- Ball Playing Center Back: More short and medium passes completed, high pass completion accuracy, high dribbling stats  
- Libero: Highest total distance covered, higher # of carries, higher PrgDist and PrgC (progressive carries), more CPAs (carries into opponent's box)  
- Central Defender: More long passes completed, Higher # of tackles in the defensive third  
- Classic Defender (No-nonsense): Higher # of clearances, # of blocks, # of challenges attempted


### Full Back Archetypes:
- Classic Full Back: Less progressive distance (PrgDist, PrgC), more total tackles & tackles in defensive 3rd (Tkl, Def 3rd), 
- Wing Back: Highest progresive distance, more tackles in Att 3rd, higher CPA
- Inverted Wing Back: Higher progressive distance, more tackles in Mid 3rd, higher Rec (# of times player successfully received a pass)

In [33]:
# Step 1: Load and Preprocess Data
df = pd.read_csv(r"/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Positional Stats/Defending/Sorted Data/DefendingSortedData.csv")

In [34]:
df

Unnamed: 0.1,Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,...,Challenge Att,Challenge%,Challenge Lost,Blocks,Sh,Pass,Int,Tkl+Int,Clr,Err
0,0,Max Aarons,RB,2023-2024,23,England,Bournemouth,Premier League,20,1237,...,34,58.8,14,9.0,5,4,8.0,37,27,0.0
1,1,Yunis Abdelhamid,CB,2023-2024,35,Morocco,Reims,Ligue 1,31,2781,...,45,57.8,19,51.0,32,19,39.0,103,109,2.0
2,2,Nabil Aberdin,CB,2023-2024,20,France,Getafe,La Liga,2,180,...,1,0.0,1,0.0,0,0,0.0,0,4,0.0
3,3,Abner,LB,2023-2024,23,Brazil,Betis,La Liga,23,1400,...,34,50.0,17,23.0,5,18,15.0,40,62,0.0
4,4,Abdel Abqar,CB,2023-2024,24,Morocco,Alavés,La Liga,27,2312,...,35,60.0,14,31.0,26,5,23.0,58,115,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
927,927,Oleksandr Zinchenko,LB,2023-2024,26,Ukraine,Arsenal,Premier League,27,1722,...,6928,653.0,706,92.5,601,667,90.1,95,134,70.9
928,928,Luc Zogbé,RB,2023-2024,18,Ivory Coast,Brest,Ligue 1,1,11,...,12,4.0,4,100.0,1,1,100.0,0,2,0.0
929,929,Nadir Zortea,RWB,2023-2024,24,Italy,Frosinone,Serie A,19,1407,...,3172,262.0,303,86.5,166,219,75.8,45,88,51.1
930,930,Kurt Zouma,CB,2023-2024,28,France,West Ham,Premier League,33,2838,...,6791,335.0,365,91.8,509,558,91.2,111,198,56.1


In [35]:
# Fill all missing values with zeros
df = df.fillna(0)

In [36]:
df.columns

Index(['Unnamed: 0', 'Player', 'Pos', 'Season', 'Age', 'Nation', 'Team',
       'Comp', 'MP', 'Min', '90s', 'Starts', 'Subs', 'unSub', 'Carries',
       'TotDist', 'PrgDist', 'PrgC', '1/3', 'CPA', 'Mis', 'Dis', 'Rec', 'PrgR',
       'Won', 'Lost', 'Won%', 'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt',
       'PKm', 'Pass Cmp', 'Pass Att', 'Cmp%', 'KP', '1/3.1', 'PPA', 'CrsPA',
       'PrgP', 'TotDist.1', 'PrgDist.1', 'Short Cmp', 'Short Att',
       'Short Cmp%', 'Med. Cmp', 'Med. Att', 'Med. Cmp%', 'Long Cmp',
       'Long Att', 'Long Cmp%', 'Tkl', 'TklW', 'Def 3rd', 'Mid 3rd', 'Att 3rd',
       'Challenges', 'Challenge Att', 'Challenge%', 'Challenge Lost', 'Blocks',
       'Sh', 'Pass', 'Int', 'Tkl+Int', 'Clr', 'Err'],
      dtype='object')

In [37]:
# Step 2: Separate Centerbacks and Full Backs/Wing Backs
centerbacks = df[df['Pos'] == 'CB']
full_backs = df[df['Pos'].isin(['RB', 'RWB', 'LB', 'LWB'])]

In [38]:
cetnerbacks = centerbacks.fillna(0)
full_backs = full_backs.fillna(0)

In [39]:
# Features for center backs
cb_features = ['Short Cmp', 'Med. Cmp', 'Long Cmp', 'Short Cmp%', 'Med. Cmp%', 'Long Cmp%', 'TotDist', 'PrgDist', 'CPA', 'Long Cmp', 'Def 3rd', 'Clr', 'Blocks', 'Challenges']

# Features for full backs
fb_features = ['PrgDist', 'PrgC', 'Tkl', 'Def 3rd', 'Att 3rd', 'Mid 3rd', 'Rec', 'CPA']

In [40]:
# Step 3: Standardize Data
scaler_cb = StandardScaler()
X_cb = scaler_cb.fit_transform(centerbacks[cb_features])

scaler_fb = StandardScaler()
X_fb = scaler_fb.fit_transform(full_backs[fb_features])

In [41]:
# Step 4: Apply K-Means Clustering
# For Centerbacks - 4 clusters for 4 archetypes
kmeans_cb = KMeans(n_clusters=4, random_state=42)
centerbacks['Cluster'] = kmeans_cb.fit_predict(X_cb)

# For Full Backs/Wing Backs - 3 clusters for 3 archetypes
kmeans_fb = KMeans(n_clusters=3, random_state=42)
full_backs['Cluster'] = kmeans_fb.fit_predict(X_fb)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  centerbacks['Cluster'] = kmeans_cb.fit_predict(X_cb)


In [42]:
# Step 5: Analyze Cluster Centers and Map to Archetypes
# Centerbacks
cb_cluster_centers = scaler_cb.inverse_transform(kmeans_cb.cluster_centers_)
cb_cluster_centers_df = pd.DataFrame(cb_cluster_centers, columns=cb_features)
print("Centerbacks Cluster Centers:")
print(cb_cluster_centers_df)

Centerbacks Cluster Centers:
    Short Cmp    Med. Cmp     Long Cmp  Short Cmp%  Med. Cmp%  Long Cmp%  \
0    4.477477    1.594595    60.563063    4.234234   3.504505  44.050901   
1    2.823529    0.788235  1082.752941    3.188235   1.941176  84.902353   
2  428.301887  611.981132   121.339623   89.094340  89.292453  58.794340   
3  361.648438   96.679688    26.257812   89.810156  67.834375  26.924219   

       TotDist      PrgDist        CPA     Long Cmp    Def 3rd         Clr  \
0  1301.013514   697.288288   0.873874    60.563063  12.445946   42.527027   
1    16.129412    17.576471  13.529412  1082.752941   5.705882  181.764706   
2  5302.188679  2875.566038   0.981132   121.339623  25.415094   95.566038   
3   306.453125   162.929687   0.101562    26.257812  12.164062   25.695312   

      Blocks    Challenges  
0  32.451351    684.689189  
1  90.261176  20550.800000  
2  31.226415     20.188679  
3  18.023437      8.312500  


#### Center Back Cluster Interpretation
Cluster 0: Central Defender  
Cluster 1: Classic CB  
Cluster 2: Libero  
Cluster 3: Ball Playing CB

In [43]:
# Full Backs/Wing Backs
fb_cluster_centers = scaler_fb.inverse_transform(kmeans_fb.cluster_centers_)
fb_cluster_centers_df = pd.DataFrame(fb_cluster_centers, columns=fb_features)
print("Full Backs/Wing Backs Cluster Centers:")
print(fb_cluster_centers_df)

Full Backs/Wing Backs Cluster Centers:
       PrgDist       PrgC        Tkl    Def 3rd    Att 3rd    Mid 3rd  \
0    30.873239   1.690141  32.126761  31.000000  42.287324  37.957746   
1   188.985714   6.175000  11.607143  10.342857  33.980714   8.582143   
2  2029.149425  48.448276  33.252874  24.597701  30.193103  25.218391   

           Rec        CPA  
0  2960.830986   0.112676  
1   379.246429   3.900000  
2   788.126437  10.666667  


#### Full Back Cluster Interpretation
Cluster 0: Full Back  
Cluster 1: Inverted Wing Back  
Cluster 2: Wing Back  

In [44]:
# Map clusters to archetypes manually based on cluster characteristics
cb_cluster_mapping = {
    0: "Central Defender",
    1: "Classic Center Back",
    2: "Libero",
    3: "Ball Playing Center Back"
}

fb_cluster_mapping = {
    0: "Full Back",
    1: "Wing Back",
    2: "Inverted Wing Back"
}

In [45]:
# Apply the mapping to create the 'Ideal Archetype' column
centerbacks['Ideal Archetype'] = centerbacks['Cluster'].map(cb_cluster_mapping)
full_backs['Ideal Archetype'] = full_backs['Cluster'].map(fb_cluster_mapping)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  centerbacks['Ideal Archetype'] = centerbacks['Cluster'].map(cb_cluster_mapping)


In [46]:
# Combine the data back into a single defenders DataFrame
def_df = pd.concat([centerbacks, full_backs])

In [47]:
def_df

Unnamed: 0.1,Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,...,Challenge Lost,Blocks,Sh,Pass,Int,Tkl+Int,Clr,Err,Cluster,Ideal Archetype
1,1,Yunis Abdelhamid,CB,2023-2024,35,Morocco,Reims,Ligue 1,31,2781,...,19,51.0,32,19,39.0,103,109,2.0,2,Libero
2,2,Nabil Aberdin,CB,2023-2024,20,France,Getafe,La Liga,2,180,...,1,0.0,0,0,0.0,0,4,0.0,3,Ball Playing Center Back
4,4,Abdel Abqar,CB,2023-2024,24,Morocco,Alavés,La Liga,27,2312,...,14,31.0,26,5,23.0,58,115,0.0,2,Libero
5,5,Francesco Acerbi,CB,2023-2024,35,Italy,Inter,Serie A,29,2388,...,3,20.0,13,7,32.0,54,77,1.0,2,Libero
8,8,Tosin Adarabioyo,CB,2023-2024,25,England,Fulham,Premier League,20,1617,...,9,16.0,11,5,25.0,46,80,0.0,2,Libero
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
922,922,Jordan Zemura,LB,2023-2024,23,Zimbabwe,Udinese,Serie A,27,1038,...,178,86.5,69,109,63.3,11,33,33.3,1,Wing Back
927,927,Oleksandr Zinchenko,LB,2023-2024,26,Ukraine,Arsenal,Premier League,27,1722,...,706,92.5,601,667,90.1,95,134,70.9,0,Full Back
928,928,Luc Zogbé,RB,2023-2024,18,Ivory Coast,Brest,Ligue 1,1,11,...,4,100.0,1,1,100.0,0,2,0.0,1,Wing Back
929,929,Nadir Zortea,RWB,2023-2024,24,Italy,Frosinone,Serie A,19,1407,...,303,86.5,166,219,75.8,45,88,51.1,1,Wing Back


In [48]:
def_df['Ideal Archetype'].value_counts()

Ideal Archetype
Wing Back                   280
Central Defender            222
Ball Playing Center Back    128
Inverted Wing Back           87
Classic Center Back          85
Full Back                    71
Libero                       53
Name: count, dtype: int64

#### Group and Export Defender Classification

In [49]:
# Select relevant columns
def_columns_to_include = [
    'Player', 'Pos', 'Season', 'Age', 'Team', 'Comp', 'Ideal Archetype'
]

In [50]:
grouped_def__df = def_df[def_columns_to_include]

In [51]:
grouped_def__df.to_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/Classified Archetypes/DefenderArchetypes.csv')

## Midfielders

In [52]:
mid_df = pd.read_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Positional Stats/Midfielding/Sorted Data/MidfieldingSortedData.csv')

In [53]:
mid_df

Unnamed: 0.1,Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,...,PrgP,Short Cmp,Short Att,Short Cmp%,Med. Cmp,Med. Att,Med. Cmp%,Long Cmp,Long Att,Long Cmp%
0,0,Brenden Aaronson,CM,2023-2024,22,United States,Union Berlin,Bundesliga,30,1267,...,56,206.0,240.0,85.8,105,130,80.8,19,32,59.4
1,1,Paxten Aaronson,CM,2023-2024,19,United States,Eint Frankfurt,Bundesliga,7,101,...,5,20.0,25.0,80.0,20,22,90.9,0,2,0.0
2,2,Salis Abdul Samed,CM,2023-2024,23,Ghana,Lens,Ligue 1,27,1519,...,78,393.0,433.0,90.8,330,360,91.7,41,54,75.9
3,3,Laurent Abergel,CM,2023-2024,30,France,Lorient,Ligue 1,33,2860,...,194,629.0,707.0,89.0,711,802,88.7,150,233,64.4
4,4,Tyler Adams,CM,2023-2024,24,United States,Bournemouth,Premier League,3,121,...,5,35.0,39.0,89.7,18,24,75.0,6,8,75.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
740,740,Bryan Zaragoza,LM,2023-2024,22,Spain,Granada,La Liga,28,1821,...,45,173.0,217.0,79.7,94,167,56.3,28,85,32.9
741,741,Oier Zarraga,CM,2023-2024,24,Spain,Udinese,Serie A,15,404,...,8,54.0,58.0,93.1,31,39,79.5,7,14,50.0
742,742,Piotr Zieliński,CM,2023-2024,29,Poland,Napoli,Serie A,28,1924,...,131,585.0,626.0,93.5,326,394,82.7,78,155,50.3
743,743,Martín Zubimendi,CM,2023-2024,24,Spain,Real Sociedad,La Liga,31,2654,...,152,607.0,677.0,89.7,607,677,89.7,86,126,68.3


In [54]:
mid_df.columns

Index(['Unnamed: 0', 'Player', 'Pos', 'Season', 'Age', 'Nation', 'Team',
       'Comp', 'MP', 'Min', '90s', 'Starts', 'Subs', 'Live', 'Dead', 'Sw',
       'Crs', 'TI', 'CK', 'xG', 'npxG', 'xA', 'G-xG', 'Sh', 'SoT', 'Sh Dist',
       'FK', 'Carries', 'TotDist', 'PrgDist', 'PrgC', '1/3C', 'CPA', 'Mis',
       'Dis', 'Touches', 'Def Pen', 'Att Pen', 'Tkl', 'TklW', 'Def 3rd',
       'Mid 3rd', 'Att 3rd', 'Tkl.1', 'Att', 'Tkl%', 'Lost', 'Blocks', 'Sh.1',
       'Pass', 'Int', 'Tkl+Int', 'Clr', 'Err', 'Gls', 'Ast', 'G+A', 'G-PK',
       'PK', 'PKatt', 'PKm', 'Cmp', 'Att.1', 'Cmp%', 'KP', '1/3', 'PPA',
       'CrsPA', 'PrgP', 'Short Cmp', 'Short Att', 'Short Cmp%', 'Med. Cmp',
       'Med. Att', 'Med. Cmp%', 'Long Cmp', 'Long Att', 'Long Cmp%'],
      dtype='object')

In [55]:
# Drop the first column by name
mid_df = mid_df.drop('Unnamed: 0', axis=1)

### Separate into CM, DM, AM, RM/LM

In [56]:
cm = mid_df[mid_df['Pos'] == 'CM']
dm = mid_df[mid_df['Pos'] == 'DM']
am = mid_df[mid_df['Pos'] == 'AM']
wide = mid_df[mid_df['Pos'].isin(['RM', 'LM'])]

### Central Midfielder Archetypes:
- Classic CM: Link player and hard worker, high # of live touches, tackles in middle 3rd, medium passes completed (Med. Cmp%)  
- Regista: Highest Sw (switches), higher carries and Short Cmp.
- Box-to-Box: Highest TotDist and PrgDist (total and progressive distance), high # carries, higher ShDist (shot distance), high key passes and shot dist.
- Segundo Volante: More tackles, higher PrgC and 1/3C (progressive carries into final third), more defensive with late support going forward
- Mezzala: "Half wingers" that drift wide, less defensive responsibilities (more Short Cmp%, Att 3rd tackles)

In [57]:
# Step 2: Select relevant features for clustering
cm_features = [
    'Live', 'PrgP', 'Mid 3rd', 'Med. Cmp%', # Classic CM
    'Sw', 'Carries', 'Short Cmp',  # Regista
    'Tkl', 'Def 3rd', '1/3C', 'PrgC',      # Segundo Volante
    'Short Cmp%', 'Att 3rd', # Mezzala 
    'TotDist', 'PrgDist', 'Sh Dist', 'KP' # Box-to-Box    
]

In [58]:
cm = cm.fillna(0)

In [59]:
# Step 3: Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(cm[cm_features])

In [60]:
# Step 4: Apply K-Means clustering (6 clusters for 6 archetypes)
kmeans = KMeans(n_clusters=5, random_state=42)
cm['Cluster'] = kmeans.fit_predict(X)

In [61]:
# Step 5: Analyze Cluster Centers and Map to Archetypes
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=cm_features)
print("Midfielders Cluster Centers:")
print(cluster_centers_df)

Midfielders Cluster Centers:
          Live        PrgP      Mid 3rd   Med. Cmp%         Sw     Carries  \
0   917.157895  486.315789    22.263158  873.368421  31.447368    1.921053   
1     2.333333    8.820513    45.694872    1.615385   2.658974  388.153846   
2  1196.193548  124.193548    23.354839   87.503226  10.483871  869.290323   
3   172.737500   40.302419    65.795161   40.385887   3.375000   49.669355   
4   962.120690  141.068966  1281.137931   12.137931  21.293103   70.965517   

    Short Cmp         Tkl      Def 3rd         1/3C       PrgC   Short Cmp%  \
0  155.263158   50.947368    22.000000  1232.078947   0.026316  1565.500000   
1    5.000000    5.564103    44.589744   145.102564  87.410256     2.128205   
2  521.967742   49.806452    19.709677    35.035484  32.580645    89.429032   
3   53.133065  188.475806   242.543548    37.363710  10.205645    59.196774   
4   36.487931   62.948276  4729.086207    17.689655  17.948276    14.206897   

      Att 3rd      TotDist 

#### CM Cluster Interpretation
Cluster 0: Classic CM  
Cluster 1: Segundo Volante  
Cluster 2: Box-to-Box  
Cluster 3: Mezzala  
Cluster 4: Regista  

In [62]:
# Manual Mapping (Example - update based on cluster center analysis)
cluster_mapping = {
    0: "Mezzala",
    1: "Segundo Volante",
    2: "Box-to-Box",
    3: "Classic CM",
    4: "Regista"
}

In [63]:
# Apply the mapping to create the 'Ideal Archetype' column
cm['Ideal Archetype'] = cm['Cluster'].map(cluster_mapping)

In [64]:
cm['Ideal Archetype'].value_counts()

Ideal Archetype
Classic CM         248
Regista             58
Segundo Volante     39
Mezzala             38
Box-to-Box          31
Name: count, dtype: int64

In [65]:
cm

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,Short Att,Short Cmp%,Med. Cmp,Med. Att,Med. Cmp%,Long Cmp,Long Att,Long Cmp%,Cluster,Ideal Archetype
0,Brenden Aaronson,CM,2023-2024,22,United States,Union Berlin,Bundesliga,30,1267,14.1,...,240.0,85.8,105,130,80.8,19,32,59.4,2,Box-to-Box
1,Paxten Aaronson,CM,2023-2024,19,United States,Eint Frankfurt,Bundesliga,7,101,1.1,...,25.0,80.0,20,22,90.9,0,2,0.0,3,Classic CM
2,Salis Abdul Samed,CM,2023-2024,23,Ghana,Lens,Ligue 1,27,1519,16.9,...,433.0,90.8,330,360,91.7,41,54,75.9,2,Box-to-Box
3,Laurent Abergel,CM,2023-2024,30,France,Lorient,Ligue 1,33,2860,31.8,...,707.0,89.0,711,802,88.7,150,233,64.4,2,Box-to-Box
4,Tyler Adams,CM,2023-2024,24,United States,Bournemouth,Premier League,3,121,1.3,...,39.0,89.7,18,24,75.0,6,8,75.0,3,Classic CM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
737,Denis Zakaria,CM,2023-2024,26,Switzerland,Monaco,Ligue 1,25,2137,23.7,...,502.0,92.2,515,548,94.0,74,92,80.4,3,Classic CM
738,Arsen Zakharyan,CM,2023-2024,20,Russia,Real Sociedad,La Liga,29,1228,13.6,...,256.0,86.3,113,156,72.4,34,85,40.0,3,Classic CM
741,Oier Zarraga,CM,2023-2024,24,Spain,Udinese,Serie A,15,404,4.5,...,58.0,93.1,31,39,79.5,7,14,50.0,3,Classic CM
742,Piotr Zieliński,CM,2023-2024,29,Poland,Napoli,Serie A,28,1924,21.4,...,626.0,93.5,326,394,82.7,78,155,50.3,3,Classic CM


### Defensive Midfielder Archetypes:
- Deep Lying Playmaker: Higher PrgP (progressive passes), higher LongCmp (long passes completed), more swithces (Sw) 
- Ball Winning Midfielder: Highest Tkl+Int, TklW (tackles won), high number of tackles in middle and defensive thirds (Mid 3rd, Def 3rd)
- Anchor Man: High number of pass and shot blocks (Pass, Blocks), high number of tackles and interceptions 

In [66]:
dm

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,PrgP,Short Cmp,Short Att,Short Cmp%,Med. Cmp,Med. Att,Med. Cmp%,Long Cmp,Long Att,Long Cmp%
6,Yacine Adli,DM,2023-2024,23,France,Milan,Serie A,24,1407,15.6,...,125,489.0,534.0,91.6,487,534,91.2,120,177,67.8
7,Michel Aebischer,DM,2023-2024,26,Switzerland,Bologna,Serie A,36,2230,24.8,...,125,729.0,773.0,94.3,525,562,93.4,79,103,76.7
15,Paul Akouokou,DM,2023-2024,25,Ivory Coast,Lyon,Ligue 1,10,334,3.7,...,14,68.0,81.0,84.0,83,88,94.3,15,20,75.0
16,Jean-Daniel Akpa-Akpro,DM,2023-2024,30,Ivory Coast,Monza,Serie A,19,705,7.8,...,10,113.0,128.0,88.3,95,104,91.3,12,15,80.0
27,Sergi Altimira,DM,2023-2024,21,Spain,Betis,La Liga,14,568,6.3,...,31,142.0,154.0,92.2,110,116,94.8,30,36,83.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,Julian Weigl,DM,2023-2024,27,Germany,Gladbach,Bundesliga,31,2764,30.7,...,105,522.0,568.0,91.9,629,673,93.5,130,169,76.9
727,Adam Wharton,DM,2023-2024,19,England,Crystal Palace,Premier League,16,1297,14.4,...,79,209.0,244.0,85.7,199,234,85.0,45,80,56.3
732,Granit Xhaka,DM,2023-2024,30,Switzerland,Leverkusen,Bundesliga,33,2821,31.3,...,392,1681.0,1777.0,94.6,1063,1138,93.4,208,258,80.6
734,Ryan Yates,DM,2023-2024,25,England,Nott'ham Forest,Premier League,35,1992,22.1,...,73,315.0,375.0,84.0,279,324,86.1,37,67,55.2


In [67]:
dm.columns

Index(['Player', 'Pos', 'Season', 'Age', 'Nation', 'Team', 'Comp', 'MP', 'Min',
       '90s', 'Starts', 'Subs', 'Live', 'Dead', 'Sw', 'Crs', 'TI', 'CK', 'xG',
       'npxG', 'xA', 'G-xG', 'Sh', 'SoT', 'Sh Dist', 'FK', 'Carries',
       'TotDist', 'PrgDist', 'PrgC', '1/3C', 'CPA', 'Mis', 'Dis', 'Touches',
       'Def Pen', 'Att Pen', 'Tkl', 'TklW', 'Def 3rd', 'Mid 3rd', 'Att 3rd',
       'Tkl.1', 'Att', 'Tkl%', 'Lost', 'Blocks', 'Sh.1', 'Pass', 'Int',
       'Tkl+Int', 'Clr', 'Err', 'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt',
       'PKm', 'Cmp', 'Att.1', 'Cmp%', 'KP', '1/3', 'PPA', 'CrsPA', 'PrgP',
       'Short Cmp', 'Short Att', 'Short Cmp%', 'Med. Cmp', 'Med. Att',
       'Med. Cmp%', 'Long Cmp', 'Long Att', 'Long Cmp%'],
      dtype='object')

In [68]:
# Step 2: Select relevant features for clustering
dm_features = [
    'PrgP', 'Long Cmp%', 'Sw', # Deep Lying Playmaker
    'Tkl+Int', 'TklW', 'Mid 3rd', 'Def 3rd', # Ball Winning midfielder
    'Pass', 'Blocks', 'Int' # Anchor man
]

In [69]:
dm = dm.fillna(0)

In [70]:
# Step 3: Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(dm[dm_features])

In [71]:
# Step 4: Apply K-Means clustering (6 clusters for 6 archetypes)
kmeans = KMeans(n_clusters=3, random_state=42)
dm['Cluster'] = kmeans.fit_predict(X)

In [72]:
# Step 5: Analyze Cluster Centers and Map to Archetypes
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=dm_features)
print("Defensive Midfielders Cluster Centers:")
print(cluster_centers_df)

Defensive Midfielders Cluster Centers:
         PrgP     Long Cmp%         Sw     Tkl+Int         TklW      Mid 3rd  \
0   86.075472  1.830509e+02   7.075472  731.979245   729.622642    76.365094   
1   57.500000  1.925000e+01   2.312500  507.812500  1346.437500    22.937500   
2  349.583333  5.684342e-14  13.500000   13.750000  3654.000000  1886.000000   

       Def 3rd        Pass       Blocks         Int  
0   276.771698   20.575472    97.122642  220.775472  
1    84.768750  545.875000  5831.875000   89.275000  
2  4702.333333   13.508333   888.750000  238.241667  


### DM Cluster Interpretation
Cluster 0: Deep Lying Playmaker  
Cluster 1: Anchor Man  
Cluster 2: Ball Winning Midfielder  

In [73]:
# Manual Mapping (Example - update based on cluster center analysis)
cluster_mapping = {
    0: "Deep Lying Playmaker",
    1: "Ball Winning Midfielder",
    2: "Anchor Man"
}

In [74]:
# Apply the mapping to create the 'Ideal Archetype' column
dm['Ideal Archetype'] = dm['Cluster'].map(cluster_mapping)

In [75]:
dm['Ideal Archetype'].value_counts()

Ideal Archetype
Deep Lying Playmaker       106
Ball Winning Midfielder     16
Anchor Man                  12
Name: count, dtype: int64

### Attacking Mifielder Archetypes
Attacking Midfielder: More key passes and passes into penalty area (KP, PPA), more non-penatly expected goals and shots (npXG, Sh)  
Advanced Playmaker: More short and progressive passes completed (Short Cmp%, PrgP), carries into final 3rd (1/3C) and touches in penalty area (Att Pen)

In [76]:
am

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,PrgP,Short Cmp,Short Att,Short Cmp%,Med. Cmp,Med. Att,Med. Cmp%,Long Cmp,Long Att,Long Cmp%
5,Amine Adli,AM,2023-2024,23,Morocco,Leverkusen,Bundesliga,23,898,10.0,...,39,252.0,291.0,86.6,91,110,82.7,7,18,38.9
11,Tosin Aiyegun,AM,2023-2024,25,Benin,Lorient,Ligue 1,19,902,10.0,...,16,73.0,98.0,74.5,58,70,82.9,10,15,66.7
14,Maghnes Akliouche,AM,2023-2024,21,France,Monaco,Ligue 1,28,1613,17.9,...,93,380.0,444.0,85.6,196,246,79.7,32,69,46.4
17,Luis Alberto,AM,2023-2024,30,Spain,Lazio,Serie A,33,2311,25.7,...,212,711.0,797.0,89.2,442,538,82.2,104,239,43.5
24,Nabil Alioui,AM,2023-2024,24,France,Le Havre,Ligue 1,18,1009,11.2,...,43,118.0,151.0,78.1,64,92,69.6,35,83,42.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
723,Luca Waldschmidt,AM,2023-2024,27,Germany,Köln,Bundesliga,22,1146,12.7,...,46,150.0,180.0,83.3,125,163,76.7,31,60,51.7
730,Florian Wirtz,AM,2023-2024,20,Germany,Leverkusen,Bundesliga,32,2372,26.4,...,224,1003.0,1124.0,89.2,395,462,85.5,62,98,63.3
731,Jeong Woo-yeong,AM,2023-2024,23,South Korea,Stuttgart,Bundesliga,26,633,7.0,...,22,188.0,209.0,90.0,57,73,78.1,12,15,80.0
735,Yusuf Yazıcı,AM,2023-2024,26,Turkey,Lille,Ligue 1,27,1312,14.6,...,48,215.0,261.0,82.4,110,140,78.6,23,37,62.2


In [77]:
am.columns

Index(['Player', 'Pos', 'Season', 'Age', 'Nation', 'Team', 'Comp', 'MP', 'Min',
       '90s', 'Starts', 'Subs', 'Live', 'Dead', 'Sw', 'Crs', 'TI', 'CK', 'xG',
       'npxG', 'xA', 'G-xG', 'Sh', 'SoT', 'Sh Dist', 'FK', 'Carries',
       'TotDist', 'PrgDist', 'PrgC', '1/3C', 'CPA', 'Mis', 'Dis', 'Touches',
       'Def Pen', 'Att Pen', 'Tkl', 'TklW', 'Def 3rd', 'Mid 3rd', 'Att 3rd',
       'Tkl.1', 'Att', 'Tkl%', 'Lost', 'Blocks', 'Sh.1', 'Pass', 'Int',
       'Tkl+Int', 'Clr', 'Err', 'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt',
       'PKm', 'Cmp', 'Att.1', 'Cmp%', 'KP', '1/3', 'PPA', 'CrsPA', 'PrgP',
       'Short Cmp', 'Short Att', 'Short Cmp%', 'Med. Cmp', 'Med. Att',
       'Med. Cmp%', 'Long Cmp', 'Long Att', 'Long Cmp%'],
      dtype='object')

In [78]:
# Step 2: Select relevant features for clustering
am_features = [
    'KP', 'PPA', 'npxG', 'Sh', # Attacking Midfielder
    'Short Cmp%', 'PrgP', '1/3C', 'Att Pen' # Advanced Playmaker
]

In [79]:
am = am.fillna(0)

In [80]:
# Step 3: Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(am[am_features])

In [81]:
# Step 4: Apply K-Means clustering (2 clusters for 2 archetypes)
kmeans = KMeans(n_clusters=2, random_state=42)
am['Cluster'] = kmeans.fit_predict(X)

In [82]:
# Step 5: Analyze Cluster Centers and Map to Archetypes
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=am_features)
print("Defensive Midfielders Cluster Centers:")
print(cluster_centers_df)

Defensive Midfielders Cluster Centers:
           KP         PPA        npxG         Sh   Short Cmp%        PrgP  \
0  398.583333   85.716667    4.066667  10.583333  1242.916667  347.916667   
1   23.209836  204.697541  154.638525  62.833607    46.322131   49.524590   

         1/3C    Att Pen  
0  925.083333   5.333333  
1   36.270492  69.844262  


### AM Cluster Interpretation
Cluster 0: Advanced Playmaker  
Cluster 1: Attacking Midfielder  

In [83]:
# Manual Mapping (Example - update based on cluster center analysis)
cluster_mapping = {
    0: "Advanced Playmaker",
    1: "Attacking Midfielder"
}

In [84]:
# Apply the mapping to create the 'Ideal Archetype' column
am['Ideal Archetype'] = am['Cluster'].map(cluster_mapping)

In [85]:
am['Ideal Archetype'].value_counts()

Ideal Archetype
Attacking Midfielder    122
Advanced Playmaker       12
Name: count, dtype: int64

### Wide Midfielder Archetypes
Wide Midfielder: Higher number of crosses (Crs) and swithces (Sw), high total distance (Tot Dist)  
Wide Playmaker: High number of progressive passes (PrgP), completed crosses into the penalty area (CrsPA), and progressive distance (PrgDist)  

In [86]:
wide

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,PrgP,Short Cmp,Short Att,Short Cmp%,Med. Cmp,Med. Att,Med. Cmp%,Long Cmp,Long Att,Long Cmp%
9,Felix Agu,LM,2023-2024,23,Germany,Werder Bremen,Bundesliga,24,1614,17.9,...,32,246.0,292.0,84.2,203,253,80.2,27,67,40.3
10,Naouirou Ahamada,LM,2023-2024,21,France,Crystal Palace,Premier League,20,349,3.9,...,25,76.0,89.0,85.4,59,69,85.5,10,14,71.4
12,Ilias Akhomach,RM,2023-2024,19,Morocco,Villarreal,La Liga,31,1511,16.8,...,45,239.0,275.0,86.9,101,133,75.9,14,28,50.0
21,Iván Alejo,RM,2023-2024,28,Spain,Cádiz,La Liga,30,1692,18.8,...,31,117.0,163.0,71.8,70,147,47.6,39,104,37.5
22,Carles Aleñá,RM,2023-2024,25,Spain,Getafe,La Liga,29,1020,11.3,...,23,144.0,169.0,85.2,114,138,82.6,38,59,64.4
23,Aboubacar Ali,LM,2023-2024,17,France,Strasbourg,Ligue 1,9,130,1.4,...,3,14.0,19.0,73.7,7,12,58.3,0,0,
25,Tadeo Allende,RM,2023-2024,24,Argentina,Celta Vigo,La Liga,10,336,3.7,...,7,39.0,52.0,75.0,12,22,54.5,5,8,62.5
29,Hugo Álvarez,LM,2023-2024,20,Spain,Celta Vigo,La Liga,12,762,8.5,...,34,241.0,263.0,91.6,117,134,87.3,8,27,29.6
51,Alex Baena,LM,2023-2024,22,Spain,Villarreal,La Liga,34,2579,28.7,...,165,394.0,460.0,85.7,351,467,75.2,129,244,52.9
57,Jonathan Bamba,LM,2023-2024,27,Ivory Coast,Celta Vigo,La Liga,27,1981,22.0,...,80,397.0,455.0,87.3,138,190,72.6,20,36,55.6


In [87]:
wide.columns

Index(['Player', 'Pos', 'Season', 'Age', 'Nation', 'Team', 'Comp', 'MP', 'Min',
       '90s', 'Starts', 'Subs', 'Live', 'Dead', 'Sw', 'Crs', 'TI', 'CK', 'xG',
       'npxG', 'xA', 'G-xG', 'Sh', 'SoT', 'Sh Dist', 'FK', 'Carries',
       'TotDist', 'PrgDist', 'PrgC', '1/3C', 'CPA', 'Mis', 'Dis', 'Touches',
       'Def Pen', 'Att Pen', 'Tkl', 'TklW', 'Def 3rd', 'Mid 3rd', 'Att 3rd',
       'Tkl.1', 'Att', 'Tkl%', 'Lost', 'Blocks', 'Sh.1', 'Pass', 'Int',
       'Tkl+Int', 'Clr', 'Err', 'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt',
       'PKm', 'Cmp', 'Att.1', 'Cmp%', 'KP', '1/3', 'PPA', 'CrsPA', 'PrgP',
       'Short Cmp', 'Short Att', 'Short Cmp%', 'Med. Cmp', 'Med. Att',
       'Med. Cmp%', 'Long Cmp', 'Long Att', 'Long Cmp%'],
      dtype='object')

In [88]:
# Step 2: Select relevant features for clustering
wide_features = [
    'Crs', 'Sw', 'TotDist', # Attacking Midfielder
    'CrsPA', 'PrgP', 'PrgDist' # Advanced Playmaker
]

In [89]:
wide = wide.fillna(0)

In [90]:
# Step 3: Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(wide[wide_features])

In [91]:
# Step 4: Apply K-Means clustering (2 clusters for 2 archetypes)
kmeans = KMeans(n_clusters=2, random_state=42)
wide['Cluster'] = kmeans.fit_predict(X)

In [92]:
# Step 5: Analyze Cluster Centers and Map to Archetypes
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=wide_features)
print("Wide Midfielders Cluster Centers:")
print(cluster_centers_df)

Wide Midfielders Cluster Centers:
         Crs         Sw     TotDist       CrsPA        PrgP     PrgDist
0  24.417647   4.366667  744.998627   82.764706   38.529412  361.347647
1  27.111111  52.555556    3.777778  102.000000  126.111111   41.333333


### Wide Midfielder Cluster Interpretation
Cluster 0: Wide Playmaker  
Cluster 1: Wide Midfielder  

In [93]:
# Manual Mapping (Example - update based on cluster center analysis)
cluster_mapping = {
    0: "Wide Playmaker",
    1: "Wide Midfielder"
}

In [94]:
# Apply the mapping to create the 'Ideal Archetype' column
wide['Ideal Archetype'] = wide['Cluster'].map(cluster_mapping)

In [95]:
wide['Ideal Archetype'].value_counts()

Ideal Archetype
Wide Playmaker     51
Wide Midfielder     9
Name: count, dtype: int64

#### Group and Export Midfielder Classification

In [96]:
# Combine the data back into a single midfielders DataFrame
mid_df = pd.concat([cm, dm, am, wide])

In [97]:
mid_df

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,Short Att,Short Cmp%,Med. Cmp,Med. Att,Med. Cmp%,Long Cmp,Long Att,Long Cmp%,Cluster,Ideal Archetype
0,Brenden Aaronson,CM,2023-2024,22,United States,Union Berlin,Bundesliga,30,1267,14.1,...,240.0,85.8,105,130,80.8,19,32,59.4,2,Box-to-Box
1,Paxten Aaronson,CM,2023-2024,19,United States,Eint Frankfurt,Bundesliga,7,101,1.1,...,25.0,80.0,20,22,90.9,0,2,0.0,3,Classic CM
2,Salis Abdul Samed,CM,2023-2024,23,Ghana,Lens,Ligue 1,27,1519,16.9,...,433.0,90.8,330,360,91.7,41,54,75.9,2,Box-to-Box
3,Laurent Abergel,CM,2023-2024,30,France,Lorient,Ligue 1,33,2860,31.8,...,707.0,89.0,711,802,88.7,150,233,64.4,2,Box-to-Box
4,Tyler Adams,CM,2023-2024,24,United States,Bournemouth,Premier League,3,121,1.3,...,39.0,89.7,18,24,75.0,6,8,75.0,3,Classic CM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
658,Alberto Soro,LM,2023-2024,24,Spain,Granada,La Liga,1,8,0.1,...,0.0,0.0,0,0,0.0,0,0,0.0,0,Wide Playmaker
674,Suso,RM,2023-2024,29,Spain,Sevilla,La Liga,29,1517,16.9,...,29.0,1.0,4,5,1.0,0,0,0.0,0,Wide Playmaker
698,Bertrand Traoré,RM,2023-2024,27,Burkina Faso,Villarreal,La Liga,11,568,6.3,...,3.0,1.0,0,1,1.0,0,0,0.0,0,Wide Playmaker
701,Viktor Tsyhankov,RM,2023-2024,25,Ukraine,Girona,La Liga,30,2052,22.8,...,25.0,8.0,7,15,8.0,0,0,0.0,0,Wide Playmaker


In [98]:
# Select relevant columns
mid_columns_to_include = [
    'Player', 'Pos', 'Season', 'Age', 'Team', 'Comp', 'Ideal Archetype'
]

In [99]:
grouped_mid__df = mid_df[mid_columns_to_include]

In [100]:
grouped_mid__df.to_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/Classified Archetypes/MidfielderArchetypes.csv')

## Attackers

In [101]:
att_df = pd.read_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Positional Stats/Attacking/Sorted Data/ForwardsSortedData.csv')

In [102]:
att_df

Unnamed: 0.1,Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,...,Live.1,Dead,FK.1,Sw,Crs,TI,CK,Head Won,Head Lost,Head Won%
0,0,Keyliane Abdallah,RW,2023-2024,17.0,France,Marseille,Ligue 1,1,4,...,1,0,0,0,0,0,0,0,1,0.0
1,1,Matthis Abline,ST,2023-2024,20.0,France,2 Teams,Ligue 1,22,1044,...,218,12,0,1,7,1,0,15,27,35.7
2,2,Zakaria Aboukhlal,RW,2023-2024,23.0,Morocco,Toulouse,Ligue 1,13,754,...,185,11,1,0,5,9,0,6,7,46.2
3,3,Tammy Abraham,ST,2023-2024,25.0,England,Roma,Serie A,8,242,...,37,2,0,0,2,0,0,6,5,54.5
4,4,Bénie Adama Traore,LW,2023-2024,20.0,Ivory Coast,2 Teams,2 Comps,22,891,...,194,23,3,0,23,2,6,3,29,9.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
659,659,Anass Zaroury,LW,2023-2024,22.0,Morocco,Burnley,Premier League,6,152,...,0,0,0,0,0,0,65,0,0,0.0
660,660,Edon Zhegrova,RW,2023-2024,24.0,Kosovo,Lille,Ligue 1,33,2288,...,6,6,12,0,0,0,905,6,6,12.0
661,661,Joshua Zirkzee,ST,2023-2024,22.0,Netherlands,Bologna,Serie A,34,2759,...,11,4,15,2,2,0,850,11,4,15.0
662,662,Simon Zoller,ST,2023-2024,32.0,Germany,Bochum,Bundesliga,1,45,...,0,0,0,0,0,0,6,0,0,0.0


In [103]:
att_df.columns

Index(['Unnamed: 0', 'Player', 'Pos', 'Season', 'Age', 'Nation', 'Team',
       'Comp', 'MP', 'Min', '90s', 'Starts', 'Subs', 'Cmp', 'Att', 'Cmp%',
       'KP', 'PPA', 'CrsPA', 'PrgP', 'Short Cmp', 'Short Att', 'Short Cmp%',
       'Med. Cmp', 'Med. Att', 'Med. Cmp%', 'Long Cmp', 'Long Att',
       'Long Cmp%', 'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt', 'PKm',
       'Touches', 'Def 3rd Touch', 'Mid 3rd Touch', 'Att 3rd Touch', 'Live',
       'Carries', 'TotDist', 'PrgDist', 'PrgC', '1/3C', 'CPA', 'Mis', 'Dis',
       'Sh', 'G/Sh', 'G/SoT', 'SoT', 'SoT%', 'Sh Dist', 'FK', 'xG', 'npxG',
       'xA', 'G-xG', 'Live.1', 'Dead', 'FK.1', 'Sw', 'Crs', 'TI', 'CK',
       'Head Won', 'Head Lost', 'Head Won%'],
      dtype='object')

In [104]:
# Drop the first column by name
att_df = att_df.drop('Unnamed: 0', axis=1)

### Separate by Position
Wingers (LW/RW), Strikers (ST), Second Strikers (SS)

In [105]:
wingers = att_df[att_df['Pos'].isin(['LW', 'RW'])]
strikers = att_df[att_df['Pos'] == 'ST']
ss = att_df[att_df['Pos'] == 'SS']

### Winger Archetypes
Classic Winger: High progressive carries (PrgC) more crosses into penatly area (CrsPA), higher number of corner kicks (CK), more shots (Sh & SoT)  
Inverted Winger: More touches in the middle 3rd (Mid 3rd Touch), more short passes (Short Cmp), more sacrificial runs made to create opportunities   
Inside Forward: More key passes and passes into the penalty box (KP & PPA), Take more long shots (Higher Sh Dist)  

In [106]:
# Step 2: Select relevant features for clustering
winger_features = [
    'CrsPA', 'PrgC', 'SoT', 'CK', 'Ast', # Classic Winger
    'Mid 3rd Touch', 'Short Cmp', # Inverted Winger
    'KP', 'PPA', 'Sh', 'Sh Dist', 'npxG' # Inside forward
]

In [107]:
wingers = wingers.fillna(0)

In [108]:
# Step 3: Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(wingers[winger_features])

In [109]:
# Step 4: Apply K-Means clustering (6 clusters for 6 archetypes)
kmeans = KMeans(n_clusters=3, random_state=42)
wingers['Cluster'] = kmeans.fit_predict(X)

In [110]:
# Step 5: Analyze Cluster Centers and Map to Archetypes
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=winger_features)
print("Wingers Cluster Centers:")
print(cluster_centers_df)

Wingers Cluster Centers:
      CrsPA       PrgC          SoT          CK          Ast  Mid 3rd Touch  \
0  1.145226  11.314070   559.939698   97.432161   307.981407      26.056784   
1  7.113208  19.854717  3533.792453  533.452830  2234.294340      56.630189   
2  6.225806  65.419355    14.354839   21.741935     3.322581     341.193548   

    Short Cmp         KP        PPA         Sh     Sh Dist        npxG  
0   57.158794   7.172864   8.387940  10.025126  113.653769   23.431156  
1  282.907547  36.849057  33.075472  30.690566  159.094340  125.113208  
2  254.193548  28.935484  22.774194  39.967742   18.209677    3.716129  


#### Winger Cluster Interpretation
Cluster 0: Classic Winger  
Cluster 1: Inside Forward   
Cluster 2: Inverted Winger

In [111]:
# Manual Mapping (Example - update based on cluster center analysis)
cluster_mapping = {
    0: 'Classic Winger',
    1: 'Inside Forward',
    2: 'Inverted Winger',
}

In [112]:
# Apply the mapping to create the 'Ideal Archetype' column
wingers['Ideal Archetype'] = wingers['Cluster'].map(cluster_mapping)

In [113]:
wingers['Ideal Archetype'].value_counts()

Ideal Archetype
Classic Winger     199
Inside Forward      53
Inverted Winger     31
Name: count, dtype: int64

In [114]:
wingers

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,FK.1,Sw,Crs,TI,CK,Head Won,Head Lost,Head Won%,Cluster,Ideal Archetype
0,Keyliane Abdallah,RW,2023-2024,17.0,France,Marseille,Ligue 1,1,4,0.0,...,0,0,0,0,0,0,1,0.0,0,Classic Winger
2,Zakaria Aboukhlal,RW,2023-2024,23.0,Morocco,Toulouse,Ligue 1,13,754,8.4,...,1,0,5,9,0,6,7,46.2,0,Classic Winger
4,Bénie Adama Traore,LW,2023-2024,20.0,Ivory Coast,2 Teams,2 Comps,22,891,9.9,...,3,0,23,2,6,3,29,9.4,0,Classic Winger
9,Karim Adeyemi,LW,2023-2024,21.0,Germany,Dortmund,Bundesliga,21,913,10.1,...,3,1,25,11,4,14,9,60.9,0,Classic Winger
10,Simon Adingra,RW,2023-2024,21.0,Ivory Coast,Brighton,Premier League,31,2222,24.7,...,3,2,76,15,5,8,12,40.0,2,Inverted Winger
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
651,Lamine Yamal,RW,2023-2024,16.0,Spain,Barcelona,La Liga,37,2201,24.5,...,10,0,0,0,1091,5,5,10.0,1,Inside Forward
655,Kenan Yıldız,LW,2023-2024,18.0,Turkey,Juventus,Serie A,27,952,10.6,...,2,0,0,0,338,2,0,2.0,0,Classic Winger
656,Mattia Zaccagni,LW,2023-2024,28.0,Italy,Lazio,Serie A,28,1956,21.7,...,7,0,0,0,823,6,1,7.0,1,Inside Forward
659,Anass Zaroury,LW,2023-2024,22.0,Morocco,Burnley,Premier League,6,152,1.7,...,0,0,0,0,65,0,0,0.0,0,Classic Winger


### Striker Archetypes
Target Man: Headers won (Head Won, Head Won%), more key passes (KP), expected goals (xG)
Poacher: Breaks defensive line and scores goals. Higher number of non-penalty expected goals (npxG) and shots on target (SoT).  
Complete Forward: High shots (Sh) and shots on target (SoT), high # of att 3rd touches (Att 3rd Touch), solid key passing and passes into penalty area (KP, CrsPA, PPA) 

In [115]:
strikers

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,Live.1,Dead,FK.1,Sw,Crs,TI,CK,Head Won,Head Lost,Head Won%
1,Matthis Abline,ST,2023-2024,20.0,France,2 Teams,Ligue 1,22,1044,11.6,...,218,12,0,1,7,1,0,15,27,35.7
3,Tammy Abraham,ST,2023-2024,25.0,England,Roma,Serie A,8,242,2.7,...,37,2,0,0,2,0,0,6,5,54.5
5,Akor Adams,ST,2023-2024,23.0,Nigeria,Montpellier,Ligue 1,32,2252,25.0,...,375,10,0,1,12,1,0,54,69,43.9
6,Junior Adamu,ST,2023-2024,22.0,Austria,Freiburg,Bundesliga,15,105,1.2,...,29,1,0,0,1,0,0,5,6,45.5
7,Sargis Adamyan,ST,2023-2024,30.0,Armenia,Köln,Bundesliga,20,801,8.9,...,166,22,1,2,4,2,0,14,37,27.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
653,Kelvin Yeboah,ST,2023-2024,23.0,Italy,Montpellier,Ligue 1,13,276,3.1,...,0,0,0,0,0,0,77,0,0,0.0
654,Bertuğ Yıldırım,ST,2023-2024,21.0,Turkey,Rennes,Ligue 1,21,469,5.2,...,0,1,1,0,0,0,99,0,1,1.0
658,Duván Zapata,ST,2023-2024,32.0,Colombia,2 Teams,Serie A,37,2992,33.2,...,13,4,17,0,0,0,754,13,4,17.0
661,Joshua Zirkzee,ST,2023-2024,22.0,Netherlands,Bologna,Serie A,34,2759,30.7,...,11,4,15,2,2,0,850,11,4,15.0


In [116]:
strikers.columns

Index(['Player', 'Pos', 'Season', 'Age', 'Nation', 'Team', 'Comp', 'MP', 'Min',
       '90s', 'Starts', 'Subs', 'Cmp', 'Att', 'Cmp%', 'KP', 'PPA', 'CrsPA',
       'PrgP', 'Short Cmp', 'Short Att', 'Short Cmp%', 'Med. Cmp', 'Med. Att',
       'Med. Cmp%', 'Long Cmp', 'Long Att', 'Long Cmp%', 'Gls', 'Ast', 'G+A',
       'G-PK', 'PK', 'PKatt', 'PKm', 'Touches', 'Def 3rd Touch',
       'Mid 3rd Touch', 'Att 3rd Touch', 'Live', 'Carries', 'TotDist',
       'PrgDist', 'PrgC', '1/3C', 'CPA', 'Mis', 'Dis', 'Sh', 'G/Sh', 'G/SoT',
       'SoT', 'SoT%', 'Sh Dist', 'FK', 'xG', 'npxG', 'xA', 'G-xG', 'Live.1',
       'Dead', 'FK.1', 'Sw', 'Crs', 'TI', 'CK', 'Head Won', 'Head Lost',
       'Head Won%'],
      dtype='object')

In [117]:
# Step 2: Select relevant features for clustering
striker_features = [
    'Head Won', 'Head Won%', 'KP', 'xG', # Target Man
    'SoT', 'npxG', # Poacher
    'Sh', 'Att 3rd Touch', 'PPA', 'CrsPA' # Complete Forward
]

In [118]:
strikers = strikers.fillna(0)

In [119]:
# Step 3: Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(strikers[striker_features])

In [120]:
# Step 4: Apply K-Means clustering (6 clusters for 6 archetypes)
kmeans = KMeans(n_clusters=3, random_state=42)
strikers['Cluster'] = kmeans.fit_predict(X)

In [121]:
# Step 5: Analyze Cluster Centers and Map to Archetypes
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=striker_features)
print("Striker Cluster Centers:")
print(cluster_centers_df)

Striker Cluster Centers:
    Head Won  Head Won%         KP          xG          SoT        npxG  \
0  11.303167  24.220362   5.141629   80.268778   176.276018   10.403620   
1  10.206897  14.000000  25.655172   78.437931  5380.241379  109.206897   
2  33.775281  29.773034  23.697753  387.284270   299.988764   56.001124   

          Sh  Att 3rd Touch        PPA     CrsPA  
0   9.536652      52.668371   4.455656  0.227602  
1  18.034483     807.965517  14.882759  0.965517  
2  33.168539      76.075169  15.237079  2.331461  


### Striker Cluster Interpretations
Cluster 0: Complete Forward  
Cluster 1: Poacher  
Cluster 2: Target Man  

In [122]:
# Manual Mapping (Example - update based on cluster center analysis)
cluster_mapping = {
    0: 'Complete Forward',
    1: 'Poacher',
    2: 'Target Man',
}

In [123]:
# Apply the mapping to create the 'Ideal Archetype' column
strikers['Ideal Archetype'] = strikers['Cluster'].map(cluster_mapping)

In [124]:
strikers['Ideal Archetype'].value_counts()

Ideal Archetype
Complete Forward    221
Target Man           89
Poacher              29
Name: count, dtype: int64

In [125]:
strikers

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,FK.1,Sw,Crs,TI,CK,Head Won,Head Lost,Head Won%,Cluster,Ideal Archetype
1,Matthis Abline,ST,2023-2024,20.0,France,2 Teams,Ligue 1,22,1044,11.6,...,0,1,7,1,0,15,27,35.7,0,Complete Forward
3,Tammy Abraham,ST,2023-2024,25.0,England,Roma,Serie A,8,242,2.7,...,0,0,2,0,0,6,5,54.5,0,Complete Forward
5,Akor Adams,ST,2023-2024,23.0,Nigeria,Montpellier,Ligue 1,32,2252,25.0,...,0,1,12,1,0,54,69,43.9,2,Target Man
6,Junior Adamu,ST,2023-2024,22.0,Austria,Freiburg,Bundesliga,15,105,1.2,...,0,0,1,0,0,5,6,45.5,0,Complete Forward
7,Sargis Adamyan,ST,2023-2024,30.0,Armenia,Köln,Bundesliga,20,801,8.9,...,1,2,4,2,0,14,37,27.5,0,Complete Forward
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
653,Kelvin Yeboah,ST,2023-2024,23.0,Italy,Montpellier,Ligue 1,13,276,3.1,...,0,0,0,0,77,0,0,0.0,0,Complete Forward
654,Bertuğ Yıldırım,ST,2023-2024,21.0,Turkey,Rennes,Ligue 1,21,469,5.2,...,1,0,0,0,99,0,1,1.0,0,Complete Forward
658,Duván Zapata,ST,2023-2024,32.0,Colombia,2 Teams,Serie A,37,2992,33.2,...,17,0,0,0,754,13,4,17.0,1,Poacher
661,Joshua Zirkzee,ST,2023-2024,22.0,Netherlands,Bologna,Serie A,34,2759,30.7,...,15,2,2,0,850,11,4,15.0,1,Poacher


### Shadow Striker Archetypes
Second Striker: High number of shots on target (SoT), high number of headers won (Head Won, Head Won%), high non-penatly expected goals (npxG)  
False Nine: High number of touches in attacking and middle  3rd (Att 3rd Touch, Mid 3rd Touch), more live ball passes (Live), higher key passes (KP)

In [126]:
# Step 2: Select relevant features for clustering
ss_features = [
    'Head Won', 'Head Won%', 'SoT', 'npxG', # Second Striker
    'Att 3rd Touch', 'Mid 3rd Touch', 'Live', 'KP', # False Nine
]

In [127]:
ss = ss.fillna(0)

In [128]:
# Step 3: Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(ss[ss_features])

In [129]:
# Step 4: Apply K-Means clustering (6 clusters for 6 archetypes)
kmeans = KMeans(n_clusters=2, random_state=42)
ss['Cluster'] = kmeans.fit_predict(X)

In [130]:
# Step 5: Analyze Cluster Centers and Map to Archetypes
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=ss_features)
print("Shadow Striker Cluster Centers:")
print(cluster_centers_df)

Shadow Striker Cluster Centers:
    Head Won  Head Won%          SoT       npxG  Att 3rd Touch  Mid 3rd Touch  \
0   3.928571   5.571429  2854.000000  87.657143     342.650000      35.535714   
1  12.521739  39.752174     4.478261  15.108696     109.556522     102.817391   

          Live         KP  
0    14.492857  13.878571  
1  1389.695652  15.400000  


### Shadow Striker Cluster Interpretations
Cluster 0: 
Cluster 1: 

In [131]:
# Manual Mapping (Example - update based on cluster center analysis)
cluster_mapping = {
    0: 'Second Striker',
    1: 'False Nine',
}

In [132]:
# Apply the mapping to create the 'Ideal Archetype' column
ss['Ideal Archetype'] = ss['Cluster'].map(cluster_mapping)

In [133]:
ss['Ideal Archetype'].value_counts()

Ideal Archetype
False Nine        23
Second Striker    14
Name: count, dtype: int64

In [134]:
ss

Unnamed: 0,Player,Pos,Season,Age,Nation,Team,Comp,MP,Min,90s,...,FK.1,Sw,Crs,TI,CK,Head Won,Head Lost,Head Won%,Cluster,Ideal Archetype
21,Selim Amallah,SS,2023-2024,26.0,Morocco,Valencia,La Liga,20,604,6.7,...,0,0,4,0,0,11,15,42.3,1,False Nine
71,Lucas Beltrán,SS,2023-2024,22.0,Argentina,Fiorentina,Serie A,32,1692,18.8,...,2,3,7,4,0,22,58,27.5,1,False Nine
92,Badredine Bouanani,SS,2023-2024,18.0,Algeria,Nice,Ligue 1,7,161,1.8,...,0,1,8,0,0,0,1,0.0,0,Second Striker
110,Maxime Busi,SS,2023-2024,23.0,Belgium,Reims,Ligue 1,5,162,1.8,...,0,1,4,22,0,2,1,66.7,1,False Nine
111,Rémy Cabella,SS,2023-2024,33.0,France,Lille,Ligue 1,30,1477,16.4,...,31,1,51,22,36,7,17,29.2,1,False Nine
114,Nicolò Cambiaghi,SS,2023-2024,22.0,Italy,Empoli,Serie A,37,2553,28.4,...,1,1,131,5,24,9,25,26.5,1,False Nine
116,Matteo Cancellieri,SS,2023-2024,21.0,Italy,Empoli,Serie A,36,1788,19.9,...,2,1,42,4,2,32,67,32.3,1,False Nine
131,Fares Chaïbi,SS,2023-2024,20.0,Algeria,2 Teams,2 Comps,30,1970,21.9,...,41,3,183,28,96,11,56,16.4,1,False Nine
142,Ángel Correa,SS,2023-2024,28.0,Argentina,Atlético Madrid,La Liga,32,1528,17.0,...,3,2,40,0,21,5,13,27.8,1,False Nine
157,Charles De Ketelaere,SS,2023-2024,22.0,Belgium,Atalanta,Serie A,35,2026,22.5,...,7,2,47,8,2,38,35,52.1,1,False Nine


### Group and Export Attacker Classification

In [135]:
# Combine the data back into a single attackres DataFrame
att_df = pd.concat([wingers, strikers, ss])

In [136]:
# Select relevant columns
att_columns_to_include = [
    'Player', 'Pos', 'Season', 'Age', 'Team', 'Comp', 'Ideal Archetype'
]

In [137]:
grouped_att__df = att_df[att_columns_to_include]

In [138]:
grouped_att__df.to_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/Classified Archetypes/ForwardArchetypes.csv')

## Combine Archetype Classifications into one file

In [139]:
GKArchetypes = pd.read_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/Classified Archetypes/GKArchetypes.csv')
DefArchetypes = pd.read_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/Classified Archetypes/DefenderArchetypes.csv')
MidArchetypes = pd.read_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/Classified Archetypes/MidfielderArchetypes.csv')
ForwardArchetypes = pd.read_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/Classified Archetypes/ForwardArchetypes.csv')

In [140]:
combined_archetypes = pd.concat([GKArchetypes, DefArchetypes, MidArchetypes, ForwardArchetypes])

In [141]:
combined_archetypes.to_csv(r'/Users/mukikrishnan/Desktop/Interactive Soccer Dashboard/Archetype Classification/CombinedArchetypes.csv')