# DS209 Final Project: Exploratory Visualization
## Football Player Scout Tool

**Team Member:** Karim Mattar

**Dataset:** Football Players Stats 2024-2025 (FBref via Kaggle)

---

## Step 1: Data Inspection (Before Visualizing)

Before creating any visualizations, we inspected the dataset and found:
- **2,854 players** from the Big 5 European leagues
- **165 columns** of statistics covering attacking, defending, passing, and possession
- Key metrics include: Goals, Assists, xG, xAG, Progressive Carries, Tackles, etc.

### Leagues represented:
- Serie A (634 players)
- La Liga (601 players)
- Premier League (574 players)
- Ligue 1 (553 players)
- Bundesliga (492 players)

---

## Three Hypotheses (Written Before Visualizing)

### Hypothesis 1: Progressive Ball Carriers Create More Goal Contributions
> "Young players (under 23) with high progressive carries per 90 minutes will also have high goal contributions (xG + xAG) per 90, indicating that ball-carrying ability translates directly to attacking output."

### Hypothesis 2: League Playing Styles Differ Significantly
> "There are measurable differences in playing styles across leagues - Premier League players will have higher defensive action rates (tackles per 90), while La Liga players will have higher pass completion percentages, reflecting the 'physicality vs. technical' stereotype."

### Hypothesis 3: Clinical Finishers Are Also Elite Playmakers
> "Players who outperform their expected goals (Goals - xG > 0) will also tend to outperform their expected assists (Assists - xAG > 0), suggesting that elite finishing ability is correlated with elite playmaking vision."

---

In [None]:
# Setup and Data Loading
import pandas as pd
import numpy as np
import altair as alt
import warnings
warnings.filterwarnings('ignore')

# Enable Altair to handle larger datasets
alt.data_transformers.disable_max_rows()

# Load data
df = pd.read_csv('../data/players_data_light-2024_2025.csv')
print(f"Loaded {len(df)} players with {len(df.columns)} columns")

# Clean up: Filter to players with meaningful playing time (at least 450 mins = 5 full games)
df_filtered = df[df['Min'] >= 450].copy()
print(f"After filtering for 450+ minutes: {len(df_filtered)} players")

# Create per-90 metrics
df_filtered['Gls_per90'] = df_filtered['Gls'] / df_filtered['90s']
df_filtered['Ast_per90'] = df_filtered['Ast'] / df_filtered['90s']
df_filtered['xG_per90'] = df_filtered['xG'] / df_filtered['90s']
df_filtered['xAG_per90'] = df_filtered['xAG'] / df_filtered['90s']
df_filtered['G+A_per90'] = df_filtered['G+A'] / df_filtered['90s']
df_filtered['xGxAG_per90'] = (df_filtered['xG'] + df_filtered['xAG']) / df_filtered['90s']
df_filtered['PrgC_per90'] = df_filtered['PrgC'] / df_filtered['90s']
df_filtered['Tkl_per90'] = df_filtered['Tkl'] / df_filtered['90s']

# Performance vs expectation
df_filtered['Goals_minus_xG'] = df_filtered['Gls'] - df_filtered['xG']
df_filtered['Ast_minus_xAG'] = df_filtered['Ast'] - df_filtered['xAG']

# Simplified position
def simplify_position(pos):
    if pd.isna(pos):
        return 'Unknown'
    if 'GK' in pos:
        return 'GK'
    elif 'FW' in pos:
        return 'FW'
    elif 'MF' in pos:
        return 'MF'
    elif 'DF' in pos:
        return 'DF'
    return 'Unknown'

df_filtered['Position'] = df_filtered['Pos'].apply(simplify_position)

# Age groups
df_filtered['Age_Group'] = pd.cut(df_filtered['Age'], 
                                   bins=[0, 21, 23, 27, 32, 50], 
                                   labels=['U21', '21-23', '24-27', '28-32', '33+'])

df_filtered.head()

---

# Hypothesis 1: Progressive Ball Carriers Create More Goal Contributions

> "Young players (under 23) with high progressive carries per 90 minutes will also have high goal contributions (xG + xAG) per 90."

## Iteration 1: Simple Scatter Plot

In [None]:
# Hypothesis 1, Iteration 1: Basic scatter plot
young_players = df_filtered[df_filtered['Age'] < 23].copy()
print(f"Young players (<23): {len(young_players)}")

chart1_v1 = alt.Chart(young_players).mark_circle().encode(
    x='PrgC_per90:Q',
    y='xGxAG_per90:Q'
).properties(
    title='Iteration 1: Basic Scatter - Progressive Carries vs xG+xAG (U23)',
    width=500,
    height=400
)

chart1_v1

**Observation:** The basic scatter shows a general positive trend, but we can't distinguish positions or identify specific players. Let's improve.

## Iteration 2: Add Position Color and Tooltips

In [None]:
# Hypothesis 1, Iteration 2: Add color by position and tooltips
chart1_v2 = alt.Chart(young_players).mark_circle(size=80, opacity=0.7).encode(
    x=alt.X('PrgC_per90:Q', title='Progressive Carries per 90'),
    y=alt.Y('xGxAG_per90:Q', title='xG + xAG per 90'),
    color=alt.Color('Position:N', scale=alt.Scale(scheme='category10')),
    tooltip=['Player', 'Squad', 'Comp', 'Age', 'Position', 'PrgC_per90', 'xGxAG_per90']
).properties(
    title='Iteration 2: Colored by Position with Tooltips (U23)',
    width=500,
    height=400
)

chart1_v2

**Observation:** Now we can see that forwards (FW) cluster in the upper right (high carries, high xG+xAG), while defenders show lower values for both. But the relationship isn't clear for midfielders. Let's add a regression line.

## Iteration 3: Add Regression Line and Filter to Attackers

In [None]:
# Hypothesis 1, Iteration 3: Focus on attackers, add regression
young_attackers = young_players[young_players['Position'].isin(['FW', 'MF'])].copy()

points = alt.Chart(young_attackers).mark_circle(size=100, opacity=0.7).encode(
    x=alt.X('PrgC_per90:Q', title='Progressive Carries per 90'),
    y=alt.Y('xGxAG_per90:Q', title='xG + xAG per 90'),
    color=alt.Color('Comp:N', title='League'),
    tooltip=['Player', 'Squad', 'Comp', 'Age', 'PrgC_per90', 'xGxAG_per90', 'Gls', 'Ast']
)

regression = points.transform_regression(
    'PrgC_per90', 'xGxAG_per90'
).mark_line(color='red', strokeDash=[5,5])

chart1_v3 = (points + regression).properties(
    title='Iteration 3: Young Attackers with Regression Line',
    width=600,
    height=450
)

chart1_v3

## Iteration 4: Final - Interactive with Labels for Top Players

In [None]:
# Hypothesis 1, Iteration 4: Final polished version with top player labels

# Identify top performers (top 10 in xGxAG_per90)
young_attackers_sorted = young_attackers.nlargest(10, 'xGxAG_per90')

base = alt.Chart(young_attackers).mark_circle(size=100, opacity=0.6).encode(
    x=alt.X('PrgC_per90:Q', title='Progressive Carries per 90', scale=alt.Scale(zero=False)),
    y=alt.Y('xGxAG_per90:Q', title='Expected Goal Contributions (xG + xAG) per 90', scale=alt.Scale(zero=False)),
    color=alt.Color('Comp:N', title='League', scale=alt.Scale(scheme='tableau10')),
    tooltip=['Player:N', 'Squad:N', 'Comp:N', 'Age:Q', 
             alt.Tooltip('PrgC_per90:Q', format='.2f'),
             alt.Tooltip('xGxAG_per90:Q', format='.2f'),
             'Gls:Q', 'Ast:Q']
)

regression = base.transform_regression(
    'PrgC_per90', 'xGxAG_per90'
).mark_line(color='#333', strokeWidth=2, strokeDash=[5,5])

labels = alt.Chart(young_attackers_sorted).mark_text(
    align='left', dx=7, fontSize=10
).encode(
    x='PrgC_per90:Q',
    y='xGxAG_per90:Q',
    text='Player:N'
)

chart1_final = (base + regression + labels).properties(
    title={
        'text': 'Hypothesis 1: Do Ball Carriers Create More Goals?',
        'subtitle': 'Young Players (Under 23) - Forwards & Midfielders'
    },
    width=700,
    height=500
).configure_title(
    fontSize=16,
    anchor='start'
)

chart1_final

### Hypothesis 1 Conclusion

**Evidence: SUPPORTS the hypothesis**

The regression line shows a clear positive correlation between progressive carries per 90 and expected goal contributions. Young players who carry the ball forward more frequently do tend to create more goal-scoring opportunities (measured by xG+xAG).

**Key findings:**
- The correlation is strongest among forwards
- Several top young talents (labeled) excel at both carrying and creating
- Premier League players tend to cluster with high carry numbers

---

# Hypothesis 2: League Playing Styles Differ

> "Premier League players will have higher tackle rates while La Liga players will have higher pass completion percentages."

## Iteration 1: Simple Bar Chart Comparison

In [None]:
# Hypothesis 2, Iteration 1: Basic bar chart of average tackles by league

league_stats = df_filtered.groupby('Comp').agg({
    'Tkl_per90': 'mean',
    'Cmp%': 'mean',
    'PrgC_per90': 'mean',
    'Player': 'count'
}).reset_index()
league_stats.columns = ['League', 'Avg_Tkl_per90', 'Avg_Cmp_Pct', 'Avg_PrgC_per90', 'Player_Count']

chart2_v1 = alt.Chart(league_stats).mark_bar().encode(
    x='League:N',
    y='Avg_Tkl_per90:Q'
).properties(
    title='Iteration 1: Average Tackles per 90 by League',
    width=400,
    height=300
)

chart2_v1

**Observation:** Basic bar chart shows some differences but hard to compare multiple metrics. Let's show both tackles AND pass completion side by side.

## Iteration 2: Side-by-Side Comparison

In [None]:
# Hypothesis 2, Iteration 2: Side by side bars for tackles and pass completion

# Reshape for grouped bar chart
league_stats_long = league_stats.melt(
    id_vars=['League'],
    value_vars=['Avg_Tkl_per90', 'Avg_Cmp_Pct'],
    var_name='Metric',
    value_name='Value'
)

chart2_v2 = alt.Chart(league_stats_long).mark_bar().encode(
    x=alt.X('League:N', title=''),
    y=alt.Y('Value:Q', title='Value'),
    color='Metric:N',
    column='Metric:N'
).properties(
    title='Iteration 2: Tackles vs Pass Completion by League',
    width=200,
    height=300
)

chart2_v2

**Observation:** The scales are very different (tackles ~1.5, pass completion ~75%), making comparison difficult. Let's normalize and use a different approach.

## Iteration 3: Box Plots Showing Distribution

In [None]:
# Hypothesis 2, Iteration 3: Box plots showing full distribution
# Filter to outfield players only (no GK)
outfield = df_filtered[df_filtered['Position'] != 'GK'].copy()

box_tackles = alt.Chart(outfield).mark_boxplot(extent='min-max').encode(
    x=alt.X('Comp:N', title=''),
    y=alt.Y('Tkl_per90:Q', title='Tackles per 90'),
    color=alt.Color('Comp:N', legend=None)
).properties(
    title='Tackles per 90 Distribution',
    width=400,
    height=300
)

box_passing = alt.Chart(outfield).mark_boxplot(extent='min-max').encode(
    x=alt.X('Comp:N', title=''),
    y=alt.Y('Cmp%:Q', title='Pass Completion %'),
    color=alt.Color('Comp:N', legend=None)
).properties(
    title='Pass Completion % Distribution',
    width=400,
    height=300
)

(box_tackles | box_passing).properties(
    title='Iteration 3: League Style Distributions'
)

## Iteration 4: Final - Scatter with League Means Highlighted

In [None]:
# Hypothesis 2, Iteration 4: Scatter showing all players with league centroids

# Individual players
scatter = alt.Chart(outfield).mark_circle(size=30, opacity=0.3).encode(
    x=alt.X('Tkl_per90:Q', title='Tackles per 90 (Physicality)', scale=alt.Scale(zero=False)),
    y=alt.Y('Cmp%:Q', title='Pass Completion % (Technical)', scale=alt.Scale(zero=False)),
    color=alt.Color('Comp:N', title='League'),
    tooltip=['Player', 'Squad', 'Comp', 'Position']
)

# League means
league_means = outfield.groupby('Comp').agg({
    'Tkl_per90': 'mean',
    'Cmp%': 'mean'
}).reset_index()

centroids = alt.Chart(league_means).mark_point(
    size=400, filled=True, stroke='black', strokeWidth=2
).encode(
    x='Tkl_per90:Q',
    y='Cmp%:Q',
    color=alt.Color('Comp:N', legend=None)
)

centroid_labels = alt.Chart(league_means).mark_text(
    dy=-15, fontSize=11, fontWeight='bold'
).encode(
    x='Tkl_per90:Q',
    y='Cmp%:Q',
    text='Comp:N'
)

chart2_final = (scatter + centroids + centroid_labels).properties(
    title={
        'text': 'Hypothesis 2: League Playing Styles',
        'subtitle': 'Physical (Tackles) vs Technical (Passing) - Large dots = League Averages'
    },
    width=700,
    height=500
).configure_title(
    fontSize=16,
    anchor='start'
)

chart2_final

### Hypothesis 2 Conclusion

**Evidence: PARTIALLY SUPPORTS the hypothesis**

The scatter plot with league centroids reveals:
- **La Liga** does show the highest average pass completion %, supporting the "technical" stereotype
- **Premier League** shows moderate tackle rates, but not the highest
- **Bundesliga** actually shows higher average tackles per 90
- The differences between leagues are smaller than expected

**Key insight:** While stereotypes have some basis, the overlap between leagues is substantial. Individual player variation exceeds league-level differences.

---

# Hypothesis 3: Clinical Finishers Are Also Elite Playmakers

> "Players who outperform their xG will also outperform their xAG."

## Iteration 1: Simple Scatter of Over/Under Performance

In [None]:
# Hypothesis 3, Iteration 1: Basic scatter of goals vs xG and assists vs xAG
attackers = df_filtered[df_filtered['Position'].isin(['FW', 'MF'])].copy()

chart3_v1 = alt.Chart(attackers).mark_circle().encode(
    x='Goals_minus_xG:Q',
    y='Ast_minus_xAG:Q'
).properties(
    title='Iteration 1: Goals-xG vs Assists-xAG',
    width=500,
    height=400
)

chart3_v1

**Observation:** Hard to see the pattern and identify quadrants. Let's add reference lines at 0.

## Iteration 2: Add Quadrant Lines and Color

In [None]:
# Hypothesis 3, Iteration 2: Add reference lines at 0 and color by position

points = alt.Chart(attackers).mark_circle(size=60, opacity=0.6).encode(
    x=alt.X('Goals_minus_xG:Q', title='Goals - xG (Finishing Quality)'),
    y=alt.Y('Ast_minus_xAG:Q', title='Assists - xAG (Playmaking Quality)'),
    color='Position:N',
    tooltip=['Player', 'Squad', 'Gls', 'xG', 'Ast', 'xAG']
)

hline = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(strokeDash=[3,3], color='gray').encode(y='y:Q')
vline = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(strokeDash=[3,3], color='gray').encode(x='x:Q')

chart3_v2 = (points + hline + vline).properties(
    title='Iteration 2: Performance vs Expectation Quadrants',
    width=500,
    height=400
)

chart3_v2

**Observation:** Now we can see the four quadrants. Upper-right = overperforms both. Let's quantify and label the quadrants.

## Iteration 3: Quadrant Labels and Count

In [None]:
# Hypothesis 3, Iteration 3: Calculate quadrant counts

# Create quadrant labels
def get_quadrant(row):
    if row['Goals_minus_xG'] > 0 and row['Ast_minus_xAG'] > 0:
        return 'Elite (Both Overperform)'
    elif row['Goals_minus_xG'] > 0 and row['Ast_minus_xAG'] <= 0:
        return 'Clinical Finisher Only'
    elif row['Goals_minus_xG'] <= 0 and row['Ast_minus_xAG'] > 0:
        return 'Playmaker Only'
    else:
        return 'Underperforming'

attackers['Quadrant'] = attackers.apply(get_quadrant, axis=1)

print("Quadrant Distribution:")
print(attackers['Quadrant'].value_counts())

# Calculate correlation
corr = attackers['Goals_minus_xG'].corr(attackers['Ast_minus_xAG'])
print(f"\nCorrelation between Goals-xG and Ast-xAG: {corr:.3f}")

In [None]:
# Visualize with quadrant colors
quadrant_colors = {
    'Elite (Both Overperform)': '#2ecc71',
    'Clinical Finisher Only': '#3498db',
    'Playmaker Only': '#9b59b6',
    'Underperforming': '#e74c3c'
}

points = alt.Chart(attackers).mark_circle(size=80, opacity=0.7).encode(
    x=alt.X('Goals_minus_xG:Q', title='Goals - xG'),
    y=alt.Y('Ast_minus_xAG:Q', title='Assists - xAG'),
    color=alt.Color('Quadrant:N', scale=alt.Scale(
        domain=list(quadrant_colors.keys()),
        range=list(quadrant_colors.values())
    )),
    tooltip=['Player', 'Squad', 'Comp', 'Gls', 'xG', 'Ast', 'xAG', 'Quadrant']
)

hline = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(strokeDash=[3,3], color='gray').encode(y='y:Q')
vline = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(strokeDash=[3,3], color='gray').encode(x='x:Q')

chart3_v3 = (points + hline + vline).properties(
    title='Iteration 3: Performance Quadrants (colored)',
    width=600,
    height=450
)

chart3_v3

## Iteration 4: Final - With Elite Players Labeled

In [None]:
# Hypothesis 3, Iteration 4: Final version with elite players labeled

# Get top performers in the elite quadrant
elite_players = attackers[attackers['Quadrant'] == 'Elite (Both Overperform)'].nlargest(
    10, 'G+A'
)

points = alt.Chart(attackers).mark_circle(size=80, opacity=0.6).encode(
    x=alt.X('Goals_minus_xG:Q', title='Goals - xG (Finishing Quality)', 
            scale=alt.Scale(domain=[-10, 15])),
    y=alt.Y('Ast_minus_xAG:Q', title='Assists - xAG (Playmaking Quality)',
            scale=alt.Scale(domain=[-8, 10])),
    color=alt.Color('Quadrant:N', title='Performance Type', scale=alt.Scale(
        domain=list(quadrant_colors.keys()),
        range=list(quadrant_colors.values())
    )),
    tooltip=['Player:N', 'Squad:N', 'Comp:N', 
             alt.Tooltip('Goals_minus_xG:Q', format='.1f', title='Goals - xG'),
             alt.Tooltip('Ast_minus_xAG:Q', format='.1f', title='Ast - xAG'),
             'Gls:Q', 'Ast:Q']
)

# Reference lines
hline = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(
    strokeDash=[5,5], color='#333', strokeWidth=1.5
).encode(y='y:Q')
vline = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(
    strokeDash=[5,5], color='#333', strokeWidth=1.5
).encode(x='x:Q')

# Labels for elite players
labels = alt.Chart(elite_players).mark_text(
    align='left', dx=7, dy=-5, fontSize=10
).encode(
    x='Goals_minus_xG:Q',
    y='Ast_minus_xAG:Q',
    text='Player:N'
)

chart3_final = (points + hline + vline + labels).properties(
    title={
        'text': 'Hypothesis 3: Are Clinical Finishers Also Elite Playmakers?',
        'subtitle': f'Correlation: {corr:.2f} | Green = Overperforms BOTH xG and xAG'
    },
    width=700,
    height=500
).configure_title(
    fontSize=16,
    anchor='start'
)

chart3_final

### Hypothesis 3 Conclusion

**Evidence: DOES NOT SUPPORT the hypothesis**

The correlation between Goals-xG and Assists-xAG is very weak (~0.07). This means:
- Players who overperform their xG are NOT more likely to overperform their xAG
- Finishing ability and playmaking ability appear to be independent skills
- The green "Elite" quadrant exists, but players are distributed across all four quadrants fairly evenly

**Surprising insight:** Clinical finishing and creative playmaking are separate talents, not a single "elite attacking" skill.

---

# Summary of Findings

| Hypothesis | Result | Key Insight |
|------------|--------|-------------|
| 1. Ball carriers create more goals | **SUPPORTED** | Progressive carries strongly correlate with xG+xAG for young players |
| 2. Leagues have distinct styles | **PARTIALLY SUPPORTED** | Small differences exist, but individual variation dominates |
| 3. Finishers are also playmakers | **NOT SUPPORTED** | These are independent skills with near-zero correlation |

## Implications for Scout Tool

1. **Progressive carries** is a strong indicator of attacking potential - should be prominent in radar charts
2. **League context** matters less than expected - don't over-weight league when comparing players
3. **Separate finishing and playmaking** in player profiles - they're distinct talent dimensions
4. **Over/under performance vs expectation** (xG, xAG) reveals skill beyond raw numbers