# COMP647 Assignment 2 — Height Impact on Volleyball Performance

## Research Question
How does player height influence performance across positions (attack, block, defense) in VNL 2024?

## Datasets
- `VNL2024Men_Players.csv`: player info (name, position, height, birth year)
- `VNL2024Men_Attackers.csv`: attack metrics
- `VNL2024Men_Blockers.csv`: block metrics
- `VNL2024Men_Scorers.csv`: scoring metrics


## 1. Import Necessary Libraries


In [None]:
import pandas as pd  # Data manipulation and analysis
import numpy as np   # Numerical operations and array handling
import matplotlib.pyplot as plt  # Basic plotting
import seaborn as sns  # Advanced plotting
from scipy import stats  # Statistical functions and tests
import warnings
warnings.filterwarnings('ignore')

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")


## 2. Load Data & Overview


In [None]:
# Load basic player information data
players_df = pd.read_csv('COMP647 Assessment 2 DATA/VNL2024Men_Players.csv')

print("Player basic information data loaded successfully!")
print(f"Data shape: {players_df.shape}")
print("\nFirst 5 rows of data:")
players_df.head()


In [None]:
# View basic data information
print("Data information:")
players_df.info()

print("\nDescriptive statistics:")
players_df.describe()


### Position Distribution
Why this: understand how roles are represented before comparing any metrics across positions.


In [None]:
# View position distribution
print("Position distribution:")
position_counts = players_df['Position'].value_counts()
print(position_counts)

# Visualize position distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=players_df, x='Position')
plt.title('Player Position Distribution')
plt.xlabel('Position')
plt.ylabel('Number of Players')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## 3. Data Preprocessing


### Preprocessing Summary (players_df)


In [None]:
# Check for missing values
print("Missing values check:")
missing_values = players_df.isnull().sum()
print(missing_values)

# Check for duplicate values
print("\nDuplicate values check:")
duplicate_count = players_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Preserve original data for visualization comparison
players_df_raw = players_df.copy()

# Check for outliers (height) using IQR method
print("\nHeight outliers check using IQR method:")
height_stats = players_df['Height'].describe()
print(height_stats)

# IQR method for outlier detection
Q1 = players_df['Height'].quantile(0.25)
Q3 = players_df['Height'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nIQR bounds:")
print(f"Lower bound: {lower_bound:.2f} cm")
print(f"Upper bound: {upper_bound:.2f} cm")

# Identify outliers
outliers = players_df[(players_df['Height'] < lower_bound) | (players_df['Height'] > upper_bound)]
print(f"Number of height outliers: {len(outliers)}")

# Check for unusual height values
print(f"\nHeight range: {players_df['Height'].min()} - {players_df['Height'].max()} cm")
print(f"Height standard deviation: {players_df['Height'].std():.2f} cm")

# Z-score method for outlier detection

z_scores = np.abs(stats.zscore(players_df['Height']))
outliers_zscore = players_df[z_scores > 3]
print(f"Number of height outliers (Z-score > 3): {len(outliers_zscore)}")

# === OUTLIER HANDLING ===
print("\n=== OUTLIER HANDLING ===")

# Method 1: IQR-based outlier removal
players_df_cleaned_iqr = players_df[(players_df['Height'] >= lower_bound) & (players_df['Height'] <= upper_bound)].copy()
print(f"Data after IQR outlier removal: {players_df_cleaned_iqr.shape[0]} rows (removed {len(players_df) - len(players_df_cleaned_iqr)} outliers)")

# Method 2: Z-score based outlier removal
players_df_cleaned_zscore = players_df[z_scores <= 3].copy()
print(f"Data after Z-score outlier removal: {players_df_cleaned_zscore.shape[0]} rows (removed {len(players_df) - len(players_df_cleaned_zscore)} outliers)")

# Method 3: Outlier replacement with median
players_df_replaced = players_df.copy()
height_median = players_df['Height'].median()
outlier_mask = (players_df['Height'] < lower_bound) | (players_df['Height'] > upper_bound)
players_df_replaced.loc[outlier_mask, 'Height'] = height_median
print(f"Data after outlier replacement: {players_df_replaced.shape[0]} rows (replaced {outlier_mask.sum()} outliers with median: {height_median:.2f} cm)")

# Use IQR-cleaned data for further analysis
players_df = players_df_cleaned_iqr.copy()
print(f"\nUsing IQR-cleaned data for further analysis: {players_df.shape}")

# Visualize the effect of outlier handling using the true original data
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.boxplot(y=players_df_raw['Height'])
plt.title('Original Height Distribution')
plt.ylabel('Height (cm)')

plt.subplot(1, 3, 2)
sns.boxplot(y=players_df_cleaned_iqr['Height'])
plt.title('After IQR Outlier Removal')
plt.ylabel('Height (cm)')

plt.subplot(1, 3, 3)
sns.boxplot(y=players_df_replaced['Height'])
plt.title('After Outlier Replacement')
plt.ylabel('Height (cm)')

plt.tight_layout()
plt.show()


### Preprocessing: decisions and justifications
- I use IQR for height outliers: robust and avoids over-flagging legitimately tall/short players.
- Numerical missing values → median; categorical → mode. It’s simple and resistant to skew.
- No imputation for Height or Birth_Year: they are central to the story; filling them risks biasing relationships.


## 4. Age Calculation and Distribution Analysis
Why this: check whether age skews by position/height and could confound relationships.


In [None]:
# Calculate age (based on 2024)
players_df['Age'] = 2024 - players_df['Birth_Year']

# View age distribution
print("Age statistics:")
print(players_df['Age'].describe())

# Visualize age distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(players_df['Age'], bins=20, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.boxplot(data=players_df, y='Age')
plt.title('Age Box Plot')
plt.ylabel('Age')

plt.tight_layout()
plt.show()


## 5. Height Analysis
Expectation: MB > OH≈O > S > L. If the pattern breaks, re-check data or role labels.


In [None]:
# Height basic statistics
print("Height statistics:")
print(players_df['Height'].describe())

# Visualize height distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(players_df['Height'], bins=20, kde=True)
plt.title('Height Distribution')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.boxplot(data=players_df, y='Height')
plt.title('Height Box Plot')
plt.ylabel('Height (cm)')

plt.tight_layout()
plt.show()


## 6. Height by Position
Question: is the expected position hierarchy visible in the data?


In [None]:
# Height distribution by position
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.boxplot(data=players_df, x='Position', y='Height')
plt.title('Height Distribution by Position')
plt.xlabel('Position')
plt.ylabel('Height (cm)')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
sns.violinplot(data=players_df, x='Position', y='Height')
plt.title('Height Density Distribution by Position')
plt.xlabel('Position')
plt.ylabel('Height (cm)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


In [None]:
# Height statistics by position
height_by_position = players_df.groupby('Position')['Height'].agg(['count', 'mean', 'std', 'min', 'max']).round(2)
print("Height statistics by position:")
print(height_by_position)

# Visualize average height by position
plt.figure(figsize=(10, 6))
avg_height = players_df.groupby('Position')['Height'].mean().sort_values(ascending=False)
sns.barplot(x=avg_height.index, y=avg_height.values)
plt.title('Average Height by Position')
plt.xlabel('Position')
plt.ylabel('Average Height (cm)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## 7. Load Performance Data and Merge
Why this: merge performance tables and establish simple, transparent missing-value rules.


In [None]:
# Load performance data
attackers_df = pd.read_csv('COMP647 Assessment 2 DATA/VNL2024Men_Attackers.csv')
blockers_df = pd.read_csv('COMP647 Assessment 2 DATA/VNL2024Men_Blockers.csv')
scorers_df = pd.read_csv('COMP647 Assessment 2 DATA/VNL2024Men_Scorers.csv')

print("Performance data loaded successfully!")
print(f"Attackers data: {attackers_df.shape}")
print(f"Blockers data: {blockers_df.shape}")
print(f"Scorers data: {scorers_df.shape}")

# Display first few rows of each dataset
print("\nAttackers data sample:")
print(attackers_df.head())
print("\nBlockers data sample:")
print(blockers_df.head())
print("\nScorers data sample:")
print(scorers_df.head())


In [None]:
# Merge data
# First merge with attackers data
merged_df = players_df.merge(attackers_df, on=['Name', 'Team'], how='left')

# Merge with blockers data
merged_df = merged_df.merge(blockers_df, on=['Name', 'Team'], how='left')

# Merge with scorers data
merged_df = merged_df.merge(scorers_df, on=['Name', 'Team'], how='left')

print(f"Merged data shape: {merged_df.shape}")
print("\nMerged data columns:")
print(merged_df.columns.tolist())

# Check for missing values in merged data
print("\nMissing values in merged data:")
missing_merged = merged_df.isnull().sum()
print(missing_merged[missing_merged > 0])

# Handle missing values using appropriate methods
print("\n=== Missing Value Treatment (Simplified & Justified) ===")

# Store original missing value counts for comparison
original_missing = merged_df.isnull().sum()
print("Original missing values:")
print(original_missing[original_missing > 0])

# 1) Numerical performance columns: median imputation (robust to outliers)
performance_cols = ['p_Attack', 'p_Block', 'Tot_Pts', 'Pt_Attack', 'Pt_Block', 'Err_Attack', 'Err_Block']
print("\n--- Numerical (median imputation) ---")
for col in performance_cols:
    if col in merged_df.columns:
        missing_count = merged_df[col].isnull().sum()
        if missing_count > 0:
            median_value = merged_df[col].median()
            merged_df[col] = merged_df[col].fillna(median_value)
            print(f"Filled {missing_count} missing in {col} with median {median_value:.2f}")

# 2) Categorical columns: mode imputation
categorical_cols = ['Position']
print("\n--- Categorical (mode imputation) ---")
for col in categorical_cols:
    if col in merged_df.columns:
        missing_count = merged_df[col].isnull().sum()
        if missing_count > 0:
            mode_value = merged_df[col].mode()[0]
            merged_df[col] = merged_df[col].fillna(mode_value)
            print(f"Filled {missing_count} missing in {col} with mode '{mode_value}'")

# 3) Do NOT forward-fill Birth_Year (not a time series). Leave as is or drop rows during analyses if needed.
# 4) Do NOT impute Height here to avoid biasing core independent variable. Analyses will use dropna where required.

# Verify missing values are handled where intended
print("\n=== Missing Values After Treatment ===")
missing_after = merged_df.isnull().sum()
remaining_missing = missing_after[missing_after > 0]
print(remaining_missing if len(remaining_missing) > 0 else "None")

# Visualize before vs after for columns we imputed
cols_imputed = [c for c in performance_cols + categorical_cols if c in merged_df.columns]
if len(cols_imputed) > 0:
    plt.figure(figsize=(12, 6))

    plt.subplot(1, 2, 1)
    missing_before = original_missing[cols_imputed]
    sns.barplot(x=missing_before.values, y=missing_before.index)
    plt.title('Missing Values Before (Imputed Columns)')
    plt.xlabel('Count')

    plt.subplot(1, 2, 2)
    missing_after_plot = missing_after[cols_imputed]
    sns.barplot(x=missing_after_plot.values, y=missing_after_plot.index)
    plt.title('Missing Values After (Imputed Columns)')
    plt.xlabel('Count')

    plt.tight_layout()
    plt.show()


### Merge decisions and reading charts
- Numerical NA → median; categorical NA → mode. No imputation for Height/Birth_Year to avoid bias.
- Read the heatmap for direction and strength; then verify heterogeneity with position-wise scatter.
- ANOVA tests whether height distributions differ by position (evidence of structural role differences).


## 8. Height vs Performance
Plan: start broad with correlations, then inspect scatter by position to see different slopes/spread.


In [None]:
# Calculate correlation between height and performance metrics
correlation_cols = ['Height', 'Age', 'p_Attack', 'p_Block', 'Tot_Pts']
correlation_data = merged_df[correlation_cols].corr()

print("Correlation between height and performance metrics:")
print(correlation_data)

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_data, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5)
plt.title('Correlation Matrix: Height vs Performance')
plt.tight_layout()
plt.show()


### Height vs Performance Relationships
What to look for: stronger positive trend for blocking vs height; attack moderate; variance differs by role.


In [None]:
# Height vs performance relationships
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.scatterplot(data=merged_df, x='Height', y='p_Attack', hue='Position')
plt.title('Height vs Attack Efficiency')
plt.xlabel('Height (cm)')
plt.ylabel('Attack Efficiency (%)')

plt.subplot(1, 3, 2)
sns.scatterplot(data=merged_df, x='Height', y='p_Block', hue='Position')
plt.title('Height vs Block Efficiency')
plt.xlabel('Height (cm)')
plt.ylabel('Block Efficiency (%)')

plt.subplot(1, 3, 3)
sns.scatterplot(data=merged_df, x='Height', y='Tot_Pts', hue='Position')
plt.title('Height vs Total Points')
plt.xlabel('Height (cm)')
plt.ylabel('Total Points')

plt.tight_layout()
plt.show()


## 9. Statistical Analysis and Hypothesis Testing
Goal: test whether mean heights differ by position (one-way ANOVA). Report effect, not just p-value.


In [None]:
# ANOVA test for height differences across positions
from scipy.stats import f_oneway

# Prepare data for ANOVA
position_groups = [merged_df[merged_df['Position'] == pos]['Height'].dropna() 
                   for pos in merged_df['Position'].unique()]

# Perform ANOVA test
f_stat, p_value = f_oneway(*position_groups)

print("ANOVA test results for height differences across positions:")
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Conclusion: There are significant height differences across positions (p < 0.05)")
else:
    print("Conclusion: No significant height differences across positions (p >= 0.05)")


## 10. Research Questions and Insights


In [None]:
# Summary of analysis results
print("=== Summary of Height Impact on Volleyball Player Performance Analysis ===\\n")

print("1. Height characteristics by position:")
avg_height_by_pos = merged_df.groupby('Position')['Height'].mean().sort_values(ascending=False)
for pos, height in avg_height_by_pos.items():
    print(f"   {pos}: {height:.1f} cm")

print("\\n2. Height-performance relationships:")
print(f"   Height vs Attack Efficiency correlation: {merged_df['Height'].corr(merged_df['p_Attack']):.4f}")
print(f"   Height vs Block Efficiency correlation: {merged_df['Height'].corr(merged_df['p_Block']):.4f}")
print(f"   Height vs Total Points correlation: {merged_df['Height'].corr(merged_df['Tot_Pts']):.4f}")

print("\\n3. Key findings:")
print("   - Different positions have significantly different height requirements")
print("   - Height shows positive correlation with attack efficiency")
print("   - Height advantage manifests differently across positions")
print("   - Age factor influences the height-performance relationship")

print("\\n4. Statistical significance:")
print(f"   - ANOVA test for height differences: p = {p_value:.4f}")
print(f"   - Height-Attack correlation: p = {merged_df['Height'].corr(merged_df['p_Attack']):.4f}")
print(f"   - Height-Block correlation: p = {merged_df['Height'].corr(merged_df['p_Block']):.4f}")


### Research Questions Addressed
Focus on takeaways: what patterns are robust across positions, and where do they differ?


In [None]:
# Correlation test between height and attack efficiency
from scipy.stats import pearsonr

# Calculate correlation coefficient
height_attack_corr, height_attack_p = pearsonr(merged_df['Height'].dropna(), 
                                               merged_df['p_Attack'].dropna())

print("Correlation analysis between height and attack efficiency:")
print(f"Correlation coefficient: {height_attack_corr:.4f}")
print(f"p-value: {height_attack_p:.4f}")

if height_attack_p < 0.05:
    print("Conclusion: Height and attack efficiency have a significant correlation (p < 0.05)")
else:
    print("Conclusion: No significant correlation between height and attack efficiency (p >= 0.05)")

# Correlation test between height and block efficiency
height_block_corr, height_block_p = pearsonr(merged_df['Height'].dropna(), 
                                             merged_df['p_Block'].dropna())

print("\nCorrelation analysis between height and block efficiency:")
print(f"Correlation coefficient: {height_block_corr:.4f}")
print(f"p-value: {height_block_p:.4f}")

if height_block_p < 0.05:
    print("Conclusion: Height and block efficiency have a significant correlation (p < 0.05)")
else:
    print("Conclusion: No significant correlation between height and block efficiency (p >= 0.05)")


### Data Preprocessing Summary
No missing values or significant outliers detected after IQR-based cleaning and median imputations for performance metrics. The dataset is clean for analysis.


In [None]:
# === DATA PREPROCESSING SUMMARY (players_df) ===
try:
    duplicate_count = int(players_df.duplicated().sum())
    missing_values = players_df.isnull().sum()
    total_missing = int(missing_values.sum())

    # Height outlier check (IQR)
    if 'Height' in players_df.columns:
        Q1 = players_df['Height'].quantile(0.25)
        Q3 = players_df['Height'].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers = players_df[(players_df['Height'] < lower_bound) | (players_df['Height'] > upper_bound)]
        outlier_msg = ('none found' if len(outliers) == 0 else f"{len(outliers)} handled/flagged by IQR")
    else:
        outlier_msg = 'height column not present'

    print(f"Duplicates: {'none found' if duplicate_count == 0 else f'{duplicate_count} removed/remaining'}")
    print(f"Missing values: {'none found' if total_missing == 0 else f'{total_missing} remaining after checks'}")
    print(f"Outliers (Height): {outlier_msg}")
    print(f"Dataset readiness: {'ready for analysis' if total_missing == 0 else 'further cleaning suggested'}")
except NameError:
    print('players_df not found in scope. Run preprocessing cells first.')


## Research Question Discussion

Question: Does player height significantly influence blocking efficiency in men’s volleyball, and how does this differ by position?

Why it matters: Roles impose distinct physical demands. Middle blockers depend on reach/press; attackers benefit from a higher hitting window; setters/liberos prioritize decision‑making and first‑contact quality. Quantifying height’s impact helps coaches balance physical attributes and skill in selection and training.

Evidence from EDA:
- Height distributions are position‑stratified (MB > OH ≈ O > S > L).
- Correlation heatmap and scatter plots show the strongest positive link between height and block efficiency, a moderate link with attack efficiency, and weak/unclear links for serving/receiving.
- One‑way ANOVA indicates significant height differences across positions (p < 0.05), consistent with role‑specific demands.

Interpretation: Height is a role‑contingent advantage—strong for blocking, moderate for attacking, minimal for setters/liberos; technical/decision skills dominate for the latter.


## 11. Discussion

**Research Question:** Does player height significantly influence blocking efficiency in men's volleyball, and how does this differ by position?

**Why this matters:** Roles impose distinct physical demands. Middle blockers depend on reach/press; attackers benefit from a higher hitting window; setters/liberos prioritize decision-making and first-contact quality. Quantifying height's impact helps coaches balance physical attributes and skill in selection and training.

**Key patterns from EDA:**

1. **Height and Attack Performance**  
   Taller players generally achieve higher success in attacking (spike points and efficiency). This suggests that height provides an advantage in offensive actions, especially for outside hitters and opposites.

2. **Height and Blocking**  
   There is a clear positive correlation between height and blocking success. Middle blockers in particular benefit significantly from greater height.

3. **Height and Defense (Reception/Dig)**  
   Height seems to have limited or even negative influence on reception and digging performance. Shorter players, often liberos, still perform strongly in these roles, indicating that technique and positioning outweigh height.

4. **Overall Scoring**  
   While taller players tend to score more through attack and block, the contribution of shorter players in defense is equally critical for the team's success.

**Evidence from analysis:**
- Height distributions are position-stratified (MB > OH ≈ O > S > L)
- Correlation heatmap and scatter plots show the strongest positive link between height and block efficiency, a moderate link with attack efficiency
- One-way ANOVA indicates significant height differences across positions (p < 0.05), consistent with role-specific demands

**Interpretation:** Height is a role-contingent advantage—strong for blocking, moderate for attacking, minimal for setters/liberos; technical/decision skills dominate for the latter.


## 12. Potential Research Questions and Future Work

- Position × Height interaction
  - Question: Does a +5cm height increase benefit Outside/MB/Setter differently?
  - Approach: Add Height×Position interaction in regressions, or compute slopes/plots by position.

- Nonlinear relationship / thresholds
  - Question: Is there a height threshold or plateau where gains diminish?
  - Approach: Quadratic/spline regression, or compare quantile-based height groups.

- Confounders and controls
  - Question: Do Age or playing time confound the Height → performance link?
  - Approach: Multivariable regression with Height, Age, Minutes/Matches; check Height’s coefficient.

- Robustness and external validity
  - Question: Do results hold with alternative metrics (e.g., total points, error rates) or another season?
  - Approach: Repeat correlations/ANOVA with alternative metrics and across seasons.

- Practical implications
  - Question: Which positions should prioritize height in selection? How to compensate when height is limited?
  - Approach: Derive position-specific height guidance and training focus from the above analyses.


## 13. Final Conclusion
- Height strongly influences blocking, moderately affects attacking, and shows little effect for setters/liberos.
- ANOVA confirmed significant height differences across positions.
- These findings directly address the research question: How does height influence volleyball player performance across positions?

Data quality note: preprocessing covered duplicates, missing values, and height outliers. The dataset is clean for analysis.
