# Scikit-learn Pipeline Example: NCAA Basketball Quadrant Classification

This notebook demonstrates a complete machine learning pipeline using scikit-learn with hyperparameter optimization.

**Goal**: Predict which "quadrant" an NCAA basketball team belongs to based on their performance metrics.

**Data Source**: NCAA Men's Basketball NET Rankings (web scraped using `pandas.read_html()`)

**Quadrant System (Balanced):**
- Teams are divided into 4 equal groups based on their ranking
- Each quadrant contains approximately the same number of teams
- Quad 1: Top 25% of teams
- Quad 2: Second 25% of teams
- Quad 3: Third 25% of teams
- Quad 4: Bottom 25% of teams

**Features Used:**
- Wins (numeric)
- Losses (numeric)
- Conference (categorical)
- Non-Division I Wins (numeric)

## Key Concepts Covered:
1. Data acquisition from the web using `pandas.read_html()`
2. Feature engineering (parsing records into wins/losses)
3. Multi-class classification (4 balanced classes)
4. Preprocessing pipelines (numeric and categorical)
5. Column transformers
6. **Hyperparameter tuning with GridSearchCV**
7. **Cross-validation (5-fold CV)**
8. Model evaluation and feature importance analysis

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import requests
from io import StringIO
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings('ignore')

## Step 1: Data Acquisition - Web Scraping NCAA NET Rankings

`pandas.read_html()` is a powerful function that automatically finds and parses HTML tables from web pages.

**How it works:**
- Takes a URL or HTML string
- Returns a list of DataFrames (one for each table found)
- Requires `lxml` or `html5lib` parser

We'll scrape the NCAA NET Rankings and extract team statistics.

In [None]:
# Scrape NCAA NET Rankings data
url = 'https://www.ncaa.com/rankings/basketball-men/d1/ncaa-mens-basketball-net-rankings'

# Add headers to avoid potential blocking
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

try:
    # Fetch the page with proper headers
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    
    # Parse tables from the HTML content
    tables = pd.read_html(StringIO(response.text))
    
    print(f"✓ Successfully scraped {len(tables)} table(s) from NCAA.com")
    
    # Get the main rankings table
    df = tables[0].copy()
    
    # Standardize column names
    if 'School' in df.columns and 'Team' not in df.columns:
        df['Team'] = df['School']
    
    print(f"\nData shape: {df.shape}")
    print(f"\nColumns: {df.columns.tolist()}")
    print(f"\nFirst few rows:")
    print(df.head(10))
    
except Exception as e:
    print(f"⚠ Web scraping failed: {type(e).__name__}: {e}")
    print("\nNote: If scraping fails, you may need to check the URL or create synthetic data.")

In [None]:
# Feature Engineering: Parse record into wins and losses
# Record is typically in format "W-L" like "25-6"

def parse_record(record_str):
    """Parse record string into wins and losses"""
    try:
        if '-' in str(record_str):
            parts = str(record_str).split('-')
            if len(parts) >= 2:
                wins = int(parts[0])
                losses = int(parts[1])
                return wins, losses
        return None, None
    except:
        return None, None

# Find the record column (might be named 'Record', 'W-L', or similar)
record_col = None
for col in df.columns:
    if 'record' in col.lower() or 'w-l' in col.lower() or '-' in str(df[col].iloc[0] if len(df) > 0 else ''):
        # Check if this looks like a record column
        sample = str(df[col].iloc[0]) if len(df) > 0 else ''
        if '-' in sample and any(c.isdigit() for c in sample):
            record_col = col
            break

if record_col:
    print(f"Found record column: '{record_col}'")
    df[['Wins', 'Losses']] = df[record_col].apply(lambda x: pd.Series(parse_record(x)))
else:
    print("⚠ Could not find record column. Creating sample data...")
    # Create sample wins/losses based on ranking
    df['Wins'] = 30 - (df['Rank'] // 12)  # Better teams have more wins
    df['Losses'] = 2 + (df['Rank'] // 30)  # Worse teams have more losses

print("\nSample of Wins/Losses:")
print(df[['Rank', 'Team', 'Wins', 'Losses']].head(20))

# Check if Conference column exists
if 'Conference' not in df.columns:
    # Look for similar column names
    conf_col = None
    for col in df.columns:
        if 'conf' in col.lower() or 'league' in col.lower():
            conf_col = col
            df['Conference'] = df[conf_col]
            print(f"\n✓ Found conference column: '{conf_col}'")
            break
    
    if conf_col is None:
        print("\n⚠ Conference column not found. Creating synthetic conference data...")
        # Create synthetic conferences based on team distribution
        conferences = ['ACC', 'Big Ten', 'Big 12', 'SEC', 'Pac-12', 'Big East', 
                      'American', 'Mountain West', 'West Coast', 'Atlantic 10', 'Other']
        df['Conference'] = np.random.choice(conferences, size=len(df), 
                                           p=[0.15, 0.15, 0.12, 0.15, 0.10, 0.10, 
                                             0.08, 0.05, 0.04, 0.04, 0.02])

print("\nConference distribution:")
print(df['Conference'].value_counts().head(10))

# Create balanced quadrant labels
total_teams = len(df)
teams_per_quad = total_teams // 4

def assign_balanced_quadrant(rank):
    """Assign quadrant based on equal distribution"""
    if rank <= teams_per_quad:
        return 1
    elif rank <= teams_per_quad * 2:
        return 2
    elif rank <= teams_per_quad * 3:
        return 3
    else:
        return 4

df['Quadrant'] = df['Rank'].apply(assign_balanced_quadrant)

print("\n\nBalanced Quadrant distribution:")
print(df['Quadrant'].value_counts().sort_index())
print(f"Each quadrant has approximately {teams_per_quad} teams")

print(f"\nDataset now has {df.shape[0]} teams with balanced quadrant labels")
df[['Rank', 'Team', 'Wins', 'Losses', 'Conference', 'Quadrant']].head(20)

## Step 2: Data Exploration and Cleaning

In [None]:
# Explore the scraped data
print("Initial Dataset Information:")
print(f"Shape: {df.shape}")
print(f"\nAll available columns:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i}. {col}")

print("\n\nData types:")
print(df.dtypes)

print("\n\nFirst few rows of raw data:")
print(df.head())

In [None]:
# Extract Non-Division I wins
# This is typically in a column that shows wins against non-D1 opponents

print("Looking for Non-Division I wins column...")

# Common column names that might contain this info
non_d1_col = None
for col in df.columns:
    col_lower = col.lower()
    if ('non' in col_lower and 'div' in col_lower) or 'non-d1' in col_lower or 'nond1' in col_lower:
        non_d1_col = col
        break

if non_d1_col:
    # The Non-Div I column contains record strings like "3-0" (wins-losses format)
    # We need to parse out just the wins using the parse_record function
    df['Non_D1_Wins'] = df[non_d1_col].apply(lambda x: parse_record(x)[0] if parse_record(x)[0] is not None else 0)
    print(f"✓ Using column: {non_d1_col}")
else:
    print("⚠ Non-D1 wins column not found in data.")
    print("  Creating synthetic Non-D1 wins (0-3 per team, more for lower-ranked teams)")
    # Lower-ranked teams tend to schedule more non-D1 opponents
    df['Non_D1_Wins'] = np.random.poisson(lam=1.5 - (df['Rank'] / len(df) * 1.5))
    df['Non_D1_Wins'] = df['Non_D1_Wins'].clip(0, 3)  # Cap at 3

print("\nNon-D1 Wins statistics:")
print(df['Non_D1_Wins'].describe())

print("\nSample of all engineered features:")
print(df[['Rank', 'Team', 'Wins', 'Losses', 'Conference', 'Non_D1_Wins']].head(15))

## Step 3: Define Classification Problem

**Task**: Predict which quadrant a team belongs to based on their performance metrics.

**Target Variable**: Quadrant (1, 2, 3, or 4) - **Balanced** with equal number of teams in each

**Features** (carefully selected):
- **Wins** (numeric) - Total wins in the season
- **Losses** (numeric) - Total losses in the season
- **Conference** (categorical) - Conference affiliation
- **Non_D1_Wins** (numeric) - Wins against non-Division I opponents

This is a **multi-class classification problem** with 4 balanced classes.

In [None]:
# Select only the specified features
# Features: Wins, Losses, Conference, Non_D1_Wins

# Ensure all required columns exist
required_features = ['Wins', 'Losses', 'Conference', 'Non_D1_Wins']

# Check if all features are available
available_features = [col for col in required_features if col in df.columns]
missing_features = [col for col in required_features if col not in df.columns]

if missing_features:
    print(f"⚠ Warning: Missing features: {missing_features}")
    print("  These will be excluded from the model.")

# Create feature matrix with only specified features
X = df[available_features].copy()
y = df['Quadrant'].copy()

print(f"Features (X):")
print(f"  Shape: {X.shape}")
print(f"  Columns: {X.columns.tolist()}")

print(f"\nTarget (y):")
print(f"  Shape: {y.shape}")
print(f"  Class distribution (balanced):")
print(y.value_counts().sort_index())

print("\nSample of feature data:")
print(X.head(10))

In [None]:
# Identify numeric and categorical features
print("Feature types:")

numeric_features = []
categorical_features = []

for col in X.columns:
    if X[col].dtype in ['int64', 'float64']:
        numeric_features.append(col)
        print(f"  {col}: numeric (dtype: {X[col].dtype})")
    else:
        categorical_features.append(col)
        print(f"  {col}: categorical (dtype: {X[col].dtype})")

print(f"\nNumeric features: {numeric_features}")
print(f"Categorical features: {categorical_features}")

# Check for missing values
print("\n\nMissing values:")
missing_counts = X.isnull().sum()
if missing_counts.sum() > 0:
    print(missing_counts[missing_counts > 0])
else:
    print("  None - all features are complete ✓")

# Display summary statistics
print("\n\nSummary statistics for numeric features:")
if numeric_features:
    print(X[numeric_features].describe())

## Step 4: Split Data into Training and Testing Sets

We'll split our data 80/20 for training and testing, using stratification to ensure each quadrant is proportionally represented in both sets.

In [None]:
# Split the data with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} teams")
print(f"Testing set size: {X_test.shape[0]} teams")

print(f"\nTraining set quadrant distribution:")
print(y_train.value_counts().sort_index())

print(f"\nTesting set quadrant distribution:")
print(y_test.value_counts().sort_index())

In [None]:
# Verify that stratification worked correctly
train_proportions = y_train.value_counts(normalize=True).sort_index()
test_proportions = y_test.value_counts(normalize=True).sort_index()

print("Quadrant proportions (should be similar in train and test):")
comparison = pd.DataFrame({
    'Training': train_proportions,
    'Testing': test_proportions
})
print(comparison)

## Step 5: Build the Pipeline

This is the key part! We'll create a complete pipeline that:
1. Handles numeric features (imputation + scaling)
2. Handles categorical features (imputation + one-hot encoding)
3. Trains a multi-class classifier

All in one convenient package!

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
print(f"\nTraining set class distribution:")
print(y_train.value_counts())

## Step 5: Build the Pipeline

This is the key part! We'll create a complete pipeline that:
1. Handles numeric features (imputation + scaling)
2. Handles categorical features (imputation + one-hot encoding)
3. Trains a classifier

All in one convenient package!

In [None]:
# Build preprocessing transformers for our specific features
# Numeric: Wins, Losses, Non_D1_Wins
# Categorical: Conference

print("Building preprocessing pipeline...")

# Numeric transformer: impute missing values (if any), then scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical transformer: impute missing values, then one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

print("✓ Preprocessing pipeline created!")
print(f"\nNumeric features to be scaled: {numeric_features}")
print(f"Categorical features to be one-hot encoded: {categorical_features}")

In [None]:
# Create preprocessing pipelines for both numeric and categorical data

# Numeric transformer: impute missing values, then scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical transformer: impute missing values, then one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

print("Preprocessing pipeline created!")

In [None]:
# Create the complete pipeline: preprocessing + classifier
# We'll use GridSearchCV to find the best hyperparameters

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

print("Pipeline structure:")
print(pipeline)

## Step 6: Hyperparameter Tuning with GridSearchCV

Instead of just training with default parameters, we'll use GridSearchCV to find the best hyperparameters for our Random Forest classifier through cross-validation.

In [None]:
# Define the parameter grid for Random Forest
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

print("Hyperparameter grid to search:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

print(f"\nTotal combinations to test: {np.prod([len(v) for v in param_grid.values()])}")

# Create GridSearchCV
grid_search = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,  # Use all available processors
    verbose=1,
    return_train_score=True
)

print("\nStarting Grid Search with 5-fold Cross-Validation...")
print("This may take a few minutes...\n")

# Fit the grid search
grid_search.fit(X_train, y_train)

print("\n✓ Grid Search Complete!")
print(f"\nBest parameters found:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nBest cross-validation accuracy: {grid_search.best_score_:.3f}")

# Analyze Grid Search Results
# Show top performing parameter combinations

results_df = pd.DataFrame(grid_search.cv_results_)

# Select relevant columns
results_summary = results_df[[
    'param_classifier__n_estimators',
    'param_classifier__max_depth',
    'param_classifier__min_samples_split',
    'param_classifier__min_samples_leaf',
    'mean_test_score',
    'std_test_score',
    'rank_test_score'
]].sort_values('rank_test_score').head(10)

print("Top 10 Parameter Combinations:")
print(results_summary.to_string(index=False))

print(f"\n\nComparison:")
print(f"  Best CV Score: {grid_search.best_score_:.4f}")
print(f"  Worst CV Score: {results_df['mean_test_score'].min():.4f}")
print(f"  Improvement: {(grid_search.best_score_ - results_df['mean_test_score'].min()):.4f}")

# Visualize parameter importance
print("\n\nParameter Value Impact on Performance:")
for param in param_grid.keys():
    param_col = f'param_{param}'
    avg_by_param = results_df.groupby(param_col)['mean_test_score'].mean().sort_values(ascending=False)
    print(f"\n{param}:")
    for val, score in avg_by_param.items():
        print(f"  {val}: {score:.4f}")

In [None]:
# Evaluate the optimized model on the held-out test set
y_pred = grid_search.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.3f}")
print(f"(Compared to best CV accuracy: {grid_search.best_score_:.3f})")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Quad 1', 'Quad 2', 'Quad 3', 'Quad 4']))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, grid_search.predict(X_test))
print("Confusion Matrix:")
print("Rows = Actual, Columns = Predicted")
print("\n      Q1   Q2   Q3   Q4")
for i, row in enumerate(cm, 1):
    print(f"Q{i}   {row}")

print("\nInterpretation:")
print(f"Diagonal elements show correct predictions for each quadrant")
print(f"Off-diagonal elements show misclassifications")

# Calculate per-class accuracy
print("\nPer-Quadrant Accuracy:")
for i in range(len(cm)):
    class_acc = cm[i, i] / cm[i, :].sum() if cm[i, :].sum() > 0 else 0
    print(f"  Quad {i+1}: {class_acc:.3f}")

## Step 8: Evaluate Model on Test Set

In [None]:
# Note: Cross-validation was already performed during GridSearchCV
# Each of the 144 parameter combinations was evaluated using 5-fold CV
# The best model (shown above) achieved the best average CV score

print("GridSearchCV already performed cross-validation!")
print(f"\nBest model's cross-validation performance:")
print(f"  Mean CV Accuracy: {grid_search.best_score_:.4f}")
print(f"  Across {grid_search.cv} folds")

# Let's also look at the CV scores for the best model specifically
best_idx = grid_search.best_index_
cv_results = grid_search.cv_results_

print(f"\nBest model's scores across each fold:")
for fold in range(grid_search.cv):
    fold_score = cv_results[f'split{fold}_test_score'][best_idx]
    print(f"  Fold {fold + 1}: {fold_score:.4f}")

print(f"\nStandard deviation: {cv_results['std_test_score'][best_idx]:.4f}")

## Step 9: Feature Importance Analysis

Let's see which features (including one-hot encoded conference categories) are most important for the classification.

In [None]:
# Extract feature importance from the trained Random Forest
best_model = grid_search.best_estimator_
feature_importance = best_model.named_steps['classifier'].feature_importances_

# Get feature names after preprocessing
num_feature_names = numeric_features
cat_feature_names = best_model.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)
all_feature_names = num_feature_names + list(cat_feature_names)

# Create a dataframe for visualization
importance_df = pd.DataFrame({
    'Feature': all_feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("Top 10 Most Important Features:")
print(importance_df.head(10))

## Step 10: Make Predictions on New Data

The beauty of pipelines: preprocessing is applied automatically to new data!

In [None]:
# Make predictions on a hypothetical new team
# Create a team with specific stats

new_team = pd.DataFrame([{
    'Wins': 25,
    'Losses': 6,
    'Conference': 'Big Ten',
    'Non_D1_Wins': 2
}])

print("New team statistics:")
print(new_team)
print(f"\nRecord: {new_team['Wins'].values[0]}-{new_team['Losses'].values[0]}")

# Make prediction - GridSearchCV delegates to best_estimator_ automatically
prediction = grid_search.predict(new_team)
probability = grid_search.predict_proba(new_team)

print(f"\nPredicted Quadrant: {prediction[0]}")
print(f"\nProbabilities for each quadrant:")
for i, prob in enumerate(probability[0], 1):
    print(f"  Quad {i}: {prob:.3f} ({prob*100:.1f}%)")

# Show the most likely quadrant
most_likely = prediction[0]
confidence = probability[0][most_likely - 1]
print(f"\n✓ This team is most likely in Quadrant {most_likely} (confidence: {confidence:.1%})")

## Summary

### What We Learned:

1. **Data Acquisition**: Used `pandas.read_html()` to scrape NCAA NET rankings from the web
2. **Feature Engineering**: 
   - Parsed record strings (e.g., "25-6") into separate Wins and Losses columns
   - Extracted Non-Division I wins
   - Selected only relevant features for modeling
3. **Balanced Multi-Class Classification**: Built a 4-class classifier with equal representation
4. **Minimal Feature Set**: Used only 4 carefully selected features:
   - Wins, Losses (numeric)
   - Conference (categorical)
   - Non-D1 Wins (numeric)
5. **Pipeline Construction**: Built a complete ML pipeline with:
   - Separate preprocessing for numeric and categorical features
   - Feature scaling for numeric data
   - One-hot encoding for conference names
   - Model training
6. **Hyperparameter Tuning with GridSearchCV**:
   - Tested 144 different parameter combinations (3 × 4 × 3 × 3)
   - Used 5-fold cross-validation for each combination
   - Automatically selected best hyperparameters
   - Prevented overfitting through systematic validation
7. **Benefits of This Approach**:
   - **Pipelines** prevent data leakage and ensure consistent preprocessing
   - **GridSearchCV** finds optimal hyperparameters automatically
   - **Cross-validation** provides robust performance estimates
   - Combined approach is production-ready
8. **Model Evaluation**: Used multiple metrics including:
   - Overall accuracy
   - Per-class precision, recall, F1-score
   - Confusion matrix
   - Cross-validation scores across parameter combinations

### Real-World Application:
This pipeline demonstrates:
- Professional ML workflow with hyperparameter optimization
- How to systematically search large parameter spaces
- Balancing model complexity vs. performance
- Making data-driven decisions about model configuration

### Try It Yourself:
- Expand the parameter grid: add `max_features`, `min_samples_leaf`
- Try RandomizedSearchCV for larger parameter spaces
- Test different classifiers: Logistic Regression, Gradient Boosting, SVM
- Create interaction features: `Win_Percentage = Wins / (Wins + Losses)`
- Compare performance: GridSearchCV vs. default parameters