This project is part of DES 400: Digital Engineering Project Development, offered by the Sirindhorn International Institute of Technology (SIIT), Thammasat University.

Contributors:
1. Atiluck Chanveeratham 6422771764
2. Tanapat Suntornsirikul 6422781342
3. Sirapat Eiamassavamongkol 6422781359
4. Sorrawee Laowithayangkul 6422781565

Get the raw data from https://www.kaggle.com/datasets/paololol/league-of-legends-ranked-matches/data?select=stats1.csv

## Data Preparation

This section lists the core libraries and machine learning tools used in the notebook, along with their purposes:

Core Libraries for Data Manipulation and Visualization
- **pandas**: For data manipulation and analysis.
- **numpy**: For numerical computing.
- **seaborn**: For statistical data visualization.
- **matplotlib.pyplot**: For creating static, interactive, and animated visualizations.
- **itertools**: For efficient looping and handling combinatoric operations.

Machine Learning Tools
- **sklearn.ensemble.IsolationForest**: For anomaly detection using an ensemble-based isolation approach.
- **sklearn.cluster.KMeans**: For clustering data points into groups based on their specified features.
- **sklearn.preprocessing.StandardScaler**: For scaling data to standardize features.
- **sklearn.metrics.silhouette_score**: For evaluating the quality of clusters using the silhouette coefficient.

In [90]:
# Core libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import itertools

# Machine learning tools for clustering and anomaly detection
from sklearn.ensemble import IsolationForest
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

This section handles the loading, concatenation, and initial filtering of the dataset. Due to the large size of the data, it is split across two files, which are concatenated and then merged with participant and match data. Further filtering ensures the dataset is clean and adheres to specific constraints.

**Steps:**
1. **Load the Data**:
- `participants.csv`: Contains participant data, including information about players and roles.
- `stats1.csv` and `stats2.csv`: Contain game statistics, divided into two files due to their size.
- `matches.csv`: Contains match data, including match identifiers and related details.

In [None]:
# Load the data
participant_data = pd.read_csv('./data/participants.csv')
stat1_data = pd.read_csv('./data/stats1.csv')
stat2_data = pd.read_csv('./data/stats2.csv')
matches_data = pd.read_csv('./data/matches.csv')

2. **Data Cleansing:**
- `dmgtoobj`: Represents the total damage to objects, which inherently includes damage to turrets.
- `dmgtoturrets`: Represents the specific damage to turrets.

Assumption:

Since turrets are a subset of objects, the value of `dmgtoturrets` cannot logically exceed `dmgtoobj`. Rows violating this rule indicate invalid data and are filtered out. Additionally, rows with missing or non-numeric values in these columns are removed.

In [92]:
stat1_data_cleaned = stat1_data[
    pd.to_numeric(stat1_data['dmgtoobj'], errors='coerce') >= pd.to_numeric(stat1_data['dmgtoturrets'], errors='coerce')
]
stat1_data_cleaned = stat1_data_cleaned.dropna(subset=['dmgtoobj', 'dmgtoturrets'])

stat2_data_cleaned = stat2_data[
    pd.to_numeric(stat2_data['dmgtoobj'], errors='coerce') >= pd.to_numeric(stat2_data['dmgtoturrets'], errors='coerce')
]
stat2_data_cleaned = stat2_data_cleaned.dropna(subset=['dmgtoobj', 'dmgtoturrets'])

In [None]:
# The shape of data before cleaning
print("The shape of stat1_data before cleaning:", stat1_data.shape)
print("The shape of stat2_data before cleaning:", stat2_data.shape)
# The shape of data after cleaning
print("The shape of stat1_data after cleaning:", stat1_data_cleaned.shape)
print("The shape of stat2_data after cleaning:", stat2_data_cleaned.shape)

2. **Concatenate Large Datasets**:
Combine `stats1.csv` and `stats2.csv` into a single dataset for unified processing.

3. **Merge Datasets**:
Merge the concatenated statistics with participant data and match data to create a comprehensive dataset.

4. **Filter Matches**:
- Retain only matches with exactly **10 participants**.
- Ensure roles are unique and not duplicated within each team to maintain valid team compositions.

In [None]:
# Step 1: Concatenate stat1 and stat2 data
merged_stat_data = pd.concat([stat1_data_cleaned, stat2_data_cleaned], ignore_index=True)

# Step 2: Merge participant_data with merged_stat_data on 'id'
merged_data = pd.merge(participant_data, merged_stat_data, on='id', how='outer')

# Step 3: Merge the resulting data with matches_data on 'matchid' and 'id'
merged_data = pd.merge(merged_data, matches_data, left_on='matchid', right_on='id', how='inner')

# Step 4: Clean up columns by removing 'id_y' and renaming 'id_x' to 'participantid'
merged_data = (
    merged_data
    .drop(columns=['id_y'])
    .rename(columns={'id_x': 'participantid'})
)

# Step 5: Count occurrences of each match_id
match_id_counts = merged_data['matchid'].value_counts()

# Step 6: Identify match_ids with exactly 10 occurrences
match_ids_with_10_records = match_id_counts[match_id_counts == 10].index

# Step 7: Filter merged_data for match_ids appearing 10 times
participants_with_10_records = merged_data[
    merged_data['matchid'].isin(match_ids_with_10_records)
]

# Step 8: Define conditions and corresponding values for assigning roles
conditions = [
    (participants_with_10_records['role'] == 'SOLO') & (participants_with_10_records['position'] == 'TOP'),
    (participants_with_10_records['role'] == 'NONE') & (participants_with_10_records['position'] == 'JUNGLE'),
    (participants_with_10_records['role'] == 'SOLO') & (participants_with_10_records['position'] == 'MID'),
    (participants_with_10_records['role'] == 'DUO_CARRY') & (participants_with_10_records['position'] == 'BOT'),
    (participants_with_10_records['role'] == 'DUO_SUPPORT') & (participants_with_10_records['position'] == 'BOT')
]

values = ['TOP', 'JUNGLE', 'MID', 'ADC', 'SUPPORT']

# Step 9: Assign roles based on conditions and add a new column 'role_position'
participants_with_10_records['role_position'] = np.select(conditions, values, default='UNKNOWN')

In [None]:
# Step 10: Check each match_id to ensure both teams have exactly 5 unique roles
# Group participants by matchid and win/loss, then check role uniqueness within teams
valid_matches = participants_with_10_records.groupby(['matchid', 'win'])['role_position'].nunique()

# Identify valid matches where both teams have exactly 5 unique roles
valid_match_ids = valid_matches[valid_matches == 5].reset_index()

# Further filter to get match_ids where both teams (win == 1 and win == 0) have exactly 5 roles
valid_matches = valid_match_ids.groupby('matchid').filter(lambda x: len(x) == 2)['matchid'].unique()

# Step 11: Filter the merged_data again to keep only valid matches
valid_participants = participants_with_10_records[participants_with_10_records['matchid'].isin(valid_matches)]

print(f"Number of valid matches with exactly 5 unique roles per team based on win/loss: {len(valid_matches)}")

5. **Optimize DataFrame Size**:
- Separate the dataset by roles (e.g., **Top, Jungle, Mid, ADC, Support**) to facilitate role-specific analysis.
- Select only relevant columns to reduce the size of the DataFrame, focusing on key features needed for analysis.

In [96]:
# Specify the columns you want to keep
columns_to_keep = [
    'participantid', 'matchid', 'championid', 'win', 'kills', 'deaths', 'assists', 'largestkillingspree', 'largestmultikill', 
    'killingsprees', 'totdmgdealt', 'totdmgtochamp', 'dmgtoobj', 'dmgtoturrets', 'visionscore', 'totdmgtaken', 'goldearned', 'inhibkills', 
    'totminionskilled', 'neutralminionskilled', 'wardsbought', 'wardsplaced', 'wardskilled', 'role_position', 'duration'
]

# First filtering based on domain knowledge
filtered_data = valid_participants[columns_to_keep]

In this step, we adjust the `deaths` column to better align with performance metrics. Since a higher number of deaths indicates poorer performance, the values in this column are made negative to reflect this relationship.

In [None]:
# Define a function to make the 'deaths' column negative
def make_column_negative(df, column_name='deaths'):
    if column_name in df.columns:
        df[column_name] = -df[column_name]
    return df

# Apply the function to the entire DataFrame first
filtered_data = make_column_negative(filtered_data)

# Then filter by role position
top_laners_df = filtered_data[filtered_data['role_position'] == 'TOP']
junglers_df = filtered_data[filtered_data['role_position'] == 'JUNGLE']
mid_laners_df = filtered_data[filtered_data['role_position'] == 'MID']
adc_df = filtered_data[filtered_data['role_position'] == 'ADC']
supports_df = filtered_data[filtered_data['role_position'] == 'SUPPORT']

6. **Feature Engineering and Scaling**:
Divide time-dependent features (e.g., damage, gold, kills) by **match duration** to standardize values across matches of varying lengths and etc.

In [98]:
def divide_by_duration(df, metrics, suffix='_per_minute'):
    # Ensure 'duration' column exists and convert to minutes (if in seconds)
    if 'duration' not in df.columns:
        raise ValueError("'duration' column is missing from the DataFrame.")
    df['duration'] = df['duration'] / 60
    
    for metric in metrics:
        # Ensure metric exists in the DataFrame
        if metric not in df.columns:
            raise ValueError(f"'{metric}' column is missing from the DataFrame.")
        
        # Create a new column with the suffix to store the divided value
        new_column_name = metric + suffix
        
        # Check for zero duration to avoid division by zero errors
        if df['duration'].min() == 0:
            raise ValueError("Some rows have a duration of zero, cannot divide by zero.")
        
        # Perform the division
        df[new_column_name] = df[metric] / df['duration']
    
    return df

# Generate dmg_per_kill (safe division)
def generate_dmg_per_kill(df, dmg_col='totdmgtochamp', kills_col='kills', new_feature_name='dmg_per_kill'):
    df[new_feature_name] = df[dmg_col] / df[kills_col]
    return df

# Generate visionscore_per_wardplaced (safe division)
def generate_visionscore_per_wardplaced(df, visionscore_col='visionscore', wardsplaced_col='wardsplaced', new_feature_name='visionscore_per_wardplaced'):
    if visionscore_col in df.columns and wardsplaced_col in df.columns:
        df[new_feature_name] = df.apply(lambda x: x[visionscore_col] / x[wardsplaced_col] if x[wardsplaced_col] != 0 else 0, axis=1)
    else:
        raise ValueError(f"Columns '{visionscore_col}' and/or '{wardsplaced_col}' are missing from the DataFrame.")
    return df

# Fill 'wardsbought' with the average, handling non-numeric entries
def fill_wardsbought_with_average(df):
    if 'wardsbought' not in df.columns:
        raise ValueError("'wardsbought' column is missing from the DataFrame.")
    
    # Convert non-numeric entries to NaN
    df['wardsbought'] = pd.to_numeric(df['wardsbought'], errors='coerce')
    
    # Calculate and fill NaN values with the average
    average_wardsbought = df['wardsbought'].mean()  # Automatically ignores NaN values
    df['wardsbought'].fillna(average_wardsbought, inplace=True)
    
    return df

In [None]:
# ADC Role
adc_metrics_to_divide = ['totdmgtochamp', 'totminionskilled', 'dmgtoobj', 'goldearned', 'kills', 'deaths', 'assists', 'dmgtoturrets']
adc_divided = divide_by_duration(adc_df, adc_metrics_to_divide)

# Support Role
supports_df = fill_wardsbought_with_average(supports_df)
support_metrics_to_divide = ['visionscore', 'wardsplaced', 'wardskilled', 'wardsbought', 'totdmgtaken', 'kills', 'deaths', 'assists']
support_divided = divide_by_duration(supports_df, support_metrics_to_divide)

# Mid Role
mid_metrics_to_divide = ['totdmgtochamp', 'goldearned','visionscore', 'dmgtoobj', 'totdmgdealt','totminionskilled', 'kills', 'deaths', 'assists']
mid_divided = divide_by_duration(mid_laners_df, mid_metrics_to_divide)

# Top Role
top_metrics_to_divide = ['totdmgtaken', 'deaths', 'assists', 'goldearned', 'totdmgtochamp', 'totminionskilled', 'totdmgdealt', 'kills']
top_damage_per_kill = generate_dmg_per_kill(top_laners_df)
top_divided = divide_by_duration(top_damage_per_kill, top_metrics_to_divide)

# Jungle Role
jungle_metrics_to_divide = ['neutralminionskilled', 'kills', 'assists', 'goldearned','visionscore', 'totdmgdealt', 'dmgtoobj', 'totdmgtochamp', 'deaths', 'dmgtoturrets', 'inhibkills']
jungle_divided = divide_by_duration(junglers_df, jungle_metrics_to_divide)

## Feature Correlation Matrix

This function generates and visualizes a correlation matrix for selected features in the dataset. It helps identify relationships between variables for a specific role.

In [100]:
def visualize_correlation_matrix(df, features, role_name):
    metrics = df[features]

    # Check for columns with string values and report them
    non_numeric_columns = metrics.select_dtypes(include=['object']).columns
    if not non_numeric_columns.empty:
        print(f"Warning: The following columns in {role_name} contain non-numeric data:")
        print(non_numeric_columns)
        for col in non_numeric_columns:
            print(f"Non-numeric values in '{col}':")
            print(metrics[col].unique())

    # Convert all columns to numeric, forcing errors to NaN
    metrics = metrics.apply(pd.to_numeric, errors='coerce')
    
    # Drop rows with NaN values
    metrics = metrics.dropna()

    # Calculate correlation matrix
    correlation_matrix = metrics.corr()

    # Visualize the correlation matrix using a heatmap
    plt.figure(figsize=(12, 10))  # Increase figure size to give more space for labels
    heatmap = sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
    
    # Set the title of the heatmap
    plt.title(f'Correlation Matrix for {role_name} Features')
    
    # Adjust layout to avoid cutting off labels
    plt.subplots_adjust(left=0.25, right=0.9, top=0.9, bottom=0.25)  # Adjust both left and bottom margins
    
    # Show the plot
    plt.show()

To analyze feature relationships for different roles in the dataset, correlation matrices are generated and visualized for key metrics specific to each role.

In [None]:
metrics_before_correlation = {
    "adc": [
        'totdmgtochamp_per_minute', 'totminionskilled_per_minute', 'goldearned_per_minute', 'largestkillingspree', 
        'largestmultikill', 'dmgtoobj_per_minute', 'kills_per_minute', 'deaths_per_minute', 'assists_per_minute', 'dmgtoturrets_per_minute'
        ],
    "support": [
        'wardskilled_per_minute', 'wardsplaced_per_minute', 'visionscore_per_minute', 'wardsbought_per_minute', 
        'totdmgtaken_per_minute', 'kills_per_minute', 'deaths_per_minute', 'assists_per_minute'
        ],
    "mid": [
        'totdmgtochamp_per_minute', 'goldearned_per_minute', 'largestkillingspree', 'visionscore_per_minute', 'totdmgdealt_per_minute', 
        'dmgtoobj_per_minute', 'totminionskilled_per_minute', 'kills_per_minute', 'deaths_per_minute', 'assists_per_minute'
        ],
    "top": [
         'totdmgtaken_per_minute', 'deaths_per_minute', 'dmg_per_kill', 'goldearned_per_minute', 'totdmgtochamp_per_minute', 
        'assists_per_minute', 'totdmgdealt_per_minute', 'kills_per_minute', 'largestkillingspree', 'totminionskilled_per_minute',
        ],
    "jungle": [
        'neutralminionskilled_per_minute', 'kills_per_minute', 'assists_per_minute', 'killingsprees',
        'visionscore_per_minute', 'totdmgdealt_per_minute', 'dmgtoobj_per_minute', 'totdmgtochamp_per_minute', 'deaths_per_minute', 
        'goldearned_per_minute', 'dmgtoturrets_per_minute', 'inhibkills_per_minute', 'largestmultikill'
        ]
}

visualize_correlation_matrix(adc_divided, metrics_before_correlation['adc'], 'ADC')
visualize_correlation_matrix(support_divided, metrics_before_correlation['support'], 'Support')
visualize_correlation_matrix(mid_divided, metrics_before_correlation['mid'], 'Mid')
visualize_correlation_matrix(top_divided, metrics_before_correlation['top'], 'Top')
visualize_correlation_matrix(jungle_divided, metrics_before_correlation['jungle'], 'Jungle')

Remove the feature pairs having high correlation and plot the heatmap

In [None]:
def remove_columns(metrics_before_correlation, columns_to_exclude):
    metrics_after_correlation = {}
    
    for role, metrics in metrics_before_correlation.items():
        # Exclude columns specified in columns_to_exclude for the given role
        excluded_columns = columns_to_exclude.get(role, [])
        filtered_metrics = [metric for metric in metrics if metric not in excluded_columns]
        
        # Update the result with the filtered metrics for each role
        metrics_after_correlation[role] = filtered_metrics
    
    return metrics_after_correlation

columns_to_exclude = {
    "adc": ['goldearned_per_minute', 'kills_per_minute'],
    "support": [],
    "mid": ['goldearned_per_minute', 'largestkillingspree'],
    "top": ['largestkillingspree', 'goldearned_per_minute'],
    "jungle": ['killingsprees']
}

# Calling the function to get the filtered metrics
metrics_after_correlation = remove_columns(metrics_before_correlation, columns_to_exclude)

# Output the result
print(metrics_after_correlation)

In [None]:
visualize_correlation_matrix(adc_divided, metrics_after_correlation['adc'], 'ADC')
visualize_correlation_matrix(support_divided, metrics_after_correlation['support'], 'Support')
visualize_correlation_matrix(mid_divided, metrics_after_correlation['mid'], 'Mid')
visualize_correlation_matrix(top_divided, metrics_after_correlation['top'], 'Top')
visualize_correlation_matrix(jungle_divided, metrics_after_correlation['jungle'], 'Jungle')

## Elbow Method, Silhouette Analysis, and Anomaly Detection with Isolation Forest

In [104]:
# the color for plotting KMeans clusters

color_for_plot = {
    0: 'orange',
    1: 'red',
    2: 'blue',
    3: 'green',
    4: 'purple',
    5: 'brown',
    6: 'pink',
    7: 'gray',
}

This function performs **Elbow** and **Silhouette** analyses to determine the optimal number of clusters for a given role. It helps identify the best clustering configuration by evaluating inertia and silhouette scores across different cluster numbers.

Process:
1. **Elbow Plot**:
- Plots the inertia values (within-cluster sum of squares) for cluster sizes ranging from 2 to 10.
- The "elbow" in the plot indicates the optimal number of clusters where the inertia starts to decrease at a slower rate.

2. **Silhouette Plot**:
- Measures how similar data points are within a cluster compared to other clusters.
- Higher silhouette scores suggest better-defined clusters.

In [105]:
def plot_elbow_silhouette(df, features, role):
    """
    Plots elbow and silhouette methods to determine optimal clusters for a role.

    Args:
        df (pandas.DataFrame): The data frame containing the features.
        features (list): List of feature names to be used for clustering.
        role (str): The role for which the analysis is being performed.
    """
    max_clusters = 10
    inertia_values, silhouette_values = [], []
    # generates a sequence of integers from 2 to max_clusters (inclusive) (cluster 2 to 10)
    cluster_range = range(2, max_clusters + 1)

    for k in cluster_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(df[features])
        inertia_values.append(kmeans.inertia_)
        silhouette_values.append(silhouette_score(df[features], labels))

    # Elbow plot
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.plot(cluster_range, inertia_values, 'bo-')
    plt.title(f'{role.capitalize()} Elbow Plot')
    plt.xlabel('Number of clusters')
    plt.ylabel('Inertia (within cluster sum of squares)')

    # Silhouette plot
    plt.subplot(1, 2, 2)
    plt.plot(cluster_range, silhouette_values, 'go-')
    plt.title(f'{role.capitalize()} Silhouette Score Plot')
    plt.xlabel('Number of clusters')
    plt.ylabel('Silhouette Score')

    plt.suptitle(f'Elbow and Silhouette Analysis for {role.capitalize()} Role')
    plt.show()
    # plt.savefig(f"./Isolation-forest-getInsight/Elbow-silhouette/{role}_elbow_silhouette_plot.png", dpi=100)


This function performs anomaly detection using the **Isolation Forest** algorithm, identifies normal vs. anomaly data points, and visualizes the results. The analysis is combined with **Elbow** and **Silhouette** methods for clustering insights.

Process:
1. **Data Preprocessing**:
- Drops unnecessary columns and handles missing, infinite, or outlier values.
- Scales the data using **StandardScaler** for consistent metric comparison.
   
2. **Isolation Forest**:
- Detects anomalies in the data by isolating them based on feature distributions.
- Anomalies are marked with `-1`, and normal data is labeled as `1`.

3. **Visualization**:
- **Elbow and Silhouette Plots** for anomaly data to determine optimal clusters.
- **Normal vs Anomaly Plot** visualizes how the normal and anomaly data points distribute across different feature combinations.

In [106]:
def isolation_forest_plot_elbow_silhouette(role, data_role):
    print(f"Processing role: {role}")
    
    df = data_role.drop(columns=['role_position'], errors='ignore')
    
    # Select features and preprocess
    features = metrics_after_correlation[role]
    X = df[features].apply(pd.to_numeric, errors='coerce').dropna()
    X.replace([np.inf, -np.inf], np.nan, inplace=True)
    X = np.clip(X, a_min=-1e5, a_max=1e5).dropna()
    df = df.loc[X.index]

    # Scale data
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Isolation Forest
    iso_forest = IsolationForest(contamination=0.01, random_state=42)
    df['anomaly'] = iso_forest.fit_predict(X_scaled)

    anomaly_data = df[df['anomaly'] == -1]
    normal_data = df[df['anomaly'] != -1]

    # Plot elbow and silhouette for anomaly data
    plot_elbow_silhouette(anomaly_data, features, role)

    # Plot normal vs anomaly data
    plot_normal_vs_anomaly(normal_data, anomaly_data, features, role)

    return anomaly_data, normal_data


def plot_normal_vs_anomaly(normal_data, anomaly_data, features, role):
    """
    Plots normal (blue) and anomaly (red) data points for all feature combinations.

    Args:
        normal_data (DataFrame): Data classified as normal.
        anomaly_data (DataFrame): Data classified as anomalies.
        features (list): List of feature columns.
        role (str): Role name for the plot title.
        alpha_normal (float): Transparency level for normal data points (default: 0.4).
        alpha_anomaly (float): Transparency level for anomaly data points (default: 0.6).
    """

    # Generate feature combinations for scatter plots
    feature_combinations = list(itertools.combinations(features, 2))
    n_cols = 4
    n_rows = (len(feature_combinations) + n_cols - 1) // n_cols
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, n_rows * 3))
    axes = axes.flatten()

    for i, (feature_x, feature_y) in enumerate(feature_combinations):
        ax = axes[i]

        # Plot normal data points (blue, customizable alpha)
        ax.scatter(
            normal_data[feature_x], 
            normal_data[feature_y], 
            c='blue', 
            label='Normal', 
            alpha=0.2
        )

        # Plot anomaly data points (red, customizable alpha)
        ax.scatter(
            anomaly_data[feature_x], 
            anomaly_data[feature_y], 
            c='red', 
            label='Anomaly', 
            alpha=0.2
        )

        # Add labels and title
        ax.set_title(f'{feature_x} vs {feature_y}', fontsize=8)
        ax.set_xlabel(feature_x, fontsize=8)
        ax.set_ylabel(feature_y, fontsize=8)

    # Remove extra subplots if any
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    # Adjust layout
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    fig.suptitle(f'Normal vs Anomaly Plot for {role.capitalize()} Role', fontsize=16)
    plt.legend(loc='upper right')
    plt.show()

In [None]:
(adc_anomaly, adc_normal) = isolation_forest_plot_elbow_silhouette('adc', adc_divided)
(support_anomaly, support_normal) = isolation_forest_plot_elbow_silhouette('support', support_divided)
(mid_anomaly, mid_normal) = isolation_forest_plot_elbow_silhouette('mid', mid_divided)    
(top_anomaly, top_normal) = isolation_forest_plot_elbow_silhouette('top', top_divided)
(jungle_anomaly, jungle_normal) = isolation_forest_plot_elbow_silhouette('jungle', jungle_divided)

## K-Means Clustering for Outlier Analysis

This section utilizes **K-Means clustering** to analyze and visualize outliers within the dataset based on selected features. The results are shown in multiple ways:
1. Cluster analysis across feature combinations.
2. Detailed visualizations of individual clusters.

Key Functions:
1. **`compute_axis_limits`**:
- Computes the axis limits for scatter plots based on feature combinations to ensure consistent visualization across all plots.

2. **`analyze_outliers_with_kmeans`**:
- Applies **K-Means clustering** on the scaled data.
- Adds a `kmeans_cluster` column to the DataFrame for cluster assignments.
- Plots the overall cluster distribution and individual clusters for better insight.

3. **`plot_clusters`**:
- Plots all the clusters in 2D scatter plots for each feature pair.
- Ensures consistent axis limits across plots for clarity.

4. **`plot_each_cluster`**:
- Visualizes each cluster individually, making it easier to analyze data points within specific clusters.
- Provides a detailed view of the data points belonging to a particular cluster.

In [108]:
# Helper function to compute axis limits
def compute_axis_limits(df, features):
    feature_combinations = list(itertools.combinations(features, 2))
    axis_limits = {}
    for feature_x, feature_y in feature_combinations:
        x_min, x_max = df[feature_x].min(), df[feature_x].max()
        y_min, y_max = df[feature_y].min(), df[feature_y].max()
        axis_limits[(feature_x, feature_y)] = (x_min, x_max, y_min, y_max)
    return axis_limits

# Function to analyze outliers with KMeans
def analyze_outliers_with_kmeans(outliers_df, features, role, n_clusters):
    scaler = StandardScaler()
    X_outliers_scaled = scaler.fit_transform(outliers_df[features])

    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    outliers_df['kmeans_cluster'] = kmeans.fit_predict(X_outliers_scaled)
    
    axis_limits = compute_axis_limits(outliers_df, features)
    
    plot_clusters(outliers_df, features, f"{role.capitalize()} KMeans Cluster Analysis with {n_clusters} Clusters", axis_limits)

    for cluster in outliers_df['kmeans_cluster'].unique():
        plot_each_cluster(outliers_df, features, cluster, f"{role.capitalize()} KMeans Cluster Analysis with {n_clusters} Clusters", axis_limits)
        
    return outliers_df

# Function to plot all clusters
def plot_clusters(df, features, title, axis_limits):
    feature_combinations = list(itertools.combinations(features, 2))
    n_cols = 4
    n_rows = (len(feature_combinations) + n_cols - 1) // n_cols
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, n_rows * 3))
    axes = axes.flatten()

    for i, (feature_x, feature_y) in enumerate(feature_combinations):
        ax = axes[i]
        for cluster in df['kmeans_cluster'].unique():
            cluster_data = df[df['kmeans_cluster'] == cluster]
            cluster_color = color_for_plot.get(cluster, 'gray')  # Default to 'gray' if cluster color is not defined
            ax.scatter(cluster_data[feature_x], cluster_data[feature_y], label=f'Cluster {cluster}', color=cluster_color, alpha=0.3)
        ax.set_title(f'{feature_x} vs {feature_y}', fontsize=8)
        ax.set_xlabel(feature_x, fontsize=8)
        ax.set_ylabel(feature_y, fontsize=8)

        # Apply axis limits
        x_min, x_max, y_min, y_max = axis_limits[(feature_x, feature_y)]
        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)

    # Remove unused subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    fig.suptitle(title, fontsize=16)
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.legend()
    plt.show()

# Function to plot each cluster
def plot_each_cluster(df, features, cluster, title, axis_limits):
    cluster_data = df[df['kmeans_cluster'] == cluster]
    print(f"Cluster {cluster}: {len(cluster_data)} data points")
    feature_combinations = list(itertools.combinations(features, 2))
    n_cols = 4
    n_rows = (len(feature_combinations) + n_cols - 1) // n_cols
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, n_rows * 3))
    axes = axes.flatten()

    for i, (feature_x, feature_y) in enumerate(feature_combinations):
        ax = axes[i]
        cluster_color = color_for_plot.get(cluster, 'gray')  # Default to 'gray' if cluster color is not defined
        ax.scatter(cluster_data[feature_x], cluster_data[feature_y], color=cluster_color, alpha=0.3)
        ax.set_title(f'{feature_x} vs {feature_y}', fontsize=8)
        ax.set_xlabel(feature_x, fontsize=8)
        ax.set_ylabel(feature_y, fontsize=8)

        # Apply axis limits
        x_min, x_max, y_min, y_max = axis_limits[(feature_x, feature_y)]
        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)

    # Remove unused subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    
    fig.suptitle(f'{title} - Cluster {cluster}', fontsize=16)
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()


In [None]:
# Specify the number of clusters for each role based on our conclusions
adc_cluster_counts = 7
support_cluster_counts = 4
mid_cluster_counts = 4
top_cluster_counts = 5
jungle_cluster_counts = 6

adc_outlier_kmean = analyze_outliers_with_kmeans(adc_anomaly, metrics_after_correlation['adc'], 'adc', adc_cluster_counts)
support_outlier_kmean = analyze_outliers_with_kmeans(support_anomaly, metrics_after_correlation['support'], 'support', support_cluster_counts)
mid_outlier_kmean = analyze_outliers_with_kmeans(mid_anomaly, metrics_after_correlation['mid'], 'mid', mid_cluster_counts)
top_outlier_kmean = analyze_outliers_with_kmeans(top_anomaly, metrics_after_correlation['top'], 'top', top_cluster_counts)
jungle_outlier_kmean = analyze_outliers_with_kmeans(jungle_anomaly, metrics_after_correlation['jungle'], 'jungle', jungle_cluster_counts)

## Z-score Visualization

The function `normalize_features` is designed to normalize selected features in a dataset to the range [0, 1]. This ensures that all features are scaled uniformly.

In [110]:
def normalize_features(df, feature_cols):
    """
    Normalize features to the range [0, 1].

    Parameters:
        df (pd.DataFrame): The input dataframe.
        feature_cols (list): List of columns to normalize.

    Returns:
        pd.DataFrame: A dataframe with normalized features.
    """
    df_normalized = df.copy()
    for col in feature_cols:
        min_val = df[col].min()
        max_val = df[col].max()
        df_normalized[col] = (df[col] - min_val) / (max_val - min_val)
    return df_normalized

The `zscore_plot_normalized` function generates a scatter plot showing the normalized scores of features in a dataset, segmented by clusters. It also identifies and annotates outliers based on z-score thresholds. This function is useful for visualizing data distributions and detecting anomalies.

In [111]:
def zscore_plot_normalized(df, cluster_col, feature_cols, role, all_df):
    thresholds = [2, 2.5, 3, 3.5]

    all_df_normalized = normalize_features(all_df, feature_cols)
    all_df_normalized['normalized_score_combined'] = all_df_normalized[feature_cols].sum(axis=1)
    all_df = all_df.reset_index(drop=True)
    all_df_normalized = all_df_normalized.reset_index(drop=True)

    # Normalize the selected features
    df_normalized = normalize_features(df, feature_cols)

    # Combine normalized scores by summing across features
    df_normalized['normalized_score_combined'] = df_normalized[feature_cols].sum(axis=1)

    # Ensure the index is a range index for consistent plotting
    df = df.reset_index(drop=True)
    df_normalized = df_normalized.reset_index(drop=True)

    # Scatter plot by cluster
    plt.figure(figsize=(12, 8))
    unique_clusters = df[cluster_col].unique()

    for cluster in unique_clusters:
        cluster_data = df[df[cluster_col] == cluster]
        color = color_for_plot.get(cluster, 'gray')  # Default to gray if cluster color is not defined
        plt.scatter(
            cluster_data.index,
            df_normalized.loc[cluster_data.index, 'normalized_score_combined'],
            label=f'Cluster {cluster}',
            color=color,
            alpha=0.5
        )

    # Process thresholds if provided
    outliers_count = {}
    if thresholds:
        for idx, threshold in enumerate(thresholds):
            threshold_value = all_df_normalized['normalized_score_combined'].mean() + threshold * all_df_normalized['normalized_score_combined'].std()
            
            # Identify outliers
            outliers = all_df_normalized[all_df_normalized['normalized_score_combined'] > threshold_value]
            outliers_count[threshold] = len(outliers)

            # Plot the threshold line
            plt.axhline(threshold_value, color='red', linestyle='dashed', linewidth=1, label=f'Threshold {threshold}')
            vertical_offset = 0.9 - idx * 0.1  # Adjust vertical position for each threshold
            plt.text(df.index[-1] * 0.95, threshold_value, f'Threshold {threshold}', color='red', rotation=0, ha='right')

        # Annotate the number of outliers for each threshold
        annotation_text = "\n".join([f"Threshold {threshold}: {count} outliers" for threshold, count in outliers_count.items()])
        plt.text(len(df) * 0.95, max(df_normalized['normalized_score_combined']) * 0.8, annotation_text, fontsize=10,
                 bbox=dict(facecolor='white', alpha=0.7), ha='right', va='top')

        # Print outliers for each threshold
        for threshold, count in outliers_count.items():
            print(f"Outliers at threshold {threshold}: {count} outliers")
            print(f"It is about {count / len(df) * 100:.2f}% of the total data")
    else:
        print("No thresholds provided, skipping outlier calculation.")
    
    plt.title(f'Normalized Metrics Plot by Cluster {role.capitalize()}')
    plt.xlabel('Record Index')
    plt.ylabel('Normalized Combined Score')
    plt.legend()

    # Uncomment to save the plot
    # plt.savefig(f"action/initial-plot/cross-validation/plot/{role}_normalized_zscore.png", dpi=100)

    plt.show()

In [None]:
zscore_plot_normalized(adc_outlier_kmean, 'kmeans_cluster', metrics_after_correlation['adc'], 'ADC', adc_divided)
zscore_plot_normalized(support_outlier_kmean, 'kmeans_cluster', metrics_after_correlation['support'], 'Support', support_divided)
zscore_plot_normalized(mid_outlier_kmean, 'kmeans_cluster', metrics_after_correlation['mid'], 'Mid', mid_divided)
zscore_plot_normalized(top_outlier_kmean, 'kmeans_cluster', metrics_after_correlation['top'], 'Top', top_divided)
zscore_plot_normalized(jungle_outlier_kmean, 'kmeans_cluster', metrics_after_correlation['jungle'], 'Jungle', jungle_divided)

The `process_role_data` function calculates and visualizes the performance distribution of data based on normalized feature scores for different roles. It also detects and annotates outliers based on predefined z-score thresholds. This function helps in identifying performance patterns.

In [None]:
def process_role_data(all_df, df, feature_cols, thresholds=None):
    all_df_normalized = normalize_features(all_df, feature_cols)
    all_df_normalized['normalized_score_combined'] = all_df_normalized[feature_cols].sum(axis=1)
    all_df = all_df.reset_index(drop=True)
    all_df_normalized = all_df_normalized.reset_index(drop=True)
    
    df_normalized = normalize_features(df, feature_cols)
    # Combine normalized scores by summing across features
    df_normalized['normalized_score_combined'] = df_normalized[feature_cols].sum(axis=1)

    # Drop rows with NaN values
    df_normalized = df_normalized.dropna(subset=feature_cols)

    overall_performance = df_normalized['normalized_score_combined']
    
    # Create histogram
    bins = np.linspace(overall_performance.min(), overall_performance.max(), 50)
    counts, edges = np.histogram(overall_performance, bins=bins)

    # Plot the histogram
    plt.figure(figsize=(10, 6))
    plt.bar(edges[:-1], counts, width=np.diff(edges), color='skyblue', edgecolor='black')

    # Process thresholds if provided
    outliers_count = {}
    if thresholds:
        for idx, threshold in enumerate(thresholds):
            threshold_value = all_df_normalized['normalized_score_combined'].mean() + threshold * all_df_normalized['normalized_score_combined'].std()
            
            # Identify outliers
            outliers = all_df_normalized[all_df_normalized['normalized_score_combined'] > threshold_value]
            outliers_count[threshold] = len(outliers)

            # Plot the threshold line
            plt.axvline(threshold_value, color='red', linestyle='dashed', linewidth=1, label=f'Threshold {threshold}')
            vertical_offset = 0.9 - idx * 0.1  # Adjust vertical position for each threshold
            plt.text(threshold_value, max(counts) * vertical_offset, f'Threshold {threshold}', color='red', rotation=45, ha='right')

        # Annotate the number of outliers for each threshold
        annotation_text = "\n".join([f"Threshold {threshold}: {count} outliers" for threshold, count in outliers_count.items()])
        plt.text(overall_performance.max() * 0.95, max(counts) * 0.8, annotation_text, fontsize=10,
                 bbox=dict(facecolor='white', alpha=0.7), ha='right', va='top')

        # Print outliers for each threshold
        for threshold, count in outliers_count.items():
            print(f"Outliers at threshold {threshold}: {count} outliers")
            print(f"It is about {count / len(df) * 100:.2f}% of the total data")
    else:
        print("No thresholds provided, skipping outlier calculation.")

    # Add labels and title
    plt.xlabel('Overall Performance Score')
    plt.ylabel('Frequency')
    plt.title(f'Performance Distribution {("(Thresholds: " + str(thresholds) + ")") if thresholds else ""}')
    plt.legend(loc='upper right' if thresholds else None)

    plt.show()

    return overall_performance


# Define thresholds for detecting outliers
thresholds = [2, 2.5, 3, 3.5]

process_role_data(adc_divided, adc_divided, metrics_after_correlation['adc'], thresholds)
process_role_data(support_divided, support_divided, metrics_after_correlation['support'], thresholds)
process_role_data(mid_divided, mid_divided, metrics_after_correlation['mid'], thresholds)
process_role_data(top_divided, top_divided, metrics_after_correlation['top'], thresholds)
process_role_data(jungle_divided, jungle_divided, metrics_after_correlation['jungle'], thresholds)

Plot the z-score distribution for anomaly data detected by isolation forest

In [None]:
process_role_data(adc_divided, adc_outlier_kmean, metrics_after_correlation['adc'], thresholds)
process_role_data(support_divided, support_outlier_kmean, metrics_after_correlation['support'], thresholds)
process_role_data(mid_divided, mid_outlier_kmean, metrics_after_correlation['mid'], thresholds)
process_role_data(top_divided, top_outlier_kmean, metrics_after_correlation['top'], thresholds)
process_role_data(jungle_divided, jungle_outlier_kmean, metrics_after_correlation['jungle'], thresholds)

Plot the z-score distribution for normal data detected by isolation forest

In [None]:
process_role_data(adc_divided, adc_normal, metrics_after_correlation['adc'], thresholds)
process_role_data(support_divided, support_normal, metrics_after_correlation['support'], thresholds)
process_role_data(mid_divided, mid_normal, metrics_after_correlation['mid'], thresholds)
process_role_data(top_divided, top_normal, metrics_after_correlation['top'], thresholds)
process_role_data(jungle_divided, jungle_normal, metrics_after_correlation['jungle'], thresholds)