# Table of Contents
- [Importing](#Imports)
- [Loading fights dataset](#Loading-fights-dataset)
    - [Data Preprocessing](#Data-Preprocessing)
        - [Dropping redundant features](#Dropping-redundant-features)
        - [Renaming features](#Renaming-features)
- [Loading athlete stats dataset](#Loading-athlete-stats-dataset)
    - [Data Preprocessing](#Data-Preprocessing)
        - [Dropping irrelevant features](#Dropping-irrelevant-features)
        - [Renaming features](#Renaming-features)
        - [Imputing NaNs](#Imputing-NaNs)
        - [Formatting to match the format in the fights dataset](#Formatting-to-match-the-format-in-the-fights-dataset)
        - [Converting from inches to cm](#Converting-from-inches-to-cm)
- [Merging into the final dataset](#Merging-into-the-final-dataset)
    - [Data preprocessing](#Data-preprocessing)
        - [Data Cleaning](#Data-Cleaning)
            - [Imputing NaNs](#Imputing-NaNs)
            - [Handling Duplicates](#Handling-Duplicates)
        - [Processing categorical features](#Processing-categorical-features)
        - [Standardizing](#Standardizing)
            - [Standardizing fraction-based features](#Standardizing-fraction-based-features)
            - [Standardizing percentage-based features](#Standardizing-percentage-based-features)
            - [Standardizing time-based features](#Standardizing-time-based-features)
        - [Dtype converting](#Dtype-converting)
    - [Feature Engineering](#Feature-Engineering)
        - [Winner](#Winner)
        - [Winner_feature/loser_feature](#Winner_feature/loser_feature)
        - [Striking/wrestling dominance](#Striking/wrestling-dominance)
        - [Delta](#Delta)
- [Saving](#Saving)

# Imports

### Dtype converting

In [1]:
import re

import numpy as np
import pandas as pd

# Loading fights dataset

In [2]:
fights_stats = pd.read_csv("../stats/stats_raw.csv", sep=";")
fights_stats.head()

Unnamed: 0,red_fighter_name,blue_fighter_name,event_date,red_fighter_nickname,blue_fighter_nickname,red_fighter_result,blue_fighter_result,method,round,time,...,red_fighter_sig_str_body_pct,blue_fighter_sig_str_body_pct,red_fighter_sig_str_leg_pct,blue_fighter_sig_str_leg_pct,red_fighter_sig_str_distance_pct,blue_fighter_sig_str_distance_pct,red_fighter_sig_str_clinch_pct,blue_fighter_sig_str_clinch_pct,red_fighter_sig_str_ground_pct,blue_fighter_sig_str_ground_pct
0,ILIA TOPURIA,MAX HOLLOWAY,26/10/2024,El Matador,Blessed,W,L,KO/TKO,3,1:34,...,14,16,20,24,94,100,0,0,5,0
1,ROBERT WHITTAKER,KHAMZAT CHIMAEV,26/10/2024,The Reaper,Borz,L,W,Submission,1,3:34,...,0,33,100,0,100,0,0,0,0,100
2,MAGOMED ANKALAEV,ALEKSANDAR RAKIC,26/10/2024,-,Rocket,W,L,Decision - Unanimous,3,5:00,...,40,16,23,64,90,94,9,5,0,0
3,LERONE MURPHY,DAN IGE,26/10/2024,The Miracle,50K,W,L,Decision - Unanimous,3,5:00,...,23,10,7,13,71,69,23,13,5,17
4,SHARA MAGOMEDOV,ARMEN PETROSYAN,26/10/2024,Bullet,Superman,W,L,KO/TKO,2,4:52,...,44,12,18,58,96,97,3,2,0,0


## Data Preprocessing

### Dropping redundant features

Let's drop some features that are redundant and have no value for us here, for example: <br>
> `x_fighter_sig_str`, where we already have `x_fighter_sig_str_pct`. Where the prior takes the `75 of 144` form, and the latter takes the percentage version of the same value = `52%`. The latter is already scaled, and will be easier to work with.

In [3]:
fights_stats.loc[:1, ["red_fighter_sig_str", "red_fighter_sig_str_pct"]]

Unnamed: 0,red_fighter_sig_str,red_fighter_sig_str_pct
0,75 of 144,52
1,2 of 2,100


In [4]:
redundant_cols = ["fighter_sig_str", "fighter_TD"]

# Fighters from both corners
fighters = ("red_", "blue_")

# For both red/blue fighters
cols_to_drop = [f"{fighter}{col}" for col in redundant_cols for fighter in fighters]
cols_to_drop

['red_fighter_sig_str',
 'blue_fighter_sig_str',
 'red_fighter_TD',
 'blue_fighter_TD']

Dropping:

In [5]:
fights_stats.drop(columns=cols_to_drop, inplace=True)

### Renaming features

Let's rename some columns to avoid name conflicts later and to better represent what they mean.
* `sig_str_acc_cols` - Significant Strikes **Accuracy** columns (how much *landed out of the total* thrown)
* `sig_str_tar_cols` - Significant Strikes by **Target** columns (*what ratio of the strikes went to a certain location*: head, body, leg)
* `sig_str_pos_cols` - Significant Strikes by **Position** columns (*what ratio of the strikes were landed from a certain position*: distance, clinch, ground)

Preparing:

In [None]:
# Define feature groups
sig_str_acc_cols = [
    "fighter_sig_str_head",
    "fighter_sig_str_body",
    "fighter_sig_str_leg",
    "fighter_sig_str_distance",
    "fighter_sig_str_clinch",
    "fighter_sig_str_ground",
]
sig_str_tar_cols = [
    "fighter_sig_str_head_pct",
    "fighter_sig_str_body_pct",
    "fighter_sig_str_leg_pct",
]
sig_str_pos_cols = [
    "fighter_sig_str_distance_pct",
    "fighter_sig_str_clinch_pct",
    "fighter_sig_str_ground_pct",
]

# Define fighters' corners
fighters = ("red_", "blue_")

# Postfixes
ACC_POST = "_acc"
TAR_POST = "_tar_pct"
POS_POST = "_pos_pct"

# Define mappings
col_names_mappings = {}

# Start mapping
for fighter in fighters:
    # For significant strikes accuracy features
    for col in sig_str_acc_cols:
        col_names_mappings[f"{fighter}{col}"] = f"{fighter}{col}{ACC_POST}"

    # For significant strikes by target features
    for col in sig_str_tar_cols:
        # Reposition '_pct' suffix to the end
        import pdb
        pdb.set_trace()
        base = col.removesuffix("_pct")
        col_names_mappings[f"{fighter}{col}"] = f"{fighter}{base}{TAR_POST}"

    # For significant strikes by position features
    for col in sig_str_pos_cols:
        # Reposition '_pct' suffix to the end
        base = col.removesuffix("_pct")
        col_names_mappings[f"{fighter}{col}"] = f"{fighter}{base}{POS_POST}"

> [0;32m/var/folders/99/3nnshfd56kv1_g2l04l874t00000gn/T/ipykernel_11461/822206033.py[0m(43)[0;36m<module>[0;34m()[0m
[0;32m     41 [0;31m        [0;32mimport[0m [0mpdb[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     42 [0;31m        [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 43 [0;31m        [0mbase[0m [0;34m=[0m [0mcol[0m[0;34m.[0m[0mremovesuffix[0m[0;34m([0m[0;34m"_pct"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     44 [0;31m        [0mcol_names_mappings[0m[0;34m[[0m[0;34mf"{fighter}{col}"[0m[0;34m][0m [0;34m=[0m [0;34mf"{fighter}{base}{TAR_POST}"[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     45 [0;31m[0;34m[0m[0m
[0m


ipdb>  
ipdb>  col


'fighter_sig_str_head_pct'


ipdb>  col.removesuffix("pct")


*** AttributeError: 'str' object has no attribute 'removesuffix'


Renaming:

In [None]:
fights_stats.rename(columns=col_names_mappings, inplace=True)
fights_stats.columns

Too long and too many features to look at, we will make the names shorter and decrease the number of features by half. But first, we need to merge some additional features :))

# Loading athlete stats dataset

Let's merge additional athlete-based features from an external dataset like: `Height`, `Reach`, `Stance` and other career statistic features like: 
* `SLpM`-`Significant Strikes Landed per Minute`
* `Str_Acc`-`Significant Striking Accuracy`

In [None]:
# External dataset
athlete_stats = pd.read_csv("../external_data/raw_fighter_details.csv", sep=",")
athlete_stats.head(3)

## Data Preprocessing

### Dropping irrelevant features

We leave only the features we are interested in, and drop irrelevant ones:

In [None]:
athlete_stats.drop(columns=["Weight", "DOB"], inplace=True)
athlete_stats.tail()

### Renaming features

We need to rename some columns for differentiating between **athlete-based** and **fight-based**. <br>
We'll add `_cs` suffix to the **athlete-based** features. `cs` stands for `Career Statistic`.

Let's prep the columns:

In [None]:
suffix = "_cs"  # _cs is an abbreviation for career statistic
cols_to_rename = [
    "SLpM",
    "Str_Acc",
    "SApM",
    "Str_Def",
    "TD_Avg",
    "TD_Acc",
    "TD_Def",
    "Sub_Avg",
]
name_mappings = {col: f"{col}{suffix}" for col in cols_to_rename}
name_mappings

Rename:

In [None]:
athlete_stats.rename(columns=name_mappings, inplace=True)
athlete_stats.head(3)

### Imputing NaNs

Dropping rows, that bring us no information whatsoever (rows, where all valuable columns are empty):

In [None]:
cols_of_interest = list(athlete_stats.columns)
cols_of_interest.remove("fighter_name")
cols_of_interest

Are there any?

In [None]:
all_nans = athlete_stats[athlete_stats[cols_of_interest] == 0].sum().sum()
all_nans

There's not, good.

Impututing the rest of NaNs with zeros:

In [None]:
print(f"Number of NaN entries before imputing: {athlete_stats.isna().sum().sum()}")

In [None]:
# Impute
athlete_stats.fillna(0, inplace=True)

print(f"Number of NaN entries after imputing: {athlete_stats.isna().sum().sum()}")

### Formatting to match the format in the fights dataset

Convert external dataset's `fighter_name` values to uppercase, and column names to lowercase to match our format:

In [None]:
# Fighter names values to => upper
athlete_stats["fighter_name"] = athlete_stats["fighter_name"].str.upper()

# Column names to => lower
athlete_stats.columns = athlete_stats.columns.str.lower()
athlete_stats.loc[:2, ["height", "reach"]]

### Converting from inches to cm

Defining a function that converts `height` and `reach` features from inches to cm:

In [None]:
def conv_from_inches_to_cm(inches):
    """Converts from inches to cm"""

    # To avoid trying to operate str ops on numerical input
    if not isinstance(inches, str):
        return inches

    # Separate feet from inches
    inches = inches.replace('"', "").strip()

    # If both, feet and inches given
    if "'" in inches:
        feet, inch = inches.split("'")
        feet = int(feet)
        inch = int(inch) if inch else 0
        return round(feet * 30.48 + inch * 2.54, 2)
    # If only feet given
    else:
        return round(float(inches) * 2.54, 2)

Applying:

In [None]:
# Convert height
athlete_stats["height"] = athlete_stats["height"].apply(conv_from_inches_to_cm)
# Convert reach
athlete_stats["reach"] = athlete_stats["reach"].apply(conv_from_inches_to_cm)

In [None]:
# Take a look
athlete_stats.loc[:2, ["height", "reach"]]

Looks solid.

# Merging into the final dataset

Prepare mappings to map features to red/blue fighters:

In [None]:
red_mappings = {
    col: f"red_fighter_{col}" for col in athlete_stats.columns if "fighter" not in col
}
blue_mappings = {
    col: f"blue_fighter_{col}" for col in athlete_stats.columns if "fighter" not in col
}

Merging:

In [None]:
# Merge reds
stats = pd.merge(
    fights_stats,
    athlete_stats.rename(columns=red_mappings),
    left_on="red_fighter_name",
    right_on="fighter_name",
)
stats.drop(columns="fighter_name", inplace=True)

In [None]:
# Merge blues
stats = pd.merge(
    stats,
    athlete_stats.rename(columns=blue_mappings),
    left_on="blue_fighter_name",
    right_on="fighter_name",
)
stats.drop(columns="fighter_name", inplace=True)

In [None]:
stats.columns

Looks good, let's now work on the merged dataset.

## Data preprocessing

### Data Cleaning

#### Imputing NaNs

In [None]:
stats.isnull().sum().sum()

Replacing NaN entry fillers with zeros:

In [None]:
stats.isin(["-", "--", "---"]).sum().sum()

In [None]:
stats = stats.replace(["-", "--", "---"], "0")
stats.isin(["-", "--", "---"]).sum().sum()

#### Handling Duplicates

In [None]:
stats.duplicated().sum()

We can see that there are no NaNs or duplicates. Let's get to Feature Engineering.

### Processing categorical features

Find columns that need to be standardized from categorical dtype to numerical:

In [None]:
def find_obj_cols(df):
    """Searches for columns that are of dtype object,
    contain numbers, and names of cols start with either 'red_' or 'blue_'"""

    cols_to_standardize = []

    for col in df.columns:
        if stats[col].dtype == "object" and (
            col.startswith("red_") or col.startswith("blue_")
        ):
            # Get the first non-nan value, and convert to str to be able to use .isdigit()
            sample_val = stats[col].dropna().astype(str).head(1).values[0]
            if len(sample_val) > 0 and any(char.isdigit() for char in sample_val):
                cols_to_standardize.append(col)

    return cols_to_standardize

In [None]:
cols_to_standardize = find_obj_cols(stats)
print(f"Number of categorical features to preprocess: {len(cols_to_standardize)}")

In total we have to standardize 3 types of features:
1. Ratio to pct: 75 of 144 => 52 (%)
2. Dropping pct symbol: 85% => 85 (%)
3. Time: 1:31	=> 91 (seconds)

Taking a look:

In [None]:
stats.loc[:2, ["red_fighter_total_str", "red_fighter_td_acc_cs", "red_fighter_ctrl"]]

But let's first group the columns into 3 different buckets for simplicity:
* `of_cols`, for example: `78 of 147`
* `pct_cols`, for example: `45%`
* `time_cols`, for example: `0:45`

In [None]:
def bucket_obj_cols(cols):
    """Buckets provided columns into 3 various buckets, based on regex patterns."""

    of_cols = []
    pct_cols = []
    time_cols = []

    for col in cols:
        sample_val = stats[col].dropna().astype(str).head(1).values[0]

        if re.search(
            r"\d+\s*of\s*\d+", sample_val
        ):  # If matches 'number of number' schema
            of_cols.append(col)
        elif re.search(r"\d+\s*%", sample_val):  # If matches 'x%' pct schema
            pct_cols.append(col)
        elif re.search(r"\d+\s*:\s*\d+", sample_val):  # If matches 'x:yy' time schema
            time_cols.append(col)

    return of_cols, pct_cols, time_cols

Running it:

In [None]:
of_cols, pct_cols, time_cols = bucket_obj_cols(cols_to_standardize)

Let's take a look:

In [None]:
stats.loc[:2, of_cols[:3]]

In [None]:
stats.loc[:2, pct_cols[:3]]

In [None]:
stats.loc[:2, time_cols]

Let's standardize.

### Standardizing

#### Standardizing fraction-based features

Standardizing fractions into pct % (e.g. from `70 of 140` to `50` (%)):

In [None]:
def convert_ratio_to_pct(row):
    """Converts ratio 'x of y' to percentages 'z'"""

    vals = row.split("of")
    if len(vals) != 2:
        return 0

    made = int(vals[0].strip())
    attempted = int(vals[1].strip())
    if made == 0 or attempted == 0:
        return 0

    return round((made * 100) / attempted, 2)

Applying:

In [None]:
for col in of_cols:
    stats[col] = stats[col].apply(convert_ratio_to_pct)

# Rename
name_mappings = {col: f"{col}_pct" for col in of_cols}
stats.rename(columns=name_mappings, inplace=True)

In [None]:
stats[name_mappings.values()].head(3)

#### Standardizing percentage-based features

Standardizing pct features, dropping `%` symbol (e.g. from `50%` to `50`):

In [None]:
stats[pct_cols] = stats[pct_cols].apply(lambda row: row.str.strip("%"))
stats[pct_cols].head(3)

#### Standardizing time-based features

Standardizing time features from `mm:ss` into the total `seconds` (e.g. from `1:31` **minutes** to `91` **seconds**):

Features to work with:

In [None]:
time_cols

Replace 0s to match the common format `0:00`

In [None]:
stats[time_cols] = stats[time_cols].replace("0", "0:00")

Format time schema to make it have the `hh:mm:ss` shape:

Before:

In [None]:
# Rows to compare before/after on
comp_rows = [i for i in range(40, 43)]
# Before samples
stats.loc[comp_rows, time_cols]

In [None]:
def format_time_schema(row):
    """Formats time schema by making sure it follows hh:mm:ss format"""

    parts = row.split(":")
    if len(parts) == 2:
        minutes, seconds = parts
        # Pad both minutes and seconds to 2 zeros
        minutes = minutes.zfill(2)
        seconds = seconds.zfill(2)
        # Add hours prefix
        return f"00:{minutes}:{seconds}"
    return parts  # Return like is if not in mm:ss format

Applying:

In [None]:
for col in time_cols:
    stats[col] = stats[col].apply(format_time_schema)

After:

In [None]:
pd.DataFrame(stats.loc[comp_rows, time_cols])

Convert to total seconds:

Before:

In [None]:
# Rows to compare before/after on
comp_rows = [i for i in range(0, 2)]
stats.loc[comp_rows, time_cols]

In [None]:
# Convert into total seconds
stats[time_cols] = stats[time_cols].apply(
    lambda col: pd.to_timedelta(col).dt.total_seconds()
)

After:

In [None]:
stats.loc[comp_rows, time_cols]

Looks good.

### Dtype converting

Convert columns that are in object dtype but contain values of numerical dtype to remove the pool of features that need to be preprocessed:

In [None]:
stats.dtypes.value_counts()

In [None]:
for col in stats.columns:
    try:
        stats[col] = stats[col].astype(float)

    # If not possible, (contains strings) we'll handle it in a minute
    except:
        continue

In [None]:
stats.dtypes.value_counts()

## Feature Engineering

### Winner

Adding a winner feature:

In [None]:
stats.loc[:, "winner"] = stats["red_fighter_result"].apply(
    lambda x: "red" if x == "W" else "blue"
)

In [None]:
stats["winner"].head(3)

Since we now have the `winner` feature, we can drop the `red_fighter_result/blue_fighter_result` features:

In [None]:
stats.drop(columns=['red_fighter_result', 'blue_fighter_result'], inplace=True)

### Winner_feature/loser_feature

Let's change the columns from `red/blue_fighter`+`feature name` to `winner/loser`+`feature name`.

Saving the order of the columns first because it will be distorted:

In [None]:
def rename_condition(col):
    if col.startswith("red_fighter_"):
        return col.replace("red_fighter_", "winner_")
    elif col.startswith("blue_fighter_"):
        return col.replace("blue_fighter_", "loser_")
    return col

In [None]:
# Get the renamed cols order
cols_order = [rename_condition(col) for col in stats.columns]

Function for setting winner & loser:

In [None]:
def set_winner_n_loser(df, winner_col="winner"):
    """Filters what columns to take into account,
    creates new columns, instead of red/blue makes winner/loser,
    gets data points from red/blue column based on
    the value of the feature 'winner' in that same row."""

    df = df.copy()
    cols_to_drop = []

    # Find all red columns that have blue equivalents
    for col in df.columns:
        if col.startswith("red_fighter_"):
            base = col.removeprefix("red_fighter_")
            red_col = f"red_fighter_{base}"
            blue_col = f"blue_fighter_{base}"

            # If blue counterpart exists
            if blue_col in df.columns:
                # Create winner/loser columns
                df[f"winner_{base}"] = np.where(
                    df[winner_col] == "red", df[red_col], df[blue_col]
                )
                df[f"loser_{base}"] = np.where(
                    df[winner_col] == "blue", df[red_col], df[blue_col]
                )
    
                cols_to_drop.extend([red_col, blue_col])

    # Drop the red/blue columns to keep only winner/loser
    df = df.drop(columns=cols_to_drop)
    return df

In [None]:
stats = set_winner_n_loser(stats)
stats.columns[:5]

Setting the previous, correct order up:

In [None]:
stats = stats.loc[:, cols_order]

In [None]:
stats.columns[:5]

### Striking/wrestling dominance (This needs to be rewritten, normalized)

Let's engineer some additional features:
1. `Striking dominance` - a fighter's overall striking performance. Calculated as: `KD` + `Significant strikes %` + `Total landed strikes %`
2. `Wrestling dominance` - a fighter's overall wrestling performance. Calculated as: `TD %` + `Submission attempts` + `reversals`

##### Striking dominance:

In [None]:
"""stats["winner_striking_dominance"] = (
    stats["winner_KD"] + stats["winner_sig_str_pct"] + stats["winner_total_str_pct"]
)
stats["loser_striking_dominance"] = (
    stats["loser_KD"] + stats["loser_sig_str_pct"] + stats["loser_total_str_pct"]
)"""

##### Wrestling dominance:

In [None]:
"""stats["winner_wrestling_dominance"] = (
    stats["winner_TD_pct"] + stats["winner_sub_att"] + stats["winner_rev"]
)
stats["loser_wrestling_dominance"] = (
    stats["loser_TD_pct"] + stats["loser_sub_att"] + stats["loser_rev"]
)"""

Result:

In [None]:
"""stats.iloc[:3, -4:]"""

### Delta

Let's now decrease the amount of features by half. We're going to use *delta* for this. For example, instead of having both `winner_striking_dominance` and `loser_striking_dominance` features, we're going to just have `delta_striking_dominance`. Which would just be `winner_striking_dominance` - `loser_striking_dominance`. Where a positive value would mean that the winner has a higher striking dominance factor and vice versa for negative.

In [None]:
def deltafy_data(df):
    """Filters what columns to process,
    creates new columns, instead of winner/loser makes delta.
    Merges the new delta columns to the df, and drops the previous winner/loser processed columns.
    """

    df = df.copy()
    cols_to_drop = []

    # Get the columns that are numerical and have winner/loser counterparts
    delta_cols = {}

    for col in df.columns:
        if col.startswith("winner_"):
            base = col.removeprefix("winner_")
            winner_col = f"winner_{base}"
            loser_col = f"loser_{base}"

            # If loser counterpart also exists, calculate delta
            if loser_col in df.columns and df[col].dtype in (float, int):
                delta_cols[f"delta_{base}"] = np.round(
                    df[winner_col] - df[loser_col], 2
                )

                cols_to_drop.extend([winner_col, loser_col])

    # Delta df
    delta_df = pd.DataFrame(delta_cols)

    # Concat to the original one
    df = pd.concat([df, delta_df], axis=1)

    # Drop the winner/loser columns to keep only delta
    df.drop(columns=cols_to_drop, inplace=True)
    return df

In [None]:
stats = deltafy_data(stats)
stats.columns

In [None]:
stats.head(3)

# Saving

Saving preprocessed, cleaned, merged, feature-engineer added, ready for EDA dataset:

In [None]:
stats.to_csv("../stats/stats_processed.csv", sep=";", index=False)