# Table of Contents
- [Importing](#Imports)
- [Loading fights dataset](#Loading-fights-dataset)
    - [Data Preprocessing](#Data-Preprocessing)
        - [Dropping redundant features](#Dropping-redundant-features)
        - [Renaming features](#Renaming-features)
- [Loading athlete stats dataset](#Loading-athlete-stats-dataset)
    - [Data Preprocessing](#Data-Preprocessing)
        - [Dropping irrelevant features](#Dropping-irrelevant-features)
        - [Renaming features](#Renaming-features)
        - [Imputing NaNs](#Imputing-NaNs)
        - [Formatting to match the format in the fights dataset](#Formatting-to-match-the-format-in-the-fights-dataset)
        - [Converting from inches to cm](#Converting-from-inches-to-cm)
- [Merging into the final dataset](#Merging-into-the-final-dataset)
    - [Data preprocessing](#Data-preprocessing)
        - [Data Cleaning](#Data-Cleaning)
            - [Imputing NaNs](#Imputing-NaNs)
            - [Handling Duplicates](#Handling-Duplicates)
        - [Processing categorical features](#Processing-categorical-features)
        - [Standardizing](#Standardizing)
            - [Standardizing fraction-based features](#Standardizing-fraction-based-features)
            - [Standardizing percentage-based features](#Standardizing-percentage-based-features)
            - [Standardizing time-based features](#Standardizing-time-based-features)
        - [Dtype converting](#Dtype-converting)
    - [Feature Engineering](#Feature-Engineering)
        - [Winner](#Winner)
        - [Winner_feature/loser_feature](#Winner_feature/loser_feature)
        - [Striking/wrestling dominance](#Striking/wrestling-dominance)
        - [Delta](#Delta)
- [Saving](#Saving)

In [158]:
# TEST
def check_nans(): 
    return stats.isna().sum().sort_values(ascending=False)[:60] 

# Imports

### Dtype converting

In [159]:
import re

import numpy as np
import pandas as pd

# Loading fights dataset

In [160]:
fights_stats = pd.read_csv("../stats/stats_raw.csv", sep=";")
fights_stats.head()

Unnamed: 0,red_fighter_name,blue_fighter_name,event_date,red_fighter_nickname,blue_fighter_nickname,red_fighter_result,blue_fighter_result,method,round,time,...,red_fighter_sig_str_body_pct,blue_fighter_sig_str_body_pct,red_fighter_sig_str_leg_pct,blue_fighter_sig_str_leg_pct,red_fighter_sig_str_distance_pct,blue_fighter_sig_str_distance_pct,red_fighter_sig_str_clinch_pct,blue_fighter_sig_str_clinch_pct,red_fighter_sig_str_ground_pct,blue_fighter_sig_str_ground_pct
0,ILIA TOPURIA,MAX HOLLOWAY,26/10/2024,El Matador,Blessed,W,L,KO/TKO,3,1:34,...,14,16,20,24,94,100,0,0,5,0
1,ROBERT WHITTAKER,KHAMZAT CHIMAEV,26/10/2024,The Reaper,Borz,L,W,Submission,1,3:34,...,0,33,100,0,100,0,0,0,0,100
2,MAGOMED ANKALAEV,ALEKSANDAR RAKIC,26/10/2024,-,Rocket,W,L,Decision - Unanimous,3,5:00,...,40,16,23,64,90,94,9,5,0,0
3,LERONE MURPHY,DAN IGE,26/10/2024,The Miracle,50K,W,L,Decision - Unanimous,3,5:00,...,23,10,7,13,71,69,23,13,5,17
4,SHARA MAGOMEDOV,ARMEN PETROSYAN,26/10/2024,Bullet,Superman,W,L,KO/TKO,2,4:52,...,44,12,18,58,96,97,3,2,0,0


Taking a look at the num of rows and cols:

In [161]:
fights_stats.shape

(7754, 59)

## Data Preprocessing

### Dropping features

In [162]:
fights_stats.columns

Index(['red_fighter_name', 'blue_fighter_name', 'event_date',
       'red_fighter_nickname', 'blue_fighter_nickname', 'red_fighter_result',
       'blue_fighter_result', 'method', 'round', 'time', 'time_format',
       'referee', 'details', 'bout_type', 'bonus', 'event_name',
       'event_location', 'red_fighter_KD', 'blue_fighter_KD',
       'red_fighter_sig_str', 'blue_fighter_sig_str',
       'red_fighter_sig_str_pct', 'blue_fighter_sig_str_pct',
       'red_fighter_total_str', 'blue_fighter_total_str', 'red_fighter_TD',
       'blue_fighter_TD', 'red_fighter_TD_pct', 'blue_fighter_TD_pct',
       'red_fighter_sub_att', 'blue_fighter_sub_att', 'red_fighter_rev',
       'blue_fighter_rev', 'red_fighter_ctrl', 'blue_fighter_ctrl',
       'red_fighter_sig_str_head', 'blue_fighter_sig_str_head',
       'red_fighter_sig_str_body', 'blue_fighter_sig_str_body',
       'red_fighter_sig_str_leg', 'blue_fighter_sig_str_leg',
       'red_fighter_sig_str_distance', 'blue_fighter_sig_str_distan

Let's drop some **redundant** features that are redundant and have no value for us here, for example: <br>
> `x_fighter_sig_str`, where we already have `x_fighter_sig_str_pct`. Where the prior takes the `75 of 144` form, and the latter takes the percentage version of the same value = `52%`. The latter is already scaled, and will be easier to work with.

In [163]:
fights_stats.loc[:1, ["red_fighter_sig_str", "red_fighter_sig_str_pct"]]

Unnamed: 0,red_fighter_sig_str,red_fighter_sig_str_pct
0,75 of 144,52
1,2 of 2,100


We will also drop some **irrelevant** features that we are not going to work with, these are:
* Nicknames: `red_fighter_nickname`, `blue_fighter_nickname`
* `Referee`
* `Details`
* `Bonus`

In [164]:
redundant_cols = ["fighter_sig_str", "fighter_TD"]
irrelevant_cols = [
    "red_fighter_nickname",
    "blue_fighter_nickname",
    "referee",
    "details",
    "bonus",
]

# Fighters from both corners
fighters = ("red_", "blue_")

# For both red/blue fighters
cols_to_drop = [f"{fighter}{col}" for col in redundant_cols for fighter in fighters]
cols_to_drop.extend(irrelevant_cols)
cols_to_drop

['red_fighter_sig_str',
 'blue_fighter_sig_str',
 'red_fighter_TD',
 'blue_fighter_TD',
 'red_fighter_nickname',
 'blue_fighter_nickname',
 'referee',
 'details',
 'bonus']

Dropping:

In [165]:
fights_stats.drop(columns=cols_to_drop, inplace=True)

### Renaming features

Let's rename some columns to avoid name conflicts later and to better represent what they mean.
* `sig_str_acc_cols` - Significant Strikes **Accuracy** columns (how much *landed out of the total* thrown)
* `sig_str_tar_cols` - Significant Strikes by **Target** columns (*what ratio of the strikes went to a certain location*: head, body, leg)
* `sig_str_pos_cols` - Significant Strikes by **Position** columns (*what ratio of the strikes were landed from a certain position*: distance, clinch, ground)

Preparing:

In [166]:
# Define feature groups
sig_str_acc_cols = [
    "fighter_sig_str_head",
    "fighter_sig_str_body",
    "fighter_sig_str_leg",
    "fighter_sig_str_distance",
    "fighter_sig_str_clinch",
    "fighter_sig_str_ground",
]
sig_str_tar_cols = [
    "fighter_sig_str_head_pct",
    "fighter_sig_str_body_pct",
    "fighter_sig_str_leg_pct",
]
sig_str_pos_cols = [
    "fighter_sig_str_distance_pct",
    "fighter_sig_str_clinch_pct",
    "fighter_sig_str_ground_pct",
]

# Define fighters' corners
fighters = ("red_", "blue_")

# Postfixes
ACC_POST = "_acc"
TAR_POST = "_tar_pct"
POS_POST = "_pos_pct"

# Define mappings
col_names_mappings = {}

# Start mapping
for fighter in fighters:
    # For significant strikes accuracy features
    for col in sig_str_acc_cols:
        col_names_mappings[f"{fighter}{col}"] = f"{fighter}{col}{ACC_POST}"

    # For significant strikes by target features
    for col in sig_str_tar_cols:
        # Reposition '_pct' suffix to the end
        base = col.removesuffix("_pct")
        col_names_mappings[f"{fighter}{col}"] = f"{fighter}{base}{TAR_POST}"

    # For significant strikes by position features
    for col in sig_str_pos_cols:
        # Reposition '_pct' suffix to the end
        base = col.removesuffix("_pct")
        col_names_mappings[f"{fighter}{col}"] = f"{fighter}{base}{POS_POST}"

Renaming:

In [167]:
fights_stats.rename(columns=col_names_mappings, inplace=True)
fights_stats.columns

Index(['red_fighter_name', 'blue_fighter_name', 'event_date',
       'red_fighter_result', 'blue_fighter_result', 'method', 'round', 'time',
       'time_format', 'bout_type', 'event_name', 'event_location',
       'red_fighter_KD', 'blue_fighter_KD', 'red_fighter_sig_str_pct',
       'blue_fighter_sig_str_pct', 'red_fighter_total_str',
       'blue_fighter_total_str', 'red_fighter_TD_pct', 'blue_fighter_TD_pct',
       'red_fighter_sub_att', 'blue_fighter_sub_att', 'red_fighter_rev',
       'blue_fighter_rev', 'red_fighter_ctrl', 'blue_fighter_ctrl',
       'red_fighter_sig_str_head_acc', 'blue_fighter_sig_str_head_acc',
       'red_fighter_sig_str_body_acc', 'blue_fighter_sig_str_body_acc',
       'red_fighter_sig_str_leg_acc', 'blue_fighter_sig_str_leg_acc',
       'red_fighter_sig_str_distance_acc', 'blue_fighter_sig_str_distance_acc',
       'red_fighter_sig_str_clinch_acc', 'blue_fighter_sig_str_clinch_acc',
       'red_fighter_sig_str_ground_acc', 'blue_fighter_sig_str_ground_ac

Too long and too many features to look at, we will make the names shorter and decrease the number of features by half. But first, we need to merge some additional features :))

# Loading athlete stats dataset

Let's merge additional athlete-based features from an external dataset like: `Height`, `Reach`, `Stance` and other career statistic features like: 
* `SLpM`-`Significant Strikes Landed per Minute`
* `Str_Acc`-`Significant Striking Accuracy`

In [168]:
# External dataset
athlete_stats = pd.read_csv("../external_data/raw_fighter_details.csv", sep=",")
athlete_stats.head(3)

Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB,SLpM,Str_Acc,SApM,Str_Def,TD_Avg,TD_Acc,TD_Def,Sub_Avg
0,Tom Aaron,,155 lbs.,,,"Jul 13, 1978",0.0,0%,0.0,0%,0.0,0%,0%,0.0
1,Papy Abedi,"5' 11""",185 lbs.,,Southpaw,"Jun 30, 1978",2.8,55%,3.15,48%,3.47,57%,50%,1.3
2,Shamil Abdurakhimov,"6' 3""",235 lbs.,"76""",Orthodox,"Sep 02, 1981",2.45,44%,2.45,58%,1.23,24%,47%,0.2


Taking a look at the num of rows and cols:

In [169]:
athlete_stats.shape

(3596, 14)

## Data Preprocessing

### Dropping irrelevant features

We leave only the features we are interested in, and drop irrelevant ones:

In [170]:
athlete_stats.drop(columns=["Weight", "DOB"], inplace=True)
athlete_stats.tail()

Unnamed: 0,fighter_name,Height,Reach,Stance,SLpM,Str_Acc,SApM,Str_Def,TD_Avg,TD_Acc,TD_Def,Sub_Avg
3591,Zhang Tiequan,"5' 8""","69""",Orthodox,1.23,36%,2.14,51%,1.95,58%,75%,3.4
3592,Alex Zuniga,,,,0.0,0%,0.0,0%,0.0,0%,0%,0.0
3593,George Zuniga,"5' 9""",,,7.64,38%,5.45,37%,0.0,0%,100%,0.0
3594,Allan Zuniga,"5' 7""","70""",Orthodox,3.93,52%,1.8,61%,0.0,0%,57%,1.0
3595,Virgil Zwicker,"6' 2""","74""",,3.34,48%,4.87,39%,1.31,30%,50%,0.0


### Renaming features

We need to rename some columns for differentiating between **athlete-based** and **fight-based**. <br>
We'll add `_cs` suffix to the **athlete-based** features. `cs` stands for `Career Statistic`.

Let's prep the columns:

In [171]:
suffix = "_cs"  # _cs is an abbreviation for career statistic
cols_to_rename = [
    "SLpM",
    "Str_Acc",
    "SApM",
    "Str_Def",
    "TD_Avg",
    "TD_Acc",
    "TD_Def",
    "Sub_Avg",
]
name_mappings = {col: f"{col}{suffix}" for col in cols_to_rename}
name_mappings

{'SLpM': 'SLpM_cs',
 'Str_Acc': 'Str_Acc_cs',
 'SApM': 'SApM_cs',
 'Str_Def': 'Str_Def_cs',
 'TD_Avg': 'TD_Avg_cs',
 'TD_Acc': 'TD_Acc_cs',
 'TD_Def': 'TD_Def_cs',
 'Sub_Avg': 'Sub_Avg_cs'}

Rename:

In [172]:
athlete_stats.rename(columns=name_mappings, inplace=True)
athlete_stats.head(3)

Unnamed: 0,fighter_name,Height,Reach,Stance,SLpM_cs,Str_Acc_cs,SApM_cs,Str_Def_cs,TD_Avg_cs,TD_Acc_cs,TD_Def_cs,Sub_Avg_cs
0,Tom Aaron,,,,0.0,0%,0.0,0%,0.0,0%,0%,0.0
1,Papy Abedi,"5' 11""",,Southpaw,2.8,55%,3.15,48%,3.47,57%,50%,1.3
2,Shamil Abdurakhimov,"6' 3""","76""",Orthodox,2.45,44%,2.45,58%,1.23,24%,47%,0.2


### Examining NaNs

Dropping rows, that bring us no information whatsoever (rows, where all valuable columns are empty):

In [173]:
cols_of_interest = list(athlete_stats.columns)
cols_of_interest.remove("fighter_name")
cols_of_interest

['Height',
 'Reach',
 'Stance',
 'SLpM_cs',
 'Str_Acc_cs',
 'SApM_cs',
 'Str_Def_cs',
 'TD_Avg_cs',
 'TD_Acc_cs',
 'TD_Def_cs',
 'Sub_Avg_cs']

Are there any?

In [174]:
all_nans = athlete_stats[athlete_stats[cols_of_interest] == 0].sum().sum()
all_nans

0.0

There's not, good.

How about NaNs?

In [175]:
print(f"Number of NaNs: {athlete_stats.isna().sum().sum()}")

Number of NaNs: 2979


Ok, we got some, let's take a look:

In [176]:
athlete_stats.isna().sum().sort_values(ascending=False)

Reach           1912
Stance           804
Height           263
fighter_name       0
SLpM_cs            0
Str_Acc_cs         0
SApM_cs            0
Str_Def_cs         0
TD_Avg_cs          0
TD_Acc_cs          0
TD_Def_cs          0
Sub_Avg_cs         0
dtype: int64

Ok, so it wouldn't make any sense for these columns to be actually zero, so **we're not going to impute them with 0's, nor drop them**, as doing the prior wouldn't make any sense for someone's height to be 0, and I **do not want to drop these for now**, as **there are other rows that contain valuable information**, and **doing so would make us lose it**. I also do not want to fill the values with majority for categorical or mean for numerical features, I want to keep the information as accurate as possible.

So for now, we're **going to keep these as NaNs**, and **when we will be about to do EDA** on one of these features, **then we will drop the NaNs**, that way, we **will not lose the information** in the other columns in the rows where there are some NaNs.

And instead, we will impute with 0's the features, that would actually make sense to have as 0's. More on this in the Merged Examining NaNs part.

### Formatting to match the format in the fights dataset

Convert external dataset's `fighter_name` values to uppercase, and column names to lowercase to match our format:

In [177]:
# Fighter names values to => upper
athlete_stats["fighter_name"] = athlete_stats["fighter_name"].str.upper()

# Column names to => lower
athlete_stats.columns = athlete_stats.columns.str.lower()
athlete_stats.loc[:9, ["height", "reach"]].dropna()

Unnamed: 0,height,reach
2,"6' 3""","76"""
6,"5' 11""","71"""
8,"6' 0""","74"""


### Converting from inches to cm

Defining a function that converts `height` and `reach` features from inches to cm:

In [178]:
def conv_from_inches_to_cm(inches):
    """Converts from inches to cm"""

    # To avoid trying to operate str ops on numerical input
    if not isinstance(inches, str):
        return inches

    # Separate feet from inches
    inches = inches.replace('"', "").strip()

    # If both, feet and inches given
    if "'" in inches:
        feet, inch = inches.split("'")
        feet = int(feet)
        inch = int(inch) if inch else 0
        return round(feet * 30.48 + inch * 2.54, 2)
    # If only feet given
    else:
        return round(float(inches) * 2.54, 2)

Applying:

In [179]:
# Convert height
athlete_stats["height"] = athlete_stats["height"].apply(conv_from_inches_to_cm)
# Convert reach
athlete_stats["reach"] = athlete_stats["reach"].apply(conv_from_inches_to_cm)

In [180]:
# Take a look
athlete_stats.loc[:10, ["height", "reach"]].dropna()

Unnamed: 0,height,reach
2,190.5,193.04
6,180.34,180.34
8,182.88,187.96


Looks solid.

# Merging into the final dataset

Prepare mappings to map features to red/blue fighters:

In [181]:
red_mappings = {
    col: f"red_fighter_{col}" for col in athlete_stats.columns if "fighter" not in col
}
blue_mappings = {
    col: f"blue_fighter_{col}" for col in athlete_stats.columns if "fighter" not in col
}

Merging:

In [182]:
# Merge reds
stats = pd.merge(
    fights_stats,
    athlete_stats.rename(columns=red_mappings),
    left_on="red_fighter_name",
    how="left",
    right_on="fighter_name",
)
stats.drop(columns="fighter_name", inplace=True)

In [183]:
# Merge blues
stats = pd.merge(
    stats,
    athlete_stats.rename(columns=blue_mappings),
    how="left",
    left_on="blue_fighter_name",
    right_on="fighter_name",
)
stats.drop(columns="fighter_name", inplace=True)

In [184]:
stats.isna().sum().sort_values(ascending=False)[:25] # TEST

blue_fighter_reach                   1628
red_fighter_reach                     997
blue_fighter_stance                   780
blue_fighter_height                   675
blue_fighter_sub_avg_cs               648
blue_fighter_td_def_cs                648
blue_fighter_td_acc_cs                648
blue_fighter_td_avg_cs                648
blue_fighter_str_def_cs               648
blue_fighter_sapm_cs                  648
blue_fighter_str_acc_cs               648
blue_fighter_slpm_cs                  648
red_fighter_stance                    599
red_fighter_height                    541
red_fighter_sapm_cs                   526
red_fighter_slpm_cs                   526
red_fighter_str_acc_cs                526
red_fighter_str_def_cs                526
red_fighter_td_avg_cs                 526
red_fighter_td_acc_cs                 526
red_fighter_td_def_cs                 526
red_fighter_sub_avg_cs                526
blue_fighter_sig_str_leg_tar_pct        0
red_fighter_sig_str_head_tar_pct  

Taking a look at the num of rows and cols:

In [185]:
stats.shape

(7754, 72)

Merged columns:

In [186]:
stats.columns

Index(['red_fighter_name', 'blue_fighter_name', 'event_date',
       'red_fighter_result', 'blue_fighter_result', 'method', 'round', 'time',
       'time_format', 'bout_type', 'event_name', 'event_location',
       'red_fighter_KD', 'blue_fighter_KD', 'red_fighter_sig_str_pct',
       'blue_fighter_sig_str_pct', 'red_fighter_total_str',
       'blue_fighter_total_str', 'red_fighter_TD_pct', 'blue_fighter_TD_pct',
       'red_fighter_sub_att', 'blue_fighter_sub_att', 'red_fighter_rev',
       'blue_fighter_rev', 'red_fighter_ctrl', 'blue_fighter_ctrl',
       'red_fighter_sig_str_head_acc', 'blue_fighter_sig_str_head_acc',
       'red_fighter_sig_str_body_acc', 'blue_fighter_sig_str_body_acc',
       'red_fighter_sig_str_leg_acc', 'blue_fighter_sig_str_leg_acc',
       'red_fighter_sig_str_distance_acc', 'blue_fighter_sig_str_distance_acc',
       'red_fighter_sig_str_clinch_acc', 'blue_fighter_sig_str_clinch_acc',
       'red_fighter_sig_str_ground_acc', 'blue_fighter_sig_str_ground_ac

Looks good, let's now work on the merged dataset.

## Data preprocessing

### Data Cleaning

#### Examining NaNs

In [187]:
stats.isna().sum().sum()

14612

Ok, since we merged on **left** `fights_stats`, if the `athlete_stats` didn't have a fighter in its records, the `athlete_stats` columns **become NaNs**.

In [188]:
stats.isna().sum().sort_values(ascending=False)[:25]

blue_fighter_reach                   1628
red_fighter_reach                     997
blue_fighter_stance                   780
blue_fighter_height                   675
blue_fighter_sub_avg_cs               648
blue_fighter_td_def_cs                648
blue_fighter_td_acc_cs                648
blue_fighter_td_avg_cs                648
blue_fighter_str_def_cs               648
blue_fighter_sapm_cs                  648
blue_fighter_str_acc_cs               648
blue_fighter_slpm_cs                  648
red_fighter_stance                    599
red_fighter_height                    541
red_fighter_sapm_cs                   526
red_fighter_slpm_cs                   526
red_fighter_str_acc_cs                526
red_fighter_str_def_cs                526
red_fighter_td_avg_cs                 526
red_fighter_td_acc_cs                 526
red_fighter_td_def_cs                 526
red_fighter_sub_avg_cs                526
blue_fighter_sig_str_leg_tar_pct        0
red_fighter_sig_str_head_tar_pct  

Our dataset also has NaN placeholders like: `'-'`, `'-'`, `'---'`. Which are essentially **fillers** on the ufcstats.com and they're assigned as a value to an inactive stat, **effectively 0**. So let's examine these:

In [189]:
nan_placeholders = ["-", "--", "---"]
# Mask that returns True/False based on a value matching the set of placeholders
mask = stats.isin(nan_placeholders)

# Get cols where there's at least one True
nan_cols = stats.loc[:, mask.any(axis=0)].columns
nan_cols

Index(['red_fighter_KD', 'blue_fighter_KD', 'red_fighter_sig_str_pct',
       'blue_fighter_sig_str_pct', 'red_fighter_total_str',
       'blue_fighter_total_str', 'red_fighter_TD_pct', 'blue_fighter_TD_pct',
       'red_fighter_sub_att', 'blue_fighter_sub_att', 'red_fighter_rev',
       'blue_fighter_rev', 'red_fighter_ctrl', 'blue_fighter_ctrl',
       'red_fighter_sig_str_head_acc', 'blue_fighter_sig_str_head_acc',
       'red_fighter_sig_str_body_acc', 'blue_fighter_sig_str_body_acc',
       'red_fighter_sig_str_leg_acc', 'blue_fighter_sig_str_leg_acc',
       'red_fighter_sig_str_distance_acc', 'blue_fighter_sig_str_distance_acc',
       'red_fighter_sig_str_clinch_acc', 'blue_fighter_sig_str_clinch_acc',
       'red_fighter_sig_str_ground_acc', 'blue_fighter_sig_str_ground_acc',
       'red_fighter_sig_str_head_tar_pct', 'blue_fighter_sig_str_head_tar_pct',
       'red_fighter_sig_str_body_tar_pct', 'blue_fighter_sig_str_body_tar_pct',
       'red_fighter_sig_str_leg_tar_pct', 'b

Let's take a look at the counts of NaN placeholders in these columns:

In [190]:
# Get the counts of NaN placeholders per NaN columns and sort
nan_counts = (stats[nan_cols].isin(nan_placeholders)).sum().sort_values(ascending=True)
nan_counts

red_fighter_KD                             21
blue_fighter_sig_str_distance_acc          21
red_fighter_sig_str_clinch_acc             21
blue_fighter_sig_str_clinch_acc            21
red_fighter_sig_str_ground_acc             21
blue_fighter_sig_str_ground_acc            21
red_fighter_sig_str_head_tar_pct           21
blue_fighter_sig_str_head_tar_pct          21
red_fighter_sig_str_body_tar_pct           21
blue_fighter_sig_str_body_tar_pct          21
red_fighter_sig_str_leg_tar_pct            21
blue_fighter_sig_str_leg_tar_pct           21
red_fighter_sig_str_distance_pos_pct       21
blue_fighter_sig_str_distance_pos_pct      21
red_fighter_sig_str_clinch_pos_pct         21
blue_fighter_sig_str_clinch_pos_pct        21
red_fighter_sig_str_distance_acc           21
blue_fighter_sig_str_leg_acc               21
red_fighter_sig_str_leg_acc                21
blue_fighter_sig_str_body_acc              21
blue_fighter_KD                            21
red_fighter_total_str             

Ok, let's inspect these rows to have an idea on how we are going to treat these NaNs. <br>
Let's create a function for inspection:

In [191]:
def inspect_nans(df, cols_to_inspect):
    """Creates a mask that assigns True/False to every single value
    based on whether the value is in the specified set set of NaN placeholders.
    It then applies that mask on top of the dataframe that returns all of the rows
    and the given columns where that value is True, i.e. the value is one of the NaN placeholders.
    """

    nan_placeholders = ["-", "--", "---"]

    # Mask that returns True/False based on a value matching the set of placeholders
    mask = df.loc[:, cols_to_inspect].isin(nan_placeholders)
    # Get cols where there's at least one True
    nan_rows = df.loc[mask.any(axis=1), :]

    return nan_rows

Let's start examining. We'll **examine the first 32 NaN columns**, where **the number of NaN placeholders is the same = 21**. I assume these entries are from the old UFC ~ 1-10 events, where the data wasn't even recorded. But assuming doesn't cut it. Let's find out:

Let's grab these columns first:

In [192]:
# Get a bool mask where the number of NaNs is 21
c = nan_counts == 21
# Get columns of interest only where the the mask is == True
fsttwentyone_nancols = c[c].index.tolist()
# Check the cols, and their count
fsttwentyone_nancols, len(fsttwentyone_nancols)

(['red_fighter_KD',
  'blue_fighter_sig_str_distance_acc',
  'red_fighter_sig_str_clinch_acc',
  'blue_fighter_sig_str_clinch_acc',
  'red_fighter_sig_str_ground_acc',
  'blue_fighter_sig_str_ground_acc',
  'red_fighter_sig_str_head_tar_pct',
  'blue_fighter_sig_str_head_tar_pct',
  'red_fighter_sig_str_body_tar_pct',
  'blue_fighter_sig_str_body_tar_pct',
  'red_fighter_sig_str_leg_tar_pct',
  'blue_fighter_sig_str_leg_tar_pct',
  'red_fighter_sig_str_distance_pos_pct',
  'blue_fighter_sig_str_distance_pos_pct',
  'red_fighter_sig_str_clinch_pos_pct',
  'blue_fighter_sig_str_clinch_pos_pct',
  'red_fighter_sig_str_distance_acc',
  'blue_fighter_sig_str_leg_acc',
  'red_fighter_sig_str_leg_acc',
  'blue_fighter_sig_str_body_acc',
  'blue_fighter_KD',
  'red_fighter_total_str',
  'blue_fighter_total_str',
  'red_fighter_sig_str_ground_pos_pct',
  'blue_fighter_sub_att',
  'red_fighter_rev',
  'red_fighter_sub_att',
  'red_fighter_sig_str_head_acc',
  'blue_fighter_sig_str_head_acc',
  '

Good. Let's inspect now:

In [193]:
fsttwentyone_nanrows = inspect_nans(stats, fsttwentyone_nancols)
fsttwentyone_nanrows.event_date.value_counts()

event_date
07/12/1996    3
13/03/1998    2
07/02/1997    2
20/09/1996    2
16/12/1995    2
08/09/1995    2
14/07/1995    2
16/12/1994    2
16/10/1998    1
15/05/1998    1
12/07/1996    1
16/02/1996    1
Name: count, dtype: int64

Indeed, they're all from the 1990's. which means the entries are just **not filled**, and not technically 0, so we will **make these as NaNs for now** so that we **do not lose information** in other columns, and will **drop these NaNs, once we will be doing EDA on these features.**

Now, let's inspect the `red_fighter_sig_str_pct` and `blue_fighter_sig_str_pct` features. My guess is that the NaNs come for two, but mostly one reason - A fighter did not land any significant strikes because:
* **They lost instantly** in the first round
* **They were completely dominated** thus being unable to score a single significant punch to their opponent.

Let's take a look:

In [194]:
# Columns of interest
sigstr_nancols = ['red_fighter_sig_str_pct', 'blue_fighter_sig_str_pct']
sigstr_nanrows = inspect_nans(stats, sigstr_nancols)
sigstr_nanrows['round'].value_counts()

round
1    104
2      3
Name: count, dtype: int64

Indeed, in 97% of the cases these fights **ended in the first round**, which means **the NaNs** technically **are just 0's** and thus **can be safely replaced with 0's.**

Let's take a look at `red_fighter_ctrl` and `blue_fighter_ctrl`, which both have the same number of NaNs = 199. Now, it is important to know what is `ctrl` - `control` in the context of UFC. `Control` - is **the amount of time a fighter maintains top (dominant) position in grappling**, effectively, "*controlling*" their opponent.

So, my assumption is that these features are from the same rows, where the fighters just didn't grapple at all, and used only striking. <br>Let's "Sherlock Holmes" this:

In [195]:
ctrl_nancols = ["red_fighter_ctrl", "blue_fighter_ctrl"]
ctrl_nanrows = inspect_nans(stats, ctrl_nancols)
ctrl_nanrows[["red_fighter_sig_str_ground_pos_pct", "blue_fighter_sig_str_ground_pos_pct"]]

Unnamed: 0,red_fighter_sig_str_ground_pos_pct,blue_fighter_sig_str_ground_pos_pct
7546,80,0
7556,72,100
7557,0,0
7558,40,66
7559,4,0
...,...,...
7749,87,0
7750,50,0
7751,81,25
7752,0,0


Ha! I was wrong in this one, turns out these are just also the old fights, where this data wasn't recorded:

In [196]:
ctrl_nanrows['event_date'].value_counts()

event_date
11/03/1994    15
08/09/1995    10
16/12/1994    10
14/07/1995    10
07/12/1996    10
15/05/1998     9
07/02/1997     9
07/04/1995     9
16/12/1995     9
16/02/1996     9
12/07/1996     9
30/05/1997     9
27/07/1997     9
20/09/1996     8
13/03/1998     8
16/10/1998     8
07/05/1999     7
17/05/1996     7
17/10/1997     7
08/01/1999     7
05/03/1999     7
21/12/1997     6
09/09/1994     6
24/09/1999     1
Name: count, dtype: int64

Again, this **data wasn't recorded**, so technically, **it is not 0's**, there definitelly has been some control from either of the fighters, at least in some columns. So, since the data isn't technically 0, we will **keep these as NaNs for now** as well.

Let's take a look at the final group of features: `red_fighter_TD_pct`, `blue_fighter_TD_pct`. These features are just the percentage of successfull takedowns. So my assumption is that, these features are empty because:
* A fighter just **didn't try any takedowns** and used striking exclusively in the fight
* A fighter tried taking their opponent down, **but didn't succeed**, thus gaining 0 out of x takedowns percentage ratio, which in that case is represented as "-"  on the official ufcstats.com, where the data was scraped from.

Let's take a look:

In [197]:
tdpct_nancols = ["red_fighter_TD_pct", "blue_fighter_TD_pct"]
tdpct_nanrows = inspect_nans(stats, tdpct_nancols)
tdpct_nanrows[["red_fighter_TD_pct", "red_fighter_sig_str_ground_pos_pct", "blue_fighter_TD_pct", "blue_fighter_sig_str_ground_pos_pct"]]

Unnamed: 0,red_fighter_TD_pct,red_fighter_sig_str_ground_pos_pct,blue_fighter_TD_pct,blue_fighter_sig_str_ground_pos_pct
0,100,5,---,0
1,---,0,50,100
2,---,0,0,0
4,---,0,---,0
5,---,0,---,0
...,...,...,...,...
7748,100,100,---,0
7749,---,87,100,0
7751,100,81,---,25
7752,0,0,---,0


It's not that one sided, but it looks like there's a lot of cases where my assumption #1 confirms, let's take a deeper look:

In [198]:
tdpct_nanrows.loc[(tdpct_nanrows["red_fighter_TD_pct"].isin(nan_placeholders)).any(axis=0)
                   | (tdpct_nanrows["red_fighter_ctrl"] == "0")] \
    [["red_fighter_TD_pct", "red_fighter_ctrl"]].value_counts().head(5)

red_fighter_TD_pct  red_fighter_ctrl
---                 0:00                722
                    0:02                142
0                   0:00                114
---                 0:03                109
                    0:01                 88
Name: count, dtype: int64

In [199]:
tdpct_nanrows.loc[(tdpct_nanrows["blue_fighter_TD_pct"].isin(nan_placeholders)).any(axis=0)
                   | (tdpct_nanrows["blue_fighter_ctrl"] == "0")] \
    [["blue_fighter_TD_pct", "blue_fighter_ctrl"]].value_counts().head(5)

blue_fighter_TD_pct  blue_fighter_ctrl
---                  0:00                 1124
0                    0:00                  144
---                  0:02                  124
                     --                    102
                     0:03                   84
Name: count, dtype: int64

Observations:
> The top 1 combination is where `red/blue_fighter_TD_pct` = `---`, as well as in the top 2 and 3, which is cross-checked with `red/blue_fighter_ctrl` column, that shows that the control time. We can observe that the top most groups, indeed, there were no grappling action involved in those fights.
> 
> There are a few seconds 1-4 seconds of control time, but the fighter could have been taken down by their opponent, but then reversed and scored a few control seconds.

So, since it looks like **it's not the case, where the data exists but is not filled**, since the other confirm feature/s of low grappling action, so, these features **are technically 0's** and so, we will **replace them with 0's**.

**Conclusion**: 
- *Convert* to NaN: `ctrl_nancols`, `fsttwentyone_nancols`
- *Replace* with 0's: `tdpct_nancols`, `sigstr_nancols`

#### Handling NaNs

Let's group the rows and cols into two distinct groups, one to NaN, another to zeros:

In [200]:
cols_tonan = fsttwentyone_nancols + ctrl_nancols
cols_tozero = tdpct_nancols + sigstr_nancols

In [201]:
# Num of columns that is going to be changed to NaN and num of cols that will be changed to 0
len(cols_tonan), len(cols_tozero)

(34, 4)

**Replacing NaN placeholders to NaNs:**

Num of NaN placeholders `["-", "--", "---"]` before dropping:

In [202]:
nnan_pls_before = stats.isin(nan_placeholders).sum().sum()
nnan_pls_before

6607

Replace:

In [203]:
stats[cols_tonan] = stats[cols_tonan].replace(nan_placeholders, np.NaN)

Num of replaced NaN placeholders:

In [204]:
nnan_pls_after = stats.isin(nan_placeholders).sum().sum()
nnan_pls_before - nnan_pls_after

1070

Num of NaN placeholders now: 

In [205]:
nnan_pls_after

5537

**Replacing NaN placeholders to 0:**

Num of NaN placeholders `["-", "--", "---"]` before dropping:

In [206]:
nnan_pls_before = stats.isin(nan_placeholders).sum().sum()
nnan_pls_before

5537

In [207]:
stats[cols_tozero] = stats[cols_tozero].replace(nan_placeholders, "0")

Num of replaced NaNs:

In [208]:
nnan_pls_after = stats.isin(nan_placeholders).sum().sum()
nnan_pls_before - nnan_pls_after

5537

Num of NaN placeholders now: 

In [209]:
nnan_pls_after

0

#### Handling Duplicates

In [210]:
stats.duplicated().sum()

0

We can see that there are no NaNs or duplicates. Let's get to Feature Engineering.

### Processing categorical features

Find columns that need to be standardized from categorical dtype to numerical:

In [211]:
def find_obj_cols(df):
    """Searches for columns that are of dtype object,
    contain numbers, and names of cols start with either 'red_' or 'blue_'"""

    cols_to_standardize = []

    for col in df.columns:
        if stats[col].dtype == "object" and (
            col.startswith("red_") or col.startswith("blue_")
        ):
            # Get the first non-nan value, and convert to str to be able to use .isdigit()
            sample_val = stats[col].dropna().astype(str).head(1).values[0]
            if len(sample_val) > 0 and any(char.isdigit() for char in sample_val):
                cols_to_standardize.append(col)

    return cols_to_standardize

In [212]:
cols_to_standardize = find_obj_cols(stats)
print(f"Number of categorical features to preprocess: {len(cols_to_standardize)}")

Number of categorical features to preprocess: 46


In total we have to standardize 3 types of features:
1. Ratio to pct: 75 of 144 => 52 (%)
2. Dropping pct symbol: 85% => 85 (%)
3. Time: 1:31	=> 91 (seconds)

Taking a look:

In [213]:
stats.loc[:2, ["red_fighter_total_str", "red_fighter_td_acc_cs", "red_fighter_ctrl"]]

Unnamed: 0,red_fighter_total_str,red_fighter_td_acc_cs,red_fighter_ctrl
0,78 of 147,55%,0:45
1,2 of 2,27%,0:00
2,75 of 142,33%,1:31


But let's first group the columns into 3 different buckets for simplicity:
* `of_cols`, for example: `78 of 147`
* `pct_cols`, for example: `45%`
* `time_cols`, for example: `0:45`

In [214]:
def bucket_obj_cols(cols):
    """Buckets provided columns into 3 various buckets, based on regex patterns."""

    of_cols = []
    pct_cols = []
    time_cols = []

    for col in cols:
        sample_val = stats[col].dropna().astype(str).head(1).values[0]

        if re.search(
            r"\d+\s*of\s*\d+", sample_val
        ):  # If matches 'number of number' schema
            of_cols.append(col)
        elif re.search(r"\d+\s*%", sample_val):  # If matches 'x%' pct schema
            pct_cols.append(col)
        elif re.search(r"\d+\s*:\s*\d+", sample_val):  # If matches 'x:yy' time schema
            time_cols.append(col)

    return of_cols, pct_cols, time_cols

Running it:

In [215]:
of_cols, pct_cols, time_cols = bucket_obj_cols(cols_to_standardize)

Let's take a look:

In [216]:
stats.loc[:2, of_cols[:3]]

Unnamed: 0,red_fighter_total_str,blue_fighter_total_str,red_fighter_sig_str_head_acc
0,78 of 147,84 of 209,49 of 114
1,2 of 2,25 of 31,0 of 0
2,75 of 142,59 of 123,20 of 72


In [217]:
stats.loc[:2, pct_cols[:3]]

Unnamed: 0,red_fighter_str_acc_cs,red_fighter_str_def_cs,red_fighter_td_acc_cs
0,45%,68%,55%
1,40%,60%,27%
2,53%,65%,33%


In [218]:
stats.loc[:2, time_cols]

Unnamed: 0,red_fighter_ctrl,blue_fighter_ctrl
0,0:45,0:00
1,0:00,3:20
2,1:31,1:00


Let's standardize.

### Standardizing

#### Standardizing fraction-based features

Standardizing fractions into pct % (e.g. from `70 of 140` to `50` (%)):

In [219]:
def convert_ratio_to_pct(row):
    """Converts ratio 'x of y' to percentages 'z'"""

    # If NaN, just return NaN
    if pd.isna(row):
        return row

    vals = row.split("of")
    if len(vals) != 2:
        return 0

    made = int(vals[0].strip())
    attempted = int(vals[1].strip())
    if made == 0 or attempted == 0:
        return 0

    return round((made * 100) / attempted, 2)

Applying:

In [220]:
for col in of_cols:
    stats[col] = stats[col].apply(convert_ratio_to_pct)

# Rename
name_mappings = {col: f"{col}_pct" for col in of_cols}
stats.rename(columns=name_mappings, inplace=True)

In [221]:
stats[name_mappings.values()].head(3)

Unnamed: 0,red_fighter_total_str_pct,blue_fighter_total_str_pct,red_fighter_sig_str_head_acc_pct,blue_fighter_sig_str_head_acc_pct,red_fighter_sig_str_body_acc_pct,blue_fighter_sig_str_body_acc_pct,red_fighter_sig_str_leg_acc_pct,blue_fighter_sig_str_leg_acc_pct,red_fighter_sig_str_distance_acc_pct,blue_fighter_sig_str_distance_acc_pct,red_fighter_sig_str_clinch_acc_pct,blue_fighter_sig_str_clinch_acc_pct,red_fighter_sig_str_ground_acc_pct,blue_fighter_sig_str_ground_acc_pct
0,53.06,40.19,42.98,31.13,78.57,43.33,93.75,82.61,50.71,38.73,0.0,0.0,100.0,0.0
1,100.0,80.65,0.0,66.67,0.0,100.0,100.0,0.0,100.0,0.0,0.0,0.0,0.0,100.0
2,52.82,47.97,27.78,15.38,75.86,75.0,100.0,85.0,47.62,45.05,55.56,50.0,0.0,0.0


#### Standardizing percentage-based features

Standardizing pct features, dropping `%` symbol (e.g. from `50%` to `50`):

In [222]:
stats[pct_cols] = stats[pct_cols].apply(lambda row: row.str.strip("%"))
stats[pct_cols].head(3)

Unnamed: 0,red_fighter_str_acc_cs,red_fighter_str_def_cs,red_fighter_td_acc_cs,red_fighter_td_def_cs,blue_fighter_str_acc_cs,blue_fighter_str_def_cs,blue_fighter_td_acc_cs,blue_fighter_td_def_cs
0,45,68,55,100,46,60,71,84
1,40,60,27,84,72,66,75,0
2,53,65,33,85,52,52,25,90


#### Standardizing time-based features

Standardizing time features from `mm:ss` into the total `seconds` (e.g. from `1:31` **minutes** to `91` **seconds**):

Features to work with:

In [223]:
time_cols

['red_fighter_ctrl', 'blue_fighter_ctrl']

Replace 0s to match the common format `0:00`

In [224]:
stats[time_cols] = stats[time_cols].replace("0", "0:00")

Format time schema to make it have the `hh:mm:ss` shape:

Before:

In [225]:
# Rows to compare before/after on
comp_rows = [i for i in range(40, 43)]
# Before samples
stats.loc[comp_rows, time_cols]

Unnamed: 0,red_fighter_ctrl,blue_fighter_ctrl
40,0:00,7:54
41,0:03,4:13
42,0:11,4:40


In [226]:
def format_time_schema(row):
    """Formats time schema by making sure it follows hh:mm:ss format"""

    # If is nan, just retur nan
    if pd.isna(row):
        return row
        
    parts = row.split(":")
    if len(parts) == 2:
        minutes, seconds = parts
        # Pad both minutes and seconds to 2 zeros
        minutes = minutes.zfill(2)
        seconds = seconds.zfill(2)
        # Add hours prefix
        return f"00:{minutes}:{seconds}"
    return parts  # Return like is if not in mm:ss format

Applying:

In [227]:
for col in time_cols:
    stats[col] = stats[col].apply(format_time_schema)

After:

In [228]:
pd.DataFrame(stats.loc[comp_rows, time_cols])

Unnamed: 0,red_fighter_ctrl,blue_fighter_ctrl
40,00:00:00,00:07:54
41,00:00:03,00:04:13
42,00:00:11,00:04:40


Convert to total seconds:

Before:

In [229]:
# Rows to compare before/after on
comp_rows = [i for i in range(0, 2)]
stats.loc[comp_rows, time_cols]

Unnamed: 0,red_fighter_ctrl,blue_fighter_ctrl
0,00:00:45,00:00:00
1,00:00:00,00:03:20


In [230]:
# Convert into total seconds
stats[time_cols] = stats[time_cols].apply(
    lambda col: pd.to_timedelta(col).dt.total_seconds()
)

After:

In [231]:
stats.loc[comp_rows, time_cols]

Unnamed: 0,red_fighter_ctrl,blue_fighter_ctrl
0,45.0,0.0
1,0.0,200.0


Looks good.

In [232]:
# TEST
check_nans()

blue_fighter_reach                       1628
red_fighter_reach                         997
blue_fighter_stance                       780
blue_fighter_height                       675
blue_fighter_sub_avg_cs                   648
blue_fighter_td_def_cs                    648
blue_fighter_td_acc_cs                    648
blue_fighter_td_avg_cs                    648
blue_fighter_str_def_cs                   648
blue_fighter_sapm_cs                      648
blue_fighter_str_acc_cs                   648
blue_fighter_slpm_cs                      648
red_fighter_stance                        599
red_fighter_height                        541
red_fighter_sapm_cs                       526
red_fighter_slpm_cs                       526
red_fighter_str_acc_cs                    526
red_fighter_str_def_cs                    526
red_fighter_td_avg_cs                     526
red_fighter_td_acc_cs                     526
red_fighter_td_def_cs                     526
red_fighter_sub_avg_cs            

### Dtype converting

Convert columns that are in object dtype but contain values of numerical dtype to remove the pool of features that need to be preprocessed:

In [233]:
stats.dtypes.value_counts()

object     43
float64    28
int64       1
Name: count, dtype: int64

In [234]:
for col in stats.columns:
    try:
        stats[col] = pd.to_numeric(stats[col], downcast="float")

    # If not possible, (contains strings) we'll handle it in a minute
    except:
        continue

stats.dtypes.value_counts()

float32    59
object     13
Name: count, dtype: int64

In [235]:
# TEST
check_nans()

blue_fighter_reach                       1628
red_fighter_reach                         997
blue_fighter_stance                       780
blue_fighter_height                       675
blue_fighter_sub_avg_cs                   648
blue_fighter_td_def_cs                    648
blue_fighter_td_acc_cs                    648
blue_fighter_td_avg_cs                    648
blue_fighter_str_def_cs                   648
blue_fighter_sapm_cs                      648
blue_fighter_str_acc_cs                   648
blue_fighter_slpm_cs                      648
red_fighter_stance                        599
red_fighter_height                        541
red_fighter_sapm_cs                       526
red_fighter_slpm_cs                       526
red_fighter_str_acc_cs                    526
red_fighter_str_def_cs                    526
red_fighter_td_avg_cs                     526
red_fighter_td_acc_cs                     526
red_fighter_td_def_cs                     526
red_fighter_sub_avg_cs            

## Feature Engineering

### Winner

Adding a winner feature:

In [236]:
stats.loc[:, "winner"] = stats["red_fighter_result"].apply(
    lambda x: "red" if x == "W" else "blue"
)

In [237]:
stats["winner"].head(3)

0     red
1    blue
2     red
Name: winner, dtype: object

Since we now have the `winner` feature, we can drop the `red_fighter_result/blue_fighter_result` features:

In [238]:
stats.drop(columns=["red_fighter_result", "blue_fighter_result"], inplace=True)

### Winner_feature/loser_feature

Let's change the columns from `red/blue_fighter`+`feature name` to `winner/loser`+`feature name`.

Saving the order of the columns first because it will be distorted:

In [239]:
def rename_condition(col):
    if col.startswith("red_fighter_"):
        return col.replace("red_fighter_", "winner_")
    elif col.startswith("blue_fighter_"):
        return col.replace("blue_fighter_", "loser_")
    return col

In [240]:
# Get the renamed cols order
cols_order = [rename_condition(col) for col in stats.columns]

Function for setting winner & loser:

In [241]:
def set_winner_n_loser(df, winner_col="winner"):
    """Filters what columns to take into account,
    creates new columns, instead of red/blue makes winner/loser,
    gets data points from red/blue column based on
    the value of the feature 'winner' in that same row."""

    df = df.copy()
    cols_to_drop = []

    # Find all red columns that have blue equivalents
    for col in df.columns:
        if col.startswith("red_fighter_"):
            base = col.removeprefix("red_fighter_")
            red_col = f"red_fighter_{base}"
            blue_col = f"blue_fighter_{base}"

            # If blue counterpart exists
            if blue_col in df.columns:
                # Create winner/loser columns
                df[f"winner_{base}"] = np.where(
                    df[winner_col] == "red", df[red_col], df[blue_col]
                )
                df[f"loser_{base}"] = np.where(
                    df[winner_col] == "blue", df[red_col], df[blue_col]
                )

                cols_to_drop.extend([red_col, blue_col])

    # Drop the red/blue columns to keep only winner/loser
    df = df.drop(columns=cols_to_drop)
    return df

In [242]:
stats = set_winner_n_loser(stats)
stats.columns[:5]

Index(['event_date', 'method', 'round', 'time', 'time_format'], dtype='object')

Setting the previous, correct order up:

In [243]:
stats = stats.loc[:, cols_order]

In [244]:
stats.columns[:5]

Index(['winner_name', 'loser_name', 'event_date', 'method', 'round'], dtype='object')

In [245]:
# TEST
check_nans()

loser_reach                        1600
winner_reach                       1025
winner_stance                       694
loser_stance                        685
winner_height                       630
winner_sapm_cs                      614
winner_td_acc_cs                    614
winner_sub_avg_cs                   614
winner_slpm_cs                      614
winner_str_acc_cs                   614
winner_td_avg_cs                    614
winner_str_def_cs                   614
winner_td_def_cs                    614
loser_height                        586
loser_str_acc_cs                    560
loser_slpm_cs                       560
loser_str_def_cs                    560
loser_td_avg_cs                     560
loser_sub_avg_cs                    560
loser_td_acc_cs                     560
loser_td_def_cs                     560
loser_sapm_cs                       560
loser_ctrl                          199
winner_ctrl                         199
loser_sig_str_ground_pos_pct         21


### Delta

Let's now decrease the amount of features by half. We're going to use *delta* for this. For example, instead of having both `winner_striking_dominance` and `loser_striking_dominance` features, we're going to just have `delta_striking_dominance`. Which would just be `winner_striking_dominance` - `loser_striking_dominance`. Where a positive value would mean that the winner has a higher striking dominance factor and vice versa for negative.

In [347]:
def deltafy_data(df):
    """Filters what columns to process,
    creates new columns, instead of winner/loser makes delta.
    Merges the new delta columns to the df, and drops the previous winner/loser columns.
    """

    df = df.copy()
    cols_to_drop = []

    # Get the columns that are numerical and have winner/loser counterparts
    delta_cols = {}

    for col in df.columns:
        if col.startswith("winner_"):
            base = col.removeprefix("winner_")
            winner_col = f"winner_{base}"
            loser_col = f"loser_{base}"

            # If loser counterpart also exists, and col dtype is numerical => calculate delta
            if loser_col in df.columns and df[col].dtype in (np.float32, int):
                delta_cols[f"delta_{base}"] = np.round(
                    df[winner_col] - df[loser_col], 2
                )

                cols_to_drop.extend([winner_col, loser_col])

    # Delta df
    delta_df = pd.DataFrame(delta_cols)

    # Concat to the original one
    df = pd.concat([df, delta_df], axis=1)

    # Drop the winner/loser columns to keep only delta
    df.drop(columns=cols_to_drop, inplace=True)
    return df

In [348]:
stats = deltafy_data(stats)
stats.columns

Index(['winner_name', 'loser_name', 'event_date', 'method', 'round', 'time',
       'time_format', 'bout_type', 'event_name', 'event_location',
       'winner_stance', 'loser_stance', 'winner', 'delta_KD',
       'delta_sig_str_pct', 'delta_total_str_pct', 'delta_TD_pct',
       'delta_sub_att', 'delta_rev', 'delta_ctrl',
       'delta_sig_str_head_acc_pct', 'delta_sig_str_body_acc_pct',
       'delta_sig_str_leg_acc_pct', 'delta_sig_str_distance_acc_pct',
       'delta_sig_str_clinch_acc_pct', 'delta_sig_str_ground_acc_pct',
       'delta_sig_str_head_tar_pct', 'delta_sig_str_body_tar_pct',
       'delta_sig_str_leg_tar_pct', 'delta_sig_str_distance_pos_pct',
       'delta_sig_str_clinch_pos_pct', 'delta_sig_str_ground_pos_pct',
       'delta_height', 'delta_reach', 'delta_slpm_cs', 'delta_str_acc_cs',
       'delta_sapm_cs', 'delta_str_def_cs', 'delta_td_avg_cs',
       'delta_td_acc_cs', 'delta_td_def_cs', 'delta_sub_avg_cs'],
      dtype='object')

In [349]:
check_nans()

delta_reach                       2049
delta_height                       941
delta_sub_avg_cs                   905
delta_td_def_cs                    905
delta_td_acc_cs                    905
delta_td_avg_cs                    905
delta_str_def_cs                   905
delta_sapm_cs                      905
delta_str_acc_cs                   905
delta_slpm_cs                      905
winner_stance                      694
loser_stance                       685
delta_ctrl                         199
delta_sig_str_distance_acc_pct      21
delta_sig_str_ground_pos_pct        21
delta_sig_str_clinch_pos_pct        21
delta_sig_str_distance_pos_pct      21
delta_sig_str_leg_tar_pct           21
delta_sig_str_body_tar_pct          21
delta_sig_str_head_tar_pct          21
delta_sig_str_ground_acc_pct        21
delta_sig_str_clinch_acc_pct        21
delta_sig_str_body_acc_pct          21
delta_sig_str_leg_acc_pct           21
delta_sig_str_head_acc_pct          21
delta_rev                

In [350]:
stats.head(3)

Unnamed: 0,winner_name,loser_name,event_date,method,round,time,time_format,bout_type,event_name,event_location,...,delta_height,delta_reach,delta_slpm_cs,delta_str_acc_cs,delta_sapm_cs,delta_str_def_cs,delta_td_avg_cs,delta_td_acc_cs,delta_td_def_cs,delta_sub_avg_cs
0,ILIA TOPURIA,MAX HOLLOWAY,26/10/2024,KO/TKO,3.0,1:34,5 Rnd (5-5-5-5-5),UFC Featherweight Title Bout,UFC 308: Topuria vs. Holloway,"Abu Dhabi, Abu Dhabi, United Arab Emirates",...,-10.16,0.0,-4.76,-1.0,-2.71,8.0,4.04,-16.0,16.0,3.9
1,KHAMZAT CHIMAEV,ROBERT WHITTAKER,26/10/2024,Submission,1.0,3:34,5 Rnd (5-5-5-5-5),Middleweight Bout,UFC 308: Topuria vs. Holloway,"Abu Dhabi, Abu Dhabi, United Arab Emirates",...,5.08,5.08,4.51,32.0,-3.34,6.0,4.24,48.0,-84.0,3.1
2,MAGOMED ANKALAEV,ALEKSANDAR RAKIC,26/10/2024,Decision - Unanimous,3.0,5:00,3 Rnd (5-5-5),Light Heavyweight Bout,UFC 308: Topuria vs. Holloway,"Abu Dhabi, Abu Dhabi, United Arab Emirates",...,-2.54,-7.62,-0.82,1.0,-0.45,13.0,0.43,8.0,-5.0,-0.2


# Saving

Saving preprocessed, cleaned, merged, feature-engineer added, ready for EDA dataset:

In [351]:
stats.to_csv("../stats/stats_processed.csv", sep=";", index=False)