# 🔹UFC Fight Predictor ETL

<div style="text-align: center;">
  🔹 <img src="../img/ufc_logo.png" width="50" /> 🔹
</div>

## 1. Import Libraries and Setup Environment

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Get the current working directory
current_dir = os.getcwd()

# Navigate to the project root
project_root = os.path.abspath(os.path.join(current_dir, '..'))

# Import from /src
sys.path.append(os.path.join(project_root, 'src'))
from helpers import *

<div style="text-align: center;">
  🔹 <img src="../img/ufc_logo.png" width="50" /> 🔹
</div>

## 2. Load Data

In [2]:
# Define the path to the CSV file
file_path = os.path.join(project_root, 'data', 'raw', 'ufc_raw.csv')

# Load the CSV into a DataFrame
try:
    ufc_raw = pd.read_csv(file_path)
    print_header(f"Data successfully loaded: {ufc_raw.shape[0]} rows, {ufc_raw.shape[1]} columns.", color='bright_green')
except Exception as e:
    print_header(f"Error loading training data: {e}", color='bright_red')

[92m╔═════════════════════════════════════════════════════╗
║  Data successfully loaded: 6541 rows, 118 columns.  ║
╚═════════════════════════════════════════════════════╝[0m


<div style="text-align: center;">
  🔹 <img src="../img/ufc_logo.png" width="50" /> 🔹
</div>

## 3. Preview

In [3]:
# Preview the first few records
display(ufc_raw.head())

# General dataset information
ufc_raw.info()

Unnamed: 0,RedFighter,BlueFighter,RedOdds,BlueOdds,RedExpectedValue,BlueExpectedValue,Date,Location,Country,Winner,...,FinishDetails,FinishRound,FinishRoundTime,TotalFightTimeSecs,RedDecOdds,BlueDecOdds,RSubOdds,BSubOdds,RKOOdds,BKOOdds
0,Colby Covington,Joaquin Buckley,205.0,-250.0,205.0,40.0,2024-12-14,"Tampa, Florida, USA",USA,Blue,...,,3.0,4:42,882.0,300.0,175.0,1800.0,2000.0,1100.0,150.0
1,Cub Swanson,Billy Quarantillo,124.0,-148.0,124.0,67.5676,2024-12-14,"Tampa, Florida, USA",USA,Red,...,Punch,3.0,1:36,696.0,250.0,,1800.0,,450.0,
2,Manel Kape,Bruno Silva,-395.0,310.0,25.3165,310.0,2024-12-14,"Tampa, Florida, USA",USA,Red,...,Punches,3.0,1:57,717.0,-105.0,550.0,900.0,1800.0,225.0,1100.0
3,Vitor Petrino,Dustin Jacoby,-340.0,270.0,29.4118,270.0,2024-12-14,"Tampa, Florida, USA",USA,Blue,...,Punch,3.0,3:44,824.0,240.0,500.0,550.0,3000.0,110.0,800.0
4,Adrian Yanez,Daniel Marcos,185.0,-225.0,185.0,44.4444,2024-12-14,"Tampa, Florida, USA",USA,Blue,...,,3.0,5:00,900.0,450.0,150.0,2200.0,2200.0,450.0,200.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6541 entries, 0 to 6540
Columns: 118 entries, RedFighter to BKOOdds
dtypes: bool(1), float64(60), int64(43), object(14)
memory usage: 5.8+ MB


In [4]:
ufc_raw = ufc_raw.drop(['RedFighter','BlueFighter','RedExpectedValue', 'Finish',
                        'BlueExpectedValue','Date','Location','Country','EmptyArena',
                        'FinishDetails','FinishRound','FinishRoundTime','RedDecOdds',
                        'TotalFightTimeSecs', 'BlueDecOdds', 'RSubOdds',
                        'BSubOdds','RKOOdds','BKOOdds','WeightClass', 'BetterRank'],axis=1)

<div style="text-align: center;">
  🔹 <img src="../img/ufc_logo.png" width="50" /> 🔹
</div>

## 4. Check 

In [5]:
# Null values check
nulls = ufc_raw.isnull().sum()
print("\nNull values per column:\n", nulls[nulls > 0])

# Duplicate analysis
duplicates = ufc_raw.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")


Null values per column:
 RedOdds                   227
BlueOdds                  226
BlueAvgSigStrLanded       930
BlueAvgSigStrPct          765
BlueAvgSubAtt             832
BlueAvgTDLanded           833
BlueAvgTDPct              842
BlueStance                  3
RedAvgSigStrLanded        455
RedAvgSigStrPct           357
RedAvgSubAtt              357
RedAvgTDLanded            357
RedAvgTDPct               367
BMatchWCRank             5339
RMatchWCRank             4760
RWFlyweightRank          6445
RWFeatherweightRank      6532
RWStrawweightRank        6395
RWBantamweightRank       6387
RHeavyweightRank         6355
RLightHeavyweightRank    6357
RMiddleweightRank        6359
RWelterweightRank        6349
RLightweightRank         6357
RFeatherweightRank       6364
RBantamweightRank        6360
RFlyweightRank           6352
RPFPRank                 6288
BWFlyweightRank          6468
BWFeatherweightRank      6540
BWStrawweightRank        6441
BWBantamweightRank       6434
BHeavyweightRa

<div style="text-align: center;">
  🔹 <img src="../img/ufc_logo.png" width="50" /> 🔹
</div>

## 5. Data Cleaning

### Null Values

In [6]:
# Drop columns with too many null values (threshold: 300)
threshold = 300
cols_to_drop = [col for col in ufc_raw.columns if ufc_raw[col].isnull().sum() > threshold]
for col in cols_to_drop:
    print('Dropping:', col)
ufc_raw.drop(columns=cols_to_drop, inplace=True)

# Drop rows with any remaining missing values
print(f"➡️ Before dropna: {ufc_raw.shape}")
ufc_raw.dropna(inplace=True)
print(f"✅ After dropna: {ufc_raw.shape}")

Dropping: BlueAvgSigStrLanded
Dropping: BlueAvgSigStrPct
Dropping: BlueAvgSubAtt
Dropping: BlueAvgTDLanded
Dropping: BlueAvgTDPct
Dropping: RedAvgSigStrLanded
Dropping: RedAvgSigStrPct
Dropping: RedAvgSubAtt
Dropping: RedAvgTDLanded
Dropping: RedAvgTDPct
Dropping: BMatchWCRank
Dropping: RMatchWCRank
Dropping: RWFlyweightRank
Dropping: RWFeatherweightRank
Dropping: RWStrawweightRank
Dropping: RWBantamweightRank
Dropping: RHeavyweightRank
Dropping: RLightHeavyweightRank
Dropping: RMiddleweightRank
Dropping: RWelterweightRank
Dropping: RLightweightRank
Dropping: RFeatherweightRank
Dropping: RBantamweightRank
Dropping: RFlyweightRank
Dropping: RPFPRank
Dropping: BWFlyweightRank
Dropping: BWFeatherweightRank
Dropping: BWStrawweightRank
Dropping: BWBantamweightRank
Dropping: BHeavyweightRank
Dropping: BLightHeavyweightRank
Dropping: BMiddleweightRank
Dropping: BWelterweightRank
Dropping: BLightweightRank
Dropping: BFeatherweightRank
Dropping: BBantamweightRank
Dropping: BFlyweightRank
Droppi

In [7]:
# Null values check
nulls = ufc_raw.isnull().sum()
print("\nNull values per column:\n", nulls[nulls > 0])


Null values per column:
 Series([], dtype: int64)


### Incongruent Data

In [8]:
print(ufc_raw[['RedReachCms', 'BlueReachCms']].describe())

       RedReachCms  BlueReachCms
count  6300.000000   6300.000000
mean    182.517754    182.289786
std      11.078540     11.140183
min     147.320000      0.000000
25%     175.260000    175.260000
50%     182.880000    182.880000
75%     190.500000    190.500000
max     214.630000    213.360000


In [9]:
# Reemplazar ceros o nulos en los alcances antes de dividir
ufc_raw['RedReachCms'] = ufc_raw['RedReachCms'].replace(0, ufc_raw['RedReachCms'].mean())
ufc_raw['BlueReachCms'] = ufc_raw['BlueReachCms'].replace(0, ufc_raw['BlueReachCms'].mean())

In [10]:
# Open Stance is incorrect
ufc_raw[ufc_raw['RedStance'] == 'Open Stance']

Unnamed: 0,RedOdds,BlueOdds,Winner,TitleBout,Gender,NumberOfRounds,BlueCurrentLoseStreak,BlueCurrentWinStreak,BlueDraws,BlueLongestWinStreak,...,TotalRoundDif,TotalTitleBoutDif,KODif,SubDif,HeightDif,ReachDif,AgeDif,SigStrDif,AvgSubAttDif,AvgTDDif
6051,-255.0,235.0,Blue,False,MALE,3,0,1,0,1,...,-7,0,-1,-1,-2.54,-2.54,2,-15.5,-0.55,0.625
6366,-108.0,-102.0,Red,False,MALE,3,2,0,0,1,...,-4,0,-1,-2,5.08,10.16,7,-4.5,0.8333,0.8333
6448,-190.0,175.0,Blue,False,MALE,3,3,0,0,3,...,18,1,-1,0,7.62,2.54,-1,3.2,0.4,0.1273
6511,-230.0,190.0,Blue,False,MALE,3,0,2,0,2,...,6,-1,-2,0,2.54,2.54,0,-2.8182,0.0909,0.4343


In [11]:
ufc_raw = ufc_raw[ufc_raw['RedStance'] != 'Open Stance']

In [12]:
# Open Stance is incorrect
ufc_raw[ufc_raw['RedStance'] == 'Open Stance']

Unnamed: 0,RedOdds,BlueOdds,Winner,TitleBout,Gender,NumberOfRounds,BlueCurrentLoseStreak,BlueCurrentWinStreak,BlueDraws,BlueLongestWinStreak,...,TotalRoundDif,TotalTitleBoutDif,KODif,SubDif,HeightDif,ReachDif,AgeDif,SigStrDif,AvgSubAttDif,AvgTDDif


In [13]:
# Open Stance is incorrect
ufc_raw[ufc_raw['BlueStance'] == 'Open Stance']

Unnamed: 0,RedOdds,BlueOdds,Winner,TitleBout,Gender,NumberOfRounds,BlueCurrentLoseStreak,BlueCurrentWinStreak,BlueDraws,BlueLongestWinStreak,...,TotalRoundDif,TotalTitleBoutDif,KODif,SubDif,HeightDif,ReachDif,AgeDif,SigStrDif,AvgSubAttDif,AvgTDDif
6216,265.0,-325.0,Blue,False,MALE,3,0,1,0,3,...,10,0,2,1,-2.54,10.16,-5,28.4762,-0.8095,-1.5714


In [14]:
ufc_raw = ufc_raw[ufc_raw['BlueStance'] != 'Open Stance']

In [15]:
# Open Stance is incorrect
ufc_raw[ufc_raw['BlueStance'] == 'Open Stance']

Unnamed: 0,RedOdds,BlueOdds,Winner,TitleBout,Gender,NumberOfRounds,BlueCurrentLoseStreak,BlueCurrentWinStreak,BlueDraws,BlueLongestWinStreak,...,TotalRoundDif,TotalTitleBoutDif,KODif,SubDif,HeightDif,ReachDif,AgeDif,SigStrDif,AvgSubAttDif,AvgTDDif


In [16]:
# Open Stance is incorrect
ufc_raw[ufc_raw['RedStance'] == 'Open Stance']

Unnamed: 0,RedOdds,BlueOdds,Winner,TitleBout,Gender,NumberOfRounds,BlueCurrentLoseStreak,BlueCurrentWinStreak,BlueDraws,BlueLongestWinStreak,...,TotalRoundDif,TotalTitleBoutDif,KODif,SubDif,HeightDif,ReachDif,AgeDif,SigStrDif,AvgSubAttDif,AvgTDDif


## 6. Create Fight Stance Columns
- If both fighters have the same fighting stance, the bout is considered a Closed Stance matchup. If their stances differ, it is classified as an Open Stance matchup.

In [17]:
# Create column FightStance according Stances matches.
ufc_raw['FightStance'] = np.where(
    ufc_raw['BlueStance'] == ufc_raw['RedStance'],
    'Closed Stance',
    'Open Stance'
)

In [18]:
ufc_preview = ufc_raw[ufc_raw['FightStance'] == 'Open Stance']

In [19]:
ufc_preview2 = ufc_raw[ufc_raw['FightStance'] == 'Closed Stance']

In [20]:
ufc_preview[['FightStance', 'BlueStance', 'RedStance']]

Unnamed: 0,FightStance,BlueStance,RedStance
0,Open Stance,Southpaw,Orthodox
2,Open Stance,Orthodox,Southpaw
6,Open Stance,Switch,Southpaw
10,Open Stance,Orthodox,Southpaw
11,Open Stance,Southpaw,Orthodox
...,...,...,...
6524,Open Stance,Orthodox,Southpaw
6525,Open Stance,Orthodox,Southpaw
6528,Open Stance,Orthodox,Switch
6529,Open Stance,Orthodox,Southpaw


In [21]:
ufc_preview2[['FightStance', 'BlueStance', 'RedStance']]

Unnamed: 0,FightStance,BlueStance,RedStance
1,Closed Stance,Orthodox,Orthodox
3,Closed Stance,Orthodox,Orthodox
4,Closed Stance,Orthodox,Orthodox
5,Closed Stance,Orthodox,Orthodox
7,Closed Stance,Orthodox,Orthodox
...,...,...,...
6535,Closed Stance,Orthodox,Orthodox
6536,Closed Stance,Orthodox,Orthodox
6538,Closed Stance,Orthodox,Orthodox
6539,Closed Stance,Orthodox,Orthodox


<div style="text-align: center;">
  🔹 <img src="../img/ufc_logo.png" width="50" /> 🔹
</div>

<div style="text-align: center;">
  🔹 <img src="../img/ufc_logo.png" width="50" /> 🔹
</div>

## 7. Check Clean Data

In [22]:
# Null values check
nulls = ufc_raw.isnull().sum()
print("\nNull values per column:\n", nulls[nulls > 0])

# Duplicate analysis
duplicates = ufc_raw.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")


Null values per column:
 Series([], dtype: int64)

Duplicate rows: 0


In [23]:
# Preview the first few records
display(ufc_raw.head())
display(ufc_raw.columns)
# Para ver los tipos de dato de cada columna:
display(ufc_raw.dtypes)


Unnamed: 0,RedOdds,BlueOdds,Winner,TitleBout,Gender,NumberOfRounds,BlueCurrentLoseStreak,BlueCurrentWinStreak,BlueDraws,BlueLongestWinStreak,...,TotalTitleBoutDif,KODif,SubDif,HeightDif,ReachDif,AgeDif,SigStrDif,AvgSubAttDif,AvgTDDif,FightStance
0,205.0,-250.0,Blue,False,MALE,5,0,5,0,5,...,-4,4,-2,-2.54,10.16,-6,0.25,-0.2,-1.83,Open Stance
1,124.0,-148.0,Red,False,MALE,3,1,0,0,4,...,0,-2,-1,5.08,0.0,-5,2.69,0.7,0.2,Closed Stance
2,-395.0,310.0,Red,False,MALE,3,0,4,0,4,...,0,1,1,-2.54,-7.62,3,-1.12,-0.2,1.72,Open Stance
3,-340.0,270.0,Blue,False,MALE,3,2,0,1,4,...,0,2,-1,2.54,-2.54,9,2.68,-0.8,-3.62,Closed Stance
4,185.0,-225.0,Blue,False,MALE,3,0,4,0,4,...,0,-5,0,0.0,-2.54,0,-0.57,0.0,0.25,Closed Stance


Index(['RedOdds', 'BlueOdds', 'Winner', 'TitleBout', 'Gender',
       'NumberOfRounds', 'BlueCurrentLoseStreak', 'BlueCurrentWinStreak',
       'BlueDraws', 'BlueLongestWinStreak', 'BlueLosses',
       'BlueTotalRoundsFought', 'BlueTotalTitleBouts',
       'BlueWinsByDecisionMajority', 'BlueWinsByDecisionSplit',
       'BlueWinsByDecisionUnanimous', 'BlueWinsByKO', 'BlueWinsBySubmission',
       'BlueWinsByTKODoctorStoppage', 'BlueWins', 'BlueStance',
       'BlueHeightCms', 'BlueReachCms', 'BlueWeightLbs',
       'RedCurrentLoseStreak', 'RedCurrentWinStreak', 'RedDraws',
       'RedLongestWinStreak', 'RedLosses', 'RedTotalRoundsFought',
       'RedTotalTitleBouts', 'RedWinsByDecisionMajority',
       'RedWinsByDecisionSplit', 'RedWinsByDecisionUnanimous', 'RedWinsByKO',
       'RedWinsBySubmission', 'RedWinsByTKODoctorStoppage', 'RedWins',
       'RedStance', 'RedHeightCms', 'RedReachCms', 'RedWeightLbs', 'RedAge',
       'BlueAge', 'LoseStreakDif', 'WinStreakDif', 'LongestWinStreakDi

RedOdds                        float64
BlueOdds                       float64
Winner                          object
TitleBout                         bool
Gender                          object
NumberOfRounds                   int64
BlueCurrentLoseStreak            int64
BlueCurrentWinStreak             int64
BlueDraws                        int64
BlueLongestWinStreak             int64
BlueLosses                       int64
BlueTotalRoundsFought            int64
BlueTotalTitleBouts              int64
BlueWinsByDecisionMajority       int64
BlueWinsByDecisionSplit          int64
BlueWinsByDecisionUnanimous      int64
BlueWinsByKO                     int64
BlueWinsBySubmission             int64
BlueWinsByTKODoctorStoppage      int64
BlueWins                         int64
BlueStance                      object
BlueHeightCms                  float64
BlueReachCms                   float64
BlueWeightLbs                    int64
RedCurrentLoseStreak             int64
RedCurrentWinStreak      

## 8. Create the target value: **0** (Fighter Red wins) or **1** (Fighter Blue wins)

In [24]:
ufc_raw['label'] = ufc_raw['Winner'].apply(lambda x: 1 if x == 'Blue' else 0)
ufc_raw=ufc_raw.drop('Winner', axis=1)

## 9. Save

In [25]:
# Save the cleaned file
ufc_raw.to_csv(f'{project_root}/data/processed/ufc_etl.csv', index=False)
print_header("ETL file saved as 'ufc_etl.csv'.", color = 'bright_green')

[92m╔════════════════════════════════════╗
║  ETL file saved as 'ufc_etl.csv'.  ║
╚════════════════════════════════════╝[0m


<div style="text-align: center;">
     <img src="../img/ufc_logo.png" width="800" /> 
</div>