In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import plotly.express as px
import plotly.subplots as sp
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')
 

### **Background info**

**Brazilian Jiu-Jitsu (BJJ)** is a martial art that focuses on wrestling or throwing an opponent to the ground. While on the ground, the athlete aims to achieve a submission victory. A submission victory occurs when an athlete puts his opponent into a joint lock or a choke, and the submitted opponent signals defeat by *tapping on his opponent or the mat.*

[![BJJ match](https://upload.wikimedia.org/wikipedia/commons/2/22/GABRIEL_VELLA_vs_ROMINHO_51.jpg "The athlete on the bottom has caught his opponent in a choke, click for more info")](https://en.wikipedia.org/wiki/Brazilian_jiu-jitsu)

***Click image for more info***

This data visualization aims to perform a data analysis to find the most common types of submissions that occurs in one of the most popular BJJ competitions, the *Abu-Dhabi Combat Club Submission Fighting World Championship (ADCC).*

This data analysis will be useful to anyone practicing BJJ as it will help them understand which are the most common submissions that are succesful in the sport. They can then focus their efforts on learning or defending against these submissions. 

A separate app was built to present the data in an easy-to-understand and engaging way.

### Data source
The raw CSV file was obtained from Kaggle.com

https://www.kaggle.com/datasets/bjagrelli/adcc-historical-dataset

Now we have to clean the data and prepare it.

In [2]:
url = 'https://raw.githubusercontent.com/lizh1988/portfolio/main/ADCC%20data%20viz/adcc_historical_data.csv'

df = pd.read_csv(url, sep = ';')

The dataframe should first be checked to ensure that it read the CSV file correctly.

In [3]:
df

Unnamed: 0,match_id,winner_id,winner_name,loser_id,loser_name,win_type,submission,winner_points,loser_points,adv_pen,weight_class,sex,stage,year
0,3314,484,Murilo Santana,733,Vinicius Magalhaes,DECISION,,-1,-1,,ABS,M,4F,2011
1,35049,7507,Nicholas Meregali,9554,Henrique Cardoso,SUBMISSION,Kimura,-1,-1,,99KG,M,R1,2022
2,35053,7507,Nicholas Meregali,1740,Yuri Simoes,DECISION,,-1,-1,,99KG,M,4F,2022
3,35057,7507,Nicholas Meregali,576,Rafael Lovato Jr,POINTS,,0,0,PEN,99KG,M,3RD,2022
4,35096,7507,Nicholas Meregali,11797,Giancarlo Bodoni,POINTS,,6,2,,ABS,M,4F,2022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1023,368,6438,Hayato Sakurai,6167,Ricco Rodriguez,SUBMISSION,Heel hook,-1,-1,,ABS,M,SF,1999
1024,818,-1,A. Cacareco,559,Rodrigo Medeiros,SUBMISSION,Guillotine,-1,-1,,ABS,M,R2,2003
1025,341,-1,Jeff Monson,481,Saulo Ribeiro,POINTS,,-1,-1,,99KG,M,F,1999
1026,35076,1119,Beatriz Mesquita,9271,Mayssa Bastos,POINTS,,3,0,,60KG,F,4F,2022


The match_id, winner_id, loser_id, winner_name and loser_name will be dropped from the dataframe as we would assign equal weightage to all matches (i.e. the superstars' matches will be treated the same as lesser known athletes).

In [4]:
df.drop(['match_id','winner_id', 'loser_id', 'winner_name','loser_name'], axis=1, inplace=True)
df.head(15)

Unnamed: 0,win_type,submission,winner_points,loser_points,adv_pen,weight_class,sex,stage,year
0,DECISION,,-1,-1,,ABS,M,4F,2011
1,SUBMISSION,Kimura,-1,-1,,99KG,M,R1,2022
2,DECISION,,-1,-1,,99KG,M,4F,2022
3,POINTS,,0,0,PEN,99KG,M,3RD,2022
4,POINTS,,6,2,,ABS,M,4F,2022
5,DECISION,,-1,-1,,ABS,M,SF,2022
6,DECISION,,-1,-1,,+99KG,M,R1,2019
7,POINTS,,0,0,PEN,+99KG,M,4F,2019
8,DECISION,,-1,-1,,+99KG,M,SF,2019
9,POINTS,,3,0,,+99KG,M,SF,2022


Athletes can normally win in 3 different type of ways: by submission, by points obtained from getting into advantageous positions, or by advantages/penalties given through the referee's discretion.

We will attempt to see which type of victory is most common.

First, we need to know how many entries do we have.

In [5]:
df.shape

(1028, 9)

We have a total of **1028 entries**, next is to check for missing or invalid data.

In [6]:
df.isnull().sum()

win_type           0
submission       628
winner_points      0
loser_points       0
adv_pen          999
weight_class       0
sex                0
stage              0
year               0
dtype: int64

We notice that there are lots of missing values for the 'submission' column that shows the type of submission that was applied to win the match, and the 'adv_pen' column that shows whether the match was won by an advantage or penalty that was given by the referee. This may not necessarily be an error in the data, as the 'submission' column will remain blank if the match was won by points or advantage/penalty and the 'adv_pen' column will only be filled up if the athlete won via advantage/penalty.

In [7]:
df.value_counts('submission').sum()

400

We can see that the sum of null values in the 'submission' column with the amount of submissions recorded is **1028**.

In [8]:
df.value_counts('win_type')

win_type
POINTS              520
SUBMISSION          402
DECISION             97
INJURY                8
DESQUALIFICATION      1
dtype: int64

However, an inspection of the value_counts() for the types of wins reveealed that there are 402 wins by submissions instead. The type of submission for 2 matches where the winner won by submission were not recorded.

In [9]:
mask1=df.loc[(df['win_type']=='SUBMISSION') & df['submission'].isna()]
mask2=df.loc[(df['win_type']== 'INJURY') | (df['win_type']== 'DESQUALIFICATION')]
print(mask1)
print(mask2)

       win_type submission  winner_points  loser_points adv_pen weight_class  \
515  SUBMISSION        NaN             -1            -1     NaN        +60KG   
883  SUBMISSION        NaN             -1            -1     NaN         99KG   

    sex stage  year  
515   F     F  2013  
883   M    R1  2015  
              win_type submission  winner_points  loser_points adv_pen  \
302             INJURY        NaN             -1            -1     NaN   
308             INJURY        NaN             -1            -1     NaN   
321             INJURY        NaN             -1            -1     NaN   
361             INJURY        NaN             -1            -1     NaN   
607   DESQUALIFICATION        NaN             -1            -1     NaN   
932             INJURY        NaN             -1            -1     NaN   
934             INJURY        NaN             -1            -1     NaN   
1001            INJURY        NaN             -1            -1     NaN   
1011            INJURY     

We have found the 2 anomalous records. Since we cannot be sure of the actual data for these 2 matches, we will drop them from the dataframe. We will also drop the "INJURY" and "DESQUALIFICATION" (sic) records as it does not contribute to a better understanding of the matches due to the results being one-off events in most cases.

In [10]:
df.drop(index=mask1.index, inplace=True)
df.drop(index=mask2.index, inplace=True)

df.value_counts('win_type')

win_type
POINTS        520
SUBMISSION    400
DECISION       97
dtype: int64

Next, we would attempt to group the submissions into families of techniques based on which part of the body they attack: the neck, the upper body or the lower body.

In [11]:
df['target']=(np.nan)
upper_body=['Armbar', 'Kimura' ,'Omoplata', 'Americana' ,'Shoulder lock', 'Wristlock']
lower_body=['Inside heel hook','Outside heel hook','Heel hook' ,'Footlock' , 'Toe hold' , 'Leg lock' , 'Calf slicer', 'Dogbar' , 'Estima lock' , 'Z Lock', 'Kneebar']
neck=['RNC' , 'Katagatame' , 'Guillotine' , 'Triangle' , 'Choke' , "D'arce choke" , 'Short choke' , 'North south choke' , 'Headlock' , 'Anaconda' , 'Ezekiel' , 'Cross face' , 'Twister']
df.loc[df['submission'].isin(upper_body), 'target'] = 'Upper body'
df.loc[df['submission'].isin(lower_body), 'target'] = 'Lower body'
df.loc[df['submission'].isin(neck), 'target'] = 'Neck'   



A look at the varieties of submissions performed at ADCC revealed two new issues:

1. We can see that heel hooks are seperated into 3 categories: heel hooks, inside heel hooks and outside heel hooks. All heel hooks are supposed to be classified into inside heel hooks or outside heel hooks.

![Inside heel hook](https://bjj.tv/wp-content/uploads/2020/08/Inside-Heel-Hook.jpg "Inside heel hook")

*Inside heel hooks turns the heel towards the centre of the body*

![Outside heel hook](https://grapplinginsider.com/wp-content/uploads/2021/03/Dean-Lister.jpg "Outside heel hook")

*Outside heel hooks turns the heel towards the outside of the body*

2. Entries such as "Submission" or "Verbal tap" does not give any idea of what submissions were performed and does not contribute to clarity about which types of submission are favoured in ADCC.

We will do the following to solve these two problems:

1. Inside heel hooks and outside heel hooks will be classified as heel hooks and merged with the "Heel hook" entry.

2. Entries such as "Submission" or "Verbal tap" will be removed dropped.


In [12]:
#Issue number 1
df['submission'].replace(['Inside heel hook', 'Outside heel hook'], 'Heel hook', inplace=True)
#Issue number 2
mask=df[(df['submission']=='Submission') | (df['submission']=='Verbal tap')]
df.drop(mask.index,inplace=True, axis=0)

Lastly, we would split the dataframe by genders, as the two genders have different weight categories which should not be compared with each other.

In [13]:
dfm=df[df['sex']=='M']
dff=df[df['sex']=='F']

### Exploratory Data Analysis (EDA)

We are finally ready to explore the dataset and discover any relationships.

In [14]:
#Initializing temporary dataframes and modifying the temporary dataframes so that it is easier to code the layout of the visualizations
dfm['percentagecount']=0.01
dff['percentagecount']=0.01
dff=dff.sort_values(by=["win_type"], axis=0)
dfm=dfm.sort_values(by=["win_type"], axis=0)


#Building the horizontal stacked histograms
fig1=px.histogram(
    dfm, y='weight_class', x='percentagecount', title = 'Distribution of win types across weight categories - Males',
     barmode='stack', barnorm='percent', color ='win_type',
    labels=dict(weight_class='Weight Class', percentagecount="occurence", win_type="Type of victory"),
    category_orders={'Type of victory': ['POINTS', 'DECISION', 'SUBMISSION']}
)

fig2=px.histogram(
    dff, y='weight_class', x='percentagecount', title = 'Distribution of win types across weight categories - Females', 
    color='win_type', barmode='stack', barnorm='percent',
    labels=dict(weight_class='Weight Class', percentagecount="occurence", win_type="Type of victory"),
    category_orders={"Type of victory": ["Points", "Decision", "Submission"]}
)

#Amending the layout of the 2 figures
fig1.update_layout(xaxis_ticksuffix = '%', xaxis_title= 'Percentage within each weight class')
fig2.update_layout(xaxis_ticksuffix = '%', xaxis_title= 'Percentage within each weight class')

fig1.update_yaxes(categoryorder='array', categoryarray= ['66KG', '77KG', '88KG', '99KG', '+99KG', 'ABS'])
fig2.update_yaxes(categoryorder='array', categoryarray=['60KG', '+60KG'], matches=None, showticklabels=True)

fig1.show()
fig2.show()


From the distribution of win types across the weight classes and genders, we can see that for all weight classes and genders submission victories are about the same as victory by points. 

This is surprising as it is easier to get an advantageous position during the match than to complete a submission, so the rational approach will be to obtain a lead in points, and then stall out the match to win instead of risking a submission attempt. 

Perhaps this can be attributed to ADCC being a tournament style event, where athletes may have to fight multiple matches each day. Stalling out the match may be a more rational approach to the individual matches, but athletes who employ this strategy would find themselves low on stamina in the later matches and lose the overall tournament. 

In [15]:
dfmpoints=dfm[dfm['win_type']=='POINTS']
dffpoints=dff[dff['win_type']=='POINTS']

dfmpoints['points_diff']=(dfmpoints['winner_points']-dfmpoints['loser_points'])
dffpoints['points_diff']=(dffpoints['winner_points']-dffpoints['loser_points'])

fig1=px.histogram(
    dfmpoints, x='points_diff', title = 'Distribution of points difference in matches where athletes won by points - Males',
    labels=dict(points_diff='Difference in points')
)

fig2=px.histogram(
    dffpoints, x='points_diff', title = 'Distribution of points difference in matches where athletes won by points - Females',
    labels=dict(points_diff='Difference in points')
)

fig2.update_traces(
    xbins=dict(size=1),
)
fig1.update_layout(xaxis=dict(tickmode = 'linear', tick0 = 0, dtick=1))
fig2.update_layout(xaxis=dict(tickmode = 'linear', tick0 = 0, dtick=1))
fig2.update_xaxes(tickson='boundaries')

fig1.show()
fig2.show()

*A points difference of 0 in the above histograms indicate that the winner of the match won by an advantage point or penalty point, which is given by the referree to reward aggressive athletes who do not reach a point-scoring position and to penalise passive athletes.*

An examination of the distribution of the difference between the points between the winner and the loser seems to indicate that the majority of matches are won by either advantages/penalties or a small point lead of 2 to 3 points. This lends credence to the theory that the rational thing to do to eke out a point victory is simply to get a minor point lead and stall out the match.

In [16]:
fig = sp.make_subplots(rows=1, cols=2, specs=[
    [{"type": "sunburst"}, {"type": "sunburst"}]
], subplot_titles=['Male','Female'])

dfmsub=dfm[dfm['submission'].notnull()]
dfmsub['subcounts']=1

dffsub=dff[dff['submission'].notnull()]
dffsub['subcounts']=1

dffsub=dffsub.sort_values(by=["weight_class","target"], axis=0)
dfmsub=dfmsub.sort_values(by=["weight_class","target"], axis=0)

sb1=px.sunburst(
    dfmsub, path=['weight_class','target','submission'], values='subcounts', color='target',
    labels={'subcounts': 'Number of occurences'}, title='Male atheletes'
)

sb2=px.sunburst(
    dffsub, path=['weight_class','target','submission'], values='subcounts', color='target',
    labels={'subcounts': 'Number of occurences'}, title='Female athletes'
)

sb1.update_traces(textinfo="label+percent parent")
sb2.update_traces(textinfo="label+percent parent")
fig.add_trace(sb1.data[0], row=1, col=1)
fig.add_trace(sb2.data[0], row=1, col=2)

fig.update_layout(title='Breakdown of targets attacked and submission type by weight class',height=600, width=1000)


The above sunburst charts provides more info regarding the favoured techniques for each weight class and gender.

Several observations can be made:

1. One of the most popular lower body attacks now is the heel hook, which dominated lower body attacks for male athletes. This is strangely missing in the female matches. A posible explanation comes from a personal anecdote whereby multiple training partners have commented that it is hard to perform heel hooks on female training partners due to increased flexability in their ankles.

2. Despite heel hooks being very popular, with multiple world champions producing DVDs to teach their variations of the heel hook, we find that for male athletes, neck attacks occur much more frequently than lower body attacks. The top 3 most common neck attacks are the Rear Naked Choke (RNC), guillotine, and the triangle choke. Male athletes will be well served to learn how to perform, and to defend, against these attacks.

3. The only weight category where lower body attacks, and heel hooks in particular, are more popular than chokes is the absolute (ABS) category, where there is no weight limit for athletes. This may be due to the heel hook's reputation as a 'giant slayer', where it is touted as the most viable technique for smaller athletes to overcome larger athletes. More research is required to validate this statement, as it may be possible that the larger athletes in the dataset are winning against smaller athletes via the heel hook.