# Summary
Some of my data questions about MMA couldn't be answered with the Kaggle dataset alone. To expand on it, I chose to webscrape data for the primary fighting style of each fighter from ESPN websites. These are the steps that I took.
1. Collect URLs for ESPN fighter profiles through a google search
    - Prepare google search URLs through Pandas string manipulation
    - Scrape google search results using Scrapy 
2. Collect data from ESPN fighter profile websites
    - Scrape ESPN website using Scrapy
    - Store as dataframe and merge with original dataset to prepare for analysis

![screenshot](assets/annotated_collage.png)

In [1]:
import pandas as pd
from datetime import datetime
import time
import random
import requests
from scrapy import Selector

# 1. Scrape for ESPN URLS

## Prepare search URLs

In [2]:
df = pd.read_csv('data/ufc_fighter_data.csv')
print(df.info())
print(df.shape)
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4107 entries, 0 to 4106
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   fighter_id          4107 non-null   int64  
 1   fighter_f_name      4107 non-null   object 
 2   fighter_l_name      4092 non-null   object 
 3   fighter_nickname    2250 non-null   object 
 4   fighter_height_cm   3797 non-null   float64
 5   fighter_weight_lbs  4020 non-null   float64
 6   fighter_reach_cm    2166 non-null   float64
 7   fighter_stance      3273 non-null   object 
 8   fighter_dob         3349 non-null   object 
 9   fighter_w           4107 non-null   int64  
 10  fighter_l           4107 non-null   int64  
 11  fighter_d           4107 non-null   int64  
 12  fighter_nc_dq       482 non-null    float64
 13  fighter_url         4107 non-null   object 
dtypes: float64(4), int64(4), object(6)
memory usage: 449.3+ KB
None
(4107, 14)


Unnamed: 0,fighter_id,fighter_f_name,fighter_l_name,fighter_nickname,fighter_height_cm,fighter_weight_lbs,fighter_reach_cm,fighter_stance,fighter_dob,fighter_w,fighter_l,fighter_d,fighter_nc_dq,fighter_url
0,4107,Tom,Aaron,,,155.0,,,1978-07-13,5,3,0,,http://ufcstats.com/fighter-details/93fe7332d1...
1,4106,Danny,Abbadi,The Assassin,180.34,155.0,,Orthodox,1983-07-03,4,6,0,,http://ufcstats.com/fighter-details/15df64c02b...
2,4105,Nariman,Abbasov,Bayraktar,172.72,155.0,167.64,Orthodox,1994-02-01,28,4,0,,http://ufcstats.com/fighter-details/59a9d6dac6...
3,4104,David,Abbott,Tank,182.88,265.0,,Switch,,10,15,0,,http://ufcstats.com/fighter-details/b361180739...
4,4103,Hamdy,Abdelwahab,The Hammer,187.96,264.0,182.88,Southpaw,1993-01-22,5,0,0,1.0,http://ufcstats.com/fighter-details/3329d692ae...


In [3]:
df = df[['fighter_id', 'fighter_f_name', 'fighter_l_name', 'fighter_nickname', 'fighter_dob']].copy()
print(df.isna().sum())

fighter_id             0
fighter_f_name         0
fighter_l_name        15
fighter_nickname    1857
fighter_dob          758
dtype: int64


In [4]:
df['fighter_l_name'].fillna('', inplace=True)
df['fighter_nickname'].fillna('', inplace=True)
df['fighter_dob'].fillna('', inplace=True)

# Append string columns joined by '+' to format as url
df['fighter_name_url'] = df[['fighter_f_name', 'fighter_l_name', 'fighter_nickname']].apply('+'.join, axis=1)
# Replace any remaining spaces with '+' to format as url
df['fighter_name_url'] = df['fighter_name_url'].str.replace(' ', '+')

# Create full name column to merge with later
df['fighter_name'] = df['fighter_f_name'] + str(' ') + df['fighter_l_name']

there is one name that is duplicated where both don't have a nickname to distinguish itself

In [5]:
df[df['fighter_name_url'].duplicated(keep=False) == True]

Unnamed: 0,fighter_id,fighter_f_name,fighter_l_name,fighter_nickname,fighter_dob,fighter_name_url,fighter_name
1740,2367,Tony,Johnson,,1983-05-02,Tony+Johnson+,Tony Johnson
1748,2359,Tony,Johnson,,,Tony+Johnson+,Tony Johnson


append dob for this row

In [6]:
df.loc[df["fighter_name_url"].duplicated(keep=False) == True, "fighter_name_url"] = (
    df.loc[df["fighter_name_url"].duplicated(keep=False) == True, "fighter_name_url"]
    + df.loc[df["fighter_name_url"].duplicated(keep=False) == True, "fighter_dob"]
)

confirm that there are no more duplicated rows

In [7]:
df['fighter_name_url'].duplicated(keep=False).sum()

0

In [8]:
# Store column as list
fighter_name_lst = df['fighter_name_url'].to_list()

## Scrape ESPN URLs from google search results

In [9]:
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}
url_start = 'https://www.google.com/search?q=espn+fighter+'

# Create empty data frame
out_df = pd.DataFrame(columns=['fighter_name_url', 'search_url', 'espn_url'])

# Scrape a espn url from the first search of each fighter name + nickname from list
for n in fighter_name_lst:
    url = str(url_start) + str(n)
    html = requests.get(url, headers=headers).content
    sel = Selector(text=html)
    
    # select the espn url
    espn_url = sel.xpath(
        '//*[@id="rso"]/div[1]/div/div/div[1]/div/div/span/a/@href'
    ).extract_first()
    
    # concat to dataframe
    new_row = pd.DataFrame(
        [[n, url, espn_url]], 
        columns=['fighter_name_url', 'search_url', 'espn_url']
    )
    
    out_df = pd.concat([out_df, new_row])
    
    # Use random.uniform to generate a random float within the range [1.0, 4.0]
    sleep_time = random.uniform(1.0, 4.0)
    time.sleep(sleep_time)  # Sleep for the randomly generated interval

KeyboardInterrupt: 

In [10]:
out_df = out_df.reset_index(drop=True)

Merge dataframe with original to get fighter_id

In [11]:
out_df = df[['fighter_id', 'fighter_name', 'fighter_name_url']].merge(out_df, how='left', on='fighter_name_url')
print(len(out_df[out_df['espn_url'].isna() == False]))
out_df.head()

Unnamed: 0,fighter_id,fighter_name,fighter_name_url,search_url,espn_url
0,4107,Tom Aaron,Tom+Aaron+,https://www.google.com/search?q=espn+fighter+T...,http://www.espn.com.au/mma/fighter/_/id/250499...
1,4106,Danny Abbadi,Danny+Abbadi+The+Assassin,https://www.google.com/search?q=espn+fighter+D...,https://www.espn.com/mma/fighter/_/id/2556806/...
2,4105,Nariman Abbasov,Nariman+Abbasov+Bayraktar,https://www.google.com/search?q=espn+fighter+N...,https://www.espn.com.au/mma/fighter/_/id/42948...
3,4104,David Abbott,David+Abbott+Tank,https://www.google.com/search?q=espn+fighter+D...,https://www.espn.com/mma/fighter/_/id/2354050/...
4,4103,Hamdy Abdelwahab,Hamdy+Abdelwahab+The+Hammer,https://www.google.com/search?q=espn+fighter+H...,


### write dataframe to csv file

In [None]:
dt = datetime.now().strftime('%Y%m%d-%H%M%S')
filepath = f'out/espn_urls_{dt}.csv'
out_df.to_csv(filepath)
print(filepath)

out/espn_urls_20231204-205849.csv


# 2. Scrape data from ESPN URLs

In [12]:
espn_url_df = out_df[['fighter_name', 'espn_url']].dropna()

In [13]:
# Create empty data frame
p_style_df = pd.DataFrame(columns=["fighter_name_espn", "primary_style", "fighter_name", "espn_url"])

# define user-agent
headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}

# scraper for each url in espn_url_df
for index, row in espn_url_df.iterrows():
    espn_url = row["espn_url"]
    fighter_name = row["fighter_name"]

    html = requests.get(espn_url, headers=headers).content
    sel = Selector(text=html)

    # Extract scraped names as list (expecting 2 per scrape)
    fighter_name_espn = sel.xpath(
        '//*[@id="fittPageContainer"]/div[2]/div[5]/div/div/section[1]//td[1]/text()'
    ).extract()[0:2]

    # Extract scraped primary styles as list (expecting 2 per scrape)
    primary_style = sel.xpath(
        '//*[@id="fittPageContainer"]/div[2]/div[5]/div/div/section[1]//td[2]/text()'
    ).extract()[0:2]
    # Append scraped data to dataframe
    for fighter_name_espn, style, fighter_name, url in zip(
        fighter_name_espn,
        primary_style,
        [fighter_name, fighter_name],
        [espn_url, espn_url],
    ):
        new_row = pd.DataFrame(
            [fighter_name_espn, style, fighter_name, url],
            index=["fighter_name_espn", "primary_style", "fighter_name", "espn_url"],
        ).T
        p_style_df = pd.concat([p_style_df, new_row])

    # Use random.uniform to generate a random float within interval
    sleep_time = random.uniform(0.5, 2.0)
    time.sleep(sleep_time)  # Sleep for the randomly generated interval

p_style_df = p_style_df.reset_index(drop=True)

In [15]:
# Select rows where the fighter name scraped from espn matches the fighter name used to scrape
p_style_df = p_style_df.loc[
    p_style_df['fighter_name_espn'] == p_style_df['fighter_name']
]

# Merge to get fighter_id
p_style_df = out_df[['fighter_id', 'espn_url']].merge(p_style_df, how='left', on='espn_url')
p_style_df =  p_style_df.dropna(subset='fighter_name_espn')

display(p_style_df)

Unnamed: 0,fighter_id,espn_url,fighter_name_espn,primary_style,fighter_name
2,4105,https://www.espn.com.au/mma/fighter/_/id/42948...,Nariman Abbasov,-,Nariman Abbasov
5,4102,https://www.espn.com/mma/fighter/_/id/2558062/...,Shamil Abdurakhimov,Wrestling,Shamil Abdurakhimov
7,4100,https://www.espn.com/mma/fighter/_/id/4046608/...,Daichi Abe,Striker,Daichi Abe
13,4094,https://www.espn.com.au/mma/fighter/_/id/50684...,John Adajar,-,John Adajar
15,4092,https://www.espn.com/mma/fighter/_/id/4292650/...,Juan Adams,Striker,Juan Adams
16,4091,https://www.espn.com/mma/fighter/_/id/4359111/...,Anthony Adams,-,Anthony Adams
17,4090,https://www.espn.com/mma/fighter/_/id/4683807/...,Zarrukh Adashev,"Striker, Kick Boxing",Zarrukh Adashev
27,4080,https://www.espn.com/mma/fighter/_/id/5074274/...,Jesus Aguilar,Freestyle,Jesus Aguilar
34,4073,https://www.espn.com/mma/fighter/_/id/2354127/...,Yoshihiro Akiyama,Judo,Yoshihiro Akiyama
42,4065,https://www.espn.com.au/mma/fighter/_/id/46848...,Amir Albazi,Jiu-Jitsu,Amir Albazi


### write dataframe to csv file

In [None]:
dt = datetime.now().strftime('%Y%m%d-%H%M%S')
p_style_df.to_csv(f'out/p_style_df{dt}.csv')