## Scraping Tapology.com for UFC

I want to scrape tapology.com for the bout information of UFC events. I'm looking to create a few data frames that I will convert into csv's for further exploration. I'll start with importing the following modules:

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys

module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

from bs4 import BeautifulSoup
import requests
import pandas as pd
import src

## Table: Events

I have the url to a search results page that contains all of the events I want to look at. So I'll start by converting that into a beautiful soup object. I'm going to use pandas read_html so that I can pull in the whole list.

In [2]:
df_results = pd.read_html('https://www.tapology.com/search?term=ufc&mainSearchFilter=events')
len(df_results)

3

Read_html returns a list of all the tables found in a web page, so I need to figure ot which one I want. To do this, I'll check the info for each of the dataframes. I know I have 864 results from the web page by just looking at it so I'll choose the one matching that in length.

In [3]:
for df in df_results:
    print(df.info(),'\n\n\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 864 entries, 0 to 863
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Events (864)  864 non-null    object 
 1   Unnamed: 1    0 non-null      float64
 2   Name          727 non-null    object 
 3   Unnamed: 3    0 non-null      float64
 4   Date          864 non-null    object 
 5   Unnamed: 5    0 non-null      float64
 6   Bouts         864 non-null    int64  
dtypes: float64(3), int64(1), object(3)
memory usage: 47.4+ KB
None 



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Event       8 non-null      object 
 1   Unnamed: 1  0 non-null      float64
 2   Start Time  8 non-null      object 
dtypes: float64(1), object(2)
memory usage: 320.0+ bytes
None 



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 

It looks like the first one is the one I want, so let's make sure it has all the info I want.

In [4]:
df_results[0].head()

Unnamed: 0,Events (864),Unnamed: 1,Name,Unnamed: 3,Date,Unnamed: 5,Bouts
0,Contender Series 2020,,Week 10,,2020.08.25,,0
1,Contender Series 2020,,Week 9,,2020.08.18,,0
2,UFC Fight Night,,,,2020.08.15,,1
3,Contender Series 2020,,Week 8,,2020.08.11,,0
4,Contender Series 2020,,Week 7,,2020.08.04,,0


The only thing it's missing is a link to the event page. It also has a few null columns, so I'll drop those as well.

In [5]:
df_results = df_results[0]
df_results = df_results.dropna(axis=1,how='all')
df_results.head()

Unnamed: 0,Events (864),Name,Date,Bouts
0,Contender Series 2020,Week 10,2020.08.25,0
1,Contender Series 2020,Week 9,2020.08.18,0
2,UFC Fight Night,,2020.08.15,1
3,Contender Series 2020,Week 8,2020.08.11,0
4,Contender Series 2020,Week 7,2020.08.04,0


That's cleaner but it is still missing the links so let's add those in. First I'll take the first table then I'll find all the tr elements and pull the href attribute from them. Then I'll add that list to the df_results under a new 'link' column.

In [6]:
page = requests.get('https://www.tapology.com/search?term=ufc&mainSearchFilter=events').text
soup = BeautifulSoup(page, 'html.parser')
table = soup.find('table')
rows = table.find_all('tr')
rows[0]

<tr>
<th class="lrgB" scope="col">Events (864)</th>
<th class="gutter" scope="col"> </th>
<th class="lrgA" scope="col">Name</th>
<th class="gutter" scope="col"> </th>
<th class="rightC" scope="col">Date</th>
<th class="gutter" scope="col"> </th>
<th class="smlD" scope="col">Bouts</th>
</tr>

The first row is the header, so I'll drop that and then use a list comprehension. First I want to see how I can access the 'href' attribute.

In [7]:
rows = rows[1:]

I'll find out how to get one

In [8]:
rows[0].find('a').get('href')

'/fightcenter/events/68353-contender-series-2020-week-10'

Then I'll get the rest

In [9]:
links = [row.find('a').get('href') for row in rows]
links [:5]

['/fightcenter/events/68353-contender-series-2020-week-10',
 '/fightcenter/events/68352-contender-series-2020-week-9',
 '/fightcenter/events/67159-ufc-on-espn',
 '/fightcenter/events/68351-contender-series-2020-week-8',
 '/fightcenter/events/68350-contender-series-2020-week-7']

In [10]:
df_results['link'] = links
print(df_results.info())
df_results.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 864 entries, 0 to 863
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Events (864)  864 non-null    object
 1   Name          727 non-null    object
 2   Date          864 non-null    object
 3   Bouts         864 non-null    int64 
 4   link          864 non-null    object
dtypes: int64(1), object(4)
memory usage: 33.9+ KB
None


Unnamed: 0,Events (864),Name,Date,Bouts,link
0,Contender Series 2020,Week 10,2020.08.25,0,/fightcenter/events/68353-contender-series-202...
1,Contender Series 2020,Week 9,2020.08.18,0,/fightcenter/events/68352-contender-series-202...
2,UFC Fight Night,,2020.08.15,1,/fightcenter/events/67159-ufc-on-espn
3,Contender Series 2020,Week 8,2020.08.11,0,/fightcenter/events/68351-contender-series-202...
4,Contender Series 2020,Week 7,2020.08.04,0,/fightcenter/events/68350-contender-series-202...


The next thing I want to do is clear out all of the non UFC events, and I'll cut out The Ultimate Fighter fights as well because most of these, if not all, are exhibition matches. I'll probably need to use regular expressions for this.

In [11]:
import re

ufc = re.compile('^UFC') #matches the start of the string with UFC
contender = re.compile('^Contender')

for event in df_results['Events (864)'][:50]:
    if ufc.match(event) or contender.match(event):
        print('UFC')
    else:
        print('-----------', event, '-----------')

UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
----------- AUFC 25 -----------
UFC
UFC
UFC
UFC
UFC
UFC
----------- Davao Urban FC Fight Night 17 -----------
----------- AUFC 24 -----------
UFC
UFC
UFC
UFC
UFC
UFC
UFC
UFC
----------- Davao Urban FC Fight Night 16 -----------


Maybe I can use these regular expressions to create a mask that will filter out all non-UFC events.

In [12]:
def is_ufc(event_name):
    if ufc.match(event_name) or contender.match(event_name):
        return True
    else:
        return False

In [13]:
mask=df_results['Events (864)'].map(is_ufc)

In [14]:
ufc_only = df_results[mask]
ufc_only.reset_index()

print(ufc_only.info())
ufc_only.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 534 entries, 0 to 863
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Events (864)  534 non-null    object
 1   Name          517 non-null    object
 2   Date          534 non-null    object
 3   Bouts         534 non-null    int64 
 4   link          534 non-null    object
dtypes: int64(1), object(4)
memory usage: 25.0+ KB
None


Unnamed: 0,Events (864),Name,Date,Bouts,link
0,Contender Series 2020,Week 10,2020.08.25,0,/fightcenter/events/68353-contender-series-202...
1,Contender Series 2020,Week 9,2020.08.18,0,/fightcenter/events/68352-contender-series-202...
2,UFC Fight Night,,2020.08.15,1,/fightcenter/events/67159-ufc-on-espn
3,Contender Series 2020,Week 8,2020.08.11,0,/fightcenter/events/68351-contender-series-202...
4,Contender Series 2020,Week 7,2020.08.04,0,/fightcenter/events/68350-contender-series-202...


Seems like it worked. I'm going to export this as a csv and save it for later. Before I do that, I'm just going to rename the columns

In [15]:
ufc_only.columns = ['event', 'name', 'date', 'bouts', 'link']

In [12]:
#ufc_only.to_csv('events_ufc_tapology.csv')

## function testing

Disclaimer: the follwing function combines modifications used in 02_mef_tapology_scrape so it will differ slightly.

In [25]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
search_link = 'https://www.tapology.com/search?term=ufc&mainSearchFilter=events'

html_content = requests.get(search_link, headers=headers).content

src.get_ufc_events(html_content)

Unnamed: 0,event,name,date,bouts,link
20,UFC Fight Night,Woodley vs. Burns,2020-05-30,11,/fightcenter/events/69127-ufc-fight-night
21,UFC Fight Night,Overeem vs. Harris,2020-05-16,11,/fightcenter/events/67412-ufc-on-espn-33
22,UFC Fight Night,Smith vs. Teixeira,2020-05-13,10,/fightcenter/events/69126-ufc-fight-night
23,UFC 249,Ferguson vs. Gaethje,2020-05-09,11,/fightcenter/events/66312-ufc-250
30,UFC on ESPN+ 28,Lee vs. Oliveira,2020-03-14,12,/fightcenter/events/64600-ufc-on-espn-26
...,...,...,...,...,...
859,UFC 5,Return of the Beast,1995-04-07,10,/fightcenter/events/ufc-5-return-of-the-beast
860,UFC 4,Revenge of the Warriors,1994-12-16,10,/fightcenter/events/ufc-4-revenge-of-the-warriors
861,UFC 3,The American Dream,1994-09-09,6,/fightcenter/events/ufc-3-the-american-dream
862,UFC 2,No Way Out,1994-03-11,15,/fightcenter/events/ufc-2-no-way-out
