# Scope

This notebook focuses on scraping individual match statistics from an event page on tapology.com. I'll take in a tapolgy event page and output a dataframe containing the stats for each individual match.

## Which events?

I don't want to look at all ufc events right now, so I'll just check how many events happened in the last 10 years. First I'm going to make my date row a date time object.

In [1]:
%load_ext autoreload
%autoreload 2

from bs4 import BeautifulSoup
import requests
import pandas as pd
import src

previous_ufc = pd.read_csv('previous_ufc.csv', index_col = 0)

In [2]:
previous_ufc.head()

Unnamed: 0,event,name,date,bouts,link
0,UFC Fight Night,Woodley vs. Burns,2020.05.30,11,/fightcenter/events/69127-ufc-fight-night
1,UFC Fight Night,Overeem vs. Harris,2020.05.16,11,/fightcenter/events/67412-ufc-on-espn-33
2,UFC Fight Night,Smith vs. Teixeira,2020.05.13,10,/fightcenter/events/69126-ufc-fight-night
3,UFC 249,Ferguson vs. Gaethje,2020.05.09,11,/fightcenter/events/66312-ufc-250
4,UFC on ESPN+ 32 (cancelled),,2020.05.02,0,/fightcenter/events/67068-ufc-on-espn-32


In [3]:
previous_ufc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 514 entries, 0 to 513
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   event   514 non-null    object
 1   name    501 non-null    object
 2   date    514 non-null    object
 3   bouts   514 non-null    int64 
 4   link    514 non-null    object
dtypes: int64(1), object(4)
memory usage: 24.1+ KB


In [4]:
previous_ufc['date'] = pd.to_datetime(previous_ufc['date'])

In [5]:
previous_ufc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 514 entries, 0 to 513
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   event   514 non-null    object        
 1   name    501 non-null    object        
 2   date    514 non-null    datetime64[ns]
 3   bouts   514 non-null    int64         
 4   link    514 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 24.1+ KB


I also want to remove any event that got cancelled, so let's get rid of any events with number of bouts = 0.

In [6]:
previous_ufc = previous_ufc[previous_ufc['bouts']>0]
previous_ufc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 513
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   event   500 non-null    object        
 1   name    494 non-null    object        
 2   date    500 non-null    datetime64[ns]
 3   bouts   500 non-null    int64         
 4   link    500 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 23.4+ KB


How many bouts are in here total? 

In [7]:
previous_ufc.bouts.sum()

5375

Now let's see how these date time objects work.

In [8]:
#previous_ufc.date >2010

In [9]:
previous_ufc.date

0     2020-05-30
1     2020-05-16
2     2020-05-13
3     2020-05-09
10    2020-03-14
         ...    
509   1995-04-07
510   1994-12-16
511   1994-09-09
512   1994-03-11
513   1993-11-12
Name: date, Length: 500, dtype: datetime64[ns]

In [10]:
minimum = pd.to_datetime('2012-1-1')

In [11]:
recent_ufc = previous_ufc[previous_ufc['date'] > minimum]
recent_ufc.head()

Unnamed: 0,event,name,date,bouts,link
0,UFC Fight Night,Woodley vs. Burns,2020-05-30,11,/fightcenter/events/69127-ufc-fight-night
1,UFC Fight Night,Overeem vs. Harris,2020-05-16,11,/fightcenter/events/67412-ufc-on-espn-33
2,UFC Fight Night,Smith vs. Teixeira,2020-05-13,10,/fightcenter/events/69126-ufc-fight-night
3,UFC 249,Ferguson vs. Gaethje,2020-05-09,11,/fightcenter/events/66312-ufc-250
10,UFC on ESPN+ 28,Lee vs. Oliveira,2020-03-14,12,/fightcenter/events/64600-ufc-on-espn-26


All fights from 2012 on should work.

In [61]:
recent_ufc.bouts.sum()
recent_ufc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 321 entries, 0 to 333
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   event   321 non-null    object        
 1   name    320 non-null    object        
 2   date    321 non-null    datetime64[ns]
 3   bouts   321 non-null    int64         
 4   link    321 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 25.0+ KB


## Table: fighter_instances
### Scraping matches

Now that I have my time frame, I want to take the first element and see how I can scrape it. I already have a list of matches and their respective links. I think I can build a pretty accurate dataframe by using hierarchal indexing where the first index references the bout link and the second index references the winner or loser.

In [13]:
link = recent_ufc.loc[0].link

In [14]:
event_url = 'https://www.tapology.com'+link

#here I set up my request module and parse the page with beautiful soup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
response = requests.get(event_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

#here I create the fightCard data frame
fightCard_df = src.create_bouts_table(soup, link)
fightCard_df

Unnamed: 0,method,length,order,fighter_1,record_1,bout_type,weight,scheduled_rounds,fighter_2,record_2,link,event_link
0,"Decision, Unanimous","5 Rounds, 25:00 Total",11,Gilbert Burns,Climbed to 19-3,Main Event,170,5 x 5,Tyron Woodley,Fell to 19-5,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/events/69127-ufc-fight-night
1,"Decision, Split","3 Rounds, 15:00 Total",10,Augusto Sakai,Climbed to 15-1,Co-Main Event,265,3 x 5,Blagoy Ivanov,Fell to 18-4,/fightcenter/bouts/501441-ufc-fight-night-blag...,/fightcenter/events/69127-ufc-fight-night
2,"Decision, Unanimous","3 Rounds, 15:00 Total",9,Billy Quarantillo,Climbed to 14-2,Main Card,150,3 x 5,Spike Carlyle,Fell to 9-2,/fightcenter/bouts/502990-ufc-fight-night-bill...,/fightcenter/events/69127-ufc-fight-night
3,"Submission, Rear Naked Choke","3:26 Round 2 of 3, 8:26 Total",8,Roosevelt Roberts,Climbed to 10-1,Main Card,155,3 x 5,Brok Weaver,Fell to 15-5,/fightcenter/bouts/502972-ufc-fight-night-roos...,/fightcenter/events/69127-ufc-fight-night
4,"Submission, Kneebar",2:36 Round 1 of 3,7,Mackenzie Dern,Climbed to 8-1,Main Card,115,3 x 5,Hannah Cifers,Fell to 10-5,/fightcenter/bouts/500998-ufc-fight-night-mack...,/fightcenter/events/69127-ufc-fight-night
5,"Decision, Unanimous","3 Rounds, 15:00 Total",6,Katlyn Chookagian,Climbed to 14-3,Prelim,125,3 x 5,Antonina Shevchenko,Fell to 8-2,/fightcenter/bouts/502988-ufc-fight-night-katl...,/fightcenter/events/69127-ufc-fight-night
6,"Decision, Unanimous","3 Rounds, 15:00 Total",5,Daniel Rodriguez,Climbed to 12-1,Prelim,170,3 x 5,Gabe Green,Fell to 9-3,/fightcenter/bouts/503985-ufc-fight-night-dani...,/fightcenter/events/69127-ufc-fight-night
7,"KO/TKO, Knee to the Body to Ground and Pound",1:51 Round 1 of 3,4,Jamahal Hill,Climbed to 8-0,Prelim,205,3 x 5,Klidson Abreu,Fell to 15-5,/fightcenter/bouts/500971-ufc-fight-night-jama...,/fightcenter/events/69127-ufc-fight-night
8,"Submission, Arm Triangle Choke","3:18 Round 2 of 3, 8:18 Total",3,Brandon Royval,Climbed to 11-4,Prelim,125,3 x 5,Tim Elliott,Fell to 15-11,/fightcenter/bouts/502983-ufc-fight-night-tim-...,/fightcenter/events/69127-ufc-fight-night
9,"Submission, One-Arm Guillotine Choke",3:03 Round 1 of 3,2,Casey Kenney,Climbed to 14-2,Prelim,135,3 x 5,Louis Smolka,Fell to 16-7,/fightcenter/bouts/502982-ufc-fight-night-case...,/fightcenter/events/69127-ufc-fight-night


### First bout scrape

In [15]:
first_bout = fightCard_df.loc[0]

In [16]:
first_link = fightCard_df.loc[0].link
bout_url = 'https://www.tapology.com'+first_link
response = requests.get(bout_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

Let's see if we can find a class for the list we want in this page, we'll use chrome developer's tools.

Looking at the page, I found that the info I want is under the boutComparisonTable class. So let's pull that first.

In [17]:
bout_table = soup.find_all(class_='boutComparisonTable')

Actually, I can just pull the table in with pd so let's try that first.

In [18]:
tables = pd.read_html(response.content)

In [19]:
len(tables)

3

In [20]:
tables[0]

Unnamed: 0,0,1,2,3,4
0,18-3-0,,Pro Record At Fight,,19-4-1
1,Climbed to 19-3,,Record After Fight,,Fell to 19-5
2,+150 (Slight Underdog),,Betting Odds,,-185 (Slight Favorite)
3,Brazil,,Nationality,,United States
4,"Boca Raton, Florida",,Fighting out of,,"St. Louis, Missouri"
5,"33 years, 10 months, 3 days",,Age at Fight,,"38 years, 1 month, 6 days"
6,170.5 lbs (77.3 kgs),,Weigh-In Result,,170.5 lbs (77.3 kgs)
7,"5'10"" (178cm)",,Height,,"5'9"" (176cm)"
8,"71.0"" (180cm)",,Reach,,"74.0"" (188cm)"
9,Blackzilians,,Gym,,American Top Team


Now we should be able to pivot it.

In [21]:
bout_stats = tables[0]
bout_stats.columns

Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [22]:
bout_stats_pivot = pd.DataFrame([list(bout_stats[0]), list(bout_stats[4])])

In [23]:
bout_stats_pivot.columns = list(bout_stats[2])

In [24]:
bout_stats_pivot.reset_index(inplace=True, drop=True)

In [25]:
bout_stats_pivot

Unnamed: 0,Pro Record At Fight,Record After Fight,Betting Odds,Nationality,Fighting out of,Age at Fight,Weigh-In Result,Height,Reach,Gym
0,18-3-0,Climbed to 19-3,+150 (Slight Underdog),Brazil,"Boca Raton, Florida","33 years, 10 months, 3 days",170.5 lbs (77.3 kgs),"5'10"" (178cm)","71.0"" (180cm)",Blackzilians
1,19-4-1,Fell to 19-5,-185 (Slight Favorite),United States,"St. Louis, Missouri","38 years, 1 month, 6 days",170.5 lbs (77.3 kgs),"5'9"" (176cm)","74.0"" (188cm)",American Top Team


## Table: Events (addition)
I'm goiing to pause the fighter instances scrape and work on making the events scrape more thorough. First I want the events dataframe available. Then I'm going to open the first event.

In [26]:
df_events = pd.read_csv('previous_ufc.csv')
first_event = df_events.loc[0]

I'm going to open the page and check the first list with the event info:

In [27]:
soup = BeautifulSoup(src.open_tapology_link(first_event.link))

In [28]:
div = soup.find(class_="details details_with_poster clearfix") #grab top header
event_info_elem = div.find('ul') #grab first list in top header

info_list = event_info_elem.get_text().split('\n\n\n')#turn it into text and split into list


new_list = [] #fields with no info do not have a colon followed by a new line, so i will take those out
for item in info_list:
    if ':\n' in item:
        new_list.append(item)

info_list = '\n'.join(new_list).split('\n')
info_list

['',
 'Saturday 05.30.2020 at 06:00 PM ET',
 '',
 'U.S. Broadcast:',
 'ESPN',
 '',
 'Name:',
 'UFC Fight Night: Woodley vs. Burns',
 'Also Known As:',
 'UFC Fight Night APEX',
 'Promotion:',
 '',
 'Ultimate Fighting Championship',
 '',
 'Ownership:',
 'Endeavor',
 'Venue:',
 'UFC APEX',
 'Location:',
 'Las Vegas, Nevada, United States',
 'Enclosure:',
 'Octagon',
 'TV Announcers:',
 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier',
 'Ring Announcer:',
 'Joe Martinez',
 'Post-Fight Interviews:',
 'Daniel Cormier',
 'TV Ratings:',
 '1.02M avg. viewers (615k ESPN prelims)',
 'MMA Bouts:',
 '11']

Now I'm going to remove the whitespace.

In [29]:
info_list = list(filter(lambda item: item != '', info_list))
info_list

start_time = info_list.pop(0)
info_list

['U.S. Broadcast:',
 'ESPN',
 'Name:',
 'UFC Fight Night: Woodley vs. Burns',
 'Also Known As:',
 'UFC Fight Night APEX',
 'Promotion:',
 'Ultimate Fighting Championship',
 'Ownership:',
 'Endeavor',
 'Venue:',
 'UFC APEX',
 'Location:',
 'Las Vegas, Nevada, United States',
 'Enclosure:',
 'Octagon',
 'TV Announcers:',
 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier',
 'Ring Announcer:',
 'Joe Martinez',
 'Post-Fight Interviews:',
 'Daniel Cormier',
 'TV Ratings:',
 '1.02M avg. viewers (615k ESPN prelims)',
 'MMA Bouts:',
 '11']

Now I'm going to group them and turn them into a list then into a dataframe.

In [30]:
info_list = src.group(info_list, 2)

Zip the grouped info list into a dictionary and turn it into a dataframe.

In [31]:
info_df = pd.DataFrame(dict(info_list), index=[0]) #zipped the group into a dictionary, needs index param to work

Which columns do I want?

In [32]:
info_df.columns

Index(['U.S. Broadcast:', 'Name:', 'Also Known As:', 'Promotion:',
       'Ownership:', 'Venue:', 'Location:', 'Enclosure:', 'TV Announcers:',
       'Ring Announcer:', 'Post-Fight Interviews:', 'TV Ratings:',
       'MMA Bouts:'],
      dtype='object')

In [33]:
relevent_df = info_df.loc[:, ['Location:', 'Venue:', 'Enclosure:']] #remove all irrelevent data

Add the link

In [34]:
relevent_df['link'] = first_event.link

In [57]:
relevent_df

Unnamed: 0,Location:,Venue:,Enclosure:,link,start_time
0,"Las Vegas, Nevada, United States",UFC APEX,Octagon,/fightcenter/events/69127-ufc-fight-night,Saturday 05.30.2020 at 06:00 PM ET


Now I'll turn this into a function and see how I can join it

In [58]:
relevent_df = src.get_missing_event_info(soup, first_event.link)

In [59]:
previous_ufc.head()

Unnamed: 0,event,name,date,bouts,link
0,UFC Fight Night,Woodley vs. Burns,2020-05-30,11,/fightcenter/events/69127-ufc-fight-night
1,UFC Fight Night,Overeem vs. Harris,2020-05-16,11,/fightcenter/events/67412-ufc-on-espn-33
2,UFC Fight Night,Smith vs. Teixeira,2020-05-13,10,/fightcenter/events/69126-ufc-fight-night
3,UFC 249,Ferguson vs. Gaethje,2020-05-09,11,/fightcenter/events/66312-ufc-250
10,UFC on ESPN+ 28,Lee vs. Oliveira,2020-03-14,12,/fightcenter/events/64600-ufc-on-espn-26


In [60]:
previous_ufc.join(relevent_df.set_index('link'), on='link')

Unnamed: 0,event,name,date,bouts,link,location,venue,enclosure,start_time
0,UFC Fight Night,Woodley vs. Burns,2020-05-30,11,/fightcenter/events/69127-ufc-fight-night,"Las Vegas, Nevada, United States",UFC APEX,Octagon,Saturday 05.30.2020 at 06:00 PM ET
1,UFC Fight Night,Overeem vs. Harris,2020-05-16,11,/fightcenter/events/67412-ufc-on-espn-33,,,,
2,UFC Fight Night,Smith vs. Teixeira,2020-05-13,10,/fightcenter/events/69126-ufc-fight-night,,,,
3,UFC 249,Ferguson vs. Gaethje,2020-05-09,11,/fightcenter/events/66312-ufc-250,,,,
10,UFC on ESPN+ 28,Lee vs. Oliveira,2020-03-14,12,/fightcenter/events/64600-ufc-on-espn-26,,,,
...,...,...,...,...,...,...,...,...,...
509,UFC 5,Return of the Beast,1995-04-07,10,/fightcenter/events/ufc-5-return-of-the-beast,,,,
510,UFC 4,Revenge of the Warriors,1994-12-16,10,/fightcenter/events/ufc-4-revenge-of-the-warriors,,,,
511,UFC 3,The American Dream,1994-09-09,6,/fightcenter/events/ufc-3-the-american-dream,,,,
512,UFC 2,No Way Out,1994-03-11,15,/fightcenter/events/ufc-2-no-way-out,,,,
