# Scope

This notebook focuses on scraping individual match statistics from an event page on tapology.com. I'll take in a tapolgy event page and output a dataframe containing the stats for each individual match.

## Which events?

I don't want to look at all ufc events right now, so I'll just check how many events happened in the last 10 years. First I'm going to make my date row a date time object.

In [2]:
%load_ext autoreload
%autoreload 2

import os
import sys

module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

from bs4 import BeautifulSoup
import requests
import pandas as pd
import src

previous_ufc = pd.read_csv('previous_ufc.csv', index_col = 0)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
previous_ufc.head()

Unnamed: 0,event,name,date,bouts,link
0,UFC Fight Night,Woodley vs. Burns,2020.05.30,11,/fightcenter/events/69127-ufc-fight-night
1,UFC Fight Night,Overeem vs. Harris,2020.05.16,11,/fightcenter/events/67412-ufc-on-espn-33
2,UFC Fight Night,Smith vs. Teixeira,2020.05.13,10,/fightcenter/events/69126-ufc-fight-night
3,UFC 249,Ferguson vs. Gaethje,2020.05.09,11,/fightcenter/events/66312-ufc-250
4,UFC on ESPN+ 32 (cancelled),,2020.05.02,0,/fightcenter/events/67068-ufc-on-espn-32


In [4]:
previous_ufc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 514 entries, 0 to 513
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   event   514 non-null    object
 1   name    501 non-null    object
 2   date    514 non-null    object
 3   bouts   514 non-null    int64 
 4   link    514 non-null    object
dtypes: int64(1), object(4)
memory usage: 24.1+ KB


In [5]:
previous_ufc['date'] = pd.to_datetime(previous_ufc['date'])

In [6]:
previous_ufc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 514 entries, 0 to 513
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   event   514 non-null    object        
 1   name    501 non-null    object        
 2   date    514 non-null    datetime64[ns]
 3   bouts   514 non-null    int64         
 4   link    514 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 24.1+ KB


I also want to remove any event that got cancelled, so let's get rid of any events with number of bouts = 0.

In [7]:
previous_ufc = previous_ufc[previous_ufc['bouts']>0]
previous_ufc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 513
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   event   500 non-null    object        
 1   name    494 non-null    object        
 2   date    500 non-null    datetime64[ns]
 3   bouts   500 non-null    int64         
 4   link    500 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 23.4+ KB


How many bouts are in here total? 

In [8]:
previous_ufc.bouts.sum()

5375

Now let's see how these date time objects work.

In [9]:
#previous_ufc.date >2010

In [10]:
previous_ufc.date

0     2020-05-30
1     2020-05-16
2     2020-05-13
3     2020-05-09
10    2020-03-14
         ...    
509   1995-04-07
510   1994-12-16
511   1994-09-09
512   1994-03-11
513   1993-11-12
Name: date, Length: 500, dtype: datetime64[ns]

In [11]:
minimum = pd.to_datetime('2012-1-1')

In [12]:
recent_ufc = previous_ufc[previous_ufc['date'] > minimum]
recent_ufc.head()

Unnamed: 0,event,name,date,bouts,link
0,UFC Fight Night,Woodley vs. Burns,2020-05-30,11,/fightcenter/events/69127-ufc-fight-night
1,UFC Fight Night,Overeem vs. Harris,2020-05-16,11,/fightcenter/events/67412-ufc-on-espn-33
2,UFC Fight Night,Smith vs. Teixeira,2020-05-13,10,/fightcenter/events/69126-ufc-fight-night
3,UFC 249,Ferguson vs. Gaethje,2020-05-09,11,/fightcenter/events/66312-ufc-250
10,UFC on ESPN+ 28,Lee vs. Oliveira,2020-03-14,12,/fightcenter/events/64600-ufc-on-espn-26


All fights from 2012 on should work.

In [13]:
recent_ufc.bouts.sum()
recent_ufc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 321 entries, 0 to 333
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   event   321 non-null    object        
 1   name    320 non-null    object        
 2   date    321 non-null    datetime64[ns]
 3   bouts   321 non-null    int64         
 4   link    321 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 15.0+ KB


## Table: bouts
### Scraping matches

Now that I have my time frame, I want to take the first element and see how I can scrape it. I already have a list of matches and their respective links. I think I can build a pretty accurate dataframe by using hierarchal indexing where the first index references the bout link and the second index references the winner or loser.

In [14]:
link = recent_ufc.loc[0].link

In [15]:
event_url = 'https://www.tapology.com'+link

#here I set up my request module and parse the page with beautiful soup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
response = requests.get(event_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

#here I create the fightCard data frame
fightCard_df = src.create_bouts_table(soup, link)
fightCard_df

Unnamed: 0,method,length,order,fighter_0,bout_type,weight_class,scheduled_rounds,fighter_1,link,event_link
0,"Decision, Unanimous","5 Rounds, 25:00 Total",11,Gilbert Burns,Main Event,170,5 x 5,Tyron Woodley,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/events/69127-ufc-fight-night
1,"Decision, Split","3 Rounds, 15:00 Total",10,Augusto Sakai,Co-Main Event,265,3 x 5,Blagoy Ivanov,/fightcenter/bouts/501441-ufc-fight-night-blag...,/fightcenter/events/69127-ufc-fight-night
2,"Decision, Unanimous","3 Rounds, 15:00 Total",9,Billy Quarantillo,Main Card,150,3 x 5,Spike Carlyle,/fightcenter/bouts/502990-ufc-fight-night-bill...,/fightcenter/events/69127-ufc-fight-night
3,"Submission, Rear Naked Choke","3:26 Round 2 of 3, 8:26 Total",8,Roosevelt Roberts,Main Card,155,3 x 5,Brok Weaver,/fightcenter/bouts/502972-ufc-fight-night-roos...,/fightcenter/events/69127-ufc-fight-night
4,"Submission, Kneebar",2:36 Round 1 of 3,7,Mackenzie Dern,Main Card,115,3 x 5,Hannah Cifers,/fightcenter/bouts/500998-ufc-fight-night-mack...,/fightcenter/events/69127-ufc-fight-night
5,"Decision, Unanimous","3 Rounds, 15:00 Total",6,Katlyn Chookagian,Prelim,125,3 x 5,Antonina Shevchenko,/fightcenter/bouts/502988-ufc-fight-night-katl...,/fightcenter/events/69127-ufc-fight-night
6,"Decision, Unanimous","3 Rounds, 15:00 Total",5,Daniel Rodriguez,Prelim,170,3 x 5,Gabe Green,/fightcenter/bouts/503985-ufc-fight-night-dani...,/fightcenter/events/69127-ufc-fight-night
7,"KO/TKO, Knee to the Body to Ground and Pound",1:51 Round 1 of 3,4,Jamahal Hill,Prelim,205,3 x 5,Klidson Abreu,/fightcenter/bouts/500971-ufc-fight-night-jama...,/fightcenter/events/69127-ufc-fight-night
8,"Submission, Arm Triangle Choke","3:18 Round 2 of 3, 8:18 Total",3,Brandon Royval,Prelim,125,3 x 5,Tim Elliott,/fightcenter/bouts/502983-ufc-fight-night-tim-...,/fightcenter/events/69127-ufc-fight-night
9,"Submission, One-Arm Guillotine Choke",3:03 Round 1 of 3,2,Casey Kenney,Prelim,135,3 x 5,Louis Smolka,/fightcenter/bouts/502982-ufc-fight-night-case...,/fightcenter/events/69127-ufc-fight-night


In [16]:
fightCard_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   method            11 non-null     object
 1   length            11 non-null     object
 2   order             11 non-null     object
 3   fighter_0         11 non-null     object
 4   bout_type         11 non-null     object
 5   weight_class      11 non-null     object
 6   scheduled_rounds  11 non-null     object
 7   fighter_1         11 non-null     object
 8   link              11 non-null     object
 9   event_link        11 non-null     object
dtypes: object(10)
memory usage: 1008.0+ bytes


I want to remove the records from this table because it will already be in the fighter_instances table.

## Table: fighter_instances

### First bout scrape

In [17]:
first_bout = fightCard_df.loc[0]

In [18]:
bout_link = fightCard_df.loc[0].link
bout_url = 'https://www.tapology.com'+bout_link
response = requests.get(bout_url, headers=headers)
bout_soup = BeautifulSoup(response.text, 'html.parser')

Let's see if we can find a class for the list we want in this page, we'll use chrome developer's tools.

Looking at the page, I found that the info I want is under the boutComparisonTable class. So let's pull that first.

In [19]:
bout_table = bout_soup.find_all(class_='boutComparisonTable')

Actually, I can just pull the table in with pd so let's try that first.

In [20]:
tables = pd.read_html(response.content)

In [21]:
len(tables)

3

In [22]:
tables[0]

Unnamed: 0,0,1,2,3,4
0,18-3-0,,Pro Record At Fight,,19-4-1
1,Climbed to 19-3,,Record After Fight,,Fell to 19-5
2,+150 (Slight Underdog),,Betting Odds,,-185 (Slight Favorite)
3,Brazil,,Nationality,,United States
4,"Boca Raton, Florida",,Fighting out of,,"St. Louis, Missouri"
5,"33 years, 10 months, 3 days",,Age at Fight,,"38 years, 1 month, 6 days"
6,170.5 lbs (77.3 kgs),,Weigh-In Result,,170.5 lbs (77.3 kgs)
7,"5'10"" (178cm)",,Height,,"5'9"" (176cm)"
8,"71.0"" (180cm)",,Reach,,"74.0"" (188cm)"
9,Blackzilians,,Gym,,American Top Team


Now we should be able to pivot it.

In [23]:
bout_stats = tables[0]
bout_stats.columns

Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [24]:
bout_stats_pivot = pd.DataFrame([list(bout_stats[0]), list(bout_stats[4])])

In [25]:
bout_stats_pivot.columns = list(bout_stats[2])

In [26]:
bout_stats_pivot.reset_index(inplace=True, drop=True)

In [27]:
bout_stats_pivot

Unnamed: 0,Pro Record At Fight,Record After Fight,Betting Odds,Nationality,Fighting out of,Age at Fight,Weigh-In Result,Height,Reach,Gym
0,18-3-0,Climbed to 19-3,+150 (Slight Underdog),Brazil,"Boca Raton, Florida","33 years, 10 months, 3 days",170.5 lbs (77.3 kgs),"5'10"" (178cm)","71.0"" (180cm)",Blackzilians
1,19-4-1,Fell to 19-5,-185 (Slight Favorite),United States,"St. Louis, Missouri","38 years, 1 month, 6 days",170.5 lbs (77.3 kgs),"5'9"" (176cm)","74.0"" (188cm)",American Top Team


Now let's test it in function form.

In [28]:
fighter_instances = src.create_fighter_instances_table(response.content, bout_link)
fighter_instances

Unnamed: 0,Pro Record At Fight,Record After Fight,Betting Odds,Nationality,Fighting out of,Age at Fight,Weigh-In Result,Height,Reach,Gym,fighter_link,bout_link,instance_id
0,18-3-0,Climbed to 19-3,+150 (Slight Underdog),Brazil,"Boca Raton, Florida","33 years, 10 months, 3 days",170.5 lbs (77.3 kgs),"5'10"" (178cm)","71.0"" (180cm)",Blackzilians,/fightcenter/fighters/31168-gilbert-burns-durinho,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/bouts/501343-ufc-fight-night-tyro...
1,19-4-1,Fell to 19-5,-185 (Slight Favorite),United States,"St. Louis, Missouri","38 years, 1 month, 6 days",170.5 lbs (77.3 kgs),"5'9"" (176cm)","74.0"" (188cm)",American Top Team,/fightcenter/fighters/tyron-woodley-t-wood,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/bouts/501343-ufc-fight-night-tyro...


Now I need to add in the links, so maybe I can use the code I wrote to get the links from the other links functions I made. First I'll find where the table is located.

In [29]:
name_elem = bout_soup.find(class_="fighterNames botPad clearfix")
name_elem

<div class="fighterNames botPad clearfix">
<p>
<span class="fName left">
<a href="/fightcenter/fighters/31168-gilbert-burns-durinho">Gilbert Burns</a>
</span>
<span class="fName right">
<a href="/fightcenter/fighters/tyron-woodley-t-wood">Tyron Woodley</a>
</span>
</p>
<p>
<span class="fName left">"Durinho"</span>
<span class="fName right">"The Chosen One"</span>
</p>
</div>

In [30]:
name_elem = name_elem.find_all('a')
name_elem

[<a href="/fightcenter/fighters/31168-gilbert-burns-durinho">Gilbert Burns</a>,
 <a href="/fightcenter/fighters/tyron-woodley-t-wood">Tyron Woodley</a>]

In [31]:
links = [name.get('href') for name in name_elem]
links

['/fightcenter/fighters/31168-gilbert-burns-durinho',
 '/fightcenter/fighters/tyron-woodley-t-wood']

Now I'll add this onto my dataframe.

In [32]:
fighter_instances['fighter_link'] = links
fighter_instances


Unnamed: 0,Pro Record At Fight,Record After Fight,Betting Odds,Nationality,Fighting out of,Age at Fight,Weigh-In Result,Height,Reach,Gym,fighter_link,bout_link,instance_id
0,18-3-0,Climbed to 19-3,+150 (Slight Underdog),Brazil,"Boca Raton, Florida","33 years, 10 months, 3 days",170.5 lbs (77.3 kgs),"5'10"" (178cm)","71.0"" (180cm)",Blackzilians,/fightcenter/fighters/31168-gilbert-burns-durinho,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/bouts/501343-ufc-fight-night-tyro...
1,19-4-1,Fell to 19-5,-185 (Slight Favorite),United States,"St. Louis, Missouri","38 years, 1 month, 6 days",170.5 lbs (77.3 kgs),"5'9"" (176cm)","74.0"" (188cm)",American Top Team,/fightcenter/fighters/tyron-woodley-t-wood,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/bouts/501343-ufc-fight-night-tyro...


I also need a way to refer to these specific elements so I'm going to create an instance ID that is made up of the bout_link and the fighter_link combined.

In [33]:
fighter_instances['instance_id'] = fighter_instances['bout_link'] + fighter_instances['fighter_link']
fighter_instances

Unnamed: 0,Pro Record At Fight,Record After Fight,Betting Odds,Nationality,Fighting out of,Age at Fight,Weigh-In Result,Height,Reach,Gym,fighter_link,bout_link,instance_id
0,18-3-0,Climbed to 19-3,+150 (Slight Underdog),Brazil,"Boca Raton, Florida","33 years, 10 months, 3 days",170.5 lbs (77.3 kgs),"5'10"" (178cm)","71.0"" (180cm)",Blackzilians,/fightcenter/fighters/31168-gilbert-burns-durinho,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/bouts/501343-ufc-fight-night-tyro...
1,19-4-1,Fell to 19-5,-185 (Slight Favorite),United States,"St. Louis, Missouri","38 years, 1 month, 6 days",170.5 lbs (77.3 kgs),"5'9"" (176cm)","74.0"" (188cm)",American Top Team,/fightcenter/fighters/tyron-woodley-t-wood,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/bouts/501343-ufc-fight-night-tyro...


Function testing:

In [34]:
src.create_fighter_instances_table(response.content, bout_link)

Unnamed: 0,Pro Record At Fight,Record After Fight,Betting Odds,Nationality,Fighting out of,Age at Fight,Weigh-In Result,Height,Reach,Gym,fighter_link,bout_link,instance_id
0,18-3-0,Climbed to 19-3,+150 (Slight Underdog),Brazil,"Boca Raton, Florida","33 years, 10 months, 3 days",170.5 lbs (77.3 kgs),"5'10"" (178cm)","71.0"" (180cm)",Blackzilians,/fightcenter/fighters/31168-gilbert-burns-durinho,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/bouts/501343-ufc-fight-night-tyro...
1,19-4-1,Fell to 19-5,-185 (Slight Favorite),United States,"St. Louis, Missouri","38 years, 1 month, 6 days",170.5 lbs (77.3 kgs),"5'9"" (176cm)","74.0"" (188cm)",American Top Team,/fightcenter/fighters/tyron-woodley-t-wood,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/bouts/501343-ufc-fight-night-tyro...


## Table: Events (addition)
I'm goiing to pause the fighter instances scrape and work on making the events scrape more thorough. First I want the events dataframe available. Then I'm going to open the first event.

In [35]:
df_events = pd.read_csv('previous_ufc.csv')
first_event = df_events.loc[0]

I'm going to open the page and check the first list with the event info:

In [36]:
event_soup = BeautifulSoup(src.open_tapology_link(first_event.link))

## List parser
I realize I need a more reluable way to parse html lists for thios project. Here I will test a function that will be able to turn an html list into a pandas dataframe, ready to be merged with my other dataframes.

### Find the list element

In [37]:
div = soup.find(class_="details details_with_poster clearfix") #grab top header
event_info_elem = div.find('ul') #grab first list in top header

### Find the list items in the element

In [44]:
list_items = event_info_elem.find_all('li')
list_items = [item.get_text() for item in list_items]
list_items

['Saturday 05.30.2020 at 06:00 PM ET',
 '\nU.S. Broadcast:\nESPN\n\n',
 '\nName:\nUFC Fight Night: Woodley vs. Burns\n',
 '\nAlso Known As:\nUFC Fight Night APEX\n',
 '\nPromotion:\n\nUltimate Fighting Championship\n\n',
 '\nOwnership:\nEndeavor\n',
 '\nVenue:\nUFC APEX\n',
 '\nLocation:\nLas Vegas, Nevada, United States\n',
 '\nEnclosure:\nOctagon\n',
 '\nTV Announcers:\nBrendan Fitzgerald, Michael Bisping, Daniel Cormier\n',
 '\nRing Announcer:\nJoe Martinez\n',
 '\nPost-Fight Interviews:\nDaniel Cormier\n',
 '\nTV Ratings:\n1.02M avg. viewers (615k ESPN prelims)\n',
 '\nMMA Bouts:\n11\n',
 '\nPromotion Links:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
 '\nEvent Links:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n']

### Insert a 'header' label.
#### The info in the header may be useful later on

In [65]:
item_list = [item.split(':\n') for item in list_items]
item_list[0].insert(0, 'header')
item_list

[['header', 'Saturday 05.30.2020 at 06:00 PM ET'],
 ['\nU.S. Broadcast', 'ESPN\n\n'],
 ['\nName', 'UFC Fight Night: Woodley vs. Burns\n'],
 ['\nAlso Known As', 'UFC Fight Night APEX\n'],
 ['\nPromotion', '\nUltimate Fighting Championship\n\n'],
 ['\nOwnership', 'Endeavor\n'],
 ['\nVenue', 'UFC APEX\n'],
 ['\nLocation', 'Las Vegas, Nevada, United States\n'],
 ['\nEnclosure', 'Octagon\n'],
 ['\nTV Announcers', 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier\n'],
 ['\nRing Announcer', 'Joe Martinez\n'],
 ['\nPost-Fight Interviews', 'Daniel Cormier\n'],
 ['\nTV Ratings', '1.02M avg. viewers (615k ESPN prelims)\n'],
 ['\nMMA Bouts', '11\n'],
 ['\nPromotion Links',
  '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'],
 ['\nEvent Links',
  '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n']]

### Encapsulate the second element of each row in a list
#### This allows them to be made into a dataframe easily

In [69]:
item_list = [[item[0], [item[1].strip()]] for item in item_list]
item_list

[['header', ['Saturday 05.30.2020 at 06:00 PM ET']],
 ['\nU.S. Broadcast', ['ESPN']],
 ['\nName', ['UFC Fight Night: Woodley vs. Burns']],
 ['\nAlso Known As', ['UFC Fight Night APEX']],
 ['\nPromotion', ['Ultimate Fighting Championship']],
 ['\nOwnership', ['Endeavor']],
 ['\nVenue', ['UFC APEX']],
 ['\nLocation', ['Las Vegas, Nevada, United States']],
 ['\nEnclosure', ['Octagon']],
 ['\nTV Announcers', ['Brendan Fitzgerald, Michael Bisping, Daniel Cormier']],
 ['\nRing Announcer', ['Joe Martinez']],
 ['\nPost-Fight Interviews', ['Daniel Cormier']],
 ['\nTV Ratings', ['1.02M avg. viewers (615k ESPN prelims)']],
 ['\nMMA Bouts', ['11']],
 ['\nPromotion Links', ['']],
 ['\nEvent Links', ['']]]

### Convert to dataframe

In [70]:
list_df = pd.DataFrame(dict(item_list))
list_df

Unnamed: 0,header,\nU.S. Broadcast,\nName,\nAlso Known As,\nPromotion,\nOwnership,\nVenue,\nLocation,\nEnclosure,\nTV Announcers,\nRing Announcer,\nPost-Fight Interviews,\nTV Ratings,\nMMA Bouts,\nPromotion Links,\nEvent Links
0,Saturday 05.30.2020 at 06:00 PM ET,ESPN,UFC Fight Night: Woodley vs. Burns,UFC Fight Night APEX,Ultimate Fighting Championship,Endeavor,UFC APEX,"Las Vegas, Nevada, United States",Octagon,"Brendan Fitzgerald, Michael Bisping, Daniel Co...",Joe Martinez,Daniel Cormier,1.02M avg. viewers (615k ESPN prelims),11,,


{0: 0                      header
 1            \nU.S. Broadcast
 2                      \nName
 3             \nAlso Known As
 4                 \nPromotion
 5                 \nOwnership
 6                     \nVenue
 7                  \nLocation
 8                 \nEnclosure
 9             \nTV Announcers
 10           \nRing Announcer
 11    \nPost-Fight Interviews
 12               \nTV Ratings
 13                \nMMA Bouts
 14          \nPromotion Links
 15              \nEvent Links
 Name: 0, dtype: object,
 1: 0                    Saturday 05.30.2020 at 06:00 PM ET
 1                                              ESPN\n\n
 2                  UFC Fight Night: Woodley vs. Burns\n
 3                                UFC Fight Night APEX\n
 4                  \nUltimate Fighting Championship\n\n
 5                                            Endeavor\n
 6                                            UFC APEX\n
 7                    Las Vegas, Nevada, United States\n
 8               

In [37]:
info_list = event_info_elem.get_text().split('\n\n\n')#turn it into text and split into list


new_list = [] #fields with no info do not have a colon followed by a new line, so i will take those out
for item in info_list:
    if ':\n' in item:
        new_list.append(item)

info_list = '\n'.join(new_list).split('\n')
info_list

['',
 'Saturday 05.30.2020 at 06:00 PM ET',
 '',
 'U.S. Broadcast:',
 'ESPN',
 '',
 'Name:',
 'UFC Fight Night: Woodley vs. Burns',
 'Also Known As:',
 'UFC Fight Night APEX',
 'Promotion:',
 '',
 'Ultimate Fighting Championship',
 '',
 'Ownership:',
 'Endeavor',
 'Venue:',
 'UFC APEX',
 'Location:',
 'Las Vegas, Nevada, United States',
 'Enclosure:',
 'Octagon',
 'TV Announcers:',
 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier',
 'Ring Announcer:',
 'Joe Martinez',
 'Post-Fight Interviews:',
 'Daniel Cormier',
 'TV Ratings:',
 '1.02M avg. viewers (615k ESPN prelims)',
 'MMA Bouts:',
 '11']

Now I'm going to remove the whitespace.

In [38]:
info_list = list(filter(lambda item: item != '', info_list))
info_list

start_time = info_list.pop(0)
info_list

['U.S. Broadcast:',
 'ESPN',
 'Name:',
 'UFC Fight Night: Woodley vs. Burns',
 'Also Known As:',
 'UFC Fight Night APEX',
 'Promotion:',
 'Ultimate Fighting Championship',
 'Ownership:',
 'Endeavor',
 'Venue:',
 'UFC APEX',
 'Location:',
 'Las Vegas, Nevada, United States',
 'Enclosure:',
 'Octagon',
 'TV Announcers:',
 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier',
 'Ring Announcer:',
 'Joe Martinez',
 'Post-Fight Interviews:',
 'Daniel Cormier',
 'TV Ratings:',
 '1.02M avg. viewers (615k ESPN prelims)',
 'MMA Bouts:',
 '11']

Now I'm going to group them and turn them into a list then into a dataframe.

In [31]:
info_list = src.group(info_list, 2)

Zip the grouped info list into a dictionary and turn it into a dataframe.

In [32]:
info_df = pd.DataFrame(dict(info_list), index=[0]) #zipped the group into a dictionary, needs index param to work

Which columns do I want?

In [33]:
info_df.columns

Index(['U.S. Broadcast:', 'Name:', 'Also Known As:', 'Promotion:',
       'Ownership:', 'Venue:', 'Location:', 'Enclosure:', 'TV Announcers:',
       'Ring Announcer:', 'Post-Fight Interviews:', 'TV Ratings:',
       'MMA Bouts:'],
      dtype='object')

In [34]:
relevent_df = info_df.loc[:, ['Location:', 'Venue:', 'Enclosure:']] #remove all irrelevent data

Add the link

In [35]:
relevent_df['link'] = first_event.link

In [93]:
relevent_df

Unnamed: 0,location,venue,enclosure,start_time,link
0,"Las Vegas, Nevada, United States",UFC APEX,Octagon,\nSaturday 05.30.2020 at 06:00 PM ET\n\nU.S. B...,/fightcenter/events/69127-ufc-fight-night


Now I'll turn this into a function and see how I can join it

In [92]:
relevent_df = src.get_missing_event_info(soup, first_event.link)

[['\nName', 'UFC Fight Night: Woodley vs. Burns'], ['Also Known As', 'UFC Fight Night APEX'], ['Promotion', '\nUltimate Fighting Championship'], ['\nOwnership', 'Endeavor'], ['Venue', 'UFC APEX'], ['Location', 'Las Vegas, Nevada, United States'], ['Enclosure', 'Octagon'], ['TV Announcers', 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier'], ['Ring Announcer', 'Joe Martinez'], ['Post-Fight Interviews', 'Daniel Cormier'], ['TV Ratings', '1.02M avg. viewers (615k ESPN prelims)'], ['MMA Bouts', '11']]


In [38]:
previous_ufc.head()

Unnamed: 0,event,name,date,bouts,link
0,UFC Fight Night,Woodley vs. Burns,2020-05-30,11,/fightcenter/events/69127-ufc-fight-night
1,UFC Fight Night,Overeem vs. Harris,2020-05-16,11,/fightcenter/events/67412-ufc-on-espn-33
2,UFC Fight Night,Smith vs. Teixeira,2020-05-13,10,/fightcenter/events/69126-ufc-fight-night
3,UFC 249,Ferguson vs. Gaethje,2020-05-09,11,/fightcenter/events/66312-ufc-250
10,UFC on ESPN+ 28,Lee vs. Oliveira,2020-03-14,12,/fightcenter/events/64600-ufc-on-espn-26


In [39]:
previous_ufc.join(relevent_df.set_index('link'), on='link')

Unnamed: 0,event,name,date,bouts,link,location,venue,enclosure,start_time
0,UFC Fight Night,Woodley vs. Burns,2020-05-30,11,/fightcenter/events/69127-ufc-fight-night,"Las Vegas, Nevada, United States",UFC APEX,Octagon,Saturday 05.30.2020 at 06:00 PM ET
1,UFC Fight Night,Overeem vs. Harris,2020-05-16,11,/fightcenter/events/67412-ufc-on-espn-33,,,,
2,UFC Fight Night,Smith vs. Teixeira,2020-05-13,10,/fightcenter/events/69126-ufc-fight-night,,,,
3,UFC 249,Ferguson vs. Gaethje,2020-05-09,11,/fightcenter/events/66312-ufc-250,,,,
10,UFC on ESPN+ 28,Lee vs. Oliveira,2020-03-14,12,/fightcenter/events/64600-ufc-on-espn-26,,,,
...,...,...,...,...,...,...,...,...,...
509,UFC 5,Return of the Beast,1995-04-07,10,/fightcenter/events/ufc-5-return-of-the-beast,,,,
510,UFC 4,Revenge of the Warriors,1994-12-16,10,/fightcenter/events/ufc-4-revenge-of-the-warriors,,,,
511,UFC 3,The American Dream,1994-09-09,6,/fightcenter/events/ufc-3-the-american-dream,,,,
512,UFC 2,No Way Out,1994-03-11,15,/fightcenter/events/ufc-2-no-way-out,,,,


## Table: bouts (addition)

Once I got to the bout page I realized that there was some info that I would like to get but was not available on the initial page. I'm going to grab it from the bout page itself and add it to the dataframe. First let's open up our first bout page. I also want to make sure I have my bouts dataframe available so that I can add to it.

In [40]:
df_events = previous_ufc
df_bouts = src.get_initial_bout_df()
df_bouts.head(5)

Unnamed: 0,method,length,order,fighter_0,bout_type,weight_class,scheduled_rounds,fighter_1,link,event_link
0,"Decision, Unanimous","5 Rounds, 25:00 Total",11,Gilbert Burns,Main Event,170,5 x 5,Tyron Woodley,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/events/69127-ufc-fight-night
1,"Decision, Split","3 Rounds, 15:00 Total",10,Augusto Sakai,Co-Main Event,265,3 x 5,Blagoy Ivanov,/fightcenter/bouts/501441-ufc-fight-night-blag...,/fightcenter/events/69127-ufc-fight-night
2,"Decision, Unanimous","3 Rounds, 15:00 Total",9,Billy Quarantillo,Main Card,150,3 x 5,Spike Carlyle,/fightcenter/bouts/502990-ufc-fight-night-bill...,/fightcenter/events/69127-ufc-fight-night
3,"Submission, Rear Naked Choke","3:26 Round 2 of 3, 8:26 Total",8,Roosevelt Roberts,Main Card,155,3 x 5,Brok Weaver,/fightcenter/bouts/502972-ufc-fight-night-roos...,/fightcenter/events/69127-ufc-fight-night
4,"Submission, Kneebar",2:36 Round 1 of 3,7,Mackenzie Dern,Main Card,115,3 x 5,Hannah Cifers,/fightcenter/bouts/500998-ufc-fight-night-mack...,/fightcenter/events/69127-ufc-fight-night


According to the database schema I created, I want the following categories, The ones with check marks are the ones that my initial bout datafram already contains:

- [x] link
- [x] method
- [x] length 
- [x] order
- [x] fighter_0
- [x] fighter_1
- [x] bout_type
- [x] scheduled_rounds
- [x] weight_class
- [ ] pro_am
- [ ] referee
- [x] event_link

So in order to get the pro_am and the referee info, I need to follow the bout link to and pull that info from a table on that page. Let's pull up the bout page.

In [94]:
response = requests.get(bout_url, headers=headers)
bout_soup = BeautifulSoup(response.content, 'html.parser')

Looking at the web page I see that there is a div element with a certain class that contains the table I'm looking for.

In [52]:
info_div = bout_soup.find('div', class_='details details_with_poster clearfix')
missing_info_elem = info_div.find('ul')

This looks similar to the list I parsed above so maybe I can reuse the same list parsing functions.

In [56]:
info_list = missing_info_elem.get_text().split('\n\n\n')#turn it into text and split into list
    
info_list = src.clean_info_list(info_list)
    
info_list

['Bout Information',
 'Event:',
 'UFC Fight Night: Woodley vs. Burns',
 'Date:',
 'Saturday 05.30.2020 at 06:00 PM ET',
 'Referee:',
 'Herb Dean',
 'Venue:',
 'UFC APEX',
 'Enclosure:',
 'Octagon',
 'Location:',
 'Las Vegas, Nevada, United States',
 'Bout Billing:',
 'Main Event',
 '(fight 11 of 11)',
 'Pro/Am:',
 'Professional',
 'Weight:',
 '170 lbs (77.1 kg)',
 'TV Commentary:',
 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier',
 'Broadcast:',
 'Aired Live on Main Card',
 'Post-Fight Interviewer:',
 'Daniel Cormier',
 'Woodley Total Disclosed Pay:',
 '$200,000',
 'Burns Total Disclosed Pay:',
 '$218,000']

In [66]:
src.html_list_to_df(missing_info_elem)

Unnamed: 0,Event:,Date:,Referee:,Venue:,Enclosure:,Location:,Bout Billing:,(fight 11 of 11),Professional,170 lbs (77.1 kg),"Brendan Fitzgerald, Michael Bisping, Daniel Cormier",Aired Live on Main Card,Daniel Cormier,"$200,000",first_elem
0,UFC Fight Night: Woodley vs. Burns,Saturday 05.30.2020 at 06:00 PM ET,Herb Dean,UFC APEX,Octagon,"Las Vegas, Nevada, United States",Main Event,Pro/Am:,Weight:,TV Commentary:,Broadcast:,Post-Fight Interviewer:,Woodley Total Disclosed Pay:,Burns Total Disclosed Pay:,Bout Information


The function I built does not work on this list for some readon.

In [68]:
missing_info_elem.get_text()

'\nBout Information\n\nEvent:\nUFC Fight Night: Woodley vs. Burns\n\n\nDate:\nSaturday 05.30.2020 at 06:00 PM ET\n\n\nReferee:\nHerb Dean\n\n\nVenue:\nUFC APEX\n\n\nEnclosure:\nOctagon\n\n\nLocation:\nLas Vegas, Nevada, United States\n\n\nBout Billing:\nMain Event\n(fight 11 of 11)\n\n\nPro/Am:\nProfessional\n\n\n\nWeight:\n170 lbs (77.1 kg)\n\n\n\nTV Commentary:\nBrendan Fitzgerald, Michael Bisping, Daniel Cormier\n\n\nBroadcast:\nAired Live on Main Card\n\n\nPost-Fight Interviewer:\nDaniel Cormier\n\n\nWoodley Total Disclosed Pay:\n$200,000\n\n\nBurns Total Disclosed Pay:\n$218,000\n\n\nEvent Links:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBout Links:\n\n\n\n\n\n\n\n\n\n'

The function html_list_to_df is confused by the fact that there is a new line between event and (fight 11 of 11).

In [79]:
info_list = missing_info_elem.get_text().split('\n\n\n')#turn it into text and split into list
new_list = [] 
for item in info_list:
    #fields with no info do not have a colon followed by a new line, so i will take those out
    if ':\n' in item: 
        new_list.append(item)   
#this creates a list with the info separated by empty string elements
info_list = '\n\n'.join(new_list).split('\n\n') 
#this removes those empty string elements
info_list = list(filter(lambda item: item != '', info_list)) 

In [80]:
info_list

['\nBout Information',
 'Event:\nUFC Fight Night: Woodley vs. Burns',
 'Date:\nSaturday 05.30.2020 at 06:00 PM ET',
 'Referee:\nHerb Dean',
 'Venue:\nUFC APEX',
 'Enclosure:\nOctagon',
 'Location:\nLas Vegas, Nevada, United States',
 'Bout Billing:\nMain Event\n(fight 11 of 11)',
 'Pro/Am:\nProfessional',
 '\nWeight:\n170 lbs (77.1 kg)',
 '\nTV Commentary:\nBrendan Fitzgerald, Michael Bisping, Daniel Cormier',
 'Broadcast:\nAired Live on Main Card',
 'Post-Fight Interviewer:\nDaniel Cormier',
 'Woodley Total Disclosed Pay:\n$200,000',
 'Burns Total Disclosed Pay:\n$218,000']

In [81]:
info_list.pop(0)
info_list = [elem.split(':\n') for elem in info_list]

In [82]:
info_list

[['Event', 'UFC Fight Night: Woodley vs. Burns'],
 ['Date', 'Saturday 05.30.2020 at 06:00 PM ET'],
 ['Referee', 'Herb Dean'],
 ['Venue', 'UFC APEX'],
 ['Enclosure', 'Octagon'],
 ['Location', 'Las Vegas, Nevada, United States'],
 ['Bout Billing', 'Main Event\n(fight 11 of 11)'],
 ['Pro/Am', 'Professional'],
 ['\nWeight', '170 lbs (77.1 kg)'],
 ['\nTV Commentary', 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier'],
 ['Broadcast', 'Aired Live on Main Card'],
 ['Post-Fight Interviewer', 'Daniel Cormier'],
 ['Woodley Total Disclosed Pay', '$200,000'],
 ['Burns Total Disclosed Pay', '$218,000']]

In [85]:
src.html_list_to_df(missing_info_elem)

[['Event', 'UFC Fight Night: Woodley vs. Burns'], ['Date', 'Saturday 05.30.2020 at 06:00 PM ET'], ['Referee', 'Herb Dean'], ['Venue', 'UFC APEX'], ['Enclosure', 'Octagon'], ['Location', 'Las Vegas, Nevada, United States'], ['Bout Billing', 'Main Event\n(fight 11 of 11)'], ['Pro/Am', 'Professional'], ['\nWeight', '170 lbs (77.1 kg)'], ['\nTV Commentary', 'Brendan Fitzgerald, Michael Bisping, Daniel Cormier'], ['Broadcast', 'Aired Live on Main Card'], ['Post-Fight Interviewer', 'Daniel Cormier'], ['Woodley Total Disclosed Pay', '$200,000'], ['Burns Total Disclosed Pay', '$218,000']]


Unnamed: 0,Event,Date,Referee,Venue,Enclosure,Location,Bout Billing,Pro/Am,\nWeight,\nTV Commentary,Broadcast,Post-Fight Interviewer,Woodley Total Disclosed Pay,Burns Total Disclosed Pay,first_elem
0,UFC Fight Night: Woodley vs. Burns,Saturday 05.30.2020 at 06:00 PM ET,Herb Dean,UFC APEX,Octagon,"Las Vegas, Nevada, United States",Main Event\n(fight 11 of 11),Professional,170 lbs (77.1 kg),"Brendan Fitzgerald, Michael Bisping, Daniel Co...",Aired Live on Main Card,Daniel Cormier,"$200,000","$218,000",\nBout Information


Basically I modified my get_missing_event_info function to work on bout pages as well.

In [99]:
missing_bout_info = src.get_missing_bout_info(bout_soup, first_link)

In [100]:
fightCard_df.join(missing_bout_info.set_index('link'), on='link')

Unnamed: 0,method,length,order,fighter_0,bout_type,weight_class,scheduled_rounds,fighter_1,link,event_link,referee,pro_am
0,"Decision, Unanimous","5 Rounds, 25:00 Total",11,Gilbert Burns,Main Event,170,5 x 5,Tyron Woodley,/fightcenter/bouts/501343-ufc-fight-night-tyro...,/fightcenter/events/69127-ufc-fight-night,Herb Dean,Professional
1,"Decision, Split","3 Rounds, 15:00 Total",10,Augusto Sakai,Co-Main Event,265,3 x 5,Blagoy Ivanov,/fightcenter/bouts/501441-ufc-fight-night-blag...,/fightcenter/events/69127-ufc-fight-night,,
2,"Decision, Unanimous","3 Rounds, 15:00 Total",9,Billy Quarantillo,Main Card,150,3 x 5,Spike Carlyle,/fightcenter/bouts/502990-ufc-fight-night-bill...,/fightcenter/events/69127-ufc-fight-night,,
3,"Submission, Rear Naked Choke","3:26 Round 2 of 3, 8:26 Total",8,Roosevelt Roberts,Main Card,155,3 x 5,Brok Weaver,/fightcenter/bouts/502972-ufc-fight-night-roos...,/fightcenter/events/69127-ufc-fight-night,,
4,"Submission, Kneebar",2:36 Round 1 of 3,7,Mackenzie Dern,Main Card,115,3 x 5,Hannah Cifers,/fightcenter/bouts/500998-ufc-fight-night-mack...,/fightcenter/events/69127-ufc-fight-night,,
5,"Decision, Unanimous","3 Rounds, 15:00 Total",6,Katlyn Chookagian,Prelim,125,3 x 5,Antonina Shevchenko,/fightcenter/bouts/502988-ufc-fight-night-katl...,/fightcenter/events/69127-ufc-fight-night,,
6,"Decision, Unanimous","3 Rounds, 15:00 Total",5,Daniel Rodriguez,Prelim,170,3 x 5,Gabe Green,/fightcenter/bouts/503985-ufc-fight-night-dani...,/fightcenter/events/69127-ufc-fight-night,,
7,"KO/TKO, Knee to the Body to Ground and Pound",1:51 Round 1 of 3,4,Jamahal Hill,Prelim,205,3 x 5,Klidson Abreu,/fightcenter/bouts/500971-ufc-fight-night-jama...,/fightcenter/events/69127-ufc-fight-night,,
8,"Submission, Arm Triangle Choke","3:18 Round 2 of 3, 8:18 Total",3,Brandon Royval,Prelim,125,3 x 5,Tim Elliott,/fightcenter/bouts/502983-ufc-fight-night-tim-...,/fightcenter/events/69127-ufc-fight-night,,
9,"Submission, One-Arm Guillotine Choke",3:03 Round 1 of 3,2,Casey Kenney,Prelim,135,3 x 5,Louis Smolka,/fightcenter/bouts/502982-ufc-fight-night-case...,/fightcenter/events/69127-ufc-fight-night,,
