# <center>**Webscraping Notebook**</center>

O grando desafio de obter dados do UFC é realizar um trabalho de webscrapping para criar a base de dados atualizada. 

O site oficial do ufc possui muitas features de visualização final para o usuario, o que torna a tarefa de webscrapping mais penosa. Por conta disso, será utilizado outro site de informações.

link oficial:
http://statleaders.ufc.com/

link utilizado:
http://www.ufcstats.com/statistics/events/completed


O objetivo principal deste notebook é realizar um estudo mais aprofundado das tecnicas de webcrapping que auxiliarão o scrip webcrapping.py , responsavel por manter a base de dados atualiazada.

O scrip a ser gerado deverá retornar três tabelas: 

- Fighters: Com as caracteristicas de cada lutador

- Events: Com todos os Eventso

#### **Bibliotecas**

In [2]:
# Imortando bibliotecas

import pandas as pd

import requests
from bs4 import BeautifulSoup


## **Estrutura do site**

O site está estruturado em uma pagina chamada de Events & Fights e outra pagina de fighters. 

A primeira pagina tem o nome do evento, com  a data em que ele ocorreu e a localização.Além disso, é possivel entrar em cada evento e obter mais informações.

## **Código Events & Fights**

#### ***Obtendo tabela em formato html e transformando em dataframe***

Uma boa prática da tecnica de webscraping é utilizar um `User-Agent` para fazer as requisições. Alguns sites podem bloquear requisições que não contêm um User-Agent ou que contêm um User-Agent associado a ferramentas de scraping.

In [3]:
# Link de acesso
URL = 'http://www.ufcstats.com/statistics/events/completed?page=all'

# Configurando o User_agent 
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
}

response = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(response.content, 'html.parser')



Após realizar a conexão, vamos encontrar o nome da tabela

In [5]:
# Procurando nome da tabela
table = soup.find('table')
print(table.attrs['class'][0])

b-statistics__table-events


In [4]:
# Localizando a tabela pela classe correta
table = soup.find('table', class_='b-statistics__table-events')

table

<table class="b-statistics__table-events">
<thead class="b-statistics__table-caption">
<tr class="b-statistics__table-row">
<th class="b-statistics__table-col">
                  Name/date
                </th>
<th class="b-statistics__table-col">
                  Location
                </th>
</tr>
</thead>
<tbody>
<tr class="b-statistics__table-row">
<td class="b-statistics__table-col_type_clear"></td>
</tr>
<tr class="b-statistics__table-row_type_first">
<td class="b-statistics__table-col">
<img class="b-statistics__icon" src="http://1e49bc5171d173577ecd-1323f4090557a33db01577564f60846c.r80.cf1.rackcdn.com/next.png">
<i class="b-statistics__table-content">
<a class="b-link b-link_style_white" href="http://www.ufcstats.com/event-details/89a407032911e27e">
                          UFC Fight Night: Holloway vs. The Korean Zombie
                        </a>
<span class="b-statistics__date">
                          August 26, 2023
                        </span>
</i>
</img></td>
<t

In [7]:
# Parseia o HTML usando BeautifulSoup

# Encontra todas as tags <a> com a classe específica
links = soup.find_all('a', class_='b-link b-link_style_black')

# Extrai os URLs destas tags
urls = [link['href'] for link in links]

urls

['http://www.ufcstats.com/event-details/2719f300b0439039',
 'http://www.ufcstats.com/event-details/d2fa318f34d0aadc',
 'http://www.ufcstats.com/event-details/6f81b6de2557739a',
 'http://www.ufcstats.com/event-details/ccd58ff71e260ed5',
 'http://www.ufcstats.com/event-details/1174782eacde9b0c',
 'http://www.ufcstats.com/event-details/c9885b1b7c7055a0',
 'http://www.ufcstats.com/event-details/6085ceb59087514b',
 'http://www.ufcstats.com/event-details/e9e1acc96536bb4f',
 'http://www.ufcstats.com/event-details/a780d16cf7eed44d',
 'http://www.ufcstats.com/event-details/b9415726dc3ec526',
 'http://www.ufcstats.com/event-details/b6c6d1731ff00eeb',
 'http://www.ufcstats.com/event-details/7abe471b61725980',
 'http://www.ufcstats.com/event-details/6f812143641ceff8',
 'http://www.ufcstats.com/event-details/901cddcbfa079097',
 'http://www.ufcstats.com/event-details/3c6976f8182d9527',
 'http://www.ufcstats.com/event-details/51b1e2fd9872005b',
 'http://www.ufcstats.com/event-details/6fb1ba67bef41b37

In [6]:
# Criando dataframe a partir da variavel table
# Convertendo a tabela HTML em DataFrame
df_events = pd.read_html(str(table))[0]

# Exibindo as primeiras linhas do DataFrame
df_events


Unnamed: 0,Name/date,Location
0,,
1,UFC Fight Night: Holloway vs. The Korean Zombi...,"Kallang, Singapore"
2,"UFC 292: Sterling vs. O'Malley August 19, 2023","Boston, Massachusetts, USA"
3,UFC Fight Night: Luque vs. Dos Anjos August 1...,"Las Vegas, Nevada, USA"
4,UFC Fight Night: Sandhagen vs. Font August 05...,"Nashville, Tennessee, USA"
...,...,...
656,"UFC 6: Clash of the Titans July 14, 1995","Casper, Wyoming, USA"
657,"UFC 5: The Return of the Beast April 07, 1995","Charlotte, North Carolina, USA"
658,"UFC 4: Revenge of the Warriors December 16, 1994","Tulsa, Oklahoma, USA"
659,"UFC 3: The American Dream September 09, 1994","Charlotte, North Carolina, USA"


In [8]:
# Criando coluna data

df_events['data'] = df_events['Name/date'].str.rsplit(" ", n=3).str[-3:].str.join(' ')

df_events

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_events['data'] = df_events['Name/date'].str.rsplit(" ", n=3).str[-3:].str.join(' ')


Unnamed: 0,Name/date,Location,Evento,data
1,"UFC 292: Sterling vs. O'Malley August 19, 2023","Boston, Massachusetts, USA",UFC 292,"August 19, 2023"
2,UFC Fight Night: Luque vs. Dos Anjos August 1...,"Las Vegas, Nevada, USA",UFC Fight Night,"August 12, 2023"
3,UFC Fight Night: Sandhagen vs. Font August 05...,"Nashville, Tennessee, USA",UFC Fight Night,"August 05, 2023"
4,"UFC 291: Poirier vs. Gaethje 2 July 29, 2023","Salt Lake City, Utah, USA",UFC 291,"July 29, 2023"
5,"UFC Fight Night: Aspinall vs. Tybura July 22,...","London, England, United Kingdom",UFC Fight Night,"July 22, 2023"
...,...,...,...,...
655,"UFC 6: Clash of the Titans July 14, 1995","Casper, Wyoming, USA",UFC 6,"July 14, 1995"
656,"UFC 5: The Return of the Beast April 07, 1995","Charlotte, North Carolina, USA",UFC 5,"April 07, 1995"
657,"UFC 4: Revenge of the Warriors December 16, 1994","Tulsa, Oklahoma, USA",UFC 4,"December 16, 1994"
658,"UFC 3: The American Dream September 09, 1994","Charlotte, North Carolina, USA",UFC 3,"September 09, 1994"


In [9]:

df_events['fighters'] = df_events['Name/date'].str.split(":").str[1]
# Segundo Tratamento

df_events

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_events['fighters'] = df_events['Name/date'].str.split(":").str[1]


Unnamed: 0,Name/date,Location,Evento,data,fighters
1,"UFC 292: Sterling vs. O'Malley August 19, 2023","Boston, Massachusetts, USA",UFC 292,"August 19, 2023","Sterling vs. O'Malley August 19, 2023"
2,UFC Fight Night: Luque vs. Dos Anjos August 1...,"Las Vegas, Nevada, USA",UFC Fight Night,"August 12, 2023","Luque vs. Dos Anjos August 12, 2023"
3,UFC Fight Night: Sandhagen vs. Font August 05...,"Nashville, Tennessee, USA",UFC Fight Night,"August 05, 2023","Sandhagen vs. Font August 05, 2023"
4,"UFC 291: Poirier vs. Gaethje 2 July 29, 2023","Salt Lake City, Utah, USA",UFC 291,"July 29, 2023","Poirier vs. Gaethje 2 July 29, 2023"
5,"UFC Fight Night: Aspinall vs. Tybura July 22,...","London, England, United Kingdom",UFC Fight Night,"July 22, 2023","Aspinall vs. Tybura July 22, 2023"
...,...,...,...,...,...
655,"UFC 6: Clash of the Titans July 14, 1995","Casper, Wyoming, USA",UFC 6,"July 14, 1995","Clash of the Titans July 14, 1995"
656,"UFC 5: The Return of the Beast April 07, 1995","Charlotte, North Carolina, USA",UFC 5,"April 07, 1995","The Return of the Beast April 07, 1995"
657,"UFC 4: Revenge of the Warriors December 16, 1994","Tulsa, Oklahoma, USA",UFC 4,"December 16, 1994","Revenge of the Warriors December 16, 1994"
658,"UFC 3: The American Dream September 09, 1994","Charlotte, North Carolina, USA",UFC 3,"September 09, 1994","The American Dream September 09, 1994"


In [10]:
print(df_events)

                                             Name/date  \
1      UFC 292: Sterling vs. O'Malley  August 19, 2023   
2    UFC Fight Night: Luque vs. Dos Anjos  August 1...   
3    UFC Fight Night: Sandhagen vs. Font  August 05...   
4        UFC 291: Poirier vs. Gaethje 2  July 29, 2023   
5    UFC Fight Night: Aspinall vs. Tybura  July 22,...   
..                                                 ...   
655          UFC 6: Clash of the Titans  July 14, 1995   
656     UFC 5: The Return of the Beast  April 07, 1995   
657  UFC 4: Revenge of the Warriors  December 16, 1994   
658      UFC 3: The American Dream  September 09, 1994   
659                  UFC 2: No Way Out  March 11, 1994   

                            Location           Evento                data  \
1         Boston, Massachusetts, USA          UFC 292     August 19, 2023   
2             Las Vegas, Nevada, USA  UFC Fight Night     August 12, 2023   
3          Nashville, Tennessee, USA  UFC Fight Night     August 05, 202

In [11]:
# Fighters
stringue = "UFC 292: Sterling vs. O'Malley August 19, 2023"

# Primeiro Tratamento
trat1 = stringue.split(":")[1]

trat2 = " ".join(trat1.split(" ")[:-3])

trat2

" Sterling vs. O'Malley"

In [12]:
df_events.to_csv("test1.csv")

In [13]:
table

<table class="b-statistics__table-events">
<thead class="b-statistics__table-caption">
<tr class="b-statistics__table-row">
<th class="b-statistics__table-col">
                  Name/date
                </th>
<th class="b-statistics__table-col">
                  Location
                </th>
</tr>
</thead>
<tbody>
<tr class="b-statistics__table-row">
<td class="b-statistics__table-col_type_clear"></td>
</tr>
<tr class="b-statistics__table-row_type_first">
<td class="b-statistics__table-col">
<img class="b-statistics__icon" src="http://1e49bc5171d173577ecd-1323f4090557a33db01577564f60846c.r80.cf1.rackcdn.com/next.png">
<i class="b-statistics__table-content">
<a class="b-link b-link_style_white" href="http://www.ufcstats.com/event-details/2719f300b0439039">
                          UFC 292: Sterling vs. O'Malley
                        </a>
<span class="b-statistics__date">
                          August 19, 2023
                        </span>
</i>
</img></td>
<td class="b-statis

### Obtendo links dos eventos

In [15]:
html_content

<table class="b-statistics__table-events">
<thead class="b-statistics__table-caption">
<tr class="b-statistics__table-row">
<th class="b-statistics__table-col">
                  Name/date
                </th>
<th class="b-statistics__table-col">
                  Location
                </th>
</tr>
</thead>
<tbody>
<tr class="b-statistics__table-row">
<td class="b-statistics__table-col_type_clear"></td>
</tr>
<tr class="b-statistics__table-row_type_first">
<td class="b-statistics__table-col">
<img class="b-statistics__icon" src="http://1e49bc5171d173577ecd-1323f4090557a33db01577564f60846c.r80.cf1.rackcdn.com/next.png">
<i class="b-statistics__table-content">
<a class="b-link b-link_style_white" href="http://www.ufcstats.com/event-details/2719f300b0439039">
                          UFC 292: Sterling vs. O'Malley
                        </a>
<span class="b-statistics__date">
                          August 19, 2023
                        </span>
</i>
</img></td>
<td class="b-statis

## **Código FIGHTERS**

O código dos atletas está dividido no alfabeto de "A" até "Z". Então será necessario realizar um loop mudando a URL.

In [14]:
# Link de acesso
URL = 'http://www.ufcstats.com/statistics/fighters?char=a&page=all'

# Configurando o User_agent 
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
}

response = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(response.content, 'html.parser')


In [15]:
lista_letras = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

In [16]:
# Procurando nome da tabela
table = soup.find('table')
# print(table.attrs['class'])

type(table.attrs['class'])

list

In [17]:
# Localizando a tabela pela classe correta
table = soup.find('table', class_='b-statistics__table')

table

<table class="b-statistics__table">
<thead class="b-statistics__table-caption">
<tr class="b-statistics__table-row">
<th class="b-statistics__table-col">
          First
        </th>
<th class="b-statistics__table-col">
          Last
        </th>
<th class="b-statistics__table-col">
          Nickname
        </th>
<th class="b-statistics__table-col">
          Ht.
        </th>
<th class="b-statistics__table-col">
          Wt.
        </th>
<th class="b-statistics__table-col">
          Reach
        </th>
<th class="b-statistics__table-col">
          Stance
        </th>
<th class="b-statistics__table-col b-statistics__table-col_type_small">
          W
        </th>
<th class="b-statistics__table-col b-statistics__table-col_type_small">
          L
        </th>
<th class="b-statistics__table-col b-statistics__table-col_type_small">
          D
        </th>
<th class="b-statistics__table-col">
          Belt
        </th>
</tr>
</thead>
<tbody>
<tr class="b-statistics__table-r

In [18]:
# Criando dataframe a partir da variavel table
# Convertendo a tabela HTML em DataFrame
df_events = pd.read_html(str(table))[0]

# Exibindo as primeiras linhas do DataFrame
df_events


Unnamed: 0,First,Last,Nickname,Ht.,Wt.,Reach,Stance,W,L,D,Belt
0,,,,,,,,,,,
1,Tom,Aaron,,--,155 lbs.,--,,5.0,3.0,0.0,
2,Danny,Abbadi,The Assassin,"5' 11""",155 lbs.,--,Orthodox,4.0,6.0,0.0,
3,Nariman,Abbasov,Bayraktar,"5' 8""",155 lbs.,"66.0""",Orthodox,28.0,4.0,0.0,
4,David,Abbott,Tank,"6' 0""",265 lbs.,--,Switch,10.0,15.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...
209,Abu,Azaitar,Captain Morocco,"5' 9""",185 lbs.,"76.0""",Orthodox,14.0,3.0,1.0,
210,Ottman,Azaitar,Bulldozer,"5' 8""",155 lbs.,"71.0""",Switch,13.0,2.0,0.0,
211,Luiz,Azeredo,,"5' 9""",154 lbs.,--,Orthodox,15.0,10.0,0.0,
212,Luciano,Azevedo,,"6' 3""",161 lbs.,--,Orthodox,17.0,9.0,1.0,


In [8]:
def table_get(url):
    # Link de acesso
    URL = url

    # Configurando o User_agent
    HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
    }

    response = requests.get(URL, headers=HEADERS)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Procurando nome da tabela

    table_att = soup.find("table")

    table_name = table_att.attrs["class"][0]

    # Obtendo a tabela pela classe correta

    table = soup.find("table", class_ = table_name)

    # Transformando tabela html em dataframe

    df = pd.read_html(str(table))[0]

    return df
    

In [9]:
df_de_teste = table_get("http://www.ufcstats.com/statistics/fighters?char=c&page=all")


df_de_teste

Unnamed: 0,First,Last,Nickname,Ht.,Wt.,Reach,Stance,W,L,D,Belt
0,,,,,,,,,,,
1,Yan,Cabral,,"5' 11""",155 lbs.,"73.0""",Orthodox,13.0,3.0,0.0,
2,Alvin,Cacdac,,"5' 6""",125 lbs.,--,Orthodox,18.0,13.0,0.0,
3,Alex,Caceres,Bruce Leeroy,"5' 10""",145 lbs.,"73.0""",Southpaw,21.0,13.0,0.0,
4,Vince,Cachero,The Anomaly,"5' 6""",145 lbs.,"68.0""",Orthodox,7.0,4.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...
267,Kailin,Curran,,"5' 4""",115 lbs.,"65.0""",Orthodox,4.0,5.0,0.0,
268,Pat,Curran,Paddy Mike,"5' 9""",145 lbs.,--,,22.0,7.0,0.0,
269,Chris,Curtis,Action-Man,"5' 10""",185 lbs.,"75.0""",Orthodox,30.0,10.0,0.0,
270,Ion,Cutelaba,The Hulk,"6' 1""",205 lbs.,"75.0""",Southpaw,17.0,9.0,1.0,


In [21]:
lista_dfs = []

for letra in lista_letras:
    df = table_get(f"http://www.ufcstats.com/statistics/fighters?char={letra}&page=all")

    lista_dfs.append(df)


lista_dfs


[       First     Last         Nickname     Ht.       Wt.  Reach    Stance  \
 0        NaN      NaN              NaN     NaN       NaN    NaN       NaN   
 1        Tom    Aaron              NaN      --  155 lbs.     --       NaN   
 2      Danny   Abbadi     The Assassin  5' 11"  155 lbs.     --  Orthodox   
 3    Nariman  Abbasov        Bayraktar   5' 8"  155 lbs.  66.0"  Orthodox   
 4      David   Abbott             Tank   6' 0"  265 lbs.     --    Switch   
 ..       ...      ...              ...     ...       ...    ...       ...   
 209      Abu  Azaitar  Captain Morocco   5' 9"  185 lbs.  76.0"  Orthodox   
 210   Ottman  Azaitar        Bulldozer   5' 8"  155 lbs.  71.0"    Switch   
 211     Luiz  Azeredo              NaN   5' 9"  154 lbs.     --  Orthodox   
 212  Luciano  Azevedo              NaN   6' 3"  161 lbs.     --  Orthodox   
 213   Hunter    Azure              NaN   5' 8"  145 lbs.  69.0"  Orthodox   
 
         W     L    D  Belt  
 0     NaN   NaN  NaN   NaN  
 1

In [22]:
df_final = pd.concat(lista_dfs, ignore_index=True)

df_final = df_final.dropna(how="all")

df_final

Unnamed: 0,First,Last,Nickname,Ht.,Wt.,Reach,Stance,W,L,D,Belt
1,Tom,Aaron,,--,155 lbs.,--,,5.0,3.0,0.0,
2,Danny,Abbadi,The Assassin,"5' 11""",155 lbs.,--,Orthodox,4.0,6.0,0.0,
3,Nariman,Abbasov,Bayraktar,"5' 8""",155 lbs.,"66.0""",Orthodox,28.0,4.0,0.0,
4,David,Abbott,Tank,"6' 0""",265 lbs.,--,Switch,10.0,15.0,0.0,
5,Hamdy,Abdelwahab,The Hammer,"6' 2""",264 lbs.,"72.0""",Southpaw,5.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...
4116,Dave,Zitanick,,--,170 lbs.,--,,5.0,7.0,0.0,
4117,Alex,Zuniga,,--,145 lbs.,--,,6.0,3.0,0.0,
4118,George,Zuniga,,"5' 9""",185 lbs.,--,,3.0,1.0,0.0,
4119,Allan,Zuniga,Tigre,"5' 7""",155 lbs.,"70.0""",Orthodox,13.0,1.0,0.0,


## Table Fights