<a href="https://colab.research.google.com/github/jplavorr/Math-Behind-Moneyball-with-Python/blob/main/BeatFullSoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img alt="Colaboratory logo" width="15%" src="https://i.postimg.cc/zXN3DHM3/Captura-de-tela-2021-04-22-145652.png">

#### **Data Science & Machine Learning**
*by [João Pedro Lavor](https://www.linkedin.com/in/jo%C3%A3o-pedro-lavor-65162312b/)*  

---

#Web Scraping para Esportes

Quando comecei a procurar DataSets relacionados as ligas esportivas e as estatisticas de seus jogadores, eu me deparei com um grande problema. Por mais que eu encontrasse tais DataSets no Kaggle, eles não continham as informalçoes completas ou não apresentavam colunas nos quais eu gostaria de analisar. Foi aí que percebi que às vezes, nem todos os dados estão disponíveis para nós. Às vezes, para continuar um determinado projeto de análise de dados, devemos fazer um pouco mais para obter os dados corretos e atualizados de que precisamos.

Logo, isso nos tras no tópico desse artigo, **Web Scraping**, que será usado em prol de criar o DataSet que iremos usar futuramente na série de artigos que irei começar a postar aqui no médium sobre Data Science aplicada nos esportes. Esse artigo servirá como base sobre como iremos extrair as informações estatísticas dos jogos e temporadas para realizar as analises que irão ocorrer durante o curso. 

Para reunir as informações de todas as estatísticas de variados esportes, iremos usar o site [Sports Reference](https://www.sports-reference.com/). Este site é essencialmente uma enciclopédia para todas as coisas sobre estatísticas de esportes. Aí veio a minha próxima pergunta: Por que não pegar os dados diretamente da Referência do Basquete? Depois de mais pesquisas, descobri uma ótima biblioteca Python que resolveu esta parte do meu projeto: BeautifulSoup. Esta biblioteca é um raspador da web que nos permite pesquisar o HTML de uma página da web e extrair as informações de que precisamos. A partir daí, armazenaremos os dados que coletamos em um DataFrame usando pandas.


#Importando Bibliotecas 

In [None]:
#Bibliotecas 
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
# Temporada da NBA que iremos Analisar
year = 2018

In [None]:
# Liga que da MLB que iremos Analisar
league = 'NL'

In [None]:
# URL da pagina que iremos fazer o scarping
url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)

In [None]:
url = "https://www.baseball-reference.com/leagues/{}/{}.shtml".format(league, year)

In [None]:
# this is the HTML from the given URL
html = urlopen(url)

In [None]:
soup = BeautifulSoup(html)

In [None]:
table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="teams_standard_batting") 

In [None]:
table

<table class="sortable stats_table" data-cols-to-freeze=",1" id="teams_standard_batting">
<caption>Team Standard Batting Table</caption>
<colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup>
<thead>
<tr>
<th aria-label="Tm" class=" poptip sort_default_asc show_partial_when_sorting center" data-stat="team_ID" scope="col">Tm</th>
<th aria-label=" Number of Players used in Games " class=" poptip center" data-stat="batters_used" data-tip="&lt;strong&gt;Number of Players used in Games&lt;/strong&gt;" scope="col">#Bat</th>
<th aria-label=" Batters&amp;#x2019; average age Weighted by AB + Games Played" class=" poptip sort_default_asc center" data-stat="age_bat" data-tip="&lt;strong&gt;Batters&amp;#x2019; average age&lt;/strong&gt;&lt;br&gt;Weighted by AB + Games Played" scope="col">BatAge</th>
<th aria-label="Runs Scored/Game" class=" poptip center" dat

In [None]:
columns = table.findAll(lambda tag: tag.name=='tr',limit=2)

In [None]:
headers = [th.getText() for th in columns[0].findAll('th')]

In [None]:
# use findALL() to get the column headers
soup.findAll('tr', limit=2)

[<tr>
 <th aria-label="Tm" class=" poptip sort_default_asc show_partial_when_sorting center" data-stat="team_ID" scope="col">Tm</th>
 <th aria-label="Wins" class=" poptip center" data-stat="W" data-tip="Wins" scope="col">W</th>
 <th aria-label="Losses" class=" poptip center" data-stat="L" data-tip="Losses" scope="col">L</th>
 <th aria-label="Win-Loss %" class=" poptip hide_non_quals center" data-filter="1" data-name="Win-Loss %" data-stat="win_loss_perc" data-tip="&lt;strong&gt;Win-Loss Percentage&lt;/strong&gt;&lt;br&gt;W / (W + L)&lt;br&gt;For players, leaders need one decision for every ten team games.&lt;br&gt;For managers, minimum to qualify for leading is 320 games." scope="col">W-L%</th>
 <th aria-label="Games Back" class=" poptip sort_default_asc center" data-stat="games_back" data-tip="&lt;strong&gt;Games Back of Division/League Leader&lt;/strong&gt;&lt;br&gt;Computed as games over .500 of leader (W-L) minus games over .500 of team divided by two.&lt;br&gt;Typically computed a

In [None]:
# use getText()to extract the text we need into a list
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]

In [None]:
# exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
headers = headers[1:]
headers

['#Bat',
 'BatAge',
 'R/G',
 'G',
 'PA',
 'AB',
 'R',
 'H',
 '2B',
 '3B',
 'HR',
 'RBI',
 'SB',
 'CS',
 'BB',
 'SO',
 'BA',
 'OBP',
 'SLG',
 'OPS',
 'OPS+',
 'TB',
 'GDP',
 'HBP',
 'SH',
 'SF',
 'IBB',
 'LOB']

In [None]:
#Criando uma lista com todas as estatisticas presentes
rows = table.tbody.findAll('tr')[0:-1]
player_stats_baseball = [[td.getText() for td in rows[i].findAll('td')]
            for i in range(len(rows))]

In [None]:
#Recolhendo o nome dos times presentes na tabela
teams_list = []
for tag in table.tbody.findAll('a'):
  teams_list.append(tag.get('title'))


In [None]:
#Tomando a abreviação dos times
team_short = [[a.getText() for a in rows_teams_names[i].findAll('a')]
            for i in range(len(rows_teams_names))]

In [None]:
#Criando lista com as ligas presentes:
leagues = ['NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL']


In [None]:
baseball_stats_2018 = pd.DataFrame(player_stats_baseball, columns = headers)

In [None]:
baseball_stats_2018['Team_Names'] = teams_list

In [None]:
baseball_stats_2018['Team_short'] = team_short

In [None]:
baseball_stats_2018["League"] = leagues

In [None]:
baseball_stats_2018.head()

Unnamed: 0,#Bat,BatAge,R/G,G,PA,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,BA,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB,Team_Names,Team_short,League
0,49,29.2,4.28,162,6157,5460,693,1283,259,50,176,658,79,25,560,1460,0.235,0.31,0.397,0.707,86,2170,110,52,38,45,36,1086,Arizona Diamondbacks,[ARI],NL
1,58,27.3,4.69,162,6251,5582,759,1433,314,29,175,717,90,36,511,1290,0.257,0.324,0.417,0.742,98,2330,99,66,49,43,53,1143,Atlanta Braves,[ATL],NL
2,50,27.2,4.67,163,6369,5624,761,1453,286,34,167,722,66,38,576,1388,0.258,0.333,0.41,0.744,97,2308,107,78,40,46,67,1224,Chicago Cubs,[CHC],NL
3,53,27.2,4.3,162,6240,5532,696,1404,251,25,172,665,77,33,559,1376,0.254,0.328,0.401,0.729,95,2221,128,65,49,35,35,1179,Cincinnati Reds,[CIN],NL
4,41,28.7,4.79,163,6178,5541,780,1418,280,42,210,748,95,33,507,1397,0.256,0.322,0.435,0.757,90,2412,114,51,42,37,38,1067,Colorado Rockies,[COL],NL


In [None]:
# shift column 'Name' to first position
first_column = baseball_stats_2018.pop('Team_Names')
second_column = baseball_stats_2018.pop('Team_short')
third_column = baseball_stats_2018.pop('League')
  
# insert column using insert(position,column_name,
# first_column) function
baseball_stats_2018.insert(0, 'Team_Names', first_column)
baseball_stats_2018.insert(1, 'Team_short', second_column)
baseball_stats_2018.insert(2, 'League', third_column)

In [None]:
baseball_stats_2018.head()

Unnamed: 0,Team_Names,Team_short,League,#Bat,BatAge,R/G,G,PA,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,BA,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB
0,Arizona Diamondbacks,[ARI],NL,49,29.2,4.28,162,6157,5460,693,1283,259,50,176,658,79,25,560,1460,0.235,0.31,0.397,0.707,86,2170,110,52,38,45,36,1086
1,Atlanta Braves,[ATL],NL,58,27.3,4.69,162,6251,5582,759,1433,314,29,175,717,90,36,511,1290,0.257,0.324,0.417,0.742,98,2330,99,66,49,43,53,1143
2,Chicago Cubs,[CHC],NL,50,27.2,4.67,163,6369,5624,761,1453,286,34,167,722,66,38,576,1388,0.258,0.333,0.41,0.744,97,2308,107,78,40,46,67,1224
3,Cincinnati Reds,[CIN],NL,53,27.2,4.3,162,6240,5532,696,1404,251,25,172,665,77,33,559,1376,0.254,0.328,0.401,0.729,95,2221,128,65,49,35,35,1179
4,Colorado Rockies,[COL],NL,41,28.7,4.79,163,6178,5541,780,1418,280,42,210,748,95,33,507,1397,0.256,0.322,0.435,0.757,90,2412,114,51,42,37,38,1067


In [None]:
# avoid the first header row
rows = soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
            for i in range(len(rows))]

In [None]:
stats_2019 = pd.DataFrame(player_stats, columns = headers)
stats_2019

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Álex Abrines,SG,25,OKC,31,2,19.0,1.8,5.1,.357,...,.923,0.2,1.4,1.5,0.6,0.5,0.2,0.5,1.7,5.3
1,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,.222,...,.700,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,Jaylen Adams,PG,22,ATL,34,1,12.6,1.1,3.2,.345,...,.778,0.3,1.4,1.8,1.9,0.4,0.1,0.8,1.3,3.2
3,Steven Adams,C,25,OKC,80,80,33.4,6.0,10.1,.595,...,.500,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9
4,Bam Adebayo,C,21,MIA,82,28,23.3,3.4,5.9,.576,...,.735,2.0,5.3,7.3,2.2,0.9,0.8,1.5,2.5,8.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
729,Tyler Zeller,C,29,MEM,4,1,20.5,4.0,7.0,.571,...,.778,2.3,2.3,4.5,0.8,0.3,0.8,1.0,4.0,11.5
730,Ante Žižić,C,22,CLE,59,25,18.3,3.1,5.6,.553,...,.705,1.8,3.6,5.4,0.9,0.2,0.4,1.0,1.9,7.8
731,Ivica Zubac,C,21,TOT,59,37,17.6,3.6,6.4,.559,...,.802,1.9,4.2,6.1,1.1,0.2,0.9,1.2,2.3,8.9
732,Ivica Zubac,C,21,LAL,33,12,15.6,3.4,5.8,.580,...,.864,1.6,3.3,4.9,0.8,0.1,0.8,1.0,2.2,8.5


In [None]:
range(len(rows)

range(0, 734)

In [None]:
year_new = [2018, 2019, 2020]
names = ['Season_2018', 'Season_2019','Season_2020']

In [None]:
for i in range(len(year_new)):
    url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year_new[i])
    # this is the HTML from the given URL
    html = urlopen(url)
    soup = BeautifulSoup(html)
    headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
    headers = headers[1:]
    # avoid the first header row
    rows = soup.findAll('tr')[1:]
    player_stats = [[td.getText() for td in rows[j].findAll('td')]
                for j in range(len(rows))]
    locals()[names[i]] = pd.DataFrame(player_stats, columns = headers)


In [None]:
Season_2020

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Steven Adams,C,26,OKC,63,63,26.7,4.5,7.6,.592,...,.582,3.3,6.0,9.3,2.3,0.8,1.1,1.5,1.9,10.9
1,Bam Adebayo,PF,22,MIA,72,72,33.6,6.1,11.0,.557,...,.691,2.4,7.8,10.2,5.1,1.1,1.3,2.8,2.5,15.9
2,LaMarcus Aldridge,C,34,SAS,53,53,33.1,7.4,15.0,.493,...,.827,1.9,5.5,7.4,2.4,0.7,1.6,1.4,2.4,18.9
3,Kyle Alexander,C,23,MIA,2,0,6.5,0.5,1.0,.500,...,,1.0,0.5,1.5,0.0,0.0,0.0,0.5,0.5,1.0
4,Nickeil Alexander-Walker,SG,21,NOP,47,1,12.6,2.1,5.7,.368,...,.676,0.2,1.6,1.8,1.9,0.4,0.2,1.1,1.2,5.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
672,Trae Young,PG,21,ATL,60,60,35.3,9.1,20.8,.437,...,.860,0.5,3.7,4.3,9.3,1.1,0.1,4.8,1.7,29.6
673,Cody Zeller,C,27,CHO,58,39,23.1,4.3,8.3,.524,...,.682,2.8,4.3,7.1,1.5,0.7,0.4,1.3,2.4,11.1
674,Tyler Zeller,C,30,SAS,2,0,2.0,0.5,2.0,.250,...,,1.5,0.5,2.0,0.0,0.0,0.0,0.0,0.0,1.0
675,Ante Žižić,C,23,CLE,22,0,10.0,1.9,3.3,.569,...,.737,0.8,2.2,3.0,0.3,0.3,0.2,0.5,1.2,4.4
