<a href="https://colab.research.google.com/github/jplavorr/Math-Behind-Moneyball-with-Python/blob/main/Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img alt="Colaboratory logo" width="15%" src="https://i.postimg.cc/zXN3DHM3/Captura-de-tela-2021-04-22-145652.png">

#### **Data Science & Machine Learning**
*by [João Pedro Lavor](https://www.linkedin.com/in/jo%C3%A3o-pedro-lavor-65162312b/)*  

---

#Web Scraping para Esportes

Quando comecei a procurar DataSets relacionados as ligas esportivas e as estatisticas de seus jogadores, acabei me deparando com um problema. Por mais que eu encontrasse tais DataSets no Kaggle, eles não continham as informações completas ou não apresentavam colunas nos quais eu gostaria de analisar. Foi aí que percebi que às vezes, nem todos os dados estão disponíveis para nós de forma prática. Para continuar um determinado projeto de análise de dados, devemos fazer um pouco mais para obter os dados apropriados e atualizados de que precisamos.

Logo, isso nos tras no tópico desse artigo, **Web Scraping**, que será usado para criar DataSets que iremos usar futuramente na série de artigos sobre Data Science aplicada nos esportes. Esse artigo servirá como base sobre como iremos extrair as informações estatísticas dos jogos e temporadas para realizar as analises que irão ocorrer. 

Para reunir as informações de todas as estatísticas de variados esportes, iremos usar o site [Sports Reference](https://www.sports-reference.com/). Este site é essencialmente uma enciclopédia para todas as coisas sobre estatísticas de esportes. Aí veio a minha próxima pergunta: Por que não pegar os dados diretamente da Referência do Basquete? Depois de mais pesquisas, descobri uma ótima biblioteca Python que resolveu esta parte do meu projeto: BeautifulSoup. Esta biblioteca é um raspador da web que nos permite pesquisar o HTML de uma página da web e extrair as informações de que precisamos. A partir daí, armazenaremos os dados que coletamos em um DataFrame usando pandas.


#Importando Bibliotecas 

In [28]:
#Bibliotecas 
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

#Web Scraping for MLB

In [44]:
# Liga que da MLB que iremos Analisar
league = 'NL'
# Temporada iremos Analisar
year = 2018

In [50]:
url = "https://www.baseball-reference.com/leagues/{}/{}.shtml".format(league, year)
url2 = 'https://www.baseball-reference.com/leagues/{}/{}-standard-pitching.shtml'.format(league, year)

In [51]:
html = urlopen(url)
html2 = urlopen(url2)

In [52]:
soup = BeautifulSoup(html)
soup2 = BeautifulSoup(html2)

Vamos selecionar a tabela que estamos querendo do site 

In [53]:
table_Batting = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="teams_standard_batting") 

In [55]:
table_Pitching = soup2.find(lambda tag: tag.name=='table' and tag.has_attr('id') and  tag['id']=="teams_standard_pitching") 

In [56]:
columns_Batting = table_Batting.findAll(lambda tag: tag.name=='tr',limit=2)
columns_Pitching = table_Pitching.findAll(lambda tag: tag.name=='tr', limit=2)

In [97]:
headers_Batting = [th.getText() for th in columns_Batting[0].findAll('th')]
headers_Batting_final = headers_Batting[1:]
headers_Pitching = [th.getText() for th in columns_Pitching[0].findAll('th')]
headers_Pitching_final = headers_Pitching[1:]

In [86]:
#Criando uma lista com todas as estatisticas presentes
rows_Batting = table_Batting.tbody.findAll('tr')[0:-1]
player_stats_baseball_Batting = [[td.getText() for td in rows_Batting[i].findAll('td')]
            for i in range(len(rows_Batting))]

In [87]:
#Criando uma lista com todas as estatisticas presentes
rows_Pitching = table_Pitching.tbody.findAll('tr')[0:-1]
player_stats_baseball_Pitching = [[td.getText() for td in rows_Pitching[i].findAll('td')]
            for i in range(len(rows_Pitching))]

In [75]:
#Recolhendo o nome dos times presentes na tabela
teams_list = []
for tag in table_Pitching.tbody.findAll('a'):
  teams_list.append(tag.getText())


In [101]:
#Criando lista com as ligas presentes:
leagues = ['NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL','NL']


In [98]:
baseball_stats_Batting_2018 = pd.DataFrame(player_stats_baseball_Batting, columns = headers_Batting_final)
baseball_stats_Pitching_2018 = pd.DataFrame(player_stats_baseball_Pitching, columns = headers_Pitching_final)

In [99]:
baseball_stats_Batting_2018['Team_Names'] = teams_list
baseball_stats_Pitching_2018['Team_Names'] = teams_list

In [102]:
baseball_stats_Batting_2018["League"] = leagues
baseball_stats_Pitching_2018["League"] = leagues

In [105]:
# shift column 'Name' to first position
first_column = baseball_stats_Batting_2018.pop('Team_Names')
second_column = baseball_stats_Batting_2018.pop('League')

# insert column using insert(position,column_name,
# first_column) function
baseball_stats_Batting_2018.insert(0, 'Team_Names', first_column)
baseball_stats_Batting_2018.insert(1, 'League', second_column)

In [107]:
# shift column 'Name' to first position
first_column = baseball_stats_Pitching_2018.pop('Team_Names')
second_column = baseball_stats_Pitching_2018.pop('League')

# insert column using insert(position,column_name,
# first_column) function
baseball_stats_Pitching_2018.insert(0, 'Team_Names', first_column)
baseball_stats_Pitching_2018.insert(1, 'League', second_column)

In [106]:
baseball_stats_Batting_2018.head()

Unnamed: 0,Team_Names,League,#Bat,BatAge,R/G,G,PA,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,BA,OBP,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB
0,Arizona Diamondbacks,NL,49,29.2,4.28,162,6157,5460,693,1283,259,50,176,658,79,25,560,1460,0.235,0.31,0.397,0.707,86,2170,110,52,38,45,36,1086
1,Atlanta Braves,NL,58,27.3,4.69,162,6251,5582,759,1433,314,29,175,717,90,36,511,1290,0.257,0.324,0.417,0.742,98,2330,99,66,49,43,53,1143
2,Chicago Cubs,NL,50,27.2,4.67,163,6369,5624,761,1453,286,34,167,722,66,38,576,1388,0.258,0.333,0.41,0.744,97,2308,107,78,40,46,67,1224
3,Cincinnati Reds,NL,53,27.2,4.3,162,6240,5532,696,1404,251,25,172,665,77,33,559,1376,0.254,0.328,0.401,0.729,95,2221,128,65,49,35,35,1179
4,Colorado Rockies,NL,41,28.7,4.79,163,6178,5541,780,1418,280,42,210,748,95,33,507,1397,0.256,0.322,0.435,0.757,90,2412,114,51,42,37,38,1067


In [108]:
baseball_stats_Pitching_2018.head()

Unnamed: 0,Team_Names,League,#P,PAge,RA/G,W,L,W-L%,ERA,G,GS,GF,CG,tSho,cSho,SV,IP,H,R,ER,HR,BB,IBB,SO,HBP,BK,WP,BF,ERA+,FIP,WHIP,H9,HR9,BB9,SO9,SO/W,LOB
0,Arizona Diamondbacks,NL,30,29.6,3.98,82,80,0.506,3.72,162,162,160,2,9,1,39,1463.0,1313,644,605,174,522,43,1448,57,5,69,6139,113,3.91,1.254,8.1,1.1,3.2,8.9,2.77,1106
1,Atlanta Braves,NL,35,27.7,4.06,90,72,0.556,3.75,162,162,160,2,11,1,40,1456.2,1236,657,607,153,635,43,1423,52,8,61,6155,110,3.98,1.284,7.6,0.9,3.9,8.8,2.24,1128
2,Chicago Cubs,NL,35,30.2,3.96,95,68,0.583,3.65,163,163,162,1,18,0,46,1476.1,1319,645,598,157,622,33,1333,66,3,46,6264,115,4.13,1.315,8.0,1.0,3.8,8.1,2.14,1190
3,Cincinnati Reds,NL,32,27.1,5.06,67,95,0.414,4.63,162,162,161,1,6,0,38,1441.0,1491,819,741,228,532,60,1258,50,8,48,6279,90,4.66,1.404,9.3,1.4,3.3,7.9,2.36,1137
4,Colorado Rockies,NL,21,27.9,4.57,91,72,0.558,4.33,163,163,163,0,10,0,51,1452.1,1377,745,699,184,525,24,1409,52,8,70,6154,109,4.06,1.31,8.5,1.1,3.3,8.7,2.68,1052


#Web Scrapping for NBA

In [109]:
# URL da pagina que iremos fazer o scarping
url_nba = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year)
# Temporada iremos Analisar
year = 2018

In [110]:
html_nba = urlopen(url_nba)
soup_nba = BeautifulSoup(html_nba)

In [121]:
# avoid the first header row
rows = soup_nba.findAll('tr')[1:]
players_stats_2018 = [[td.getText() for td in rows[i].findAll('td')]
            for i in range(len(rows))]

In [124]:
headers_2018_final = headers_2018[1:]

In [115]:
headers_2018 = [th.getText() for th in soup_nba.findAll('tr', limit=2)[0].findAll('th')]

#Players

In [125]:
stats_2018 = pd.DataFrame(players_stats_2018, columns = headers_2018_final)
stats_2018

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Álex Abrines,SG,24,OKC,75,8,15.1,1.5,3.9,.395,1.1,2.9,.380,0.4,0.9,.443,.540,0.5,0.6,.848,0.3,1.2,1.5,0.4,0.5,0.1,0.3,1.7,4.7
1,Quincy Acy,PF,27,BRK,70,8,19.4,1.9,5.2,.356,1.5,4.2,.349,0.4,1.0,.384,.496,0.7,0.9,.817,0.6,3.1,3.7,0.8,0.5,0.4,0.9,2.1,5.9
2,Steven Adams,C,24,OKC,76,76,32.7,5.9,9.4,.629,0.0,0.0,.000,5.9,9.3,.631,.629,2.1,3.8,.559,5.1,4.0,9.0,1.2,1.2,1.0,1.7,2.8,13.9
3,Bam Adebayo,C,20,MIA,69,19,19.8,2.5,4.9,.512,0.0,0.1,.000,2.5,4.8,.523,.512,1.9,2.6,.721,1.7,3.8,5.5,1.5,0.5,0.6,1.0,2.0,6.9
4,Arron Afflalo,SG,32,ORL,53,3,12.9,1.2,3.1,.401,0.5,1.3,.386,0.7,1.7,.413,.485,0.4,0.5,.846,0.1,1.2,1.2,0.6,0.1,0.2,0.4,1.1,3.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,Tyler Zeller,C,28,BRK,42,33,16.7,3.0,5.5,.546,0.2,0.6,.385,2.7,4.8,.567,.568,1.0,1.4,.667,1.5,3.1,4.6,0.7,0.2,0.5,0.8,1.9,7.1
686,Tyler Zeller,C,28,MIL,24,1,16.9,2.6,4.4,.590,0.0,0.1,.000,2.6,4.3,.602,.590,0.7,0.8,.895,2.0,2.7,4.6,0.8,0.3,0.6,0.5,2.0,5.9
687,Paul Zipser,SF,23,CHI,54,12,15.3,1.5,4.3,.346,0.7,2.0,.336,0.8,2.3,.355,.425,0.4,0.5,.760,0.2,2.2,2.4,0.9,0.4,0.3,0.8,1.6,4.0
688,Ante Žižić,C,21,CLE,32,2,6.7,1.5,2.1,.731,0.0,0.0,,1.5,2.1,.731,.731,0.7,0.9,.724,0.8,1.1,1.9,0.2,0.1,0.4,0.3,0.9,3.7


In [None]:
range(len(rows))

range(0, 734)

In [None]:
year_new = [2018, 2019, 2020]
names = ['Season_2018', 'Season_2019','Season_2020']

In [None]:
for i in range(len(year_new)):
    url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html".format(year_new[i])
    # this is the HTML from the given URL
    html = urlopen(url)
    soup = BeautifulSoup(html)
    headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
    headers = headers[1:]
    # avoid the first header row
    rows = soup.findAll('tr')[1:]
    player_stats = [[td.getText() for td in rows[j].findAll('td')]
                for j in range(len(rows))]
    locals()[names[i]] = pd.DataFrame(player_stats, columns = headers)


In [None]:
Season_2020

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Steven Adams,C,26,OKC,63,63,26.7,4.5,7.6,.592,...,.582,3.3,6.0,9.3,2.3,0.8,1.1,1.5,1.9,10.9
1,Bam Adebayo,PF,22,MIA,72,72,33.6,6.1,11.0,.557,...,.691,2.4,7.8,10.2,5.1,1.1,1.3,2.8,2.5,15.9
2,LaMarcus Aldridge,C,34,SAS,53,53,33.1,7.4,15.0,.493,...,.827,1.9,5.5,7.4,2.4,0.7,1.6,1.4,2.4,18.9
3,Kyle Alexander,C,23,MIA,2,0,6.5,0.5,1.0,.500,...,,1.0,0.5,1.5,0.0,0.0,0.0,0.5,0.5,1.0
4,Nickeil Alexander-Walker,SG,21,NOP,47,1,12.6,2.1,5.7,.368,...,.676,0.2,1.6,1.8,1.9,0.4,0.2,1.1,1.2,5.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
672,Trae Young,PG,21,ATL,60,60,35.3,9.1,20.8,.437,...,.860,0.5,3.7,4.3,9.3,1.1,0.1,4.8,1.7,29.6
673,Cody Zeller,C,27,CHO,58,39,23.1,4.3,8.3,.524,...,.682,2.8,4.3,7.1,1.5,0.7,0.4,1.3,2.4,11.1
674,Tyler Zeller,C,30,SAS,2,0,2.0,0.5,2.0,.250,...,,1.5,0.5,2.0,0.0,0.0,0.0,0.0,0.0,1.0
675,Ante Žižić,C,23,CLE,22,0,10.0,1.9,3.3,.569,...,.737,0.8,2.2,3.0,0.3,0.3,0.2,0.5,1.2,4.4
