# Web Scraping FIFA22

O intuito desse projeto foi realizar um treinamento de web scraping, consumi-lo em ferramentas de BI como PowerBI e disponibilizar os dados no portal Kaggle.

link disponíbilizado no portal do Kaggle: https://www.kaggle.com/datasets/juniorverli/fifa-22-all-normal-players-from-ultimate-team

## Bibliotecas utilizadas

Foram utilizados as bibliotecas BeautifulSoup para realização do web scraping, requests para se conectar ao site do Futhead e pandas para realizar a leitura do dataset final e exportação dos dados

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Barra de progresso

A barra de progresso foi criada para acompanhar a demora na execução de cada script, para saber se o código está sendo executado corretamente

In [2]:
def progress_bar(progress, total):
    percent = 100 * (progress / float(total))
    bar = '-' * int(percent) + ' ' * (100 - int(percent))
    print(f"\r|{bar}| {percent:.2f}%", end="\r")

### Cabeçalho utilizado para se conectar com a página do futhead

In [3]:
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36'}

### Criação da lista onde irá conter os links dos jogadores

In [4]:
playerLink = []

## Buscando os links de cada jogador

Essa parte foi criada para buscar o link de cada jogador no site do futhead. Foi criado uma separação de busca entre os links dos goleiros e os demais jogadores devido a uma limitação do site que não permite consultar todos jogadores de uma só vez

### Busca de links de todos os goleiros normais na base do futhead

In [5]:
for i in range(1,40):
    page = requests.get("https://www.futhead.com/22/players/"+"?level=all_nif&group=gk&page="+str(i)+"&bin_platform=ps", headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    for link in soup.find_all("a", {"class": "display-block padding-0"}):
            playerLink.append(link['href'])
    progress_bar(i,39)

|----------------------------------------------------------------------------------------------------| 100.00%

### Busca os links de  jogadores normais, exceto goleiros da base do futhead

In [6]:
for i in range(1,210):
    page = requests.get("https://www.futhead.com/22/players/"+"?level=all_nif&page="+str(i)+"&bin_platform=ps", headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    for link in soup.find_all("a", {"class": "display-block padding-0"}):
            playerLink.append(link['href'])
    progress_bar(i,209)

|----------------------------------------------------------------------------------------------------| 100.00%

In [7]:
print(len(playerLink))

11864


### Listas utilizadas no script

Aqui estão as listas de atributos de jogadores que depois serão enviados para a lista do dataset e temos a lista dos jogadores que já tiveram dados buscados no site do futhead.

A lista playerOk foi criada como segurança caso algo no script dê errado.

In [8]:
dataset = []
playerAttributes = []
playerOK = []

## Buscando os dados finais

Nessa parte temos a função que realiza a busca dos dados de cada jogador como nome, imagem do jogador, posição, time, imagem do time, liga, imagem da liga, nacionalidade, imagem do país, melhor perna do jogador, as skills de drible de cada jogador de 0 à 5, a habilidade com a perna ruim de 0 à 5, data de aniversário, tamanho em centimetros, a classificação do jogador indo para o ataque e para defesa se é baixa, média ou alta, o tipo de carta se é bronze, prata ou bronze, o overrall do jogador e seus atributos, como velocidade, chute, passe, etc.

### Teste de script apontando apenas para o link de um jogador
Foi realizado um teste do script inicialmente para coletar todas informações necessárias na página de cada jogador

In [None]:
page = requests.get("https://www.futhead.com/22/players/20833/ladislav-krejci/", headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.find_all("div", {"class": "col-flex-300"})[1].find_all('div')[0].get('data-player-full-name'))
print(soup.find_all("div", {"class": "playercard-picture"})[0].find_all('img')[0].get('data-src'))
print(soup.find_all("div", {"class": "playercard-position"})[0].get_text().strip())
for i in range(len(soup.find_all("div", {"class": "col-xs-7"}))-1): 
    if(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) in ("Skill Moves", "Weak Foot"):
        print(len(soup.find_all("div", {"class": "col-xs-5"})[i].find_all()))
    elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Age":
        print(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip()[-10:])
    elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Height":
        print(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip()[:5])
    elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Nation":
        print(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip())
        print(soup.find_all("div", {"class": "col-xs-7"})[i].find_all('img')[0].get('src'))
    elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "League":
        print(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip())
        print(soup.find_all("div", {"class": "col-xs-7"})[i].find_all('img')[0].get('src'))
    elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Club":
        print(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip())
        print(soup.find_all("div", {"class": "col-xs-7"})[i].find_all('img')[0].get('src'))
    elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Strong Foot":
        print(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip())
    elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Workrates":
        print(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip().replace(' ', ''))
print(soup.find_all("div", {"class": "player-cards"})[0].find_all('div')[0].get('class')[2].title())
print(soup.find_all("div", {"class": "playercard-rating"})[0].get_text().strip())
print(soup.find_all("div", {"class": "playercard-attr playercard-attr1"})[0].get_text()[:3].strip())
print(soup.find_all("div", {"class": "playercard-attr playercard-attr2"})[0].get_text()[:3].strip())
print(soup.find_all("div", {"class": "playercard-attr playercard-attr3"})[0].get_text()[:3].strip())
print(soup.find_all("div", {"class": "playercard-attr playercard-attr4"})[0].get_text()[:3].strip())
print(soup.find_all("div", {"class": "playercard-attr playercard-attr5"})[0].get_text()[:3].strip())
print(soup.find_all("div", {"class": "playercard-attr playercard-attr6"})[0].get_text()[:3].strip())

### Script final de consulta dos dados dos jogadores através de uma função

In [9]:
def searchAttributes(search):

    page = requests.get("https://www.futhead.com"+search, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    playerAttributes.append(soup.find_all("div", {"class": "col-flex-300"})[1].find_all('div')[0].get('data-player-full-name'))
    playerAttributes.append(soup.find_all("div", {"class": "playercard-picture"})[0].find_all('img')[0].get('data-src'))
    playerAttributes.append(soup.find_all("div", {"class": "playercard-position"})[0].get_text().strip())
    for i in range(len(soup.find_all("div", {"class": "col-xs-7"}))-1): 
        if(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) in ("Skill Moves", "Weak Foot"):
            playerAttributes.append(len(soup.find_all("div", {"class": "col-xs-5"})[i].find_all()))
        elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Age":
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip()[-10:])
        elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Height":
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip()[:5])
        elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Nation":
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip())
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-7"})[i].find_all('img')[0].get('src'))
        elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "League":
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip())
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-7"})[i].find_all('img')[0].get('src'))
        elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Club":
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip())
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-7"})[i].find_all('img')[0].get('src'))
        elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Strong Foot":
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip())
        elif(soup.find_all("div", {"class": "col-xs-7"})[i].get_text().strip()) == "Workrates":
            playerAttributes.append(soup.find_all("div", {"class": "col-xs-5"})[i].get_text().strip().replace(' ', ''))
    playerAttributes.append(soup.find_all("div", {"class": "player-cards"})[0].find_all('div')[0].get('class')[2].title())
    playerAttributes.append(soup.find_all("div", {"class": "playercard-rating"})[0].get_text().strip())
    playerAttributes.append(soup.find_all("div", {"class": "playercard-attr playercard-attr1"})[0].get_text()[:3].strip())
    playerAttributes.append(soup.find_all("div", {"class": "playercard-attr playercard-attr2"})[0].get_text()[:3].strip())
    playerAttributes.append(soup.find_all("div", {"class": "playercard-attr playercard-attr3"})[0].get_text()[:3].strip())
    playerAttributes.append(soup.find_all("div", {"class": "playercard-attr playercard-attr4"})[0].get_text()[:3].strip())
    playerAttributes.append(soup.find_all("div", {"class": "playercard-attr playercard-attr5"})[0].get_text()[:3].strip())
    playerAttributes.append(soup.find_all("div", {"class": "playercard-attr playercard-attr6"})[0].get_text()[:3].strip())
    if len(playerAttributes) == 23:
        dataset.append(playerAttributes)
        playerOK.append(players)

### Executando a função

Nessa parte é realizado a chamada da função através dos links adicionados na lista playerLink e dos jogadores que não constam na lista playerOK

In [11]:
for players in playerLink:
    if players not in playerOK:
        playerAttributes = []
        searchAttributes(players) 
    progress_bar(len(dataset),len(playerLink))

|----------------------------------------------------------------------------------------------------| 100.00%

### Validando se tudo foi processado

In [10]:
print(len(playerOK))
print(len(playerLink))

0
11864


In [16]:
print(len(dataset))
print(len(playerLink))


11864
11864


In [12]:
for players in playerLink:
    if players not in playerOK:
        print(players)

### Criação dos dataframes

Nessa parte é criado o dataframe principal de acordo com os dados contidos na lista dataset e criado um id para cada jogador

In [13]:
df = pd.DataFrame(dataset,columns=['name', 'img-player', 'position', 'team', 'img-team', 'league', 'img-league', 'nation', 'img-nation', 'strong-foot', 'skill-moves', 'weak-foot', 'birth-date', 'height', 'workrates', 'type-card', 'over', 'attr1', 'attr2', 'attr3', 'attr4', 'attr5', 'attr6'])
df = df.sort_values(by=['name'])
df["id"] = range(1, 1+ len(df))
df = df[ ['id'] + [ col for col in df.columns if col != 'id' ] ]

In [15]:
df.head()

Unnamed: 0,id,name,img-player,position,team,img-team,league,img-league,nation,img-nation,...,height,workrates,type-card,over,attr1,attr2,attr3,attr4,attr5,attr6
2788,1,A Lan,https://futhead.cursecdn.com/static/img/22/pla...,ST,Guangzhou FC,https://futhead.cursecdn.com/static/img/22/clu...,Chinese Football Association Super League,https://futhead.cursecdn.com/static/img/22/lea...,China PR,https://futhead.cursecdn.com/static/img/22/nat...,...,178cm,High/Low,Gold,77,84,75,68,81,33,72
11192,2,A.J. DeLaGarza,https://futhead.cursecdn.com/static/img/22/pla...,RB,New England,https://futhead.cursecdn.com/static/img/22/clu...,Major League Soccer,https://futhead.cursecdn.com/static/img/22/lea...,Guam,https://futhead.cursecdn.com/static/img/22/nat...,...,175cm,Medium/High,Silver,65,64,48,61,62,63,68
10803,3,Aapo Halme,https://futhead.cursecdn.com/static/img/22/pla...,CB,Barnsley,https://futhead.cursecdn.com/static/img/22/clu...,EFL Championship,https://futhead.cursecdn.com/static/img/22/lea...,Finland,https://futhead.cursecdn.com/static/img/22/nat...,...,196cm,Medium/Medium,Silver,65,47,21,45,55,63,71
7098,4,Aaron Appindangoye,https://futhead.cursecdn.com/static/img/22/pla...,CB,Sivasspor,https://futhead.cursecdn.com/static/img/22/clu...,Süper Lig,https://futhead.cursecdn.com/static/img/22/lea...,Gabon,https://futhead.cursecdn.com/static/img/22/nat...,...,184cm,Medium/Medium,Silver,69,60,45,55,53,71,76
1299,5,Aaron Chapman,https://futhead.cursecdn.com/static/img/22/pla...,GK,Gillingham,https://futhead.cursecdn.com/static/img/22/clu...,EFL League One,https://futhead.cursecdn.com/static/img/22/lea...,England,https://futhead.cursecdn.com/static/img/22/nat...,...,203cm,Medium/Medium,Bronze,61,63,58,56,62,31,60


#### Criação de um dataset separado apenas para goleiros

In [17]:
df2 = df[df["position"]=="GK"]
df2.head()

Unnamed: 0,id,name,img-player,position,team,img-team,league,img-league,nation,img-nation,...,height,workrates,type-card,over,attr1,attr2,attr3,attr4,attr5,attr6
1299,5,Aaron Chapman,https://futhead.cursecdn.com/static/img/22/pla...,GK,Gillingham,https://futhead.cursecdn.com/static/img/22/clu...,EFL League One,https://futhead.cursecdn.com/static/img/22/lea...,England,https://futhead.cursecdn.com/static/img/22/nat...,...,203cm,Medium/Medium,Bronze,61,63,58,56,62,31,60
187,19,Aaron Ramsdale,https://futhead.cursecdn.com/static/img/22/pla...,GK,Arsenal,https://futhead.cursecdn.com/static/img/22/clu...,Premier League,https://futhead.cursecdn.com/static/img/22/lea...,England,https://futhead.cursecdn.com/static/img/22/nat...,...,188cm,Medium/Medium,Silver,74,75,72,69,76,48,71
438,24,Aarón,https://futhead.cursecdn.com/static/img/22/pla...,GK,Granada CF,https://futhead.cursecdn.com/static/img/22/clu...,LaLiga Santander,https://futhead.cursecdn.com/static/img/22/lea...,Spain,https://futhead.cursecdn.com/static/img/22/nat...,...,185cm,Medium/Medium,Silver,70,67,70,66,72,37,67
1058,48,Abdul Manaf Nurudeen,https://futhead.cursecdn.com/static/img/22/pla...,GK,KAS Eupen,https://futhead.cursecdn.com/static/img/22/clu...,1A Pro League,https://futhead.cursecdn.com/static/img/22/lea...,Ghana,https://futhead.cursecdn.com/static/img/22/nat...,...,190cm,Medium/Medium,Bronze,63,72,57,58,68,22,58
1205,59,Abdullah Al Jadani,https://futhead.cursecdn.com/static/img/22/pla...,GK,Al Hilal,https://futhead.cursecdn.com/static/img/22/clu...,MBS Pro League,https://futhead.cursecdn.com/static/img/22/lea...,Saudi Arabia,https://futhead.cursecdn.com/static/img/22/nat...,...,180cm,Medium/Medium,Bronze,62,65,58,57,64,49,60


#### Criação de um dataset para os jogadores, exceto goleiros

In [18]:
df3 = df[df["position"]!="GK"]
df3.head()

Unnamed: 0,id,name,img-player,position,team,img-team,league,img-league,nation,img-nation,...,height,workrates,type-card,over,attr1,attr2,attr3,attr4,attr5,attr6
2788,1,A Lan,https://futhead.cursecdn.com/static/img/22/pla...,ST,Guangzhou FC,https://futhead.cursecdn.com/static/img/22/clu...,Chinese Football Association Super League,https://futhead.cursecdn.com/static/img/22/lea...,China PR,https://futhead.cursecdn.com/static/img/22/nat...,...,178cm,High/Low,Gold,77,84,75,68,81,33,72
11192,2,A.J. DeLaGarza,https://futhead.cursecdn.com/static/img/22/pla...,RB,New England,https://futhead.cursecdn.com/static/img/22/clu...,Major League Soccer,https://futhead.cursecdn.com/static/img/22/lea...,Guam,https://futhead.cursecdn.com/static/img/22/nat...,...,175cm,Medium/High,Silver,65,64,48,61,62,63,68
10803,3,Aapo Halme,https://futhead.cursecdn.com/static/img/22/pla...,CB,Barnsley,https://futhead.cursecdn.com/static/img/22/clu...,EFL Championship,https://futhead.cursecdn.com/static/img/22/lea...,Finland,https://futhead.cursecdn.com/static/img/22/nat...,...,196cm,Medium/Medium,Silver,65,47,21,45,55,63,71
7098,4,Aaron Appindangoye,https://futhead.cursecdn.com/static/img/22/pla...,CB,Sivasspor,https://futhead.cursecdn.com/static/img/22/clu...,Süper Lig,https://futhead.cursecdn.com/static/img/22/lea...,Gabon,https://futhead.cursecdn.com/static/img/22/nat...,...,184cm,Medium/Medium,Silver,69,60,45,55,53,71,76
6141,6,Aaron Connolly,https://futhead.cursecdn.com/static/img/22/pla...,ST,Brighton,https://futhead.cursecdn.com/static/img/22/clu...,Premier League,https://futhead.cursecdn.com/static/img/22/lea...,Republic of Ireland,https://futhead.cursecdn.com/static/img/22/nat...,...,175cm,Medium/Low,Silver,70,75,70,50,71,20,64


#### Criação do dataset com informações apenas dos times e suas respectivas ligas

In [20]:
df4 = df.drop_duplicates(["team","img-team","league","img-league"])
df4 = df4[["team","img-team","league","img-league"]]
df4 = df4.sort_values(by=['team'])
df4["id"] = range(1, 1+ len(df4))
df4 = df4[ ['id'] + [ col for col in df4.columns if col != 'id' ] ]
df4.head()

Unnamed: 0,id,team,img-team,league,img-league
5151,1,1. FC Köln,https://futhead.cursecdn.com/static/img/22/clu...,Bundesliga,https://futhead.cursecdn.com/static/img/22/lea...
11468,2,1. FC Magdeburg,https://futhead.cursecdn.com/static/img/22/clu...,3. Liga,https://futhead.cursecdn.com/static/img/22/lea...
7995,3,1. FC Nürnberg,https://futhead.cursecdn.com/static/img/22/clu...,Bundesliga 2,https://futhead.cursecdn.com/static/img/22/lea...
3740,4,1. FSV Mainz 05,https://futhead.cursecdn.com/static/img/22/clu...,Bundesliga,https://futhead.cursecdn.com/static/img/22/lea...
9989,5,1860 München,https://futhead.cursecdn.com/static/img/22/clu...,3. Liga,https://futhead.cursecdn.com/static/img/22/lea...


### Exportando dados

Nessa parte é realizado a exportação de todos os dataframes criados através do dataset

In [22]:
df.to_csv('all_players.csv', index=False, encoding="utf-8-sig") 
df2.to_csv('all_gk.csv', index=False, encoding="utf-8-sig")
df3.to_csv('all_players_not_gk.csv', index=False, encoding="utf-8-sig") 
df4.to_csv('all_teams.csv', index=False, encoding="utf-8-sig") 

## Agradecimentos

Os dados foram extraídos graças ao site disponível publicamente https://www.futhead.com/.