## Soccerway squad lists scraper

The website soccerway.com is sits on top of an almost endless database of football matches from all points of the globe. As someone who reports on football, and also a football fan, it's one of my go-to sources for information.

This is one of many scripts I have written over to collect information from the site, which then makes the data infinitely more accessible for carrying out data analytics.

This script focusses on getting the basic player line-ups from a match: the home team, the away team and their substitutes. It includes other basic and important data such as the minutes played and whether the player was on the home or away team. 

An exhaustive assmebler of data from a match it is not, but as a tool for seeing who played when, it proved incredibly helpful as I investigated the recruitment of Brazilians for the national side of Timor-Leste in mid 2010s, a story I broke for the New York Times. 

In [None]:
A script to scrape player line-ups and additional basic information (such as minutes played and opponent) for matches listed on the Soccerway website. 

In [2]:
from requests import get
from bs4 import BeautifulSoup
import json
import pandas as pd
import re

In [32]:
def getPlayersInGame(url):
    
    dfGAMELINEUP = pd.DataFrame(columns=['Date', 'Team', 'Opponent', 'HA', 'Player', 'Minutes', 'Player Link'])
    
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
        
    match = soup.title.text
    teams = match.split(' - ')[0]
    homeTeam, awayTeam = teams.split(' vs. ')
    date = match.split(' - ')[1]
 
    def getDataFromCombinedLineupsContainers(n, N):
                
        ha = 'H' if N == 0 else 'A'
        
        cLC = combinedLineupsContainers[n].find_all('tbody')[N]
            
        for tr in cLC.find_all('tr'):

            for player in tr.find_all('td', {'class':'player'}):
                
                playerMinutes = None
                
                playerLink = player.find('a')['href']        
                playerName = player.text.replace('\n','')
                
                playerTeam = homeTeam if N == 0 else awayTeam
                opponentTeam = homeTeam if N == 1 else awayTeam
                
                ###### Starting lineup #######
                if n == 0:   
                
                    subbed = player.find('img')
                    if subbed == None:
                            playerMinutes = 90
                            
                    else:
                            playerMinutes = 'TBC'
                                                
                ###### Subs ##################
                
                if n == 1:   
                                        
                    subbed = player.find('img', {'alt': 'Substituted'})
                    
                    if subbed != None:   # This section calculates the minutes played for the players subbed on and off 
                        
                        subIN = player.find('p', {'class': 'substitute-in'}).text.strip()
                        
                        subOUTALL = player.find('p', {'class': 'substitute-out'}).text.replace('for ','').strip()
                        subOUTALL = re.split("([0-9]{1,3}\')", subOUTALL)
                        subOUT = subOUTALL[0].strip()
                        
                        subTIME = int(subOUTALL[1].replace("'",""))                        
                        subINMINS = 90 - subTIME
                        
                        playerName = subIN
                        playerMinutes = subINMINS
                        
                        dfGAMELINEUP.loc[dfGAMELINEUP['Player'] == subOUT, 'Minutes'] = subTIME
                        
                    else:   # Players left on the bench get zero minutes
                        
                        playerMinutes = 0
                        
                dfGAMELINEUP.loc[dfGAMELINEUP.shape[0]] = [date, playerTeam, opponentTeam, ha, playerName, playerMinutes, playerLink]

    # This section runs the above function four times: for the home side's starting line-up, for their subs, and for the away team's starters and subs
    combinedLineupsContainers = soup.find_all('div', {'class':'combined-lineups-container'})
    for n in range(0, len(combinedLineupsContainers)):
        haCombinedLineupsContainers = combinedLineupsContainers[n].find_all('tbody')
        for N in range(0, len(haCombinedLineupsContainers)):
            getDataFromCombinedLineupsContainers(n, N)
            
    dfGAMELINEUP = dfGAMELINEUP.sort_values(by=['HA', 'Minutes', 'Player'], ascending=[False, False, True])

    return dfGAMELINEUP

In [33]:
dfALL = pd.DataFrame()

urls = ['https://int.soccerway.com/matches/2015/10/08/asia/wc-qualifying-asia/timor-leste/palestine/2028605/',
       ]

for url in urls:
    
    dfX = getPlayersInGame(url)
    
    dfALL = pd.concat([dfALL, dfX], ignore_index=True)

dfALL

Unnamed: 0,Date,Team,Opponent,HA,Player,Minutes,Player Link
0,8 October 2015,Timor-Leste,Palestine,H,Ade,90,/players/adelino-trindade-coelho-manek-de-oliv...
1,8 October 2015,Timor-Leste,Palestine,H,Diogo Rangel,90,/players/diogo-santos-rangel/271101/
2,8 October 2015,Timor-Leste,Palestine,H,Filipe,90,/players/filipe-oliveira/373267/
3,8 October 2015,Timor-Leste,Palestine,H,Helber,90,/players/paulo-helber-rosa-ribeiro/271110/
4,8 October 2015,Timor-Leste,Palestine,H,Jose Fonseca,90,/players/jose-carlos-da-fonseca/399495/
5,8 October 2015,Timor-Leste,Palestine,H,Juninho,90,/players/junior-aparecido-guimaro-de-souza/178...
6,8 October 2015,Timor-Leste,Palestine,H,Patrick Alves,90,/players/patrick-fabiano-alves-nobrega/71367/
7,8 October 2015,Timor-Leste,Palestine,H,Ramon,90,/players/ramon-de-lima-saro/271108/
8,8 October 2015,Timor-Leste,Palestine,H,Ramos,90,/players/ramos-saozinho-ribeiro-maxanches/271097/
9,8 October 2015,Timor-Leste,Palestine,H,Rodriguinho,90,/players/rodrigo-souza-silva/110171/
