**1. Premier League Defender Formation Data**

This notebook intends to parse defence formation data (match date, match name, formation for home and away teamsn, player names (from RB to LB) for home and away teams) for every match in 2017-18 season, since this information was not available in Wyscout data. The data is preprocessed and relevant features are extracted. Finally, the processed data is stored in the csv file (PL1718Defence.csv)

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from html import unescape
import re
from unidecode import unidecode
from datetime import datetime

**Opening Premier League 2017-18 Results HTML File**

In [4]:
pl_html = open("../Data/matches/PL1718Results.html",'r',encoding='utf-8')
pl = pl_html.read()

**Converting escaped HTML characters to unescaped characters**

In [5]:
unescaped = unescape(pl)

**Finding all the fixture containers in the HTML File**

In [6]:
pl_page = BeautifulSoup(unescaped)
matches = pl_page.find_all('li',class_='matchFixtureContainer')


**Obtaining match information in the form of a dictionary with keys -**

**match_id** - Unique ID of the match

**home_team** - Home Team

**away_team** - Away Team

**match_link** - Match Center link for the particular match

In [7]:
pl_1718_match_info = []
base_link = 'https://www.premierleague.com/match/'
for match in matches:
    match_info = dict()
    match_id = match['data-comp-match-item']
    home_team = match['data-home']
    away_team = match['data-away']
    match_info['match_id'] = match_id
    match_info['home_team'] = home_team
    match_info['away_team'] = away_team
    match_info['match_link'] = base_link + str(match_id)
    pl_1718_match_info.append(match_info)

In [8]:
base_link = 'https://www.premierleague.com/match/'
match_ids = [x['match_id'] for x in pl_1718_match_info]
home_teams = [x['home_team'] for x in pl_1718_match_info]
away_teams = [x['away_team'] for x in pl_1718_match_info]

**Retrieving defence lineups for Home and Away Team**

In [9]:
pl1718_defence = list()
pl1718_lineups = list()
for match_info in pl_1718_match_info:
    matches1718 = dict()
    squads1718 = dict()
    matchcenter = requests.get(match_info['match_link'])
    match_data = BeautifulSoup(matchcenter.content)
    
#     Collecting jersey numbers for the defence lineup for the home team

    home_team_data = match_data.find('div',class_='team home pitchPositonsContainer')
    rows = home_team_data.find_all('div',class_='row')
    defence_row = rows[1]
    home_players = defence_row.find_all('div')
    home_jersey_nos = list()
    for home_player in home_players:
        home_jersey_nos.append(home_player.text)
        
#     Collecting jersey numbers for the defence lineup for the away team

    away_team_data = match_data.find('div',class_='team away pitchPositonsContainer')
    rows = away_team_data.find_all('div',class_='row')
    defence_row = rows[1]
    away_players = defence_row.find_all('div')
    away_jersey_nos = list()
    for away_player in away_players:
        away_jersey_nos.append(away_player.text)

#     Obtaining the respective player name from the jersey number for the home team

    matches1718['match_id'] = match_info['match_id']
    matches1718['home_team'] = match_info['home_team']
    matches1718['away_team'] = match_info['away_team']
    home_lineups_container = match_data.find('div',class_='teamList mcLineUpContainter homeLineup').find('div',class_= 'matchLineupTeamContainer')
    home_lineups_players = home_lineups_container.find_all('li',class_='player')
    home_player_jersey = list()
    for home_lineups_player in home_lineups_players:
        jersey_number = home_lineups_player.find('div',class_='number').text
        player_name = home_lineups_player.find('div',class_='name').text
        player_name = player_name.strip()
        player_name = unidecode(player_name)
        player_name = re.sub('[^a-zA-Z]+','',player_name)
        home_player_jersey.append((jersey_number,player_name))
    home_player_names = list()
    for x in home_jersey_nos:
        for i,j in home_player_jersey:
            if x==str(i):
                home_player_names.append(j)
                
#     Obtaining the respective player name from the jersey number for the home team

    away_lineups_container = match_data.find('div',class_='teamList mcLineUpContainter awayLineup').find('div',class_= 'matchLineupTeamContainer')
    away_lineups_players = away_lineups_container.find_all('li',class_='player')
    away_player_jersey = list()
    for away_lineups_player in away_lineups_players:
        jersey_number = away_lineups_player.find('div',class_='number').text
        player_name = away_lineups_player.find('div',class_='name').text
        player_name = player_name.strip()
        player_name = unidecode(player_name)
        player_name = re.sub('[^a-zA-Z]+','',player_name)
        away_player_jersey.append((jersey_number,player_name))
    away_player_names = list()
    for x in away_jersey_nos:
        for i,j in away_player_jersey:
            if x==str(i):
                away_player_names.append(j)

#      Fetching the match dates

    match_date = match_data.find('div',class_='current').find('time').text
    match_date = match_date.split('-')[0].strip()
    if match_date.split(' ')[1] in ['Jan','Feb','Mar','Apr','May']:
        match_date = match_date + ' 2018'
    else:
        match_date = match_date + ' 2017'
    match_date = datetime.strptime(match_date,'%d %b %Y')
    match_date = match_date.date()
    matches1718['match_date'] = match_date.isoformat()
    matches1718['home_team_defense'] = home_player_names
    matches1718['away_team_defense'] = away_player_names
    pl1718_defence.append(matches1718)
                
            
    
    
        

In [10]:
pl1718defense_df = pd.DataFrame(pl1718_defence)

**Saving the data in a csv file**

In [13]:
pl1718defense_df.to_csv('../data/matches/PL1718Defense.csv')