### Web Scraping for Liverpool vs Manchester City Matches at 3rd October, 2021

The league’s best two teams for the last few years with two of the world’s best managers matched up on 3rd October 2021 to set up a pulsating encounter. In a breathtaking, enthralling Premier League on steroids classic we were reminded of the level Jurgen Klopp and Pep Guardiola have elevated not only both teams but the quality of the league in football’s current uber tactical environment. 

This has been deemed the blockbuster fixture to circle on calendars in recent years and never for a second did it disappoint, watching another chapter of two prime heavyweights unfold was fascinating. On many occasions last year perhaps because football was devoid of that oh so precious big-game atmosphere, the biggest games were a damp squib, but this clash was enrapturing.

The reason why I scrape this match is that this game is one of the best matches in 2021. I got the dataset from <a href="https://understat.com/match/16441">understat.com</a>. So, let's get started it.

In [1]:
# Import necessary libraries

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

In [2]:
# Scrape single game shot from understat.com

url = "https://understat.com/match/16441"

In [7]:
# Use requests to get the webpage and BeautifulSoup to parse the page

res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')
scripts = soup.find_all('script')

In [17]:
# Get the shot data

strings = scripts[1].string

In [20]:
# Strip unnecessary symbols and get only json data
index_start = strings.index("('") + 2
index_end = strings.index("')")
json_data = strings[index_start:index_end]
json_data = json_data.encode('utf8').decode('unicode_escape')

# Convert string to json format
data = json.loads(json_data)
data

{'h': [{'id': '435650',
   'minute': '7',
   'result': 'BlockedShot',
   'X': '0.8590000152587891',
   'Y': '0.7319999694824219',
   'xG': '0.028604544699192047',
   'player': 'Mohamed Salah',
   'h_a': 'h',
   'player_id': '1250',
   'situation': 'FromCorner',
   'season': '2021',
   'shotType': 'LeftFoot',
   'match_id': '16441',
   'h_team': 'Liverpool',
   'a_team': 'Manchester City',
   'h_goals': '2',
   'a_goals': '2',
   'date': '2021-10-03 15:30:00',
   'player_assisted': 'James Milner',
   'lastAction': 'Cross'},
  {'id': '435658',
   'minute': '49',
   'result': 'SavedShot',
   'X': '0.8380000305175781',
   'Y': '0.5620000076293945',
   'xG': '0.07208539545536041',
   'player': 'Diogo Jota',
   'h_a': 'h',
   'player_id': '6854',
   'situation': 'OpenPlay',
   'season': '2021',
   'shotType': 'LeftFoot',
   'match_id': '16441',
   'h_team': 'Liverpool',
   'a_team': 'Manchester City',
   'h_goals': '2',
   'a_goals': '2',
   'date': '2021-10-03 15:30:00',
   'player_assisted

In [72]:
Player = []
Minutes = []
Team = []
X = []
Y = []
ShotType = []
xG = []
Result = []
PlayerAssisted = []

data_home = data['h']
data_away = data['a']

In [73]:
for index in range(len(data_home)):
    for key in data_home[index]:
        if key == 'player':
            Player.append(data_home[index][key])
        if key == 'minute':
            Minutes.append(data_home[index][key])
        if key == 'h_team':
            Team.append(data_home[index][key])
        if key == 'X':
            X.append(data_home[index][key])
        if key == 'Y':
            Y.append(data_home[index][key])
        if key == 'shotType':
            ShotType.append(data_home[index][key])
        if key == 'xG':
            xG.append(data_home[index][key])
        if key == 'result':
            Result.append(data_home[index][key])
        if key == 'player_assisted':
            PlayerAssisted.append(data_home[index][key])
            
for index in range(len(data_away)):
    for key in data_away[index]:
        if key == 'player':
            Player.append(data_away[index][key])
        if key == 'minute':
            Minutes.append(data_away[index][key])
        if key == 'a_team':
            Team.append(data_away[index][key])
        if key == 'X':
            X.append(data_away[index][key])
        if key == 'Y':
            Y.append(data_away[index][key])
        if key == 'shotType':
            ShotType.append(data_away[index][key])
        if key == 'xG':
            xG.append(data_away[index][key])
        if key == 'result':
            Result.append(data_away[index][key])
        if key == 'player_assisted':
            PlayerAssisted.append(data_away[index][key])

In [74]:
# Create the DataFrame

column = ['Player', 'Minutes', 'Team', 'X', 'Y', 'Shot Type', 'xG', 'Result', 'Player Assisted']
df = pd.DataFrame([Player, Minutes, Team, X, Y, ShotType, xG, Result, PlayerAssisted], index = column).transpose()

In [75]:
df

Unnamed: 0,Player,Minutes,Team,X,Y,Shot Type,xG,Result,Player Assisted
0,Mohamed Salah,7,Liverpool,0.8590000152587891,0.7319999694824219,LeftFoot,0.028604544699192,BlockedShot,James Milner
1,Diogo Jota,49,Liverpool,0.8380000305175781,0.5620000076293945,LeftFoot,0.0720853954553604,SavedShot,Joel Matip
2,Sadio Mané,58,Liverpool,0.8830000305175781,0.375,RightFoot,0.3344335854053497,Goal,Mohamed Salah
3,Mohamed Salah,62,Liverpool,0.7519999694824219,0.6379999923706055,LeftFoot,0.0537067279219627,BlockedShot,
4,Mohamed Salah,75,Liverpool,0.9669999694824218,0.36,RightFoot,0.1609952896833419,Goal,Curtis Jones
5,Fabinho,86,Liverpool,0.958000030517578,0.5979999923706054,RightFoot,0.2995425164699554,BlockedShot,Mohamed Salah
6,Jack Grealish,14,Manchester City,0.86,0.7230000305175781,RightFoot,0.0332346484065055,BlockedShot,João Cancelo
7,Phil Foden,20,Manchester City,0.950999984741211,0.6120000076293945,LeftFoot,0.3645058870315552,SavedShot,Bernardo Silva
8,Kevin De Bruyne,23,Manchester City,0.921999969482422,0.6769999694824219,LeftFoot,0.1202637702226638,MissedShots,João Cancelo
9,Kevin De Bruyne,33,Manchester City,0.916999969482422,0.4270000076293945,Head,0.0825801640748977,MissedShots,Phil Foden


In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Player           18 non-null     object
 1   Minutes          18 non-null     object
 2   Team             18 non-null     object
 3   X                18 non-null     object
 4   Y                18 non-null     object
 5   Shot Type        18 non-null     object
 6   xG               18 non-null     object
 7   Result           18 non-null     object
 8   Player Assisted  14 non-null     object
dtypes: object(9)
memory usage: 1.4+ KB


In [77]:
df['X'] = df['X'].astype(float).round(decimals = 3)
df['Y'] = df['Y'].astype(float).round(decimals = 3)
df['xG'] = df['xG'].astype(float).round(decimals = 3)
df['Minutes'] = df['Minutes'].astype(int)

In [78]:
df

Unnamed: 0,Player,Minutes,Team,X,Y,Shot Type,xG,Result,Player Assisted
0,Mohamed Salah,7,Liverpool,0.859,0.732,LeftFoot,0.029,BlockedShot,James Milner
1,Diogo Jota,49,Liverpool,0.838,0.562,LeftFoot,0.072,SavedShot,Joel Matip
2,Sadio Mané,58,Liverpool,0.883,0.375,RightFoot,0.334,Goal,Mohamed Salah
3,Mohamed Salah,62,Liverpool,0.752,0.638,LeftFoot,0.054,BlockedShot,
4,Mohamed Salah,75,Liverpool,0.967,0.36,RightFoot,0.161,Goal,Curtis Jones
5,Fabinho,86,Liverpool,0.958,0.598,RightFoot,0.3,BlockedShot,Mohamed Salah
6,Jack Grealish,14,Manchester City,0.86,0.723,RightFoot,0.033,BlockedShot,João Cancelo
7,Phil Foden,20,Manchester City,0.951,0.612,LeftFoot,0.365,SavedShot,Bernardo Silva
8,Kevin De Bruyne,23,Manchester City,0.922,0.677,LeftFoot,0.12,MissedShots,João Cancelo
9,Kevin De Bruyne,33,Manchester City,0.917,0.427,Head,0.083,MissedShots,Phil Foden


In [80]:
df = df.sort_values(by = 'Minutes').reset_index()

In [82]:
df.drop(columns = 'index', inplace = True)
df

Unnamed: 0,Player,Minutes,Team,X,Y,Shot Type,xG,Result,Player Assisted
0,Mohamed Salah,7,Liverpool,0.859,0.732,LeftFoot,0.029,BlockedShot,James Milner
1,Jack Grealish,14,Manchester City,0.86,0.723,RightFoot,0.033,BlockedShot,João Cancelo
2,Phil Foden,20,Manchester City,0.951,0.612,LeftFoot,0.365,SavedShot,Bernardo Silva
3,Kevin De Bruyne,23,Manchester City,0.922,0.677,LeftFoot,0.12,MissedShots,João Cancelo
4,Kevin De Bruyne,33,Manchester City,0.917,0.427,Head,0.083,MissedShots,Phil Foden
5,Kevin De Bruyne,37,Manchester City,0.859,0.375,RightFoot,0.063,BlockedShot,Gabriel Jesus
6,Kevin De Bruyne,39,Manchester City,0.86,0.338,LeftFoot,0.06,BlockedShot,Gabriel Jesus
7,Rodri,39,Manchester City,0.749,0.382,RightFoot,0.025,MissedShots,
8,Diogo Jota,49,Liverpool,0.838,0.562,LeftFoot,0.072,SavedShot,Joel Matip
9,Sadio Mané,58,Liverpool,0.883,0.375,RightFoot,0.334,Goal,Mohamed Salah


In [89]:
# Save into csv files

df.to_csv("Liverpool Shot vs Man City Shot.csv")