# Scraping Premier League Table (Training Data)

## Overview

This notebook aims to scrape data from the Premier League table from sofascore to gather training data for our betting prediction model. The extracted data will be utilized to train our model for accurate predictions. The primary objectives of this notebook include:

1. **Data Extraction**: Retrieve relevant information from the Premier League table, including team standings, recent match results, and performance statistics.
2. **Data Processing**: Cleanse and format the scraped data to ensure consistency and accuracy for model training.
3. **Database Integration**: Store the processed data in the previous_league_standings table which we would later merge with completed_matches table and append to our `training_data` table within our database for future model training sessions.

## Requirements

- Python 3.x
- Libraries: `selenium web driver`, `pandas`, `sqlalchemy`
- Access to the database containing the `training_data` table

## Notes

- Prior to execution, ensure that database credentials are correctly configured.
- This notebook should be executed periodically to keep the completed_matches up-to-date with the previous Premier League data. to ensure this, run this notebook before you run `Forebet Completed Matches`

## Updates July 2024
### Project Migration and Update Process
- Migration: Migrated project from Jupyter Notebook 6 to Notebook 7.
- Python Version: Updated from Python 3.9 to Python 3.12.
- Selenium Version and Usage: Noted that with Selenium 4.2, it was possible to use webdriver without specifying service and options AND PATH, eliminating unnecessary code.

In [3]:
# let's install the necessary libraries
!pip install pandas sqlalchemy

Collecting selenium
  Downloading selenium-4.22.0-py3-none-any.whl.metadata (7.0 kB)
Collecting pandas
  Downloading pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting sqlalchemy
  Downloading SQLAlchemy-2.0.31-cp312-cp312-win_amd64.whl.metadata (9.9 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.25.1-py3-none-any.whl.metadata (8.7 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.0.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
     -------------------- ------------------- 30.7/60.9 kB ? eta -:--:--
     -------------------- ------------------- 30.7/60.9 kB ? eta -:--:--
     -------------------------------------- 60.9/60.9 kB 405.3 kB/s eta 0:00:00
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)

In [6]:
# Make sure youre using latest version of selenium to prevent running into errors. 
# Make sure your chrome driver version is compatible with your Chrome versions. 
# Use this link to update to latest Chrome and chromerdriver https://googlechromelabs.github.io/chrome-for-testing/ if necessary.
!pip install --upgrade selenium

Note: you may need to restart the kernel to use updated packages.


In [1]:
from selenium import webdriver
import pandas as pd
from sqlalchemy import create_engine
# from modules import team_labels_sofascore

In [2]:

team_labels = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brighton': 4,
        'Burnley': 5,
        'Chelsea': 6,
        'Crystal Palace': 7,
        'Everton': 8,
        'Fulham': 9,
        'Ipswich': 10,
        'Leeds Utd': 11,
        'Leicester City': 12,
        'Liverpool': 13,
        'Man City': 14,
        'Man Utd': 15,
        'Newcastle': 16,
        'Norwich': 17,
        'Sheffield': 18,
        'Southampton': 19,
        'Tottenham': 20,
        'West Ham': 21,
        'Luton': 22,
        'Wolves': 23,
        'Brentford': 24,
        'Sheffield Utd': 25,
        'Forest': 26
    }

In [3]:
# Initialize WebDriver 
driver = webdriver.Chrome()
driver.set_window_size(1920, 1080)
# When you run the next block ensure the ChromeDriver window is maximized to accurately render the website. 
# Different screen sizes may affect website responsiveness and the class names used, 
# which could lead to varied results if the window size is too small.

In [4]:
# Navigate to the specified URL
driver.get('https://www.sofascore.com/tournament/football/england/premier-league/17#id:61627')

In [10]:
# Now find the fixture containers within the body element
league_table_results = driver.find_elements("class name", "eHXJll")
print(league_table_results)


[<selenium.webdriver.remote.webelement.WebElement (session="f0c76b5353014473eba72fb6be828d5b", element="f.A3A24DB1C248231B63EB93AA2074A388.d.F0F1F953CD5D2CEBBBD42DAC9592F5EA.e.37")>]


In [11]:
league_table_results_container = []

for results in league_table_results:
    # Extract the text from the fixture container
    results_text = results.text
    league_table_results_container.append(results_text)
    print(results_text)

ALLHOMEAWAY
#
Team
P
W
D
L
Goals
Last 5
PTS
1
Brighton
1
1
0
0
3:0
W
3
2
Arsenal
1
1
0
0
2:0
W
3
3
Liverpool
1
1
0
0
2:0
W
3
4
Man City
1
1
0
0
2:0
W
3
5
Aston Villa
1
1
0
0
2:1
W
3
6
Brentford
1
1
0
0
2:1
W
3
7
Man Utd
1
1
0
0
1:0
W
3
8
Newcastle
1
1
0
0
1:0
W
3
9
Bournemouth
1
0
1
0
1:1
D
1
10
Forest
1
0
1
0
1:1
D
1
11
Leicester
1
0
1
0
1:1
D
1
12
Tottenham
1
0
1
0
1:1
D
1
13
Crystal Palace
1
0
0
1
1:2
L
0
14
West Ham
1
0
0
1
1:2
L
0
15
Fulham
1
0
0
1
0:1
L
0
16
Southampton
1
0
0
1
0:1
L
0
17
Chelsea
1
0
0
1
0:2
L
0
18
Wolves
1
0
0
1
0:2
L
0
19
Ipswich
1
0
0
1
0:2
L
0
20
Everton
1
0
0
1
0:3
L
0


In [12]:
driver.quit()

In [13]:
league_table_results_container

['ALLHOMEAWAY\n#\nTeam\nP\nW\nD\nL\nGoals\nLast 5\nPTS\n1\nBrighton\n1\n1\n0\n0\n3:0\nW\n3\n2\nArsenal\n1\n1\n0\n0\n2:0\nW\n3\n3\nLiverpool\n1\n1\n0\n0\n2:0\nW\n3\n4\nMan City\n1\n1\n0\n0\n2:0\nW\n3\n5\nAston Villa\n1\n1\n0\n0\n2:1\nW\n3\n6\nBrentford\n1\n1\n0\n0\n2:1\nW\n3\n7\nMan Utd\n1\n1\n0\n0\n1:0\nW\n3\n8\nNewcastle\n1\n1\n0\n0\n1:0\nW\n3\n9\nBournemouth\n1\n0\n1\n0\n1:1\nD\n1\n10\nForest\n1\n0\n1\n0\n1:1\nD\n1\n11\nLeicester\n1\n0\n1\n0\n1:1\nD\n1\n12\nTottenham\n1\n0\n1\n0\n1:1\nD\n1\n13\nCrystal Palace\n1\n0\n0\n1\n1:2\nL\n0\n14\nWest Ham\n1\n0\n0\n1\n1:2\nL\n0\n15\nFulham\n1\n0\n0\n1\n0:1\nL\n0\n16\nSouthampton\n1\n0\n0\n1\n0:1\nL\n0\n17\nChelsea\n1\n0\n0\n1\n0:2\nL\n0\n18\nWolves\n1\n0\n0\n1\n0:2\nL\n0\n19\nIpswich\n1\n0\n0\n1\n0:2\nL\n0\n20\nEverton\n1\n0\n0\n1\n0:3\nL\n0']

In [14]:
data_copy = league_table_results_container

df_columns = data_copy[0][11:43]

data = data_copy[0][43:]

data

'\n1\nBrighton\n1\n1\n0\n0\n3:0\nW\n3\n2\nArsenal\n1\n1\n0\n0\n2:0\nW\n3\n3\nLiverpool\n1\n1\n0\n0\n2:0\nW\n3\n4\nMan City\n1\n1\n0\n0\n2:0\nW\n3\n5\nAston Villa\n1\n1\n0\n0\n2:1\nW\n3\n6\nBrentford\n1\n1\n0\n0\n2:1\nW\n3\n7\nMan Utd\n1\n1\n0\n0\n1:0\nW\n3\n8\nNewcastle\n1\n1\n0\n0\n1:0\nW\n3\n9\nBournemouth\n1\n0\n1\n0\n1:1\nD\n1\n10\nForest\n1\n0\n1\n0\n1:1\nD\n1\n11\nLeicester\n1\n0\n1\n0\n1:1\nD\n1\n12\nTottenham\n1\n0\n1\n0\n1:1\nD\n1\n13\nCrystal Palace\n1\n0\n0\n1\n1:2\nL\n0\n14\nWest Ham\n1\n0\n0\n1\n1:2\nL\n0\n15\nFulham\n1\n0\n0\n1\n0:1\nL\n0\n16\nSouthampton\n1\n0\n0\n1\n0:1\nL\n0\n17\nChelsea\n1\n0\n0\n1\n0:2\nL\n0\n18\nWolves\n1\n0\n0\n1\n0:2\nL\n0\n19\nIpswich\n1\n0\n0\n1\n0:2\nL\n0\n20\nEverton\n1\n0\n0\n1\n0:3\nL\n0'

In [15]:
data_copy_split = [data.split('\n')]

data_copy_split = data_copy_split[0][1:]
print(data_copy_split)

['1', 'Brighton', '1', '1', '0', '0', '3:0', 'W', '3', '2', 'Arsenal', '1', '1', '0', '0', '2:0', 'W', '3', '3', 'Liverpool', '1', '1', '0', '0', '2:0', 'W', '3', '4', 'Man City', '1', '1', '0', '0', '2:0', 'W', '3', '5', 'Aston Villa', '1', '1', '0', '0', '2:1', 'W', '3', '6', 'Brentford', '1', '1', '0', '0', '2:1', 'W', '3', '7', 'Man Utd', '1', '1', '0', '0', '1:0', 'W', '3', '8', 'Newcastle', '1', '1', '0', '0', '1:0', 'W', '3', '9', 'Bournemouth', '1', '0', '1', '0', '1:1', 'D', '1', '10', 'Forest', '1', '0', '1', '0', '1:1', 'D', '1', '11', 'Leicester', '1', '0', '1', '0', '1:1', 'D', '1', '12', 'Tottenham', '1', '0', '1', '0', '1:1', 'D', '1', '13', 'Crystal Palace', '1', '0', '0', '1', '1:2', 'L', '0', '14', 'West Ham', '1', '0', '0', '1', '1:2', 'L', '0', '15', 'Fulham', '1', '0', '0', '1', '0:1', 'L', '0', '16', 'Southampton', '1', '0', '0', '1', '0:1', 'L', '0', '17', 'Chelsea', '1', '0', '0', '1', '0:2', 'L', '0', '18', 'Wolves', '1', '0', '0', '1', '0:2', 'L', '0', '19', '

In [16]:
# Determine the current weekly round (number of matches played)
weekly_round = 1  # Update this value as needed

# Calculate the slice length based on weekly_round
slice_length = 7 + min(weekly_round, 5) + 1  # Max of 5 match results considered

# Split the data based on the calculated slice length
data_array = [data_copy_split[i:i+slice_length] for i in range(0, len(data_copy_split), slice_length)]

# Print the resulting split data
for team_data in data_array:
    print(team_data)

['1', 'Brighton', '1', '1', '0', '0', '3:0', 'W', '3']
['2', 'Arsenal', '1', '1', '0', '0', '2:0', 'W', '3']
['3', 'Liverpool', '1', '1', '0', '0', '2:0', 'W', '3']
['4', 'Man City', '1', '1', '0', '0', '2:0', 'W', '3']
['5', 'Aston Villa', '1', '1', '0', '0', '2:1', 'W', '3']
['6', 'Brentford', '1', '1', '0', '0', '2:1', 'W', '3']
['7', 'Man Utd', '1', '1', '0', '0', '1:0', 'W', '3']
['8', 'Newcastle', '1', '1', '0', '0', '1:0', 'W', '3']
['9', 'Bournemouth', '1', '0', '1', '0', '1:1', 'D', '1']
['10', 'Forest', '1', '0', '1', '0', '1:1', 'D', '1']
['11', 'Leicester', '1', '0', '1', '0', '1:1', 'D', '1']
['12', 'Tottenham', '1', '0', '1', '0', '1:1', 'D', '1']
['13', 'Crystal Palace', '1', '0', '0', '1', '1:2', 'L', '0']
['14', 'West Ham', '1', '0', '0', '1', '1:2', 'L', '0']
['15', 'Fulham', '1', '0', '0', '1', '0:1', 'L', '0']
['16', 'Southampton', '1', '0', '0', '1', '0:1', 'L', '0']
['17', 'Chelsea', '1', '0', '0', '1', '0:2', 'L', '0']
['18', 'Wolves', '1', '0', '0', '1', '0:2', 

In [17]:
testing = ['1', 'Arsenal', '36', '26', '5', '5', '88:28', 'L', 'W', 'W', 'W', 'W', '83']

# Splitting the element at index 6 and extending the list with the resulting parts
testing[6:7] = testing[6].split(':')

print(testing)

['1', 'Arsenal', '36', '26', '5', '5', '88', '28', 'L', 'W', 'W', 'W', 'W', '83']


In [18]:
split_data = []

for data in data_array:
    data[6:7] = data[6].split(':')
    split_data.append(data)    

print(split_data) 

[['1', 'Brighton', '1', '1', '0', '0', '3', '0', 'W', '3'], ['2', 'Arsenal', '1', '1', '0', '0', '2', '0', 'W', '3'], ['3', 'Liverpool', '1', '1', '0', '0', '2', '0', 'W', '3'], ['4', 'Man City', '1', '1', '0', '0', '2', '0', 'W', '3'], ['5', 'Aston Villa', '1', '1', '0', '0', '2', '1', 'W', '3'], ['6', 'Brentford', '1', '1', '0', '0', '2', '1', 'W', '3'], ['7', 'Man Utd', '1', '1', '0', '0', '1', '0', 'W', '3'], ['8', 'Newcastle', '1', '1', '0', '0', '1', '0', 'W', '3'], ['9', 'Bournemouth', '1', '0', '1', '0', '1', '1', 'D', '1'], ['10', 'Forest', '1', '0', '1', '0', '1', '1', 'D', '1'], ['11', 'Leicester', '1', '0', '1', '0', '1', '1', 'D', '1'], ['12', 'Tottenham', '1', '0', '1', '0', '1', '1', 'D', '1'], ['13', 'Crystal Palace', '1', '0', '0', '1', '1', '2', 'L', '0'], ['14', 'West Ham', '1', '0', '0', '1', '1', '2', 'L', '0'], ['15', 'Fulham', '1', '0', '0', '1', '0', '1', 'L', '0'], ['16', 'Southampton', '1', '0', '0', '1', '0', '1', 'L', '0'], ['17', 'Chelsea', '1', '0', '0', '

In [19]:
print(data_array)

[['1', 'Brighton', '1', '1', '0', '0', '3', '0', 'W', '3'], ['2', 'Arsenal', '1', '1', '0', '0', '2', '0', 'W', '3'], ['3', 'Liverpool', '1', '1', '0', '0', '2', '0', 'W', '3'], ['4', 'Man City', '1', '1', '0', '0', '2', '0', 'W', '3'], ['5', 'Aston Villa', '1', '1', '0', '0', '2', '1', 'W', '3'], ['6', 'Brentford', '1', '1', '0', '0', '2', '1', 'W', '3'], ['7', 'Man Utd', '1', '1', '0', '0', '1', '0', 'W', '3'], ['8', 'Newcastle', '1', '1', '0', '0', '1', '0', 'W', '3'], ['9', 'Bournemouth', '1', '0', '1', '0', '1', '1', 'D', '1'], ['10', 'Forest', '1', '0', '1', '0', '1', '1', 'D', '1'], ['11', 'Leicester', '1', '0', '1', '0', '1', '1', 'D', '1'], ['12', 'Tottenham', '1', '0', '1', '0', '1', '1', 'D', '1'], ['13', 'Crystal Palace', '1', '0', '0', '1', '1', '2', 'L', '0'], ['14', 'West Ham', '1', '0', '0', '1', '1', '2', 'L', '0'], ['15', 'Fulham', '1', '0', '0', '1', '0', '1', 'L', '0'], ['16', 'Southampton', '1', '0', '0', '1', '0', '1', 'L', '0'], ['17', 'Chelsea', '1', '0', '0', '

In [26]:
print(slice_length)

9


#### Code Playground: creating ppg, last_5_matches, and removing redundant indexes

In [43]:
epl_table = [['1','Arsenal','37','27','5','5','89','28','W','W','W', 'W','W','86'],
 ['2','Man City', '36','26','7','3','91','33','W','W','W','W','W','85'],
 ['3','Liverpool','36','23','9','4','81','38','L','W','L','D','W','78']]

epl_table_case_2 = [['1', 'Brighton', '1', '1', '0', '0', '3', '0', 'W', '3'], 
                    ['2', 'Arsenal', '1', '1', '0', '0', '2', '0', 'W', '3'], 
                    ['3', 'Liverpool', '1', '1', '0', '0', '2', '0', 'W', '3']]


# for case 3. the 
epl_table_case_3 = [['1', 'Brighton', '1', '1', '0', '0', '3:0', 'W', '3'], 
                    ['2', 'Arsenal', '1', '1', '0', '0', '2:0', 'W', '3'], 
                    ['3', 'Liverpool', '1', '1', '0', '0', '2:0', 'W', '3']]


epl_table_case_4 = [['1', 'Brighton', '1', '1', '0', '0', '3','0', 'W','W', '6'], 
                    ['2', 'Arsenal', '1', '1', '0', '0', '2','0', 'W','W', '6'], 
                    ['3', 'Liverpool', '1', '1', '0', '0', '2','0', 'W','D', '4']]

# best case scenario is, we do not care which index W,D or L is. we can look for strings, and then calculate our ppg.
# However we need to consider joining the strings back so it looks like 'WWWDL'. 
# For the beginning of the season, this will be tricky since the number of last matches varies from 1 - 5

# Notes 
# For week 1. slice length is 9.
# the strings start from index 7 in case 2 hence (8:slice_length )
# the strings start from index 6 in case 3 hence (7:slice_length -1)


for row in epl_table_case_4:
    ppg = 0
    for result in row[7:slice_length]:
        if result == 'W':
            ppg += 3
        elif result == 'D':
            ppg += 1
        elif result == 'L':
            ppg += 1
        
    last_5_matches = ''.join(row[8:slice_length])  # Concatenate match results from index 7 to 12
    row[8:slice_length] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG. they start from index 8. and not 7. 

print(epl_table_case_4)


[['1', 'Brighton', '1', '1', '0', '0', '3', '0', 'WW', 1.2, '6'], ['2', 'Arsenal', '1', '1', '0', '0', '2', '0', 'WW', 1.2, '6'], ['3', 'Liverpool', '1', '1', '0', '0', '2', '0', 'WD', 0.8, '4']]


In [44]:
def calculate_ppg_and_create_last_5_matches(two_d_array):
    """
    Calculate the points per game (PPG) and create a string representing the last 5 matches for each team.

    Args:
        two_d_array (list of lists): Two-dimensional array containing match data for each team.

    Returns:
        list of lists: Updated two-dimensional array with additional columns for last 5 matches and PPG.
    """
    for row in two_d_array:
        ppg = 0
        for result in row[7:slice_length]:
            if result == 'W':
                ppg += 3
            elif result == 'D':
                ppg += 1
            elif result == 'L':
                ppg += 0
        last_5_matches = ''.join(row[8:slice_length])  # Concatenate match results from index 7 to 11
        row[8:slice_length] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG
    
    return two_d_array

In [45]:
premier_league_columns  = ['Pos', 'Team', 'Pld', 'Wins', 'Draws', 'Losses', 'GF', 'GA', 'Last 5 Matches', 'Ppg', 'Points']


In [46]:
df_data = calculate_ppg_and_create_last_5_matches(split_data)


In [47]:
# Create DataFrame
premier_league_table = pd.DataFrame(df_data, columns=premier_league_columns)
print(premier_league_table)


   Pos            Team Pld Wins Draws Losses GF GA Last 5 Matches  Ppg Points
0    1        Brighton   1    1     0      0  3  0              W  0.6      3
1    2         Arsenal   1    1     0      0  2  0              W  0.6      3
2    3       Liverpool   1    1     0      0  2  0              W  0.6      3
3    4        Man City   1    1     0      0  2  0              W  0.6      3
4    5     Aston Villa   1    1     0      0  2  1              W  0.6      3
5    6       Brentford   1    1     0      0  2  1              W  0.6      3
6    7         Man Utd   1    1     0      0  1  0              W  0.6      3
7    8       Newcastle   1    1     0      0  1  0              W  0.6      3
8    9     Bournemouth   1    0     1      0  1  1              D  0.2      1
9   10          Forest   1    0     1      0  1  1              D  0.2      1
10  11       Leicester   1    0     1      0  1  1              D  0.2      1
11  12       Tottenham   1    0     1      0  1  1              

In [48]:
def team_to_label(team_name):
    return team_labels.get(team_name)

In [49]:
premier_league_table['Team'] = premier_league_table['Team'].map(team_to_label)

NameError: name 'team_labels_sofascore' is not defined

#### Confirm its the right Weekly Round in order to join with the right dataset. 
Note this data will eb considered as 'Previous League standings'(The data we are scrapping will be vital before the weekly fixtures | completed_matches in our case. Because we need the stats before the matches are played

In [24]:
Round  = 'Round ' + ' ' +  premier_league_table['Pld'].mode()[0]

Establish your DB 
Make sure you have already created a database ready for it. In my next update i will have automations for creating the databse for you

In [None]:
# Database connection
user = '<user>'
password = '<password>'
host = 'localhost'
port = 3306
database = '<db>'

In [None]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database}')
premier_league_table.to_sql('previous_week_league_standings', con=engine, if_exists='replace', index=False)