# Scraping Premier League Table (Training Data)

## Overview

This notebook aims to scrape data from the Premier League table from sofascore to gather training data for our betting prediction model. The extracted data will be utilized to train our model for accurate predictions. The primary objectives of this notebook include:

1. **Data Extraction**: Retrieve relevant information from the Premier League table, including team standings, recent match results, and performance statistics.
2. **Data Processing**: Cleanse and format the scraped data to ensure consistency and accuracy for model training.
3. **Database Integration**: Store the processed data in the previous_league_standings table which we would later merge with completed_matches table and append to our `training_data` table within our database for future model training sessions.

## Requirements

- Python 3.x
- Libraries: `selenium web driver`, `pandas`, `sqlalchemy`
- Access to the database containing the `training_data` table

## Notes

- Prior to execution, ensure that database credentials are correctly configured.
- This notebook should be executed periodically to keep the completed_matches up-to-date with the previous Premier League data. to ensure this, run this notebook before you run `Forebet Completed Matches`

## Updates July 2024
### Project Migration and Update Process
- Migration: Migrated project from Jupyter Notebook 6 to Notebook 7.
- Python Version: Updated from Python 3.9 to Python 3.12.
- Selenium Version and Usage: Noted that with Selenium 4.2, it was possible to use webdriver without specifying service and options AND PATH, eliminating unnecessary code.

In [None]:
# let's install the necessary libraries
!pip install pandas sqlalchemy

In [None]:
# Make sure youre using latest version of selenium to prevent running into errors. 
# Make sure your chrome driver version is compatible with your Chrome versions. 
# Use this link to update to latest Chrome and chromerdriver https://googlechromelabs.github.io/chrome-for-testing/ if necessary.
!pip install --upgrade selenium

In [1]:
from selenium import webdriver
import pandas as pd
from sqlalchemy import create_engine
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# from modules import team_labels_sofascore


In [2]:

team_labels = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brentford': 4,
        'Brighton': 5,
        'Burnley': 6,
        'Chelsea': 7,
        'Crystal Palace': 8,
        'Everton': 9,
        'Fulham': 10,
        'Ipswich': 11,
        'Leeds Utd': 12,
        'Leicester': 13,
        'Liverpool': 14,
        'Man City': 15,
        'Man Utd': 16,
        'Newcastle': 17,
        'Norwich': 18,
        'Sheffield': 19,
        'Southampton': 20,
        'Tottenham': 21,
        'West Ham': 22,
        'Luton': 23,
        'Wolves': 24,
        'Sheffield Utd': 25,
        'Forest': 26
    }


In [3]:
# Initialize WebDriver 
# Set up the Chrome driver
chrome_options = Options()
chrome_options.add_argument("--start-maximized")  # Opens browser in full-screen
driver = webdriver.Chrome(options=chrome_options)
# When you run the next block ensure the ChromeDriver window is maximized to accurately render the website. 
# Different screen sizes may affect website responsiveness and the class names used, 
# which could lead to varied results if the window size is too small.

In [4]:
# Navigate to the specified URL
driver.get('https://www.sofascore.com/tournament/football/england/premier-league/17#id:61627')

# # Add a wait to ensure the element is present
try:
    # Wait for the consent pop-up and click the consent button
    consent_button = WebDriverWait(driver, 3).until(
        EC.element_to_be_clickable((By.XPATH, "//p[@class='fc-button-label' and text()='Consent']"))
    )
    consent_button.click()
    print("Clicked the consent button!")
except Exception as e:
    print(f"Error: {e}")

# Now find the fixture containers within the body element
league_table_results = driver.find_elements("class name", "eHXJll")
print(league_table_results)



Clicked the consent button!
[<selenium.webdriver.remote.webelement.WebElement (session="7cdb5c9fadf371325f6c541b2db34202", element="f.6362AD3D99BC6FBDE97FAF71FA6BA38A.d.B8B444FFCCBF9CA4D984EEB059E94DB0.e.540")>]


In [6]:
driver.quit()

In [5]:
league_table_results_container = []

for results in league_table_results:
    # Extract the text from the fixture container
    results_text = results.text
    league_table_results_container.append(results_text)
    print(results_text)

ALLHOMEAWAY
#
Team
P
W
D
L
DIFF
Goals
Last 5
PTS
1
Man City
9
7
2
0
+11
20:9
D
D
W
W
W
23
2
Liverpool
9
7
1
1
+12
17:5
W
W
W
W
D
22
3
Arsenal
9
5
3
1
+7
17:10
D
W
W
L
D
18
4
Aston Villa
9
5
3
1
+5
16:11
W
D
D
W
D
18
5
Chelsea
9
5
2
2
+8
19:11
W
W
D
L
W
17
6
Brighton
9
4
4
1
+4
16:12
D
L
W
W
D
16
7
Forest
9
4
4
1
+4
11:7
D
L
D
W
W
16
8
Tottenham
9
4
1
4
+8
18:10
W
W
L
W
L
13
9
Brentford
9
4
1
4
0
18:18
L
D
W
L
W
13
10
Fulham
9
3
3
3
0
12:12
W
W
L
L
D
12
11
Bournemouth
9
3
3
3
0
11:11
L
W
L
W
D
12
12
Newcastle
9
3
3
3
-1
9:10
L
D
D
L
L
12
13
West Ham
9
3
2
4
-3
13:16
L
D
W
L
W
11
14
Man Utd
9
3
2
4
-3
8:11
D
L
D
W
L
11
15
Leicester
9
2
3
4
-4
13:17
D
L
W
W
L
9
16
Everton
9
2
3
4
-6
10:16
D
W
D
W
D
9
17
Crystal Palace
9
1
3
5
-5
6:11
D
L
L
L
W
6
18
Ipswich
9
0
4
5
-11
9:20
D
D
L
L
L
4
19
Wolves
9
0
2
7
-13
12:25
L
L
L
L
D
2
20
Southampton
9
0
1
8
-13
6:19
D
L
L
L
L
1


In [24]:
league_table_results_container

['ALLHOMEAWAY\n#\nTeam\nP\nW\nD\nL\nDIFF\nGoals\nLast 5\nPTS\n1\nMan City\n9\n7\n2\n0\n+11\n20:9\nD\nD\nW\nW\nW\n23\n2\nLiverpool\n9\n7\n1\n1\n+12\n17:5\nW\nW\nW\nW\nD\n22\n3\nArsenal\n9\n5\n3\n1\n+7\n17:10\nD\nW\nW\nL\nD\n18\n4\nAston Villa\n9\n5\n3\n1\n+5\n16:11\nW\nD\nD\nW\nD\n18\n5\nChelsea\n9\n5\n2\n2\n+8\n19:11\nW\nW\nD\nL\nW\n17\n6\nBrighton\n9\n4\n4\n1\n+4\n16:12\nD\nL\nW\nW\nD\n16\n7\nForest\n9\n4\n4\n1\n+4\n11:7\nD\nL\nD\nW\nW\n16\n8\nTottenham\n9\n4\n1\n4\n+8\n18:10\nW\nW\nL\nW\nL\n13\n9\nBrentford\n9\n4\n1\n4\n0\n18:18\nL\nD\nW\nL\nW\n13\n10\nFulham\n9\n3\n3\n3\n0\n12:12\nW\nW\nL\nL\nD\n12\n11\nBournemouth\n9\n3\n3\n3\n0\n11:11\nL\nW\nL\nW\nD\n12\n12\nNewcastle\n9\n3\n3\n3\n-1\n9:10\nL\nD\nD\nL\nL\n12\n13\nWest Ham\n9\n3\n2\n4\n-3\n13:16\nL\nD\nW\nL\nW\n11\n14\nMan Utd\n9\n3\n2\n4\n-3\n8:11\nD\nL\nD\nW\nL\n11\n15\nLeicester\n9\n2\n3\n4\n-4\n13:17\nD\nL\nW\nW\nL\n9\n16\nEverton\n9\n2\n3\n4\n-6\n10:16\nD\nW\nD\nW\nD\n9\n17\nCrystal Palace\n9\n1\n3\n5\n-5\n6:11\nD\nL\nL\nL\nW\

In [27]:
# Extract the relevant part of the string
#data_copy = league_table_results_container.copy()
data_string = league_table_results_container.copy()[0]

# Find the index where the data rows start (`\n1`)
start_data_index = data_string.find("\n1")

# Split into columns and data using the index
df_columns = data_string[:start_data_index].split("\n")  # Everything before `\n1`
data = data_string[start_data_index+1:].split("\n")      # Everything from `\n1` onwards

# Clean up any empty entries caused by splitting
df_columns = [col for col in df_columns if col]
data = [item for item in data if item]

# Print results
print("Columns:", df_columns)
print("\nData:", data)

Columns: ['ALLHOMEAWAY', '#', 'Team', 'P', 'W', 'D', 'L', 'DIFF', 'Goals', 'Last 5', 'PTS']

Data: ['1', 'Man City', '9', '7', '2', '0', '+11', '20:9', 'D', 'D', 'W', 'W', 'W', '23', '2', 'Liverpool', '9', '7', '1', '1', '+12', '17:5', 'W', 'W', 'W', 'W', 'D', '22', '3', 'Arsenal', '9', '5', '3', '1', '+7', '17:10', 'D', 'W', 'W', 'L', 'D', '18', '4', 'Aston Villa', '9', '5', '3', '1', '+5', '16:11', 'W', 'D', 'D', 'W', 'D', '18', '5', 'Chelsea', '9', '5', '2', '2', '+8', '19:11', 'W', 'W', 'D', 'L', 'W', '17', '6', 'Brighton', '9', '4', '4', '1', '+4', '16:12', 'D', 'L', 'W', 'W', 'D', '16', '7', 'Forest', '9', '4', '4', '1', '+4', '11:7', 'D', 'L', 'D', 'W', 'W', '16', '8', 'Tottenham', '9', '4', '1', '4', '+8', '18:10', 'W', 'W', 'L', 'W', 'L', '13', '9', 'Brentford', '9', '4', '1', '4', '0', '18:18', 'L', 'D', 'W', 'L', 'W', '13', '10', 'Fulham', '9', '3', '3', '3', '0', '12:12', 'W', 'W', 'L', 'L', 'D', '12', '11', 'Bournemouth', '9', '3', '3', '3', '0', '11:11', 'L', 'W', 'L', 'W

### Handling Goal Difference. 
After an update on the website. I added some methods for including goal difference 

In [28]:
# Dynamic weekly round: specify the current match week
weekly_round = int(input("Enter the current weekly round (matches played so far): "))

# Constants based on data structure
BASE_TEAM_STATS = 8  # Base stats for each team: rank, team name, P, W, D, L, DIFF
MAX_LAST_MATCHES = 5  # Maximum match results to include in 'Last 5'
POINTS_FIELD = 1  # Points field length

# Calculate slice length: base stats + current match results (up to MAX_LAST_MATCHES) + points field
slice_length = BASE_TEAM_STATS + min(weekly_round, MAX_LAST_MATCHES) + POINTS_FIELD

# Split the data into slices for each team
data_array = [data[i:i+slice_length] for i in range(0, len(data), slice_length)]

# Print each team's data for verification
for team_data in data_array:
    
    print(team_data)


Enter the current weekly round (matches played so far):  10


['1', 'Man City', '9', '7', '2', '0', '+11', '20:9', 'D', 'D', 'W', 'W', 'W', '23']
['2', 'Liverpool', '9', '7', '1', '1', '+12', '17:5', 'W', 'W', 'W', 'W', 'D', '22']
['3', 'Arsenal', '9', '5', '3', '1', '+7', '17:10', 'D', 'W', 'W', 'L', 'D', '18']
['4', 'Aston Villa', '9', '5', '3', '1', '+5', '16:11', 'W', 'D', 'D', 'W', 'D', '18']
['5', 'Chelsea', '9', '5', '2', '2', '+8', '19:11', 'W', 'W', 'D', 'L', 'W', '17']
['6', 'Brighton', '9', '4', '4', '1', '+4', '16:12', 'D', 'L', 'W', 'W', 'D', '16']
['7', 'Forest', '9', '4', '4', '1', '+4', '11:7', 'D', 'L', 'D', 'W', 'W', '16']
['8', 'Tottenham', '9', '4', '1', '4', '+8', '18:10', 'W', 'W', 'L', 'W', 'L', '13']
['9', 'Brentford', '9', '4', '1', '4', '0', '18:18', 'L', 'D', 'W', 'L', 'W', '13']
['10', 'Fulham', '9', '3', '3', '3', '0', '12:12', 'W', 'W', 'L', 'L', 'D', '12']
['11', 'Bournemouth', '9', '3', '3', '3', '0', '11:11', 'L', 'W', 'L', 'W', 'D', '12']
['12', 'Newcastle', '9', '3', '3', '3', '-1', '9:10', 'L', 'D', 'D', 'L', '

### Detecting ':' in the array and then split it
I updated the process here. I like to keep Old code, to show how goofy i was.

In [29]:
for data in data_array:
    for d in data:
         if d.__contains__(':'):
             data[data.index(d):data.index(d)+1] = d.split(':')

print(data_array)

[['1', 'Man City', '9', '7', '2', '0', '+11', '20', '9', 'D', 'D', 'W', 'W', 'W', '23'], ['2', 'Liverpool', '9', '7', '1', '1', '+12', '17', '5', 'W', 'W', 'W', 'W', 'D', '22'], ['3', 'Arsenal', '9', '5', '3', '1', '+7', '17', '10', 'D', 'W', 'W', 'L', 'D', '18'], ['4', 'Aston Villa', '9', '5', '3', '1', '+5', '16', '11', 'W', 'D', 'D', 'W', 'D', '18'], ['5', 'Chelsea', '9', '5', '2', '2', '+8', '19', '11', 'W', 'W', 'D', 'L', 'W', '17'], ['6', 'Brighton', '9', '4', '4', '1', '+4', '16', '12', 'D', 'L', 'W', 'W', 'D', '16'], ['7', 'Forest', '9', '4', '4', '1', '+4', '11', '7', 'D', 'L', 'D', 'W', 'W', '16'], ['8', 'Tottenham', '9', '4', '1', '4', '+8', '18', '10', 'W', 'W', 'L', 'W', 'L', '13'], ['9', 'Brentford', '9', '4', '1', '4', '0', '18', '18', 'L', 'D', 'W', 'L', 'W', '13'], ['10', 'Fulham', '9', '3', '3', '3', '0', '12', '12', 'W', 'W', 'L', 'L', 'D', '12'], ['11', 'Bournemouth', '9', '3', '3', '3', '0', '11', '11', 'L', 'W', 'L', 'W', 'D', '12'], ['12', 'Newcastle', '9', '3', 

In [31]:
data_array[0][9:slice_length]

['D', 'D', 'W', 'W', 'W']

In [32]:
def calculate_ppg_and_create_last_5_matches(two_d_array):
    """
    Calculate the points per game (PPG) and create a string representing the last 5 matches for each team.

    Args:
        two_d_array (list of lists): Two-dimensional array containing match data for each team.

    Returns:
        list of lists: Updated two-dimensional array with additional columns for last 5 matches and PPG.
    """
    for row in two_d_array:
        ppg = 0
        for result in row[9:slice_length-1]:
            if result == 'W':
                ppg += 3
            elif result == 'D':
                ppg += 1
            elif result == 'L':
                ppg += 0
        last_5_matches = ''.join(row[9:slice_length])  # Concatenate match results from index 7 to 11
        row[9:slice_length] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG
    
    return two_d_array

In [33]:
premier_league_columns  = ['pos', 'team', 'pld', 'wins', 'draws', 'losses', 'diff' , 'gf', 'ga', 'last_5_matches', 'ppg_last_5_matches', 'points']


In [34]:
split_data = data_array.copy()

In [35]:
df_data = calculate_ppg_and_create_last_5_matches(split_data)
df_data

In [37]:
# Create DataFrame
premier_league_table = pd.DataFrame(df_data, columns=premier_league_columns)
print(premier_league_table)


   pos            team pld wins draws losses diff  gf  ga last_5_matches  \
0    1        Man City   9    7     2      0  +11  20   9          DDWWW   
1    2       Liverpool   9    7     1      1  +12  17   5          WWWWD   
2    3         Arsenal   9    5     3      1   +7  17  10          DWWLD   
3    4     Aston Villa   9    5     3      1   +5  16  11          WDDWD   
4    5         Chelsea   9    5     2      2   +8  19  11          WWDLW   
5    6        Brighton   9    4     4      1   +4  16  12          DLWWD   
6    7          Forest   9    4     4      1   +4  11   7          DLDWW   
7    8       Tottenham   9    4     1      4   +8  18  10          WWLWL   
8    9       Brentford   9    4     1      4    0  18  18          LDWLW   
9   10          Fulham   9    3     3      3    0  12  12          WWLLD   
10  11     Bournemouth   9    3     3      3    0  11  11          LWLWD   
11  12       Newcastle   9    3     3      3   -1   9  10          LDDLL   
12  13      

In [38]:
premier_league_table.drop(columns=['diff'], inplace=True)

In [39]:
def team_to_label(team_name):
    return team_labels.get(team_name)

In [40]:
premier_league_table['team'] = premier_league_table['team'].map(team_to_label)

In [41]:
premier_league_table

Unnamed: 0,pos,team,pld,wins,draws,losses,gf,ga,last_5_matches,ppg_last_5_matches,points
0,1,15,9,7,2,0,20,9,DDWWW,1.6,23
1,2,14,9,7,1,1,17,5,WWWWD,2.4,22
2,3,1,9,5,3,1,17,10,DWWLD,1.4,18
3,4,2,9,5,3,1,16,11,WDDWD,1.6,18
4,5,7,9,5,2,2,19,11,WWDLW,1.4,17
5,6,5,9,4,4,1,16,12,DLWWD,1.4,16
6,7,26,9,4,4,1,11,7,DLDWW,1.0,16
7,8,21,9,4,1,4,18,10,WWLWL,1.8,13
8,9,4,9,4,1,4,18,18,LDWLW,0.8,13
9,10,10,9,3,3,3,12,12,WWLLD,1.2,12


#### Confirm its the right Weekly Round in order to join with the right dataset. 
Note this data will eb considered as 'Previous League standings'(The data we are scrapping will be vital before the weekly fixtures | completed_matches in our case. Because we need the stats before the matches are played

In [48]:
Round  = 'Round ' + ' ' +  premier_league_table['pld'].mode()
Round

0    Round  9
Name: pld, dtype: object

Establish your DB 
Make sure you have already created a database ready for it. In my next update i will have automations for creating the databse for you

In [45]:
# Database connection
user = 'test_user'
password = 'password'
host = 'localhost'
port = 3306
database = 'bet_tips_validation_web_app'

In [46]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database}')
premier_league_table.to_sql('previous_week_league_standings', con=engine, if_exists='replace', index=False)

20