# Betting Prediction Model - Scraping Premier League Table (Training Data)

## Overview

This notebook aims to scrape data from the Premier League table from sofascore to gather training data for our betting prediction model. The extracted data will be utilized to train our model for accurate predictions. The primary objectives of this notebook include:

1. **Data Extraction**: Retrieve relevant information from the Premier League table, including team standings, recent match results, and performance statistics.
2. **Data Processing**: Cleanse and format the scraped data to ensure consistency and accuracy for model training.
3. **Database Integration**: Store the processed data in the previous_league_standings table which we would later merge with completed_matches table and append to our `training_data` table within our database for future model training sessions.

## Steps Involved

1. **Environment Setup and Imports**: Import necessary libraries and configure settings.
2. **Web Scraping**: Utilize web scraping techniques to extract data from the Premier League table.
3. **Data Cleansing and Transformation**: Process the extracted data to ensure uniformity and reliability.
4. **Database Interaction**: Insert the processed data into the `training_data` table to build a comprehensive dataset.


## Requirements

- Python 3.x
- Libraries: `selenium web driver`, `pandas`, `sqlalchemy`
- Access to the database containing the `training_data` table

## Notes

- Prior to execution, ensure that database credentials are correctly configured.
- This notebook should be executed periodically to keep the completed_matches up-to-date with the previous Premier League data. to ensure this, run this notebook before you run `Betting Prediction Model - Scraping Forebet Prediction Table Completed Matches`

## Updates July 2024
### Project Migration and Update Process
- Migration: Migrated project from Jupyter Notebook 6 to Notebook 7.
- Python Version: Updated from Python 3.9 to Python 3.12.
- Selenium Version and Usage: Noted that with Selenium 4.2, it was possible to use webdriver without specifying service and options AND PATH, eliminating unnecessary code.

In [3]:
# let's install the necessary libraries
!pip install pandas sqlalchemy

Collecting selenium
  Downloading selenium-4.22.0-py3-none-any.whl.metadata (7.0 kB)
Collecting pandas
  Downloading pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting sqlalchemy
  Downloading SQLAlchemy-2.0.31-cp312-cp312-win_amd64.whl.metadata (9.9 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.25.1-py3-none-any.whl.metadata (8.7 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.0.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
     -------------------- ------------------- 30.7/60.9 kB ? eta -:--:--
     -------------------- ------------------- 30.7/60.9 kB ? eta -:--:--
     -------------------------------------- 60.9/60.9 kB 405.3 kB/s eta 0:00:00
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)

In [6]:
# Make sure youre using latest version of selenium to prevent running into errors. 
# Make sure your chrome driver version is compatible with your Chrome versions. 
# Use this link to update to latest Chrome and chromerdriver https://googlechromelabs.github.io/chrome-for-testing/ if necessary.
!pip install --upgrade selenium

Note: you may need to restart the kernel to use updated packages.


In [1]:
from selenium import webdriver
import pandas as pd
from sqlalchemy import create_engine

In [3]:
# Initialize WebDriver 
driver = webdriver.Chrome()

# When you run the next block ensure the ChromeDriver window is maximized to accurately render the website. 
# Different screen sizes may affect website responsiveness and the class names used, 
# which could lead to varied results if the window size is too small.

In [4]:
# Navigate to the specified URL
driver.get('https://www.sofascore.com/tournament/football/england/premier-league/17#id:52186')

In [5]:
# Now find the fixture containers within the body element
league_table_results = driver.find_elements("class name", "eHXJll")
print(league_table_results)


[<selenium.webdriver.remote.webelement.WebElement (session="59a8d929a60185e1f9a5f61e52a7c6cf", element="f.942C4C415A270AA7DA9FC0AE1BB348EC.d.8CD9CD88538908BE01B7A1931F437410.e.33")>]


In [6]:
league_table_results_container = []

for results in league_table_results:
    # Extract the text from the fixture container
    results_text = results.text
    league_table_results_container.append(results_text)
    print(results_text)

ALLHOMEAWAY
#
Team
P
W
D
L
Goals
Last 5
PTS
1
Man City
38
28
7
3
96:34
W
W
W
W
W
91
2
Arsenal
38
28
5
5
91:29
W
W
W
W
W
89
3
Liverpool
38
24
10
4
86:41
L
D
W
D
W
82
4
Aston Villa
38
20
8
10
76:61
W
D
L
D
L
68
5
Tottenham
38
20
6
12
74:61
L
L
W
L
W
66
6
Chelsea
38
18
9
11
77:63
W
W
W
W
W
63
7
Newcastle
38
18
6
14
85:62
W
W
D
L
W
60
8
Man Utd
38
18
6
14
57:58
D
L
L
W
W
60
9
West Ham
38
14
10
14
60:74
L
D
L
W
L
52
10
Crystal Palace
38
13
10
15
57:58
W
D
W
W
W
49
11
Brighton
38
12
12
14
55:62
L
W
D
L
L
48
12
Bournemouth
38
13
9
16
54:67
W
W
L
L
L
48
13
Fulham
38
13
8
17
55:61
L
D
D
L
W
47
14
Wolves
38
13
7
18
50:65
L
W
L
L
L
46
15
Everton
38
13
9
16
40:51
W
W
D
W
L
40
16
Brentford
38
10
9
19
56:65
W
L
D
W
L
39
17
Forest
38
9
9
20
49:67
L
L
W
L
W
32
18
Luton
38
6
8
24
52:85
L
L
D
L
L
26
19
Burnley
38
5
9
24
41:78
W
D
L
L
L
24
20
Sheffield Utd
38
3
7
28
35:104
L
L
L
L
L
16


In [7]:
driver.quit()

In [8]:
league_table_results_container

['ALLHOMEAWAY\n#\nTeam\nP\nW\nD\nL\nGoals\nLast 5\nPTS\n1\nMan City\n38\n28\n7\n3\n96:34\nW\nW\nW\nW\nW\n91\n2\nArsenal\n38\n28\n5\n5\n91:29\nW\nW\nW\nW\nW\n89\n3\nLiverpool\n38\n24\n10\n4\n86:41\nL\nD\nW\nD\nW\n82\n4\nAston Villa\n38\n20\n8\n10\n76:61\nW\nD\nL\nD\nL\n68\n5\nTottenham\n38\n20\n6\n12\n74:61\nL\nL\nW\nL\nW\n66\n6\nChelsea\n38\n18\n9\n11\n77:63\nW\nW\nW\nW\nW\n63\n7\nNewcastle\n38\n18\n6\n14\n85:62\nW\nW\nD\nL\nW\n60\n8\nMan Utd\n38\n18\n6\n14\n57:58\nD\nL\nL\nW\nW\n60\n9\nWest Ham\n38\n14\n10\n14\n60:74\nL\nD\nL\nW\nL\n52\n10\nCrystal Palace\n38\n13\n10\n15\n57:58\nW\nD\nW\nW\nW\n49\n11\nBrighton\n38\n12\n12\n14\n55:62\nL\nW\nD\nL\nL\n48\n12\nBournemouth\n38\n13\n9\n16\n54:67\nW\nW\nL\nL\nL\n48\n13\nFulham\n38\n13\n8\n17\n55:61\nL\nD\nD\nL\nW\n47\n14\nWolves\n38\n13\n7\n18\n50:65\nL\nW\nL\nL\nL\n46\n15\nEverton\n38\n13\n9\n16\n40:51\nW\nW\nD\nW\nL\n40\n16\nBrentford\n38\n10\n9\n19\n56:65\nW\nL\nD\nW\nL\n39\n17\nForest\n38\n9\n9\n20\n49:67\nL\nL\nW\nL\nW\n32\n18\nLuton\n3

In [9]:
data_copy = league_table_results_container

df_columns = data_copy[0][11:43]

data = data_copy[0][43:]

data

'\n1\nMan City\n38\n28\n7\n3\n96:34\nW\nW\nW\nW\nW\n91\n2\nArsenal\n38\n28\n5\n5\n91:29\nW\nW\nW\nW\nW\n89\n3\nLiverpool\n38\n24\n10\n4\n86:41\nL\nD\nW\nD\nW\n82\n4\nAston Villa\n38\n20\n8\n10\n76:61\nW\nD\nL\nD\nL\n68\n5\nTottenham\n38\n20\n6\n12\n74:61\nL\nL\nW\nL\nW\n66\n6\nChelsea\n38\n18\n9\n11\n77:63\nW\nW\nW\nW\nW\n63\n7\nNewcastle\n38\n18\n6\n14\n85:62\nW\nW\nD\nL\nW\n60\n8\nMan Utd\n38\n18\n6\n14\n57:58\nD\nL\nL\nW\nW\n60\n9\nWest Ham\n38\n14\n10\n14\n60:74\nL\nD\nL\nW\nL\n52\n10\nCrystal Palace\n38\n13\n10\n15\n57:58\nW\nD\nW\nW\nW\n49\n11\nBrighton\n38\n12\n12\n14\n55:62\nL\nW\nD\nL\nL\n48\n12\nBournemouth\n38\n13\n9\n16\n54:67\nW\nW\nL\nL\nL\n48\n13\nFulham\n38\n13\n8\n17\n55:61\nL\nD\nD\nL\nW\n47\n14\nWolves\n38\n13\n7\n18\n50:65\nL\nW\nL\nL\nL\n46\n15\nEverton\n38\n13\n9\n16\n40:51\nW\nW\nD\nW\nL\n40\n16\nBrentford\n38\n10\n9\n19\n56:65\nW\nL\nD\nW\nL\n39\n17\nForest\n38\n9\n9\n20\n49:67\nL\nL\nW\nL\nW\n32\n18\nLuton\n38\n6\n8\n24\n52:85\nL\nL\nD\nL\nL\n26\n19\nBurnley\n3

In [10]:
data_copy_split = [data.split('\n')]


data_copy_split = data_copy_split[0][1:]
print(data_copy_split)

['1', 'Man City', '38', '28', '7', '3', '96:34', 'W', 'W', 'W', 'W', 'W', '91', '2', 'Arsenal', '38', '28', '5', '5', '91:29', 'W', 'W', 'W', 'W', 'W', '89', '3', 'Liverpool', '38', '24', '10', '4', '86:41', 'L', 'D', 'W', 'D', 'W', '82', '4', 'Aston Villa', '38', '20', '8', '10', '76:61', 'W', 'D', 'L', 'D', 'L', '68', '5', 'Tottenham', '38', '20', '6', '12', '74:61', 'L', 'L', 'W', 'L', 'W', '66', '6', 'Chelsea', '38', '18', '9', '11', '77:63', 'W', 'W', 'W', 'W', 'W', '63', '7', 'Newcastle', '38', '18', '6', '14', '85:62', 'W', 'W', 'D', 'L', 'W', '60', '8', 'Man Utd', '38', '18', '6', '14', '57:58', 'D', 'L', 'L', 'W', 'W', '60', '9', 'West Ham', '38', '14', '10', '14', '60:74', 'L', 'D', 'L', 'W', 'L', '52', '10', 'Crystal Palace', '38', '13', '10', '15', '57:58', 'W', 'D', 'W', 'W', 'W', '49', '11', 'Brighton', '38', '12', '12', '14', '55:62', 'L', 'W', 'D', 'L', 'L', '48', '12', 'Bournemouth', '38', '13', '9', '16', '54:67', 'W', 'W', 'L', 'L', 'L', '48', '13', 'Fulham', '38', '

In [11]:
data_array = [data_copy_split[i:i+13] for i in range(0, len(data_copy_split), 13)]
print(data_array)

[['1', 'Man City', '38', '28', '7', '3', '96:34', 'W', 'W', 'W', 'W', 'W', '91'], ['2', 'Arsenal', '38', '28', '5', '5', '91:29', 'W', 'W', 'W', 'W', 'W', '89'], ['3', 'Liverpool', '38', '24', '10', '4', '86:41', 'L', 'D', 'W', 'D', 'W', '82'], ['4', 'Aston Villa', '38', '20', '8', '10', '76:61', 'W', 'D', 'L', 'D', 'L', '68'], ['5', 'Tottenham', '38', '20', '6', '12', '74:61', 'L', 'L', 'W', 'L', 'W', '66'], ['6', 'Chelsea', '38', '18', '9', '11', '77:63', 'W', 'W', 'W', 'W', 'W', '63'], ['7', 'Newcastle', '38', '18', '6', '14', '85:62', 'W', 'W', 'D', 'L', 'W', '60'], ['8', 'Man Utd', '38', '18', '6', '14', '57:58', 'D', 'L', 'L', 'W', 'W', '60'], ['9', 'West Ham', '38', '14', '10', '14', '60:74', 'L', 'D', 'L', 'W', 'L', '52'], ['10', 'Crystal Palace', '38', '13', '10', '15', '57:58', 'W', 'D', 'W', 'W', 'W', '49'], ['11', 'Brighton', '38', '12', '12', '14', '55:62', 'L', 'W', 'D', 'L', 'L', '48'], ['12', 'Bournemouth', '38', '13', '9', '16', '54:67', 'W', 'W', 'L', 'L', 'L', '48'],

In [12]:
testing = ['1', 'Arsenal', '36', '26', '5', '5', '88:28', 'L', 'W', 'W', 'W', 'W', '83']

# Splitting the element at index 6 and extending the list with the resulting parts
testing[6:7] = testing[6].split(':')

print(testing)

['1', 'Arsenal', '36', '26', '5', '5', '88', '28', 'L', 'W', 'W', 'W', 'W', '83']


In [13]:
split_data = []

for data in data_array:
    data[6:7] = data[6].split(':')
    split_data.append(data)    

print(split_data) 

[['1', 'Man City', '38', '28', '7', '3', '96', '34', 'W', 'W', 'W', 'W', 'W', '91'], ['2', 'Arsenal', '38', '28', '5', '5', '91', '29', 'W', 'W', 'W', 'W', 'W', '89'], ['3', 'Liverpool', '38', '24', '10', '4', '86', '41', 'L', 'D', 'W', 'D', 'W', '82'], ['4', 'Aston Villa', '38', '20', '8', '10', '76', '61', 'W', 'D', 'L', 'D', 'L', '68'], ['5', 'Tottenham', '38', '20', '6', '12', '74', '61', 'L', 'L', 'W', 'L', 'W', '66'], ['6', 'Chelsea', '38', '18', '9', '11', '77', '63', 'W', 'W', 'W', 'W', 'W', '63'], ['7', 'Newcastle', '38', '18', '6', '14', '85', '62', 'W', 'W', 'D', 'L', 'W', '60'], ['8', 'Man Utd', '38', '18', '6', '14', '57', '58', 'D', 'L', 'L', 'W', 'W', '60'], ['9', 'West Ham', '38', '14', '10', '14', '60', '74', 'L', 'D', 'L', 'W', 'L', '52'], ['10', 'Crystal Palace', '38', '13', '10', '15', '57', '58', 'W', 'D', 'W', 'W', 'W', '49'], ['11', 'Brighton', '38', '12', '12', '14', '55', '62', 'L', 'W', 'D', 'L', 'L', '48'], ['12', 'Bournemouth', '38', '13', '9', '16', '54', '

In [14]:
# These team_labels match all the data from the other notebooks
team_labels = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brighton': 4,
        'Burnley': 5,
        'Chelsea': 6,
        'Crystal Palace': 7,
        'Everton': 8,
        'Fulham': 9,
        'Leeds Utd': 10,
        'Leicester City': 11,
        'Liverpool': 12,
        'Man City': 13,
        'Man Utd': 14,
        'Newcastle': 15,
        'Norwich': 16,
        'Sheffield': 17,
        'Southampton': 18,
        'Tottenham': 19,
        'West Ham': 20,
        'Luton': 21,
        'Wolves': 22,
        'Brentford': 23,
        'Sheffield Utd': 24,
        'Forest': 25
    }

#### Code Playground: creating ppg, last_5_matches, and removing redundant indexes

In [15]:
epl_table = [['1','Arsenal','37','27','5','5','89','28','W','W','W', 'W','W','86'],
 ['2','Man City', '36','26','7','3','91','33','W','W','W','W','W','85'],
 ['3','Liverpool','36','23','9','4','81','38','L','W','L','D','W','78']]

for row in epl_table:
    ppg = 0
    for result in row[7:13]:
        if result == 'W':
            ppg += 3
        elif result == 'D':
            ppg += 1
    last_5_matches = ''.join(row[8:13])  # Concatenate match results from index 7 to 11
    row[8:13] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG

print(epl_table)


[['1', 'Arsenal', '37', '27', '5', '5', '89', '28', 'WWWWW', 3.0, '86'], ['2', 'Man City', '36', '26', '7', '3', '91', '33', 'WWWWW', 3.0, '85'], ['3', 'Liverpool', '36', '23', '9', '4', '81', '38', 'LWLDW', 1.4, '78']]


In [16]:
def calculate_ppg_and_create_last_5_matches(two_d_array):
    """
    Calculate the points per game (PPG) and create a string representing the last 5 matches for each team.

    Args:
        two_d_array (list of lists): Two-dimensional array containing match data for each team.

    Returns:
        list of lists: Updated two-dimensional array with additional columns for last 5 matches and PPG.
    """
    for row in two_d_array:
        ppg = 0
        for result in row[7:13]:
            if result == 'W':
                ppg += 3
            elif result == 'D':
                ppg += 1
        last_5_matches = ''.join(row[8:13])  # Concatenate match results from index 7 to 11
        row[8:13] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG
    
    return two_d_array

In [17]:
premier_league_columns  = ['Pos', 'Team', 'Pld', 'Wins', 'Draws', 'Losses', 'GF', 'GA', 'Last 5 Matches', 'Ppg', 'Points']


In [18]:
df_data = calculate_ppg_and_create_last_5_matches(split_data)


In [21]:
import pandas as pd
# Create DataFrame
premier_league_table = pd.DataFrame(df_data, columns=premier_league_columns)
print(premier_league_table)


   Pos            Team Pld Wins Draws Losses  GF   GA Last 5 Matches  Ppg  \
0    1        Man City  38   28     7      3  96   34          WWWWW  3.0   
1    2         Arsenal  38   28     5      5  91   29          WWWWW  3.0   
2    3       Liverpool  38   24    10      4  86   41          LDWDW  1.6   
3    4     Aston Villa  38   20     8     10  76   61          WDLDL  1.0   
4    5       Tottenham  38   20     6     12  74   61          LLWLW  1.2   
5    6         Chelsea  38   18     9     11  77   63          WWWWW  3.0   
6    7       Newcastle  38   18     6     14  85   62          WWDLW  2.0   
7    8         Man Utd  38   18     6     14  57   58          DLLWW  1.4   
8    9        West Ham  38   14    10     14  60   74          LDLWL  0.8   
9   10  Crystal Palace  38   13    10     15  57   58          WDWWW  2.6   
10  11        Brighton  38   12    12     14  55   62          LWDLL  0.8   
11  12     Bournemouth  38   13     9     16  54   67          WWLLL  1.2   

In [22]:
def team_to_label(team_name):
    return team_labels.get(team_name)

In [23]:
premier_league_table['Team'] = premier_league_table['Team'].map(team_to_label)

#### Confirm its the right Weekly Round in order to join with the right dataset. 
Note this data will eb considered as 'Previous League standings'(The data we are scrapping will be vital before the weekly fixtures | completed_matches in our case. Because we need the stats before the matches are played

In [24]:
Round  = 'Round ' + ' ' +  premier_league_table['Pld'].mode()[0]

Establish your DB 
Make sure you have already created a database ready for it. In my next update i will have automations for creating the databse for you

In [None]:
# Database connection
user = '<user>'
password = '<password>'
host = 'localhost'
port = 3306
database = '<db>'

In [None]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database}')
premier_league_table.to_sql('previous_week_league_standings', con=engine, if_exists='replace', index=False)