# Betting Prediction Model - Scraping Premier League Table (Testing Data)

## Overview

This notebook aims to scrape data from the Premier League table from sofascore to gather testing data for our betting prediction model. The extracted data will be utilized to test our model for accurate predictions. The primary objectives of this notebook include:

1. **Data Extraction**: Retrieve relevant information from the Premier League table, including team standings, recent match results, and performance statistics.
2. **Data Processing**: Cleanse and format the scraped data to ensure consistency and accuracy for model training.
3. **Database Integration**: Store the processed data in the previous_league_standings table which we would later merge with completed_matches table and append to our `training_data` table within our database for future model training sessions.

## Steps Involved

1. **Environment Setup and Imports**: Import necessary libraries and configure settings.
2. **Web Scraping**: Utilize web scraping techniques to extract data from the Premier League table.
3. **Data Cleansing and Transformation**: Process the extracted data to ensure uniformity and reliability.
4. **Database Interaction**: Insert the processed data into the `training_data` table to build a comprehensive dataset.


## Requirements

- Python 3.x
- Libraries: `selenium web driver`, `pandas`, `sqlalchemy`
- Access to the database containing the `testing_data` table

## Notes

- Prior to execution, ensure that database credentials are correctly configured.
- This notebook should be executed periodically to keep the upcoming_matches up-to-date with the current Premier League data. to ensure this, run this notebook before you run `Betting Prediction Model - Scraping Forebet Upcoming Matches`

Let's commence by setting up our environment and importing essential libraries.


In [1]:
from selenium import webdriver
import pandas as pd
from sqlalchemy import create_engine



In [2]:
PATH = 'C:/Users/kevin/Desktop/tools/chromedriver-win64/chromedriver'
driver = webdriver.Chrome(PATH) 
driver.get('https://www.sofascore.com/tournament/football/england/premier-league/17#id:52186');

  driver = webdriver.Chrome(PATH)


In [3]:
# Now find the fixture containers within the body element
league_table_results = driver.find_elements("class name", "eHXJll")
print(league_table_results)

[<selenium.webdriver.remote.webelement.WebElement (session="a4aa153049cfd7bdf356eeb1eea911cf", element="f.B9FD9D4813958A87474B9417C938633A.d.DC83393AFC48E8C2DCCFFCAB7AC916C3.e.91")>]


In [4]:
league_table_results_container = []

for results in league_table_results:
    # Extract the text from the fixture container
    results_text = results.text
    league_table_results_container.append(results_text)
    print(results_text)

ALLHOMEAWAY
#
Team
P
W
D
L
Goals
Last 5
PTS
1
Man City
37
27
7
3
93:33
W
W
W
W
W
88
2
Arsenal
37
27
5
5
89:28
W
W
W
W
W
86
3
Liverpool
37
23
10
4
84:41
W
L
D
W
D
79
4
Aston Villa
37
20
8
9
76:56
W
W
D
L
D
68
5
Tottenham
37
19
6
12
71:61
L
L
L
W
L
63
6
Chelsea
37
17
9
11
75:62
D
W
W
W
W
60
7
Newcastle
37
17
6
14
81:60
L
W
W
D
L
57
8
Man Utd
37
17
6
14
55:58
W
D
L
L
W
57
9
West Ham
37
14
10
13
59:71
L
L
D
L
W
52
10
Brighton
37
12
12
13
55:60
L
L
W
D
L
48
11
Bournemouth
37
13
9
15
53:65
L
W
W
L
L
48
12
Crystal Palace
37
12
10
15
52:58
W
W
D
W
W
46
13
Wolves
37
13
7
17
50:63
L
L
W
L
L
46
14
Fulham
37
12
8
17
51:59
W
L
D
D
L
44
15
Everton
37
13
9
15
39:49
W
W
W
D
W
40
16
Brentford
37
10
9
18
54:61
W
W
L
D
W
39
17
Forest
37
8
9
20
47:66
D
L
L
W
L
29
18
Luton
37
6
8
23
50:81
L
L
L
D
L
26
19
Burnley
37
5
9
23
40:76
D
W
D
L
L
24
20
Sheffield Utd
37
3
7
27
35:101
L
L
L
L
L
16


In [5]:
driver.quit()

In [6]:
league_table_results_container

['ALLHOMEAWAY\n#\nTeam\nP\nW\nD\nL\nGoals\nLast 5\nPTS\n1\nMan City\n37\n27\n7\n3\n93:33\nW\nW\nW\nW\nW\n88\n2\nArsenal\n37\n27\n5\n5\n89:28\nW\nW\nW\nW\nW\n86\n3\nLiverpool\n37\n23\n10\n4\n84:41\nW\nL\nD\nW\nD\n79\n4\nAston Villa\n37\n20\n8\n9\n76:56\nW\nW\nD\nL\nD\n68\n5\nTottenham\n37\n19\n6\n12\n71:61\nL\nL\nL\nW\nL\n63\n6\nChelsea\n37\n17\n9\n11\n75:62\nD\nW\nW\nW\nW\n60\n7\nNewcastle\n37\n17\n6\n14\n81:60\nL\nW\nW\nD\nL\n57\n8\nMan Utd\n37\n17\n6\n14\n55:58\nW\nD\nL\nL\nW\n57\n9\nWest Ham\n37\n14\n10\n13\n59:71\nL\nL\nD\nL\nW\n52\n10\nBrighton\n37\n12\n12\n13\n55:60\nL\nL\nW\nD\nL\n48\n11\nBournemouth\n37\n13\n9\n15\n53:65\nL\nW\nW\nL\nL\n48\n12\nCrystal Palace\n37\n12\n10\n15\n52:58\nW\nW\nD\nW\nW\n46\n13\nWolves\n37\n13\n7\n17\n50:63\nL\nL\nW\nL\nL\n46\n14\nFulham\n37\n12\n8\n17\n51:59\nW\nL\nD\nD\nL\n44\n15\nEverton\n37\n13\n9\n15\n39:49\nW\nW\nW\nD\nW\n40\n16\nBrentford\n37\n10\n9\n18\n54:61\nW\nW\nL\nD\nW\n39\n17\nForest\n37\n8\n9\n20\n47:66\nD\nL\nL\nW\nL\n29\n18\nLuton\n37

In [7]:
data_copy = league_table_results_container

In [8]:
df_columns = data_copy[0][11:43]

In [9]:
data = data_copy[0][43:]

In [10]:
data

'\n1\nMan City\n37\n27\n7\n3\n93:33\nW\nW\nW\nW\nW\n88\n2\nArsenal\n37\n27\n5\n5\n89:28\nW\nW\nW\nW\nW\n86\n3\nLiverpool\n37\n23\n10\n4\n84:41\nW\nL\nD\nW\nD\n79\n4\nAston Villa\n37\n20\n8\n9\n76:56\nW\nW\nD\nL\nD\n68\n5\nTottenham\n37\n19\n6\n12\n71:61\nL\nL\nL\nW\nL\n63\n6\nChelsea\n37\n17\n9\n11\n75:62\nD\nW\nW\nW\nW\n60\n7\nNewcastle\n37\n17\n6\n14\n81:60\nL\nW\nW\nD\nL\n57\n8\nMan Utd\n37\n17\n6\n14\n55:58\nW\nD\nL\nL\nW\n57\n9\nWest Ham\n37\n14\n10\n13\n59:71\nL\nL\nD\nL\nW\n52\n10\nBrighton\n37\n12\n12\n13\n55:60\nL\nL\nW\nD\nL\n48\n11\nBournemouth\n37\n13\n9\n15\n53:65\nL\nW\nW\nL\nL\n48\n12\nCrystal Palace\n37\n12\n10\n15\n52:58\nW\nW\nD\nW\nW\n46\n13\nWolves\n37\n13\n7\n17\n50:63\nL\nL\nW\nL\nL\n46\n14\nFulham\n37\n12\n8\n17\n51:59\nW\nL\nD\nD\nL\n44\n15\nEverton\n37\n13\n9\n15\n39:49\nW\nW\nW\nD\nW\n40\n16\nBrentford\n37\n10\n9\n18\n54:61\nW\nW\nL\nD\nW\n39\n17\nForest\n37\n8\n9\n20\n47:66\nD\nL\nL\nW\nL\n29\n18\nLuton\n37\n6\n8\n23\n50:81\nL\nL\nL\nD\nL\n26\n19\nBurnley\n37

In [11]:
data_copy_split = [data.split('\n')]


In [12]:
data_copy_split = data_copy_split[0][1:]
print(data_copy_split)

['1', 'Man City', '37', '27', '7', '3', '93:33', 'W', 'W', 'W', 'W', 'W', '88', '2', 'Arsenal', '37', '27', '5', '5', '89:28', 'W', 'W', 'W', 'W', 'W', '86', '3', 'Liverpool', '37', '23', '10', '4', '84:41', 'W', 'L', 'D', 'W', 'D', '79', '4', 'Aston Villa', '37', '20', '8', '9', '76:56', 'W', 'W', 'D', 'L', 'D', '68', '5', 'Tottenham', '37', '19', '6', '12', '71:61', 'L', 'L', 'L', 'W', 'L', '63', '6', 'Chelsea', '37', '17', '9', '11', '75:62', 'D', 'W', 'W', 'W', 'W', '60', '7', 'Newcastle', '37', '17', '6', '14', '81:60', 'L', 'W', 'W', 'D', 'L', '57', '8', 'Man Utd', '37', '17', '6', '14', '55:58', 'W', 'D', 'L', 'L', 'W', '57', '9', 'West Ham', '37', '14', '10', '13', '59:71', 'L', 'L', 'D', 'L', 'W', '52', '10', 'Brighton', '37', '12', '12', '13', '55:60', 'L', 'L', 'W', 'D', 'L', '48', '11', 'Bournemouth', '37', '13', '9', '15', '53:65', 'L', 'W', 'W', 'L', 'L', '48', '12', 'Crystal Palace', '37', '12', '10', '15', '52:58', 'W', 'W', 'D', 'W', 'W', '46', '13', 'Wolves', '37', '1

In [13]:
data_array = [data_copy_split[i:i+13] for i in range(0, len(data_copy_split), 13)]
print(data_array)

[['1', 'Man City', '37', '27', '7', '3', '93:33', 'W', 'W', 'W', 'W', 'W', '88'], ['2', 'Arsenal', '37', '27', '5', '5', '89:28', 'W', 'W', 'W', 'W', 'W', '86'], ['3', 'Liverpool', '37', '23', '10', '4', '84:41', 'W', 'L', 'D', 'W', 'D', '79'], ['4', 'Aston Villa', '37', '20', '8', '9', '76:56', 'W', 'W', 'D', 'L', 'D', '68'], ['5', 'Tottenham', '37', '19', '6', '12', '71:61', 'L', 'L', 'L', 'W', 'L', '63'], ['6', 'Chelsea', '37', '17', '9', '11', '75:62', 'D', 'W', 'W', 'W', 'W', '60'], ['7', 'Newcastle', '37', '17', '6', '14', '81:60', 'L', 'W', 'W', 'D', 'L', '57'], ['8', 'Man Utd', '37', '17', '6', '14', '55:58', 'W', 'D', 'L', 'L', 'W', '57'], ['9', 'West Ham', '37', '14', '10', '13', '59:71', 'L', 'L', 'D', 'L', 'W', '52'], ['10', 'Brighton', '37', '12', '12', '13', '55:60', 'L', 'L', 'W', 'D', 'L', '48'], ['11', 'Bournemouth', '37', '13', '9', '15', '53:65', 'L', 'W', 'W', 'L', 'L', '48'], ['12', 'Crystal Palace', '37', '12', '10', '15', '52:58', 'W', 'W', 'D', 'W', 'W', '46'], 

In [14]:
testing = ['1', 'Arsenal', '36', '26', '5', '5', '88:28', 'L', 'W', 'W', 'W', 'W', '83']

# Splitting the element at index 6 and extending the list with the resulting parts
testing[6:7] = testing[6].split(':')

print(testing)

['1', 'Arsenal', '36', '26', '5', '5', '88', '28', 'L', 'W', 'W', 'W', 'W', '83']


In [15]:
split_data = []

for data in data_array:
    data[6:7] = data[6].split(':')
    split_data.append(data)    

In [16]:
print(split_data)  

[['1', 'Man City', '37', '27', '7', '3', '93', '33', 'W', 'W', 'W', 'W', 'W', '88'], ['2', 'Arsenal', '37', '27', '5', '5', '89', '28', 'W', 'W', 'W', 'W', 'W', '86'], ['3', 'Liverpool', '37', '23', '10', '4', '84', '41', 'W', 'L', 'D', 'W', 'D', '79'], ['4', 'Aston Villa', '37', '20', '8', '9', '76', '56', 'W', 'W', 'D', 'L', 'D', '68'], ['5', 'Tottenham', '37', '19', '6', '12', '71', '61', 'L', 'L', 'L', 'W', 'L', '63'], ['6', 'Chelsea', '37', '17', '9', '11', '75', '62', 'D', 'W', 'W', 'W', 'W', '60'], ['7', 'Newcastle', '37', '17', '6', '14', '81', '60', 'L', 'W', 'W', 'D', 'L', '57'], ['8', 'Man Utd', '37', '17', '6', '14', '55', '58', 'W', 'D', 'L', 'L', 'W', '57'], ['9', 'West Ham', '37', '14', '10', '13', '59', '71', 'L', 'L', 'D', 'L', 'W', '52'], ['10', 'Brighton', '37', '12', '12', '13', '55', '60', 'L', 'L', 'W', 'D', 'L', '48'], ['11', 'Bournemouth', '37', '13', '9', '15', '53', '65', 'L', 'W', 'W', 'L', 'L', '48'], ['12', 'Crystal Palace', '37', '12', '10', '15', '52', '5

In [17]:
# These team_labels match all the data from the other notebooks
team_labels = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brighton': 4,
        'Burnley': 5,
        'Chelsea': 6,
        'Crystal Palace': 7,
        'Everton': 8,
        'Fulham': 9,
        'Leeds Utd': 10,
        'Leicester City': 11,
        'Liverpool': 12,
        'Man City': 13,
        'Man Utd': 14,
        'Newcastle': 15,
        'Norwich': 16,
        'Sheffield': 17,
        'Southampton': 18,
        'Tottenham': 19,
        'West Ham': 20,
        'Luton': 21,
        'Wolves': 22,
        'Brentford': 23,
        'Sheffield Utd': 24,
        'Forest': 25
    }

#### Code Playground: creating ppg, last_5_matches, and removing redundant indexes

In [18]:
epl_table = [['1','Arsenal','37','27','5','5','89','28','W','W','W', 'W','W','86'],
 ['2','Man City', '36','26','7','3','91','33','W','W','W','W','W','85'],
 ['3','Liverpool','36','23','9','4','81','38','L','W','L','D','W','78']]

for row in epl_table:
    ppg = 0
    for result in row[7:13]:
        if result == 'W':
            ppg += 3
        elif result == 'D':
            ppg += 1
    last_5_matches = ''.join(row[8:13])  # Concatenate match results from index 7 to 11
    row[8:13] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG

print(epl_table)



[['1', 'Arsenal', '37', '27', '5', '5', '89', '28', 'WWWWW', 3.0, '86'], ['2', 'Man City', '36', '26', '7', '3', '91', '33', 'WWWWW', 3.0, '85'], ['3', 'Liverpool', '36', '23', '9', '4', '81', '38', 'LWLDW', 1.4, '78']]


In [19]:
def calculate_ppg_and_create_last_5_matches(two_d_array):
    for row in two_d_array:
        ppg = 0
        for result in row[7:13]:
            if result == 'W':
                ppg += 3
            elif result == 'D':
                ppg += 1
        last_5_matches = ''.join(row[8:13])  # Concatenate match results from index 7 to 11
        row[8:13] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG
    
    return two_d_array


In [20]:
premier_league_columns  = ['Pos', 'Team', 'Pld', 'Wins', 'Draws', 'losses', 'GF', 'GA', 'Last 5 Matches', 'Ppg_Last_5_Matches', 'Points']

In [21]:
df_data = calculate_ppg_and_create_last_5_matches(split_data)

In [22]:

# Create DataFrame
premier_league_table = pd.DataFrame(df_data, columns=premier_league_columns)
print(premier_league_table)

   Pos            Team Pld Wins Draws losses  GF   GA Last 5 Matches  \
0    1        Man City  37   27     7      3  93   33          WWWWW   
1    2         Arsenal  37   27     5      5  89   28          WWWWW   
2    3       Liverpool  37   23    10      4  84   41          WLDWD   
3    4     Aston Villa  37   20     8      9  76   56          WWDLD   
4    5       Tottenham  37   19     6     12  71   61          LLLWL   
5    6         Chelsea  37   17     9     11  75   62          DWWWW   
6    7       Newcastle  37   17     6     14  81   60          LWWDL   
7    8         Man Utd  37   17     6     14  55   58          WDLLW   
8    9        West Ham  37   14    10     13  59   71          LLDLW   
9   10        Brighton  37   12    12     13  55   60          LLWDL   
10  11     Bournemouth  37   13     9     15  53   65          LWWLL   
11  12  Crystal Palace  37   12    10     15  52   58          WWDWW   
12  13          Wolves  37   13     7     17  50   63          L

In [23]:
def team_to_label(team_name):
    return team_labels.get(team_name)

In [24]:
premier_league_table['Team'] = premier_league_table['Team'].map(team_to_label)

In [25]:
premier_league_table = premier_league_table.drop(columns=['Last 5 Matches'])

#### Confirm its the right Weekly Round in order to join with the right dataset. Note Current League standings goes with upcoming_matches

In [26]:
Round  = 'Round ' + ' ' +  premier_league_table['Pld'].mode()[0]

In [27]:
Round

'Round  37'

In [None]:
# Database connection
user = '<user>'
password = '<password>'
host = 'localhost'
port = 3306
database = '<db>'

In [28]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database}')
premier_league_table.to_sql('current_week_league_standings', con=engine, if_exists='replace', index=False)

20