# Betting Prediction Model - Scraping Premier League Table (Testing Data)

## Overview

This notebook aims to scrape data from the Premier League table from sofascore to gather testing data for our betting prediction model. The extracted data will be utilized to test our model for accurate predictions. The primary objectives of this notebook include:

1. **Data Extraction**: Retrieve relevant information from the Premier League table, including team standings, recent match results, and performance statistics.
2. **Data Processing**: Cleanse and format the scraped data to ensure consistency and accuracy for model training.
3. **Database Integration**: Store the processed data in the previous_league_standings table which we would later merge with completed_matches table and append to our `training_data` table within our database for future model training sessions.

## Steps Involved

1. **Environment Setup and Imports**: Import necessary libraries and configure settings.
2. **Web Scraping**: Utilize web scraping techniques to extract data from the Premier League table.
3. **Data Cleansing and Transformation**: Process the extracted data to ensure uniformity and reliability.
4. **Database Interaction**: Insert the processed data into the `training_data` table to build a comprehensive dataset.


## Requirements

- Python 3.x
- Libraries: `selenium web driver`, `pandas`, `sqlalchemy`
- Access to the database containing the `testing_data` table

## Notes

- Prior to execution, ensure that database credentials are correctly configured.
- This notebook should be executed periodically to keep the upcoming_matches up-to-date with the current Premier League data. to ensure this, run this notebook before you run `Betting Prediction Model - Scraping Forebet Upcoming Matches`


## Updates July 2024
### Project Migration and Update Process
- Migration: Migrated project from Jupyter Notebook 6 to Notebook 7.
- Python Version: Updated from Python 3.9 to Python 3.12.
- Selenium Version and Usage: Noted that with Selenium 4.2, it was possible to use webdriver without specifying service and options AND PATH, eliminating unnecessary code.

  
Let's commence by setting up our environment and importing essential libraries.


In [3]:
# Make sure youre using latest version of selenium to prevent running into errors. 
# Make sure your chrome driver version is compatible with your Chrome versions. 
# Use this link to update to latest Chrome and chromerdriver https://googlechromelabs.github.io/chrome-for-testing/ if necessary.
!pip install pandas sqlalchemy
!pip install --upgrade selenium



In [1]:
from selenium import webdriver
import pandas as pd
from sqlalchemy import create_engine

In [6]:
# Initialize WebDriver 
driver = webdriver.Chrome()
# When you run the next block ensure the ChromeDriver window is maximized to accurately render the website. 
# Different screen sizes may affect website responsiveness and the class names used, 
# which could lead to varied results if the window size is too small.

In [7]:
driver.get('https://www.sofascore.com/tournament/football/england/premier-league/17#id:61627');

In [8]:
# Now find the fixture containers within the body element
league_table_results = driver.find_elements("class name", "eHXJll")
print(league_table_results)

[<selenium.webdriver.remote.webelement.WebElement (session="2035fc9ba65b28da667c1a4716dda02b", element="f.5D4EFDA9B7C820FAD040406E291E7EFF.d.2D0C11AB7E91C564C2C4EECBF43A62D0.e.67")>]


In [9]:
league_table_results_container = []

for results in league_table_results:
    # Extract the text from the fixture container
    results_text = results.text
    league_table_results_container.append(results_text)
    print(results_text)

ALLHOMEAWAY
#
Team
P
W
D
L
Goals
Last 5
PTS
1
Fulham
1
0
1
0
0:0
1
2
Man Utd
1
0
1
0
0:0
1
3
Bournemouth
0
0
0
0
0:0
0
4
Arsenal
0
0
0
0
0:0
0
5
Aston Villa
0
0
0
0
0:0
0
6
Brentford
0
0
0
0
0:0
0
7
Brighton
0
0
0
0
0:0
0
8
Chelsea
0
0
0
0
0:0
0
9
Crystal Palace
0
0
0
0
0:0
0
10
Everton
0
0
0
0
0:0
0
11
Ipswich
0
0
0
0
0:0
0
12
Leicester
0
0
0
0
0:0
0
13
Liverpool
0
0
0
0
0:0
0
14
Man City
0
0
0
0
0:0
0
15
Newcastle
0
0
0
0
0:0
0
16
Forest
0
0
0
0
0:0
0
17
Southampton
0
0
0
0
0:0
0
18
Tottenham
0
0
0
0
0:0
0
19
West Ham
0
0
0
0
0:0
0
20
Wolves
0
0
0
0
0:0
0


In [10]:
driver.quit()

In [11]:
league_table_results_container

['ALLHOMEAWAY\n#\nTeam\nP\nW\nD\nL\nGoals\nLast 5\nPTS\n1\nFulham\n1\n0\n1\n0\n0:0\n1\n2\nMan Utd\n1\n0\n1\n0\n0:0\n1\n3\nBournemouth\n0\n0\n0\n0\n0:0\n0\n4\nArsenal\n0\n0\n0\n0\n0:0\n0\n5\nAston Villa\n0\n0\n0\n0\n0:0\n0\n6\nBrentford\n0\n0\n0\n0\n0:0\n0\n7\nBrighton\n0\n0\n0\n0\n0:0\n0\n8\nChelsea\n0\n0\n0\n0\n0:0\n0\n9\nCrystal Palace\n0\n0\n0\n0\n0:0\n0\n10\nEverton\n0\n0\n0\n0\n0:0\n0\n11\nIpswich\n0\n0\n0\n0\n0:0\n0\n12\nLeicester\n0\n0\n0\n0\n0:0\n0\n13\nLiverpool\n0\n0\n0\n0\n0:0\n0\n14\nMan City\n0\n0\n0\n0\n0:0\n0\n15\nNewcastle\n0\n0\n0\n0\n0:0\n0\n16\nForest\n0\n0\n0\n0\n0:0\n0\n17\nSouthampton\n0\n0\n0\n0\n0:0\n0\n18\nTottenham\n0\n0\n0\n0\n0:0\n0\n19\nWest Ham\n0\n0\n0\n0\n0:0\n0\n20\nWolves\n0\n0\n0\n0\n0:0\n0']

In [12]:
data_copy = league_table_results_container

In [13]:
df_columns = data_copy[0][11:43]

In [14]:
data = data_copy[0][43:]

In [15]:
data

'\n1\nFulham\n1\n0\n1\n0\n0:0\n1\n2\nMan Utd\n1\n0\n1\n0\n0:0\n1\n3\nBournemouth\n0\n0\n0\n0\n0:0\n0\n4\nArsenal\n0\n0\n0\n0\n0:0\n0\n5\nAston Villa\n0\n0\n0\n0\n0:0\n0\n6\nBrentford\n0\n0\n0\n0\n0:0\n0\n7\nBrighton\n0\n0\n0\n0\n0:0\n0\n8\nChelsea\n0\n0\n0\n0\n0:0\n0\n9\nCrystal Palace\n0\n0\n0\n0\n0:0\n0\n10\nEverton\n0\n0\n0\n0\n0:0\n0\n11\nIpswich\n0\n0\n0\n0\n0:0\n0\n12\nLeicester\n0\n0\n0\n0\n0:0\n0\n13\nLiverpool\n0\n0\n0\n0\n0:0\n0\n14\nMan City\n0\n0\n0\n0\n0:0\n0\n15\nNewcastle\n0\n0\n0\n0\n0:0\n0\n16\nForest\n0\n0\n0\n0\n0:0\n0\n17\nSouthampton\n0\n0\n0\n0\n0:0\n0\n18\nTottenham\n0\n0\n0\n0\n0:0\n0\n19\nWest Ham\n0\n0\n0\n0\n0:0\n0\n20\nWolves\n0\n0\n0\n0\n0:0\n0'

In [16]:
data_copy_split = [data.split('\n')]

In [17]:
data_copy_split = data_copy_split[0][1:]
print(data_copy_split)

['1', 'Fulham', '1', '0', '1', '0', '0:0', '1', '2', 'Man Utd', '1', '0', '1', '0', '0:0', '1', '3', 'Bournemouth', '0', '0', '0', '0', '0:0', '0', '4', 'Arsenal', '0', '0', '0', '0', '0:0', '0', '5', 'Aston Villa', '0', '0', '0', '0', '0:0', '0', '6', 'Brentford', '0', '0', '0', '0', '0:0', '0', '7', 'Brighton', '0', '0', '0', '0', '0:0', '0', '8', 'Chelsea', '0', '0', '0', '0', '0:0', '0', '9', 'Crystal Palace', '0', '0', '0', '0', '0:0', '0', '10', 'Everton', '0', '0', '0', '0', '0:0', '0', '11', 'Ipswich', '0', '0', '0', '0', '0:0', '0', '12', 'Leicester', '0', '0', '0', '0', '0:0', '0', '13', 'Liverpool', '0', '0', '0', '0', '0:0', '0', '14', 'Man City', '0', '0', '0', '0', '0:0', '0', '15', 'Newcastle', '0', '0', '0', '0', '0:0', '0', '16', 'Forest', '0', '0', '0', '0', '0:0', '0', '17', 'Southampton', '0', '0', '0', '0', '0:0', '0', '18', 'Tottenham', '0', '0', '0', '0', '0:0', '0', '19', 'West Ham', '0', '0', '0', '0', '0:0', '0', '20', 'Wolves', '0', '0', '0', '0', '0:0', '0']

In [18]:
data_array = [data_copy_split[i:i+13] for i in range(0, len(data_copy_split), 13)]
print(data_array)

[['1', 'Fulham', '1', '0', '1', '0', '0:0', '1', '2', 'Man Utd', '1', '0', '1'], ['0', '0:0', '1', '3', 'Bournemouth', '0', '0', '0', '0', '0:0', '0', '4', 'Arsenal'], ['0', '0', '0', '0', '0:0', '0', '5', 'Aston Villa', '0', '0', '0', '0', '0:0'], ['0', '6', 'Brentford', '0', '0', '0', '0', '0:0', '0', '7', 'Brighton', '0', '0'], ['0', '0', '0:0', '0', '8', 'Chelsea', '0', '0', '0', '0', '0:0', '0', '9'], ['Crystal Palace', '0', '0', '0', '0', '0:0', '0', '10', 'Everton', '0', '0', '0', '0'], ['0:0', '0', '11', 'Ipswich', '0', '0', '0', '0', '0:0', '0', '12', 'Leicester', '0'], ['0', '0', '0', '0:0', '0', '13', 'Liverpool', '0', '0', '0', '0', '0:0', '0'], ['14', 'Man City', '0', '0', '0', '0', '0:0', '0', '15', 'Newcastle', '0', '0', '0'], ['0', '0:0', '0', '16', 'Forest', '0', '0', '0', '0', '0:0', '0', '17', 'Southampton'], ['0', '0', '0', '0', '0:0', '0', '18', 'Tottenham', '0', '0', '0', '0', '0:0'], ['0', '19', 'West Ham', '0', '0', '0', '0', '0:0', '0', '20', 'Wolves', '0', '0'

In [19]:
testing = ['1', 'Arsenal', '36', '26', '5', '5', '88:28', 'L', 'W', 'W', 'W', 'W', '83']

# Splitting the element at index 6 and extending the list with the resulting parts
testing[6:7] = testing[6].split(':')

print(testing)

['1', 'Arsenal', '36', '26', '5', '5', '88', '28', 'L', 'W', 'W', 'W', 'W', '83']


In [23]:
split_data = []

for data in data_array:
    print(data[6])
    data[6:7] = data[6].split(':')
    split_data.append(data)    

0
0
5
0
0
0
0
Liverpool
0
0
18
0


IndexError: list index out of range

In [21]:
print(split_data)  

[['1', 'Fulham', '1', '0', '1', '0', '0', '0', '1', '2', 'Man Utd', '1', '0', '1'], ['0', '0:0', '1', '3', 'Bournemouth', '0', '0', '0', '0', '0:0', '0', '4', 'Arsenal'], ['0', '0', '0', '0', '0:0', '0', '5', 'Aston Villa', '0', '0', '0', '0', '0:0'], ['0', '6', 'Brentford', '0', '0', '0', '0', '0:0', '0', '7', 'Brighton', '0', '0'], ['0', '0', '0:0', '0', '8', 'Chelsea', '0', '0', '0', '0', '0:0', '0', '9'], ['Crystal Palace', '0', '0', '0', '0', '0:0', '0', '10', 'Everton', '0', '0', '0', '0'], ['0:0', '0', '11', 'Ipswich', '0', '0', '0', '0', '0:0', '0', '12', 'Leicester', '0'], ['0', '0', '0', '0:0', '0', '13', 'Liverpool', '0', '0', '0', '0', '0:0', '0'], ['14', 'Man City', '0', '0', '0', '0', '0', '0', '0', '15', 'Newcastle', '0', '0', '0'], ['0', '0:0', '0', '16', 'Forest', '0', '0', '0', '0', '0:0', '0', '17', 'Southampton'], ['0', '0', '0', '0', '0:0', '0', '18', 'Tottenham', '0', '0', '0', '0', '0:0'], ['0', '19', 'West Ham', '0', '0', '0', '0', '0:0', '0', '20', 'Wolves', '0

In [19]:
# These team_labels match all the data from the other notebooks
team_labels = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brighton': 4,
        'Burnley': 5,
        'Chelsea': 6,
        'Crystal Palace': 7,
        'Everton': 8,
        'Fulham': 9,
        'Leeds Utd': 10,
        'Leicester City': 11,
        'Liverpool': 12,
        'Man City': 13,
        'Man Utd': 14,
        'Newcastle': 15,
        'Norwich': 16,
        'Sheffield': 17,
        'Southampton': 18,
        'Tottenham': 19,
        'West Ham': 20,
        'Luton': 21,
        'Wolves': 22,
        'Brentford': 23,
        'Sheffield Utd': 24,
        'Forest': 25
    }

#### Code Playground: creating ppg, last_5_matches, and removing redundant indexes

In [20]:
epl_table = [['1','Arsenal','37','27','5','5','89','28','W','W','W', 'W','W','86'],
 ['2','Man City', '36','26','7','3','91','33','W','W','W','W','W','85'],
 ['3','Liverpool','36','23','9','4','81','38','L','W','L','D','W','78']]

for row in epl_table:
    ppg = 0
    for result in row[7:13]:
        if result == 'W':
            ppg += 3
        elif result == 'D':
            ppg += 1
    last_5_matches = ''.join(row[8:13])  # Concatenate match results from index 7 to 11
    row[8:13] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG

print(epl_table)



[['1', 'Arsenal', '37', '27', '5', '5', '89', '28', 'WWWWW', 3.0, '86'], ['2', 'Man City', '36', '26', '7', '3', '91', '33', 'WWWWW', 3.0, '85'], ['3', 'Liverpool', '36', '23', '9', '4', '81', '38', 'LWLDW', 1.4, '78']]


In [21]:
def calculate_ppg_and_create_last_5_matches(two_d_array):
    for row in two_d_array:
        ppg = 0
        for result in row[7:13]:
            if result == 'W':
                ppg += 3
            elif result == 'D':
                ppg += 1
        last_5_matches = ''.join(row[8:13])  # Concatenate match results from index 7 to 11
        row[8:13] = [last_5_matches, ppg / 5]  # Replace old match results with last 5 matches string and PPG
    
    return two_d_array


In [22]:
premier_league_columns  = ['Pos', 'Team', 'Pld', 'Wins', 'Draws', 'losses', 'GF', 'GA', 'Last 5 Matches', 'Ppg_Last_5_Matches', 'Points']

In [23]:
df_data = calculate_ppg_and_create_last_5_matches(split_data)

In [24]:
# Create DataFrame
premier_league_table = pd.DataFrame(df_data, columns=premier_league_columns)
print(premier_league_table)

   Pos            Team Pld Wins Draws losses  GF   GA Last 5 Matches  \
0    1        Man City  38   28     7      3  96   34          WWWWW   
1    2         Arsenal  38   28     5      5  91   29          WWWWW   
2    3       Liverpool  38   24    10      4  86   41          LDWDW   
3    4     Aston Villa  38   20     8     10  76   61          WDLDL   
4    5       Tottenham  38   20     6     12  74   61          LLWLW   
5    6         Chelsea  38   18     9     11  77   63          WWWWW   
6    7       Newcastle  38   18     6     14  85   62          WWDLW   
7    8         Man Utd  38   18     6     14  57   58          DLLWW   
8    9        West Ham  38   14    10     14  60   74          LDLWL   
9   10  Crystal Palace  38   13    10     15  57   58          WDWWW   
10  11        Brighton  38   12    12     14  55   62          LWDLL   
11  12     Bournemouth  38   13     9     16  54   67          WWLLL   
12  13          Fulham  38   13     8     17  55   61          L

In [25]:
def team_to_label(team_name):
    return team_labels.get(team_name)

In [26]:
premier_league_table['Team'] = premier_league_table['Team'].map(team_to_label)

In [27]:
premier_league_table = premier_league_table.drop(columns=['Last 5 Matches'])

#### Confirm its the right Weekly Round in order to join with the right dataset. Note Current League standings goes with upcoming_matches

In [28]:
Round  = 'Round ' + ' ' +  premier_league_table['Pld'].mode()[0]

In [29]:
Round

'Round  38'

In [None]:
# Database connection
user = '<user>'
password = '<password>'
host = 'localhost'
port = 3306
database = '<db>'

In [28]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database}')
premier_league_table.to_sql('current_week_league_standings', con=engine, if_exists='replace', index=False)

20