# Scrapping Forebet Completed Matches - Predictions and Results of Matches

## Overview

This notebook is designed to scrape data from Forebet's prediction table for completed matches. The scraped data will be stored in a database and used to train our betting prediction model. The primary goals of this notebook are:

1. **Scrape Completed Matches Data**: Retrieve data for completed matches, including team names, win probabilities, score predictions, and actual match outcomes.
2. **Data Processing**: Clean and format the scraped data to ensure it is suitable for analysis and model training.
3. **Database Insertion**: Store the cleaned data in the `completed_matches` table within our database.
4. **Chrome Web Driver**: Make sure to download and install Chrome Webdriver.


## Requirements

- Python 3.x
- Libraries: `selenium`, `sklearn`, `pandas`, `sqlalchemy`, `re`
- pymysql installed
- MYSQL Workbench
- Access to the database where the `completed_matches` table and `training_data` is located

## Notes

- Ensure you have the correct database credentials set up before running this notebook.
- This notebook should be run weekly to keep the `completed_matches` table updated with the latest match results.
- training_data is a joint table of `completed_matches` and `previous_league_table_standings` from your DB.


### Setup and Imports:

In [3]:
!pip install pandas
!pip install sqlalchemy
!pip install scikit-learn



In [1]:
from selenium import webdriver
import pandas as pd
from sqlalchemy import create_engine
import re
from sklearn.preprocessing import LabelEncoder
# from modules import team_labels_forebet

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [2]:
# TEAM LABELS
team_labels_forebet = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brentford': 4,
        'Brighton': 5,
        'Burnley': 6,
        'Chelsea': 7,
        'Crystal Palace': 8,
        'Everton': 9,
        'Fulham': 10,
        'Ipswich Town':11,
        'Leeds United': 12,
        'Leicester City': 13,
        'Liverpool': 14,
        'Manchester City': 15,
        'Manchester United': 16,
        'Newcastle United': 17,
        'Norwich City': 18,
        'Sheffield United': 19,
        'Southampton': 20,
        'Tottenham': 21,
        'West Ham': 22,
        'Luton Town': 23,
        'Wolverhampton': 24,
        'Sheffield United': 25,
        'Nottingham Forest': 26,
    }

### Scraping Data:

In [3]:
# Set up the Chrome driver
chrome_options = Options()
chrome_options.add_argument("--start-maximized")  # Opens browser in full-screen
driver = webdriver.Chrome(options=chrome_options)

# Set the window size (optional, depends on your need)
driver.set_window_size(1920, 1080)

# Open the Forebet predictions page for the English Premier League
driver.get('https://www.forebet.com/en/football-tips-and-predictions-for-england/premier-league')

# Add a wait to ensure the element is present
try:
    # Wait for the consent pop-up and click the consent button
    consent_button = WebDriverWait(driver, 3).until(
        EC.element_to_be_clickable((By.XPATH, "//p[@class='fc-button-label' and text()='Consent']"))
    )
    consent_button.click()
    print("Clicked the consent button!")
except Exception as e:
    print(f"Error: {e}")



Clicked the consent button!


In [4]:
# Now find the fixture containers within the body element
match_fixture_containers = driver.find_elements("class name", "schema")
print(match_fixture_containers)

[<selenium.webdriver.remote.webelement.WebElement (session="8224b754e4c1b098948411866df9cdcc", element="f.35C03F2918D37A093C25509943053552.d.9DBBA03A632C71FCBB550009FC3816AE.e.566")>, <selenium.webdriver.remote.webelement.WebElement (session="8224b754e4c1b098948411866df9cdcc", element="f.35C03F2918D37A093C25509943053552.d.9DBBA03A632C71FCBB550009FC3816AE.e.567")>, <selenium.webdriver.remote.webelement.WebElement (session="8224b754e4c1b098948411866df9cdcc", element="f.35C03F2918D37A093C25509943053552.d.9DBBA03A632C71FCBB550009FC3816AE.e.568")>]


In [5]:
# Initialize an empty list to store fixture details
fixtures_container = []


# Iterate through each match fixture container
for fixture in match_fixture_containers:
    # Extract the text from the fixture container
    fixture_text = fixture.text
    # Append the extracted text to the fixtures_container list
    fixtures_container.append(fixture_text)
    # Print the extracted fixture text
    print(fixture_text)


Round 7
EPL
Crystal Palace
Liverpool
5/10/2024 12:30
14167121 - 22.9017°1/2
PRE
VIEW
0.13
EPL
Arsenal
Southampton
5/10/2024 15:00
7318913 - 03.4916°1/10
PRE
VIEW
EPL
Brentford
Wolverhampton
5/10/2024 15:00
55281713 - 02.8316°19/20
PRE
VIEW
0.08
EPL
Leicester City
Bournemouth
5/10/2024 15:00
264034X1 - 12.3016°13/5
PRE
VIEW
0.17
EPL
Manchester City
Fulham
5/10/2024 15:00
384023X2 - 22.9416°5/1
PRE
VIEW
0.28
EPL
West Ham
Ipswich Town
5/10/2024 15:00
255026X1 - 12.3016°11/4
PRE
VIEW
0.32
EPL
Everton
Newcastle United
5/10/2024 17:30
28294321 - 23.3613°23/20
PRE
VIEW
EPL
Aston Villa
Manchester United
6/10/2024 14:00
41253412 - 12.8814°23/20
PRE
VIEW
EPL
Chelsea
Nottingham Forest
6/10/2024 14:00
333730X2 - 23.1917°33/10
PRE
VIEW
0.18
EPL
Brighton
Tottenham
6/10/2024 16:30
27343821 - 22.5016°6/5
PRE
VIEW
Round 6
EPL
Newcastle United
Manchester City
28/9/2024 12:30
22285021 - 33.2412°3/5 FT 1 - 1
(0 - 1)
EPL
Arsenal
Leicester City
28/9/2024 15:00
5538712 - 13.0214°1/5 FT 4 - 2
(2 - 0)
EPL
Bren

In [6]:
# Closes the webdriver and forebet page
driver.quit()

In [7]:
fixtures_container

['Round 7\nEPL\nCrystal Palace\nLiverpool\n5/10/2024 12:30\n14167121 - 22.9017°1/2\nPRE\nVIEW\n0.13\nEPL\nArsenal\nSouthampton\n5/10/2024 15:00\n7318913 - 03.4916°1/10\nPRE\nVIEW\nEPL\nBrentford\nWolverhampton\n5/10/2024 15:00\n55281713 - 02.8316°19/20\nPRE\nVIEW\n0.08\nEPL\nLeicester City\nBournemouth\n5/10/2024 15:00\n264034X1 - 12.3016°13/5\nPRE\nVIEW\n0.17\nEPL\nManchester City\nFulham\n5/10/2024 15:00\n384023X2 - 22.9416°5/1\nPRE\nVIEW\n0.28\nEPL\nWest Ham\nIpswich Town\n5/10/2024 15:00\n255026X1 - 12.3016°11/4\nPRE\nVIEW\n0.32\nEPL\nEverton\nNewcastle United\n5/10/2024 17:30\n28294321 - 23.3613°23/20\nPRE\nVIEW\nEPL\nAston Villa\nManchester United\n6/10/2024 14:00\n41253412 - 12.8814°23/20\nPRE\nVIEW\nEPL\nChelsea\nNottingham Forest\n6/10/2024 14:00\n333730X2 - 23.1917°33/10\nPRE\nVIEW\n0.18\nEPL\nBrighton\nTottenham\n6/10/2024 16:30\n27343821 - 22.5016°6/5\nPRE\nVIEW\nRound 6\nEPL\nNewcastle United\nManchester City\n28/9/2024 12:30\n22285021 - 33.2412°3/5 FT 1 - 1\n(0 - 1)\nEPL\

### Data Cleaning and Processing

In [8]:
# Luckily the data has \nEpl which we can use to separate epl matches from other leagues.
matches_data_cleaned_step_1 = [match.split("\nEPL") for match in fixtures_container]
epl_matches = matches_data_cleaned_step_1[0]
epl_matches

['Round 7',
 '\nCrystal Palace\nLiverpool\n5/10/2024 12:30\n14167121 - 22.9017°1/2\nPRE\nVIEW\n0.13',
 '\nArsenal\nSouthampton\n5/10/2024 15:00\n7318913 - 03.4916°1/10\nPRE\nVIEW',
 '\nBrentford\nWolverhampton\n5/10/2024 15:00\n55281713 - 02.8316°19/20\nPRE\nVIEW\n0.08',
 '\nLeicester City\nBournemouth\n5/10/2024 15:00\n264034X1 - 12.3016°13/5\nPRE\nVIEW\n0.17',
 '\nManchester City\nFulham\n5/10/2024 15:00\n384023X2 - 22.9416°5/1\nPRE\nVIEW\n0.28',
 '\nWest Ham\nIpswich Town\n5/10/2024 15:00\n255026X1 - 12.3016°11/4\nPRE\nVIEW\n0.32',
 '\nEverton\nNewcastle United\n5/10/2024 17:30\n28294321 - 23.3613°23/20\nPRE\nVIEW',
 '\nAston Villa\nManchester United\n6/10/2024 14:00\n41253412 - 12.8814°23/20\nPRE\nVIEW',
 '\nChelsea\nNottingham Forest\n6/10/2024 14:00\n333730X2 - 23.1917°33/10\nPRE\nVIEW\n0.18',
 '\nBrighton\nTottenham\n6/10/2024 16:30\n27343821 - 22.5016°6/5\nPRE\nVIEW\nRound 6',
 '\nNewcastle United\nManchester City\n28/9/2024 12:30\n22285021 - 33.2412°3/5 FT 1 - 1\n(0 - 1)',
 '\

In [9]:
# Extract the current weekly round from the first index in epl_matches
weekly_round = epl_matches[0]
# Split the string to get the number, and subtract by 1. Note Do not subtract by 1 if its the first week of the season
weekly_round = int(weekly_round.split(' ')[1])  - 1
weekly_round

6

#### Brief Notes:
The page separates Upcoming matches from Completed matches using 'Round'. We use this Round Number to confirm if we are getting the right data. Also i append it in the dataframe at the end of the notebook to help double check when working with aggregated data in the DataBase

In [10]:
completed_matches = []

for match in epl_matches:
    if 'FT' in match:  # Check if 'FT' (full-time) is present in the match string
        completed_matches.append(match)

print("Completed Matches:")
print(completed_matches)

Completed Matches:
['\nNewcastle United\nManchester City\n28/9/2024 12:30\n22285021 - 33.2412°3/5 FT 1 - 1\n(0 - 1)', '\nArsenal\nLeicester City\n28/9/2024 15:00\n5538712 - 13.0214°1/5 FT 4 - 2\n(2 - 0)', '\nBrentford\nWest Ham\n28/9/2024 15:00\n42302712 - 12.7214°5/4 FT 1 - 1\n(1 - 0)', '\nChelsea\nBrighton\n28/9/2024 15:00\n204832X1 - 12.3414°3/1 FT 4 - 2\n(4 - 2)\n0.31', '\nEverton\nCrystal Palace\n28/9/2024 15:00\n304525X1 - 12.0013°12/5 FT 2 - 1\n(0 - 1)\n0.22', '\nNottingham Forest\nFulham\n28/9/2024 15:00\n274725X1 - 12.3013°12/5 FT 0 - 1\n(0 - 0)\n0.25', '\nWolverhampton\nLiverpool\n28/9/2024 17:30\n1388021 - 33.4812°2/5 FT 1 - 2\n(0 - 1)\n0.24', '\nIpswich Town\nAston Villa\n29/9/2024 14:00\n11206920 - 22.3413°9/10 FT 2 - 2\n(1 - 2)\n0.35', '\nManchester United\nTottenham\n29/9/2024 16:30\n39352612 - 12.8713°5/4 FT 0 - 3\n(0 - 1)', '\nBournemouth\nSouthampton\n30/9/2024 20:00\n43332311 - 02.3513°3/5 FT 3 - 1\n(3 - 0)']


#### Now we can remove `PREVIEW` if only it exists

In [11]:
# Remove '\nPRE\nVIEW' from completed matches
completed_matches = [match.replace('\nPRE\nVIEW', '') for match in completed_matches]

In [12]:
print("\nCompleted Matches (after removing noise):\n")
print(completed_matches)


Completed Matches (after removing noise):

['\nNewcastle United\nManchester City\n28/9/2024 12:30\n22285021 - 33.2412°3/5 FT 1 - 1\n(0 - 1)', '\nArsenal\nLeicester City\n28/9/2024 15:00\n5538712 - 13.0214°1/5 FT 4 - 2\n(2 - 0)', '\nBrentford\nWest Ham\n28/9/2024 15:00\n42302712 - 12.7214°5/4 FT 1 - 1\n(1 - 0)', '\nChelsea\nBrighton\n28/9/2024 15:00\n204832X1 - 12.3414°3/1 FT 4 - 2\n(4 - 2)\n0.31', '\nEverton\nCrystal Palace\n28/9/2024 15:00\n304525X1 - 12.0013°12/5 FT 2 - 1\n(0 - 1)\n0.22', '\nNottingham Forest\nFulham\n28/9/2024 15:00\n274725X1 - 12.3013°12/5 FT 0 - 1\n(0 - 0)\n0.25', '\nWolverhampton\nLiverpool\n28/9/2024 17:30\n1388021 - 33.4812°2/5 FT 1 - 2\n(0 - 1)\n0.24', '\nIpswich Town\nAston Villa\n29/9/2024 14:00\n11206920 - 22.3413°9/10 FT 2 - 2\n(1 - 2)\n0.35', '\nManchester United\nTottenham\n29/9/2024 16:30\n39352612 - 12.8713°5/4 FT 0 - 3\n(0 - 1)', '\nBournemouth\nSouthampton\n30/9/2024 20:00\n43332311 - 02.3513°3/5 FT 3 - 1\n(3 - 0)']


Note.

on 10/3/2024
i experienced a use case where the probabilities differ, in terms of length and expectation
usually we have double digits for home draw and away.
but in the case below Arsenal had a win probability of 55 and a draw of 38 leaving only room for a single digit to represent away win probability. This shifts the values hence previous regex operations do not pass this use case.

In [None]:
'\nNewcastle United\nManchester City\n28/9/2024 12:30\n(22)(28)(50)(2)(1 - 3)3.2412°3/5 FT 1 - 1\n(0 - 1)', 
'\nArsenal\nLeicester City\n28/9/2024 15:00\n55)(38)(7)(1) (2 - 1)3.0214°1/5 FT 4 - 2\n(2 - 0)', 




In [None]:
Solution
Hard Coding for now. Put zeros before the single 
['\nArsenal\nSouthampton\n5/10/2024 15:00\n7318913 - 03.4916°1/10\nPRE\nVIEW']
['\nWolverhampton\nLiverpool\n28/9/2024 17:30\n1388021 - 33.4812°2/5 FT 1 - 2\n(0 - 1)\n0.24']

In [13]:
import re
from itertools import permutations

def insert_zero_and_validate(number_string):
    # Convert the string into a list of integers
    digits = [int(digit) for digit in number_string]
    
    # Check different groupings by inserting '0' in each possible spot
    for i in range(1, len(digits)):
        # Create a new list with a '0' inserted at position i
        new_digits = digits[:i] + [0] + digits[i:]
        
        # Now we use static indices for the grouping
        # Group digits as: (index 0 and 1), (index 2 and 3), (index 4 and 5)
        first_group = int("".join(map(str, new_digits[:2])))  # 1st and 2nd digits
        second_group = int("".join(map(str, new_digits[2:4])))  # 3rd and 4th digits
        third_group = int("".join(map(str, new_digits[4:6])))  # 5th and 6th digits
        
        # Sum of the three groups
        total = first_group + second_group + third_group
        
        # Check if the total is close to 100 (with margin of error of 1)
        if abs(100 - total) <= 1:
            return "".join(map(str, new_digits))
    
    return "No valid grouping found"
def extract_and_correct_probabilities(text):
    """
    This function extracts the probability portion of each match, corrects for any single digits, 
    and returns the updated probability string.
    """
    # Extract the portion before the dash
    prob_pattern = r'\n(\d+)\s-\s'  # Updated to capture 5 or 6 digit probabilities
    
    match = re.search(prob_pattern, text)
    if match:
        prob_str = match.group(1)  # The part containing probabilities
        
        if len(prob_str) < 8:
            # Handle the case where the probability string needs correction
            prob_str = prob_str[:-2]
            valid_probabilities = insert_zero_and_validate(prob_str)
            if valid_probabilities:
                # Replace the old probability string with the corrected one
                text = text.replace(prob_str, valid_probabilities)
    
    return text

def fractional_to_decimal(fractional):
    """
    Converts fractional odds to decimal odds.
    Example: '11/10' -> 2.10
    """
    try:
        numerator, denominator = map(int, fractional.split('/'))
        return round((numerator / denominator) + 1, 2)
    except Exception as e:
        print(f"Error converting fractional odds: {e}")
        return fractional

def replace_completed_matches(text):
    """
    This function applies the main regex to update the completed matches data.
    """
    def replacement_function(match):
        # Extract all groups from the regex
        group1 = match.group(1)
        group2 = match.group(2)
        group3 = match.group(3)
        group4 = match.group(4)
        group5 = match.group(5)
        group6 = match.group(6)
        group7 = match.group(7)
        decimal_odds = match.group(8)
        group9 = match.group(9)

        # If odds are in fractional format, convert them to decimal
        if '/' in decimal_odds:
            decimal_odds = fractional_to_decimal(decimal_odds)

        # Rebuild the string with updated decimal odds for the 8th group
        return f"\n{group1}\n{group2}\n{group3}\n{group4}\n{group5}\n{group6}\n{group7}\n{decimal_odds}\n{group9}"

    # Regular expression pattern for matching completed matches data, including both decimal and fractional odds.
    pattern_completed_matches = r'\n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2}|\d{1,2}\/\d{1,2})(\d{2}°|\d{1}°)(\d?.\d{2}|\d{1,2}\/\d{1,2})\s(FT\s\d?\s-\s\d?)'

    # Perform the replacement on the text
    return re.sub(pattern_completed_matches, replacement_function, text)


# Step-by-step processing
for i in range(len(completed_matches)):
    # Step 1: Correct probabilities
    completed_matches[i] = extract_and_correct_probabilities(completed_matches[i])
    
    # Step 2: Apply the main regex replacement
    completed_matches[i] = replace_completed_matches(completed_matches[i])

# Output the processed matches
for match in completed_matches:
    print(match)



Newcastle United
Manchester City
28/9/2024 12:30
22
28
50
2
1 - 3
3.24
12°
1.6
FT 1 - 1
(0 - 1)

Arsenal
Leicester City
28/9/2024 15:00
55
38
07
1
2 - 1
3.02
14°
1.2
FT 4 - 2
(2 - 0)

Brentford
West Ham
28/9/2024 15:00
42
30
27
1
2 - 1
2.72
14°
2.25
FT 1 - 1
(1 - 0)

Chelsea
Brighton
28/9/2024 15:00
20
48
32
X
1 - 1
2.34
14°
4.0
FT 4 - 2
(4 - 2)
0.31

Everton
Crystal Palace
28/9/2024 15:00
30
45
25
X
1 - 1
2.00
13°
3.4
FT 2 - 1
(0 - 1)
0.22

Nottingham Forest
Fulham
28/9/2024 15:00
27
47
25
X
1 - 1
2.30
13°
3.4
FT 0 - 1
(0 - 0)
0.25

Wolverhampton
Liverpool
28/9/2024 17:30
13
08
80
2
1 - 3
3.48
12°
1.4
FT 1 - 2
(0 - 1)
0.24

Ipswich Town
Aston Villa
29/9/2024 14:00
11
20
69
2
0 - 2
2.34
13°
1.9
FT 2 - 2
(1 - 2)
0.35

Manchester United
Tottenham
29/9/2024 16:30
39
35
26
1
2 - 1
2.87
13°
2.25
FT 0 - 3
(0 - 1)

Bournemouth
Southampton
30/9/2024 20:00
43
33
23
1
1 - 0
2.35
13°
1.6
FT 3 - 1
(3 - 0)


### REGEX: Preprocessing Completed matches array 

# v1.2
After coming to the Uk, i realized forebet has different representations for odds. the uk system the american and the european, which africans also use.

In [16]:
import re

# Regular expression pattern for matching completed matches data, including both decimal and fractional odds.
# Fractional odds are in the form a/b and need conversion
pattern_completed_matches = r'\n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2}|\d{1,2}\/\d{1,2})(\d{2}°|\d{1}°)(\d?.\d{2}|\d{1,2}\/\d{1,2})\s(FT\s\d?\s-\s\d?)'

def fractional_to_decimal(fractional):
    """
    Converts fractional odds to decimal odds.
    Example: '11/10' -> 2.10
    """
    try:
        numerator, denominator = map(int, fractional.split('/'))
        return round((numerator / denominator) + 1, 2)
    except Exception as e:
        print(f"Error converting fractional odds: {e}")
        return fractional

def replace_completed_matches(text):
    def replacement_function(match):
        # Extract all groups from the regex
        group1 = match.group(1)
        group2 = match.group(2)
        group3 = match.group(3)
        group4 = match.group(4)
        group5 = match.group(5)
        group6 = match.group(6)
        group7 = match.group(7)
        decimal_odds = match.group(8)
        group9 = match.group(9)

        # If odds are in fractional format, convert them to decimal
        if '/' in decimal_odds:
            decimal_odds = fractional_to_decimal(decimal_odds)

        # Rebuild the string with updated decimal odds for the 8th group
        return f"\n{group1}\n{group2}\n{group3}\n{group4}\n{group5}\n{group6}\n{group7}\n{decimal_odds}\n{group9}"

    # Perform the replacement on the text
    return re.sub(pattern_completed_matches, replacement_function, text)




# Process completed matches
for i in range(len(completed_matches)):
    completed_matches[i] = replace_completed_matches(completed_matches[i])


In [14]:
print("\nCompleted Matches (after removing noise):\n")
print(completed_matches)


Completed Matches (after removing noise):

['\nNewcastle United\nManchester City\n28/9/2024 12:30\n22\n28\n50\n2\n1 - 3\n3.24\n12°\n1.6\nFT 1 - 1\n(0 - 1)', '\nArsenal\nLeicester City\n28/9/2024 15:00\n55\n38\n07\n1\n2 - 1\n3.02\n14°\n1.2\nFT 4 - 2\n(2 - 0)', '\nBrentford\nWest Ham\n28/9/2024 15:00\n42\n30\n27\n1\n2 - 1\n2.72\n14°\n2.25\nFT 1 - 1\n(1 - 0)', '\nChelsea\nBrighton\n28/9/2024 15:00\n20\n48\n32\nX\n1 - 1\n2.34\n14°\n4.0\nFT 4 - 2\n(4 - 2)\n0.31', '\nEverton\nCrystal Palace\n28/9/2024 15:00\n30\n45\n25\nX\n1 - 1\n2.00\n13°\n3.4\nFT 2 - 1\n(0 - 1)\n0.22', '\nNottingham Forest\nFulham\n28/9/2024 15:00\n27\n47\n25\nX\n1 - 1\n2.30\n13°\n3.4\nFT 0 - 1\n(0 - 0)\n0.25', '\nWolverhampton\nLiverpool\n28/9/2024 17:30\n13\n08\n80\n2\n1 - 3\n3.48\n12°\n1.4\nFT 1 - 2\n(0 - 1)\n0.24', '\nIpswich Town\nAston Villa\n29/9/2024 14:00\n11\n20\n69\n2\n0 - 2\n2.34\n13°\n1.9\nFT 2 - 2\n(1 - 2)\n0.35', '\nManchester United\nTottenham\n29/9/2024 16:30\n39\n35\n26\n1\n2 - 1\n2.87\n13°\n2.25\nFT 0 -

# v.1
old version for european system. Just leave it here for documentation

In [16]:
## Regular expression pattern for matching completed matches data
## Example matched string: '\n15196521 - 33.1921°1.25 FT 0 - 4'
## This represents:
## 15 - Probability of home team win
## 19 - Probability of draw
## 52 - Probability of away team win
## 2 - Team to win (1 for home, X for draw, 2 for away)
## 1 - Home team score prediction
## 3 - Away team score prediction
## 3.19 - Average goals prediction
## 21° - Weather in degrees
## 1.25 - Odds
## FT 0 - 4 - Actual full-time score

pattern_completed_matches = r'\n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2})(\d{2}°|\d{1}°)(\d?.\d{2})\s(FT\s\d?\s-\s\d?)'

replacement_completed_matches = r'\n\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8\n\9'

def replace_completed_matches(text):
    """
    Replace matches of the completed matches pattern in the given text
    with the specified replacement pattern.

    Args:
        text (str): The input text containing completed matches data.

    Returns:
        str: The text with replaced patterns.
    """
    return re.sub(pattern_completed_matches, replacement_completed_matches, text)

# Process completed matches
for i in range(len(completed_matches)):
    completed_matches[i] = replace_completed_matches(completed_matches[i])


In [17]:
# print("\nCompleted Matches (after removing noise):\n")
# print(completed_matches)


Completed Matches (after removing noise):

['\nBrighton\nManchester United\n24/8/2024 11:30\n30\n24\n46\n2\n1 - 2\n2.85\n15°\n2.63\nFT 2 - 1\n(1 - 0)\n0.13', '\nCrystal Palace\nWest Ham\n24/8/2024 14:00\n48\n30\n21\n1\n2 - 1\n3.34\n17°\n2.15\nFT 0 - 2\n(0 - 0)\n0.03', '\nFulham\nLeicester City\n24/8/2024 14:00\n31\n45\n23\nX\n2 - 2\n3.13\n17°\n3.90\nFT 2 - 1\n(1 - 1)\n0.26', '\nManchester City\nIpswich Town\n24/8/2024 14:00\n62\n14\n23\n1\n3 - 1\n3.76\n17°\n1.09\nFT 4 - 1\n(3 - 1)', '\nSouthampton\nNottingham Forest\n24/8/2024 14:00\n22\n47\n31\nX\n2 - 2\n3.39\n18°\n3.40\nFT 0 - 1\n(0 - 0)\n0.25', '\nTottenham\nEverton\n24/8/2024 14:00\n59\n25\n16\n1\n3 - 0\n2.90\n17°\n1.44\nFT 4 - 0\n(2 - 0)', '\nAston Villa\nArsenal\n24/8/2024 16:30\n21\n27\n52\n2\n1 - 3\n3.18\n17°\n1.83\nFT 0 - 2\n(0 - 0)', '\nManchester United\nFulham\n16/8/2024 19:00\n59\n21\n19\n1\n3 - 0\n3.23\n17°\n1.53\nFT 1 - 0\n(0 - 0)', '\nIpswich Town\nLiverpool\n17/8/2024 11:30\n21\n28\n51\n2\n1 - 3\n3.36\n23°\n1.36\nFT 0

#### Brief Notes:
- pattern_completed_matches: A regular expression pattern to match the completed matches data format. This pattern captures various parts of the match data, including win probabilities, team scores, average goals, weather, odds, and the actual scoreline.
- replacement_completed_matches: A replacement pattern to reformat the matched completed matches data by adding newline characters (\n) between the captured groups. This facilitates easier creation of a DataFrame by separating the future columns. 
eg  \n15196521 - 33.1921°1.25 FT 0 - 4  becomes  => \n15\n19\n65\n2\n1 - 3\n3.19\n21°\n1.25\nFT 0 - 4

#### Split by \n

In [15]:
# New array to hold split Data 
completed_matches_split = []

for i in range(len(completed_matches)):
    completed_matches_split.append(completed_matches[i].split('\n'))
    

In [16]:
print("\nCompleted Matches Split:\n")
print(completed_matches_split)



Completed Matches Split:

[['', 'Newcastle United', 'Manchester City', '28/9/2024 12:30', '22', '28', '50', '2', '1 - 3', '3.24', '12°', '1.6', 'FT 1 - 1', '(0 - 1)'], ['', 'Arsenal', 'Leicester City', '28/9/2024 15:00', '55', '38', '07', '1', '2 - 1', '3.02', '14°', '1.2', 'FT 4 - 2', '(2 - 0)'], ['', 'Brentford', 'West Ham', '28/9/2024 15:00', '42', '30', '27', '1', '2 - 1', '2.72', '14°', '2.25', 'FT 1 - 1', '(1 - 0)'], ['', 'Chelsea', 'Brighton', '28/9/2024 15:00', '20', '48', '32', 'X', '1 - 1', '2.34', '14°', '4.0', 'FT 4 - 2', '(4 - 2)', '0.31'], ['', 'Everton', 'Crystal Palace', '28/9/2024 15:00', '30', '45', '25', 'X', '1 - 1', '2.00', '13°', '3.4', 'FT 2 - 1', '(0 - 1)', '0.22'], ['', 'Nottingham Forest', 'Fulham', '28/9/2024 15:00', '27', '47', '25', 'X', '1 - 1', '2.30', '13°', '3.4', 'FT 0 - 1', '(0 - 0)', '0.25'], ['', 'Wolverhampton', 'Liverpool', '28/9/2024 17:30', '13', '08', '80', '2', '1 - 3', '3.48', '12°', '1.4', 'FT 1 - 2', '(0 - 1)', '0.24'], ['', 'Ipswich Town'

In [17]:
df_columns_completed_matches  = ['', 'home', 'away', 'date_and_time', 'home_win_probability', 'draw_probability', 'away_win_probability', 'team_to_win_prediction', 'scoreline_prediction', 'average_goals_prediction', 'weather_in_degrees', 'odds', 'full_time_score', 'score_at_halftime', "kelly_criterion"]

In [18]:
# Create DataFrame
df_completed_matches = pd.DataFrame(completed_matches_split, columns=df_columns_completed_matches)

# Drop first column
df_completed_matches = df_completed_matches.drop(columns=[''])

# Display DataFrame
print(df_completed_matches)

                home             away    date_and_time home_win_probability  \
0   Newcastle United  Manchester City  28/9/2024 12:30                   22   
1            Arsenal   Leicester City  28/9/2024 15:00                   55   
2          Brentford         West Ham  28/9/2024 15:00                   42   
3            Chelsea         Brighton  28/9/2024 15:00                   20   
4            Everton   Crystal Palace  28/9/2024 15:00                   30   
5  Nottingham Forest           Fulham  28/9/2024 15:00                   27   
6      Wolverhampton        Liverpool  28/9/2024 17:30                   13   
7       Ipswich Town      Aston Villa  29/9/2024 14:00                   11   
8  Manchester United        Tottenham  29/9/2024 16:30                   39   
9        Bournemouth      Southampton  30/9/2024 20:00                   43   

  draw_probability away_win_probability team_to_win_prediction  \
0               28                   50                      2  

We are going to skip EDA since the data varies week in week out. Later when i connect the data to a database, and keep track of certain trends maybe we can notice certain patterns

### DataBase and Model Readiness: 
We are still preparing the data, but from now on, everything we do has the model in mind, that means label encoding, converting string to floats dropping columns etc. so its a bit different from just cleaning the data. its more like tuning ? parameter tuning ?

We are going to create labels for the teams in the premier league, in order to have Home and Away values are numerical labels. important for logisit regression. Might not be the best model, but yh 
Sometimes reading the teams labels in figures is annoying, so you can chose to skip this part, and only copy the code for label encoding when you are ready to train the model

In [19]:
def team_to_label(team_name):
    """
    Convert a team name to its corresponding label.

    Args:
        team_name (str): The name of the team.

    Returns:
        int: The corresponding label for the team name as per the team_labels dictionary.
    """
    return team_labels_forebet.get(team_name)


In [20]:
# Map home and away team names to corresponding labels.
df_completed_matches['home'] = df_completed_matches['home'].map(team_to_label)
df_completed_matches['away'] = df_completed_matches['away'].map(team_to_label)


print(df_completed_matches.head())

   home  away    date_and_time home_win_probability draw_probability  \
0    17    15  28/9/2024 12:30                   22               28   
1     1    13  28/9/2024 15:00                   55               38   
2     4    22  28/9/2024 15:00                   42               30   
3     7     5  28/9/2024 15:00                   20               48   
4     9     8  28/9/2024 15:00                   30               45   

  away_win_probability team_to_win_prediction scoreline_prediction  \
0                   50                      2                1 - 3   
1                   07                      1                2 - 1   
2                   27                      1                2 - 1   
3                   32                      X                1 - 1   
4                   25                      X                1 - 1   

  average_goals_prediction weather_in_degrees  odds full_time_score  \
0                     3.24                12°   1.6        FT 1 - 1   
1   

### Splitting Date and Time 

In [21]:
df_completed_matches[['date', 'time']] = df_completed_matches['date_and_time'].str.split(' ', expand=True)

### Splitting the "Scoreline prediction" column into separate columns (Home goals, Away goals)

In [22]:
df_completed_matches[['home_team_score_prediction', 'away_team_score_prediction']] = \
    df_completed_matches['scoreline_prediction'].str.split('-', expand=True)


# Converting the split columns to integers
df_completed_matches['home_team_score_prediction'] = df_completed_matches['home_team_score_prediction'].astype(int)
df_completed_matches['away_team_score_prediction'] = df_completed_matches['away_team_score_prediction'].astype(int)

# Example usage:
print(df_completed_matches.head())  # Display the first few rows to verify the changes


   home  away    date_and_time home_win_probability draw_probability  \
0    17    15  28/9/2024 12:30                   22               28   
1     1    13  28/9/2024 15:00                   55               38   
2     4    22  28/9/2024 15:00                   42               30   
3     7     5  28/9/2024 15:00                   20               48   
4     9     8  28/9/2024 15:00                   30               45   

  away_win_probability team_to_win_prediction scoreline_prediction  \
0                   50                      2                1 - 3   
1                   07                      1                2 - 1   
2                   27                      1                2 - 1   
3                   32                      X                1 - 1   
4                   25                      X                1 - 1   

  average_goals_prediction weather_in_degrees  odds full_time_score  \
0                     3.24                12°   1.6        FT 1 - 1   
1   

In [23]:
df_completed_matches.drop(columns=['scoreline_prediction'], inplace=True)

### Splitting the "Halftime scoreline" column into separate columns (Home goals, Away goals)

In [24]:
df_completed_matches[['home_team_full_time_score', 'away_team_full_time_score']] = df_completed_matches['full_time_score'].str.strip('FT ').str.split(' - ', expand=True)

df_completed_matches['away_team_full_time_score'] = df_completed_matches['away_team_full_time_score'].astype(int)
df_completed_matches['home_team_full_time_score'] = df_completed_matches['home_team_full_time_score'].astype(int)


In [25]:

df_completed_matches[['home_team_halftime_score', 'away_team_halftime_score']] = df_completed_matches['score_at_halftime'].str.strip('()').str.split(' - ', expand=True)


### Creating Prediction win/loss Column 

In [26]:
def create_y(df):
    """
    Create the target variable (y) based on the prediction results.

    Args:
        df (DataFrame): The DataFrame containing match predictions and actual outcomes.

    Returns:
        list: The target variable (y) indicating whether the prediction was correct (1) or not (0)
    """
    y = []
    for i in range(len(df)):
        if df['team_to_win_prediction'][i] == '1' and df['home_team_full_time_score'][i] > df['away_team_full_time_score'][i]:
            y.append(1)
        elif df['team_to_win_prediction'][i] == '2' and df['home_team_full_time_score'][i] < df['away_team_full_time_score'][i]:
            y.append(1)
        elif df['team_to_win_prediction'][i] == 'X' and df['home_team_full_time_score'][i] == df['away_team_full_time_score'][i]:
            y.append(1)
        else:
            y.append(0)
    return y

# Append the y column to the main DataFrame
df_completed_matches['prediction_result'] = create_y(df_completed_matches)


#### Brief Notes
This function iterates through each row in the DataFrame, checks the prediction result against the actual outcome, and assigns a value of 1 if the prediction was correct and 0 otherwise. This is what our model will be trained on. Can the model spot the patterns in predictions that are usually correct and affirm future predictions for upcoming matches ?

In [27]:
print(df_completed_matches)

   home  away    date_and_time home_win_probability draw_probability  \
0    17    15  28/9/2024 12:30                   22               28   
1     1    13  28/9/2024 15:00                   55               38   
2     4    22  28/9/2024 15:00                   42               30   
3     7     5  28/9/2024 15:00                   20               48   
4     9     8  28/9/2024 15:00                   30               45   
5    26    10  28/9/2024 15:00                   27               47   
6    24    14  28/9/2024 17:30                   13               08   
7    11     2  29/9/2024 14:00                   11               20   
8    16    21  29/9/2024 16:30                   39               35   
9     3    20  30/9/2024 20:00                   43               33   

  away_win_probability team_to_win_prediction average_goals_prediction  \
0                   50                      2                     3.24   
1                   07                      1              

## Feature engineering

We will make sure our columns are in the right format and again all with the model in mind. 

In [30]:
# df_completed_matches[['home', 'away', 'team_to_win_prediction','prediction_result']] = df_completed_matches[['home', 'away', 'team_to_win_prediction','prediction_result']].astype('category')

In [28]:
# Convert probabilities to float
df_completed_matches['home_win_probability'] = df_completed_matches['home_win_probability'].astype(float)
df_completed_matches['draw_probability'] = df_completed_matches['draw_probability'].astype(float)
df_completed_matches['away_win_probability'] = df_completed_matches['away_win_probability'].astype(float)
df_completed_matches['average_goals_prediction'] = df_completed_matches['average_goals_prediction'].astype(float)
df_completed_matches['odds'] = df_completed_matches['odds'].astype(float)

In [29]:
# Convert relevant score columns to integers
df_completed_matches['home_team_full_time_score'] = df_completed_matches['home_team_full_time_score'].astype(int)
df_completed_matches['away_team_full_time_score'] = df_completed_matches['away_team_full_time_score'].astype(int)
df_completed_matches['home_team_halftime_score'] = df_completed_matches['home_team_halftime_score'].astype(int)
df_completed_matches['away_team_halftime_score'] = df_completed_matches['away_team_halftime_score'].astype(int)


In [30]:
df_completed_matches.drop(columns=['kelly_criterion'], inplace=True)

In [31]:
# Depending on your needs, you can extract features from date and time columns
# For example, to extract day of the week and month from 'Date':
# Convert 'Date' column to datetime with correct format
df_completed_matches['date'] = pd.to_datetime(df_completed_matches['date'], format='%d/%m/%Y')
df_completed_matches['day_of_week'] = df_completed_matches['date'].dt.dayofweek
df_completed_matches['month'] = df_completed_matches['date'].dt.month
df_completed_matches['weekly_round'] = weekly_round
# Finally change the date_and_time column to Datetime format ? VERIFY
df_completed_matches['date_and_time'] = pd.to_datetime(df_completed_matches['date_and_time'], format='%d/%m/%Y %H:%M')
# Assuming the 'time' column is in the format '%H:%M' (24-hour clock)
df_completed_matches['time'] = pd.to_datetime(df_completed_matches['time'], format='%H:%M').dt.time


In [32]:
# Custom label mapping
label_mapping = {'X': 0, '1': 1, '2': 2}
df_completed_matches['team_to_win_prediction'] = df_completed_matches['team_to_win_prediction'].map(label_mapping)

### Validation and Logging: 

In [33]:
df_completed_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   home                        10 non-null     int64         
 1   away                        10 non-null     int64         
 2   date_and_time               10 non-null     datetime64[ns]
 3   home_win_probability        10 non-null     float64       
 4   draw_probability            10 non-null     float64       
 5   away_win_probability        10 non-null     float64       
 6   team_to_win_prediction      10 non-null     int64         
 7   average_goals_prediction    10 non-null     float64       
 8   weather_in_degrees          10 non-null     object        
 9   odds                        10 non-null     float64       
 10  full_time_score             10 non-null     object        
 11  score_at_halftime           10 non-null     object        
 1

In [34]:
df_completed_matches["time"]

0    12:30:00
1    15:00:00
2    15:00:00
3    15:00:00
4    15:00:00
5    15:00:00
6    17:30:00
7    14:00:00
8    16:30:00
9    20:00:00
Name: time, dtype: object

### PERSISTENT STORAGE

In [42]:
!pip install pymysql



In [35]:
# Database connection
user = 'test_user'
password = 'password'
host = 'localhost'
port = 3306
database = 'bet_prediction_model'

In [36]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database}')
df_completed_matches.to_sql('completed_matches', con=engine, if_exists='replace', index=False)

10