# Betting Prediction Model - Scrapping Forebet Prediction Table Completed Matches

## Overview

This notebook is designed to scrape data from Forebet's prediction table for completed matches. The scraped data will be stored in a database and used to train our betting prediction model. The primary goals of this notebook are:

1. **Scrape Completed Matches Data**: Retrieve data for completed matches, including team names, win probabilities, score predictions, and actual match outcomes.
2. **Data Processing**: Clean and format the scraped data to ensure it is suitable for analysis and model training.
3. **Database Insertion**: Store the cleaned data in the `completed_matches` table within our database.

## Steps Involved

1. **Setup and Imports**: Import necessary libraries and set up configurations.
2. **Scraping Data**: Extract data from the Forebet website.
3. **Data Cleaning and Processing**: Process the extracted data to ensure consistency and correctness.
4. **Database Operations**: Insert the cleaned data into the `completed_matches` table.
5. **Validation and Logging**: Validate the inserted data and log the operations for future reference.

## Requirements

- Python 3.x
- Libraries: `selenium`, `sklearn`, `pandas`, `sqlalchemy`, `re`
- pymysql installed
- Access to the database where the `completed_matches` table and `training_data` is located

## Notes

- Ensure you have the correct database credentials set up before running this notebook.
- This notebook should be run weekly to keep the `completed_matches` table updated with the latest match results.
- Training_data is a joint table of Completed matches and previous_league_table_standings

Let's get started by setting up our environment and importing the necessary libraries.


### Setup and Imports:

In [31]:
from selenium import webdriver
import pandas as pd
from sqlalchemy import create_engine
import re
from sklearn.preprocessing import LabelEncoder

### Scraping Data:

In [32]:
# Path to the ChromeDriver executable
PATH = 'C:/Users/kevin/Desktop/tools/chromedriver-win64/chromedriver'

# Initialize the Chrome WebDriver
driver = webdriver.Chrome(PATH)

# Open the Forebet predictions page for the English Premier League
driver.get('https://www.forebet.com/en/football-tips-and-predictions-for-england/premier-league')


  driver = webdriver.Chrome(PATH)


In [33]:
# Now find the fixture containers within the body element
match_fixture_containers = driver.find_elements("class name", "schema")
print(match_fixture_containers)

[<selenium.webdriver.remote.webelement.WebElement (session="8f6ec6b31aa50c8e56cf74702d381660", element="f.3AB4E407FD71B4EEF7EDAAE62423BD52.d.A1FDADC4032EF60FB217B26D469C42E8.e.46")>, <selenium.webdriver.remote.webelement.WebElement (session="8f6ec6b31aa50c8e56cf74702d381660", element="f.3AB4E407FD71B4EEF7EDAAE62423BD52.d.A1FDADC4032EF60FB217B26D469C42E8.e.236")>, <selenium.webdriver.remote.webelement.WebElement (session="8f6ec6b31aa50c8e56cf74702d381660", element="f.3AB4E407FD71B4EEF7EDAAE62423BD52.d.A1FDADC4032EF60FB217B26D469C42E8.e.237")>]


In [34]:
# Initialize an empty list to store fixture details
fixtures_container = []

# Iterate through each match fixture container
for fixture in match_fixture_containers:
    # Extract the text from the fixture container
    fixture_text = fixture.text
    # Append the extracted text to the fixtures_container list
    fixtures_container.append(fixture_text)
    # Print the extracted fixture text
    print(fixture_text)


Round 38
EPL
Arsenal
Everton
19/5/2024 15:00
53202713 - 13.1622°1.18 FT 2 - 1
(1 - 1)
EPL
Brentford
Newcastle United
19/5/2024 15:00
27254921 - 33.1322°2.20 FT 2 - 4
(0 - 3)
0.07
EPL
Brighton
Manchester United
19/5/2024 15:00
34343212 - 12.8817°4.00 FT 0 - 2
(0 - 0)
EPL
Burnley
Nottingham Forest
19/5/2024 15:00
38332812 - 12.9819°2.90 FT 1 - 2
(0 - 2)
0.05
EPL
Chelsea
Bournemouth
19/5/2024 15:00
49272412 - 12.7822°1.45 FT 2 - 1
(1 - 0)
EPL
Crystal Palace
Aston Villa
19/5/2024 15:00
28314121 - 22.9122°3.80 FT 5 - 0
(2 - 0)
0.2
EPL
Liverpool
Wolverhampton
19/5/2024 15:00
63231413 - 04.1521°1.18 FT 2 - 0
(2 - 0)
EPL
Luton Town
Fulham
19/5/2024 15:00
253936X2 - 23.4320°3.90 FT 2 - 4
(1 - 2)
0.18
EPL
Manchester City
West Ham
19/5/2024 15:00
59261613 - 14.3422°1.10 FT 3 - 1
(2 - 1)
EPL
Sheffield United
Tottenham
19/5/2024 15:00
33175021 - 33.5321°1.36 FT 0 - 3
(0 - 1)
Round 37
EPL
Fulham
Manchester City
11/5/2024 11:30
15196521 - 33.1921°1.25 FT 0 - 4
(0 - 1)
EPL
Bournemouth
Brentford
11/5/2

In [35]:
# Closes the webdriver and forebet page
driver.quit()

In [36]:
fixtures_container

['Round 38\nEPL\nArsenal\nEverton\n19/5/2024 15:00\n53202713 - 13.1622°1.18 FT 2 - 1\n(1 - 1)\nEPL\nBrentford\nNewcastle United\n19/5/2024 15:00\n27254921 - 33.1322°2.20 FT 2 - 4\n(0 - 3)\n0.07\nEPL\nBrighton\nManchester United\n19/5/2024 15:00\n34343212 - 12.8817°4.00 FT 0 - 2\n(0 - 0)\nEPL\nBurnley\nNottingham Forest\n19/5/2024 15:00\n38332812 - 12.9819°2.90 FT 1 - 2\n(0 - 2)\n0.05\nEPL\nChelsea\nBournemouth\n19/5/2024 15:00\n49272412 - 12.7822°1.45 FT 2 - 1\n(1 - 0)\nEPL\nCrystal Palace\nAston Villa\n19/5/2024 15:00\n28314121 - 22.9122°3.80 FT 5 - 0\n(2 - 0)\n0.2\nEPL\nLiverpool\nWolverhampton\n19/5/2024 15:00\n63231413 - 04.1521°1.18 FT 2 - 0\n(2 - 0)\nEPL\nLuton Town\nFulham\n19/5/2024 15:00\n253936X2 - 23.4320°3.90 FT 2 - 4\n(1 - 2)\n0.18\nEPL\nManchester City\nWest Ham\n19/5/2024 15:00\n59261613 - 14.3422°1.10 FT 3 - 1\n(2 - 1)\nEPL\nSheffield United\nTottenham\n19/5/2024 15:00\n33175021 - 33.5321°1.36 FT 0 - 3\n(0 - 1)\nRound 37\nEPL\nFulham\nManchester City\n11/5/2024 11:30\n1

### Data Cleaning and Processing

In [37]:
# Luckily the data has \nEpl which we can use to separate epl matches from other leagues.
matches_data_cleaned_step_1 = [match.split("\nEPL") for match in fixtures_container]
epl_matches = matches_data_cleaned_step_1[0]
epl_matches

['Round 38',
 '\nArsenal\nEverton\n19/5/2024 15:00\n53202713 - 13.1622°1.18 FT 2 - 1\n(1 - 1)',
 '\nBrentford\nNewcastle United\n19/5/2024 15:00\n27254921 - 33.1322°2.20 FT 2 - 4\n(0 - 3)\n0.07',
 '\nBrighton\nManchester United\n19/5/2024 15:00\n34343212 - 12.8817°4.00 FT 0 - 2\n(0 - 0)',
 '\nBurnley\nNottingham Forest\n19/5/2024 15:00\n38332812 - 12.9819°2.90 FT 1 - 2\n(0 - 2)\n0.05',
 '\nChelsea\nBournemouth\n19/5/2024 15:00\n49272412 - 12.7822°1.45 FT 2 - 1\n(1 - 0)',
 '\nCrystal Palace\nAston Villa\n19/5/2024 15:00\n28314121 - 22.9122°3.80 FT 5 - 0\n(2 - 0)\n0.2',
 '\nLiverpool\nWolverhampton\n19/5/2024 15:00\n63231413 - 04.1521°1.18 FT 2 - 0\n(2 - 0)',
 '\nLuton Town\nFulham\n19/5/2024 15:00\n253936X2 - 23.4320°3.90 FT 2 - 4\n(1 - 2)\n0.18',
 '\nManchester City\nWest Ham\n19/5/2024 15:00\n59261613 - 14.3422°1.10 FT 3 - 1\n(2 - 1)',
 '\nSheffield United\nTottenham\n19/5/2024 15:00\n33175021 - 33.5321°1.36 FT 0 - 3\n(0 - 1)\nRound 37',
 '\nFulham\nManchester City\n11/5/2024 11:30\n1

In [38]:
# Extract the current weekly round from the first match in epl_matches
weekly_round = epl_matches[0]
# Split the string to get the round number, convert it to an integer, and adjust it to a zero-based index
weekly_round = int(weekly_round.split(' ')[1]) 
weekly_round

38

#### Brief Notes:
The page separates Upcoming matches from Completed matches using 'Round'. We use this Round Number to confirm if we are getting the right data. Also i append it in the dataframe at the end of the notebook to help double check when working with aggregated data in the DataBase

In [39]:
completed_matches = []

for match in epl_matches:
    if 'FT' in match:  # Check if 'FT' (full-time) is present in the match string
        completed_matches.append(match)

print("Completed Matches:")
print(completed_matches)

Completed Matches:
['\nArsenal\nEverton\n19/5/2024 15:00\n53202713 - 13.1622°1.18 FT 2 - 1\n(1 - 1)', '\nBrentford\nNewcastle United\n19/5/2024 15:00\n27254921 - 33.1322°2.20 FT 2 - 4\n(0 - 3)\n0.07', '\nBrighton\nManchester United\n19/5/2024 15:00\n34343212 - 12.8817°4.00 FT 0 - 2\n(0 - 0)', '\nBurnley\nNottingham Forest\n19/5/2024 15:00\n38332812 - 12.9819°2.90 FT 1 - 2\n(0 - 2)\n0.05', '\nChelsea\nBournemouth\n19/5/2024 15:00\n49272412 - 12.7822°1.45 FT 2 - 1\n(1 - 0)', '\nCrystal Palace\nAston Villa\n19/5/2024 15:00\n28314121 - 22.9122°3.80 FT 5 - 0\n(2 - 0)\n0.2', '\nLiverpool\nWolverhampton\n19/5/2024 15:00\n63231413 - 04.1521°1.18 FT 2 - 0\n(2 - 0)', '\nLuton Town\nFulham\n19/5/2024 15:00\n253936X2 - 23.4320°3.90 FT 2 - 4\n(1 - 2)\n0.18', '\nManchester City\nWest Ham\n19/5/2024 15:00\n59261613 - 14.3422°1.10 FT 3 - 1\n(2 - 1)', '\nSheffield United\nTottenham\n19/5/2024 15:00\n33175021 - 33.5321°1.36 FT 0 - 3\n(0 - 1)\nRound 37', '\nFulham\nManchester City\n11/5/2024 11:30\n15196

#### Now we can remove `PREVIEW` if only it exists

In [40]:
# Remove '\nPRE\nVIEW' from completed matches
completed_matches = [match.replace('\nPRE\nVIEW', '') for match in completed_matches]

In [41]:
print("\nCompleted Matches (after removing noise):\n")
print(completed_matches)


Completed Matches (after removing noise):

['\nArsenal\nEverton\n19/5/2024 15:00\n53202713 - 13.1622°1.18 FT 2 - 1\n(1 - 1)', '\nBrentford\nNewcastle United\n19/5/2024 15:00\n27254921 - 33.1322°2.20 FT 2 - 4\n(0 - 3)\n0.07', '\nBrighton\nManchester United\n19/5/2024 15:00\n34343212 - 12.8817°4.00 FT 0 - 2\n(0 - 0)', '\nBurnley\nNottingham Forest\n19/5/2024 15:00\n38332812 - 12.9819°2.90 FT 1 - 2\n(0 - 2)\n0.05', '\nChelsea\nBournemouth\n19/5/2024 15:00\n49272412 - 12.7822°1.45 FT 2 - 1\n(1 - 0)', '\nCrystal Palace\nAston Villa\n19/5/2024 15:00\n28314121 - 22.9122°3.80 FT 5 - 0\n(2 - 0)\n0.2', '\nLiverpool\nWolverhampton\n19/5/2024 15:00\n63231413 - 04.1521°1.18 FT 2 - 0\n(2 - 0)', '\nLuton Town\nFulham\n19/5/2024 15:00\n253936X2 - 23.4320°3.90 FT 2 - 4\n(1 - 2)\n0.18', '\nManchester City\nWest Ham\n19/5/2024 15:00\n59261613 - 14.3422°1.10 FT 3 - 1\n(2 - 1)', '\nSheffield United\nTottenham\n19/5/2024 15:00\n33175021 - 33.5321°1.36 FT 0 - 3\n(0 - 1)\nRound 37', '\nFulham\nManchester Cit

### REGEX: Preprocessing Completed matches array 

In [42]:
# Regular expression pattern for matching completed matches data
# Example matched string: '\n15196521 - 33.1921°1.25 FT 0 - 4'
# This represents:
# 15 - Probability of home team win
# 19 - Probability of draw
# 52 - Probability of away team win
# 2 - Team to win (1 for home, X for draw, 2 for away)
# 1 - Home team score prediction
# 3 - Away team score prediction
# 3.19 - Average goals prediction
# 21° - Weather in degrees
# 1.25 - Odds
# FT 0 - 4 - Actual full-time score

pattern_completed_matches = r'\n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2})(\d{2}°|\d{1}°)(\d?.\d{2})\s(FT\s\d?\s-\s\d?)'

replacement_completed_matches = r'\n\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8\n\9'

def replace_completed_matches(text):
    """
    Replace matches of the completed matches pattern in the given text
    with the specified replacement pattern.

    Args:
        text (str): The input text containing completed matches data.

    Returns:
        str: The text with replaced patterns.
    """
    return re.sub(pattern_completed_matches, replacement_completed_matches, text)

# Process upcoming matches
for i in range(len(completed_matches)):
    completed_matches[i] = replace_completed_matches(completed_matches[i])


In [43]:
print("\nCompleted Matches (after removing noise):\n")
print(completed_matches)


Completed Matches (after removing noise):

['\nArsenal\nEverton\n19/5/2024 15:00\n53\n20\n27\n1\n3 - 1\n3.16\n22°\n1.18\nFT 2 - 1\n(1 - 1)', '\nBrentford\nNewcastle United\n19/5/2024 15:00\n27\n25\n49\n2\n1 - 3\n3.13\n22°\n2.20\nFT 2 - 4\n(0 - 3)\n0.07', '\nBrighton\nManchester United\n19/5/2024 15:00\n34\n34\n32\n1\n2 - 1\n2.88\n17°\n4.00\nFT 0 - 2\n(0 - 0)', '\nBurnley\nNottingham Forest\n19/5/2024 15:00\n38\n33\n28\n1\n2 - 1\n2.98\n19°\n2.90\nFT 1 - 2\n(0 - 2)\n0.05', '\nChelsea\nBournemouth\n19/5/2024 15:00\n49\n27\n24\n1\n2 - 1\n2.78\n22°\n1.45\nFT 2 - 1\n(1 - 0)', '\nCrystal Palace\nAston Villa\n19/5/2024 15:00\n28\n31\n41\n2\n1 - 2\n2.91\n22°\n3.80\nFT 5 - 0\n(2 - 0)\n0.2', '\nLiverpool\nWolverhampton\n19/5/2024 15:00\n63\n23\n14\n1\n3 - 0\n4.15\n21°\n1.18\nFT 2 - 0\n(2 - 0)', '\nLuton Town\nFulham\n19/5/2024 15:00\n25\n39\n36\nX\n2 - 2\n3.43\n20°\n3.90\nFT 2 - 4\n(1 - 2)\n0.18', '\nManchester City\nWest Ham\n19/5/2024 15:00\n59\n26\n16\n1\n3 - 1\n4.34\n22°\n1.10\nFT 3 - 1\n(2 

#### Brief Notes:
- pattern_completed_matches: A regular expression pattern to match the completed matches data format. This pattern captures various parts of the match data, including win probabilities, team scores, average goals, weather, odds, and the actual scoreline.
- replacement_completed_matches: A replacement pattern to reformat the matched completed matches data by adding newline characters (\n) between the captured groups. This facilitates easier creation of a DataFrame by separating the future columns. 
eg  \n15196521 - 33.1921°1.25 FT 0 - 4  becomes  => \n15\n19\n65\n2\n1 - 3\n3.19\n21°\n1.25\nFT 0 - 4

#### Split by \n

In [44]:
# New array to hold split Data 
completed_matches_split = []

for i in range(len(completed_matches)):
    completed_matches_split.append(completed_matches[i].split('\n'))
    

In [45]:
print("\nCompleted Matches Split:\n")
print(completed_matches_split)



Completed Matches Split:

[['', 'Arsenal', 'Everton', '19/5/2024 15:00', '53', '20', '27', '1', '3 - 1', '3.16', '22°', '1.18', 'FT 2 - 1', '(1 - 1)'], ['', 'Brentford', 'Newcastle United', '19/5/2024 15:00', '27', '25', '49', '2', '1 - 3', '3.13', '22°', '2.20', 'FT 2 - 4', '(0 - 3)', '0.07'], ['', 'Brighton', 'Manchester United', '19/5/2024 15:00', '34', '34', '32', '1', '2 - 1', '2.88', '17°', '4.00', 'FT 0 - 2', '(0 - 0)'], ['', 'Burnley', 'Nottingham Forest', '19/5/2024 15:00', '38', '33', '28', '1', '2 - 1', '2.98', '19°', '2.90', 'FT 1 - 2', '(0 - 2)', '0.05'], ['', 'Chelsea', 'Bournemouth', '19/5/2024 15:00', '49', '27', '24', '1', '2 - 1', '2.78', '22°', '1.45', 'FT 2 - 1', '(1 - 0)'], ['', 'Crystal Palace', 'Aston Villa', '19/5/2024 15:00', '28', '31', '41', '2', '1 - 2', '2.91', '22°', '3.80', 'FT 5 - 0', '(2 - 0)', '0.2'], ['', 'Liverpool', 'Wolverhampton', '19/5/2024 15:00', '63', '23', '14', '1', '3 - 0', '4.15', '21°', '1.18', 'FT 2 - 0', '(2 - 0)'], ['', 'Luton Town', 

In [46]:
df_columns_completed_matches  = ['', 'home', 'away', 'date and time', 'home_win_probability', 'draw_probability', 'away_win_probability', 'team_to_win_prediction', 'scoreline_prediction', 'average_goals_prediction', 'weather_in_degrees', 'odds', 'full_time_score', 'score_at_halftime', "kelly_criterion"]

In [47]:
# Create DataFrame
df_completed_matches = pd.DataFrame(completed_matches_split, columns=df_columns_completed_matches)

# Drop first column
df_completed_matches = df_completed_matches.drop(columns=[''])

# Display DataFrame
print(df_completed_matches)

                 home               away    date and time  \
0             Arsenal            Everton  19/5/2024 15:00   
1           Brentford   Newcastle United  19/5/2024 15:00   
2            Brighton  Manchester United  19/5/2024 15:00   
3             Burnley  Nottingham Forest  19/5/2024 15:00   
4             Chelsea        Bournemouth  19/5/2024 15:00   
5      Crystal Palace        Aston Villa  19/5/2024 15:00   
6           Liverpool      Wolverhampton  19/5/2024 15:00   
7          Luton Town             Fulham  19/5/2024 15:00   
8     Manchester City           West Ham  19/5/2024 15:00   
9    Sheffield United          Tottenham  19/5/2024 15:00   
10             Fulham    Manchester City  11/5/2024 11:30   
11        Bournemouth          Brentford  11/5/2024 14:00   
12            Everton   Sheffield United  11/5/2024 14:00   
13   Newcastle United           Brighton  11/5/2024 14:00   
14          Tottenham            Burnley  11/5/2024 14:00   
15           West Ham   

We are going to skip EDA since the data varies week in week out. Later when i connect the data to a database, and keep track of certain trends maybe we can notice certain patterns

### DataBase and Model Readiness: 
We are still preparing the data, but from now on, everything we do has the model in mind, that means label encoding, converting string to floats dropping columns etc. so its a bit different from just cleaning the data. its more like tuning ? parameter tuning ?

We are going to create labels for the teams in the premier league, in order to have Home and Away values are numerical labels. important for logisit regression. Might not be the best model, but yh 
Sometimes reading the teams labels in figures is annoying, so you can chose to skip this part, and only copy the code for label encoding when you are ready to train the model

In [48]:
team_labels = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brighton': 4,
        'Burnley': 5,
        'Chelsea': 6,
        'Crystal Palace': 7,
        'Everton': 8,
        'Fulham': 9,
        'Leeds United': 10,
        'Leicester City': 11,
        'Liverpool': 12,
        'Manchester City': 13,
        'Manchester United': 14,
        'Newcastle United': 15,
        'Norwich City': 16,
        'Sheffield United': 17,
        'Southampton': 18,
        'Tottenham': 19,
        'West Ham': 20,
        'Luton Town': 21,
        'Wolverhampton': 22,
        'Brentford': 23,
        'Sheffield United': 24,
        'Nottingham Forest': 25
    }

In [49]:
def team_to_label(team_name):
    """
    Convert a team name to its corresponding label.

    Args:
        team_name (str): The name of the team.

    Returns:
        int: The corresponding label for the team name as per the team_labels dictionary.
    """
    return team_labels.get(team_name)


In [50]:
# Map home and away team names to corresponding labels.
df_completed_matches['home'] = df_completed_matches['home'].map(team_to_label)
df_completed_matches['away'] = df_completed_matches['away'].map(team_to_label)

print(df_completed_matches.head())

   home  away    date and time home_win_probability draw_probability  \
0     1     8  19/5/2024 15:00                   53               20   
1    23    15  19/5/2024 15:00                   27               25   
2     4    14  19/5/2024 15:00                   34               34   
3     5    25  19/5/2024 15:00                   38               33   
4     6     3  19/5/2024 15:00                   49               27   

  away_win_probability team_to_win_prediction scoreline_prediction  \
0                   27                      1                3 - 1   
1                   49                      2                1 - 3   
2                   32                      1                2 - 1   
3                   28                      1                2 - 1   
4                   24                      1                2 - 1   

  average_goals_prediction weather_in_degrees  odds full_time_score  \
0                     3.16                22°  1.18        FT 2 - 1   
1   

### Splitting Date and Time 

In [51]:
df_completed_matches[['date', 'time']] = df_completed_matches['date and time'].str.split(' ', expand=True)

df_completed_matches.drop(columns=['date and time'], inplace=True)

print(df_completed_matches.head()) 

   home  away home_win_probability draw_probability away_win_probability  \
0     1     8                   53               20                   27   
1    23    15                   27               25                   49   
2     4    14                   34               34                   32   
3     5    25                   38               33                   28   
4     6     3                   49               27                   24   

  team_to_win_prediction scoreline_prediction average_goals_prediction  \
0                      1                3 - 1                     3.16   
1                      2                1 - 3                     3.13   
2                      1                2 - 1                     2.88   
3                      1                2 - 1                     2.98   
4                      1                2 - 1                     2.78   

  weather_in_degrees  odds full_time_score score_at_halftime kelly_criterion  \
0                2

### Splitting the "Scoreline prediction" column into separate columns (Home goals, Away goals)

In [52]:
df_completed_matches[['home_team_score_prediction', 'away_team_score_prediction']] = df_completed_matches['scoreline_prediction'].str.split('-', expand=True)

# Converting the split columns to integers
df_completed_matches['home_team_score_prediction'] = df_completed_matches['home_team_score_prediction'].astype(int)
df_completed_matches['away_team_score_prediction'] = df_completed_matches['away_team_score_prediction'].astype(int)

df_completed_matches.drop(columns=['scoreline_prediction'], inplace=True)
# Example usage:
print(df_completed_matches.head())  # Display the first few rows to verify the changes


   home  away home_win_probability draw_probability away_win_probability  \
0     1     8                   53               20                   27   
1    23    15                   27               25                   49   
2     4    14                   34               34                   32   
3     5    25                   38               33                   28   
4     6     3                   49               27                   24   

  team_to_win_prediction average_goals_prediction weather_in_degrees  odds  \
0                      1                     3.16                22°  1.18   
1                      2                     3.13                22°  2.20   
2                      1                     2.88                17°  4.00   
3                      1                     2.98                19°  2.90   
4                      1                     2.78                22°  1.45   

  full_time_score score_at_halftime kelly_criterion       date   time  \
0

### Splitting the "Halftime scoreline" column into separate columns (Home goals, Away goals)

In [53]:
df_completed_matches[['home_team_full_time_score', 'away_team_full_time_score']] = df_completed_matches['full_time_score'].str.strip('FT ').str.split(' - ', expand=True)

df_completed_matches['away_team_full_time_score'] = df_completed_matches['away_team_full_time_score'].astype(int)
df_completed_matches['home_team_full_time_score'] = df_completed_matches['home_team_full_time_score'].astype(int)


In [54]:

df_completed_matches[['home_team_halftime_score', 'away_team_halftime_score']] = df_completed_matches['score_at_halftime'].str.strip('()').str.split(' - ', expand=True)


### Creating Prediction win/loss Column 

In [55]:
def create_y(df):
    """
    Create the target variable (y) based on the prediction results.

    Args:
        df (DataFrame): The DataFrame containing match predictions and actual outcomes.

    Returns:
        list: The target variable (y) indicating whether the prediction was correct (1) or not (0).
    """
    y = []
    for i in range(len(df)):
        if df['team_to_win_prediction'][i] == '1' and df['home_team_score_prediction'][i] > df['away_team_score_prediction'][i]:
            y.append(1)
        elif df['team_to_win_prediction'][i] == '2' and df['home_team_score_prediction'][i] < df['away_team_score_prediction'][i]:
            y.append(1)
        elif df['team_to_win_prediction'][i] == 'X' and df['home_team_score_prediction'][i] == df['away_team_score_prediction'][i]:
            y.append(1)
        else:
            y.append(0)
    return y

# Append the y column to the main DataFrame
df_completed_matches['prediction_result'] = create_y(df_completed_matches)


#### Brief Notes
This function iterates through each row in the DataFrame, checks the prediction result against the actual outcome, and assigns a value of 1 if the prediction was correct and 0 otherwise. This is what our model will be trained on. Can the model spot the patterns in predictions that are usually correct and affirm future predictions for upcoming matches ?

In [56]:
print(df_completed_matches)

    home  away home_win_probability draw_probability away_win_probability  \
0      1     8                   53               20                   27   
1     23    15                   27               25                   49   
2      4    14                   34               34                   32   
3      5    25                   38               33                   28   
4      6     3                   49               27                   24   
5      7     2                   28               31                   41   
6     12    22                   63               23                   14   
7     21     9                   25               39                   36   
8     13    20                   59               26                   16   
9     24    19                   33               17                   50   
10     9    13                   15               19                   65   
11     3    23                   34               37                   29   

## Feature engineering

We will make sure our columns are in the right format and again all with the model in mind. 

In [57]:
df_completed_matches[['home', 'away', 'team_to_win_prediction','prediction_result']] = df_completed_matches[['home', 'away', 'team_to_win_prediction','prediction_result']].astype('category')

In [58]:
# Convert probabilities to float
df_completed_matches['home_win_probability'] = df_completed_matches['home_win_probability'].astype(float)
df_completed_matches['draw_probability'] = df_completed_matches['draw_probability'].astype(float)
df_completed_matches['away_win_probability'] = df_completed_matches['away_win_probability'].astype(float)


In [59]:
# Convert average goals prediction to float
df_completed_matches['average_goals_prediction'] = df_completed_matches['average_goals_prediction'].astype(float)


In [61]:
# Convert relevant score columns to integers
df_completed_matches['home_team_score_prediction'] = df_completed_matches['home_team_score_prediction'].astype(int)
df_completed_matches['away_team_full_time_score'] = df_completed_matches['away_team_full_time_score'].astype(int)
df_completed_matches['home_team_halftime_score'] = df_completed_matches['home_team_halftime_score'].astype(int)
df_completed_matches['away_team_halftime_score'] = df_completed_matches['away_team_halftime_score'].astype(int)


In [60]:
df_completed_matches.drop(columns=['kelly_criterion'], inplace=True)


In [62]:
# Depending on your needs, you can extract features from date and time columns
# For example, to extract day of the week and month from 'Date':
# Convert 'Date' column to datetime with correct format
df_completed_matches['date'] = pd.to_datetime(df_completed_matches['date'], format='%d/%m/%Y')
df_completed_matches['day_of_week'] = df_completed_matches['date'].dt.dayofweek
df_completed_matches['month'] = df_completed_matches['date'].dt.month


In [64]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode the 'Team to win(prediction)' column
df_completed_matches['team_to_win_prediction'] = label_encoder.fit_transform(df_completed_matches['team_to_win_prediction'])

In [65]:
# Get the mapping of labels
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

print("Label Mapping:", label_mapping)

Label Mapping: {'1': 0, '2': 1, 'X': 2}


In [66]:
df_completed_matches['odds'] = df_completed_matches['odds'].astype(float)

In [67]:
df_completed_matches['weekly_round'] = weekly_round

### Validation and Logging: 

In [68]:
df_completed_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 23 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   home                        23 non-null     category      
 1   away                        23 non-null     category      
 2   home_win_probability        23 non-null     float64       
 3   draw_probability            23 non-null     float64       
 4   away_win_probability        23 non-null     float64       
 5   team_to_win_prediction      23 non-null     int32         
 6   average_goals_prediction    23 non-null     float64       
 7   weather_in_degrees          23 non-null     object        
 8   odds                        23 non-null     float64       
 9   full_time_score             23 non-null     object        
 10  score_at_halftime           23 non-null     object        
 11  date                        23 non-null     datetime64[ns]
 

In [69]:
df_completed_matches.head()

Unnamed: 0,home,away,home_win_probability,draw_probability,away_win_probability,team_to_win_prediction,average_goals_prediction,weather_in_degrees,odds,full_time_score,...,home_team_score_prediction,away_team_score_prediction,home_team_full_time_score,away_team_full_time_score,home_team_halftime_score,away_team_halftime_score,prediction_result,day_of_week,month,weekly_round
0,1,8,53.0,20.0,27.0,0,3.16,22°,1.18,FT 2 - 1,...,3,1,2,1,1,1,1,6,5,38
1,23,15,27.0,25.0,49.0,1,3.13,22°,2.2,FT 2 - 4,...,1,3,2,4,0,3,1,6,5,38
2,4,14,34.0,34.0,32.0,0,2.88,17°,4.0,FT 0 - 2,...,2,1,0,2,0,0,1,6,5,38
3,5,25,38.0,33.0,28.0,0,2.98,19°,2.9,FT 1 - 2,...,2,1,1,2,0,2,1,6,5,38
4,6,3,49.0,27.0,24.0,0,2.78,22°,1.45,FT 2 - 1,...,2,1,2,1,1,0,1,6,5,38


### PERSISTENT STORAGE

In [None]:
!pip install pymysql

In [70]:
engine = create_engine('mysql+pymysql://root:Kofi1999$@localhost:3306/bet_prediction_model')
df_completed_matches.to_sql('completed_matches', con=engine, if_exists='replace', index=False)

23