# Betting Prediction Model - Scrapping Forebet Upcoming Matches

## Overview

This notebook is designed to scrape data from Forebet`s prediction table for Upcoming matches. The scraped data will be stored in a database and used to test our betting prediction model. The primary goals of this notebook are:

1. **Scrape Upcoming Matches Data**: Retrieve data for Upcoming matches, including team names, win probabilities, score predictions
2. **Data Processing**: Clean and format the scraped data to ensure it is suitable for analysis and model testing.
3. **Database Insertion**: Store the cleaned data in the `upcoming_matches` table within our database.

## Requirements

- Python 3x
- Libraries: `selenium`, `sklearn`, `pandas`, `sqlalchemy`, `re`
- pymysql installed
- MYSQL Workbench
- Access to the database where the `current_wek_league_standings` table and `testing_data` is located

## Notes

- Ensure you have the correct database credentials set up before running this notebook.
- This notebook should be run weekly to keep the `upcoming_matches` table updated with the fixtures that are yet to be played. This is necessary to test our model on new Data
- Testing_data is a joint table of `upcoming_matches` and `current_league_table_standings`


In [None]:
!pip install pymysql
!pip install pandas
!pip install sqlalchemy
!pip install scikit-learn

In [2]:
from selenium import webdriver
from sklearn.preprocessing import LabelEncoder
from sqlalchemy import create_engine
import pandas as pd
import re
from modules import team_labels_forebet

In [None]:
# TEAM LABELS
team_labels_forebet = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brighton': 4,
        'Burnley': 5,
        'Chelsea': 6,
        'Crystal Palace': 7,
        'Everton': 8,
        'Fulham': 9,
        'Ipswich Town':10,
        'Leeds United': 11,
        'Leicester City': 12,
        'Liverpool': 13,
        'Manchester City': 14,
        'Manchester United': 15,
        'Newcastle United': 16,
        'Norwich City': 17,
        'Sheffield United': 18,
        'Southampton': 19,
        'Tottenham': 20,
        'West Ham': 21,
        'Luton Town': 22,
        'Wolverhampton': 23,
        'Brentford': 24,
        'Sheffield United': 25,
        'Nottingham Forest': 26
    }

In [3]:
# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Open the Forebet predictions page for the English Premier League
driver.get('https://www.forebet.com/en/football-tips-and-predictions-for-england/premier-league')

In [4]:
# Now find the fixture containers within the body element
match_fixture_containers = driver.find_elements("class name", "schema")
print(match_fixture_containers)

[<selenium.webdriver.remote.webelement.WebElement (session="559cd5499ff46239ab6af6b1ceb89fe7", element="f.FCAA7311E34D7181077BB4BD427BC39D.d.ECE0CF26B10A8918D10A4C6CEA532407.e.162")>, <selenium.webdriver.remote.webelement.WebElement (session="559cd5499ff46239ab6af6b1ceb89fe7", element="f.FCAA7311E34D7181077BB4BD427BC39D.d.ECE0CF26B10A8918D10A4C6CEA532407.e.163")>, <selenium.webdriver.remote.webelement.WebElement (session="559cd5499ff46239ab6af6b1ceb89fe7", element="f.FCAA7311E34D7181077BB4BD427BC39D.d.ECE0CF26B10A8918D10A4C6CEA532407.e.164")>]


In [5]:
fixtures_container = []

for fixture in match_fixture_containers:
    # Extract the text from the fixture container
    fixture_text = fixture.text
    fixtures_container.append(fixture_text)
    print(fixture_text)


Round 1
EPL
Manchester United
Fulham
16/8/2024 19:00
59211913 - 03.2317°1.53
PRE
VIEW
EPL
Ipswich Town
Liverpool
17/8/2024 11:30
21285121 - 33.3623°1.36
PRE
VIEW
EPL
Arsenal
Wolverhampton
17/8/2024 14:00
68151713 - 02.9724°1.22
PRE
VIEW
EPL
Everton
Brighton
17/8/2024 14:00
45262912 - 12.8618°2.50
PRE
VIEW
0.08
EPL
Newcastle United
Southampton
17/8/2024 14:00
59221913 - 13.4118°1.33
PRE
VIEW
EPL
Nottingham Forest
Bournemouth
17/8/2024 14:00
32284021 - 22.8222°2.80
PRE
VIEW
0.07
EPL
West Ham
Aston Villa
17/8/2024 16:30
37333012 - 12.7823°2.45
PRE
VIEW
EPL
Brentford
Crystal Palace
18/8/2024 13:00
255026X1 - 12.3624°3.40
PRE
VIEW
0.29
EPL
Chelsea
Manchester City
18/8/2024 15:30
39154621 - 22.9324°1.80
PRE
VIEW
EPL
Leicester City
Tottenham
19/8/2024 19:00
30224821 - 33.9117°1.60
PRE
VIEW
Fr2
Clermont
Pau
16/8/2024 18:00
1
Es2
Cadiz
Real Zaragoza
16/8/2024 19:30
1


In [6]:
driver.quit()

In [7]:
fixtures_container

['Round 1\nEPL\nManchester United\nFulham\n16/8/2024 19:00\n59211913 - 03.2317°1.53\nPRE\nVIEW\nEPL\nIpswich Town\nLiverpool\n17/8/2024 11:30\n21285121 - 33.3623°1.36\nPRE\nVIEW\nEPL\nArsenal\nWolverhampton\n17/8/2024 14:00\n68151713 - 02.9724°1.22\nPRE\nVIEW\nEPL\nEverton\nBrighton\n17/8/2024 14:00\n45262912 - 12.8618°2.50\nPRE\nVIEW\n0.08\nEPL\nNewcastle United\nSouthampton\n17/8/2024 14:00\n59221913 - 13.4118°1.33\nPRE\nVIEW\nEPL\nNottingham Forest\nBournemouth\n17/8/2024 14:00\n32284021 - 22.8222°2.80\nPRE\nVIEW\n0.07\nEPL\nWest Ham\nAston Villa\n17/8/2024 16:30\n37333012 - 12.7823°2.45\nPRE\nVIEW\nEPL\nBrentford\nCrystal Palace\n18/8/2024 13:00\n255026X1 - 12.3624°3.40\nPRE\nVIEW\n0.29\nEPL\nChelsea\nManchester City\n18/8/2024 15:30\n39154621 - 22.9324°1.80\nPRE\nVIEW\nEPL\nLeicester City\nTottenham\n19/8/2024 19:00\n30224821 - 33.9117°1.60\nPRE\nVIEW',
 'Fr2\nClermont\nPau\n16/8/2024 18:00\n1',
 'Es2\nCadiz\nReal Zaragoza\n16/8/2024 19:30\n1']

### Cleaning the data 

In [8]:
# Luckily the data has \nEpl which we can use to get epl matches from other leagues.
matches_data_cleaned_step_1 = [match.split("\nEPL") for match in fixtures_container]
epl_matches = matches_data_cleaned_step_1[0]
epl_matches

['Round 1',
 '\nManchester United\nFulham\n16/8/2024 19:00\n59211913 - 03.2317°1.53\nPRE\nVIEW',
 '\nIpswich Town\nLiverpool\n17/8/2024 11:30\n21285121 - 33.3623°1.36\nPRE\nVIEW',
 '\nArsenal\nWolverhampton\n17/8/2024 14:00\n68151713 - 02.9724°1.22\nPRE\nVIEW',
 '\nEverton\nBrighton\n17/8/2024 14:00\n45262912 - 12.8618°2.50\nPRE\nVIEW\n0.08',
 '\nNewcastle United\nSouthampton\n17/8/2024 14:00\n59221913 - 13.4118°1.33\nPRE\nVIEW',
 '\nNottingham Forest\nBournemouth\n17/8/2024 14:00\n32284021 - 22.8222°2.80\nPRE\nVIEW\n0.07',
 '\nWest Ham\nAston Villa\n17/8/2024 16:30\n37333012 - 12.7823°2.45\nPRE\nVIEW',
 '\nBrentford\nCrystal Palace\n18/8/2024 13:00\n255026X1 - 12.3624°3.40\nPRE\nVIEW\n0.29',
 '\nChelsea\nManchester City\n18/8/2024 15:30\n39154621 - 22.9324°1.80\nPRE\nVIEW',
 '\nLeicester City\nTottenham\n19/8/2024 19:00\n30224821 - 33.9117°1.60\nPRE\nVIEW']

In [9]:
upcoming_matches = []
for match in epl_matches:
    if 'FT' not in match:  # Check if 'FT' (full-time) is present in the match string
        upcoming_matches.append(match)
        

In [11]:
print("\nUpcoming Matches:")
print(upcoming_matches)


Upcoming Matches:
['Round 1', '\nManchester United\nFulham\n16/8/2024 19:00\n59211913 - 03.2317°1.53\nPRE\nVIEW', '\nIpswich Town\nLiverpool\n17/8/2024 11:30\n21285121 - 33.3623°1.36\nPRE\nVIEW', '\nArsenal\nWolverhampton\n17/8/2024 14:00\n68151713 - 02.9724°1.22\nPRE\nVIEW', '\nEverton\nBrighton\n17/8/2024 14:00\n45262912 - 12.8618°2.50\nPRE\nVIEW\n0.08', '\nNewcastle United\nSouthampton\n17/8/2024 14:00\n59221913 - 13.4118°1.33\nPRE\nVIEW', '\nNottingham Forest\nBournemouth\n17/8/2024 14:00\n32284021 - 22.8222°2.80\nPRE\nVIEW\n0.07', '\nWest Ham\nAston Villa\n17/8/2024 16:30\n37333012 - 12.7823°2.45\nPRE\nVIEW', '\nBrentford\nCrystal Palace\n18/8/2024 13:00\n255026X1 - 12.3624°3.40\nPRE\nVIEW\n0.29', '\nChelsea\nManchester City\n18/8/2024 15:30\n39154621 - 22.9324°1.80\nPRE\nVIEW', '\nLeicester City\nTottenham\n19/8/2024 19:00\n30224821 - 33.9117°1.60\nPRE\nVIEW']


In [12]:
weekly_round  = upcoming_matches[0].split(' ')[-1]
print(weekly_round)

1


In [13]:
# Now that we have split the epl_matches into two we can clearly remove the Round 36 and Round 35 from the beginning and end of the upcoming matches array
upcoming_matches = upcoming_matches[1:]
print("\nUpcoming Matches:")
print(upcoming_matches)


Upcoming Matches:
['\nManchester United\nFulham\n16/8/2024 19:00\n59211913 - 03.2317°1.53\nPRE\nVIEW', '\nIpswich Town\nLiverpool\n17/8/2024 11:30\n21285121 - 33.3623°1.36\nPRE\nVIEW', '\nArsenal\nWolverhampton\n17/8/2024 14:00\n68151713 - 02.9724°1.22\nPRE\nVIEW', '\nEverton\nBrighton\n17/8/2024 14:00\n45262912 - 12.8618°2.50\nPRE\nVIEW\n0.08', '\nNewcastle United\nSouthampton\n17/8/2024 14:00\n59221913 - 13.4118°1.33\nPRE\nVIEW', '\nNottingham Forest\nBournemouth\n17/8/2024 14:00\n32284021 - 22.8222°2.80\nPRE\nVIEW\n0.07', '\nWest Ham\nAston Villa\n17/8/2024 16:30\n37333012 - 12.7823°2.45\nPRE\nVIEW', '\nBrentford\nCrystal Palace\n18/8/2024 13:00\n255026X1 - 12.3624°3.40\nPRE\nVIEW\n0.29', '\nChelsea\nManchester City\n18/8/2024 15:30\n39154621 - 22.9324°1.80\nPRE\nVIEW', '\nLeicester City\nTottenham\n19/8/2024 19:00\n30224821 - 33.9117°1.60\nPRE\nVIEW']



 Now that we have split the epl_matches into two we can clearly remove the Round 36 and Round 35 from the beginning and end of the upcoming matches array
 
##### Uncomment the code below to use for certain cases wheres theres only one match left to play

- upcoming_matches = upcoming_matches[1]
- print("\nUpcoming Matches:")
- print(upcoming_matches)


#### Now we can remove `PREVIEW` if only it exists

In [14]:
# Remove '\nPRE\nVIEW' from completed matches
upcoming_matches = [match.replace('\nPRE\nVIEW', '') for match in upcoming_matches]

In [15]:
print("Upcoming Matches (after removing noise):\n")
print(upcoming_matches)

Upcoming Matches (after removing noise):

['\nManchester United\nFulham\n16/8/2024 19:00\n59211913 - 03.2317°1.53', '\nIpswich Town\nLiverpool\n17/8/2024 11:30\n21285121 - 33.3623°1.36', '\nArsenal\nWolverhampton\n17/8/2024 14:00\n68151713 - 02.9724°1.22', '\nEverton\nBrighton\n17/8/2024 14:00\n45262912 - 12.8618°2.50\n0.08', '\nNewcastle United\nSouthampton\n17/8/2024 14:00\n59221913 - 13.4118°1.33', '\nNottingham Forest\nBournemouth\n17/8/2024 14:00\n32284021 - 22.8222°2.80\n0.07', '\nWest Ham\nAston Villa\n17/8/2024 16:30\n37333012 - 12.7823°2.45', '\nBrentford\nCrystal Palace\n18/8/2024 13:00\n255026X1 - 12.3624°3.40\n0.29', '\nChelsea\nManchester City\n18/8/2024 15:30\n39154621 - 22.9324°1.80', '\nLeicester City\nTottenham\n19/8/2024 19:00\n30224821 - 33.9117°1.60']


In [16]:
# Remove '\nRound' from completed matches
upcoming_matches = [re.sub(r'\nRound\s\d{1,2}', '', match) for match in upcoming_matches]

In [17]:
print("Upcoming Matches (after removing noise):\n")
print(upcoming_matches)

Upcoming Matches (after removing noise):

['\nManchester United\nFulham\n16/8/2024 19:00\n59211913 - 03.2317°1.53', '\nIpswich Town\nLiverpool\n17/8/2024 11:30\n21285121 - 33.3623°1.36', '\nArsenal\nWolverhampton\n17/8/2024 14:00\n68151713 - 02.9724°1.22', '\nEverton\nBrighton\n17/8/2024 14:00\n45262912 - 12.8618°2.50\n0.08', '\nNewcastle United\nSouthampton\n17/8/2024 14:00\n59221913 - 13.4118°1.33', '\nNottingham Forest\nBournemouth\n17/8/2024 14:00\n32284021 - 22.8222°2.80\n0.07', '\nWest Ham\nAston Villa\n17/8/2024 16:30\n37333012 - 12.7823°2.45', '\nBrentford\nCrystal Palace\n18/8/2024 13:00\n255026X1 - 12.3624°3.40\n0.29', '\nChelsea\nManchester City\n18/8/2024 15:30\n39154621 - 22.9324°1.80', '\nLeicester City\nTottenham\n19/8/2024 19:00\n30224821 - 33.9117°1.60']


#### Preparing the array for DataFrame 

In [18]:
# Regular expression pattern for matching completed matches data
# Example matched string: '\n15196521 - 33.1921°1.25 FT 0 - 4'
# This represents:
# 15 - Probability of home team win
# 19 - Probability of draw
# 52 - Probability of away team win
# 2 - Team to win (1 for home, X for draw, 2 for away)
# 1 - Home team score prediction
# 3 - Away team score prediction
# 3.19 - Average goals prediction
# 21° - Weather in degrees
# 1.25 - Odds
# FT 0 - 4 - Actual full-time score

pattern = r'\n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2})(\d{2}°|\d{1}°)(\d?.\d{2})'

replacement = r'\n\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8'

def replace_upcoming_matches(text):
    """
    Replace matches of the completed matches pattern in the given text
    with the specified replacement pattern.

    Args:
        text (str): The input text containing completed matches data.

    Returns:
        str: The text with replaced patterns.
    """
    return re.sub(pattern, replacement, text)


for i in range(len(upcoming_matches)):
    upcoming_matches[i] = replace_upcoming_matches(upcoming_matches[i])


print("Upcoming Matches (after removing noise):\n")
print(upcoming_matches)

Upcoming Matches (after removing noise):

['\nManchester United\nFulham\n16/8/2024 19:00\n59\n21\n19\n1\n3 - 0\n3.23\n17°\n1.53', '\nIpswich Town\nLiverpool\n17/8/2024 11:30\n21\n28\n51\n2\n1 - 3\n3.36\n23°\n1.36', '\nArsenal\nWolverhampton\n17/8/2024 14:00\n68\n15\n17\n1\n3 - 0\n2.97\n24°\n1.22', '\nEverton\nBrighton\n17/8/2024 14:00\n45\n26\n29\n1\n2 - 1\n2.86\n18°\n2.50\n0.08', '\nNewcastle United\nSouthampton\n17/8/2024 14:00\n59\n22\n19\n1\n3 - 1\n3.41\n18°\n1.33', '\nNottingham Forest\nBournemouth\n17/8/2024 14:00\n32\n28\n40\n2\n1 - 2\n2.82\n22°\n2.80\n0.07', '\nWest Ham\nAston Villa\n17/8/2024 16:30\n37\n33\n30\n1\n2 - 1\n2.78\n23°\n2.45', '\nBrentford\nCrystal Palace\n18/8/2024 13:00\n25\n50\n26\nX\n1 - 1\n2.36\n24°\n3.40\n0.29', '\nChelsea\nManchester City\n18/8/2024 15:30\n39\n15\n46\n2\n1 - 2\n2.93\n24°\n1.80', '\nLeicester City\nTottenham\n19/8/2024 19:00\n30\n22\n48\n2\n1 - 3\n3.91\n17°\n1.60']


#### Brief Notes:
- pattern_completed_matches: A regular expression pattern to match the completed matches data format. This pattern captures various parts of the match data, including win probabilities, team scores, average goals, weather, odds, and the actual scoreline.
- replacement_completed_matches: A replacement pattern to reformat the matched completed matches data by adding newline characters (\n) between the captured groups. This facilitates easier creation of a DataFrame by separating the future columns. 
eg  \n15196521 - 33.1921°1.25 FT 0 - 4  becomes  => \n15\n19\n65\n2\n1 - 3\n3.19\n21°\n1.25\nFT 0 - 4

In [20]:
# New array to hold split Data 
upcoming_matches_split = []

for i in range(len(upcoming_matches)):
    upcoming_matches_split.append(upcoming_matches[i].split('\n'))

print("Full Matches split by slash n:\n")
print(upcoming_matches_split)

Full Matches split by slash n:

[['', 'Manchester United', 'Fulham', '16/8/2024 19:00', '59', '21', '19', '1', '3 - 0', '3.23', '17°', '1.53'], ['', 'Ipswich Town', 'Liverpool', '17/8/2024 11:30', '21', '28', '51', '2', '1 - 3', '3.36', '23°', '1.36'], ['', 'Arsenal', 'Wolverhampton', '17/8/2024 14:00', '68', '15', '17', '1', '3 - 0', '2.97', '24°', '1.22'], ['', 'Everton', 'Brighton', '17/8/2024 14:00', '45', '26', '29', '1', '2 - 1', '2.86', '18°', '2.50', '0.08'], ['', 'Newcastle United', 'Southampton', '17/8/2024 14:00', '59', '22', '19', '1', '3 - 1', '3.41', '18°', '1.33'], ['', 'Nottingham Forest', 'Bournemouth', '17/8/2024 14:00', '32', '28', '40', '2', '1 - 2', '2.82', '22°', '2.80', '0.07'], ['', 'West Ham', 'Aston Villa', '17/8/2024 16:30', '37', '33', '30', '1', '2 - 1', '2.78', '23°', '2.45'], ['', 'Brentford', 'Crystal Palace', '18/8/2024 13:00', '25', '50', '26', 'X', '1 - 1', '2.36', '24°', '3.40', '0.29'], ['', 'Chelsea', 'Manchester City', '18/8/2024 15:30', '39', '15

In [21]:
df_columns_upcoming_matches  = ['', 'home', 'away', 'date_and_time', 'home_win_probability', 'draw_probability', 'away_team_win_probability', 'team_to_win_prediction', 'scoreline_prediction', 'average_goals_prediction', 'weather_in_degrees', 'odds', "kelly_criterion"]

In [22]:
df_upcoming_matches = pd.DataFrame(upcoming_matches_split, columns= df_columns_upcoming_matches)

# Drop first column
df_upcoming_matches = df_upcoming_matches.drop(columns=[''])

    

# Display DataFrame
print(df_upcoming_matches)

                home             away    date and time home_win_probability  \
0  Manchester United           Fulham  16/8/2024 19:00                   59   
1       Ipswich Town        Liverpool  17/8/2024 11:30                   21   
2            Arsenal    Wolverhampton  17/8/2024 14:00                   68   
3            Everton         Brighton  17/8/2024 14:00                   45   
4   Newcastle United      Southampton  17/8/2024 14:00                   59   
5  Nottingham Forest      Bournemouth  17/8/2024 14:00                   32   
6           West Ham      Aston Villa  17/8/2024 16:30                   37   
7          Brentford   Crystal Palace  18/8/2024 13:00                   25   
8            Chelsea  Manchester City  18/8/2024 15:30                   39   
9     Leicester City        Tottenham  19/8/2024 19:00                   30   

  draw_probability away_team_win_probability team_to_win_prediction  \
0               21                        19               

We are going to skip EDA since the data varies week in week out. Later when i connect the data to a database, and keep track of certain trends maybe we can notice certain patterns


### DataBase and Model Readiness: 
We are still preparing the data, but from now on, everything we do has the model in mind, that means label encoding, converting string to floats dropping columns etc. so its a bit different from just cleaning the data. its more like tuning ? parameter tuning ?

We are going to create labels for the teams in the premier league, in order to have Home and Away values are numerical labels. important for logisit regression. Might not be the best model, but yh 


In [23]:
def team_to_label(team_name):
    """
    Convert a team name to its corresponding label.

    Args:
        team_name (str): The name of the team.

    Returns:
        int: The corresponding label for the team name as per the team_labels_forebet dictionary.
    """
    return team_labels_forebet.get(team_name)


In [65]:
df_upcoming_matches['home'] = df_upcoming_matches['home'].map(team_to_label)
df_upcoming_matches['away'] = df_upcoming_matches['away'].map(team_to_label)

print(df_upcoming_matches.head())



   Home  Away    Date and time Home Win probability Draw probability  \
0     1     8  19/5/2024 15:00                   53               20   
1    23    15  19/5/2024 15:00                   27               25   
2     4    14  19/5/2024 15:00                   34               34   
3     5    25  19/5/2024 15:00                   38               33   
4     6     3  19/5/2024 15:00                   49               27   

  Away team win probability Team to win(prediction) Scoreline prediction  \
0                        27                       1                3 - 1   
1                        49                       2                1 - 3   
2                        32                       1                2 - 1   
3                        28                       1                2 - 1   
4                        24                       1                2 - 1   

  Average goals Weather in degrees  Odds Kelly Criterion  
0          3.16                19°  1.18           

In [26]:
### Splitting the "Scoreline prediction" column into separate columns (Home goals, Away goals)
df_upcoming_matches[['home_team_score_prediction', 'away_team_score_prediction']] = df_upcoming_matches['scoreline_prediction'].str.split('-', expand=True)

# Converting the split columns to integers
df_upcoming_matches['home_team_score_prediction'] = df_upcoming_matches['home_team_score_prediction'].astype(int)
df_upcoming_matches['away_team_score_prediction'] = df_upcoming_matches['away_team_score_prediction'].astype(int)

## Feature engineering
We will make sure our columns are in the right format and think about the model 

In [28]:

# Convert probabilities to float
df_upcoming_matches['home_win_probability'] = df_upcoming_matches['home_win_probability'].astype(float)
df_upcoming_matches['draw_probability'] = df_upcoming_matches['draw_probability'].astype(float)
df_upcoming_matches['away_team_win_probability'] = df_upcoming_matches['away_team_win_probability'].astype(float)


# Convert average goals prediction to float
df_upcoming_matches['average_goals_prediction'] = df_upcoming_matches['average_goals_prediction'].astype(float)
df_upcoming_matches['odds'] = df_upcoming_matches['odds'].astype(float)

In [29]:
df_upcoming_matches.drop(columns=['kelly_criterion'], inplace=True)

In [34]:
df_upcoming_matches[['date', 'time']] = df_upcoming_matches['date_and_time'].str.split(' ', expand=True)

In [35]:
# Convert 'Date' column to datetime with correct format
df_upcoming_matches['date'] = pd.to_datetime(df_upcoming_matches['date'], format='%d/%m/%Y')
df_upcoming_matches['day_of_week'] = df_upcoming_matches['date'].dt.dayofweek
df_upcoming_matches['month'] = df_upcoming_matches['date'].dt.month

In [36]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode the 'Team to win(prediction)' column
df_upcoming_matches['team_to_win_prediction'] = label_encoder.fit_transform(df_upcoming_matches['team_to_win_prediction'])

print(df_upcoming_matches.info())

print(df_upcoming_matches.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   home                        10 non-null     object        
 1   away                        10 non-null     object        
 2   date and time               10 non-null     object        
 3   home_win_probability        10 non-null     float64       
 4   draw_probability            10 non-null     float64       
 5   away_team_win_probability   10 non-null     float64       
 6   team_to_win_prediction      10 non-null     int64         
 7   Scoreline prediction        10 non-null     object        
 8   average_goals_prediction    10 non-null     float64       
 9   weather_in_degrees          10 non-null     object        
 10  odds                        10 non-null     float64       
 11  home_team_score_prediction  10 non-null     int64         
 1

In [37]:
# Get the mapping of labels
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

print("Label Mapping:", label_mapping)

Label Mapping: {'1': np.int64(0), '2': np.int64(1), 'X': np.int64(2)}


In [38]:
weekly_round = int(weekly_round)

In [39]:
df_upcoming_matches['weekly_round'] = weekly_round

### Validation and Logging: 

In [40]:
df_upcoming_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   home                        10 non-null     object        
 1   away                        10 non-null     object        
 2   date and time               10 non-null     object        
 3   home_win_probability        10 non-null     float64       
 4   draw_probability            10 non-null     float64       
 5   away_team_win_probability   10 non-null     float64       
 6   team_to_win_prediction      10 non-null     int64         
 7   Scoreline prediction        10 non-null     object        
 8   average_goals_prediction    10 non-null     float64       
 9   weather_in_degrees          10 non-null     object        
 10  odds                        10 non-null     float64       
 11  home_team_score_prediction  10 non-null     int64         
 1

#### Save in DB

In [41]:
# Database connection
user = '<test_user>'
password = '<password>'
host = 'localhost'
port = 3306
database = 'bet_prediction_model'

In [42]:
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}:{port}/{database}')
df_upcoming_matches.to_sql('upcoming_matches', con=engine, if_exists='replace', index=False)

10