# Betting Analysis Project

## Goal:
The goal of this project is to analyze betting data (Previous history as well as a couple Dummy data) from Betway in order to identify patterns and trends that can inform a more strategic approach to betting on football (soccer) matches, with a focus on the English Premier League (EPL). By leveraging machine learning techniques, we aim to develop a model that can predict the outcomes of EPL matches based on historical data, ultimately improving the accuracy of betting decisions and maximizing potential winnings.

## Milestones:
1. **Data Collection and Preparation:**
   - Gather historical betting data from Betway, focusing on EPL matches.
   - Preprocess the data to ensure consistency and suitability for analysis.
   
2. **Exploratory Data Analysis (EDA) (Optional):**
   - Conduct exploratory analysis to identify patterns and trends in the data.
   - Visualize key insights such as betting trends over time, popular bet types, and correlations between variables.
   
3. **Model Development:**
   - Select appropriate machine learning algorithms for predicting match outcomes.
   - Train the model using historical data, tuning hyperparameters for optimal performance.
   
4. **Evaluation and Validation:**
   - Evaluate the model's performance using cross-validation techniques.
   - Validate the model's predictions against real-world outcomes to assess accuracy and reliability.
   
5. **Deployment and Monitoring:**
   - Deploy the trained model to a production environment for ongoing use.
   - Implement monitoring mechanisms to track model performance and update as necessary.

6. **Continuous Improvement:**
   - Continuously refine the model based on new data and feedback.
   - Explore additional features and strategies to enhance predictive capabilities.


## Data Collection and Preparation:

Data collection can be achieved through web scraping, API integration, or utilizing existing datasets:

1. **Web Scraping:**
   - Develop a scraper to extract betting data from Betway.
   - Define scraping logic for fixture details, bet types, and odds.
   - Implement error handling for data validation.

2. **API Integration:**
   - Explore Betway APIs for historical betting data.
   - Authenticate and make requests for EPL matches.
   - Parse and transform API responses.

3. **Existing Datasets:**
   - Investigate publicly available datasets for EPL betting data.
   - Assess quality and completeness for integration.

By combining web scraping, API integration, and existing datasets, we can gather comprehensive data for analysis and model development.


In [2]:
!pip install selenium==4.4.3



In [None]:
from selenium import webdriver

In [11]:
PATH = 'C:/Users/kevin/Desktop/tools/chromedriver-win64/chromedriver'
driver = webdriver.Chrome(PATH) 
driver.get('https://www.forebet.com/en/football-tips-and-predictions-for-england/premier-league');

  driver = webdriver.Chrome(PATH)


In [12]:
# Now find the fixture containers within the body element
match_fixture_containers = driver.find_elements("class name", "schema")
print(match_fixture_containers)

[<selenium.webdriver.remote.webelement.WebElement (session="8d2150a0bda4ce24750990b3d99f1a8d", element="f.38929082481E45371F22C6B6F7A2D360.d.F75B228941A63299F9372B9867696BF0.e.40")>, <selenium.webdriver.remote.webelement.WebElement (session="8d2150a0bda4ce24750990b3d99f1a8d", element="f.38929082481E45371F22C6B6F7A2D360.d.F75B228941A63299F9372B9867696BF0.e.158")>, <selenium.webdriver.remote.webelement.WebElement (session="8d2150a0bda4ce24750990b3d99f1a8d", element="f.38929082481E45371F22C6B6F7A2D360.d.F75B228941A63299F9372B9867696BF0.e.159")>]


In [13]:
fixtures_container = []

for fixture in match_fixture_containers:
    # Extract the text from the fixture container
    fixture_text = fixture.text
    fixtures_container.append(fixture_text)
    print(fixture_text)

Round 36
EPL
Chelsea
Tottenham
2/5/2024 18:30
37342913 - 13.9515°2.15 FT 2 - 0
(1 - 0)
EPL
Luton Town
Everton
3/5/2024 19:00
30274321 - 23.889°2.40 FT 1 - 1
(1 - 1)
0.02
EPL
Arsenal
Bournemouth
4/5/2024 11:30
56232113 - 13.1817°1.22 FT 3 - 0
(1 - 0)
EPL
Brentford
Fulham
4/5/2024 14:00
42352312 - 02.4114°2.10 FT 0 - 0
(0 - 0)
EPL
Burnley
Newcastle United
4/5/2024 14:00
21265320 - 11.9011°1.95 FT 1 - 4
(0 - 3)
0.04
EPL
Sheffield United
Nottingham Forest
4/5/2024 14:00
333630X2 - 23.0014°4.20 FT 1 - 3
(1 - 1)
0.16
EPL
Manchester City
Wolverhampton
4/5/2024 16:30
56281613 - 04.1913°1.11 FT 5 - 1
(3 - 0)
EPL
Brighton
Aston Villa
5/5/2024 13:00
22364221 - 22.9612°2.55 FT 1 - 0
(0 - 0)
0.05
EPL
Chelsea
West Ham
5/5/2024 13:00
39382312 - 12.5315°1.65 FT 5 - 0
(3 - 0)
EPL
Liverpool
Tottenham
5/5/2024 15:30
43352112 - 13.1717°1.50 FT 4 - 2
(2 - 0)
EPL
Crystal Palace
Manchester United
6/5/2024 19:00
25354021 - 23.5312°2.70
PRE
VIEW
0.05
Round 35
EPL
West Ham
Liverpool
27/4/2024 11:30
22314721 - 2

In [5]:
driver.quit()

## Understanding Match Information

When we visit the Forebet website, the match information appears like the image provided. For instance, let's consider the first game info:

![Screenshot](attachment:Screenshot%20%28617%29.png)

In our scraped data, the corresponding information for the same match was:
Chelsea
Tottenham
2/5/2024 18:30
37342913 - 13.9513°2.15

which represents 

- **Chelsea vs. Tottenham**
- **Date and Time:** 2/5/2024, 18:30
- **Probabilities:** 37-34-29
- **Outcome Prediction:** 1
- **Predicted Scoreline:** 3-1
- **Average Goals:** 3.95
- **Weather:** 13°C
- **Odds:** 2.15

**Key points to note:**
- **Probabilities:** 37 indicates the home team's win probability (1), 34 represents a draw (X), and 29 indicates the away team's win probability (2).
- **Outcome Prediction:** 1 implies a prediction for the home team to win, which could also be X (draw) or 2 (away team win).
- **Predicted Scoreline:** 3-1 means the home team is predicted to score 3 goals and concede 1.
- **Average Goals:** 3.95 denotes the average number of goals expected in the match.
- **Weather:** 13°C represents the temperature.
- **Odds:** 2.15 signifies the betting odds.

Additionally, previous fixtures include extra information about whether the team won or lost and the actual scoreline, which can be used for model training. For example:

Nottingham Forest
Manchester City
28/4/2024 15:30
16226220 - 42.9611°1.30 FT 0 - 2
(0 - 1)

represents
- **Nottingham Forest vs. Manchester City**
- **Date and Time:** 28/4/2024, 15:30
- **Prediction:** 1
- **Predicted Scoreline:** 0-2 (0-1 at halftime)

This guide will help us create dummy data for our analysis.


In [14]:
fixtures_container

['Round 36\nEPL\nChelsea\nTottenham\n2/5/2024 18:30\n37342913 - 13.9515°2.15 FT 2 - 0\n(1 - 0)\nEPL\nLuton Town\nEverton\n3/5/2024 19:00\n30274321 - 23.889°2.40 FT 1 - 1\n(1 - 1)\n0.02\nEPL\nArsenal\nBournemouth\n4/5/2024 11:30\n56232113 - 13.1817°1.22 FT 3 - 0\n(1 - 0)\nEPL\nBrentford\nFulham\n4/5/2024 14:00\n42352312 - 02.4114°2.10 FT 0 - 0\n(0 - 0)\nEPL\nBurnley\nNewcastle United\n4/5/2024 14:00\n21265320 - 11.9011°1.95 FT 1 - 4\n(0 - 3)\n0.04\nEPL\nSheffield United\nNottingham Forest\n4/5/2024 14:00\n333630X2 - 23.0014°4.20 FT 1 - 3\n(1 - 1)\n0.16\nEPL\nManchester City\nWolverhampton\n4/5/2024 16:30\n56281613 - 04.1913°1.11 FT 5 - 1\n(3 - 0)\nEPL\nBrighton\nAston Villa\n5/5/2024 13:00\n22364221 - 22.9612°2.55 FT 1 - 0\n(0 - 0)\n0.05\nEPL\nChelsea\nWest Ham\n5/5/2024 13:00\n39382312 - 12.5315°1.65 FT 5 - 0\n(3 - 0)\nEPL\nLiverpool\nTottenham\n5/5/2024 15:30\n43352112 - 13.1717°1.50 FT 4 - 2\n(2 - 0)\nEPL\nCrystal Palace\nManchester United\n6/5/2024 19:00\n25354021 - 23.5312°2.70\nPR

### Cleaning the data 

#### Step 1

In [15]:
# Luckily the data has \nEpl which we can use to get epl matches from other leagues.
matches_data_cleaned_step_1 = [match.split("\nEPL") for match in fixtures_container]
epl_matches = matches_data_cleaned_step_1[0]
epl_matches

['Round 36',
 '\nChelsea\nTottenham\n2/5/2024 18:30\n37342913 - 13.9515°2.15 FT 2 - 0\n(1 - 0)',
 '\nLuton Town\nEverton\n3/5/2024 19:00\n30274321 - 23.889°2.40 FT 1 - 1\n(1 - 1)\n0.02',
 '\nArsenal\nBournemouth\n4/5/2024 11:30\n56232113 - 13.1817°1.22 FT 3 - 0\n(1 - 0)',
 '\nBrentford\nFulham\n4/5/2024 14:00\n42352312 - 02.4114°2.10 FT 0 - 0\n(0 - 0)',
 '\nBurnley\nNewcastle United\n4/5/2024 14:00\n21265320 - 11.9011°1.95 FT 1 - 4\n(0 - 3)\n0.04',
 '\nSheffield United\nNottingham Forest\n4/5/2024 14:00\n333630X2 - 23.0014°4.20 FT 1 - 3\n(1 - 1)\n0.16',
 '\nManchester City\nWolverhampton\n4/5/2024 16:30\n56281613 - 04.1913°1.11 FT 5 - 1\n(3 - 0)',
 '\nBrighton\nAston Villa\n5/5/2024 13:00\n22364221 - 22.9612°2.55 FT 1 - 0\n(0 - 0)\n0.05',
 '\nChelsea\nWest Ham\n5/5/2024 13:00\n39382312 - 12.5315°1.65 FT 5 - 0\n(3 - 0)',
 '\nLiverpool\nTottenham\n5/5/2024 15:30\n43352112 - 13.1717°1.50 FT 4 - 2\n(2 - 0)',
 '\nCrystal Palace\nManchester United\n6/5/2024 19:00\n25354021 - 23.5312°2.70\nPR

#### Step 2 
Now we need to remove unnecessary information in our data like Round 36 and PreView

Plans
- Split by \n. the data will be easier to handle 
- We can either use regex to remove certain words from the array (Im wondering how intermittent noise in the data might affect the arragnment and consistency of end result.
- Also we need to filter out Matches that have full information as completed matches vs Upcoming matches 
- Finally we can focus on splitting strings into win probabilities weather info odds, etc.

In [24]:
completed_matches = []
upcoming_matches = []

for match in epl_matches:
    if 'FT' in match:  # Check if 'FT' (full-time) is present in the match string
        completed_matches.append(match)
    else:
        upcoming_matches.append(match)

print("Completed Matches:")
print(completed_matches)
print("\nUpcoming Matches:")
print(upcoming_matches)

Completed Matches:
['\nChelsea\nTottenham\n2/5/2024 18:30\n37342913 - 13.9515°2.15 FT 2 - 0\n(1 - 0)', '\nLuton Town\nEverton\n3/5/2024 19:00\n30274321 - 23.889°2.40 FT 1 - 1\n(1 - 1)\n0.02', '\nArsenal\nBournemouth\n4/5/2024 11:30\n56232113 - 13.1817°1.22 FT 3 - 0\n(1 - 0)', '\nBrentford\nFulham\n4/5/2024 14:00\n42352312 - 02.4114°2.10 FT 0 - 0\n(0 - 0)', '\nBurnley\nNewcastle United\n4/5/2024 14:00\n21265320 - 11.9011°1.95 FT 1 - 4\n(0 - 3)\n0.04', '\nSheffield United\nNottingham Forest\n4/5/2024 14:00\n333630X2 - 23.0014°4.20 FT 1 - 3\n(1 - 1)\n0.16', '\nManchester City\nWolverhampton\n4/5/2024 16:30\n56281613 - 04.1913°1.11 FT 5 - 1\n(3 - 0)', '\nBrighton\nAston Villa\n5/5/2024 13:00\n22364221 - 22.9612°2.55 FT 1 - 0\n(0 - 0)\n0.05', '\nChelsea\nWest Ham\n5/5/2024 13:00\n39382312 - 12.5315°1.65 FT 5 - 0\n(3 - 0)', '\nLiverpool\nTottenham\n5/5/2024 15:30\n43352112 - 13.1717°1.50 FT 4 - 2\n(2 - 0)', '\nWest Ham\nLiverpool\n27/4/2024 11:30\n22314721 - 22.9410°1.53 FT 2 - 2\n(1 - 0)', 

In certain cases, some fixtures may be listed as upcoming fixtures even after being played. For example, Friday fixtures may remain in the upcoming fixtures section on the website until all weekly fixtures have been completed
If thats the case use the last code after 

In [18]:
# Now that we have split the epl_matches into two we can clearly remove the Round 36 and Round 35 from the beginning and end of the upcoming matches array
upcoming_matches = upcoming_matches[1:-1]
print("\nUpcoming Matches:")
print(upcoming_matches)


Upcoming Matches:
[]


In [25]:
# USE FOR CASES WHERE there is only one match left
# Now that we have split the epl_matches into two we can clearly remove the Round 36 and Round 35 from the beginning and end of the upcoming matches array
upcoming_matches = upcoming_matches[1]
print("\nUpcoming Matches:")
print(upcoming_matches)


Upcoming Matches:

Crystal Palace
Manchester United
6/5/2024 19:00
25354021 - 23.5312°2.70
PRE
VIEW
0.05
Round 35


Now we can remove PREVIEW

In [26]:
upcoming_matches = [match.replace('\nPRE\nVIEW', '')]

In [27]:
# Remove '\nPRE\nVIEW' from upcoming matches
# upcoming_matches = [match.replace('\nPRE\nVIEW', '') for match in upcoming_matches]

# Remove '\nPRE\nVIEW' from completed matches
completed_matches = [match.replace('\nPRE\nVIEW', '') for match in completed_matches]

print("Upcoming Matches (after removing noise):\n")
print(upcoming_matches)

print("\nCompleted Matches (after removing noise):\n")
print(completed_matches)

Upcoming Matches (after removing noise):

['\nNottingham Forest\nManchester City\n28/4/2024 15:30\n16226220 - 42.9611°1.30 FT 0 - 2\n(0 - 1)']

Completed Matches (after removing noise):

['\nChelsea\nTottenham\n2/5/2024 18:30\n37342913 - 13.9515°2.15 FT 2 - 0\n(1 - 0)', '\nLuton Town\nEverton\n3/5/2024 19:00\n30274321 - 23.889°2.40 FT 1 - 1\n(1 - 1)\n0.02', '\nArsenal\nBournemouth\n4/5/2024 11:30\n56232113 - 13.1817°1.22 FT 3 - 0\n(1 - 0)', '\nBrentford\nFulham\n4/5/2024 14:00\n42352312 - 02.4114°2.10 FT 0 - 0\n(0 - 0)', '\nBurnley\nNewcastle United\n4/5/2024 14:00\n21265320 - 11.9011°1.95 FT 1 - 4\n(0 - 3)\n0.04', '\nSheffield United\nNottingham Forest\n4/5/2024 14:00\n333630X2 - 23.0014°4.20 FT 1 - 3\n(1 - 1)\n0.16', '\nManchester City\nWolverhampton\n4/5/2024 16:30\n56281613 - 04.1913°1.11 FT 5 - 1\n(3 - 0)', '\nBrighton\nAston Villa\n5/5/2024 13:00\n22364221 - 22.9612°2.55 FT 1 - 0\n(0 - 0)\n0.05', '\nChelsea\nWest Ham\n5/5/2024 13:00\n39382312 - 12.5315°1.65 FT 5 - 0\n(3 - 0)', '\

To split a certain part of the text clustered together into newlines, we'll use regular expressions (regex). Given a string like this:

\nWest Ham\nLiverpool\n27/4/2024 11:30\n22314721 - 22.9410°1.53

We need to break this part into separate lines:
n22314721 - 22.9410°1.53

Our final result should be:

\n22\n31\n47\n2\n1 - 2\n2.94\n10°\n1.53
This will prepare the data by splitting various sections ( win probability, predicted score, weather, etc)  within the text 

our regex pattern to look for will be : r'n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2})(\d{2}°)(\d?.\d{2})'

and replacement will be : r'/n\1/n\2/n\3/n\4/n\5/n\6/n\7/n\8'



### Regex Playground 

Tips:
- Go through the data to spot different use cases and anomalies.
- Do not try to cram all use cases into one regex pattern; have multiple regex patterns that you can use to match all the strings with.
Take weather data within the, for example; it could be negative, indicated by a preceding dash (e.g., -2, -22). This means we need patterns that account for both single and double-digit negative numbers

In [28]:
import re

pattern = r'\n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2})(\d{2}°)(\d?.\d{2})'

replacement = r'\n\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8'

def replace_matches(text):
    replaced = re.sub(pattern, replacement, text)
    return replaced

text1 = '\nChelsea\nTottenham\n2/5/2024 18:30\n37342913 - 13.9515°2.15 FT 2 - 0\n(1 - 0)'
text2 = "ials\n22314721 - 22.9410°1.53"

print(replace_matches(text1))
print(replace_matches(text2))


Chelsea
Tottenham
2/5/2024 18:30
37
34
29
1
3 - 1
3.95
15°
2.15 FT 2 - 0
(1 - 0)
ials
22
31
47
2
1 - 2
2.94
10°
1.53


### Working with Upcoming Matches

In [29]:
import re

pattern = r'\n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2})(\d{2}°|\d{1}°)(\d?.\d{2})'


replacement = r'\n\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8'

def replace_upcoming_matches(text):
    return re.sub(pattern, replacement, text)


for i in range(len(upcoming_matches)):
    upcoming_matches[i] = replace_upcoming_matches(upcoming_matches[i])


In [30]:
print("Upcoming Matches (after removing noise):\n")
print(upcoming_matches)

Upcoming Matches (after removing noise):

['\nNottingham Forest\nManchester City\n28/4/2024 15:30\n16\n22\n62\n2\n0 - 4\n2.96\n11°\n1.30 FT 0 - 2\n(0 - 1)']


### Working with Completed matches

In [31]:

pattern_completed_matches = r'\n(\d{2})(\d{2})(\d{2})([A-Z]|\d?)(\d?\s-\s\d{1})(\d{1}.\d{2})(\d{2}°|\d{1}°)(\d?.\d{2})\s(FT\s\d?\s-\s\d?)'


replacement_completed_matches = r'\n\1\n\2\n\3\n\4\n\5\n\6\n\7\n\8\n\9'

def replace_completed_matches(text):
    return re.sub(pattern_completed_matches, replacement_completed_matches, text)

# Process upcoming matches
for i in range(len(completed_matches)):
    completed_matches[i] = replace_completed_matches(completed_matches[i])


In [32]:
print("\nCompleted Matches (after removing noise):\n")
print(completed_matches)

print("\Full Matches (after removing noise):\n")
print(epl_matches)


Completed Matches (after removing noise):

['\nChelsea\nTottenham\n2/5/2024 18:30\n37\n34\n29\n1\n3 - 1\n3.95\n15°\n2.15\nFT 2 - 0\n(1 - 0)', '\nLuton Town\nEverton\n3/5/2024 19:00\n30\n27\n43\n2\n1 - 2\n3.88\n9°\n2.40\nFT 1 - 1\n(1 - 1)\n0.02', '\nArsenal\nBournemouth\n4/5/2024 11:30\n56\n23\n21\n1\n3 - 1\n3.18\n17°\n1.22\nFT 3 - 0\n(1 - 0)', '\nBrentford\nFulham\n4/5/2024 14:00\n42\n35\n23\n1\n2 - 0\n2.41\n14°\n2.10\nFT 0 - 0\n(0 - 0)', '\nBurnley\nNewcastle United\n4/5/2024 14:00\n21\n26\n53\n2\n0 - 1\n1.90\n11°\n1.95\nFT 1 - 4\n(0 - 3)\n0.04', '\nSheffield United\nNottingham Forest\n4/5/2024 14:00\n33\n36\n30\nX\n2 - 2\n3.00\n14°\n4.20\nFT 1 - 3\n(1 - 1)\n0.16', '\nManchester City\nWolverhampton\n4/5/2024 16:30\n56\n28\n16\n1\n3 - 0\n4.19\n13°\n1.11\nFT 5 - 1\n(3 - 0)', '\nBrighton\nAston Villa\n5/5/2024 13:00\n22\n36\n42\n2\n1 - 2\n2.96\n12°\n2.55\nFT 1 - 0\n(0 - 0)\n0.05', '\nChelsea\nWest Ham\n5/5/2024 13:00\n39\n38\n23\n1\n2 - 1\n2.53\n15°\n1.65\nFT 5 - 0\n(3 - 0)', '\nLiverpo

#### Split by \n

In [43]:
# New array to hold split Data 
upcoming_matches_split = []
completed_matches_split = []

for i in range(len(completed_matches)):
    completed_matches_split.append(completed_matches[i].split('\n'))
    

In [44]:
for i in range(len(upcoming_matches)):
    upcoming_matches_split.append(upcoming_matches[i].split('\n'))

In [45]:
print("\nCompleted Matches Split:\n")
print(completed_matches_split)

print("\Full Matches split by slash n:\n")
print(upcoming_matches_split)


Completed Matches Split:

[['', 'Chelsea', 'Tottenham', '2/5/2024 18:30', '37', '34', '29', '1', '3 - 1', '3.95', '15°', '2.15', 'FT 2 - 0', '(1 - 0)'], ['', 'Luton Town', 'Everton', '3/5/2024 19:00', '30', '27', '43', '2', '1 - 2', '3.88', '9°', '2.40', 'FT 1 - 1', '(1 - 1)', '0.02'], ['', 'Arsenal', 'Bournemouth', '4/5/2024 11:30', '56', '23', '21', '1', '3 - 1', '3.18', '17°', '1.22', 'FT 3 - 0', '(1 - 0)'], ['', 'Brentford', 'Fulham', '4/5/2024 14:00', '42', '35', '23', '1', '2 - 0', '2.41', '14°', '2.10', 'FT 0 - 0', '(0 - 0)'], ['', 'Burnley', 'Newcastle United', '4/5/2024 14:00', '21', '26', '53', '2', '0 - 1', '1.90', '11°', '1.95', 'FT 1 - 4', '(0 - 3)', '0.04'], ['', 'Sheffield United', 'Nottingham Forest', '4/5/2024 14:00', '33', '36', '30', 'X', '2 - 2', '3.00', '14°', '4.20', 'FT 1 - 3', '(1 - 1)', '0.16'], ['', 'Manchester City', 'Wolverhampton', '4/5/2024 16:30', '56', '28', '16', '1', '3 - 0', '4.19', '13°', '1.11', 'FT 5 - 1', '(3 - 0)'], ['', 'Brighton', 'Aston Villa

In [46]:
df_columns  = ['', 'Home', 'Away', 'Date and time', 'Home Win probability', 'Draw probability', 'Away team win probability', 'Team to win(prediction)', 'Scoreline prediction', 'Average goals prediction', 'Weather in degrees', 'Odds', 'Full time score', 'Score at halftime', "Kelly Criterion"]

In [47]:
# Create DataFrame

df_completed_matches = pd.DataFrame(completed_matches_split, columns=df_columns)

# Drop first column
df_completed_matches = df_completed_matches.drop(columns=[''])

# Display DataFrame
print(df_completed_matches)

                 Home               Away    Date and time  \
0             Chelsea          Tottenham   2/5/2024 18:30   
1          Luton Town            Everton   3/5/2024 19:00   
2             Arsenal        Bournemouth   4/5/2024 11:30   
3           Brentford             Fulham   4/5/2024 14:00   
4             Burnley   Newcastle United   4/5/2024 14:00   
5    Sheffield United  Nottingham Forest   4/5/2024 14:00   
6     Manchester City      Wolverhampton   4/5/2024 16:30   
7            Brighton        Aston Villa   5/5/2024 13:00   
8             Chelsea           West Ham   5/5/2024 13:00   
9           Liverpool          Tottenham   5/5/2024 15:30   
10           West Ham          Liverpool  27/4/2024 11:30   
11             Fulham     Crystal Palace  27/4/2024 14:00   
12  Manchester United            Burnley  27/4/2024 14:00   
13   Newcastle United   Sheffield United  27/4/2024 14:00   
14      Wolverhampton         Luton Town  27/4/2024 14:00   
15            Everton   

In [48]:
df_columns_upcoming  = ['', 'Home', 'Away', 'Date and time', 'Home Win probability', 'Draw probability', 'Away team win probability', 'Home team to win(prediction)', 'Scoreline prediction', 'Average goals', 'Weather in degrees', 'Odds', "Kelly Criterion"]

In [49]:
# Create DataFrame
df_upcoming_matches = pd.DataFrame(upcoming_matches_split, columns= df_columns_upcoming)

# Drop first column
df_upcoming_matches = df_upcoming_matches.drop(columns=[''])

# Display DataFrame
print(df_upcoming_matches)

                Home             Away    Date and time Home Win probability  \
0  Nottingham Forest  Manchester City  28/4/2024 15:30                   16   

  Draw probability Away team win probability Home team to win(prediction)  \
0               22                        62                            2   

  Scoreline prediction Average goals prediction Weather in degrees  \
0                0 - 4                     2.96                11°   

            Odds Kelly Criterion  
0  1.30 FT 0 - 2         (0 - 1)  


We are going to skip EDA since the data varies week in week out. Later when i connect the data to a database, and keep track of certain trends maybe we can notice certain patterns

### Model Development
We are still preparing the data, but from now on, everything we do has the model in mind, that means label encoding, converting string to floats dropping columns etc. so its a bit different from just cleaning the data. its more like tuning

We are going to create labels for the teams in the premier league, in order to have Home and Away values are numerical labels. important for logisit regression. Might not be the best model, but yh 

In [50]:
team_labels = {
        'Arsenal': 1,
        'Aston Villa': 2,
        'Bournemouth': 3,
        'Brighton': 4,
        'Burnley': 5,
        'Chelsea': 6,
        'Crystal Palace': 7,
        'Everton': 8,
        'Fulham': 9,
        'Leeds United': 10,
        'Leicester City': 11,
        'Liverpool': 12,
        'Manchester City': 13,
        'Manchester United': 14,
        'Newcastle United': 15,
        'Norwich City': 16,
        'Sheffield United': 17,
        'Southampton': 18,
        'Tottenham': 19,
        'West Ham': 20,
        'Luton Town': 21,
        'Wolverhampton': 22,
        'Brentford': 23,
        'Sheffield United': 24,
        'Nottingham Forest': 25
    }

In [51]:

def team_to_label(team_name):
    return team_labels.get(team_name)


In [52]:
df_completed_matches['Home'] = df_completed_matches['Home'].map(team_to_label)
df_completed_matches['Away'] = df_completed_matches['Away'].map(team_to_label)

print(df_completed_matches.head())

   Home  Away   Date and time Home Win probability Draw probability  \
0     6    19  2/5/2024 18:30                   37               34   
1    21     8  3/5/2024 19:00                   30               27   
2     1     3  4/5/2024 11:30                   56               23   
3    23     9  4/5/2024 14:00                   42               35   
4     5    15  4/5/2024 14:00                   21               26   

  Away team win probability Team to win(prediction) Scoreline prediction  \
0                        29                       1                3 - 1   
1                        43                       2                1 - 2   
2                        21                       1                3 - 1   
3                        23                       1                2 - 0   
4                        53                       2                0 - 1   

  Average goals prediction Weather in degrees  Odds Full time score  \
0                     3.95                15°

### Splitting Date and Time 

In [53]:
df_completed_matches[['Date', 'Time']] = df_completed_matches['Date and time'].str.split(' ', expand=True)

df_completed_matches.drop(columns=['Date and time'], inplace=True)

print(df_completed_matches.head()) 

   Home  Away Home Win probability Draw probability Away team win probability  \
0     6    19                   37               34                        29   
1    21     8                   30               27                        43   
2     1     3                   56               23                        21   
3    23     9                   42               35                        23   
4     5    15                   21               26                        53   

  Team to win(prediction) Scoreline prediction Average goals prediction  \
0                       1                3 - 1                     3.95   
1                       2                1 - 2                     3.88   
2                       1                3 - 1                     3.18   
3                       1                2 - 0                     2.41   
4                       2                0 - 1                     1.90   

  Weather in degrees  Odds Full time score Score at halftime K

### Splitting the "Scoreline prediction" column into separate columns (Home goals, Away goals)

In [54]:
df_completed_matches[['Home Team Score Prediction', 'Away Team Score Prediction']] = df_completed_matches['Scoreline prediction'].str.split('-', expand=True)

# Converting the split columns to integers
df_completed_matches['Home Team Score Prediction'] = df_completed_matches['Home Team Score Prediction'].astype(int)
df_completed_matches['Away Team Score Prediction'] = df_completed_matches['Away Team Score Prediction'].astype(int)

df_completed_matches.drop(columns=['Scoreline prediction'], inplace=True)
# Example usage:
print(df_completed_matches.head())  # Display the first few rows to verify the changes


   Home  Away Home Win probability Draw probability Away team win probability  \
0     6    19                   37               34                        29   
1    21     8                   30               27                        43   
2     1     3                   56               23                        21   
3    23     9                   42               35                        23   
4     5    15                   21               26                        53   

  Team to win(prediction) Scoreline prediction Average goals prediction  \
0                       1                3 - 1                     3.95   
1                       2                1 - 2                     3.88   
2                       1                3 - 1                     3.18   
3                       1                2 - 0                     2.41   
4                       2                0 - 1                     1.90   

  Weather in degrees  Odds Full time score Score at halftime K

### Splitting the "Halftime scoreline" column into separate columns (Home goals, Away goals)

In [69]:
df_completed_matches[['Home Team Full Time Score', 'Away Team Full Time Score']] = df_completed_matches['Full time score'].str.strip('FT ').str.split(' - ', expand=True)

df_completed_matches['Away Team Full Time Score'] = df_completed_matches['Away Team Full Time Score'].astype(int)
df_completed_matches['Home Team Full Time Score'] = df_completed_matches['Home Team Full Time Score'].astype(int)


In [70]:

df_completed_matches[['Home Team Halftime Score', 'Away Team Halftime Score']] = df_completed_matches['Score at halftime'].str.strip('()').str.split(' - ', expand=True)


### Creating Prediction win/loss Column 

In [88]:
def create_y(df):
    y = []
    for i in range(len(df)):
        if df['Team to win(prediction)'][i] == '1' and df['Home Team Full Time Score'][i] > df['Away Team Full Time Score'][i]:
            y.append(1)
        elif df['Team to win(prediction)'][i] == '2' and df['Home Team Full Time Score'][i] < df['Away Team Full Time Score'][i]:
            y.append(1)
        elif df['Team to win(prediction)'][i] == 'X' and df['Home Team Full Time Score'][i] == df['Away Team Full Time Score'][i]:
            y.append(1)
        else:
            y.append(0)
    return y

# Append the y column to the main DataFrame
df_completed_matches['Prediction Result(won/loss)'] = create_y(df_completed_matches)


In [None]:
print(df_completed_matches)

## Feature engineering

We will make sure our columns are in the right format and think about the model 

In [90]:
df_completed_matches[['Home', 'Away', 'Team to win(prediction)','Prediction Result(won/loss)']] = df_completed_matches[['Home', 'Away', 'Team to win(prediction)','Prediction Result(won/loss)']].astype('category')

In [93]:
# Convert probabilities to float
df_completed_matches['Home Win probability'] = df_completed_matches['Home Win probability'].astype(float)
df_completed_matches['Draw probability'] = df_completed_matches['Draw probability'].astype(float)
df_completed_matches['Away team win probability'] = df_completed_matches['Away team win probability'].astype(float)


In [94]:
# Convert average goals prediction to float
df_completed_matches['Average goals prediction'] = df_completed_matches['Average goals prediction'].astype(float)


In [95]:
# Convert relevant score columns to integers
df_completed_matches['Home Team Full Time Score'] = df_completed_matches['Home Team Full Time Score'].astype(int)
df_completed_matches['Away Team Full Time Score'] = df_completed_matches['Away Team Full Time Score'].astype(int)
df_completed_matches['Home Team Halftime Score'] = df_completed_matches['Home Team Halftime Score'].astype(int)
df_completed_matches['Away Team Halftime Score'] = df_completed_matches['Away Team Halftime Score'].astype(int)


In [96]:
df_completed_matches.drop(columns=['Kelly Criterion'], inplace=True)


In [99]:
# Depending on your needs, you can extract features from date and time columns
# For example, to extract day of the week and month from 'Date':
# Convert 'Date' column to datetime with correct format
df_completed_matches['Date'] = pd.to_datetime(df_completed_matches['Date'], format='%d/%m/%Y')
df_completed_matches['Day of Week'] = df_completed_matches['Date'].dt.dayofweek
df_completed_matches['Month'] = df_completed_matches['Date'].dt.month


In [116]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode the 'Team to win(prediction)' column
df_completed_matches['Team to win(prediction)'] = label_encoder.fit_transform(df_completed_matches['Team to win(prediction)'])

In [118]:
df_completed_matches['Odds'] = df_completed_matches['Odds'].astype(float)

In [100]:
df_completed_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 22 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Home                         20 non-null     category      
 1   Away                         20 non-null     category      
 2   Home Win probability         20 non-null     float64       
 3   Draw probability             20 non-null     float64       
 4   Away team win probability    20 non-null     float64       
 5   Team to win(prediction)      20 non-null     category      
 6   Average goals prediction     20 non-null     float64       
 7   Weather in degrees           20 non-null     object        
 8   Odds                         20 non-null     object        
 9   Full time score              20 non-null     object        
 10  Score at halftime            20 non-null     object        
 11  Date                         20 non-null     da

In [101]:
df_completed_matches.head()

Unnamed: 0,Home,Away,Home Win probability,Draw probability,Away team win probability,Team to win(prediction),Average goals prediction,Weather in degrees,Odds,Full time score,...,Time,Home Team Score Prediction,Away Team Score Prediction,Home Team Full Time Score,Away Team Full Time Score,Home Team Halftime Score,Away Team Halftime Score,Prediction Result(won/loss),Day of Week,Month
0,6,19,37.0,34.0,29.0,1,3.95,15°,2.15,FT 2 - 0,...,18:30,3,1,2,0,1,0,1,3,5
1,21,8,30.0,27.0,43.0,2,3.88,9°,2.4,FT 1 - 1,...,19:00,1,2,1,1,1,1,0,4,5
2,1,3,56.0,23.0,21.0,1,3.18,17°,1.22,FT 3 - 0,...,11:30,3,1,3,0,1,0,1,5,5
3,23,9,42.0,35.0,23.0,1,2.41,14°,2.1,FT 0 - 0,...,14:00,2,0,0,0,0,0,0,5,5
4,5,15,21.0,26.0,53.0,2,1.9,11°,1.95,FT 1 - 4,...,14:00,0,1,1,4,0,3,1,5,5


## Logistic Regression

In [119]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


X = df_completed_matches.drop(columns=['Prediction Result(won/loss)', 'Weather in degrees','Full time score','Score at halftime', 'Date','Time'])
# Define your target variable y
y = df_completed_matches['Prediction Result(won/loss)']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the logistic regression model
log_reg = LogisticRegression(max_iter=10000, solver='sag')

# Train the model on the training data
log_reg.fit(X_train, y_train)

# Predict the target variable on the testing set
y_pred = log_reg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Get more detailed classification report
print(classification_report(y_test, y_pred))


Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.67      0.67      0.67         3
           1       0.00      0.00      0.00         1

    accuracy                           0.50         4
   macro avg       0.33      0.33      0.33         4
weighted avg       0.50      0.50      0.50         4



### Check for Data Imbalance with Scaled Data

In [126]:
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()

# Scale data
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=10)

In [127]:
lm_full = LogisticRegression()

In [128]:
lm_full.fit(X_train, y_train)
pred_lm_full = lm_full.predict(X_test)
pred_lm_full

array([1, 1, 1, 1], dtype=int64)

In [130]:
print(classification_report(y_test, pred_lm_full))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.25      1.00      0.40         1

    accuracy                           0.25         4
   macro avg       0.12      0.50      0.20         4
weighted avg       0.06      0.25      0.10         4



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Well this is what happens when your training data is too small and there are imbalances. Your prediction results in one class exposing the bias in the data.
This week alot of predictions made were correct. hence the bias. If we pick a week where there was a balance outcome for bet predictions we would  have a different result

In [None]:
# Separate minority and majority classes
win = df_completed_matches[df_completed_matches['Prediction Result(won/loss)']==0]
loss = df_completed_matches[df_completed_matches['Prediction Result(won/loss)']==1]

In [None]:
labels = df_completed_matches['Prediction Result(won/loss)'].unique()
heights = [len(win),len(loss)]
plt.bar(labels,heights,color='grey')
plt.xticks(labels,['win','loss'])
plt.ylabel("# of observations")
plt.show()

In [None]:
from sklearn.utils import resample

In [None]:
loss_upsampled = resample(loss,
                          replace=True, # sample with replacement (we need to duplicate observations)
                          n_samples=len(win), # match number in minority class
                          random_state=27) # reproducible results

# Combine upsampled minority class with majority class
upsampled = pd.concat([loss_upsampled, win])

# Check new class counts
upsampled['loss'].value_counts()