## Downloading New Data for Future Matches

This notebook is dedicated to the **download and web scraping** of data for upcoming matches that have not yet taken place. As we prepare to make predictions for these future games, it’s essential to adapt our approach due to the unavailability of certain information that typically comes from completed matches. 

### Changing Our Approach

Given that the match data is not fully available for fixtures that have yet to occur, we will rely on **bookmaker websites** to obtain the necessary odds and related statistics. This shift in strategy is crucial because:

- **Limited Information:** For matches that are yet to be played, details such as team form, player injuries, and historical match data may not be fully represented. Therefore, traditional data sources may not provide the comprehensive insights we need.
  
- **Odds as Indicators:** The odds provided by bookmakers serve as valuable indicators of expected match outcomes based on the current perceptions and analyses of experts. They reflect the bookmakers' expectations regarding the probability of various results (home win, draw, away win), which can be extremely useful for our predictive model.

### Web Scraping Process

To implement this, we will perform the following steps:

1. **Identify Reliable Sources:** We will target reputable bookmaker websites that consistently provide up-to-date odds for football matches.

2. **Extract Odds Data:** Utilizing web scraping techniques, we will collect data on odds for upcoming matches. This will include odds for different outcomes such as over/under goals, match result, and other relevant betting markets.

3. **Data Cleaning and Formatting:** Once we have the odds data, we will clean and format it to ensure consistency with our existing dataset. This includes aligning the data with the matches in our database, making sure that the teams, dates, and other relevant information match correctly.

4. **Integrating New Data:** After cleaning the odds data, we will integrate it into our existing framework, allowing us to enhance our model with the most current information available for upcoming matches.

### Importance of this Data

By incorporating this newly acquired data into our predictive model, we can ensure that our forecasts for future matches are informed by the latest insights and market expectations. This enables us to refine our betting strategies, improve prediction accuracy, and ultimately put a bet on the matches that fall under previously designed catergories and conditions.


In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import requests
from bs4 import BeautifulSoup
import pandas as pd
from Dataset_functions import *
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.webdriver.support.ui import WebDriverWait

First we need to create a dictionary for our teams as the bookmaker has quite different names for them.

In [2]:
teams = ['Arsenal', 'Brighton', 'Chelsea', 'Crystal Palace', 'Everton',
       'Southampton', 'Watford', 'West Brom', 'Man United', 'Newcastle',
       'Swansea', 'Stoke', 'Burnley', 'Leicester', 'Bournemouth',
       'Liverpool', 'Huddersfield', 'Tottenham', 'Man City', 'West Ham',
       'Hamburg', 'Reading', 'Bochum', 'Ipswich', 'Millwall', 'Preston',
       'Sheffield United', 'Wigan', 'Bristol City', 'Brentford',
       'Greuther Furth', 'Regensburg', 'Birmingham', 'Magdeburg', 'Leeds',
       'Darmstadt', 'Heidenheim', 'Union Berlin', 'Hull', 'Dresden',
       "Nott'm Forest", 'Middlesbrough', 'St Pauli', 'Paderborn',
       'Ingolstadt', 'Aston Villa', 'Derby', 'Bolton', 'Norwich', 'QPR',
       'Sheffield Weds', 'Duisburg', 'Bielefeld', 'Blackburn',
       'Rotherham', 'Fulham', 'Wolves', 'Erzgebirge Aue', 'Holstein Kiel',
       'Sandhausen', 'FC Koln', 'Girona', 'Cadiz', 'Albacete', 'Betis',
       'Cardiff', 'Alcorcon', 'Cordoba', 'Elche', 'Lugo', 'Lazio',
       'Chievo', 'Barcelona', 'Celta', 'Villarreal', 'Mallorca',
       'Las Palmas', 'Oviedo', 'Zaragoza', 'Parma', 'Sassuolo', 'Torino',
       'Bologna', 'Empoli', 'Vallecano', 'Real Madrid', 'Eibar',
       'Atalanta', 'Gimnastic', 'Valencia', 'Ath Bilbao', 'Leganes',
       'Getafe', 'Bayern Munich', 'Extremadura UD', 'Malaga', 'Wolfsburg',
       'Werder Bremen', "M'gladbach", 'Hertha', 'Freiburg', 'Alaves',
       'Valladolid', 'Ath Madrid', 'Reus Deportiu', 'Fortuna Dusseldorf',
       'Juventus', 'Napoli', 'Numancia', 'Sevilla', 'Mainz', 'Espanol',
       'Dortmund', 'Almeria', 'Osasuna', 'Granada', 'Cagliari',
       'Frosinone', 'Genoa', 'Fiorentina', 'Spal', 'Udinese', 'Sp Gijon',
       'Inter', 'Rayo Majadahonda', 'Levante', 'Roma', 'Milan',
       'Hannover', 'Tenerife', 'Leverkusen', 'Hoffenheim',
       'Ein Frankfurt', 'Augsburg', 'Nurnberg', 'Stuttgart', 'Sampdoria',
       'RB Leipzig', 'Schalke 04', 'La Coruna', 'Huesca', 'Sociedad',
       'Osnabruck', 'Wehen', 'Luton', 'Karlsruhe', 'Barnsley', 'Charlton',
       'Santander', 'Pisa', 'Mirandes', 'Crotone', 'Salernitana',
       'Cittadella', 'Venezia', 'Virtus Entella', 'Ascoli',
       'Ponferradina', 'Perugia', 'Verona', 'Pordenone', 'Livorno',
       'Cremonese', 'Spezia', 'Benevento', 'Cosenza', 'Fuenlabrada',
       'Trapani', 'Lecce', 'Juve Stabia', 'Pescara', 'Brescia', 'Wycombe',
       'Coventry', 'Wurzburger Kickers', 'Castellon', 'Cartagena',
       'Monza', 'Logrones', 'Braunschweig', 'Reggiana', 'Reggina',
       'Vicenza', 'Sabadell', 'Hansa Rostock', 'Peterboro', 'Sociedad B',
       'Blackpool', 'Ternana', 'Ibiza', 'Amorebieta', 'Burgos', 'Como',
       'Alessandria', 'Kaiserslautern', 'Sunderland', 'Palermo', 'Modena',
       'Villarreal B', 'Bari', 'Sudtirol', 'Andorra', 'Plymouth',
       'Elversberg', 'FeralpiSalo', 'Catanzaro', 'Ferrol', 'Eldense',
       'Lecco', 'Ulm', 'Oxford', 'Preußen Münster', 'Portsmouth']

team_spanish_dict = {
    'Athletic Bilbao': 'Ath Bilbao',
    'CD Leganes': 'Leganes',
    'Celta Vigo': 'Celta',
    'Espanyol': 'Espanol',
    'Girona FC': 'Girona',
    'Las Palmas UD':'Las Palmas',
    'Real Betis':'Betis',
    'Real Madryt':'Real Madrid',
    'Real Sociedad':'Sociedad',
    'Real Valladolid':'Valladolid',
    'FC Barcelona': 'Barcelona',
    'Rayo Vallecano':'Vallecano',
    'Atletico Madryt': 'Ath Madrid'
}

team_english_dict = {
    'Fulham FC': 'Fulham',
    'Manchester City': 'Man City',
    'Wolverhampton': 'Wolves',
    'Manchester United': 'Man United',
    'Nottingham': "Nott'm Forest",
    'Arsenal FC':'Arsenal'
}

team_italian_dict = {
    'Hellas Verona': 'Verona',
    'AC Monza Brianz': 'Monza',
    'AS Roma': 'Roma',
    'FC Parma': 'Parma',
    'AC Milan': 'Milan',
    'Como Calcio':'Como'
}

team_german_dict = {
    'Bayern':'Bayern Munich',
     "Borussia M'gladbach": "M'gladbach",
    'Borussia Dortmund':'Dortmund',
    'RB Lipsk':'RB Leipzig',
    'VfL Bochum':'Bochum',
    '1. FC Heidenheim': 'Heidenheim',
    'Werder Brema':'Werder Bremen',
    'Frankfurt': 'Ein Frankfurt',
    'St. Pauli':'St Pauli'
}

### Updating URLs and File Names

To effectively gather the data for upcoming matches, we need to specify the URLs from which we will scrape the next matches data. Additionally, we will designate file names that will help us organize and store the downloaded data for easy access and analysis. 

#### URLs for Web Scraping

We will compile a list of URLs from the same websites that we used for creation of the main dat table. These URLs are crucial as they will direct our web scraping efforts to the correct sources.

#### File Names for Data Storage

To maintain organization and facilitate data management, we will assign specific file names to the datasets we download. These file names will reflect the content and date of the data to ensure clarity as well as update the data that we already have downloaded before

#### Summary

With the URLs and file names defined, we are prepared to proceed with our web scraping activities. This structured approach will ensure that we efficiently gather and organize the necessary data for our analysis and modeling efforts.


In [3]:
url_bundesliga = "https://www.football-data.co.uk/mmz4281/2425/D1.csv"
file_name_bundesliga = 'D1_24_25_1.csv'

url_premier = "https://www.football-data.co.uk/mmz4281/2425/E0.csv"
file_name_premier = '24_25_eng.csv'

url_laliga = "https://www.football-data.co.uk/mmz4281/2425/SP1.csv"
file_name_laliga = 'SP_24_25.csv'

url_seriea = "https://www.football-data.co.uk/mmz4281/2425/I1.csv"
file_name_seriea = 'I1_24_25_1.csv'

In [4]:
update_matches(url_bundesliga, file_name_bundesliga)
update_matches(url_premier, file_name_premier)
update_matches(url_laliga, file_name_laliga)
update_matches(url_seriea, file_name_seriea)

### Specifying Bookmaker Odds and Matches for Analysis

For each league included in our analysis, we will provide specific URLs that direct us to the bookmaker odds relevant to upcoming matches. Additionally, we will determine the number of matches to analyze for each league, typically focusing on around **10 matches** per matchday, although this may vary slightly depending on the scheduling of fixtures.

#### URLs for Bookmaker Odds

For each league, we will compile a list of URLs that link to the respective bookmaker's odds pages. This will ensure that we can effectively scrape the relevant data for our analysis.

#### Formatting Matches for Integration

Once we have the URLs established and the number of matches to analyze specified, the next step involves formatting the data appropriately. We will:

1. **Scrape Odds Data**: Using the provided URLs, we will extract the odds information for the specified matches.

2. **Match Formatting**: Each match will be formatted to include key details such as:
   - Teams involved
   - Match date and time
   - Odds for various outcomes (e.g., win, draw, loss, over/under 2.5)

3. **Integrate with General League Data**: After formatting the matches, we will add this information to our general league data files. This ensures that our existing datasets are updated with the latest odds and fixtures, allowing us to conduct comprehensive analyses and make informed predictions.

In [20]:
spanish1_url = 'https://www.iforbet.pl/zaklady-bukmacherskie/156/159'
spanish1_next_fixtures = create_df_next_fixtures(scrape_next_fixtures(spanish1_url), number_of_matches=10)

10 10 10 10 10 10 10 10


In [22]:
append_to_df_all(spanish1_next_fixtures,team_spanish_dict, 'SP_24_25.csv', 'SP_24_25_v1.csv')

In [24]:
english1_url = 'https://www.iforbet.pl/zaklady-bukmacherskie/155/199'
english1_next_fixtures = create_df_next_fixtures(scrape_next_fixtures(english1_url), number_of_matches=10)

10 10 10 10 10 10 10 10


In [25]:
append_to_df_all(english1_next_fixtures,team_english_dict, '24_25_eng.csv', '24_25_eng_v1.csv')

In [26]:
italian1_url = 'https://www.iforbet.pl/zaklady-bukmacherskie/118/122'
italian1_next_fixtures = create_df_next_fixtures(scrape_next_fixtures(italian1_url), number_of_matches=10)

10 10 10 10 10 10 10 10


In [27]:
append_to_df_all(italian1_next_fixtures,team_italian_dict, 'I1_24_25_1.csv', 'I1_24_25_v1.csv')

In [28]:
german1_url = 'https://www.iforbet.pl/zaklady-bukmacherskie/141/29975'
german1_next_fixtures = create_df_next_fixtures(scrape_next_fixtures(german1_url), number_of_matches=9)

9 9 9 9 9 9 9 9


In [29]:
append_to_df_all(german1_next_fixtures,team_german_dict, 'D1_24_25_1.csv', 'D1_24_25_v1.csv')

#### Summary

By systematically specifying the URLs for bookmaker odds and the number of matches to analyze for each league, we streamline the data collection process. Formatting the scraped data to integrate seamlessly with our existing datasets enhances our analytical capabilities, ensuring we are well-equipped to predict outcomes for future matches. This structured approach is crucial for refining our betting strategies and maximizing profitability in our betting endeavors.