#Examensarbete: **Kevin G√∂rg√º**

# **Del 1: Datainsamling av matchmetadata**



## Syfte
Samla in grundl√§ggande matchinformation fr√•n Premier League s√§songerna 2015-2025 via Football-API.

## Vad g√∂rs h√§r?
1. **API-anslutning**: Kopplar upp mot Football-API med din API-nyckel
2. **Liga och s√§songsval**: Definierar vilka ligor och √•r som ska samlas in (Premier League ID=39)
3. **S√§songbest√§mning**: Automatisk ber√§kning av aktuell s√§song baserat p√• datum (augusti-juli)
4. **Datah√§mtning**: F√∂r varje s√§song h√§mtas:
   - Fixture ID (matchens unika identifierare)
   - Datum och tid
   - Lag (hemma/borta) med ID och namn
   - Halvtidsresultat och slutresultat
   - Vinnare (True/False f√∂r varje lag)

## Utdata
- **DataFrame**: `df_meta` med 3800 matcher
- **Kolumner skapade**:
  - `total_goals_ht`: Totalt antal m√•l vid halvtid
  - `total_goals_ft`: Totalt antal m√•l vid slutsignal
  - `fulltime_over_2_5`: Bin√§r klassificering (1 = √∂ver 2.5 m√•l, 0 = under)

## Viktig anteckning om datal√§ckage
Vid detta steg samlas resultatet in, men det kommer INTE anv√§ndas direkt i modellen.
Endast `fulltime_over_2_5` anv√§nds som target variable (y). All annan information
omvandlas till rullande medelv√§rden i senare steg f√∂r att undvika datal√§ckage.

## Fil som skapas
- `fixture_meta_data.csv` - Grunddata f√∂r alla 3800 matcher

In [None]:
import http.client
import json
import pandas as pd
from datetime import datetime




"""
League IDs, add league IDs to collect more data
Premier League = 39
"""
league_ids = [39]

# Year list 2015-2025
start_year = 2015
end_year = 2025
years = list(range(start_year, end_year))

"""
Determine the correct season year (August - July season). To determine weather or not its a new season.
 - 2025 april = season 2024
 - 2025 august = season 2026
"""
today = datetime.today()
current_season = today.year if today.month >= 8 else today.year - 1


#  API connection settings
API_KEY = "YOUR API KEY" #OBS API KEY EXPIRED

conn = http.client.HTTPSConnection("v3.football.api-sports.io")
headers = {
    'x-rapidapi-host': "v3.football.api-sports.io",
    'x-rapidapi-key': API_KEY
}

print(current_season)

2025


In [3]:
all_rows = []

for league in league_ids:
    for season in years:
        endpoint = f"/fixtures?league={league}&season={season}"
        conn.request("GET", endpoint, headers=headers)
        res = conn.getresponse()
        data = res.read()
        api_response = json.loads(data.decode("utf-8"))

        # (optional) be nice to the API / avoid rate limits
        # time.sleep(0.25)

        for match in api_response.get("response", []):
            fixture = match["fixture"]
            league_info = match["league"]
            teams = match["teams"]
            score = match["score"]

            all_rows.append({
                "fixture_id": fixture["id"],
                "date": fixture["date"],
                "season": league_info["season"],
                "home_team_id": teams["home"]["id"],
                "home_team_name": teams["home"]["name"],
                "away_team_id": teams["away"]["id"],
                "away_team_name": teams["away"]["name"],
                "ht_home_goals": (score.get("halftime") or {}).get("home"),
                "ht_away_goals": (score.get("halftime") or {}).get("away"),
                "ft_home_goals": (score.get("fulltime") or {}).get("home"),
                "ft_away_goals": (score.get("fulltime") or {}).get("away"),
                "home_winner": teams["home"].get("winner"),
                "away_winner": teams["away"].get("winner"),
            })

df = pd.DataFrame(all_rows)

# Possible fixture features
"""
"fixture_id": fixture["id"],
"date": fixture["date"],
"referee": fixture.get("referee"),
"venue_name": fixture["venue"].get("name"),
"venue_city": fixture["venue"].get("city"),
"status": fixture["status"].get("long"),

"league_id": league_info["id"],
"league_name": league_info["name"],
"league_country": league_info["country"],
"season": league_info["season"],
"round": league_info["round"],

"home_team_id": teams["home"]["id"],
"home_team_name": teams["home"]["name"],
"away_team_id": teams["away"]["id"],
"away_team_name": teams["away"]["name"],

"home_goals": goals.get("home"),
"away_goals": goals.get("away"),

"ht_home_goals": score["halftime"].get("home"),
"ht_away_goals": score["halftime"].get("away"),
"ft_home_goals": score["fulltime"].get("home"),
"ft_away_goals": score["fulltime"].get("away"),

"home_winner": teams["home"].get("winner"),
"away_winner": teams["away"].get("winner")
"""



'\n"fixture_id": fixture["id"],\n"date": fixture["date"],\n"referee": fixture.get("referee"),\n"venue_name": fixture["venue"].get("name"),\n"venue_city": fixture["venue"].get("city"),\n"status": fixture["status"].get("long"),\n\n"league_id": league_info["id"],\n"league_name": league_info["name"],\n"league_country": league_info["country"],\n"season": league_info["season"],\n"round": league_info["round"],\n\n"home_team_id": teams["home"]["id"],\n"home_team_name": teams["home"]["name"],\n"away_team_id": teams["away"]["id"],\n"away_team_name": teams["away"]["name"],\n\n"home_goals": goals.get("home"),\n"away_goals": goals.get("away"),\n\n"ht_home_goals": score["halftime"].get("home"),\n"ht_away_goals": score["halftime"].get("away"),\n"ft_home_goals": score["fulltime"].get("home"),\n"ft_away_goals": score["fulltime"].get("away"),\n\n"home_winner": teams["home"].get("winner"),\n"away_winner": teams["away"].get("winner")\n'

In [4]:
df_meta = pd.DataFrame(all_rows)

In [5]:
import numpy as np

# Ensure numeric (API sometimes returns None)
goal_cols = [
    "ht_home_goals", "ht_away_goals",
    "ft_home_goals", "ft_away_goals"
]

df_meta[goal_cols] = df_meta[goal_cols].apply(pd.to_numeric, errors="coerce").fillna(0)

# Half-time total goals
df_meta["total_goals_ht"] = df_meta["ht_home_goals"] + df_meta["ht_away_goals"]

# Full-time total goals
df_meta["total_goals_ft"] = df_meta["ft_home_goals"] + df_meta["ft_away_goals"]

# Full-time Over 2.5 goals (1 = yes, 0 = no)
df_meta["fulltime_over_2_5"] = (df_meta["total_goals_ft"] > 2.5).astype(int)


In [6]:
df = df_meta

In [7]:

# ---------- Derived metrics ----------
df["btts"] = ((df["ft_home_goals"] > 0) & (df["ft_away_goals"] > 0)).astype(int)
df["second_half_goals"] = df["total_goals_ft"] - df["total_goals_ht"]

# ---------- Class balance ----------
class_balance = (
    df["fulltime_over_2_5"]
    .value_counts(normalize=True)
    .rename("percent") * 100
).round(2)

# ---------- EDA summary ----------
eda_summary = {
    "Total matches": len(df),

    "Over 2.5 (%)": round(df["fulltime_over_2_5"].mean() * 100, 2),
    "Under 2.5 (%)": round((1 - df["fulltime_over_2_5"].mean()) * 100, 2),

    "Avg FT goals": round(df["total_goals_ft"].mean(), 2),
    "Avg HT goals": round(df["total_goals_ht"].mean(), 2),

    "Games w/ HT goal (%)": round((df["total_goals_ht"] > 0).mean() * 100, 2),
    "Games w/ 2H goal (%)": round((df["second_half_goals"] > 0).mean() * 100, 2),
    "Goals in both halves (%)": round(
        ((df["total_goals_ht"] > 0) & (df["second_half_goals"] > 0)).mean() * 100, 2
    ),

    "BTTS (%)": round(df["btts"].mean() * 100, 2),
    "0‚Äì0 games (%)": round(
        ((df["ft_home_goals"] == 0) & (df["ft_away_goals"] == 0)).mean() * 100, 2
    ),
}

# ---------- Print results ----------
print("\n=== CLASS BALANCE (Over 2.5) ===")
print(class_balance)

print("\n=== EDA SUMMARY ===")
print(pd.Series(eda_summary))

print("\n=== FULL-TIME GOAL DISTRIBUTION (%) ===")
print(
    (df["total_goals_ft"]
     .value_counts(normalize=True)
     .sort_index() * 100)
    .round(2)
)


=== CLASS BALANCE (Over 2.5) ===
fulltime_over_2_5
1    54.18
0    45.82
Name: percent, dtype: float64

=== EDA SUMMARY ===
Total matches               3800.00
Over 2.5 (%)                  54.18
Under 2.5 (%)                 45.82
Avg FT goals                   2.83
Avg HT goals                   1.26
Games w/ HT goal (%)          71.82
Games w/ 2H goal (%)          80.24
Goals in both halves (%)      58.26
BTTS (%)                      52.34
0‚Äì0 games (%)                  6.21
dtype: float64

=== FULL-TIME GOAL DISTRIBUTION (%) ===
total_goals_ft
0     6.21
1    16.03
2    23.58
3    22.08
4    16.74
5     9.11
6     3.79
7     1.74
8     0.42
9     0.32
Name: proportion, dtype: float64


In [8]:
df.to_csv("fixture_meta_data.csv", index=False)

In [9]:
df.tail()

Unnamed: 0,fixture_id,date,season,home_team_id,home_team_name,away_team_id,away_team_name,ht_home_goals,ht_away_goals,ft_home_goals,ft_away_goals,home_winner,away_winner,total_goals_ht,total_goals_ft,fulltime_over_2_5,btts,second_half_goals
3795,1208396,2025-05-25T15:00:00+00:00,2024,40,Liverpool,52,Crystal Palace,0,1,1,1,,,1,2,0,1,1
3796,1208400,2025-05-25T15:00:00+00:00,2024,41,Southampton,42,Arsenal,0,1,1,2,False,True,1,3,1,1,2
3797,1208401,2025-05-25T15:00:00+00:00,2024,47,Tottenham,51,Brighton,1,0,1,4,False,True,1,5,1,1,4
3798,1208395,2025-05-25T15:00:00+00:00,2024,57,Ipswich,48,West Ham,0,1,1,3,False,True,1,4,1,1,3
3799,1208399,2025-05-25T15:00:00+00:00,2024,65,Nottingham Forest,49,Chelsea,0,0,0,1,False,True,0,1,0,0,1


# **Del 2: H√§mta detaljerad m√•lstatistik**



## Syfte
F√∂r varje match, h√§mta exakt tidpunkt (minut) n√§r varje m√•l gjordes.

## Varf√∂r beh√∂vs detta?
Enligt GoalStatistics analys √§r sannolikheten f√∂r √∂ver 2.5 m√•l:
- **~40%** om m√•l g√∂rs i minut 0-15
- **~16%** om inget m√•l g√∂rs i f√∂rsta halvlek

Detta g√∂r tidiga m√•l till en kritisk prediktor som b√∂r ing√• i modellen.

## Vad g√∂rs h√§r?
1. **Funktion: `fetch_goal_events_once()`**
   - Tar emot lista med fixture_id fr√•n Del 1
   - F√∂r varje match: h√§mtar alla "Goal"-events fr√•n API
   - Extraherar minuttal f√∂r varje m√•l

2. **Hantering av till√§ggstid**
   - Minut kan vara "90+5" ‚Üí parsas till minut 90
   
3. **Smart cachning**
   - Om filen redan finns ‚Üí hoppar √∂ver API-anrop
   - Sparar tid och API-calls vid omk√∂rning

## Tekniska detaljer
- **Rate limiting**: 0.3 sekunder v√§ntetid mellan anrop (undvik API-begr√§nsningar)
- **Felhantering**: Skippar matcher med JSON-fel
- **Progress tracking**: Visar [X/3800] f√∂r varje match

## Fil som skapas
- `goals_prem_2015_2024.csv` med kolumner:
  - `fixture_id`: Matchens ID
  - `team_id`: Vilket lag som gjorde m√•let
  - `team_name`: Lagets namn
  - `minute`: Vilken minut m√•let gjordes (ex: "22" eller "90+3")

## OBS: API-begr√§nsning
Football-API har gr√§ns p√• 100 anrop/dag (gratis tier). F√∂r 3800 matcher
beh√∂vs premium-prenumeration ELLER flera dagar av datainsamling.

## **Vad g√∂r funktionen?** (api_get_with_retries)

`api_get_with_retries` √§r en s√§ker wrapper runt ett HTTP GET-anrop som g√∂r API-h√§mtning robust.

Den:
- kontrollerar alltid HTTP-status (`res.status`)
- retry:ar automatiskt vid n√§tverksfel
- retry:ar vid **429 (rate limit)** och **500‚Äì599 (serverfel)**
- anv√§nder **exponentiell backoff + jitter**
- f√•ngar trasiga eller tomma JSON-svar
- hanterar API-fel som returneras i payload (`"errors"`)
- loggar tydligt alla problem s√• de m√§rks direkt
- avbryter efter ett max antal f√∂rs√∂k och returnerar `None`

Resultat:  
Stabil och fels√§ker API-h√§mtning som inte kraschar vid tillf√§lliga fel eller instabila API:er.


In [10]:
import os
import time
import json
import random
import http.client
import pandas as pd


def api_get_with_retries(
    host,
    endpoint,
    headers,
    fixture_id,
    max_retries=5,
    base_sleep=0.7,
    timeout_s=30,
):
    """
    Robust GET wrapper:
    - checks HTTP status
    - retries on 429 and 5xx with exponential backoff + jitter
    - retries on network errors + JSON decode errors
    - prints enough info to notice problems
    """
    for attempt in range(1, max_retries + 1):
        conn = http.client.HTTPSConnection(host, timeout=timeout_s)
        status = None
        data = None

        try:
            conn.request("GET", endpoint, headers=headers)
            res = conn.getresponse()
            status = res.status
            data = res.read()
        except Exception as e:
            print(f"‚ùå Network error for fixture {fixture_id} (attempt {attempt}/{max_retries}): {e}")
        finally:
            try:
                conn.close()
            except Exception as e_close:
                print(f"‚ö†Ô∏è Failed to close connection (ignored): {e_close}")

        # If request failed before we got anything
        if status is None or data is None:
            sleep = base_sleep * (2 ** (attempt - 1)) + random.uniform(0, 0.25)
            print(f"‚è≥ Retrying fixture {fixture_id} after {sleep:.2f}s (no response)")
            time.sleep(sleep)
            continue

        # Retry on rate limit or server errors
        if status == 429 or (500 <= status <= 599):
            snippet = data[:200]
            sleep = base_sleep * (2 ** (attempt - 1)) + random.uniform(0, 0.25)
            print(
                f"‚ö†Ô∏è HTTP {status} for fixture {fixture_id} (attempt {attempt}/{max_retries}). "
                f"Retrying in {sleep:.2f}s. Body start: {snippet!r}"
            )
            time.sleep(sleep)
            continue

        # Non-retriable HTTP errors
        if status != 200:
            snippet = data[:200]
            print(f"‚ùå HTTP {status} for fixture {fixture_id}. Body start: {snippet!r}")
            return None

        # Parse JSON
        try:
            payload = json.loads(data.decode("utf-8"))
        except json.JSONDecodeError as e:
            snippet = data[:200]
            sleep = base_sleep * (2 ** (attempt - 1)) + random.uniform(0, 0.25)
            print(
                f"‚ùå JSON decode failed for fixture {fixture_id} (attempt {attempt}/{max_retries}): {e}. "
                f"Retrying in {sleep:.2f}s. Body start: {snippet!r}"
            )
            time.sleep(sleep)
            continue

        # API-level error field (if present)
        if isinstance(payload, dict) and payload.get("errors"):
            sleep = base_sleep * (2 ** (attempt - 1)) + random.uniform(0, 0.25)
            print(
                f"‚ö†Ô∏è API errors for fixture {fixture_id} (attempt {attempt}/{max_retries}): "
                f"{payload.get('errors')}. Retrying in {sleep:.2f}s."
            )
            time.sleep(sleep)
            continue

        return payload

    print(f"‚ùå Giving up fixture {fixture_id} after {max_retries} attempts.")
    return None


In [11]:
def fetch_goal_events_once(
    api_key,
    df_fixtures=None,
    fixtures_path="../raw_data_files/fixtures_metadata.csv",
    save_path="goals_prem_2015_2024.csv",
    sleep_s=0.3,
    exclude_penalties=False,
    exclude_own_goals=False,
):
    # ----------------------------
    # Load fixture IDs
    # ----------------------------
    if df_fixtures is not None:
        fixture_ids = pd.unique(df_fixtures["fixture_id"])
        print(f"üìä Using in-memory DataFrame: {len(fixture_ids)} fixtures")
    else:
        df_fixtures = pd.read_csv(fixtures_path)
        fixture_ids = pd.unique(df_fixtures["fixture_id"])
        print(f"üìä Loaded fixtures from CSV: {len(fixture_ids)} fixtures")

    # ----------------------------
    # Ensure folder + CSV exist
    # ----------------------------
    os.makedirs(os.path.dirname(save_path) or ".", exist_ok=True)

    if not os.path.exists(save_path):
        pd.DataFrame(columns=["fixture_id", "team_id", "team_name", "minute"]).to_csv(save_path, index=False)
        print(f"üÜï Created CSV ‚Üí {save_path}")

    # ----------------------------
    # Checkpoint to resume ALL fixtures (including 0-goal matches)
    # ----------------------------
    checkpoint_path = save_path + ".done"
    if not os.path.exists(checkpoint_path):
        with open(checkpoint_path, "w", encoding="utf-8") as f:
            f.write("")
        print(f"üÜï Created checkpoint ‚Üí {checkpoint_path}")

    completed = set()
    try:
        with open(checkpoint_path, "r", encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if line:
                    try:
                        completed.add(int(line))
                    except ValueError:
                        completed.add(line)
    except Exception as e:
        print(f"‚ùå Failed to read checkpoint for resume: {e}")
        completed = set()

    print(f"üîÅ Resuming: {len(completed)} fixtures already processed")

    headers = {
        "x-rapidapi-host": "v3.football.api-sports.io",
        "x-rapidapi-key": api_key,
    }

    total = len(fixture_ids)

    for i, fixture_id in enumerate(fixture_ids, 1):
        # Skip if already completed
        if fixture_id in completed:
            continue

        print(f"[{i}/{total}] üì• Fetching goals for fixture {fixture_id}")

        endpoint = f"/fixtures?id={fixture_id}"
        payload = api_get_with_retries(
            host="v3.football.api-sports.io",
            endpoint=endpoint,
            headers=headers,
            fixture_id=fixture_id,
            max_retries=5,
            base_sleep=0.7,
            timeout_s=30,
        )

        # If request ultimately failed, do NOT mark completed (so we can retry next run)
        if payload is None:
            print(f"‚ùå Skipping fixture {fixture_id} (request failed). Will retry on next run.")
            time.sleep(sleep_s)
            continue

        resp = payload.get("response", [])
        if not resp:
            # This is a successful call but no match data came back.
            # We mark completed to avoid infinite loops, but we PRINT so you notice.
            print(f"‚ö†Ô∏è Empty response for fixture {fixture_id} (marked completed).")
            try:
                with open(checkpoint_path, "a", encoding="utf-8") as f:
                    f.write(str(fixture_id) + "\n")
                completed.add(fixture_id)
            except Exception as e:
                print(f"‚ùå Failed to write checkpoint for fixture {fixture_id}: {e}")
            time.sleep(sleep_s)
            continue

        match = resp[0]
        fx_id = match.get("fixture", {}).get("id", fixture_id)
        events = match.get("events", []) or []

        rows = []

        for ev in events:
            if ev.get("type") != "Goal":
                continue

            detail = (ev.get("detail") or "").lower()
            if exclude_penalties and "penalty" in detail:
                continue
            if exclude_own_goals and "own goal" in detail:
                continue

            team = ev.get("team", {}) or {}
            t = ev.get("time", {}) or {}
            elapsed = t.get("elapsed")
            extra = t.get("extra")

            minute = elapsed if extra is None else f"{elapsed}+{extra}"

            rows.append(
                {
                    "fixture_id": fx_id,
                    "team_id": team.get("id"),
                    "team_name": team.get("name"),
                    "minute": minute,
                }
            )

        # ----------------------------
        # Append immediately (goals CSV)
        # ----------------------------
        if rows:
            pd.DataFrame(rows).to_csv(save_path, mode="a", header=False, index=False)

        # ----------------------------
        # Mark fixture completed in checkpoint (ALWAYS, even if 0 goals)
        # ----------------------------
        try:
            with open(checkpoint_path, "a", encoding="utf-8") as f:
                f.write(str(fixture_id) + "\n")
            completed.add(fixture_id)
        except Exception as e:
            print(f"‚ùå Failed to write checkpoint for fixture {fixture_id}: {e}")

        time.sleep(sleep_s)

    print(f"‚úÖ Goal fetch complete ‚Üí {save_path}")
    return pd.read_csv(save_path)

In [12]:
    fetch_goal_events_once(
        api_key=API_KEY,
        fixtures_path="/content/fixture_meta_data.csv",
        save_path="goals_prem_2015_2024.csv",
        sleep_s=0.3,
        exclude_penalties=False,
        exclude_own_goals=False,
    )

üìä Loaded fixtures from CSV: 3800 fixtures
üÜï Created checkpoint ‚Üí goals_prem_2015_2024.csv.done
üîÅ Resuming: 0 fixtures already processed
[1/3800] üì• Fetching goals for fixture 192297
[2/3800] üì• Fetching goals for fixture 192298
[3/3800] üì• Fetching goals for fixture 192301
[4/3800] üì• Fetching goals for fixture 192300
[5/3800] üì• Fetching goals for fixture 192299
[6/3800] üì• Fetching goals for fixture 192302
[7/3800] üì• Fetching goals for fixture 192303
[8/3800] üì• Fetching goals for fixture 192304
[9/3800] üì• Fetching goals for fixture 192305
[10/3800] üì• Fetching goals for fixture 192306
[11/3800] üì• Fetching goals for fixture 192307
[12/3800] üì• Fetching goals for fixture 192308
[13/3800] üì• Fetching goals for fixture 192313
[14/3800] üì• Fetching goals for fixture 192312
[15/3800] üì• Fetching goals for fixture 192309
[16/3800] üì• Fetching goals for fixture 192310
[17/3800] üì• Fetching goals for fixture 192311
[18/3800] üì• Fetching goals

Unnamed: 0,fixture_id,team_id,team_name,minute
0,192297,33,Manchester United,22
1,192298,66,Aston Villa,72
2,192301,38,Watford,14
3,192301,45,Everton,76
4,192301,38,Watford,83
...,...,...,...,...
21535,1208395,48,West Ham,43
21536,1208395,57,Ipswich,52
21537,1208395,48,West Ham,55
21538,1208395,48,West Ham,87


In [13]:
df_meta = pd.read_csv("/content/fixture_meta_data.csv")
df_goals = pd.read_csv("/content/goals_prem_2015_2024.csv")

In [14]:
df_goals.tail()

Unnamed: 0,fixture_id,team_id,team_name,minute
21535,1208395,48,West Ham,43
21536,1208395,57,Ipswich,52
21537,1208395,48,West Ham,55
21538,1208395,48,West Ham,87
21539,1208399,49,Chelsea,50


# **Del 3: Omvandla och sl√• samman m√•lstatistik**

## Syfte
Kombinera matchmetadata (Del 1) med detaljerad m√•lstatistik (Del 2)
till en enhetlig dataset d√§r varje match har en lista √∂ver m√•lminuter.

## Vad g√∂rs h√§r?

### Steg 1: `build_df_goals_compact()`
Denna funktion omvandlar r√• m√•ldata till kompakt format:

**Input**:
```
fixture_id | team_id | minute
1234       | 33      | 22
1234       | 47      | 76
```

**Output**:
```
fixture_id | home_goal_minutes | away_goal_minutes
1234       | [22]              | [76]
```

**Processteg**:
1. Parsear minut-str√§ngar ("90+5" ‚Üí 90)
2. Matchar team_id mot home_team_id/away_team_id
3. Grupperar m√•l per match och lag
4. Skapar sorterade listor av m√•lminuter
5. **Kritiskt**: Bevarar 0-0 matcher som tomma listor []

### Steg 2: Merge med metadata
Sl√•r ihop kompakt m√•ldata med ursprungliga matchdata:
- Alla kolumner fr√•n `fixture_meta_data.csv` beh√•lls
- Tv√• nya kolumner l√§ggs till:
  - `home_goal_minutes`: Lista med hemmalags m√•lminuter
  - `away_goal_minutes`: Lista med bortalags m√•lminuter

## Exempel p√• resultat
```
fixture_id: 192297
home_team: Manchester United
away_team: Tottenham
ft_home_goals: 1
ft_away_goals: 0
home_goal_minutes: [22]        ‚Üê M√•l i minut 22
away_goal_minutes: []          ‚Üê Inga m√•l
```

## Varf√∂r listor?
Listor till√•ter senare analys av:
- Antal tidiga m√•l (0-15 min)
- Antal f√∂rsta halvlek m√•l (0-45 min)
- M√•lf√∂rdelning √∂ver tid

## Fil som skapas
- `complete_fixture_data.csv` - Komplett dataset med alla 3800 matcher
  och deras respektive m√•lminuter

In [15]:
import pandas as pd
import numpy as np

def build_df_goals_compact(df_goals: pd.DataFrame, df_meta: pd.DataFrame) -> pd.DataFrame:
    goals = df_goals.copy()
    meta = df_meta[["fixture_id","home_team_id","away_team_id"]].drop_duplicates().copy()

    # Base: ALL fixtures (ensures 0-0 fixtures still appear)
    base = meta[["fixture_id"]].drop_duplicates().copy()

    # --- Ensure IDs are numeric (prevents string-vs-int comparisons) ---
    for c in ["fixture_id", "team_id"]:
        goals[c] = pd.to_numeric(goals[c], errors="coerce")
    for c in ["fixture_id", "home_team_id", "away_team_id"]:
        meta[c] = pd.to_numeric(meta[c], errors="coerce")

    goals = goals.dropna(subset=["fixture_id","team_id"]).copy()
    meta = meta.dropna(subset=["fixture_id","home_team_id","away_team_id"]).copy()

    goals["fixture_id"] = goals["fixture_id"].astype(int)
    goals["team_id"] = goals["team_id"].astype(int)
    meta["fixture_id"] = meta["fixture_id"].astype(int)
    meta["home_team_id"] = meta["home_team_id"].astype(int)
    meta["away_team_id"] = meta["away_team_id"].astype(int)
    base["fixture_id"] = pd.to_numeric(base["fixture_id"], errors="coerce").astype(int)

    # --- Parse minute (handles "90+5") ---
    m = goals["minute"].astype(str)
    goals["minute_main"] = pd.to_numeric(m.str.split("+").str[0], errors="coerce")
    goals = goals.dropna(subset=["minute_main"]).copy()
    goals["minute_main"] = goals["minute_main"].astype(int)

    # --- Attach home/away ---
    goals = goals.merge(meta, on="fixture_id", how="left")

    goals["side"] = None
    goals.loc[goals["team_id"] == goals["home_team_id"], "side"] = "home"
    goals.loc[goals["team_id"] == goals["away_team_id"], "side"] = "away"
    goals = goals.dropna(subset=["side"]).copy()

    # --- Aggregate minutes into sorted lists ---
    agg = (
        goals.sort_values("minute_main")
             .groupby(["fixture_id","side"])["minute_main"]
             .apply(list)
             .unstack()                      # keep NaN for missing sides
             .reset_index()
             .rename(columns={"home": "home_goal_minutes", "away": "away_goal_minutes"})
    )

    # --- Left join onto ALL fixtures so 0-0 games are included ---
    out = base.merge(agg, on="fixture_id", how="left")

    # Fill missing lists with empty lists
    out["home_goal_minutes"] = out["home_goal_minutes"].apply(lambda x: x if isinstance(x, list) else [])
    out["away_goal_minutes"] = out["away_goal_minutes"].apply(lambda x: x if isinstance(x, list) else [])

    return out





In [16]:
df_goals_compact = build_df_goals_compact(df_goals, df_meta)
df_goals_compact.tail()

Unnamed: 0,fixture_id,away_goal_minutes,home_goal_minutes
3795,1208396,"[9, 9]","[84, 84]"
3796,1208400,"[43, 43, 90, 90]","[56, 56]"
3797,1208401,"[51, 51, 64, 64, 88, 88, 90, 90]","[17, 17]"
3798,1208395,"[43, 43, 55, 55, 87, 87]","[52, 52]"
3799,1208399,"[50, 50]",[]


In [17]:
df_meta_merged = df_meta.merge(df_goals_compact, on="fixture_id", how="left")


In [18]:
df_meta_merged.head()

Unnamed: 0,fixture_id,date,season,home_team_id,home_team_name,away_team_id,away_team_name,ht_home_goals,ht_away_goals,ft_home_goals,ft_away_goals,home_winner,away_winner,total_goals_ht,total_goals_ft,fulltime_over_2_5,btts,second_half_goals,away_goal_minutes,home_goal_minutes
0,192297,2015-08-08T11:45:00+00:00,2015,33,Manchester United,47,Tottenham,1,0,1,0,True,False,1,1,0,0,0,[],"[22, 22]"
1,192298,2015-08-08T14:00:00+00:00,2015,35,Bournemouth,66,Aston Villa,0,0,0,1,False,True,0,1,0,0,1,"[72, 72]",[]
2,192301,2015-08-08T14:00:00+00:00,2015,45,Everton,38,Watford,0,1,2,2,,,1,4,1,1,3,"[14, 14, 83, 83]","[76, 76, 86, 86]"
3,192300,2015-08-08T14:00:00+00:00,2015,46,Leicester,746,Sunderland,3,0,4,2,True,False,3,6,1,1,3,"[60, 60, 71, 71]","[11, 11, 18, 18, 25, 25, 66, 66]"
4,192299,2015-08-08T14:00:00+00:00,2015,71,Norwich,52,Crystal Palace,0,1,1,3,False,True,1,4,1,1,3,"[39, 39, 49, 49, 90, 90]","[69, 69]"


In [19]:
df_meta_merged.to_csv("complete_fixture_data.csv", index=False)



# **Del 4: Feature engineering med rullande medelv√§rden**



## Syfte
Omvandla historiska matchresultat till attribut som √§r **k√§nda innan avspark**.

## KRITISKT: Undvika datal√§ckage
**Datal√§ckage** = n√§r modellen f√•r information som inte skulle vara tillg√§nglig
vid prediktionstillf√§llet.

**Exempel p√• l√§ckage** ‚ùå:
- Anv√§nda `ft_home_goals` direkt ‚Üí detta √§r matchens slutresultat!

**Korrekt approach** ‚úÖ:
- Anv√§nda `home_avg_ft_gf_last3` ‚Üí genomsnitt av de TRE SENASTE matcherna

## Vad g√∂rs h√§r?

### Steg 1: Sortera matcher temporalt
```python
df = df.sort_values(["date", "fixture_id"])
```
S√§kerst√§ller att matcher processas i kronologisk ordning.

### Steg 2: Skapa team-match view
Varje match blir tv√• rader (en f√∂r hemma, en f√∂r borta):
```
Match: Man Utd 1-0 Tottenham
‚Üí Rad 1: team_id=33 (Man Utd), side=home, pts=3, ft_gf=1, ft_ga=0
‚Üí Rad 2: team_id=47 (Tottenham), side=away, pts=0, ft_gf=0, ft_ga=1
```

### Steg 3: Ber√§kna rullande medelv√§rden
**Nyckelrad**:
```python
.groupby("team_id")[col].transform(
    lambda s: s.shift(1).rolling(window, min_periods=1).mean()
)
```

**Vad g√∂r `.shift(1)`?**
- Flyttar serien ett steg BAK√ÖT i tiden
- Betyder att match N anv√§nder data fr√•n matcher N-1, N-2, N-3
- **Detta f√∂rhindrar datal√§ckage!**

**Exempel f√∂r Manchester United**:
```
Match 1: pts=3 ‚Üí avg_pts_last3 = NaN (ingen historik)
Match 2: pts=0 ‚Üí avg_pts_last3 = 3.0 (endast match 1)
Match 3: pts=1 ‚Üí avg_pts_last3 = 1.5 (match 1+2)
Match 4: pts=3 ‚Üí avg_pts_last3 = 1.33 (match 1+2+3)
Match 5: pts=0 ‚Üí avg_pts_last3 = 1.33 (match 2+3+4) ‚Üê window=3
```

### Steg 4: Attribut som ber√§knas (window=3)

**Allm√§n form**:
- `avg_pts_last3`: Genomsnittliga po√§ng
- `pts_sum_last3`: Total po√§ngsumma
- `avg_ft_gf_last3`: Genomsnittliga gjorda m√•l (fulltime)
- `avg_ft_ga_last3`: Genomsnittliga insl√§ppta m√•l (fulltime)
- `avg_ht_gf_last3`: Genomsnittliga gjorda m√•l (halvtid)
- `avg_ht_ga_last3`: Genomsnittliga insl√§ppta m√•l (halvtid)

**Tidiga m√•l (fr√•n m√•lminuter)**:
- `avg_scored_0_15_last3`: Genomsnittliga m√•l i minut 0-15
- `avg_scored_1H_last3`: Genomsnittliga m√•l i f√∂rsta halvlek

### Steg 5: Split till home/away
Attributen delas upp s√• att varje match har:
- `home_avg_ft_gf_last3` - hemmalags genomsnitt
- `away_avg_ft_gf_last3` - bortalags genomsnitt

### Steg 6: Merge tillbaka till fixture-niv√•
Resultatet blir EN rad per match med 18 attribut:
- 5 identifierare (fixture_id, date, season, home_team_id, away_team_id)
- 6 hemma-attribut (avg_pts, avg_gf, avg_ga, osv.)
- 6 borta-attribut (samma som hemma)
- 4 tidiga m√•l-attribut (0-15 min, 1H m√•l)
- 1 target (fulltime_over_2_5)

## Rensning av data
- Droppar rader med NaN (f√∂rsta matcherna per s√§song saknar historik)
- Droppar rader med inf-v√§rden (division med noll, etc.)
- **Slutresultat**: 3776 matcher (f√∂rlorade 24 pga saknad historik)

## Fil som skapas
- Ingen explicit fil, men `df_pre_match` inneh√•ller f√§rdiga features
- Detta √§r den FINALA dataset som anv√§nds f√∂r tr√§ning/test

In [20]:
import pandas as pd
import numpy as np
df = pd.read_csv("complete_fixture_data.csv")


In [21]:
df["date"] = pd.to_datetime(df["date"], errors="coerce").dt.normalize()
df.head()

Unnamed: 0,fixture_id,date,season,home_team_id,home_team_name,away_team_id,away_team_name,ht_home_goals,ht_away_goals,ft_home_goals,ft_away_goals,home_winner,away_winner,total_goals_ht,total_goals_ft,fulltime_over_2_5,btts,second_half_goals,away_goal_minutes,home_goal_minutes
0,192297,2015-08-08 00:00:00+00:00,2015,33,Manchester United,47,Tottenham,1,0,1,0,True,False,1,1,0,0,0,[],"[22, 22]"
1,192298,2015-08-08 00:00:00+00:00,2015,35,Bournemouth,66,Aston Villa,0,0,0,1,False,True,0,1,0,0,1,"[72, 72]",[]
2,192301,2015-08-08 00:00:00+00:00,2015,45,Everton,38,Watford,0,1,2,2,,,1,4,1,1,3,"[14, 14, 83, 83]","[76, 76, 86, 86]"
3,192300,2015-08-08 00:00:00+00:00,2015,46,Leicester,746,Sunderland,3,0,4,2,True,False,3,6,1,1,3,"[60, 60, 71, 71]","[11, 11, 18, 18, 25, 25, 66, 66]"
4,192299,2015-08-08 00:00:00+00:00,2015,71,Norwich,52,Crystal Palace,0,1,1,3,False,True,1,4,1,1,3,"[39, 39, 49, 49, 90, 90]","[69, 69]"


In [22]:
print("Rows:", len(df))
print("Date range:", df["date"].min(), "->", df["date"].max())

Rows: 3800
Date range: 2015-08-08 00:00:00+00:00 -> 2025-05-25 00:00:00+00:00


In [23]:
goal_cols = ["ht_home_goals","ht_away_goals","ft_home_goals","ft_away_goals"]
df[goal_cols] = df[goal_cols].apply(pd.to_numeric, errors="coerce")

# sort in match order (fixture_id tie-breaker helps stability)
df = df.sort_values(["date", "fixture_id"]).reset_index(drop=True)

print(df[["fixture_id","date"]].head(10))
print("\nAny missing goals?", df[goal_cols].isna().any())

   fixture_id                      date
0      192297 2015-08-08 00:00:00+00:00
1      192298 2015-08-08 00:00:00+00:00
2      192299 2015-08-08 00:00:00+00:00
3      192300 2015-08-08 00:00:00+00:00
4      192301 2015-08-08 00:00:00+00:00
5      192302 2015-08-08 00:00:00+00:00
6      192303 2015-08-09 00:00:00+00:00
7      192304 2015-08-09 00:00:00+00:00
8      192305 2015-08-09 00:00:00+00:00
9      192306 2015-08-10 00:00:00+00:00

Any missing goals? ht_home_goals    False
ht_away_goals    False
ft_home_goals    False
ft_away_goals    False
dtype: bool


In [24]:
# FT points
home_pts = np.select(
    [df["ft_home_goals"] > df["ft_away_goals"], df["ft_home_goals"] == df["ft_away_goals"]],
    [3, 1],
    default=0
)
away_pts = np.select(
    [df["ft_away_goals"] > df["ft_home_goals"], df["ft_away_goals"] == df["ft_home_goals"]],
    [3, 1],
    default=0
)

home = pd.DataFrame({
    "fixture_id": df["fixture_id"],
    "date": df["date"],
    "team_id": df["home_team_id"],
    "side": "home",
    "pts": home_pts,
    "ht_gf": df["ht_home_goals"],
    "ht_ga": df["ht_away_goals"],
    "ft_gf": df["ft_home_goals"],
    "ft_ga": df["ft_away_goals"],
})

away = pd.DataFrame({
    "fixture_id": df["fixture_id"],
    "date": df["date"],
    "team_id": df["away_team_id"],
    "side": "away",
    "pts": away_pts,
    "ht_gf": df["ht_away_goals"],
    "ht_ga": df["ht_home_goals"],
    "ft_gf": df["ft_away_goals"],
    "ft_ga": df["ft_home_goals"],
})

team_matches = pd.concat([home, away], ignore_index=True)
team_matches = team_matches.sort_values(["team_id","date","fixture_id"]).reset_index(drop=True)

print("Team-match rows (should be 2x fixtures):", len(team_matches), "vs", 2*len(df))
team_matches.head(10)


Team-match rows (should be 2x fixtures): 7600 vs 7600


Unnamed: 0,fixture_id,date,team_id,side,pts,ht_gf,ht_ga,ft_gf,ft_ga
0,192297,2015-08-08 00:00:00+00:00,33,home,3,1,0,1,0
1,192307,2015-08-14 00:00:00+00:00,33,away,3,1,0,1,0
2,192317,2015-08-22 00:00:00+00:00,33,home,1,0,0,0,0
3,192336,2015-08-30 00:00:00+00:00,33,away,0,0,0,1,2
4,192343,2015-09-12 00:00:00+00:00,33,home,3,0,0,3,1
5,192355,2015-09-20 00:00:00+00:00,33,away,3,1,1,3,2
6,192362,2015-09-26 00:00:00+00:00,33,home,3,1,0,3,0
7,192375,2015-10-04 00:00:00+00:00,33,away,0,0,3,0,3
8,192381,2015-10-17 00:00:00+00:00,33,away,3,2,0,3,0
9,192394,2015-10-25 00:00:00+00:00,33,home,1,0,0,0,0


In [25]:
window = 3

# --- rolling averages (PRE-match) in team_matches ---
for col, out in [
    ("pts",   f"avg_pts_last{window}"),
    ("ht_gf", f"avg_ht_gf_last{window}"),
    ("ht_ga", f"avg_ht_ga_last{window}"),
    ("ft_gf", f"avg_ft_gf_last{window}"),
    ("ft_ga", f"avg_ft_ga_last{window}"),
]:
    team_matches[out] = (
        team_matches
        .groupby("team_id")[col]
        .transform(lambda s: s.shift(1).rolling(window, min_periods=1).mean())
    )

team_matches[f"pts_sum_last{window}"] = (
    team_matches
    .groupby("team_id")["pts"]
    .transform(lambda s: s.shift(1).rolling(window, min_periods=1).sum())
)

feat_cols = [
    "fixture_id", "side",
    f"avg_pts_last{window}", f"pts_sum_last{window}",
    f"avg_ht_gf_last{window}", f"avg_ht_ga_last{window}",
    f"avg_ft_gf_last{window}", f"avg_ft_ga_last{window}",
]

# --- split to home / away and rename columns ---
home_feats = team_matches[team_matches["side"]=="home"][feat_cols].drop(columns=["side"]).rename(columns={
    f"avg_pts_last{window}":     f"home_avg_pts_last{window}",
    f"pts_sum_last{window}":     f"home_pts_sum_last{window}",
    f"avg_ht_gf_last{window}":   f"home_avg_ht_gf_last{window}",
    f"avg_ht_ga_last{window}":   f"home_avg_ht_ga_last{window}",
    f"avg_ft_gf_last{window}":   f"home_avg_ft_gf_last{window}",
    f"avg_ft_ga_last{window}":   f"home_avg_ft_ga_last{window}",
})

away_feats = team_matches[team_matches["side"]=="away"][feat_cols].drop(columns=["side"]).rename(columns={
    f"avg_pts_last{window}":     f"away_avg_pts_last{window}",
    f"pts_sum_last{window}":     f"away_pts_sum_last{window}",
    f"avg_ht_gf_last{window}":   f"away_avg_ht_gf_last{window}",
    f"avg_ht_ga_last{window}":   f"away_avg_ht_ga_last{window}",
    f"avg_ft_gf_last{window}":   f"away_avg_ft_gf_last{window}",
    f"avg_ft_ga_last{window}":   f"away_avg_ft_ga_last{window}",
})

# --- merge back into fixture-level df ---
df_roll = df.merge(home_feats, on="fixture_id", how="left").merge(away_feats, on="fixture_id", how="left")

print("Rows preserved?", len(df_roll), "==", len(df))
df_roll[[ "fixture_id","date","home_team_id","away_team_id",
          f"home_avg_ft_gf_last{window}", f"away_avg_ft_gf_last{window}",
          f"home_pts_sum_last{window}", f"away_pts_sum_last{window}" ]].tail()


Rows preserved? 3800 == 3800


Unnamed: 0,fixture_id,date,home_team_id,away_team_id,home_avg_ft_gf_last3,away_avg_ft_gf_last3,home_pts_sum_last3,away_pts_sum_last3
3795,1208398,2025-05-25 00:00:00+00:00,34,45,1.0,2.333333,4.0,7.0
3796,1208399,2025-05-25 00:00:00+00:00,65,49,1.666667,1.333333,5.0,6.0
3797,1208400,2025-05-25 00:00:00+00:00,41,42,0.0,1.333333,1.0,4.0
3798,1208401,2025-05-25 00:00:00+00:00,47,51,0.333333,2.0,1.0,7.0
3799,1208402,2025-05-25 00:00:00+00:00,39,55,0.666667,2.333333,0.0,6.0


In [26]:
import ast

def parse_and_clean_goal_minutes(x):
    if pd.isna(x):
        return []
    if isinstance(x, list):
        raw = x
    else:
        try:
            raw = ast.literal_eval(x)   # turns "[9]" ‚Üí [9]
        except:
            return []

    out = []
    for m in raw:
        try:
            out.append(int(str(m).split("+")[0]))
        except:
             print(f"‚ö†Ô∏è Failed to parse goal minute {m!r}: {e}")
    return out

df["home_goal_minutes"] = df["home_goal_minutes"].apply(parse_and_clean_goal_minutes)
df["away_goal_minutes"] = df["away_goal_minutes"].apply(parse_and_clean_goal_minutes)


In [27]:
def count_in_window(minutes, start, end):
    return sum(start <= m < end for m in minutes)

df["home_goals_0_15"] = df["home_goal_minutes"].apply(lambda x: count_in_window(x, 0, 15))
df["home_goals_1H"]   = df["home_goal_minutes"].apply(lambda x: count_in_window(x, 0, 45))

df["away_goals_0_15"] = df["away_goal_minutes"].apply(lambda x: count_in_window(x, 0, 15))
df["away_goals_1H"]   = df["away_goal_minutes"].apply(lambda x: count_in_window(x, 0, 45))

In [28]:
home_tm = pd.DataFrame({
    "fixture_id": df["fixture_id"],
    "date": df["date"],
    "team_id": df["home_team_id"],
    "side": "home",
    "scored_0_15": df["home_goals_0_15"],
    "scored_1H": df["home_goals_1H"],
})

away_tm = pd.DataFrame({
    "fixture_id": df["fixture_id"],
    "date": df["date"],
    "team_id": df["away_team_id"],
    "side": "away",
    "scored_0_15": df["away_goals_0_15"],
    "scored_1H": df["away_goals_1H"],
})

team_time = (
    pd.concat([home_tm, away_tm], ignore_index=True)
      .sort_values(["team_id","date","fixture_id"])
      .reset_index(drop=True)
)


In [29]:
window = 3

for col in ["scored_0_15", "scored_1H"]:
    team_time[f"avg_{col}_last{window}"] = (
        team_time
        .groupby("team_id")[col]
        .transform(lambda s: s.shift(1).rolling(window, min_periods=1).mean())
    )


In [30]:
home_roll = team_time[team_time["side"]=="home"][
    ["fixture_id",
     f"avg_scored_0_15_last{window}",
     f"avg_scored_1H_last{window}"]
].rename(columns={
    f"avg_scored_0_15_last{window}": f"home_avg_scored_0_15_last{window}",
    f"avg_scored_1H_last{window}":   f"home_avg_scored_1H_last{window}",
})

away_roll = team_time[team_time["side"]=="away"][
    ["fixture_id",
     f"avg_scored_0_15_last{window}",
     f"avg_scored_1H_last{window}"]
].rename(columns={
    f"avg_scored_0_15_last{window}": f"away_avg_scored_0_15_last{window}",
    f"avg_scored_1H_last{window}":   f"away_avg_scored_1H_last{window}",
})

df_final = (
    df_roll
    .merge(home_roll, on="fixture_id", how="left")
    .merge(away_roll, on="fixture_id", how="left")
)


In [31]:
df_final.columns.to_list()

['fixture_id',
 'date',
 'season',
 'home_team_id',
 'home_team_name',
 'away_team_id',
 'away_team_name',
 'ht_home_goals',
 'ht_away_goals',
 'ft_home_goals',
 'ft_away_goals',
 'home_winner',
 'away_winner',
 'total_goals_ht',
 'total_goals_ft',
 'fulltime_over_2_5',
 'btts',
 'second_half_goals',
 'away_goal_minutes',
 'home_goal_minutes',
 'home_avg_pts_last3',
 'home_pts_sum_last3',
 'home_avg_ht_gf_last3',
 'home_avg_ht_ga_last3',
 'home_avg_ft_gf_last3',
 'home_avg_ft_ga_last3',
 'away_avg_pts_last3',
 'away_pts_sum_last3',
 'away_avg_ht_gf_last3',
 'away_avg_ht_ga_last3',
 'away_avg_ft_gf_last3',
 'away_avg_ft_ga_last3',
 'home_avg_scored_0_15_last3',
 'home_avg_scored_1H_last3',
 'away_avg_scored_0_15_last3',
 'away_avg_scored_1H_last3']

In [32]:
prematch_features = [
    # identifiers / ordering
    "fixture_id",
    "date",
    "season",
    "home_team_id",
    "away_team_id",

    # rolling form & goals
    "home_avg_pts_last3",
    "home_pts_sum_last3",
    "home_avg_ht_gf_last3",
    "home_avg_ht_ga_last3",
    "home_avg_ft_gf_last3",
    "home_avg_ft_ga_last3",

    "away_avg_pts_last3",
    "away_pts_sum_last3",
    "away_avg_ht_gf_last3",
    "away_avg_ht_ga_last3",
    "away_avg_ft_gf_last3",
    "away_avg_ft_ga_last3",

    # rolling early-goal tendencies
    "home_avg_scored_0_15_last3",
    "home_avg_scored_1H_last3",
    "away_avg_scored_0_15_last3",
    "away_avg_scored_1H_last3",
]

df_pre_match = df_final[prematch_features + ["fulltime_over_2_5"]].copy()

In [33]:
# columns with at least one NaN + count
df_pre_match.isna().sum().loc[lambda s: s > 0]


Unnamed: 0,0
home_avg_pts_last3,18
home_pts_sum_last3,18
home_avg_ht_gf_last3,18
home_avg_ht_ga_last3,18
home_avg_ft_gf_last3,18
home_avg_ft_ga_last3,18
away_avg_pts_last3,16
away_pts_sum_last3,16
away_avg_ht_gf_last3,16
away_avg_ht_ga_last3,16


In [34]:
# drop rows with NaNs in any column
df_pre_match = df_pre_match.dropna().reset_index(drop=True)
len(df_pre_match)


3776

In [35]:
df_pre_match = df_pre_match.replace([np.inf, -np.inf], np.nan).dropna().reset_index(drop=True)


In [36]:
df_pre_match.tail()

Unnamed: 0,fixture_id,date,season,home_team_id,away_team_id,home_avg_pts_last3,home_pts_sum_last3,home_avg_ht_gf_last3,home_avg_ht_ga_last3,home_avg_ft_gf_last3,...,away_pts_sum_last3,away_avg_ht_gf_last3,away_avg_ht_ga_last3,away_avg_ft_gf_last3,away_avg_ft_ga_last3,home_avg_scored_0_15_last3,home_avg_scored_1H_last3,away_avg_scored_0_15_last3,away_avg_scored_1H_last3,fulltime_over_2_5
3771,1208398,2025-05-25 00:00:00+00:00,2024,34,45,1.333333,4.0,0.333333,0.333333,1.0,...,7.0,1.666667,0.666667,2.333333,1.0,0.666667,0.666667,0.666667,2.0,0
3772,1208399,2025-05-25 00:00:00+00:00,2024,65,49,1.666667,5.0,0.666667,0.333333,1.666667,...,6.0,0.333333,0.333333,1.333333,1.0,0.666667,1.333333,0.666667,0.666667,0
3773,1208400,2025-05-25 00:00:00+00:00,2024,41,42,0.333333,1.0,0.0,1.333333,0.0,...,4.0,0.333333,0.666667,1.333333,1.333333,0.0,0.0,0.0,0.666667,1
3774,1208401,2025-05-25 00:00:00+00:00,2024,47,51,0.333333,1.0,0.333333,0.666667,0.333333,...,7.0,1.0,0.666667,2.0,1.0,0.0,0.666667,0.0,2.0,1
3775,1208402,2025-05-25 00:00:00+00:00,2024,39,55,0.0,0.0,0.333333,1.333333,0.666667,...,6.0,1.666667,0.666667,2.333333,2.0,0.0,0.666667,0.0,4.0,0


# **Del 5: Tr√§na och utv√§rdera modeller**


## Syfte
Tr√§na tv√• maskininl√§rningsmodeller och j√§mf√∂ra deras f√∂rm√•ga att f√∂ruts√§ga √∂ver 2.5 m√•l baserat p√• pre-match-attribut.

## Val av modeller

### 1. Logistic Regression (Baseline)
**Varf√∂r?**
- Enkel, tolkbar modell
- Linj√§ra beslutsregler
- Naturliga sannolikhetsutskrifter
- Fungerar som "minimum viable model"

**Hyperparametrar som optimeras**:
- `C`: Regulariseringsstyrka (s√∂ks i intervallet 0.001-10.0)
- `penalty='l2'`: L2-regularisering (Ridge)
- `solver='lbfgs'`: Effektiv optimizer f√∂r logistisk regression
- `max_iter=5000`: Tillr√§ckligt f√∂r konvergens
- `class_weight='balanced'`: Kompenserar f√∂r 54/46 klassbalans

### 2. Random Forest (Icke-linj√§r)
**Varf√∂r?**
- Kan f√•nga icke-linj√§ra samband
- Hanterar interaktioner mellan attribut
- Robust mot outliers
- Ger feature importance

**Hyperparametrar som optimeras**:
- `n_estimators`: Antal beslutstr√§d (s√∂ks i intervallet 150-500)
- `max_depth`: Maximalt tr√§djup (s√∂ks i intervallet 3-10)
- `min_samples_leaf`: Minsta antal exempel per l√∂v (s√∂ks i intervallet 10-60)
- `min_samples_split`: Minsta antal exempel f√∂r split (s√∂ks i intervallet 10-80)
- `max_features`: Antal features per split ('sqrt' eller 'log2')
- `class_weight='balanced'`: Kompenserar f√∂r klassbalans

## Hyperparameteroptimering

**Metod: Bayesian Optimization**

I st√§llet f√∂r att manuellt v√§lja hyperparametrar anv√§nds Bayesian optimization via `BayesSearchCV` fr√•n scikit-optimize. Denna metod:
- S√∂ker intelligentare √§n grid search eller random search
- Utforskar parameterrymden strategiskt
- Kr√§ver f√§rre iterationer (30 ist√§llet f√∂r 100+ vid grid search)
- Optimerar f√∂r `neg_brier_score` (negativ Brier score)

**Korsvalidering: Time Series Split**

Eftersom fotbollsdata har temporal struktur anv√§nds `TimeSeriesSplit` med 3 splits:
```python
tscv = TimeSeriesSplit(n_splits=3)
```

Detta s√§kerst√§ller att:
- Tr√§ningsdata alltid kommer f√∂re valideringsdata
- Ingen framtida information l√§cker bak√•t i tiden
- Validering simulerar verklig prediktering

**Optimeringsprocess**:

1. **Random Forest**: BayesSearchCV s√∂ker genom 30 iterationer f√∂r att hitta b√§sta kombination av antal tr√§d, djup och split-kriterier
2. **Logistic Regression**: BayesSearchCV s√∂ker genom 30 iterationer f√∂r att hitta optimal regulariseringsstyrka

Varje iteration utv√§rderas med 3-fold time series cross-validation p√• tr√§ningsdatan.

## Train/Test Split

**Strategi: Temporal split**
```python
test_seasons = [2023, 2024]  # Senaste tv√• s√§songerna
```

**Varf√∂r inte random split?**
- Fotbollsdata har temporal struktur
- Vi vill simulera "prediktera framtida matcher"
- Random split skulle kunna l√§cka framtida info till tr√§ning

**Resultat**:
- Tr√§ningsdata: 3018 matcher (s√§song 2015-2022)
- Testdata: 758 matcher (s√§song 2023-2024)
- Features: 18 stycken

## Tr√§ning av slutgiltiga modeller

Efter Bayesian optimization tr√§nas de slutgiltiga modellerna med de b√§sta hyperparametrarna:

**Random Forest**:
```python
rf = RandomForestClassifier(
    **best_rf_params,
    random_state=42,
    n_jobs=-1,
    class_weight="balanced",
)
rf.fit(X_train, y_train)
```

**Logistic Regression**:
```python
lr = LogisticRegression(
    **best_lr_params,
    solver="lbfgs",
    max_iter=5000,
    class_weight="balanced",
)
lr.fit(X_train, y_train)
```

B√•da modellerna tr√§nas p√• hela tr√§ningsdataset (3018 matcher) och utv√§rderas p√• testdata (758 matcher).

## Utv√§rderingsm√•tt

### 1. ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
**Vad m√§ts**: Modellens f√∂rm√•ga att rangordna matcher efter sannolikhet
- **1.0** = Perfekt separation mellan klasser
- **0.5** = Lika bra som slumpen
- **< 0.5** = S√§mre √§n slumpen

**Tolkning**: Om ROC-AUC = 0.51 ‚Üí modellen √§r knappt b√§ttre √§n att gissa.

### 2. Accuracy (Noggrannhet)
**Vad m√§ts**: Andel korrekta prediktioner
```
Accuracy = (R√§tt f√∂rutsagda) / (Totalt antal)
```

**Tolkning**: Om Accuracy = 52% vid 54/46 klassbalans ‚Üí inte mycket b√§ttre √§n majoritetsklass.

### 3. Brier Score
**Vad m√§ts**: Genomsnittligt kvadrerat fel mellan sannolikhet och utfall
```
Brier = mean((p_pred - y_true)¬≤)
```
- **0.0** = Perfekta sannolikheter
- **0.25** = Neutralt (alltid gissa 0.5)

**Tolkning**: Om Brier = 0.25 ‚Üí modellens sannolikheter √§r inte informativa.

### 4. Precision@5
**Vad m√§ts**: Andel korrekta bland de 5 h√∂gst rankade matcherna
```
Precision@5 = (Antal √∂ver 2.5 bland top-5) / 5
```

**Tolkning**: Om Precision@5 = 0.6 ‚Üí 3 av 5 h√∂gst rankade matcher var faktiskt √∂ver 2.5 m√•l.

**Varf√∂r viktigt?** I ett beslutsst√∂dsystem vill vi kunna s√§ga: "H√§r √§r veckans 5 b√§sta matcher" och ha r√§tt i majoriteten av fallen.

## Resultat
```
                      ROC_AUC  Accuracy  Brier  Precision@5
Model                                                
RF (Bayes-opt)          0.506    0.504   0.25          1.0
LogReg (Bayes-opt)      0.507    0.517   0.25          1.0
```


## Tolkning av resultat

### ROC-AUC ~0.5
‚ùå **Modellerna kan inte rangordna matcher b√§ttre √§n slumpen**
- √Ñven med optimerade hyperparametrar saknar attributen tillr√§cklig prediktiv kraft
- Pre-match-statistik √§r inte tillr√§cklig

### Accuracy ~52%
‚ö†Ô∏è **Bara marginellt b√§ttre √§n att alltid gissa "√∂ver 2.5"**
- Baseline (alltid gissa 1): 54% accuracy
- V√•ra modeller: ~52% accuracy
- Bayesian optimization f√∂rb√§ttrade inte resultaten n√§mnv√§rt

### Brier Score ~0.25
‚ùå **Modellen √§r inte mer informativ √§n ett neutralt antagande p√• 50 %**
- 0.25 = vad man f√•r om man alltid s√§ger 50/50
- Modellen har ingen confidence i sina prediktioner
- Hyperparameteroptimering gav minimal f√∂rb√§ttring

### Precision@5
‚ö†Ô∏è **Lite b√§ttre √§n slumpen, men instabilt**
- Vid k=5 √§r detta v√§ldigt k√§nsligt f√∂r slump
- Inte tillf√∂rlitligt nog f√∂r beslutsst√∂d

## Slutsats

Trots anv√§ndning av Bayesian optimization f√∂r att hitta optimala hyperparametrar presterar b√•da modellerna **mycket svagt**. Detta indikerar att problemet inte ligger i hyperparameterval utan i **attributens begr√§nsade informationsinneh√•ll**.

**Huvudsakliga slutsatser**:
1. De valda attributen (rullande medelv√§rden av m√•l, po√§ng, tidiga m√•l) r√§cker inte
2. Hyperparameteroptimering kan inte kompensera f√∂r svaga features
3. Fotbollsmatcher har h√∂g inneboende slumpm√§ssighet

**M√∂jliga f√∂rklaringar**:
1. Saknar kritiska attribut (xG, taktik, skador, motivation)
2. Fotboll har h√∂g grad av slumpm√§ssighet som inte kan f√•ngas av historisk statistik
3. Matchspecifika faktorer (v√§der, domare, taktisk matchning) saknas
4. Pre-match-statistik f√•ngar inte "dagsform" eller psykologiska faktorer

## N√§sta steg (vidare forskning)

**F√∂rb√§ttring av features**:
- Inf√∂ra xG (Expected Goals) data
- L√§gga till spelartrupp och skadelistor
- Inkludera taktiska indikatorer (pressning, bollinnehav)
- L√§gg till head-to-head statistik

**Metodutveckling**:
- Testa ensemble-metoder (stacking, blending)
- Kalibrera sannolikheterna (Platt scaling, isotonic regression)
- Ut√∂ka dataset med fler ligor f√∂r mer tr√§ningsdata
- Testa deep learning-metoder (LSTM f√∂r temporal modellering)

**Alternativa approaches**:
- Poisson-baserade modeller f√∂r m√•lf√∂rdelning
- Bayesianska metoder f√∂r os√§kerhetskvantifiering
- Reinforcement learning f√∂r adaptiva prediktioner

In [37]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, brier_score_loss, roc_auc_score
from sklearn.linear_model import LogisticRegression


def precision_at_k(y_true, y_prob, k=5):
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob)
    k = min(k, len(y_true))
    topk_idx = np.argsort(-y_prob)[:k]
    return y_true[topk_idx].mean()

def evaluate_model(name, y_true, proba):
    pred = (proba >= 0.5).astype(int)
    return {
        "Model": name,
        "ROC_AUC": roc_auc_score(y_true, proba),
        "Accuracy": accuracy_score(y_true, pred),
        "Brier": brier_score_loss(y_true, proba),
        "Precision@5": precision_at_k(y_true, proba, k=5),
    }

# --- columns / target ---
y_col = "fulltime_over_2_5"
drop_cols = ["fixture_id", "date", "season", y_col]
X_cols = [c for c in df_pre_match.columns if c not in drop_cols]

# --- train/test split: last 2 seasons as test ---
test_seasons = sorted(df_pre_match["season"].unique())[-2:]

train_df = df_pre_match[~df_pre_match["season"].isin(test_seasons)].copy()
test_df  = df_pre_match[df_pre_match["season"].isin(test_seasons)].copy()

X_train, y_train = train_df[X_cols], train_df[y_col].astype(int)
X_test,  y_test  = test_df[X_cols],  test_df[y_col].astype(int)

print("Test seasons:", test_seasons)
print("Train rows:", len(train_df), " Test rows:", len(test_df))
print("Num features:", len(X_cols))


Test seasons: [np.int64(2023), np.int64(2024)]
Train rows: 3018  Test rows: 758
Num features: 18


### Hyperparameteroptimering med BayesSearchCV

Denna kod anv√§nds f√∂r att optimera hyperparametrar f√∂r tv√• klassificeringsmodeller: Random Forest och logistisk regression. Optimeringen genomf√∂rs med hj√§lp av BayesSearchCV, som anv√§nder bayesiansk optimering f√∂r att effektivt s√∂ka efter parametrar som ger b√§st modellprestanda.



F√∂r att hantera tidsberoende data anv√§nds TimeSeriesSplit med tre delningar, vilket s√§kerst√§ller att tr√§ningsdata alltid f√∂reg√•r valideringsdata i tiden och d√§rmed f√∂rhindrar informationsl√§ckage. F√∂r Random Forest optimeras bland annat antal tr√§d, tr√§dens djup och minsta antal observationer per nod, medan regulariseringsparametern optimeras f√∂r den logistiska regressionsmodellen.



Som utv√§rderingsm√•tt anv√§nds negativt Brier score, vilket inneb√§r att optimeringen prioriterar modeller som ger v√§lkalibrerade sannolikhetsprediktioner. Efter avslutad optimering sparas den b√§sta modellen f√∂r respektive algoritm f√∂r vidare utv√§rdering p√• testdata.



**K√§lla som inspirerade dennna del**:
https://www.kaggle.com/discussions/general/523342

In [38]:
!pip -q install scikit-optimize

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/107.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ[0m [32m102.4/107.8 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m107.8/107.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [40]:
from skopt import BayesSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

tscv = TimeSeriesSplit(n_splits=3)

# ---------------- Random Forest (Bayesian search) ----------------
rf_model = RandomForestClassifier(
    random_state=42,
    n_jobs=-1,
    class_weight="balanced",
)

rf_search_space = {
    "n_estimators": (150, 500),
    "max_depth": (3, 10),
    "min_samples_leaf": (10, 60),
    "min_samples_split": (10, 80),
    "max_features": ["sqrt", "log2"],
}

rf_bayes = BayesSearchCV(
    estimator=rf_model,
    search_spaces=rf_search_space,
    n_iter=30,
    cv=tscv,
    scoring="neg_brier_score",
    n_jobs=1,
    random_state=42,
    verbose=2,
)
rf_bayes.fit(X_train, y_train)

print("Best RF params:", rf_bayes.best_params_)
print("Best RF CV (neg brier):", rf_bayes.best_score_)

best_rf = rf_bayes.best_estimator_


# ---------------- Logistic Regression (Bayesian search) ----------------
lr_model = LogisticRegression(
    solver="lbfgs",
    max_iter=5000,
    class_weight="balanced",
)

lr_search_space = {
    "C": (1e-3, 10.0, "log-uniform"),
    "penalty": ["l2"],
}

lr_bayes = BayesSearchCV(
    estimator=lr_model,
    search_spaces=lr_search_space,
    n_iter=30,
    cv=tscv,
    scoring="neg_brier_score",
    n_jobs=1,
    random_state=42,
    verbose=2,
)
lr_bayes.fit(X_train, y_train)

print("Best LR params:", lr_bayes.best_params_)
print("Best LR CV (neg brier):", lr_bayes.best_score_)

best_lr = lr_bayes.best_estimator_


Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] END max_depth=6, max_features=log2, min_samples_leaf=57, min_samples_split=32, n_estimators=385; total time=   1.5s
[CV] END max_depth=6, max_features=log2, min_samples_leaf=57, min_samples_split=32, n_estimators=385; total time=   1.4s
[CV] END max_depth=6, max_features=log2, min_samples_leaf=57, min_samples_split=32, n_estimators=385; total time=   2.4s
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] END max_depth=9, max_features=log2, min_samples_leaf=25, min_samples_split=77, n_estimators=452; total time=   2.0s
[CV] END max_depth=9, max_features=log2, min_samples_leaf=25, min_samples_split=77, n_estimators=452; total time=   2.0s
[CV] END max_depth=9, max_features=log2, min_samples_leaf=25, min_samples_split=77, n_estimators=452; total time=   2.2s
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] END max_depth=6, max_features=log2, min_samples_leaf=15, min_samples_split=40, n_estimators=

In [41]:
# ----- Train final models with BEST params -----

best_rf_params = rf_bayes.best_params_
best_lr_params = lr_bayes.best_params_

# Random Forest
rf = RandomForestClassifier(
    **best_rf_params,
    random_state=42,
    n_jobs=-1,
    class_weight="balanced",
)
rf.fit(X_train, y_train)

# Logistic Regression
lr = LogisticRegression(
    **best_lr_params,
    solver="lbfgs",
    max_iter=5000,
    class_weight="balanced",
)
lr.fit(X_train, y_train)

# ----- Predict on test -----
p_rf = rf.predict_proba(X_test)[:, 1]
p_lr = lr.predict_proba(X_test)[:, 1]

# ----- Results table -----
results_df = (
    pd.DataFrame([
        evaluate_model("RF (Bayes-opt)", y_test, p_rf),
        evaluate_model("LogReg (Bayes-opt)", y_test, p_lr),
    ])
    .set_index("Model")
    .round(3)
)

print("Models trained with Bayesian-optimized hyperparameters")
results_df


Models trained with Bayesian-optimized hyperparameters


Unnamed: 0_level_0,ROC_AUC,Accuracy,Brier,Precision@5
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RF (Bayes-opt),0.506,0.504,0.25,1.0
LogReg (Bayes-opt),0.507,0.517,0.25,1.0
