## Problem Statement and Background

The main domain that I will be exploring is League of Legends Esports, which is a competitive online game that is played by teams worldwide for money and prizes. Over the past decade, League of Legends has exploded in popularity, with their international tournament, Worlds, [grossing nearly 7 million concurrent viewers in 2024](https://escharts.com/news/2024-league-legends-worlds-record). League of Legends is distinct from other cognitive sports like poker or chess in that it features a “drafting phase” prior to the start of games, which involves players picking which champions they will play in the game itself. Deciding which champions to pick is an incredibly complex process, as there are over 170 to choose from, and all have complex interactions that may make them more or less favorable as the draft progresses. Gaining a better understanding of draft is relevant to not only the teams themselves, but also sports betters and analysts: [In 2023, esports betting was valued at $2.5 billion](https://www.skyquestt.com/report/esports-betting-market), and [the prize pools for Worlds 2025 alone will be $5 million](https://esportsinsider.com/2025/03/league-of-legends-lol-world-championship-2025-prize-pool).

For this reason, the main question that I am interested in is, “Can machine learning models accurately predict the outcomes of professional League of Legends games using historical game data and pre-game statistics?”.

Game outcome in this instance could be measured a number of different ways: Primarily, my starting point will be whether the game is won or lost, which seems to be the most straightforward to predict. Other potential outputs to explore could be a composite index of team or individual player performance, including objectives taken, damage dealt, gold accumulated, and kills/deaths/assists.

The main pre-game factors explored will be the teams, the draft, and the date, although these alone could be augmented into a number of other statistics, such as the team's head to head win rate or the draft's individual champion's win rates.

## Dataset

The dataset used is the Leaguepedia API, which is a third party API maintained to support Leaguepedia, a wiki for League of Legends Esports. Although it is not an official source, the API is well-maintained with very little missing attributes, especially for the professional tiers of competition where I will be pulling most of my data. The main challenge in working with this dataset is not the lack of data, but rather accessing it - The data is stored in a highly normalized format across nearly 100 different tables, requiring several joins to get the desired information. The full list is available [here](https://lol.fandom.com/wiki/Special:CargoTables), but the most relevant tables are [ScoreBoardGames](https://lol.fandom.com/wiki/Special:CargoTables/ScoreboardGames), which contains stats for each game, [MatchScheduleGame](https://lol.fandom.com/wiki/Special:CargoTables/MatchScheduleGame), which has less granular per-game data, [MatchSchedule](https://lol.fandom.com/wiki/Special:CargoTables/MatchSchedule), which represents a match (a set of 1-5 games), and [Tournaments](https://lol.fandom.com/wiki/Special:CargoTables/Tournaments), which represent the context in which matches are played. In addition, the SQL client available for accessing it doesn't allow subqueries, imposes a limit of 500 rows returned per request, and has a rate limit. Because of this, querying data can be a long process, and cached when possible.

## Data Joining/Cleaning

The first step in data processing was fetching historical match data. For the purpose of match prediction, since League of Legends is such a dynamic game, for the purposes of predicting future match outcomes, only the data from the last few years is relevant. In addition, to limit the influence of quality of play on prediction quality, I chose to limit the data to only professional games in the major regions, or international tournaments. The match ids that meet this criteria can be fetched as follows:


In [3]:
from mwrogue.esports_client import EsportsClient
import datetime as dt

from cached_cargo_client import CachedCargoClient

site = EsportsClient("lol")
cached_client = CachedCargoClient(site.cargo_client)


def fetch_match_ids(leagues, from_date):
    tournament_filter = ", ".join(f"'{t}'" for t in leagues)

    matches = cached_client.query(
        tables="Tournaments=T, MatchSchedule=MS",
        join_on="T.OverviewPage = MS.OverviewPage",
        fields=("MS.MatchId"),
        where=(
            f"((T.League IN ({tournament_filter})) OR T.Region = 'International')"
            f" AND T.DateStart >= '{from_date}'"
        ),
        order_by="T.DateStart DESC, MS.DateTime_UTC ASC",
    )

    return [match["MatchId"] for match in matches]


major_leagues = [
    "LoL Champions Korea",
    "League of Legends Championship of The Americas North",
    "Tencent LoL Pro League",
    "LoL EMEA Championship",
    "League of Legends Championship Series",
]
today = dt.date.today()
two_years_ago = today.replace(year=today.year - 3)
match_ids = fetch_match_ids(major_leagues, two_years_ago)
print(len(match_ids))
match_ids[:5]

3086


[None,
 '2025 Season World Championship_Play-In_1',
 '2025 Season World Championship_Round 1_1',
 '2025 Season World Championship_Round 1_2',
 '2025 Season World Championship_Round 1_3']

Since each match id represents a match, which could potentially be up to 7 games, the next step is to fetch the games for each match id. This is done in batches to reduce the load on the API and avoid rate limiting. In addition, since some games are used as placeholders and have "None" for all values, these rows are discarded. In addition, games where technical issues or extraordinary circumstances forced a forfeit (FF), or replay (Chronobreak/Remake) are excluded. This leaves around 6200 games total. For each, the relevant pre-game statistics are pulled:

- Blue: The name of the team on the Blue side of the map. This team get's first pick in draft, but yields later picks, meaning they have less counter-pick power.
- Red: The name of the team on the Red side of the map. This team forgoes first pick in favor of more picks later.
- Selection: This team has side selection going into the game. This is usually given to the team that lost the last game, and may be indicative of momentum. 
- WinTeam: Which team won
- Team1Score/Team2Score: How many games Team1/2 has won in the series. Another indicator of momentum.
- Team1Bans/Team2Bans: Which champions Team1/2 have banned going into the game before draft.
- Team1Picks/Team2Picks: Which champions Team1/2 have picked in the draft.
- Patch: Which version of the game is being played. "Balance" patches are released that tune champion power, so patch is potentially relevant for accurate predictions.
- DateTime_UTC: The time the game is played. Can account for factors like tournament and momentum.

In [4]:
def fetch_games(match_ids, batch_size=500):
    all_games = []
    for i in range(0, len(match_ids), batch_size):
        batch = match_ids[i : i + batch_size]
        batch_str = ",".join(f'"{mid}"' for mid in batch)

        response = cached_client.query(
            tables="MatchScheduleGame=MSG, ScoreboardGames=SG",
            fields="MSG.Blue, MSG.Red, MSG.Selection, SG.WinTeam, SG.Team1Score, SG.Team2Score, SG.Team1Bans, SG.Team2Bans, SG.Team1Picks, SG.Team2Picks, SG.Patch, SG.DateTime_UTC",
            join_on="MSG.GameId = SG.GameId",
            where=f"MSG.MatchId IN ({batch_str}) AND SG.WinTeam IS NOT NULL AND MSG.IsRemake IS NULL AND MSG.FF IS NULL AND MSG.IsChronobreak IS NULL",
        )
        all_games.extend(response)

    return all_games


games = fetch_games(match_ids)
print(len(games))
games[:5]

6164


[{'Blue': '100 Thieves',
  'Red': 'Cloud9',
  'Selection': '100 Thieves',
  'WinTeam': '100 Thieves',
  'Team1Score': '2',
  'Team2Score': '1',
  'Team1Bans': 'Trundle,Gwen,Annie,Yorick,Camille',
  'Team2Bans': 'Pantheon,Azir,Yone,Rakan,Senna',
  'Team1Picks': 'Ambessa,Jarvan IV,Viktor,Aphelios,Braum',
  'Team2Picks': 'Gnar,Skarner,Taliyah,Lucian,Alistar',
  'Patch': '25.17',
  'DateTime UTC': '2025-09-05 21:53:00',
  'DateTime UTC__precision': '0'},
 {'Blue': '100 Thieves',
  'Red': 'Cloud9',
  'Selection': '100 Thieves',
  'WinTeam': '100 Thieves',
  'Team1Score': '3',
  'Team2Score': '2',
  'Team1Bans': 'Zeri,Maokai,Zyra,Ahri,Jhin',
  'Team2Bans': 'Azir,Annie,Yorick,Ziggs,Shen',
  'Team1Picks': 'Gwen,Pantheon,Yone,Smolder,Taric',
  'Team2Picks': 'Vladimir,Trundle,Twisted Fate,Senna,Leona',
  'Patch': '25.17',
  'DateTime UTC': '2025-09-05 23:36:00',
  'DateTime UTC__precision': '0'},
 {'Blue': '100 Thieves',
  'Red': 'FlyQuest',
  'Selection': '100 Thieves',
  'WinTeam': '100 Thieve

Once the raw data has been fetched, the non-numeric data is encoded into a more ML-friendly format using Pandas:

In [8]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder


def encode_and_normalize(games):
    df = pd.DataFrame(games)
    df["blue_win"] = (df["WinTeam"] == df["Blue"]).astype(int)

    team_encoder = LabelEncoder()
    df["blue_team_id"] = team_encoder.fit_transform(df["Blue"])
    df["red_team_id"] = team_encoder.fit_transform(df["Red"])

    df["blue_picks"] = df["Team1Picks"].str.split(",")
    df["red_picks"] = df["Team2Picks"].str.split(",")
    df["blue_bans"] = df["Team1Bans"].str.split(",")
    df["red_bans"] = df["Team2Bans"].str.split(",")

    all_champs = sorted(
        set(
            sum(df["blue_picks"], [])
            + sum(df["red_picks"], [])
            + sum(df["blue_bans"], [])
            + sum(df["red_bans"], [])
        )
    )

    for champ in all_champs:
        df[f"pick_blue_{champ}"] = df["blue_picks"].apply(
            lambda picks: int(champ in picks)
        )
        df[f"pick_red_{champ}"] = df["red_picks"].apply(
            lambda picks: int(champ in picks)
        )
        df[f"ban_blue_{champ}"] = df["blue_bans"].apply(lambda bans: int(champ in bans))
        df[f"ban_red_{champ}"] = df["red_bans"].apply(lambda bans: int(champ in bans))

    patch_encoder = LabelEncoder()
    df["patch_id"] = patch_encoder.fit_transform(df["Patch"])

    df["selection_blue"] = (df["Selection"] == df["Blue"]).astype(int)

    df["date"] = pd.to_datetime(df["DateTime UTC"])
    df["year"] = df["date"].dt.year
    df["month"] = df["date"].dt.month

    # Drop original text columns that are now encoded
    drop_cols = [
        "Blue",
        "Red",
        "WinTeam",
        "Team1Picks",
        "Team2Picks",
        "Team1Bans",
        "Team2Bans",
        "Patch",
        "Selection",
        "DateTime UTC",
        "DateTime UTC__precision",
        "blue_picks",
        "red_picks",
        "blue_bans",
        "red_bans",
        "date",
    ]
    df = df.drop(columns=drop_cols)

    return df


encode_and_normalize(games)

  df[f"pick_blue_{champ}"] = df["blue_picks"].apply(
  df[f"pick_red_{champ}"] = df["red_picks"].apply(
  df[f"ban_blue_{champ}"] = df["blue_bans"].apply(lambda bans: int(champ in bans))
  df[f"ban_red_{champ}"] = df["red_bans"].apply(lambda bans: int(champ in bans))
  df[f"pick_blue_{champ}"] = df["blue_picks"].apply(
  df[f"pick_red_{champ}"] = df["red_picks"].apply(
  df[f"ban_blue_{champ}"] = df["blue_bans"].apply(lambda bans: int(champ in bans))
  df[f"ban_red_{champ}"] = df["red_bans"].apply(lambda bans: int(champ in bans))
  df[f"pick_blue_{champ}"] = df["blue_picks"].apply(
  df[f"pick_red_{champ}"] = df["red_picks"].apply(
  df[f"ban_blue_{champ}"] = df["blue_bans"].apply(lambda bans: int(champ in bans))
  df[f"ban_red_{champ}"] = df["red_bans"].apply(lambda bans: int(champ in bans))
  df[f"pick_blue_{champ}"] = df["blue_picks"].apply(
  df[f"pick_red_{champ}"] = df["red_picks"].apply(
  df[f"ban_blue_{champ}"] = df["blue_bans"].apply(lambda bans: int(champ in bans))
  df[f"ba

Unnamed: 0,Team1Score,Team2Score,blue_win,blue_team_id,red_team_id,pick_blue_Aatrox,pick_red_Aatrox,ban_blue_Aatrox,ban_red_Aatrox,pick_blue_Ahri,...,ban_blue_Zoe,ban_red_Zoe,pick_blue_Zyra,pick_red_Zyra,ban_blue_Zyra,ban_red_Zyra,patch_id,selection_blue,year,month
0,2,1,1,0,15,0,0,0,0,0,...,0,0,0,0,0,0,44,1,2025,9
1,3,2,1,0,15,0,0,0,0,0,...,0,0,0,0,1,0,44,1,2025,9
2,1,1,1,0,37,0,0,0,0,0,...,0,0,0,0,0,0,44,1,2025,9
3,0,1,0,0,37,0,0,0,0,0,...,0,0,0,0,0,0,42,1,2025,8
4,0,2,0,0,37,0,0,0,0,0,...,0,0,0,0,0,0,42,1,2025,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6159,1,0,1,114,19,0,0,1,0,0,...,0,0,0,0,0,0,0,0,2022,10
6160,2,0,1,114,42,0,0,0,0,0,...,0,0,0,0,0,0,13,1,2023,3
6161,1,0,1,114,46,0,0,1,0,1,...,0,0,0,0,0,0,0,0,2022,10
6162,1,0,1,114,95,1,0,0,0,0,...,0,0,0,0,0,0,0,0,2022,10
