# Super Mario 64 Speedruns - Data Collection

This notebook catalogs the methodology used to obtain data for Super Mario 64 speedrun data from [speedrun.com](https://www.speedrun.com/sm64), and automatically update the corresponding [Kaggle dataset](https://www.kaggle.com/datasets/mcpenguin/super-mario-64-speedruns).

Note that this notebook is easily configurable to obtain speedrun data for other games as well.

## References

I used [this notebook](https://www.kaggle.com/code/nnjjpp/updating-a-dataset-with-a-notebook?scriptVersionId=134546596) as a reference on how to update a dataset with a notebook.

## Some of my other work

### Notebooks

- [Butterfly Image Classification](https://www.kaggle.com/code/mcpenguin/butterfly-classification-efficientnet-87)
- [Palmer Penguin EDA](https://www.kaggle.com/code/mcpenguin/palmer-archipelago-antarctica-penguin-eda)
- [Smoking and Drinking EDA + Classification](https://www.kaggle.com/code/mcpenguin/smoking-drinking-prediction-tfdf-71)
- [World Happiness Data Cleaning + EDA](https://www.kaggle.com/code/mcpenguin/world-happiness-data-cleaning-eda)
- [Precious Metals Stocks: EDA + Forecasting](https://www.kaggle.com/code/mcpenguin/precious-metals-stocks-eda-and-prediction)
- [Red Wine Quality EDA + Prediction](https://www.kaggle.com/code/mcpenguin/red-wine-quality-prediction)
- [Gaia Stellar Classification](https://www.kaggle.com/code/mcpenguin/gaia-stellar-classification-lightgbm-91-acc)

### Datasets

- [The Complete Rollercoasters Dataset](https://www.kaggle.com/datasets/mcpenguin/rollercoasters)
- [Malaysian COVID-19 Data](https://www.kaggle.com/datasets/mcpenguin/malaysia-covid19)

# Import Libraries

We will be using the `srcomapi` Python library, which is a library for the Speedrun.com API.

In [1]:
!pip install srcomapi

Collecting srcomapi


  Downloading srcomapi-0.3.3-py3-none-any.whl (6.7 kB)


Installing collected packages: srcomapi


Successfully installed srcomapi-0.3.3


In [2]:
from kaggle_secrets import UserSecretsClient

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import requests
import json

import srcomapi
import srcomapi.datatypes as srdatatypes

from tqdm.autonotebook import tqdm

  from tqdm.autonotebook import tqdm


# Process Kaggle API Key

To change/add user secrets, in the notebook editor go to `Add-ons -> Secrets`.

In [3]:
if os.path.exists('/root/.kaggle/'):
    pass
else:
    os.mkdir('/root/.kaggle/')
kaggle_API_key = UserSecretsClient().get_secret("KAGGLE_API_KEY")

with open('/root/.kaggle/kaggle.json', 'w') as fid:
    fid.writelines(f'{{"username":"mcpenguin","key":"{kaggle_API_key}"}}')

!chmod 600 /root/.kaggle/kaggle.json

# Obtain Data

## Initialize Speedrun.com API Library

We first initialize the Speedrun.com API:

In [4]:
SPEEDRUN_API_LINK = "https://www.speedrun.com/api/v1"

api = srcomapi.SpeedrunCom();
# api.debug = 1

## Get Game Data

We can then get the data for `Minecraft: Java Edition`. This notebook is configured in a way so that it should be easily adaptable to get data for other games from Speedrun.com as well.

In [5]:
# search for game
game_name = "Super Mario 64"

game_search = api.search(srcomapi.datatypes.Game, {"name": game_name})
game = game_search[0]

For reference, let's see what the game ID is:

In [6]:
game.id

'o1y9wo6q'

## Get Game-Specific "Variables"

To process the game data, we will need to get the variables for Minecraft. To do this, we will use the following `get_variables` function.

In [7]:
def get_variables(game):
    
    variables_response = api.get(f"games/{game.id}/variables")
    
    # variable_id_to_name maps variable ids to the names of the variables
    # and is of type {[variable_id]: [name]}
    variable_id_to_name = {}
    # variable_choice_id_to_name maps variable choice ids to the names of the variable choices
    # and is of type { [variable_id]: {[variable_choice_id]: [name]} }
    variable_choice_id_to_name = {}
    
    for variable_obj in variables_response:
        variable_id_to_name[variable_obj["id"]] = variable_obj["name"]
        var_id = variable_obj["id"]
        variable_choice_id_to_name[var_id] = {}
        
        values_response = variable_obj["values"]["values"]
        for value_id, value_obj in values_response.items():
            variable_choice_id_to_name[var_id][value_id] = value_obj["label"]
    
    return variable_id_to_name, variable_choice_id_to_name

variable_id_to_name_dict, variable_choice_id_to_name_dict = get_variables(game)

For reference, let's see what each of these dictionaries look like:

In [8]:
variable_id_to_name_dict

{'e8m7em86': 'Platform',
 '2lgy07lp': 'Codes',
 'jlz62r78': 'Route',
 'kn04ewol': 'Verified'}

In [9]:
variable_choice_id_to_name_dict

{'e8m7em86': {'9qj7z0oq': 'N64', 'jq6540ol': 'VC', '5lmoxk01': 'EMU'},
 '2lgy07lp': {'5lmoovm1': 'No', 'jq655njl': 'Yes'},
 'jlz62r78': {'q75kovp1': 'No LBLJ', '1gny6w6l': 'LBLJ'},
 'kn04ewol': {'5q8e86rq': 'Yes', '4qyxop3l': 'No'}}

## Get Raw Data

We now get the raw data from the API and store them in a dictionary:

In [10]:
# only consider star categories
categories = ["0 Star", "1 Star", "16 Star", "70 Star", "120 Star"]

# we need a limit, as the API might return a 504 error
# if there is too much data (> 1k rows)
# also for some reason the python API library kind of self-destructs for this function
# so we just use the API directly instead :)
def get_category_data(game, category, limit=500, level=None):
    if level is None:
        link = f"{SPEEDRUN_API_LINK}/leaderboards/{game.id}/category/{category.id}?embed=variables,players&top={limit}"
    else:
        link = f"{SPEEDRUN_API_LINK}/leaderboards/{game.id}/level/{level.id}/{category.id}?embed=variables,players&top={limit}"
    response = requests.get(link)
    result = response.json()["data"]
    return result
    
# {[category name]: [list of runs]}
raw_runs = {}
pbar = tqdm([category for category in game.categories if category.name in categories])

for category in pbar:
    category_name = category.name
    pbar.set_postfix(category=category_name)
    
    if not category.name in raw_runs:
        raw_runs[category.name] = {}

    if category.type == 'per-level':
        for level in game.levels:
            raw_runs[category.name][level.name] = get_category_data(game, category, level)
    else:
        raw_runs[category.name] = get_category_data(game, category)

  0%|          | 0/5 [00:00<?, ?it/s]

## Helper Functions to Process Data

We now define some helper functions we will use to process the raw runs data.

In [11]:
# process run
def process_run(run, player_data):
    verified_key = "kn04ewol"
    platform_key = "e8m7em86"
    
    # get player data
    player_id = run["run"]["players"][0].get("id", None)
    if player_id is not None:
        raw_player_data = player_data[player_id]
        player_name = raw_player_data["names"]["international"]
        location = raw_player_data["location"]
        if location is not None:
            location = location["country"]["names"]["international"]
        player_country = location
    else:
        player_name = player_country = None
    
    result = {
        # run id
        "id": run["run"]["id"],
        # leaderboard place
        "place": run["place"],
        # link to speedrun
        "speedrun_link": run["run"]["weblink"],
        # submitted date
        "submitted_date": run["run"]["submitted"],
        # primary time (seconds)
        "primary_time_seconds": run["run"]["times"]["primary_t"],
        # real time (seconds)
        "real_time_seconds": run["run"]["times"]["realtime_t"],
        # player id
        "player_id": player_id,
        # player name
        "player_name": player_name,
        # player country
        "player_country": player_country,
        # platform
        "platform": variable_choice_id_to_name_dict[platform_key][run["run"]["values"][platform_key]],
        # verified
        "verified": variable_choice_id_to_name_dict[verified_key][run["run"]["values"][verified_key]],
    }
    return result

# raw_runs_data = {[category]: [runs]}
def process_raw_runs_data(raw_runs_data):
    # cleaned_runs = {[category]: [runs]}
    cleaned_runs_data = {}
    
    for category, category_obj in raw_runs_data.items():
        cleaned_runs = []
        raw_runs = category_obj["runs"]
        
        player_data_list = category_obj["players"]["data"]
        player_data = {data.get("id"): data for data in player_data_list}
        
        print(f"Processing raw runs data for category {category}")
        pbar = tqdm(raw_runs)
        for raw_run in pbar:
            try:
                cleaned_run = process_run(raw_run, player_data)
                cleaned_runs.append(cleaned_run)
            except Exception as e:
                print("could not process run:")
                print(raw_run)
                print('error:')
                print(e)
        cleaned_runs_data[category] = cleaned_runs
    return cleaned_runs_data

## Process Raw Runs Data

We can now clean the raw runs data, so that we will be able to process it into a CSV.

In [12]:
cleaned_runs_data = process_raw_runs_data(raw_runs)

Processing raw runs data for category 120 Star


  0%|          | 0/501 [00:00<?, ?it/s]

Processing raw runs data for category 70 Star


  0%|          | 0/503 [00:00<?, ?it/s]

Processing raw runs data for category 16 Star


  0%|          | 0/503 [00:00<?, ?it/s]

Processing raw runs data for category 1 Star


  0%|          | 0/500 [00:00<?, ?it/s]

Processing raw runs data for category 0 Star


  0%|          | 0/351 [00:00<?, ?it/s]

# Convert Data into DataFrames

We can now convert our cleaned runs data into data-frames for exporting/upload.

In [13]:
# {category: df}
dfs = {}

for category, cleaned_runs in cleaned_runs_data.items():
    dfs[category] = pd.DataFrame.from_records(cleaned_runs)

We can see how these look like:

In [14]:
dfs["70 Star"].head()

Unnamed: 0,id,place,speedrun_link,submitted_date,primary_time_seconds,real_time_seconds,player_id,player_name,player_country,platform,verified
0,yoxwn51y,1,http://www.speedrun.com/sm64/run/yoxwn51y,2024-11-17T10:44:15Z,2786,2786,jn32931x,Suigi,Canada,N64,Yes
1,zpo7ornm,2,http://www.speedrun.com/sm64/run/zpo7ornm,2025-09-16T16:05:30Z,2788,2788,,,,N64,No
2,zqqn9jrz,3,http://www.speedrun.com/sm64/run/zqqn9jrz,2024-11-02T19:21:23Z,2790,2790,1xy9p1vj,taihou,Japan,EMU,Yes
3,zq2kgl5z,4,http://www.speedrun.com/sm64/run/zq2kgl5z,2024-11-28T16:24:47Z,2795,2795,kjprmwk8,Weegee,United States,N64,Yes
4,mew4438m,5,http://www.speedrun.com/sm64/run/mew4438m,2024-04-02T02:08:17Z,2806,2806,x353dr7j,Finnii602,Germany,VC,Yes


# Save DataFrames to Files

We can then save our dataframes to files.

In [15]:
for category, df in dfs.items():
    filename = f"/kaggle/working/data_{category}.csv"
    df.to_csv(filename)

# Re-Upload Files to Kaggle

## Define Metadata

In [16]:
metadata = {
    "id": "mcpenguin/super-mario-64-speedruns",
    "title": "New Update"
}

In [17]:
with open('/kaggle/working/dataset-metadata.json', 'w') as json_fid:
    json_fid.write(json.dumps(metadata))

## Push New Metadata

In [None]:
!kaggle datasets version -p /kaggle/working -m "Updated data"