<a href="https://colab.research.google.com/github/nick-kann/Xatu-AI/blob/main/BuildDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import requests
import sqlite3
import json
import pandas as pd

# **Creating the Dataset**:

The focus will be on games in the Gen 5 OU format since it has the smallest variety of Pokemon, helping reduce model dimensionality. Only the top 5000 games in elo (>= 1250) will be included, as higher elo players typically use logic when selecting their leading Pokémon. In contrast, lower elo players often choose the same Pokémon repeatedly or pick randomly, which complicates the model’s learning process. Data will be obtained by making HTTP GET requests to the Pokémon Showdown server.

In [4]:
# Make a HTTP GET request to the Pokemon Showdown server to get the 51 highest rated games for Gen 5 OU
base_url = "https://replay.pokemonshowdown.com/search.json?format=gen5ou&sort=rating"

all_data = []

# Loop from page 1 to 100 since Pokemon Showdown does not show replay records after page 100
for page in range(1, 101):
    url = f"{base_url}&page={page}"
    response = requests.get(url)
    data = response.json()
    data = data[:-1] # Remove the last entry because it matches the first entry of the next page


    all_data.extend(data)

In [5]:
df = pd.DataFrame(all_data)
df

Unnamed: 0,uploadtime,id,format,players,rating,private,password
0,1711751465,gen5ou-2091984244,[Gen 5] OU,"[dscrdgg/tYY9zcUuQk, kinetic koko]",1682,0,
1,1704819970,gen5ou-2030908389,[Gen 5] OU,"[sniping bwvcs, succsacturne]",1647,0,
2,1577404211,gen5ou-1036238036,[Gen 5] OU,"[Shoka, FiestaFord]",1634,0,
3,1588579407,gen5ou-1108380857,[Gen 5] OU,"[kobepayne, NOSLEEPTILLBKLYN]",1621,0,
4,1587316866,gen5ou-1099546083,[Gen 5] OU,"[Gh0st of Perdition, porrompompom]",1617,0,
...,...,...,...,...,...,...,...
4995,1704760952,gen5ou-2030403799,[Gen 5] OU,"[Rinamarokirurai!, TvAppler]",1252,0,
4996,1703465628,gen5ou-2019313776,[Gen 5] OU,"[eurosvgc, repinho2711]",1252,0,
4997,1702101587,gen5ou-2007904785,[Gen 5] OU,"[IamDoren, ezaylol]",1252,0,
4998,1404819852,gen5ou-139116208,[Gen 5] OU,"[Actombal, testsubjectN1994]",1252,0,


**With 5000 high-elo replays collected, the next step is to obtain the specific game-data for each replay.**

In [6]:
game_logs = []
i = 0
for id in df['id']:
    url = f"https://replay.pokemonshowdown.com/{id}.json"
    response = requests.get(url)
    data = response.json()
    game_logs.append(data)
    i += 1
    if i % 100 == 0:
        print(f"{i}/5000 games processed")


100/5000 games processed
200/5000 games processed
300/5000 games processed
400/5000 games processed
500/5000 games processed
600/5000 games processed
700/5000 games processed
800/5000 games processed
900/5000 games processed
1000/5000 games processed
1100/5000 games processed
1200/5000 games processed
1300/5000 games processed
1400/5000 games processed
1500/5000 games processed
1600/5000 games processed
1700/5000 games processed
1800/5000 games processed
1900/5000 games processed
2000/5000 games processed
2100/5000 games processed
2200/5000 games processed
2300/5000 games processed
2400/5000 games processed
2500/5000 games processed
2600/5000 games processed
2700/5000 games processed
2800/5000 games processed
2900/5000 games processed
3000/5000 games processed
3100/5000 games processed
3200/5000 games processed
3300/5000 games processed
3400/5000 games processed
3500/5000 games processed
3600/5000 games processed
3700/5000 games processed
3800/5000 games processed
3900/5000 games proce

**Now that all the games are processed, a function has to be written in order to extract each player's teams and leading Pokemon from the raw data.**

In [15]:
import re

def extract_teams(battle_log: str):
    teams = {
        "p1": set(),
        "p2": set()
    }

    leading_pokemon = {
        "p1": None,
        "p2": None
    }

    # Pattern to find the full teams for both players
    poke_pattern = r'poke\|(p1|p2)\|([^|,]+)'
    poke_matches = re.findall(poke_pattern, battle_log)

    for player, pokemon in poke_matches:
        pokemon = pokemon.strip() # Remove newline characters
        if player == 'p1':
            teams["p1"].add(pokemon)
        elif player == 'p2':
            teams["p2"].add(pokemon)

    # Pattern to find the leading Pokemon first each player
    switch_pattern = r'switch\|(p1a|p2a): [^|]+\|([^|,]+)'
    switch_matches = re.findall(switch_pattern, battle_log)

    # Keep track of the count to get only the first two leading Pokémon
    count = 0
    for player, pokemon in switch_matches:
        pokemon = pokemon.strip()
        if count >= 2:
            break
        if player == 'p1a' and leading_pokemon["p1"] is None:
            leading_pokemon["p1"] = pokemon
            count += 1
        elif player == 'p2a' and leading_pokemon["p2"] is None:
            leading_pokemon["p2"] = pokemon
            count += 1

    return teams, leading_pokemon

In [16]:
game_teams = [extract_teams(game['log']) for game in game_logs]

In [18]:
import csv

with open('/content/dataset.csv', mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["id", "p1_poke1", "p1_poke2", "p1_poke3", "p1_poke4",
                     "p1_poke5", "p1_poke6", "p2_poke1", "p2_poke2", "p2_poke3",
                     "p2_poke4", "p2_poke5", "p1_poke6", "p1_choice", "p2_choice"])
    id = 1
    for teams, choices in game_teams:
        row = []
        row.append(id)
        id += 1
        for team in teams:
            for poke in teams[team]:
                row.append(poke)
        for choice in choices:
            row.append(choices[choice])
        writer.writerow(row)

In [19]:
from google.colab import files

files.download('/content/dataset.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>