# Simulations Episode Scraper Match Downloader


**It is recommended to run that notebook directly on the Kaggle platform, as it provides a way to add the Kaggle submission dataset without having to redownload it every time it is updated**

This notebook downloads episodes using Kaggle's GetEpisodeReplay API and the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset.

**To run this notebook you WILL need to re-add the Meta Kaggle dataset with "+ Add data" top right in the notebook editor.
**

Meta Kaggle is refreshed daily, but sometimes misses daily refreshes for a few days.

Why download replays?
- Train your ML/RL model
- Inspect the performance of yours and others agents
- To add to your ever growing json collection 

Only one scraping strategy is implemented: For each top scoring submission, download all missing matches, move on to next submission.

Other scraping strategies can be implemented, but not here. Like download max X matches per submission or per team per day, or ignore certain teams or ignore where some scores < X, or only download some teams.

To run this notebook you may need to re-add the Meta Kaggle dataset with "+ Add data" in the notebook editor.

Todo:
- Add teamid's once meta kaggle add them. Edit: it's been a long time, it doesn;t look like this is being added.

In [1]:
import pandas as pd
import numpy as np
import os
import requests
import json
import datetime
import time
import glob
import collections


In [2]:
## You should configure these to your needs. Choose one of ...
# 'hungry-geese', 'rock-paper-scissors', santa-2020', 'halite', 'google-football'
COMP = 'lux-ai-2021'
MAX_CALLS_PER_DAY = 3000 # Kaggle says don't do more than 3600 per day and 1 per second
LOWEST_SCORE_THRESH = 1700

In [19]:
ROOT ="."
META = "meta-kaggle/"
MATCH_DIR = '.'
base_url = "https://www.kaggle.com/requests/EpisodeService/"
get_url = base_url + "GetEpisodeReplay"
BUFFER = 1
COMPETITIONS = {
    'lux-ai-2021': 30067,
    'hungry-geese': 25401,
    'rock-paper-scissors': 22838,
    'santa-2020': 24539,
    'halite': 18011,
    'google-football': 21723
}

In [18]:
# Load Episodes
print(os.getcwd())
episodes_df = pd.read_csv("meta_kaggle/Episodes.csv")

# Load EpisodeAgents
epagents_df = pd.read_csv("meta_kaggle/EpisodeAgents.csv")

print(f'Episodes.csv: {len(episodes_df)} rows before filtering.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows before filtering.')

episodes_df = episodes_df[episodes_df.CompetitionId == COMPETITIONS[COMP]] 
epagents_df = epagents_df[epagents_df.EpisodeId.isin(episodes_df.Id)]

print(f'Episodes.csv: {len(episodes_df)} rows after filtering for {COMP}.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows after filtering for {COMP}.')

/Users/Felix/Downloads/episode_download
Episodes.csv: 30176685 rows before filtering.
EpisodeAgents.csv: 75409944 rows before filtering.
Episodes.csv: 3520590 rows after filtering for lux-ai-2021.
EpisodeAgents.csv: 7041180 rows after filtering for lux-ai-2021.


In [20]:
# Prepare dataframes

episodes_df = episodes_df.set_index(['Id'])
episodes_df['CreateTime'] = pd.to_datetime(episodes_df['CreateTime'])
episodes_df['EndTime'] = pd.to_datetime(episodes_df['EndTime'])

epagents_df.fillna(0, inplace=True)
epagents_df = epagents_df.sort_values(by=['Id'], ascending=False)

In [39]:
# Get top scoring submissions
max_df = (epagents_df.sort_values(by=['EpisodeId'], ascending=False).groupby('SubmissionId').head(1).drop_duplicates().reset_index(drop=True))
max_df = max_df[max_df.UpdatedScore>=LOWEST_SCORE_THRESH]
max_df = pd.merge(left=episodes_df, right=max_df, left_on='Id', right_on='EpisodeId')
sub_to_score_top = pd.Series(max_df.UpdatedScore.values,index=max_df.SubmissionId).to_dict()
print(f'{len(sub_to_score_top)} submissions with score over {LOWEST_SCORE_THRESH}')

11 submissions with score over 1700


In [40]:
# Get episodes for these submissions
sub_to_episodes = collections.defaultdict(list)
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    eps = sorted(epagents_df[epagents_df['SubmissionId'].isin([key])]['EpisodeId'].values,reverse=True)
    sub_to_episodes[key] = eps
candidates = len(set([item for sublist in sub_to_episodes.values() for item in sublist]))
print(f'{candidates} episodes for these {len(sub_to_score_top)} submissions')

2033 episodes for these 11 submissions


In [41]:
global num_api_calls_today
num_api_calls_today = 0
all_files = []
for root, dirs, files in os.walk(os.getcwd(), topdown=False):
    all_files.extend(files)
seen_episodes = [int(f.split('.')[0]) for f in all_files 
                      if '.' in f and f.split('.')[0].isdigit() and f.split('.')[1] == 'json']
remaining = np.setdiff1d([item for sublist in sub_to_episodes.values() for item in sublist],seen_episodes)
print(f'{len(remaining)} of these {candidates} episodes not yet saved')
print('Total of {} games in existing library'.format(len(seen_episodes)))

2033 of these 2033 episodes not yet saved
Total of 0 games in existing library


In [42]:
def create_info_json(epid):
    
    create_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)
    end_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)

    agents = []
    for index, row in epagents_df[epagents_df['EpisodeId'] == epid].sort_values(by=['Index']).iterrows():
        agent = {
            "id": int(row["Id"]),
            "state": int(row["State"]),
            "submissionId": int(row['SubmissionId']),
            "reward": int(row['Reward']),
            "index": int(row['Index']),
            "initialScore": float(row['InitialScore']),
            "initialConfidence": float(row['InitialConfidence']),
            "updatedScore": float(row['UpdatedScore']),
            "updatedConfidence": float(row['UpdatedConfidence']),
            "teamId": int(99999)
        }
        agents.append(agent)

    info = {
        "id": int(epid),
        "competitionId": int(COMPETITIONS[COMP]),
        "createTime": {
            "seconds": int(create_seconds)
        },
        "endTime": {
            "seconds": int(end_seconds)
        },
        "agents": agents
    }

    return info

In [43]:
def saveEpisode(epid):
    # request
    re = requests.post(get_url, json = {"EpisodeId": int(epid)})
        
    # save replay
    with open(os.path.join(os.getcwd(), '{}.json'.format(epid)), 'w') as f:
        f.write(re.json()['result']['replay'])

    # save match info
    info = create_info_json(epid)
    with open(os.path.join(os.getcwd(), '{}_info.json'.format(epid)), 'w') as f:
        json.dump(info, f)


In [44]:
r = BUFFER;

start_time = datetime.datetime.now()
se=0
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    if num_api_calls_today<=MAX_CALLS_PER_DAY:
        print('')
        remaining = sorted(np.setdiff1d(sub_to_episodes[key],seen_episodes), reverse=True)
        print(f'submission={key}, LB={"{:.0f}".format(value)}, matches={len(set(sub_to_episodes[key]))}, still to save={len(remaining)}')
        
        for epid in remaining:
            if epid not in seen_episodes and num_api_calls_today<=MAX_CALLS_PER_DAY:
                saveEpisode(epid); 
                r+=1;
                se+=1
                try:
                    size = os.path.getsize(MATCH_DIR+'{}.json'.format(epid)) / 1e6
                    print(str(num_api_calls_today) + f': saved episode #{epid}')
                    seen_episodes.append(epid)
                    num_api_calls_today+=1
                except:
                    print('  file {}.json did not seem to save'.format(epid))    
                if r > (datetime.datetime.now() - start_time).seconds:
                    time.sleep( r - (datetime.datetime.now() - start_time).seconds)
            if num_api_calls_today>(min(3600,MAX_CALLS_PER_DAY)):
                break
print('')
print(f'Episodes saved: {se}')


submission=23297953, LB=1990, matches=264, still to save=264
0: saved episode #30288751
1: saved episode #30278439
2: saved episode #30267977
3: saved episode #30257885
4: saved episode #30247570
5: saved episode #30236818
6: saved episode #30226315
7: saved episode #30216213
8: saved episode #30206052
9: saved episode #30195759
10: saved episode #30185624
11: saved episode #30175455
12: saved episode #30165377
13: saved episode #30154988
14: saved episode #30144723
15: saved episode #30134457
16: saved episode #30124451
17: saved episode #30114473
18: saved episode #30104214
19: saved episode #30093943
20: saved episode #30083694
21: saved episode #30073322
22: saved episode #30062991
23: saved episode #30052755
24: saved episode #30042342
25: saved episode #30032434
26: saved episode #30022284
27: saved episode #30011715
28: saved episode #30001302
29: saved episode #29991187
30: saved episode #29981069
31: saved episode #29970731
32: saved episode #29960371
33: saved episode #29950

283: saved episode #30094508
284: saved episode #30084251
285: saved episode #30073873
286: saved episode #30063511
287: saved episode #30053307
288: saved episode #30042905
289: saved episode #30032983
290: saved episode #30022870
291: saved episode #30012394
292: saved episode #30001939
293: saved episode #29991742
294: saved episode #29981634
295: saved episode #29971313
296: saved episode #29960946
297: saved episode #29951066
298: saved episode #29941133
299: saved episode #29931131
300: saved episode #29921060
301: saved episode #29910994
302: saved episode #29901315
303: saved episode #29891627
304: saved episode #29881687
305: saved episode #29871801
306: saved episode #29861759
307: saved episode #29851658
308: saved episode #29841739
309: saved episode #29831931
310: saved episode #29822074
311: saved episode #29812512
312: saved episode #29802760
313: saved episode #29792682
314: saved episode #29782770
315: saved episode #29772538
316: saved episode #29762294
317: saved epi

564: saved episode #30073872
565: saved episode #30063509
566: saved episode #30053305
567: saved episode #30042904
568: saved episode #30042887
569: saved episode #30032968
570: saved episode #30022857
571: saved episode #30012382
572: saved episode #30001929
573: saved episode #29991735
574: saved episode #29981629
575: saved episode #29971309
576: saved episode #29960943
577: saved episode #29951064
578: saved episode #29941131
579: saved episode #29931128
580: saved episode #29921058
581: saved episode #29910993
582: saved episode #29910945
583: saved episode #29901266
584: saved episode #29891573
585: saved episode #29881614
586: saved episode #29871760
587: saved episode #29861716
588: saved episode #29851612
589: saved episode #29841686
590: saved episode #29831870
591: saved episode #29822001
592: saved episode #29812438
593: saved episode #29802674
594: saved episode #29792569
595: saved episode #29782707
596: saved episode #29772472
597: saved episode #29762228
598: saved epi

847: saved episode #27716253
848: saved episode #27709768
849: saved episode #27703479
850: saved episode #27697115
851: saved episode #27690626
852: saved episode #27677827
853: saved episode #27671106
854: saved episode #27664266
855: saved episode #27651429
856: saved episode #27645384
857: saved episode #27639137
858: saved episode #27632921
859: saved episode #27626813
860: saved episode #27620605
861: saved episode #27614112
862: saved episode #27614083
863: saved episode #27614082
864: saved episode #27607412
865: saved episode #27601422
866: saved episode #27595442
867: saved episode #27589658
868: saved episode #27589657
869: saved episode #27583631
870: saved episode #27583628
871: saved episode #27577327
872: saved episode #27571179
873: saved episode #27564985
874: saved episode #27559053
875: saved episode #27553457
876: saved episode #27553414
877: saved episode #27548553
878: saved episode #27547740
879: saved episode #27544321
880: saved episode #27541932
881: saved epi

1122: saved episode #29700585
1123: saved episode #29700339
1124: saved episode #29700091
1125: saved episode #29699844
1126: saved episode #29699597
1127: saved episode #29699349
1128: saved episode #29699105
1129: saved episode #29698856
1130: saved episode #29698613
1131: saved episode #29698364
1132: saved episode #29698120
1133: saved episode #29697871
1134: saved episode #29697620
1135: saved episode #29697375
1136: saved episode #29697122
1137: saved episode #29696878
1138: saved episode #29696632
1139: saved episode #29696385
1140: saved episode #29696137
1141: saved episode #29695888
1142: saved episode #29695641
1143: saved episode #29695393
1144: saved episode #29695146
1145: saved episode #29694897
1146: saved episode #29694643
1147: saved episode #29694397
1148: saved episode #29694151
1149: saved episode #29693903
1150: saved episode #29693657
1151: saved episode #29693410
1152: saved episode #29693162
1153: saved episode #29692913
1154: saved episode #29692665
1155: save

1393: saved episode #30289385
1394: saved episode #30279063
1395: saved episode #30268615
1396: saved episode #30258515
1397: saved episode #30248203
1398: saved episode #30237446
1399: saved episode #30216716
1400: saved episode #30206556
1401: saved episode #30196260
1402: saved episode #30186122
1403: saved episode #30175943
1404: saved episode #30165056
1405: saved episode #30155468
1406: saved episode #30134932
1407: saved episode #30124921
1408: saved episode #30119119
1409: saved episode #30108925
1410: saved episode #30099577
1411: saved episode #30089295
1412: saved episode #30078994
1413: saved episode #30068453
1414: saved episode #30063528
1415: saved episode #30053321
1416: saved episode #30042917
1417: saved episode #30032994
1418: saved episode #30022879
1419: saved episode #30022865
1420: saved episode #30012391
1421: saved episode #30001938
1422: saved episode #29991736
1423: saved episode #29981630
1424: saved episode #29971311
1425: saved episode #29960944
1426: save

1662: saved episode #30001941
1663: saved episode #29991744
1664: saved episode #29981637
1665: saved episode #29971315
1666: saved episode #29951067
1667: saved episode #29940573
1668: saved episode #29930583
1669: saved episode #29922970
1670: saved episode #29912793
1671: saved episode #29903060
1672: saved episode #29893412
1673: saved episode #29873539
1674: saved episode #29871868
1675: saved episode #29871309
1676: saved episode #29865699
1677: saved episode #29855463
1678: saved episode #29845539
1679: saved episode #29835734
1680: saved episode #29835733
1681: saved episode #29825851
1682: saved episode #29823925
1683: saved episode #29814517
1684: saved episode #29812522
1685: saved episode #29802772
1686: saved episode #29792696
1687: saved episode #29782783
1688: saved episode #29781805
1689: saved episode #29778273
1690: saved episode #29768045
1691: saved episode #29762379
1692: saved episode #29752533
1693: saved episode #29750686
1694: saved episode #29740818
1695: save

1931: saved episode #30124922
1932: saved episode #30115012
1933: saved episode #30104211
1934: saved episode #30093941
1935: saved episode #30073320
1936: saved episode #30053291
1937: saved episode #30042889
1938: saved episode #30032970
1939: saved episode #30022859
1940: saved episode #30016340
1941: saved episode #30005833
1942: saved episode #29995527
1943: saved episode #29985424
1944: saved episode #29981758
1945: saved episode #29961011
1946: saved episode #29960942
1947: saved episode #29960149
1948: saved episode #29950290
1949: saved episode #29940339
1950: saved episode #29930327
1951: saved episode #29920223
1952: saved episode #29910208
1953: saved episode #29909806
1954: saved episode #29900219
1955: saved episode #29890463
1956: saved episode #29881701
1957: saved episode #29872781
1958: saved episode #29871807
1959: saved episode #29861765
1960: saved episode #29851664
1961: saved episode #29841748
1962: saved episode #29831940
1963: saved episode #29825412
1964: save