# UTTTAI Dataset Processing and Upload
The following notebook processes the UTTTAI dataset, refactors and combines it, and uploads it to Hugging Face.

## Changes
Initially the data was stored in individual .txt files for each depth. Following a proprietary format, using an unconventional board indexing method (documented in the [utttai.md](utttai_conversion/utttai.md) file in this repository). We therefore decided to refactor the data. It updates the indexing data to the standard bitboard format, and combines the data into a single .parquet file, with an extra column for the depth of the data. The `data_refactoring.py` is what transforms the data from the proprietary format to the jsonl format. This is specifically converting it to a .parquet file, and uploading it.

In [10]:
# imports
import json
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

In [4]:
# function to load the data from a jsonl file as an object
def load_jsonl(file_path):
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            depth = int(file.name[-8:-6])
            try:
                json_object = json.loads(line)
                json_object['depth'] = depth
                data.append(json_object)
            except json.JSONDecodeError:
                print(f"Skipping invalid JSON line: {line.strip()}")
    return data

In [2]:
# loads all jsonl from a directory
def load_directory(directory_path) -> list[dict]:
    data = []
    for file in os.listdir(directory_path):
        data.extend(load_jsonl(os.path.join(directory_path, file)))
    return data

In [5]:
# load the data from the directory, initialize the paths
current_directory = os.getcwd()
path = os.path.join(current_directory, 'data\\stage1-mcts-refactored')
data = load_directory(path)
file_path = os.path.join(path, 'stage1-mcts.parquet')

print('done')

done


In [None]:
# loads the data into a table and writes it to a parquet file
table = pa.Table.from_pylist(data)
pq.write_table(table, file_path)

In [None]:
# uploads
from huggingface_hub import HfApi

api = HfApi()
api.upload_file(
    path_or_fileobj='data/stage1-mcts-refactored/stage1-mcts.parquet',
    path_in_repo='stage1-mcts.parquet',
    repo_id='markstanl/u3t',
    repo_type='dataset'
)

api.upload_file(
    path_or_fileobj='data/stage1-mcts-refactored/dataset_infos.json',
    path_in_repo='dataset_infos.json',
    repo_id='markstanl/u3t',
    repo_type='dataset'
)