<a href="https://colab.research.google.com/github/nelslindahlx/KnowledgeReduce/blob/main/updated_json_sharding_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### JSON Sharding Notebook Code Summary

**1. Importing Libraries**
   - The script starts by importing the necessary libraries: `json` for handling JSON files and `os` for file and directory operations.

**2. Function Definitions**
   - `read_json(file_path)`: Reads a JSON file from the given path and returns the data. It uses the `json.load()` method to deserialize the file content into a Python object.
   - `shard_json(data, shard_size=100)`: Splits the JSON data into smaller shards. This function takes the entire data set and the desired shard size as inputs and returns a list of shards. Each shard contains a subset of the data, with up to `shard_size` elements.
   - `save_shards(shards, output_dir)`: Saves each shard as a separate JSON file. This function takes the list of shards and an output directory as inputs. It creates the output directory if it does not exist and saves each shard into a separate file named `shard_<index>.json`.

**3. Main Execution**
   - The main part of the script sets the input file path and output directory, and then executes the process in a sequential manner:
       - `input_file`: Path to the large JSON file that needs to be sharded. This needs to be set by the user to point to their specific file.
       - `output_directory`: Directory where the resulting sharded files will be saved.
       - The script reads the JSON data from the input file, shards the data into smaller pieces, and then saves these shards into the specified output directory.
       - A message is printed to indicate the completion of the sharding process and the number of shards created.

This script is designed to be executed in a Google Colab notebook, allowing for easy modification and execution of the sharding process. Each step is contained in its own cell for clarity and ease of use. The user can upload their JSON file directly to the Colab environment and set the `input_file` variable to the uploaded file's path.

# Step 1: Importing Libraries

In [None]:
import json
import os

# Step 2: Function Definitions

In [None]:
def read_json(file_path):
    try:
        with open(file_path, 'r') as file:
            data = json.load(file)
        return data
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None
    except json.JSONDecodeError:
        print(f"Error: The file '{file_path}' is not a valid JSON.")
        return None

def shard_json(data, shard_size=100):
    if data is None:
        return []
    return [data[i:i + shard_size] for i in range(0, len(data), shard_size)]

def save_shards(shards, output_dir):
    if not shards:
        print("No shards to save.")
        return
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for i, shard in enumerate(shards):
        shard_file = f"{output_dir}/shard_{i + 1}.json"
        with open(shard_file, 'w') as file:
            json.dump(shard, file, indent=4)

# Step 3: Main Execution

In [None]:
# User inputs for file path and shard size
input_file = input("Enter the path to your JSON file: ")  # User can input the file path
output_directory = 'sharded_json_files'  # Directory where shards will be saved
try:
    shard_size = int(input("Enter shard size (number of elements per shard): "))
except ValueError:
    print("Invalid shard size. Using default size of 100.")
    shard_size = 100

# Read, shard, and save the JSON data
data = read_json(input_file)
if data is not None:
    shards = shard_json(data, shard_size)
    save_shards(shards, output_directory)
    if shards:
        print(f"JSON sharding complete. {len(shards)} shards created in '{output_directory}'.")
else:
    print("Sharding process aborted due to earlier errors.")