# Database Initialization Process

This file processes each data point by extracting its metadata, including:

- **ID**
- **Turn**
- **Input image name**
- **Output image name**
- **Instruction**

It then creates structured entries for the intended database and saves the database structure as a JSON file (`mongo_init.json`).

Installing the required libraries.

In [22]:
import json
import re
import os
import pymongo
from pymongo import MongoClient
from dotenv import load_dotenv

### Image Storage and Licensing

The images from the **MagicBrush** dataset are stored in a GitHub repository to facilitate data retrieval by referencing image IDs. This allows access without the need for local storage, enabling hosting the demo.

The **MagicBrush** dataset is published under the **CC-BY-4.0** license, which permits this retrieval mechanism.


In [23]:
base_url = "https://raw.githubusercontent.com/piadonabauer/magicbrush-dev/main/images"

In [24]:
output_data = []

# given a file name, extract the turn of the image edit
def extract_turn(output_filename):
    match = re.search(r"output(\d+)", output_filename)
    return int(match.group(1)) if match else None

# read all samples within the validation split
with open("edit_sessions.json", "r") as file:
    edit_sessions = json.load(file)

# create for every sample a entry for the database
for id, sessions in edit_sessions.items():
    for session in sessions:
        # generate retrieval link based id and turn
        input_link = f"{base_url}/{id}/{session['input']}"
        output_link = f"{base_url}/{id}/{session['output']}"

        turn = extract_turn(session["output"])
        if turn is None:
            print(f"No turn value found in {session['output']} - skip.")
            continue

        document = {
            "meta_information": {
                "id": int(id),
                "turn": int(turn),
                "input_img_link": input_link,
                "output_img_link": output_link,
                "instruction": session["instruction"],
            },
            "ratings": [],
        }
        output_data.append(document)

# save database structure as .json before uploading
output_json_path = "mongo_init.json"
with open(output_json_path, "w") as outfile:
    json.dump(output_data, outfile, indent=4)

print(f"Data saved at {output_json_path}")

Data saved at mongo_init.json


### Uploading to MongoDB

After signing into the database cluster, the saved structure is uploaded to **MongoDB**. This step initializes the database, setting up the structure and enabling the storage of ratings.


In [29]:
# os.environ.pop('MONGO_PASSWORD', None)
load_dotenv()  # load gitignore

# uploading requires providing the database credentials
mongo_user = os.getenv("MONGO_USER")
mongo_password = os.getenv("MONGO_PASSWORD")
cluster_url = os.getenv("MONGO_CLUSTER_URL")

In [30]:
connection_url = f"mongodb+srv://{mongo_user}:{mongo_password}@{cluster_url}"
client = MongoClient(connection_url)

db = client["thesis"]
collection = db["labeling"]

# insert all data points into the database
with open(output_json_path, "r") as infile:
    documents = json.load(infile)
    collection.insert_many(documents)

print("Data added.")

Data added.
