# Preparing Datasets for Training

In this notebook, we will create prepare our dataset to train our model with. These datasets will have two columns: 
* one for the image links
* one for the image name (once saved)
* one for the captions (this will actually be several columns as we have several prompts we want to train our model with)

The different captions or prompts we will use are:
* mass_prompt: using numerical values of planet and star mass in relation to our earth and sun.
* ratio_prompt: using a comparison between size of star and planet to represent size.
* size_text_prompt: using text comparisons for planet and star size.
* shorter_prompt: a reduced version of the prompt to simpler phrases.
* 75_tokens: an even shorter version of the prompts to only 75 tokens, which is the only length that can fine-tune stable diffusion without altering the text encoder. (This might replace the shorter_prompt option.)

The first thing we need to do is resize all of our images to 512 x 512 and save the images to a folder called "data". All of the images currently are web-links. So we need to read them, resize them, and save them.

In [None]:
import pandas as pd
import os
from PIL import Image
import requests
from io import BytesIO

In [None]:
#read in the dataset we'll be working with
dataset = pd.read_csv("training_data_prompts.csv")

In [None]:
dataset.head()

We only need the image link and the prompts, so we are going to isolate these from the whole dataset. 

In [None]:
training_data = dataset[['image_link', 'mass_prompt', 'ratio_prompt', 'size_text_prompt', 'shorter_prompt', '75_tokens']]

In [None]:
training_data.head()

In [None]:
#read the image from the dataset
data_folder = 'data_huggingface'
os.makedirs(data_folder, exist_ok=True)

In [None]:
for index, data in training_data.iterrows():
    image_url = data['image_link']  

    # Getting the Image and opening it using PIL
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))

    # Resize the image to 512x512
    img_resized = img.resize((512, 512))

    # Save the resized image to the 'data' folder
    image_path = os.path.join(data_folder, f'image_{index + 1}.jpg')
    img_resized.save(image_path)

    # Update the dataset with the image path
    training_data.at[index, 'image_path'] = image_path

# Save the updated DataFrame with image paths
training_data.to_csv('updated_training_data_prompts.csv', index=False)

Another training resource uses a json dataframe to train, so we are going to set up our code to do this in the format needed for the model. 

In [None]:
import json

In [None]:
updated_training_data = pd.read_csv('updated_training_data_prompts.csv')

In [None]:
metadata_dict = {}

for index, data in updated_training_data.iterrows():
    image_path = data['image_path'].split("/")[1].split(".")[0]
    metadata = {"tags": "solo, no humans, space, starry night", 
                "caption": data["75_tokens"]}

    metadata_dict[image_path] = metadata

with open("metadata.json", "w") as json_file:
    json.dump(metadata_dict, json_file)

In [None]:
with open("metadata.jsonl", "w") as json_file:
    for index, data in updated_training_data.iterrows():
        image_path = data['image_path'].split("/")[1]
        #print(image_path)
        metadata = {
            "file_name": image_path, "text": data["75_tokens"]
        }
        #print(metadata)
        json.dump(metadata, json_file)
        json_file.write("\n")

In [None]:
metadata = pd.DataFrame()
for index, data in updated_training_data.iterrows():
    image_path = data['image_path'].split("/")[1]
    #print(image_path)
    metadata.at[index, "file_name"] = image_path
    metadata.at[index, "text"] = data["75_tokens"]
    
metadata.to_csv('metadata.csv', index=False)

In [None]:
for index, data in updated_training_data.iterrows():
    image_path = data['image_path'].split("/")[1].split(".")[0]
    text = data['75_tokens']

    txt_file_path = f'{image_path}.txt'

    with open(txt_file_path, 'w') as txt_file:
        txt_file.write(text)

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('imagefolder', data_dir='data_huggingface', drop_labels=False, split="train")

In [None]:
dataset

In [None]:
dataset[0]['text']

In [None]:
dataset.push_to_hub("mbeaty2/exoplanet-data")