Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobManagerCrashedError when trying to generate train split viewer #2799

Open
bghira opened this issue May 13, 2024 · 2 comments
Open

JobManagerCrashedError when trying to generate train split viewer #2799

bghira opened this issue May 13, 2024 · 2 comments

Comments

@bghira
Copy link

bghira commented May 13, 2024

When loading this dataset onto the hub which contains an image field containing image bytes, I'm receiving a JobManagerCrashedError

It's not clear exactly why this happens, or the best way to encode the images in the dataset. I looked exhaustively for some examples on how to do this but wasn't much other than the data card spec

I added the features section to the dataset card with the theory that it simply didn't know how to decode that column, and the large size threw it off. That didn't change things, though the JobManagerCrashedError seemed to take longer to occur, maybe that's just an artifact of job scheduling on the backend.

@bghira
Copy link
Author

bghira commented May 13, 2024

the code i've used to assemble the dataset:

data = []
for root, _, files in os.walk(args.input_folder):
    for file in tqdm(files, desc="Processing images"):
        try:
            image = Image.open(os.path.join(root, file))
        except:
            continue

        width, height = get_size(image)
        luminance = get_image_luminance(image)
        image_hash = get_image_hash(image)
        # Get the smallest original compressed representation of the image
        file_data = open(os.path.join(root, file), "rb").read()
        image_data = np.frombuffer(file_data, dtype=np.uint8)


        data.append((file, image_hash, width, height, luminance, image_data))

df = pd.DataFrame(data, columns=["filename", "image_hash", "width", "height", "luminance", "image"])
df.to_parquet(os.path.join(args.output_folder, "images.parquet"), index=False)

print("Done!")

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant