how to deduplicate huggingface datasets #21

StephennFernandes · 2022-09-10T14:18:16Z

Hey there, excellent work on this repo and the paper.

I wanted to know on how could i use this to deduplicate my huggingface custom dataset. that i have custom developed and cleaned.

that has been saved as custom_dataset.save_to_disk("dataset_path")

and can be loaded as custom_dataset = datasets.load_from_disk("dataset_path")

The text was updated successfully, but these errors were encountered:

carlini · 2022-09-12T17:19:30Z

A while ago someone sent a PR to do this

#6

But I haven't tested it on the latest code -- maybe you could try that and see if it would work for you?

TristanThrush · 2022-09-19T23:52:03Z

I just had the same question as you @StephennFernandes, and I found this repo which can be installed with pip: https://github.com/ChenghaoMou/text-dedup. It looks like text-dedup has a more plug-and-play version of this repo? I followed the documentation here and used it on a huggingface dataset with embedder.embed_bash(my_hf_dataset["text"])

cakiki · 2022-09-22T09:23:52Z

@TristanThrush This only works on datasets that fit in memory though, right? I think my_hf_dataset["text"] loads the entire column into a list. I'm also trying to solve this for a larger-than-memory dataset, have you had any luck with this?

TristanThrush · 2022-09-27T23:45:45Z

@cakiki possibly. I have a machine with 100's of GB of memory, so it hasn't been a problem for me. I think that both this repo and the alternative repo require a lot of memory?

cakiki · 2022-09-28T07:42:57Z

@TristanThrush I don't think this repo requires you to load everything into memory though. I'll try running Teven's PR and report back.

TristanThrush · 2022-09-28T22:50:20Z

@cakiki we might be able to resolve this issue on the text-dedup side: ChenghaoMou/text-dedup#5

carlini · 2024-02-24T03:57:21Z

This repository does require that the dataset itself fit in memory. Closing because I've now merged #6.

carlini closed this as completed Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to deduplicate huggingface datasets #21

how to deduplicate huggingface datasets #21

StephennFernandes commented Sep 10, 2022

carlini commented Sep 12, 2022

TristanThrush commented Sep 19, 2022

cakiki commented Sep 22, 2022 •

edited

TristanThrush commented Sep 27, 2022

cakiki commented Sep 28, 2022

TristanThrush commented Sep 28, 2022

carlini commented Feb 24, 2024

how to deduplicate huggingface datasets #21

how to deduplicate huggingface datasets #21

Comments

StephennFernandes commented Sep 10, 2022

carlini commented Sep 12, 2022

TristanThrush commented Sep 19, 2022

cakiki commented Sep 22, 2022 • edited

TristanThrush commented Sep 27, 2022

cakiki commented Sep 28, 2022

TristanThrush commented Sep 28, 2022

carlini commented Feb 24, 2024

cakiki commented Sep 22, 2022 •

edited