Skip to content

malaysia-ai/dedup-text-dataset

Repository files navigation

dedup-text-dataset

Dedup and postprocessing for text dataset gathered from https://github.com/users/huseinzol05/projects/1

All dedup and postprocessed dataset uploaded at https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset

Server spec

  1. 24 cores.
  2. 220 GB RAM.

Deduping can explode the memory, easily eat up to 30 GB if the dataset is > 10GB, so beware.

Download dataset

  1. Most of download files are straight forward,
wget https://huggingface.co/datasets/mesolitica/crawl-amanz-my/resolve/main/parsed.jsonl -O hf-datasets/raw-datasets/amanz.jsonl

But sometime we have to some preprocessing like,

We save raw datasets at hf-datasets/raw-datasets.

Text dedup

  1. Clone remove-duplicate-text-dataset.ipynb to new notebook, eg, remove-duplicate-text-dataset-lowyat.ipynb.

This notebook use text_dedup to do dedup, borrowed from https://github.com/ChenghaoMou/text-dedup

All dedup datasets will save at hf-datasets/dedupe-datasets.

Postprocessing

  1. Run postprocessing.ipynb to start postprocessing,
  • remove texts that contain HTTP errors.
  • remove texts less than 3 characters.
  • replace 6 spaces or more with 6 spaces.
  • replace 6 dots or more with 6 dots.

Rerun this notebook will not overwrite postprocessed datasets.

Prepare for training session

There is no consideration AI alignment and safety in current dataset, we only apply basic postfilter.

  1. FPF llama2
  2. FPF Mistral
  3. Pretrain nanoT5
  4. Pretrain smaller Causal LM
  5. Pretrain LLM
  6. FPF TinyLlama
  7. FPF Yi

end-to-end processing using Python script

Released as a Python library, https://github.com/malaysia-ai/clean_text_my

About

Dedup and postprocessing for text dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published