Skip to content

Updated get_dataset_config_names returning default in offline mode #7977

Closed
abigailllr wants to merge 1 commit into
huggingface:mainfrom
abigailllr:fix-get-dataset-config-names-offline-mode
Closed

Updated get_dataset_config_names returning default in offline mode #7977
abigailllr wants to merge 1 commit into
huggingface:mainfrom
abigailllr:fix-get-dataset-config-names-offline-mode

Conversation

@abigailllr
Copy link
Copy Markdown

When a dataset is cached and accessed in offline mode, get_dataset_config_names was returning default instead of the actual cached config names. This happened because CachedDatasetModuleFactory.get_module returned a DatasetModule without builder_configs_parameters, causing the fallback to default in get_dataset_config_names.

The fix reads config_name from each dataset_info file in the cache directory and includes them as builder_configs_parameters in the returned DatasetModule. Invalid or missing dataset_info.json files are handled.

Testing:

  1. Download a dataset in online mode so it gets cached
  2. Switch to offline mode and call get_dataset_config_names
  3. Verify it returns the cached config names instead of ['default']

Example:

  • HF_DATASETS_OFFLINE=0 HF_HOME="/tmp/hftemp" python -c "import datasets; datasets.load_dataset('cais/mmlu', 'all')"
  • HF_DATASETS_OFFLINE=1 HF_HOME="/tmp/hftemp" python -c "import datasets; print(datasets.get_dataset_config_names('cais/mmlu'))"
  • -> Expected output: ['all']

Fixes #7947

… reading config names from dataset_info.json in cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MMLU get_dataset_config_names provides different lists of subsets in online and offline modes

1 participant