fix: show correct package name to install biopython #6662

BioGeek · 2024-02-13T14:15:04Z

When you try to download a dataset that uses biopython, like load_dataset("InstaDeepAI/multi_species_genomes"), you get the error:

>>> from datasets import load_dataset
>>> dataset = load_dataset("InstaDeepAI/multi_species_genomes")
/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py:1454: FutureWarning: The repository for InstaDeepAI/multi_species_genomes contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/InstaDeepAI/multi_species_genomes
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Downloading builder script: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.51k/7.51k [00:00<00:00, 7.67MB/s]
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.2k/17.2k [00:00<00:00, 11.0MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 2548, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 2220, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1871, in dataset_module_factory
    raise e1 from None
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1844, in dataset_module_factory
    ).get_module()
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1466, in get_module
    local_imports = _download_additional_modules(
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 346, in _download_additional_modules
    raise ImportError(
ImportError: To be able to use InstaDeepAI/multi_species_genomes, you need to install the following dependency: Bio.
Please install it using 'pip install Bio' for instance.
>>>

Bio comes from the biopython package that can be installed with pip install biopython, not with pip install Bio as suggested.

This PR adds special logic to show the correct package name in the error message of _download_additional_modules, similarly as is done for sklearn / scikit-learn already.

There are more packages where importable module name differs from the PyPI package name, so this could be made more generic, like:

# Mapping of importable module names to their PyPI package names
package_map = {
    "sklearn": "scikit-learn",
    "Bio": "biopython",
    "PIL": "Pillow",
    "bs4": "beautifulsoup4"
}

for module_name, pypi_name in package_map.items():
    if module_name in needs_to_be_installed.keys():
        needs_to_be_installed[module_name] = pypi_name

HuggingFaceDocBuilderDev · 2024-03-01T16:08:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

LGTM ! Thanks for the fix :)

Having something more generic as you are suggesting could be nice indeed

Merging this one for now, and we can see in subsequent PRs if we want to fix more libraries

github-actions · 2024-03-01T17:49:47Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005135 / 0.011353 (-0.006218)	0.003666 / 0.011008 (-0.007342)	0.062660 / 0.038508 (0.024152)	0.028656 / 0.023109 (0.005546)	0.249601 / 0.275898 (-0.026297)	0.265745 / 0.323480 (-0.057735)	0.002935 / 0.007986 (-0.005051)	0.002606 / 0.004328 (-0.001723)	0.048774 / 0.004250 (0.044523)	0.043643 / 0.037052 (0.006591)	0.263114 / 0.258489 (0.004625)	0.284596 / 0.293841 (-0.009245)	0.027818 / 0.128546 (-0.100728)	0.010726 / 0.075646 (-0.064921)	0.205900 / 0.419271 (-0.213371)	0.035646 / 0.043533 (-0.007887)	0.245599 / 0.255139 (-0.009540)	0.267706 / 0.283200 (-0.015493)	0.018441 / 0.141683 (-0.123242)	1.143365 / 1.452155 (-0.308790)	1.191823 / 1.492716 (-0.300893)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.089703 / 0.018006 (0.071696)	0.298073 / 0.000490 (0.297583)	0.000209 / 0.000200 (0.000009)	0.000042 / 0.000054 (-0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018068 / 0.037411 (-0.019343)	0.061416 / 0.014526 (0.046890)	0.075989 / 0.176557 (-0.100567)	0.120765 / 0.737135 (-0.616370)	0.075476 / 0.296338 (-0.220863)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.284043 / 0.215209 (0.068834)	2.770282 / 2.077655 (0.692627)	1.473040 / 1.504120 (-0.031080)	1.349064 / 1.541195 (-0.192131)	1.362783 / 1.468490 (-0.105708)	0.560765 / 4.584777 (-4.024012)	2.357731 / 3.745712 (-1.387981)	2.745771 / 5.269862 (-2.524090)	1.726764 / 4.565676 (-2.838913)	0.061212 / 0.424275 (-0.363063)	0.004902 / 0.007607 (-0.002705)	0.336963 / 0.226044 (0.110919)	3.324519 / 2.268929 (1.055591)	1.825826 / 55.444624 (-53.618798)	1.548811 / 6.876477 (-5.327666)	1.570618 / 2.142072 (-0.571454)	0.642411 / 4.805227 (-4.162816)	0.116068 / 6.500664 (-6.384596)	0.042433 / 0.075469 (-0.033036)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.988402 / 1.841788 (-0.853386)	11.509601 / 8.074308 (3.435293)	9.555338 / 10.191392 (-0.636054)	0.138728 / 0.680424 (-0.541696)	0.014107 / 0.534201 (-0.520094)	0.285465 / 0.579283 (-0.293818)	0.263086 / 0.434364 (-0.171278)	0.327469 / 0.540337 (-0.212869)	0.444799 / 1.386936 (-0.942137)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005359 / 0.011353 (-0.005993)	0.003605 / 0.011008 (-0.007403)	0.049734 / 0.038508 (0.011226)	0.029792 / 0.023109 (0.006683)	0.276384 / 0.275898 (0.000486)	0.297915 / 0.323480 (-0.025564)	0.004949 / 0.007986 (-0.003036)	0.002713 / 0.004328 (-0.001616)	0.049499 / 0.004250 (0.045249)	0.044969 / 0.037052 (0.007917)	0.284558 / 0.258489 (0.026069)	0.315170 / 0.293841 (0.021329)	0.029457 / 0.128546 (-0.099089)	0.010573 / 0.075646 (-0.065073)	0.058191 / 0.419271 (-0.361080)	0.051461 / 0.043533 (0.007928)	0.270744 / 0.255139 (0.015605)	0.291664 / 0.283200 (0.008465)	0.018607 / 0.141683 (-0.123076)	1.158799 / 1.452155 (-0.293355)	1.210509 / 1.492716 (-0.282208)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.090277 / 0.018006 (0.072270)	0.298748 / 0.000490 (0.298258)	0.000228 / 0.000200 (0.000028)	0.000060 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021850 / 0.037411 (-0.015561)	0.075433 / 0.014526 (0.060907)	0.087171 / 0.176557 (-0.089386)	0.125828 / 0.737135 (-0.611308)	0.090343 / 0.296338 (-0.205996)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.297267 / 0.215209 (0.082058)	2.865234 / 2.077655 (0.787579)	1.595024 / 1.504120 (0.090904)	1.476100 / 1.541195 (-0.065094)	1.494896 / 1.468490 (0.026406)	0.569086 / 4.584777 (-4.015691)	2.401976 / 3.745712 (-1.343736)	2.676091 / 5.269862 (-2.593771)	1.742087 / 4.565676 (-2.823590)	0.065161 / 0.424275 (-0.359114)	0.005006 / 0.007607 (-0.002602)	0.342302 / 0.226044 (0.116257)	3.450571 / 2.268929 (1.181643)	1.928754 / 55.444624 (-53.515871)	1.672823 / 6.876477 (-5.203653)	1.798830 / 2.142072 (-0.343243)	0.648730 / 4.805227 (-4.156498)	0.116433 / 6.500664 (-6.384231)	0.040683 / 0.075469 (-0.034786)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.006158 / 1.841788 (-0.835630)	12.200093 / 8.074308 (4.125785)	10.180691 / 10.191392 (-0.010701)	0.146620 / 0.680424 (-0.533804)	0.015621 / 0.534201 (-0.518580)	0.287956 / 0.579283 (-0.291327)	0.277231 / 0.434364 (-0.157133)	0.323815 / 0.540337 (-0.216522)	0.429655 / 1.386936 (-0.957281)

fix: show correct package name to install biopython

7589e06

Merge branch 'main' into biopython_additional_module

0cd7c65

lhoestq approved these changes Mar 1, 2024

View reviewed changes

lhoestq merged commit 273e16f into huggingface:main Mar 1, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: show correct package name to install biopython #6662

fix: show correct package name to install biopython #6662

BioGeek commented Feb 13, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 1, 2024

lhoestq left a comment •

edited

Loading

github-actions bot commented Mar 1, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

fix: show correct package name to install biopython #6662

fix: show correct package name to install biopython #6662

Conversation

BioGeek commented Feb 13, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Mar 1, 2024

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Mar 1, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

BioGeek commented Feb 13, 2024 •

edited

Loading

lhoestq left a comment •

edited

Loading