Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all query parameters when extracting protocol #2996

Merged
merged 5 commits into from
Oct 4, 2021

Conversation

albertvillanova
Copy link
Member

Fix _get_extraction_protocol to remove all query parameters, like ?raw=true, ?dl=1,...

@severo
Copy link
Contributor

severo commented Oct 1, 2021

Beware of cases like: http://ufal.ms.mff.cuni.cz/umc/005-en-ur/download.php?f=umc005-corpus.zip or gzip://bg-cs.xml::https://opus.nlpl.eu/download.php?f=Europarl/v8/xml/bg-cs.xml.gz. I see these URLs in the errors (https://observablehq.com/@huggingface/quality-assessment-of-datasets-loading?collection=@huggingface/datasets), but not in the Extraction protocol for file at xxx is not implemented yet error, so I'm not sure if they would break now or not.

Maybe: first try to find an extension, and if none, try to remove the ?... part and retry to find the extension.

By the way, here is the list of URLs for errors of this type, with a '?' in the URL:

https://dl.orangedox.com/WyaCpL?dl=1
https://drive.google.com/u/0/uc?id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM&export=download
https://drive.google.com/u/0/uc?id=1-CaP3xHgZxOGjQ3pXC5tr9YnIajmel-t&export=download
https://drive.google.com/u/0/uc?id=11EBGHMAswT5JDO60xh7gnZfYjpMQs7h7&export=download
https://drive.google.com/u/0/uc?id=13JCCr-IjZK7uhbLXeufptr_AxvsKinVl&export=download
https://drive.google.com/u/0/uc?id=13ZyFc2qepAYSg9WIFaeJ9y402gblsl2e&export=download
https://drive.google.com/u/0/uc?id=15auwrFAlq52JJ61u7eSfnhT9rZtI5sjk&export=download
https://drive.google.com/u/0/uc?id=16OgJ_OrfzUF_i3ftLjFn9kpcyoi7UJeO&export=download
https://drive.google.com/u/0/uc?id=1BFYF05rx-DK9Eb5hgoIgd6EcB8zOI-zu&export=download
https://drive.google.com/u/0/uc?id=1Cz1Un9p8Xn9IpEMMrg2kXSDt0dnjxc4z&export=download
https://drive.google.com/u/0/uc?id=1H7FphKVVCYoH49sUXl79CuztEfJLaKoF&export=download
https://drive.google.com/u/0/uc?id=1NAeuWLgYBzLwU5jCdkrtj4_PRUocuvlb&export=download
https://drive.google.com/u/0/uc?id=1OletxmPYNkz2ltOr9pyT0b0iBtUWxslh&export=download
https://drive.google.com/u/0/uc?id=1OletxmPYNkz2ltOr9pyT0b0iBtUWxslh&export=download/
https://drive.google.com/u/0/uc?id=1R1jR4DcH2UEaM1ZwDSRHdfTGvkCNu6NW&export=download
https://drive.google.com/u/0/uc?id=1hDHeoFIfQzJec1NgZNXh3CTNbchiIvuG&export=download
https://drive.google.com/u/0/uc?id=1wxwqnWGRzwvc_-ugRoFX8BPgpO3Q7sch&export=download
https://drive.google.com/u/0/uc?id=1ydsOTvBZXKqcRvXawOuePrJ99slOEbkk&export=download
https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ
https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbQ2Vic1kxMmZZQ1k
https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbZlU4dXhHTFhZQU0
https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9Qhbd2JNdDBsQUdocVU
https://drive.google.com/uc?export=download&id=1-w-0uqaC6hnRn1F_3XqJEvi09zlcTIhX
https://drive.google.com/uc?export=download&id=11wMGqNVSwwk6zUnDaJEgm3qT71kAHeff
https://drive.google.com/uc?export=download&id=17FGi8KI9N9SuGe7elM8qU8_3fx4sfgTr
https://drive.google.com/uc?export=download&id=1AHUm1-_V9GCtGuDcc8XrMUCJE8B-HHoL
https://drive.google.com/uc?export=download&id=1CBrh-9OrSpKmPQBxTK_ji6mq6WTN_U9U
https://drive.google.com/uc?export=download&id=1Ev4RqWcPsLI9rgOGAKh-_dFKqcEZ1u-G
https://drive.google.com/uc?export=download&id=1GTHUJxxmjLmG2lnF9dwRgIDRFZaOY3-F
https://drive.google.com/uc?export=download&id=1GcUN6mytEcOMBBOvjJOQzBmEkc-LdgQg
https://drive.google.com/uc?export=download&id=1J3mucMFTWrgAYa3LuBZoLRR3CzzYD3fa
https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P
https://drive.google.com/uc?export=download&id=1L7aoUXzHPzyzQ0ns4ApBbYepsjFOtXil
https://drive.google.com/uc?export=download&id=1M1M5yIOyjKWGprc3LUeVVwxgKXxgpqxm
https://drive.google.com/uc?export=download&id=1Nug7-Sri50mkJL4GrWw6C2ZIbfeU-6Am
https://drive.google.com/uc?export=download&id=1PGa8j1_IqxiGTc3SU6NMB38sAzxCPS34
https://drive.google.com/uc?export=download&id=1QsV8C5EPJrQl37mwva_5-IJOrCaOi2tH
https://drive.google.com/uc?export=download&id=1RsGLINVce-0GsDkCLDuLZmoLuzfmoCuQ
https://drive.google.com/uc?export=download&id=1TuWH7uwu6V90QWmZn25qhou1rm97Egmn
https://drive.google.com/uc?export=download&id=1U7WdBpd9kJ85S7BbBhWUSiy9NnXrKdO6
https://drive.google.com/uc?export=download&id=1USoQ8lJgN8kAWnUnRrupMGrPMLlDVqlV
https://drive.google.com/uc?export=download&id=1Uit4Og1pk-br_0UJIO5sdhApyhTuHzqo
https://drive.google.com/uc?export=download&id=1Z2ty5hU0tIGRZRDlFQZLO7b5vijRfvo0
https://drive.google.com/uc?export=download&id=1ZyFGufe4puX3vjGPbp4xg9Hca3Gwq22g
https://drive.google.com/uc?export=download&id=1ZzlIQvw1KNBG97QQCfdatvVrrbeLaM1u
https://drive.google.com/uc?export=download&id=1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
https://drive.google.com/uc?export=download&id=1__EjA6oZsgXQpggPm-h54jZu3kP6Y6zu
https://drive.google.com/uc?export=download&id=1aHPVfC5TrlnUjehtagVZoDfq4VccgaNT
https://drive.google.com/uc?export=download&id=1cqu_YAgvlyVSzzjcUyP1Cz7q0k8Pw7vN
https://drive.google.com/uc?export=download&id=1dUIqVwvoZAtbX_-z5axCoe97XNcFo1No
https://drive.google.com/uc?export=download&id=1eTtRs5cUlBP5dXsx-FTAlmXuB6JQi2qj
https://drive.google.com/uc?export=download&id=1fUR3MqJ8jTMka6owA0S-Fe6aHmiophc_
https://drive.google.com/uc?export=download&id=1ffWfITKFMJeqjT8loC8aiCLRNJpc_XnF
https://drive.google.com/uc?export=download&id=1g89WgFHMRbr4QrvA0ngh26PY081Nv3lx
https://drive.google.com/uc?export=download&id=1meSNZHxd_0TZLKCRCYGN-Ke3IA5c1qOE
https://drive.google.com/uc?export=download&id=1okwGJiOZmTpNRNgJLCnjFF4Q0H1z4l6_
https://drive.google.com/uc?export=download&id=1phryJg4FjCFkn0mSCqIOP2-FscAeKGV0
https://drive.google.com/uc?export=download&id=1s8NSFT4Kz0caKZ4VybPNzt88F8ZanprY
https://drive.google.com/uc?export=download&id=1vRY2wM6rlOZrf9exGTm5pXj5ExlVwJ0C
https://drive.google.com/uc?export=download&id=1ytVZ4AhubFDOEL7o7XrIRIyhU8g9wvKA
https://drive.google.com/uc?id=12Uz59TYg_NtxOy7SXraYeXPMRT7oaO7X
https://drive.google.com/uc?id=1PGH5H_oW7wUvMw_5xaXvbEN7DFll-wDX
https://github.com/MaazAmjad/Datasets-for-Urdu-news/blob/master/Urdu%20Fake%20News%20Dataset.zip?raw=true
https://github.com/TevenLeScao/glucose/blob/master/GLUCOSE_training_data.zip?raw=true
https://github.com/TevenLeScao/what-time-is-it/blob/master/gutenberg_time_phrases.zip?raw=true
https://github.com/aviaefrat/cryptonite/blob/main/data/cryptonite-official-split.zip?raw=true
https://github.com/facebookresearch/Imppres/blob/master/dataset/IMPPRES.zip?raw=true
https://github.com/ljos/navnkjenner/blob/master/data/bokmaal/no_bokmaal-ud-train.bioes?raw=true
https://github.com/ljos/navnkjenner/blob/master/data/nynorsk/no_nynorsk-ud-train.bioes?raw=true
https://github.com/ljos/navnkjenner/blob/master/data/samnorsk/no_samnorsk-ud-train.bioes?raw=true
https://github.com/mirfan899/Urdu/blob/master/sentiment/imdb_urdu_reviews.csv.tar.gz?raw=true
https://github.com/omilab/Neural-Sentiment-Analyzer-for-Modern-Hebrew/blob/master/data/morph_train.tsv?raw=true
https://github.com/omilab/Neural-Sentiment-Analyzer-for-Modern-Hebrew/blob/master/data/token_train.tsv?raw=true
https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0023-625F-0/hindencorp05.plaintext.gz?sequence=3&isAllowed=y
https://repo.sadilar.org/bitstream/handle/20.500.12185/299/nchlt_afrikaans_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y
https://repo.sadilar.org/bitstream/handle/20.500.12185/312/nchlt_isixhosa_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y
https://repo.sadilar.org/bitstream/handle/20.500.12185/319/nchlt_isizulu_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y
https://repo.sadilar.org/bitstream/handle/20.500.12185/328/nchlt_sepedi_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y
https://repo.sadilar.org/bitstream/handle/20.500.12185/334/nchlt_sesotho_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y
https://repo.sadilar.org/bitstream/handle/20.500.12185/341/nchlt_setswana_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y
https://repo.sadilar.org/bitstream/handle/20.500.12185/346/nchlt_siswati_named_entity_annotated_corpus.zip?sequence=3&isAllowed=y
https://www.dropbox.com/s/tohrsllcfy7rch4/SimpleQuestions_v2.tgz?dl=1
https://zenodo.org/record/1043504/files/corpus-webis-tldr-17.zip?download=1
https://zenodo.org/record/1489920/files/articles-training-byarticle-20181122.zip?download=1
https://zenodo.org/record/1489920/files/articles-training-bypublisher-20181122.zip?download=1
https://zenodo.org/record/2787612/files/SICK.zip?download=1
https://zenodo.org/record/3553423/files/Swahili%20data.zip?download=1
https://zenodo.org/record/3707949/files/tapaco_v1.0.zip?download=1
https://zenodo.org/record/4300294/files/train.csv?download=1

@albertvillanova
Copy link
Member Author

albertvillanova commented Oct 1, 2021

Hi @severo, I just saw your comment. Thank you.

Finally I just swapped the 2 parsings: first I extract extension and then I remove query parameters. 😉

@severo
Copy link
Contributor

severo commented Oct 1, 2021

OK :) Maybe we should add some unit tests to ensure we improve the detection without regressions (it's Friday afternoon, I trust the unit tests more than my analysis of the code)

@severo
Copy link
Contributor

severo commented Oct 1, 2021

Great! For the tests, I think we should also add some URLs in the form: http://ufal.ms.mff.cuni.cz/umc/005-en-ur/download.php?f=umc005-corpus.zip to be sure they are still correctly detected.

@albertvillanova albertvillanova merged commit 492fb16 into master Oct 4, 2021
@albertvillanova albertvillanova deleted the fix-get_extraction_protocol branch October 4, 2021 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants