[MRG] Add more multiclass datasets #6

mathurinm · 2020-10-29T08:14:48Z

sklearn.datasets.load_svmlight_file fails for this protein dataset, because it stores values as .50 instead of 0.50

Reproduce with a dummy_svmlight file containing a single line:
0 21:1.00 42:1.00 63: .50
and run load_svmlight_file("dummy_svmlight")

fix it by replacing .50 with 0.50

will have to dig a bit into load_svmlight to see how to fix

codecov-io · 2020-10-29T08:17:39Z

Codecov Report

Merging #6 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master       #6   +/-   ##
=======================================
  Coverage   89.58%   89.58%           
=======================================
  Files           3        3           
  Lines          96       96           
  Branches       14       14           
=======================================
  Hits           86       86           
  Misses          5        5           
  Partials        5        5

Impacted Files	Coverage Δ
libsvmdata/datasets.py	`86.84% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf51fab...4243523. Read the comment docs.

mathurinm · 2020-10-29T08:26:43Z

In the cython code at https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/datasets/_svmlight_format_fast.pyx#L68
there is

target, features = line_parts[0], line_parts[1:]

which in this case yields

['21:1.00', '42:1.00', '63:', '.50']

a potential fix is to perform line = line.replace(': .', ':0.')
will have to do benchmarks to see if it harms performance. Given that it's for a single dataset, it may not be worth it so far.

@QB3 here's the explanation for the issue you encountered.

agramfort · 2020-10-29T08:40:55Z

sklearn code should be fixed :(

…

[WIP] Add fetcher for protein

4243523

mathurinm added 2 commits October 29, 2020 10:26

more datasets

bcf9add

rm protein, add news20 multiclass

123f301

mathurinm changed the title ~~[WIP] Add fetcher for protein~~ [MRG] Add more multiclass datasets Nov 4, 2020

mathurinm mentioned this pull request Nov 4, 2020

Protein fails #7

Open

mathurinm merged commit 293a905 into master Nov 4, 2020

mathurinm deleted the protein branch November 4, 2020 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MRG] Add more multiclass datasets #6

[MRG] Add more multiclass datasets #6

Uh oh!

mathurinm commented Oct 29, 2020

Uh oh!

codecov-io commented Oct 29, 2020 •

edited

Loading

Uh oh!

mathurinm commented Oct 29, 2020

Uh oh!

agramfort commented Oct 29, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[MRG] Add more multiclass datasets #6

[MRG] Add more multiclass datasets #6

Uh oh!

Conversation

mathurinm commented Oct 29, 2020

Uh oh!

codecov-io commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mathurinm commented Oct 29, 2020

Uh oh!

agramfort commented Oct 29, 2020 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Oct 29, 2020 •

edited

Loading