Skip to content

Conversation

@mathurinm
Copy link
Owner

sklearn.datasets.load_svmlight_file fails for this protein dataset, because it stores values as .50 instead of 0.50

Reproduce with a dummy_svmlight file containing a single line:
0 21:1.00 42:1.00 63: .50
and run load_svmlight_file("dummy_svmlight")

fix it by replacing .50 with 0.50

will have to dig a bit into load_svmlight to see how to fix

@codecov-io
Copy link

codecov-io commented Oct 29, 2020

Codecov Report

Merging #6 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master       #6   +/-   ##
=======================================
  Coverage   89.58%   89.58%           
=======================================
  Files           3        3           
  Lines          96       96           
  Branches       14       14           
=======================================
  Hits           86       86           
  Misses          5        5           
  Partials        5        5           
Impacted Files Coverage Δ
libsvmdata/datasets.py 86.84% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf51fab...4243523. Read the comment docs.

@mathurinm
Copy link
Owner Author

In the cython code at https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/datasets/_svmlight_format_fast.pyx#L68
there is

target, features = line_parts[0], line_parts[1:]

which in this case yields

['21:1.00', '42:1.00', '63:', '.50']

a potential fix is to perform line = line.replace(': .', ':0.')
will have to do benchmarks to see if it harms performance. Given that it's for a single dataset, it may not be worth it so far.

@QB3 here's the explanation for the issue you encountered.

@agramfort
Copy link
Collaborator

agramfort commented Oct 29, 2020 via email

@mathurinm mathurinm changed the title [WIP] Add fetcher for protein [MRG] Add more multiclass datasets Nov 4, 2020
@mathurinm mathurinm mentioned this pull request Nov 4, 2020
@mathurinm mathurinm merged commit 293a905 into master Nov 4, 2020
@mathurinm mathurinm deleted the protein branch November 4, 2020 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants