Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#FIXME Benni und Flo #2

Closed
pfistfl opened this issue Aug 24, 2015 · 18 comments
Closed

#FIXME Benni und Flo #2

pfistfl opened this issue Aug 24, 2015 · 18 comments

Comments

@pfistfl
Copy link
Member

pfistfl commented Aug 24, 2015

# FIXME: dat sets which text features and special chars, they are not stored as UTF8 on OML
dids = setdiff(dids, c(374, 376,  379,  380))

# FIXME: strings are broken at "," so "[1,2]" becomes "'[1" and "2]'"
dids = setdiff(dids, c(1047, 1057))

# FIXME: foreign can not read dat set linebreaks are \r\r\n instead of \r\n 
# Might be due to conversion using R download.file()?
dids = setdiff(dids, c(579,585, 581))

# FIXME: dat sets with space in column names
dids = setdiff(dids, c(1058))

# FIXME: Error in data, numeric data sometimes quoted e.g. '1047' instead of 1047
# Weka simply removes quotes
dids = setdiff(dids, c(1092, 1095))

# FIXME: dat set where @Data lines sometimes begin with ",". 
# farff reads NA for first and drops last entry in the line
# rWeka removes ","
dids = setdiff(dids, c(676))

# FIXME: dat set of form {0 entry1, 1 entry2, 2 entry3, 4 entry5}
# Where 0,1,2,4... is the column number.
# dat set with according column number in front of entry,
# if colnumber not  in '{}' tags 
# then fill with 0 (that is what RWeka does)
dids = setdiff(dids, c(292))
´´´
@berndbischl
Copy link
Member

  1. Thanks

  2. Please only check against RWEka, not foreign

@berndbischl
Copy link
Member

Pls update the list here, I am still working on new versions

@berndbischl
Copy link
Member

Data sets where I cannot do anything, usually because they are "invalid" on OML, I will exclude with comments in the oml unit test file in farff

@berndbischl
Copy link
Member

c(579,585, 581)

These work now

@berndbischl
Copy link
Member

dids = setdiff(dids, c(1058))

This should work now

@berndbischl
Copy link
Member

292 is in sparse format, I cannot handle this yet.

@berndbischl
Copy link
Member

dids = setdiff(dids, c(1092, 1095))

I reported this on the server, faulty ARFF IMHO

openml/OpenML#204

EDIT: Fixed on OML server and can be parsed correctly now.

@berndbischl
Copy link
Member

dids = setdiff(dids, c(374, 376, 379, 380))

I reported this on the server, faulty ARFF IMHO

openml/OpenML#201

EDIT: Fixed on OML server and can be parsed correctly now, but only with data.reader = 'readr'.

@berndbischl
Copy link
Member

dids = setdiff(dids, c(676))

I reported this on the server, faulty ARFF IMHO

openml/OpenML#203

EDIT: Fixed on OML server and can be parsed correctly now.

@berndbischl
Copy link
Member

can you please give feedback whether this is now all done?

@berndbischl
Copy link
Member

please retest from your side with the latest version

@pfistfl
Copy link
Member Author

pfistfl commented Aug 28, 2015

Ok, checked back:

Not yet working for me:

# FIXME: dat sets which text features and special chars, they are not stored as UTF8 on OML
dids = setdiff(dids, c(374, 376,  379,  380))

# Should't this work by now? I even explicitly included 
 # d1 = readARFF(path, data.reader = "readr")

# and (not possible yet):

# FIXME: dat set of form {0 entry1, 1 entry2, 2 entry3, 4 entry5}
# Where 0,1,2,4... is the column number.
# dat set with according column number in front of entry,
# if colnumber not  in '{}' tags 
# then fill with 0 (that is what RWeka does)
dids = setdiff(dids, c(292))
´´´

@pfistfl
Copy link
Member Author

pfistfl commented Aug 28, 2015

Additionally found new errors (extended search range):

# If data lines do not end in \r\n  an extra line of NA's is added
# Happens at  end of 1028, 1030; Every second line of 1059, 1064
did2s = setdiff(did2s, c(1028, 1030, 1059, 1064))
´´´

@berndbischl
Copy link
Member

dids = setdiff(dids, c(374, 376, 379, 380))

This is already in my unit tests? Please paste what happens here, with readr.

@berndbischl
Copy link
Member

292: Please dont refer to that again, like i said, it is sparse, I cannot handle that now, and we have an extra issue for that

@berndbischl
Copy link
Member

Can you please also simply run the whole unit tests on your machine?
They all pass here, and all of your data sets are already included here.

@pfistfl
Copy link
Member Author

pfistfl commented Aug 28, 2015

dids = setdiff(dids, c(374, 376, 379, 380)) 
# Works now, no idea what the error was before. I'll reply when I can reproduce.

292: noted.

@jakobbossek
Copy link
Contributor

All but 292 (sparse format) runs fine. Closing 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants