Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding bug? #28

Open
giuseppec opened this issue Mar 18, 2016 · 3 comments
Open

Encoding bug? #28

giuseppec opened this issue Mar 18, 2016 · 3 comments

Comments

@giuseppec
Copy link
Collaborator

There seem to be an encoding issue at least for windows (not sure if this is because of windows java or windows or farff):

    oml.conf = getOMLConfig()
    cachedir = oml.conf$cachedir
    data.id = 376
    data.reader = "readr"
    getOMLDataSet(data.id)
    path = file.path(cachedir, "datasets", data.id, "dataset.arff")
    d1 = readARFF(path, data.reader = data.reader)
    d2 = RWeka::read.arff(path)
    for(i in 1:nrow(d1)){ 
      cat(i, fill = TRUE)
      expect_equal(d1$text[i], d2$text[i])
    }
    expect_equal(d1$text[7], d2$text[7])
    d1$text[7]
    d2$text[7]

the first string mismatch happens in row 7 of this data set and refers to the string ¤, which in RWeka is represented as ¤. I have experimented with the iconv function to convert the character into UTF-8 but it did not work. Does this work for other operating systems?

> d1$text[7]
[1] "Black Sheep Wall A&M, October 1989 cover 1. Black Sheep Wall (4:20) 2. [1]Broken Circle (Acoustic) (3:21) 3. [2]Notebook (Acoustic) (4:39) Known Formats UK (AM563) 7\" (1,2) UK (AMX563) 10\" (1,2,3) UK (AMCD563) CD (1,2,3) US (CD17875) CD (1) US (SP17801) 12\" (1,2,3) AU (?) 7\" (1,2) _________________________________________________________________ This is how I love you: I wish for a shade I can pull I feel so afraid of watching you grow up This love hurts to much And I try and build a wall So I don't have to see you fall And I pray Go away from my thoughts! Why do you keep coming back Over Black Sheep Wall? Oh, I'd love to hold you close But I play it cool And keep my thoughts in a jar Marked \"dangerous\" And everyone says, \"Never fear - All boys his age experiment with their lives\" But my eyes want to close you out I'll close you out Why do you keep coming back Over Black Sheep Wall? Brother Black Sheep, love is strong There's a shepherd out in every storm And he's not afraid of a little rain Why am I? Why do I keep building up This Black Sheep Wall? Oh, I love you so! Do you really know how much How deep? Black Sheep This is how I love you: With closed eyes With turned back With distance _________________________________________________________________ [3]\"Innocence Mission\" ¤ [4]Discography ¤ [5]Innocence Mission ¤ [6]Tony ¤ [7]NIWEB ¤ ¤ [8]comment References 1. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/circle.html 2. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/notebook.html 3. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/innmiss.html 4. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/discog.html 5. file://localhost/tony/IM 6. file://localhost/tony/ 7. file://localhost/ 8. file://localhost/tony/comment.html"

> d2$text[7]
[1] "Black Sheep Wall A&M, October 1989 cover 1. Black Sheep Wall (4:20) 2. [1]Broken Circle (Acoustic) (3:21) 3. [2]Notebook (Acoustic) (4:39) Known Formats UK (AM563) 7\" (1,2) UK (AMX563) 10\" (1,2,3) UK (AMCD563) CD (1,2,3) US (CD17875) CD (1) US (SP17801) 12\" (1,2,3) AU (?) 7\" (1,2) _________________________________________________________________ This is how I love you: I wish for a shade I can pull I feel so afraid of watching you grow up This love hurts to much And I try and build a wall So I don't have to see you fall And I pray Go away from my thoughts! Why do you keep coming back Over Black Sheep Wall? Oh, I'd love to hold you close But I play it cool And keep my thoughts in a jar Marked \"dangerous\" And everyone says, \"Never fear - All boys his age experiment with their lives\" But my eyes want to close you out I'll close you out Why do you keep coming back Over Black Sheep Wall? Brother Black Sheep, love is strong There's a shepherd out in every storm And he's not afraid of a little rain Why am I? Why do I keep building up This Black Sheep Wall? Oh, I love you so! Do you really know how much How deep? Black Sheep This is how I love you: With closed eyes With turned back With distance _________________________________________________________________ [3]\"Innocence Mission\" ¤ [4]Discography ¤ [5]Innocence Mission ¤ [6]Tony ¤ [7]NIWEB ¤ ¤ [8]comment References 1. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/circle.html 2. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/notebook.html 3. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/innmiss.html 4. file://localhost/research/ml/datasets/uci/raw/output/ucikdd/discog.html 5. file://localhost/tony/IM 6. file://localhost/tony/ 7. file://localhost/ 8. file://localhost/tony/comment.html"

does not work

@jakobbossek
Copy link
Contributor

No problems on OS X with this code.

@berndbischl
Copy link
Member

isnt this basically a readr issue?
what happens if you parse a file with a similar example plainly in readr?

@berndbischl
Copy link
Member

@giuseppec
can you try to set the local parameter in read_delim, to specify an encoding on our windows system?
does this help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants