Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control encoding when parsing SPSS file using spss.system.file() #1

Closed
00tau opened this issue Jun 10, 2015 · 2 comments
Closed

Control encoding when parsing SPSS file using spss.system.file() #1

00tau opened this issue Jun 10, 2015 · 2 comments
Assignees

Comments

@00tau
Copy link

00tau commented Jun 10, 2015

Dear Martin,

I have been given a SPSS system file that I would like to analyse using R. I am using the following magic for parsing the file into R.

library(memisc)
foo <- spss.system.file("foobar.sav")
bar <- subset(foo, select=c(var1,var2,var3))

When having a look at the parsed data, you get the following:

> bar
Data set with 379 observations and 3 variables

var1       var2        var3
1      gut    weiblich      Herbst
2      gut mnlich      Sommer
3      gut mnlich      Sommer
4      gut mnlich      Winter
5      gut mnlich Frhling
6      gut mnlich Frhling
7      gut    weiblich Frhling
.
.
.
25      gut    weiblich Frhling
.. ........ ........... ...........
(27 of 379 observations shown)

I guess you get the idea. The collaborator has saved the sav-file in utf-8 by adding a line SET UNICODE = ON. to his/her syntax-file. My locals are set to utf-8, too.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.04

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] graphics  grDevices datasets  utils     stats     methods   base     

other attached packages:
[1] foreign_0.8-63  memisc_0.97     MASS_7.3-40     lattice_0.20-29
[5] ggplot2_1.0.1   reshape2_1.4.1  plyr_1.8.2     

I am using the uxterm terminal-emulator for running R. Thus, everything is in utf-8. I have the strong suspicion that memisc is using a latin1 encoding when parsing the SPSS sav-file by default. Is this correct? Is it possible to change this encoding when parsing?

Thanks you very much!

PS. Why does it say 27 of 379 observations shown, when in fact only 25 of them are shown?

@melff melff self-assigned this Jun 10, 2015
@melff
Copy link
Owner

melff commented Jun 10, 2015

Dear Thomas,

spss.system.file() reads strings contained in SPSS files as-is, without any translation. The resulting strings therefore do not contain any encoding information. My guess is that this is why you see that strange output. So far, I did not encounter a problem like the one you describe, since most SPSS files I encountered were in latin1. However, memisc now has a function Iconv() for explicit translation of SPSS data files. Does

library(memisc)
foo <- spss.system.file("foobar.sav")
foo <- Iconv(foo,from="UTF-8",to="UTF-8")

or

foo <- Iconv(foo,from="ASCII",to="UTF-8")

work for you?

Re your PS. That is a bug, fixed in the current GitHub release (0.98).
Best,
Martin

@00tau
Copy link
Author

00tau commented Jun 11, 2015

Dear Martin,

thank you for your help and the quick response. The sav-file seems to have been saved using a Latin1 encoding, as the following did indeed work!

> library(memisc)
> foo <- spss.system.file("foobar.sav")
> foo <- Iconv(foo,from="Latin1",to="UTF-8")
> foo <- as.data.frame(as.data.set(foo))
> head(foo$Geschlecht)
[1] weiblich männlich männlich männlich männlich männlich
Levels: männlich weiblich 

All the best,
Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants