## Test MICE


In [1]:
library(mice)
library(lattice)

### First, a small test with a  full sub-matrix from which 5 random values where taken out
5 points are missing out of 703 x 3. This is supposed to be easy.

#### The Matrix
The values in this table are just enumerations of the WALS values. 
If 81A has six possible values, for example, e.g 1 SOV, 2 SVO, 3 VSO.. etc. then its column in the table will contain values between 1 and 7. The enumeration follows WALS's, so 1 stands for SOV, 2 for SVO 3 for VSO etc.

In [25]:
# read this table
minitest <- read.csv('../miscsv/703-5-emptied.csv')

# some R type stuff
for(c in c('X81A','X90A','X143A')) {
    minitest[[c]] <- as.factor(minitest[[c]])
}
test1 = mice(minitest[,c('X81A','X90A','X143A')])


 iter imp variable
  1   1  X81A  X90A  X143A
  1   2  X81A  X90A  X143A
  1   3  X81A  X90A  X143A
  1   4  X81A  X90A  X143A
  1   5  X81A  X90A  X143A
  2   1  X81A  X90A  X143A
  2   2  X81A  X90A  X143A
  2   3  X81A  X90A  X143A
  2   4  X81A  X90A  X143A
  2   5  X81A  X90A  X143A
  3   1  X81A  X90A  X143A
  3   2  X81A  X90A  X143A
  3   3  X81A  X90A  X143A
  3   4  X81A  X90A  X143A
  3   5  X81A  X90A  X143A
  4   1  X81A  X90A  X143A
  4   2  X81A  X90A  X143A
  4   3  X81A  X90A  X143A
  4   4  X81A  X90A  X143A
  4   5  X81A  X90A  X143A
  5   1  X81A  X90A  X143A
  5   2  X81A  X90A  X143A
  5   3  X81A  X90A  X143A
  5   4  X81A  X90A  X143A
  5   5  X81A  X90A  X143A


### Summary:

In [26]:
test1

Multiply imputed data set
Call:
mice(data = minitest[, c("X81A", "X90A", "X143A")])
Number of multiple imputations:  5
Missing cells per column:
 X81A  X90A X143A 
    2     2     1 
Imputation methods:
     X81A      X90A     X143A 
"polyreg" "polyreg" "polyreg" 
VisitSequence:
 X81A  X90A X143A 
    1     2     3 
PredictorMatrix:
      X81A X90A X143A
X81A     0    1     1
X90A     1    0     1
X143A    1    1     0
Random generator seed value:  NA 

We can see that the five missing values were two from 81A, two from 90A and 1 from 143A. The method was polyreg -- whatever that means. I also don't know exactly the significance of the visit sequence oand the predictor matrix. Dan can probably explain much better than I.

But the import part, are the actual values filled in for the missing ones. The default is five attempts, each one is supposed to be better than the previous. So in practice only the last one matters. So we can see that for 81A the rows (languages) with missing values were 68 and 149 and the final answer was 1 in both.

For 90A it was 327 and 392 with 4 and 1 filled in. And for 

In [27]:
test1$imp

Unnamed: 0,1,2,3,4,5
68,7,1,7,1,1
149,1,7,1,1,1

Unnamed: 0,1,2,3,4,5
327,2,1,2,6,4
392,7,2,7,4,2

Unnamed: 0,1,2,3,4,5
48,1,2,3,1,4


In [5]:
# oops the original values... no worries,the X column stores the original indices
minitest[c(68,149,327,392,48),'X']

Looking up rows  265, 591, 1250, 1449 and 194  in WALS under the relevant features, we find
That the original values were: 
 - 81A:  
     - row 265 (corresponds to minitest 68): 7 -- matches the first and third guesses
     - row 591 (corresponds to minitest 149): 2 -- matches none
  
 - 90A:
     - row 1250: 2 -- matches first and third
     - row 1449: 2 -- matches second and last
 - 143A :  
     - row 194: 4  -- matches the last guess


So even in this mini, easy test R MICE doesn't seem to perform so great. It could be that I used it wrong, but it seems to me that again, this is something made for numeric rather than categorical values.


### Bear in mind that there's a probabilistic element here. Every time you run it you get different results.  

Anyway, having seen the minitest in detail, we can put it to a more serious test:


##  same original table, now 50 values missing

In [6]:
dat50 <- read.csv('../miscsv/703-removed-50.csv')
dat50$X <- NULL
for(c in c('X81A','X90A','X143A')) {
    dat50[[c]] <- as.factor(dat50[[c]])
}
mdat50 = mice(dat50)


 iter imp variable
  1   1  X81A  X90A  X143A
  1   2  X81A  X90A  X143A
  1   3  X81A  X90A  X143A
  1   4  X81A  X90A  X143A
  1   5  X81A  X90A  X143A
  2   1  X81A  X90A  X143A
  2   2  X81A  X90A  X143A
  2   3  X81A  X90A  X143A
  2   4  X81A  X90A  X143A
  2   5  X81A  X90A  X143A
  3   1  X81A  X90A  X143A
  3   2  X81A  X90A  X143A
  3   3  X81A  X90A  X143A
  3   4  X81A  X90A  X143A
  3   5  X81A  X90A  X143A
  4   1  X81A  X90A  X143A
  4   2  X81A  X90A  X143A
  4   3  X81A  X90A  X143A
  4   4  X81A  X90A  X143A
  4   5  X81A  X90A  X143A
  5   1  X81A  X90A  X143A
  5   2  X81A  X90A  X143A
  5   3  X81A  X90A  X143A
  5   4  X81A  X90A  X143A
  5   5  X81A  X90A  X143A


In [7]:
# we'll get the ogirinals more methodically this time
orig = read.csv('../miscsv/removed50-original.csv')

In [8]:
# to test the output, create two vectors -- one with the original values,
# the other with MICE's fifth guess (remember it's supposed to be the best one [I think])
origs = c()
imputed = c()
for(f in c('81A','90A','143A')) {
    origs <- c(origs, orig[(orig$feature==f),'original_value'])
    imputed <- c(imputed,as.numeric(mdat50$imp[[paste('X',f,sep="")]][,5]))
}

List of the 50 original values

In [9]:
origs

In [10]:
imputed

### Now, how good are the guesses? we can try hamming distance and correlation:

In [11]:
#hamming distacne
sum(origs != imputed)

In [12]:
#correlation
cor(origs,imputed)

Doesn't seem too briliant

### We can also try other imputation methods

In [13]:
mdat502 = mice(dat50,method=c("lda","lda","lda"))


 iter imp variable
  1   1  X81A  X90A  X143A
  1   2  X81A  X90A  X143A
  1   3  X81A  X90A  X143A
  1   4  X81A  X90A  X143A
  1   5  X81A  X90A  X143A
  2   1  X81A  X90A  X143A
  2   2  X81A  X90A  X143A
  2   3  X81A  X90A  X143A
  2   4  X81A  X90A  X143A
  2   5  X81A  X90A  X143A
  3   1  X81A  X90A  X143A
  3   2  X81A  X90A  X143A
  3   3  X81A  X90A  X143A
  3   4  X81A  X90A  X143A
  3   5  X81A  X90A  X143A
  4   1  X81A  X90A  X143A
  4   2  X81A  X90A  X143A
  4   3  X81A  X90A  X143A
  4   4  X81A  X90A  X143A
  4   5  X81A  X90A  X143A
  5   1  X81A  X90A  X143A
  5   2  X81A  X90A  X143A
  5   3  X81A  X90A  X143A
  5   4  X81A  X90A  X143A
  5   5  X81A  X90A  X143A


In [14]:
imputed2 = c()
for(f in c('X81A','X90A','X143A')) {
    imputed2 <- c(imputed2,as.numeric(mdat502$imp[[f]][,5]))
}
imputed2

In [15]:
cor(origs,imputed2)

In [16]:
sum(origs != imputed2)

#### How different are guesses between different methods?

In [17]:
cor(imputed2,imputed)

In [18]:
sum(imputed2 != imputed)

#### A third method

In [19]:
mdat503 = mice(dat50,method=c("polr","polr","polr"))


 iter imp variable
  1   1  X81A  X90A  X143A
  1   2  X81A  X90A  X143A
  1   3  X81A  X90A  X143A
  1   4  X81A  X90A  X143A
  1   5  X81A  X90A  X143A
  2   1  X81A  X90A  X143A
  2   2  X81A  X90A  X143A
  2   3  X81A  X90A  X143A
  2   4  X81A  X90A  X143A
  2   5  X81A  X90A  X143A
  3   1  X81A  X90A  X143A
  3   2  X81A  X90A  X143A
  3   3  X81A  X90A  X143A
  3   4  X81A  X90A  X143A
  3   5  X81A  X90A  X143A
  4   1  X81A  X90A  X143A
  4   2  X81A  X90A  X143A
  4   3  X81A  X90A  X143A
  4   4  X81A  X90A  X143A
  4   5  X81A  X90A  X143A
  5   1  X81A  X90A  X143A
  5   2  X81A  X90A  X143A
  5   3  X81A  X90A  X143A
  5   4  X81A  X90A  X143A
  5   5  X81A  X90A  X143A


In [20]:
imputed3 = c()
for(f in c('X81A','X90A','X143A')) {
    imputed3 <- c(imputed3,as.numeric(mdat503$imp[[f]][,5]))
}
imputed3

In [21]:
cor(origs,imputed3)

In [22]:
cor(imputed3,imputed)

In [23]:
cor(imputed2,imputed3)

In [24]:
sum(origs != imputed3)

And that's all i did with the completion idea. I really don't think this is of any use with algorithms that rely on scalar values.