In [1]:
library(randomForest)
library(caret)
dat=read.table("features.gz", sep=";")
names(dat)=c("state","fruit","path", paste("f", 1:2048, sep=""))


randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.

Loading required package: ggplot2


Attaching package: 'ggplot2'


The following object is masked from 'package:randomForest':

    margin


Loading required package: lattice



In [None]:
table(as.character(dat$state)) 
      
-sort(-table(as.character(dat$fruit)))[1:120]


In [None]:
# Even though we have over 1K instances for the least common fruit
# lets create a datframe that has one of the 20 most common fruit
# as some of the ML methods take a lot longer:
top120F = names(sort(-table(as.character(dat$fruit)))[1:20])
datf = dat[match(dat$fruit,top120F,nomatch=0)>0,]
fruit=as.factor(as.character(datf$fruit))

#To do the training we need to separate the data into 
# training/validation sets, createDataPartition allows 
# to do that easily by ensuring that all the classes are 
# in both parts: we select 70% for training here:
grpf = createDataPartition(fruit,p=.7,list=FALSE)
trnf = grpf
valf = -grpf
#we can use use all 2048 feature vectors for the model
rng=1:2048 # but some of the models take forever for all these features, try fewer, e.g., 20
rng=1:20
trainf <- datf[trnf,]
validatef  <- datf[valf,]
trainfX <-trainf[,-c(1:3)][,rng] 
trainfY <- fruit[trnf];
valfX <-validatef[,-c(1:3)][,rng] 
valfY <- fruit[valf];

In [None]:
library(e1071)
svmMod = svm(trainfY~., data = trainfX)
prfS.v = predict(svmMod,valfX)
#and check how frequently the predicted class matched the actual class:

table(prfS.v==valfY)
#not great!
#FALSE  TRUE 
# 3745  4088 


In [None]:
#No we can fit the model (Random Forest) and predict on 
# the validation set:
rff=randomForest(trainfX,y=trainfY)
prff.v = predict(rff,valfX)
#and check how frequently the predicted class matched the actual class:
table(prff.v==valfY) 

#Not so great, but we are using only 20 of the 2048 features!
#FALSE  TRUE 
# 2787  5046 


In [3]:
#Lets try to see if we have fresh or not, e.g., quality control for groceries:

state= as.factor(as.character(dat$state)=='fresh')
grps = createDataPartition(state,p=.7,list=FALSE)
trns = grps
vals = -grps
#lets use all 2048 feature vectors for the model
rng=1:2048
rng=1:20
trains <- dat[trns,]
validates  <- dat[vals,]
trainsX <-trains[,-c(1:3)][,rng] 
trainsY <- state[trns];
valsX <-validates[,-c(1:3)][,rng] 
valsY <- state[vals];

#No we can fit the model (Random Forest) and predict on the validation set (as the number of predictors increases, TF becomes much slower, so we use just 100 out of 248*6 predictors here:

rfs=randomForest(trainsX[,],y=trainsY)
prfs.v = predict(rfs,valsX)
table(prfs.v,valsY)

       valsY
prfs.v  FALSE TRUE
  FALSE  9981 4149
  TRUE   4033 9939

A almost 33% of not-fresh are classified as fresh and of fresh as non-fresh!

That means the classify would allow plenty of bad fruit on the shelves even after throwing out 
a lot of fresh fruit.

Lets try to reduce the dimensionality of the 2048 predictors so that we can get results a bit faster and get better accuracy:

In [4]:
# 200 predictors (10%) explain 32% of the variance, just OK
pca <- prcomp(dat[,-c(1:3)], center = TRUE, scale = TRUE) 
sum(pca$sdev[1:200])/sum(pca$sdev)


In [5]:
#First 200 directions represent over 41% of the variance.

trainsRX = pca$x[trns,1:200]
valsRX <-pca$x[vals,1:200] 
rfsR=randomForest(trainsRX,y=trainsY)
prfsR.v = predict(rfsR,valsRX)
table(prfsR.v,valsY)



       valsY
prfsR.v FALSE  TRUE
  FALSE 12254  2504
  TRUE   1760 11584

Much better, but not that great: false positive rate for non-fresh is almost 13%.
 
    - Can try to use all predictors but will have to cope with longer fitting time (days?),
    - Collect more data
    - Clean existing data (for example focus on fruit with large samples as this uses all fruit types)

For comparison, lets try svn (another commonly used classification technique)


In [6]:
# takes extremely long time, similar results..
library(e1071)
svmMod = svm(trainsY~., data = trainsRX)
prfsS.v = predict(svmMod,valsRX)
table(prfsS.v,valsY)


       valsY
prfsS.v FALSE  TRUE
  FALSE 11894  1707
  TRUE   2120 12381


Finally, lets try the oldest/simplest technique - k-Nearedt Neighbor (knn)


In [None]:
#Looks like it take a long time
library(class)
trainsKX = dat[trns,-c(1:3)]
valsKX = dat[vals,-c(1:3)] 
trainsY=dat$state[trns]
valsY=dat$state[vals]
kMod = knn(trainsKX, valsKX, trainsY)
table(kMod,valsY)


Wow!