https://courses.edx.org/courses/course-v1:MITx+15.071x_3+1T2016/courseware/3372864201764d6d9f63931920e5152e/ab08d73980f046479d3bcd105a55b0c2/

One of the earliest applications of the predictive analytics methods we have studied so far in this class was to automatically recognize letters, which post office machines use to sort mail. In this problem, we will build a model that uses statistics of images of four letters in the Roman alphabet -- A, B, P, and R -- to predict which letter a particular image corresponds to.

Note that this is a multiclass classification problem. We have mostly focused on binary classification problems (e.g., predicting whether an individual voted or not, whether the Supreme Court will affirm or reverse a case, whether or not a person is at risk for a certain disease, etc.). In this problem, we have more than two classifications that are possible for each observation, like in the D2Hawkeye lecture. 

The file letters_ABPR.csv contains 3116 observations, each of which corresponds to a certain image of one of the four letters A, B, P and R. The images came from 20 different fonts, which were then randomly distorted to produce the final images; each such distorted image is represented as a collection of pixels, each of which is "on" or "off". For each such distorted image, we have available certain statistics of the image in terms of these pixels, as well as which of the four letters the image is. This data comes from the UCI Machine Learning Repository.

This dataset contains the following 17 variables:

    letter = the letter that the image corresponds to (A, B, P or R)
    xbox = the horizontal position of where the smallest box covering the letter shape begins.
    ybox = the vertical position of where the smallest box covering the letter shape begins.
    width = the width of this smallest box.
    height = the height of this smallest box.
    onpix = the total number of "on" pixels in the character image
    xbar = the mean horizontal position of all of the "on" pixels
    ybar = the mean vertical position of all of the "on" pixels
    x2bar = the mean squared horizontal position of all of the "on" pixels in the image
    y2bar = the mean squared vertical position of all of the "on" pixels in the image
    xybar = the mean of the product of the horizontal and vertical position of all of the "on" pixels in the image
    x2ybar = the mean of the product of the squared horizontal position and the vertical position of all of the "on" pixels
    xy2bar = the mean of the product of the horizontal position and the squared vertical position of all of the "on" pixels
    xedge = the mean number of edges (the number of times an "off" pixel is followed by an "on" pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image
    xedgeycor = the mean of the product of the number of horizontal edges at each vertical position and the vertical position
    yedge = the mean number of edges as the images is scanned from top to bottom, along the whole horizontal length of the image
    yedgexcor = the mean of the product of the number of vertical edges at each horizontal position and the horizontal position


In [1]:
letters = read.csv("letters_ABPR.csv")

In [2]:
str(letters)

'data.frame':	3116 obs. of  17 variables:
 $ letter   : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
 $ xbox     : int  4 1 5 5 3 8 2 3 8 6 ...
 $ ybox     : int  2 1 9 9 6 10 6 7 14 10 ...
 $ width    : int  5 3 5 7 4 8 4 5 7 8 ...
 $ height   : int  4 2 7 7 4 6 4 5 8 8 ...
 $ onpix    : int  4 1 6 10 2 6 3 3 4 7 ...
 $ xbar     : int  8 8 6 9 4 7 6 12 5 8 ...
 $ ybar     : int  7 2 11 8 14 7 7 2 10 5 ...
 $ x2bar    : int  6 2 7 4 8 3 5 3 6 7 ...
 $ y2bar    : int  6 2 3 4 1 5 5 2 3 5 ...
 $ xybar    : int  7 8 7 6 11 8 6 10 12 7 ...
 $ x2ybar   : int  6 2 3 8 6 4 5 2 5 6 ...
 $ xy2bar   : int  6 8 9 6 3 8 7 9 4 6 ...
 $ xedge    : int  2 1 2 6 0 6 3 2 4 3 ...
 $ xedgeycor: int  8 6 7 11 10 6 7 6 10 9 ...
 $ yedge    : int  7 2 5 8 4 7 5 3 4 8 ...
 $ yedgexcor: int  10 7 11 7 8 7 8 8 8 9 ...


In [4]:
summary(letters$letter)

In [5]:
levels(letters$letter)

In [6]:
letters$isB = as.factor(letters$letter == "B")

In [8]:
library(caTools)

In [9]:
set.seed(1000)

In [10]:
split = sample.split(letters$isB, SplitRatio = 0.5)

In [12]:
Train = subset(letters, split == TRUE)
nrow(Train)
Test = subset(letters, split == FALSE)
nrow(Test)

In [14]:
table(Train$isB)


FALSE  TRUE 
 1175   383 

In [15]:
1175/(1175+383)

In [16]:
library("rpart")
library("rpart.plot")

: package ‘rpart.plot’ was built under R version 3.3.0

In [19]:
CARTb = rpart(isB ~ . - letter, data = Train, method = "class")

In [21]:
bPredTest = predict(CARTb, newdata = Test, type = "class")

In [22]:
t = table(Test$isB,bPredTest)

In [23]:
t

       bPredTest
        FALSE TRUE
  FALSE  1118   57
  TRUE     43  340

In [24]:
sum(diag(t))/sum(t)

In [25]:
library("randomForest")

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.


In [26]:
set.seed(1000)
bForest = randomForest(isB ~ . - letter, data = Train)

In [27]:
bForestPredTest = predict(bForest, newdata = Test)

In [28]:
t = table(Test$isB,bForestPredTest)

In [29]:
t

       bForestPredTest
        FALSE TRUE
  FALSE  1165   10
  TRUE      9  374

In [30]:
sum(diag(t))/sum(t)

Let us now move on to the problem that we were originally interested in, which is to predict whether or not a letter is one of the four letters A, B, P or R.

As we saw in the D2Hawkeye lecture, building a multiclass classification CART model in R is no harder than building the models for binary classification problems. Fortunately, building a random forest model is just as easy.

The variable in our data frame which we will be trying to predict is "letter". Start by converting letter in the original data set (letters) to a factor by running the following command in R:



In [31]:
letters$letter = as.factor(letters$letter)

In [32]:
set.seed(2000)

In [33]:
split = sample.split(letters$letter, SplitRatio =  0.5)

In [34]:
Train1 = subset(letters, split == TRUE)
Test1 = subset(letters, split == FALSE)

In [36]:
table(Train1$letter)


  A   B   P   R 
394 383 402 379 

In [37]:
402/nrow(Train1)

In [38]:
CARTall = rpart(letter~.-isB, data = Train1, method = "class")

In [39]:
allPredTest = predict(CARTall, newdata = Test1, type = "class")

In [40]:
t = table(Test1$letter, allPredTest)
t

   allPredTest
      A   B   P   R
  A 348   4   0  43
  B   8 318  12  45
  P   2  21 363  15
  R  10  24   5 340

In [41]:
sum(diag(t))/sum(t)

In [42]:
set.seed(1000)
allForest = randomForest(letter~.-isB, data = Train1)

In [43]:
allPredForestTest = predict(allForest, newdata = Test1)

In [44]:
t = table(Test1$letter,allPredForestTest)

In [45]:
t

   allPredForestTest
      A   B   P   R
  A 390   0   3   2
  B   0 380   1   2
  P   0   5 393   3
  R   3  12   0 364

In [46]:
sum(diag(t))/sum(t)