## Letter Recognition

One of the earliest applications of the predictive analytics methods we have studied so far in this class was to automatically recognize letters, which post office machines use to sort mail. In this problem, we will build a model that uses statistics of images of four letters in the Roman alphabet -- A, B, P, and R -- to predict which letter a particular image corresponds to.

Note that this is a multiclass classification problem. We have mostly focused on binary classification problems (e.g., predicting whether an individual voted or not, whether the Supreme Court will affirm or reverse a case, whether or not a person is at risk for a certain disease, etc.). In this problem, we have more than two classifications that are possible for each observation, like in the D2Hawkeye lecture. 

The file letters_ABPR.csv contains 3116 observations, each of which corresponds to a certain image of one of the four letters A, B, P and R. The images came from 20 different fonts, which were then randomly distorted to produce the final images; each such distorted image is represented as a collection of pixels, each of which is "on" or "off". For each such distorted image, we have available certain statistics of the image in terms of these pixels, as well as which of the four letters the image is. This data comes from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Letter+Recognition).

This dataset contains the following 17 variables:

- letter = the letter that the image corresponds to (A, B, P or R)
- xbox = the horizontal position of where the smallest box covering the letter shape begins.
- ybox = the vertical position of where the smallest box covering the letter shape begins.
- width = the width of this smallest box.
- height = the height of this smallest box.
- onpix = the total number of "on" pixels in the character image
- xbar = the mean horizontal position of all of the "on" pixels
- ybar = the mean vertical position of all of the "on" pixels
- x2bar = the mean squared horizontal position of all of the "on" pixels in the image
- y2bar = the mean squared vertical position of all of the "on" pixels in the image
- xybar = the mean of the product of the horizontal and vertical position of all of the "on" pixels in the image
- x2ybar = the mean of the product of the squared horizontal position and the vertical position of all of the "on" pixels
- xy2bar = the mean of the product of the horizontal position and the squared vertical position of all of the "on" pixels
- xedge = the mean number of edges (the number of times an "off" pixel is followed by an "on" pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image
- xedgeycor = the mean of the product of the number of horizontal edges at each vertical position and the vertical position
- yedge = the mean number of edges as the images is scanned from top to bottom, along the whole horizontal length of the image
- yedgexcor = the mean of the product of the number of vertical edges at each horizontal position and the horizontal position

### Predicting B or not B

In [2]:
letters = read.csv('./dataset/letters_ABPR.csv')

In [3]:
str(letters)

'data.frame':	3116 obs. of  17 variables:
 $ letter   : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
 $ xbox     : int  4 1 5 5 3 8 2 3 8 6 ...
 $ ybox     : int  2 1 9 9 6 10 6 7 14 10 ...
 $ width    : int  5 3 5 7 4 8 4 5 7 8 ...
 $ height   : int  4 2 7 7 4 6 4 5 8 8 ...
 $ onpix    : int  4 1 6 10 2 6 3 3 4 7 ...
 $ xbar     : int  8 8 6 9 4 7 6 12 5 8 ...
 $ ybar     : int  7 2 11 8 14 7 7 2 10 5 ...
 $ x2bar    : int  6 2 7 4 8 3 5 3 6 7 ...
 $ y2bar    : int  6 2 3 4 1 5 5 2 3 5 ...
 $ xybar    : int  7 8 7 6 11 8 6 10 12 7 ...
 $ x2ybar   : int  6 2 3 8 6 4 5 2 5 6 ...
 $ xy2bar   : int  6 8 9 6 3 8 7 9 4 6 ...
 $ xedge    : int  2 1 2 6 0 6 3 2 4 3 ...
 $ xedgeycor: int  8 6 7 11 10 6 7 6 10 9 ...
 $ yedge    : int  7 2 5 8 4 7 5 3 4 8 ...
 $ yedgexcor: int  10 7 11 7 8 7 8 8 8 9 ...


Let's warm up by attempting to predict just whether a letter is B or not. To begin, load the file letters_ABPR.csv into R, and call it letters. Then, create a new variable isB in the dataframe, which takes the value "TRUE" if the observation corresponds to the letter B, and "FALSE" if it does not. You can do this by typing the following command into your R console:
```R
letters$isB = as.factor(letters$letter == "B")
```
Now split the data set into a training and testing set, putting 50% of the data in the training set. Set the seed to 1000 before making the split. The first argument to sample.split should be the dependent variable "letters$isB". Remember that TRUE values from sample.split should go in the training set.

In [4]:
letters$isB = as.factor(letters$letter == 'B')

In [11]:
library('caTools')
set.seed(1000)

In [12]:
split = sample.split(letters$isB, SplitRatio=0.5)
train = subset(letters, split == TRUE)
test = subset(letters, split == FALSE)

In [13]:
table(letters$isB)


FALSE  TRUE 
 2350   766 

In [14]:
2350 / nrow(letters)

Now build a classification tree to predict whether a letter is a B or not, using the training set to build your model. Remember to remove the variable "letter" out of the model, as this is related to what we are trying to predict! To just remove one variable, you can either write out the other variables, or remember what we did in the Billboards problem in Week 3, and use the following notation:
```R
CARTb = rpart(isB ~ . - letter, data=train, method="class")
```
We are just using the default parameters in our CART model, so we don't need to add the minbucket or cp arguments at all. We also added the argument method="class" since this is a classification problem.

In [15]:
library("rpart")
CARTb = rpart(isB ~ . - letter, data=train, method="class")

In [16]:
CARTbpred = predict(CARTb, newdata=test, type='class')

In [17]:
table(test$isB, CARTbpred)

       CARTbpred
        FALSE TRUE
  FALSE  1118   57
  TRUE     43  340

In [19]:
accuracy = (1118 + 340) / nrow(test)
accuracy

Now, build a random forest model to predict whether the letter is a B or not (the isB variable) using the training set. You should use all of the other variables as independent variables, except letter (since it helped us define what we are trying to predict!). Use the default settings for ntree and nodesize (don't include these arguments at all). Right before building the model, set the seed to 1000. (NOTE: You might get a slightly different answer on this problem, even if you set the random seed. This has to do with your operating system and the implementation of the random forest algorithm.)

In [21]:
library("randomForest")

randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.



In [22]:
set.seed(1000)
isBForest = randomForest(isB ~ . - letter, data=train,)

In [23]:
PredictForest = predict(isBForest, newdata=test)

In [24]:
table(test$isB, PredictForest)

       PredictForest
        FALSE TRUE
  FALSE  1163   12
  TRUE      9  374

In [25]:
accuracy = (1163 + 374) / nrow(test)
accuracy

### Predicting the letters A, B, P, R

Let us now move on to the problem that we were originally interested in, which is to predict whether or not a letter is one of the four letters A, B, P or R.

As we saw in the D2Hawkeye lecture, building a multiclass classification CART model in R is no harder than building the models for binary classification problems. Fortunately, building a random forest model is just as easy.

The variable in our data frame which we will be trying to predict is "letter". Start by converting letter in the original data set (letters) to a factor by running the following command in R:
```R
letters$letter = as.factor( letters$letter )
```
Now, generate new training and testing sets of the letters data frame using letters$letter as the first input to the sample.split function. Before splitting, set your seed to 2000. Again put 50% of the data in the training set. (Why do we need to split the data again? Remember that sample.split balances the outcome variable in the training and testing sets. With a new outcome variable, we want to re-generate our split.)

In a multiclass classification problem, a simple baseline model is to predict the most frequent class of all of the options.

In [26]:
letters$letter = as.factor(letters$letter)

In [27]:
set.seed(2000)
split = sample.split(letters$letter, SplitRatio = 0.5)
train = subset(letters, split == TRUE)
test = subset(letters, split == FALSE)

In [29]:
table(test$letter)


  A   B   P   R 
395 383 401 379 

In [30]:
401 / nrow(test)

Now build a classification tree to predict "letter", using the training set to build your model. You should use all of the other variables as independent variables, except "isB", since it is related to what we are trying to predict! Just use the default parameters in your CART model. Add the argument method="class" since this is a classification problem. Even though we have multiple classes here, nothing changes in how we build the model from the binary case.

In [31]:
CARTletter = rpart(letter ~ . - isB, data=train, method='class')
summary(CARTletter)

Call:
rpart(formula = letter ~ . - isB, data = train, method = "class")
  n= 1558 

          CP nsplit rel error    xerror       xstd
1 0.31920415      0 1.0000000 1.0207612 0.01463677
2 0.25865052      1 0.6807958 0.7240484 0.01702507
3 0.18685121      2 0.4221453 0.4221453 0.01583654
4 0.02595156      3 0.2352941 0.2352941 0.01296174
5 0.02076125      4 0.2093426 0.2145329 0.01249165
6 0.01730104      5 0.1885813 0.1955017 0.01202444
7 0.01384083      6 0.1712803 0.1816609 0.01166039
8 0.01211073      7 0.1574394 0.1730104 0.01142150
9 0.01000000      8 0.1453287 0.1660900 0.01122366

Variable importance
     ybar xedgeycor    x2ybar    xy2bar     yedge     y2bar     xedge     xybar 
       17        16        14        12        11         8         7         5 
    x2bar      xbar 
        5         3 

Node number 1: 1558 observations,    complexity param=0.3192042
  predicted class=P  expected loss=0.7419769  P(node) =1
    class counts:   394   383   402   379
   probabilities:

In [32]:
CARTletterpred = predict(CARTletter, newdata=test, type='class')

In [33]:
table(test$letter, CARTletterpred)

   CARTletterpred
      A   B   P   R
  A 348   4   0  43
  B   8 318  12  45
  P   2  21 363  15
  R  10  24   5 340

In [35]:
accuracy = (348 + 318 + 363 + 340) / nrow(test)
accuracy

In [36]:
set.seed(1000)
RFletter = randomForest(letter ~ . - isB, data=train,)
summary(RFletter)

                Length Class  Mode     
call               4   -none- call     
type               1   -none- character
predicted       1558   factor numeric  
err.rate        2500   -none- numeric  
confusion         20   -none- numeric  
votes           6232   matrix numeric  
oob.times       1558   -none- numeric  
classes            4   -none- character
importance        16   -none- numeric  
importanceSD       0   -none- NULL     
localImportance    0   -none- NULL     
proximity          0   -none- NULL     
ntree              1   -none- numeric  
mtry               1   -none- numeric  
forest            14   -none- list     
y               1558   factor numeric  
test               0   -none- NULL     
inbag              0   -none- NULL     
terms              3   terms  call     

In [37]:
RFletterpred = predict(RFletter, newdata=test)
table(test$letter, RFletterpred)

   RFletterpred
      A   B   P   R
  A 391   0   3   1
  B   0 380   1   2
  P   0   6 394   1
  R   3  14   0 362

In [39]:
accuracy = (391 + 380 + 394 + 362) / nrow(test)
accuracy