# Machine Learning Exercise with `R`

There is still a lot to learn about machine learning, and it is important to recognize that we have barely started to scrape the surface of it. There are many things we could do to refine our model that we didn't touch on in this module (don't worry, these will be covered throughout your curriculum), such as data transformation, elegant methods for automated feature selection, as well as unsupervised learning.

For these exercises, we ask you to only complete **ONE** of the exercise notebooks, either `Python` or `R`. We will be asking you to predict wine quality using both Decision Tree and Naïve Bayes. Your exercises will serve as a sort-of extended practice in which you are free to try and refine the model however you see fit, but we do ask you to use both Decision Tree and Naïve Bayes.

The questions will guide you a bit, but if you want to experiment or you find, through data exploration, a model that is better, feel free to do so. If you go this route, leave comments in the code justifying why you did what you did.

### Read in Packages

In [1]:
library(tree)
library(ggplot2)
library(e1071)

### Read in the Data

Today we will be using the Red Wine Quality data. The target variable is numeric, so we are going to discretize it a bit before we get to the activities.

In [2]:
wine <- read.csv('/dsa/data/all_datasets/wine-quality/winequality-red.csv', sep = ";")
head(wine)

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.66,0.0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5


In [3]:
# if wine quality is less than 6, assign the value "bad".
# if 6 or greater, assign "good". create a new target called
# taste
wine$taste <- ifelse(wine$quality < 6, 'bad', 'good') 

# 6 is the most popular value by a lot in this set, so 
# we are going to assign it a unique value. We will call 
# this "normal" as it is in the middle of the distribution.
wine$taste[wine$quality == 6] <- 'normal'

# make this target variable categorical
wine$taste <- as.factor(wine$taste)

# remove the old target, since it is no longer needed
wine <- wine[,-12]

**Exercise 1**: Create a training data set and testing data set from the `wine` data frame. Make sure that the rows are randomly selected. The training set should be constructed from 60% of the data; call it `train`. The testing set should be called `test` and should be constructed from the **other** 40% of the data. Be sure to pass `123` as the set.seed() first.

In [4]:
nrow(wine) *.6

nrow(wine) - (nrow(wine)* .6)

In [5]:
# Code for exercise 1 goes here
# *****************************

## 960 is equal to approx. 60% of the data set
set.seed(123)
train_ind <- sample(seq_len(nrow(wine)), size = 960)

train <- wine[train_ind, ]

head(train)

Unnamed: 0_level_0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,taste
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
460,11.6,0.58,0.66,2.2,0.074,10,47,1.0008,3.25,0.57,9.0,bad
1260,6.8,0.64,0.0,2.7,0.123,15,33,0.99538,3.44,0.63,11.3,normal
654,9.4,0.33,0.59,2.8,0.079,9,30,0.9976,3.12,0.54,12.0,normal
1410,6.0,0.51,0.0,2.1,0.064,40,54,0.995,3.54,0.93,10.7,normal
1501,7.5,0.725,0.04,1.5,0.076,8,15,0.99508,3.26,0.53,9.6,bad
73,7.7,0.69,0.22,1.9,0.084,18,94,0.9961,3.31,0.48,9.5,bad


In [6]:
## Now set up the testing set

test <- wine[-train_ind, ]

**Exercise 2**: Create a formula for the prediction task. First predict using all of the variables other than the target. In order to avoid typing out all of the variables, you can use the following notation:

```splus
target ~ .
```

The "." tells `R` to use all other variables in the dataset (that are not the target) as inputs.

In [7]:
# Code for exercise 2 goes here
# *****************************

frmla_full <- taste ~ .




**Exercise 3**: Create a Decision Tree model using the `tree` function. Make sure that you pass the newly created formula as a parameter and specify the training data set. Be sure to name this object something (in the examples, we called it `tr`). Then run a summary on the object. 

In [8]:
# Code for exercise 3 goes here
# *****************************

tr <- tree(frmla_full, data=train)

summary(tr)





Classification tree:
tree(formula = frmla_full, data = train)
Variables actually used in tree construction:
[1] "alcohol"              "volatile.acidity"     "sulphates"           
[4] "chlorides"            "total.sulfur.dioxide"
Number of terminal nodes:  10 
Residual mean deviance:  1.467 = 1394 / 950 
Misclassification error rate: 0.326 = 313 / 960 

Pay attention to the output of the summary.

**Exercise 4**: What is the misclassification error rate of the tree using the **testing** set?

In [9]:
# Code for exercise 4 goes here
# *****************************

test_tr<-test
test_tr$pred <- predict(tr, test_tr, type='class')
miss_tr <- test_tr[test_tr$taste != test_tr$pred,]

nrow(miss_tr)/nrow(test_tr)


**Exercise 5**: Now create a Naïve Bayes classifier using the formula and training data. Be sure to name this model something (in the other notebooks, we called it `m`).

In [10]:
# Code for exercise 5 goes here
# *****************************

m <- naiveBayes(frmla_full, data=train)

m



Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
      bad      good    normal 
0.4739583 0.1302083 0.3958333 

Conditional probabilities:
        fixed.acidity
Y            [,1]     [,2]
  bad    8.187912 1.530198
  good   8.928000 2.106771
  normal 8.458947 1.834325

        volatile.acidity
Y             [,1]      [,2]
  bad    0.5982857 0.1870850
  good   0.4045200 0.1430999
  normal 0.4914079 0.1687192

        citric.acid
Y             [,1]      [,2]
  bad    0.2388352 0.1851983
  good   0.3744800 0.1905444
  normal 0.2882105 0.1950711

        residual.sugar
Y            [,1]     [,2]
  bad    2.511319 1.296993
  good   2.676400 1.299699
  normal 2.471053 1.327730

        chlorides
Y              [,1]       [,2]
  bad    0.09213846 0.05132500
  good   0.07500800 0.02245443
  normal 0.08388947 0.03630374

        free.sulfur.dioxide
Y            [,1]     [,2]
  bad    16.43297 10.68138
  good   

**Exercise 6**: What is the misclassification error rate of the Naïve Bayes classifier using the **testing** set?

In [11]:
# Code for exercise 6 goes here
# *****************************

table(predict(m, test[, -12]), test[, 12])

test_nb<-test
test_nb$pred <- predict(m, test_nb[, -12])
miss_nb <- test_nb[test_nb$taste != test_nb$pred,]

nrow(miss_nb)/nrow(test_nb)



        
         bad good normal
  bad    221    8     98
  good     7   47     30
  normal  61   37    130

Take a look at the summary of the tree created in Exercise 3. It shows us the features that it used for the classification task. 

**Exercise 7**: Create a new formula that predicts `taste` using only the features that the decision tree defined. Be sure to name this formula something different from the old formula.

In [12]:
# Code for exercise 7 goes here
# *****************************

#[1] "alcohol"              "volatile.acidity"     "sulphates"           
#[4] "chlorides"            "total.sulfur.dioxide"

frmla_trimmed <- taste ~ alcohol + volatile.acidity + sulphates + chlorides + total.sulfur.dioxide

tr2 <- tree(frmla_trimmed, data=train)

summary(tr2)

test_tr_trim<-test
test_tr_trim$pred <- predict(tr2, test_tr_trim, type='class')
miss_tr_trim <- test_tr_trim[test_tr_trim$taste != test_tr_trim$pred,]

nrow(miss_tr_trim)/nrow(test_tr_trim)



Classification tree:
tree(formula = frmla_trimmed, data = train)
Number of terminal nodes:  10 
Residual mean deviance:  1.467 = 1394 / 950 
Misclassification error rate: 0.326 = 313 / 960 

**Exercise 8**: Now create a Naïve Bayes classifier using this pruned formula and training data. Be sure to name this model something other than your original Naive Bayes model.

In [13]:
# Code for exercise 8 goes here
# *****************************

m2 <- naiveBayes(frmla_trimmed, data=train)

test_nb_trim<-test
test_nb_trim$pred <- predict(m2, test_nb_trim, type='class')
miss_nb_trim <- test_nb_trim[test_nb_trim$taste != test_nb_trim$pred,]

nrow(miss_nb_trim)/nrow(test_nb_trim)




**Exercise 9**: Does using only these select features create a better model according to the testing data misclassification error rate?

In [14]:
# Code for exercise 9 goes here
# *****************************

# Using these elements on the Niave Bayes prediction model decreased our misclassification error rate from
# 38% to 36%, which is about a 6% decrease in the rate as shown below.

mer1 <- nrow(miss_nb)/nrow(test_nb)
mer2 <- nrow(miss_nb_trim)/nrow(test_nb_trim)

improve_percent <- abs((mer2 - mer1)/mer1)

improve_percent


# Save your noteboot, then `File > Close and Halt`