# Machine Learning Exercise with `R`

There is still a lot to learn about machine learning, and it is important to recognize that we have barely started to scrape the surface of it. There are many things we could do to refine our model that we didn't touch on in this module (don't worry, these will be covered throughout your curriculum), such as data transformation, elegant methods for automated feature selection, as well as unsupervised learning.

For these exercises, we ask you to only complete **ONE** of the exercise notebooks, either `Python` or `R`. We will be asking you to predict wine quality using both Decision Tree and Naïve Bayes. Your exercises will serve as a sort-of extended practice in which you are free to try and refine the model however you see fit, but we do ask you to use both Decision Tree and Naïve Bayes.

The questions will guide you a bit, but if you want to experiment or you find, through data exploration, a model that is better, feel free to do so. If you go this route, leave comments in the code justifying why you did what you did.

### Read in Packages

In [None]:
library(tree)
library(ggplot2)
library(e1071)

### Read in the Data

Today we will be using the Red Wine Quality data. The target variable is numeric, so we are going to discretize it a bit before we get to the activities.

In [None]:
wine <- read.csv('/dsa/data/all_datasets/wine-quality/winequality-red.csv', sep = ";")
head(wine)

In [None]:
# if wine quality is less than 6, assign the value "bad".
# if 6 or greater, assign "good". create a new target called
# taste
wine$taste <- ifelse(wine$quality < 6, 'bad', 'good') 

# 6 is the most popular value by a lot in this set, so 
# we are going to assign it a unique value. We will call 
# this "normal" as it is in the middle of the distribution.
wine$taste[wine$quality == 6] <- 'normal'

# make this target variable categorical
wine$taste <- as.factor(wine$taste)

# remove the old target, since it is no longer needed
wine <- wine[,-12]

**Exercise 1**: Create a training data set and testing data set from the `wine` data frame. Make sure that the rows are randomly selected. The training set should be constructed from 60% of the data; call it `train`. The testing set should be called `test` and should be constructed from the **other** 40% of the data. Be sure to pass `123` as the set.seed() first.

In [None]:
# Code for exercise 1 goes here
# *****************************







**Exercise 2**: Create a formula for the prediction task. First predict using all of the variables other than the target. In order to avoid typing out all of the variables, you can use the following notation:

```splus
target ~ .
```

The "." tells `R` to use all other variables in the dataset (that are not the target) as inputs.

In [None]:
# Code for exercise 2 goes here
# *****************************






**Exercise 3**: Create a Decision Tree model using the `tree` function. Make sure that you pass the newly created formula as a parameter and specify the training data set. Be sure to name this object something (in the examples, we called it `tr`). Then run a summary on the object. 

In [None]:
# Code for exercise 3 goes here
# *****************************







Pay attention to the output of the summary.

**Exercise 4**: What is the misclassification error rate of the tree using the **testing** set?

In [None]:
# Code for exercise 4 goes here
# *****************************







**Exercise 5**: Now create a Naïve Bayes classifier using the formula and training data. Be sure to name this model something (in the other notebooks, we called it `m`).

In [None]:
# Code for exercise 5 goes here
# *****************************






**Exercise 6**: What is the misclassification error rate of the Naïve Bayes classifier using the **testing** set?

In [None]:
# Code for exercise 6 goes here
# *****************************







Take a look at the summary of the tree created in Exercise 3. It shows us the features that it used for the classification task. 

**Exercise 7**: Create a new formula that predicts `taste` using only the features that the decision tree defined. Be sure to name this formula something different from the old formula.

In [None]:
# Code for exercise 7 goes here
# *****************************






**Exercise 8**: Now create a Naïve Bayes classifier using this pruned formula and training data. Be sure to name this model something other than your original Naive Bayes model.

In [None]:
# Code for exercise 8 goes here
# *****************************







**Exercise 9**: Does using only these select features create a better model according to the testing data misclassification error rate?

In [None]:
# Code for exercise 9 goes here
# *****************************






# Save your noteboot, then `File > Close and Halt`