# Group 006-26 Project Report
#### Linda Han, Shaqed Orr, Eric Zhang, Prabhjot Singh

## Introduction:
There are a number of different factors that are attributed to different species of iris flowers. Our multivariable dataset provides information on sepal length, sepal width, petal length, and petal width, that we can use to predict the particular species of flower.


We want to predict the species of an iris flower given its:
1. sepal length in cm 
2. sepal width in cm
3. petal length in cm
4. petal width in cm


We will be using the Iris data set found on https://archive.ics.uci.edu/ml/datasets/iris. This set contains 3 classes of 50 instances each, where each class refers to a species of iris plant. The species are Iris Setosa, Iris Versicolour, and Iris Virginica


## Preliminary data analysis:
First, we load in all the necessary libraries.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)

options(repr.matrix.max.rows = 6) # this lists only 6 rows when we try to display the dataset

#### Reading and cleaning the data

1) Read the iris dataset using read_csv function

2) We added column names to reflect each of the attributes, as well as changed the data type of **class** column from character into factor. Otherwise, our data is tidy, because the dataset already has only one row per observation, each column is a single variable (either a measurement of the iris flower or its species), and each value is in a single cell.


In [None]:
iris_col <- c("sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "class")
iris <- read_csv("data/iris.data", col_names= iris_col) %>% 
        mutate(class = as_factor(class))
iris

<br>

#### Summary of the training data

We first split the data into training and testing sets

In [None]:
set.seed(777)

iris_split <- initial_split(iris, prop = 0.80, strata = class)
iris_train <- training(iris_split)
iris_test <- testing(iris_split)

<br>

Using only training data, we summarize the data into 2 tables and count the number of rows with missing values

In [None]:
# Summarizes the average value of each column
iris_avg_size <- iris_train %>%
        summarize(across(sepal_length_cm:petal_width_cm, mean))
iris_avg_size

# Summarizes the number of observations in each class 
# (Iris-setosa, Iris-veriscolor, or Iris-virginica)
iris_class_count <- iris_train %>%
        count(class)
iris_class_count

# Counts the number of missing rows
sum(is.na(iris_train))

<br>

#### Visualization of the training data

Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [None]:
# Graph 1
iris_plot_sepal <- ggplot(data = iris_train, 
                          aes(x = sepal_length_cm, y = sepal_width_cm , colour = class )) +
                geom_point() +
                labs(x = "Sepal length (cm)", y = "Sepal width (cm)" , colour = "class") +
                ggtitle("Sepal Width vs Sepal Length") +
                theme(text = element_text(size = 20))

iris_plot_sepal

In [None]:
# Graph 2
iris_plot_petal <- ggplot(data = iris_train,
                          aes(x = petal_length_cm, y = petal_width_cm , colour = class )) +
                    geom_point() +
                    labs(x = "Petal length (cm)", y = "Petal width (cm)" , colour = "Class") +
                    ggtitle("Petal Width vs Petal Length") +
                    theme(text = element_text(size = 20))

iris_plot_petal

## Data analysis (Building a classifier):

In [None]:
# 1. Recipe
iris_recipe <- recipe(class ~ ., data = iris_train) %>%
                step_center(all_predictors()) %>%
                step_scale(all_predictors())

# 2. Specification
iris_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>% 
            set_engine("kknn") %>% 
            set_mode("classification")

# 3. vfold
iris_vfold <- vfold_cv(data = iris_train, fold = 5, strata = class)

# 4. workflow
iris_workflow <- workflow() %>% 
                add_recipe(iris_recipe) %>% 
                add_model(iris_spec) %>% 
                tune_grid(resamples = iris_vfold, grid = 10) 
                
# 5. metrics and accuracies
iris_accuracies <- iris_workflow %>% 
                collect_metrics() %>% 
                filter(.metric == "accuracy") %>% 
                arrange(desc(mean))

iris_accuracies

### Now we retrain our data with the optimal $K = 11 $

In [None]:
# 1. New specification
iris_new_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 11) %>% 
                set_engine("kknn") %>% 
                set_mode("classification")
                
# 2. New workflow
iris_workflow <- workflow() %>% 
                add_recipe(iris_recipe) %>% 
                add_model(iris_new_spec) %>% 
                fit(iris_train)

iris_workflow