# Classification on Raisin Dataset Proposal

## Introduction

All aspects of the natural life around us are surrounded by patterns and relationships. Although some of these have been discovered, many natural phenomena are unexplainable as of today. Tools such as classification are used to further discern these natural patterns and make sense of our world by specifying and identifying different types or species of nature. Within this project, we will be aiming to show the importance of classification, by applying this concept to an example dataset regarding raisins. Within this dataset, 900 raisin grains were collected between two types of Turkish raisins, Kecimen and Besni. Images were taken of these 900 raisins and 7 physical features of the raisins were identified, including the raisins area, perimeter, length and more. We seek to use this dataset and identify key features that will allow us to classify whether the raisin is Kecimen or Besni. By doing so, we will show the effectiveness of classification and show how this can be extrapolated to further and more complex phenomena in our world. 


## Preliminary Exploratory Data Analysis

In [30]:
# Load the packages
library(repr)
library(tidyverse)
library(tidymodels)

In [31]:
# Load data and rename columns
raisin_data <- read.table("data/Raisin_Dataset.arff", sep = ",", skip = 18)
colnames(raisin_data) = c("Area", "Perimeter", "MajorAxisLength", "MinorAxisLength",
                          "Eccentricity", "ConvexArea", "Extent", "Class")
head(raisin_data)

Unnamed: 0_level_0,Area,Perimeter,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Class
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<chr>
1,87524,442.246,253.2912,0.8197384,90546,0.7586506,1184.04,Kecimen
2,75166,406.6907,243.0324,0.8018052,78789,0.6841296,1121.786,Kecimen
3,90856,442.267,266.3283,0.7983536,93717,0.6376128,1208.575,Kecimen
4,45928,286.5406,208.76,0.6849892,47336,0.6995994,844.162,Kecimen
5,79408,352.1908,290.8275,0.5640113,81463,0.7927719,1073.251,Kecimen
6,49242,318.1254,200.1221,0.7773513,51368,0.6584564,881.836,Kecimen


In [32]:
# Select the columns(predictors) to build model

In [33]:
# Split the data into training(75%) and testing(25%)
raisin_split <- initial_split(raisin_data, prop = 0.75, strata = Class)  
raisin_train <- training(raisin_split)   
raisin_test <- testing(raisin_split)

In [34]:
# Summarized data of the training set (number of observaion grouped by class)
counts <- raisin_train |>
            group_by(Class) |>
            summarize(n = n())
counts

# Summarized data of the training set (mean of the predictors)
mean_pred <- raisin_train |>
                select(-Class) |>
                map_dfr(mean)
mean_pred

# Number of missing data in the training set
sum(is.na(raisin_train))

Class,n
<chr>,<int>
Besni,337
Kecimen,337


Area,Perimeter,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
88255.42,431.6429,255.0465,0.7809666,91552.38,0.6984483,1167.08


In [35]:
# Visualize the training data

## Methods

Maybe expand on this:
After exploring the data, selecting the predictors we wanted to build the model, splitting the data into training(75%) and testing(25%), we use cross validation to choose the best k value to build our classification model.

## Expected outcomes and significance