In [None]:
library(tidyverse) 
library(dplyr)
library(tidymodels)

In [None]:
download.file("https://raw.githubusercontent.com/matthewsans/DCSI-100-group-project/main/penguins_lter.csv", "penguins")
penguin_data <- read_csv("penguins")
head(penguin_data)


In [None]:
names(penguin_data)[13] <- "body_mass"
names(penguin_data)[2] <- "sample_number"
names(penguin_data)[10] <- "culmen_length"
names(penguin_data)[11] <- "culmen_depth"
names(penguin_data)[12] <- "flipper_length"
names(penguin_data)[13] <- "body_mass"
names(penguin_data)[3] <- "species"
names(penguin_data)[4] <- "region"
names(penguin_data)[5] <- "island"

set.seed(1234)
clean_penguin <- select(penguin_data, c(species, region, island, culmen_length, culmen_depth, flipper_length, body_mass))
clean_penguin <- clean_penguin |> mutate(species = as.factor(species))
head(clean_penguin)
split_penguin <- initial_split(clean_penguin, prop = 0.75, strata = species)
penguin_training <- training(split_penguin)  # USE FOR DATA VISUALATION/TRAINING
penguin_testing <- testing(split_penguin)    # DO NOT USE UNTIL FINAL TEST

Introduction
Backgroud Information: The Antarctic region which is characterized by its extreme climate and pristine wilderness, is a critical habitat for penguins. The dataset that is used in this project is collected in the Palmer Archipelago, which is situated off the western coast of the Antarctic Peninsula, and hosts three species of penguin. 


Research Question: The research question this project is going to answer is how to predict the species of penguin based on their culmen length (mm), culmen depth (mm), flipper length (mm), and body mass (g). To answer the research question, we utilize a comprehensive dataset containing information on the three penguin species which are Chinstrap, Adélie, and Gentoo.


Description of the Dataset: The dataset encompasses critical parameters, including species type, culmen length (mm), culmen depth (mm), flipper length (mm), body mass (g), island location (Dream, Torgersen, or Biscoe), and sex of individual penguins. This dataset, carefully collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network, provides valuable insights into the ecological and morphological characteristics of penguin populations.


Methods:
For our question, we plan to predict Penguin Species (Adelie, Chinstrap, or Gentoo) by looking at the variables "culmen_length", "culmen_Depth", "flipper_length", and "body_mass". Each of these are numerical variables that should allow us to easily calculate the nearest neighbors. In this way, we should be able to predict a penguins species when given only the culmen length, culmen debth, flipper length, and body mass. One way that we plan to visualize these is by using multivariate histogram charts. We will plot the different variables individually on histogram charts and color the bars to represent the count from each penguin species. Then, we will line the charts up in a grid so they are easier to read. 

Expected outcomes and significance:
We expect to find that some penguin species typically have certain traits. For example, maybe one of the penguin species typically has the shortest culmen length. These findings could lend insight into the penguins evolution process. If we look at specific traits that have been developed over millions of years, we could perhaps get a better picture of what the world was like when they were evolving through the specific traits they developed. This could lead to future questions involving why certain penguin species developed differently than other penguin species and, in a similar vein, questions about where they will go from here.

In [None]:

penguin_mean <- penguin_training |>
      group_by(species) |>
      summarize(across(culmen_length:body_mass, mean, na.rm = TRUE))
colnames(penguin_mean)[2:7] <- c("mean_culmen_length", "mean_culmen_depth", "mean_flipper_length", "mean_body_mass")
head(penguin_mean)
penguin_na <- sum(is.na(penguin_training))
penguin_explore <- mutate(penguin_mean, total_na = penguin_na)

#Visualization
options(repr.plot.width = 8, repr.plot.length = 3)
penguin_visual <- ggplot(penguin_training, aes(x = flipper_length, y = culmen_length, color = species, shape = species)) +
    geom_point() + 
    labs(x = "Flipper Length (mm)", y = "Bill Length (mm)", color = "Species", shape = "Species") +
    ggtitle("Culmen and Flipper Lengths of Different Species") +
    theme(text = element_text(size = 12)) 
penguin_visual