**Title:** What the Shell?!

**Team member:** Laurelyne Barbier, Carter Gunning, Sebastian Martinez Sotomayor and Tayte Stefaniuk

**Introduction**

Abalones are a type of mollusk that are found in marine environments and commonly considered a delicacy in many cuisines (costing up to 100$ per shell 😨, (The Pricer, 2021)). This has placed many species of Abalone in endangered species lists and have prompted scientific research into their population tendencies, (Kerlin, 2022). Part of the research conducted involves determining the age of individuals by cutting the shell, staining it, and counting the rings that are formed perennially using a microscope. This is a very tedious process, so scientists have resorted to other measurements performed on the abalones to predict their age (Nash & Sellers, 1994).

The question we are thus aiming to answer through our project is the following: Can the age of an abalone be predicted based on several measurements describing its dimensions and weight using a model using R?

The dataset we will be using to answer this question give various statistics on the physical characteristics of 4178 different mollusks.  These statistics include the length, diameter, height, weight, shucked weight, viscera weight (weight of soft inside), shell weight and number of rings of these mollusks.  One categorical variable they use is sex of the mollusk which can be male, female, or infant (infant is a category because mollusks can switch sexes) (Kaggle, n.d.).

**Preliminary exploratory data analysis**

In [33]:
###
### Tidy Package
###

library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")

# functions needed to work with images
# code below sourced from: https://gist.github.com/daviddalpiaz/ae62ae5ccd0bada4b9acd6dbc9008706
# helper function for visualization
show_digit = function(arr784, col = gray(12:1 / 12), ...) {
  arr784$X <- 0  # avoid dark pixel in top left
  image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, ...)
}

#set seed

set.seed(1969)

“cannot open file 'tests.R': No such file or directory”


ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


**Demonstrate that the dataset can be read from the web into R**


In [None]:
raw_shell <- read_csv("data/abalone.csv")
raw_shell

**Clean and wrangle your data into a tidy format:**

Data is already tidy.


In [None]:
#lowercase column names & rid of spaces
names(raw_shell) <- tolower(names(raw_shell))
shell <- rename(raw_shell, "whole_weight" = "whole weight","shucked_weight" = "shucked weight", "viscera_weight" =
        "viscera weight", "shell_weight" = "shell weight")

#adding an age column, knowing that age in years = number of rings + 1.5.
shell <- mutate(shell, age = rings + 1.5)
shell


**Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.**


In [None]:
shell_split <- initial_split(shell, prop = 0.75)  
shell_train <- training(shell_split)
shell_test <- testing(shell_split)

shell_train
shell_test


In [None]:
# number of observations in each class
shell_obs <- shell_train|>
    group_by(sex)|>
    summarize(count = n())
shell_obs

# means of predictor variables

shell_means <- shell_train |>
    select(length:shell_weight)|>
    map_df(mean)
shell_means

**Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.**


In [None]:
comparative_plot <- shell_train|>
    ggplot(aes(x = length, y = age)) +
           geom_point(aes(color = sex)) +
    labs(x = "Length (mm)", y = "Age (years)", color = "Sex") +
    theme(text = element_text(size = 18))
comparative_plot