**Title:** What the Shell?!

**Team member:** Laurelyne Barbier, Carter Gunning, Sebastian Martinez Sotomayor and Tayte Stefaniuk

**Introduction**

Abalones are a type of mollusk that are found in marine environments and commonly considered a delicacy in a plethora of cuisines and their popularity is reflected by their price, costing up to 100$ per shell, (The Pricer, 2021). Consequently, many species of this marine mollusk have been classified as endangered. This prompted scientific research into their populations (Kerlin, 2022) which requires knowing the age stratification. Determining the age of an abalone involves cutting the shell, staining it, and counting the rings that are formed perennially using a microscope. 
The tediousness of this procedure inspired our group to explore the following question: Can the age of an abalone can be predicted based on several measurements describing its dimensions and weight using a regression model in R?

The dataset we will be using to answer this question give various statistics on the physical characteristics of 4178 different mollusks.  These statistics include the length, diameter, height, weight, shucked weight, viscera weight (this is the gut weight after bleeding), shell weight (after being dried) and number of rings of these mollusks..  One categorical variable they use is sex of the mollusk which can be male, female, or infant (infant is a category because mollusks can switch sexes) (Kaggle, n.d.).   We also added the variable age, which is the number of rings plus 1.5 (Nash & Sellers, 1994).


**Preliminary exploratory data analysis**

In [1]:
###
### Tidy Package
###

library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")

# functions needed to work with images
# code below sourced from: https://gist.github.com/daviddalpiaz/ae62ae5ccd0bada4b9acd6dbc9008706
# helper function for visualization
show_digit = function(arr784, col = gray(12:1 / 12), ...) {
  arr784$X <- 0  # avoid dark pixel in top left
  image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, ...)
}

#set seed

set.seed(1969)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


**Demonstrate that the dataset can be read from the web into R**


In [None]:
raw_shell <- read_csv("data/abalone.csv")
raw_shell

**Clean and wrangle your data into a tidy format:**

Data is already tidy.


In [None]:
#lowercase column names & rid of spaces
names(raw_shell) <- tolower(names(raw_shell))
shell <- rename(raw_shell, "whole_weight" = "whole weight","shucked_weight" = "shucked weight", "viscera_weight" =
        "viscera weight", "shell_weight" = "shell weight")

#adding an age column, knowing that age in years = number of rings + 1.5.
shell <- mutate(shell, age = rings + 1.5)
shell


**Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.**


In [None]:
shell_split <- initial_split(shell, prop = 0.75)  
shell_train <- training(shell_split)
shell_test <- testing(shell_split)

shell_train
shell_test


In [None]:
# number of observations of each sex
shell_obs <- shell_train|>
    group_by(sex)|>
    summarize(count = n())
shell_obs

# means of numerical predictor variables

shell_means <- shell_train |>
    select(length:shell_weight)|>
    map_df(mean)
shell_means

**Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.**


In [None]:
comparative_plot_length <- shell_train|>
    ggplot(aes(x = length, y = age)) +
           geom_point(aes(color = sex, shape = sex)) +
    labs(x = "Length (mm)", y = "Age (years)", color = "Sex", shape = "Sex") +
    theme(text = element_text(size = 18))
comparative_plot_length

In [None]:
comparative_plot_diameter <- shell_train|>
    ggplot(aes(x = diameter, y = age)) +
           geom_point(aes(color = sex, shape = sex)) +
    labs(x = "Diameter (mm)", y = "Age (years)", color = "Sex", shape = "Sex") +
    theme(text = element_text(size = 18))
comparative_plot_diameter

In [None]:
comparative_plot_height <- shell_train|>
    ggplot(aes(x = height, y = age)) +
           geom_point(aes(color = sex, shape = sex)) +
    labs(x = "Height (mm)", y = "Age (years)", color = "Sex", shape = "Sex") +
    theme(text = element_text(size = 18))
comparative_plot_height

In [None]:
comparative_plot_whole_weight <- shell_train|>
    ggplot(aes(x = whole_weight, y = age)) +
           geom_point(aes(color = sex, shape = sex)) +
    labs(x = "Whole weight (g)", y = "Age (years)", color = "Sex", shape = "Sex") +
    theme(text = element_text(size = 18))
comparative_plot_whole_weight

In [None]:
comparative_plot_shucked_weight <- shell_train|>
    ggplot(aes(x = shucked_weight, y = age)) +
           geom_point(aes(color = sex, shape = sex)) +
    labs(x = "Shucked weight (g)", y = "Age (years)", color = "Sex", shape = "Sex") +
    theme(text = element_text(size = 18))
comparative_plot_shucked_weight

In [None]:
comparative_plot_viscera_weight <- shell_train|>
    ggplot(aes(x = viscera_weight, y = age)) +
           geom_point(aes(color = sex, shape = sex)) +
    labs(x = "Viscera weight (g)", y = "Age (years)", color = "Sex", shape = "Sex") +
    theme(text = element_text(size = 18))
comparative_plot_viscera_weight

In [None]:
comparative_plot_shell_weight <- shell_train|>
    ggplot(aes(x = shell_weight, y = age)) +
           geom_point(aes(color = sex, shape = sex)) +
    labs(x = "Shell weight (g)", y = "Age (years)", color = "Sex", shape = "Sex") +
    theme(text = element_text(size = 18))
comparative_plot_shell_weight

**Methods**
The dataset we are using answer this question give various statistics on the physical
characteristics of 4178 different mollusks (Kaggle, n.d.).

The characteristics associated with this data set include:
• Length
• Diameter
• Height
• Whole weight
• Shucked weight
• Viscera weight (this is the gut weight after bleeding)
• Shell weight (after being dried)
• Number of rings of these mollusks.
• Sex (categorical variable) - male, female, or infant (infant is a category because mollusks can
switch sexes) .

✴The variable age, which is the number of rings plus 1.5 (Nash & Sellers, 1994), was added.

1. The data was already tidy so no wrangling procedures were conducted.
2. The initial_split() function was utilized to split our data frame into 75% training and 25% testing data.
3. To compute the count based on the categorical variable sex, the group_by() and summarize() function combination was employed.
4. The previous step enabled the use of the map_df() function to calculate the mean for the variables in our new data frame.
5. To identify the best predictors to use in the investigation, scatterplots of the different physical characteristics and age were generated.

**Expected outcomes and significance**

We expect to find that the measurements 