# Group 10 Project Proposal
## Introduction

Abalones are a type of sea snail that are known for being seafood delicacies. The price of an abalone is positively correlated with its age. The method of determining an abalone's age is tedious and requires a lot of manual labor. We aim to streamline this process using predictive modeling.

***Question:***
How old is a given abalone?

In this project, we will be developing a model that aims to determine the age of a given abalone by using easily measurable traits such as an abalone's physical dimensions, weight, and possibly sex. We discuss the importance of this in a later section of this proposal. The data set that we are using is taken from the UC Irvine Machine Learning Repository, and contains comma-separated values with no headers. It includes the ***sex, length (mm), diameter (mm), height (mm), whole weight (grams), shucked weight (grams, weight of meat), viscera weight (grams, gut weight after bleeding), shell weight (grams, weight of shell after drying), and number of rings*** that a given abalone has. There are 4177 instances in this data set. The age of an abalone is the number of rings it has plus 1.5, in years. The target variable for our model will be the number of rings. We will determine the age after, using the predicted number of rings.

## Preliminary Exploratory Data Analysis

In [28]:
import pandas as pd
import altair as alt
from sklearn.model_selection import GridSearchCV, train_test_split
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

Reading data from web:

In [12]:
url='https://drive.google.com/file/d/1nPiV8p49ZExhs_C8TnExmvSFRQOzKi90/view?usp=sharing'
file_id = url.split('/')[-2]
read_url='https://drive.google.com/uc?id=' + file_id

abalone_data=pd.read_csv(read_url, 
                         header=None, 
                         names=[
    'sex','length','diameter','height','weight_whole','weight_shucked','weight_viscera','weight_shell','rings']
)
abalone_data.head(10)

Unnamed: 0,sex,length,diameter,height,weight_whole,weight_shucked,weight_viscera,weight_shell,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


The data is already fairly tidy. Each column contains a single variable, each row is a single observation, and each cell is a single value. We have named the columns to match their respective measurements.

In [13]:
# TODO: Do we need to really add anything? We could add an age column, but that could be confusing, as we are using rings as the target variable. 
# data looks clean to me, I think we should add an age column after predicting the number of rings. 

Splitting data:

In [14]:
# TODO: split data into training and testing data
# SIDE NOTE: should we actually be using all 4177 rows??
abalone_train, abalone_test = train_test_split(
    abalone_data, train_size=0.75
)

Summary table:

In [15]:
# TODO: create a summary table, with number of rows, maybe the units for each of the variabeles, which ones are predictors, maybe grouping them...?
# Definitely should take another look at the criteria for the proposal

Plots:

In [31]:
# TODO: create some plots (maybe with captions)
# I'm not sure if they need to compare the target variable to the predictors, or if it's just within the predictors...I think the former probably
# makes more sense, but it'd be wise to read over the proposal a couple more times...

#not sure why .title() and .bin() aren't working here
#i don't think using facet is correct here because it locks the x scale and y scale. but the measurements are all different units. i think we need to decide to only use a few of them

abalone_train_melt = abalone_train.melt(
    id_vars=["sex", "rings"],
    var_name="measurement",
    value_name="value"
)

predictor_distribution = alt.Chart(abalone_train_melt).mark_bar().encode(
    x = alt.X("value"),
    y = alt.Y("count()")
).facet("measurement")

predictor_distribution

## Methods

We will conduct our data analysis through regression to predict how old a particular abalone may be. Although the number of rings is numerical in value, it is not continuous. We would still like to use regression for this data though, since our ultimate goal is to find age, which is continuous. If the regression does provide us with a decimal answer for number of rings, then we would add 1.5 to it to find the approximate age. We may also try using classification later on to see what sort of difference in result that would give us. 

The variables that we will use as predictors are those that are related to the dimensions of the abalone and the weight of the abalone, which include all variables other than the target variable (number of rings) and sex. Sex is a categorical data value while the other data values are numerical and continuous. As well, compared to the other values it is not easily measured or determined without prior knowledge.

### Visualization
One way we could visualize would be to use a scatter plot for regression. We could have age on the y-axis, and the predictor on the x-axis. Of course, if we were to show basic trends between a variable and a predictor, we could also have a bar graph.

TODO: add more for Visualization

## Expected Outcomes and Significance

Significance
95% of the world's abalone is farmed. Creating a model to easily and quickly determine the age of an abalone would greatly help with sorting and pricing abalones. It can also allow cooks to determine how the abalone might taste before actually cooking it, based on how old it is. 

Future questions that may lead from this include ______