# Group 10 Project Proposal
## Introduction

Abalones are sea snails known for being seafood delicacies. The price of an abalone is positively correlated with its age. The standard method of determining an abalone's age is tedious and laborious. We aim to streamline this process using predictive modeling.

***Question:***
How does the number of rings an abalone has, and consequently its age, depend on its physical dimensions and weight?

In this project, we will be developing a model that predicts the age of an abalone using easily measurable physical traits. The data set we are using is from the UCI Machine Learning Repository. It contains comma-separated values with no headers and contains 4177 instances. It includes the ***sex, length (mm), diameter (mm), height (mm), whole weight (grams), shucked weight (grams, meat weight), viscera weight (grams, gut weight after bleeding), shell weight (grams, after drying), and number of rings*** of a given abalone. The range of the continuous variables has been pre-scaled by dividing by 200. Our target variable is the number of rings.

## Preliminary Exploratory Data Analysis

In [2]:
pip install -U altair

Collecting altair
  Downloading altair-5.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting narwhals>=1.5.2 (from altair)
  Downloading narwhals-1.8.3-py3-none-any.whl.metadata (6.8 kB)
Downloading altair-5.4.1-py3-none-any.whl (658 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m658.1/658.1 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading narwhals-1.8.3-py3-none-any.whl (169 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.1/169.1 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: narwhals, altair
Successfully installed altair-5.4.1 narwhals-1.8.3
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import altair as alt
from sklearn.model_selection import GridSearchCV, train_test_split
alt.data_transformers.disable_max_rows()
alt.renderers.enable('jupyterlab')

RendererRegistry.enable('jupyterlab')

### Reading data from web:

In [3]:
url='https://drive.google.com/file/d/1nPiV8p49ZExhs_C8TnExmvSFRQOzKi90/view?usp=sharing'
file_id = url.split('/')[-2]
read_url='https://drive.google.com/uc?id=' + file_id

abalone_data=pd.read_csv(read_url, 
                         header=None, 
                         names=[
    'sex','length','diameter','height','weight_whole','weight_shucked','weight_viscera','weight_shell','rings']
)
abalone_data.head(10)

Unnamed: 0,sex,length,diameter,height,weight_whole,weight_shucked,weight_viscera,weight_shell,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


The data is already fairly tidy. We have named the columns to match their respective measurements.

### Splitting data:

In [4]:
abalone_train, abalone_test = train_test_split(
    abalone_data, train_size=0.75
)

### Summary tables:

#### Basic Shape:

In [5]:
abalone_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3132 entries, 817 to 452
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sex             3132 non-null   object 
 1   length          3132 non-null   float64
 2   diameter        3132 non-null   float64
 3   height          3132 non-null   float64
 4   weight_whole    3132 non-null   float64
 5   weight_shucked  3132 non-null   float64
 6   weight_viscera  3132 non-null   float64
 7   weight_shell    3132 non-null   float64
 8   rings           3132 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 244.7+ KB


#### Describing Variables:

In [6]:
variable_info = {
    'column':['sex','length','diameter','height','weight_whole','weight_shucked','weight_viscera','weight_shell','rings'],
    'dtype':[
        'nominal (object)','continuous (float)','continuous (float)','continuous (float)','continuous (float)','continuous (float)','continuous (float)','continuous (float)','integer'
    ],
    'measurement':['Male(M), Female(F) or Infant(I)','mm','mm','mm','grams','grams','grams','grams','']
}
variable_info=pd.DataFrame(variable_info)
variable_info

Unnamed: 0,column,dtype,measurement
0,sex,nominal (object),"Male(M), Female(F) or Infant(I)"
1,length,continuous (float),mm
2,diameter,continuous (float),mm
3,height,continuous (float),mm
4,weight_whole,continuous (float),grams
5,weight_shucked,continuous (float),grams
6,weight_viscera,continuous (float),grams
7,weight_shell,continuous (float),grams
8,rings,integer,


#### Summary Statistics for Numerical Columns:

In [7]:
summary_stats=abalone_train.drop(columns='sex').agg({'max','min','std','mean',})
summary_stats

Unnamed: 0,length,diameter,height,weight_whole,weight_shucked,weight_viscera,weight_shell,rings
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
std,0.120254,0.099144,0.03896,0.490123,0.222666,0.109427,0.138199,3.184778
max,0.815,0.65,0.515,2.8255,1.488,0.6415,1.005,29.0
mean,0.524349,0.407875,0.13948,0.828298,0.359475,0.180795,0.238235,9.907407


#### Distribution of Target Variable:

In [8]:
ring_count=abalone_train['rings'].value_counts().reset_index()
ring_count.columns=['rings','count']
ring_count=ring_count.sort_values(by=['rings']).reset_index().drop(columns='index')
ring_count

Unnamed: 0,rings,count
0,1,1
1,2,1
2,3,11
3,4,45
4,5,90
5,6,187
6,7,295
7,8,412
8,9,527
9,10,479


### Visualization

#### Distribution of Possible Predictors:

In [9]:
length_distribution = alt.Chart(abalone_train).mark_bar().encode(
    x = alt.X("length").title("Length (scaled and binned)").bin(maxbins=30),
    y = alt.Y("count()").title("Count")
)

diameter_distribution = alt.Chart(abalone_train).mark_bar().encode(
    x = alt.X("diameter").title("Diameter (scaled and binned)").bin(maxbins=30),
    y = alt.Y("count()").title("Count")
)

height_distribution = alt.Chart(abalone_train).mark_bar().encode(
    x = alt.X("height").title("Height (scaled and binned)").bin(maxbins=30),
    y = alt.Y("count()").title("Count")
)

weight_whole_distribution = alt.Chart(abalone_train).mark_bar().encode(
    x = alt.X("weight_whole").title("Whole Weight (scaled and binned)").bin(maxbins=30),
    y = alt.Y("count()").title("Count")
)

weight_shucked_distribution = alt.Chart(abalone_train).mark_bar().encode(
    x = alt.X("weight_shucked").title("Shucked Weight (scaled and binned)").bin(maxbins=30),
    y = alt.Y("count()").title("Count")
)

weight_viscera_distribution = alt.Chart(abalone_train).mark_bar().encode(
    x = alt.X("weight_viscera").title("Viscera Weight (scaled and binned)").bin(maxbins=30),
    y = alt.Y("count()").title("Count")
)

weight_shell_distribution = alt.Chart(abalone_train).mark_bar().encode(
    x = alt.X("weight_shell").title("Shell Weight (scaled and binned)").bin(maxbins=30),
    y = alt.Y("count()").title("Count")
)

rings_distribution = alt.Chart(abalone_train).mark_bar().encode(
    x = alt.X("rings").title("Number of rings (scaled and binned)").bin(maxbins=30),
    y = alt.Y("count()").title("Count")
)

alt.vconcat(
    alt.hconcat(length_distribution, diameter_distribution, height_distribution),
    alt.hconcat(weight_whole_distribution, weight_shucked_distribution, weight_viscera_distribution),
    alt.hconcat(weight_shell_distribution, rings_distribution)
)


#### Relationship Between Possible Predictors and Target:

In [10]:
length_v_rings = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("length").title("Length (scaled)"),
    y = alt.Y("rings").title("Number of rings")
)

diameter_v_rings = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("diameter").title("Diameter (scaled)"),
    y = alt.Y("rings").title("Number of rings")
)

height_v_rings = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("height").title("Height (scaled)"),
    y = alt.Y("rings").title("Number of rings")
)

weight_whole_v_rings = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("weight_whole").title("Whole Weight (scaled)"),
    y = alt.Y("rings").title("Number of rings")
)

weight_shucked_v_rings = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("weight_shucked").title("Shucked Weight (scaled)"),
    y = alt.Y("rings").title("Number of rings")
)

weight_viscera_v_rings = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("weight_viscera").title("Viscera Weight (scaled)"),
    y = alt.Y("rings").title("Number of rings")
)

weight_shell_v_rings = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("weight_shell").title("Shell Weight (scaled)"),
    y = alt.Y("rings").title("Number of rings")
)

alt.vconcat(
    alt.hconcat(length_v_rings, diameter_v_rings, height_v_rings),
    alt.hconcat(weight_whole_v_rings, weight_shucked_v_rings, weight_viscera_v_rings),
    weight_shell_v_rings
)

#### Relationship between similar predictors:

In [11]:
length_v_diameter = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("length").title("Length (scaled)"),
    y = alt.Y("diameter").title("Diameter (scaled)")
)

whole_weight_v_shucked_weight = alt.Chart(abalone_train).mark_point(opacity=0.3).encode(
    x = alt.X("weight_whole").title("Whole weight (scaled)"),
    y = alt.Y("weight_viscera").title("Shucked weight (scaled)")
)

length_v_diameter | whole_weight_v_shucked_weight

## Methods

Even though the number of rings is not a continuous value, since we will be predicting age (which is continuous), regression should be an appropriate method of prediction. Approximate age can be found in years by adding 1.5 to the number of rings our model predicts. We may also use classification later on to explore different results.

Diameter, height, and whole weight will be our main predictors. This is to avoid any bias created by using four weight predictors and two length predictors. As seen in our graphs, there is an overlap of distributions and strong linear relations between similar variables. Sex will not be used as it is categorical instead of numerical.

### Visualization
Our main visualization will be a 3-dimensional scatter plot with the predictors on the x, y, and color axes and the target on the z-axis. This will give us an idea of how our model is working. Using bar graphs to show comparisons can also be effective.

We will also use a confusion matrix to determine the range of accuracy.

## Expected Outcomes 

As we can see in the preliminary graphs, the number of rings is positively related to all our predictors. We expect our model to determine the relation between the predictors and the target to be linear and positive. One concern is that the accuracy of the model may decrease when predicting above 7 rings due to the weakened relationship between weight and rings.

### Significance
Creating a model to easily and quickly determine the age of an abalone would greatly help with sorting and pricing abalones. This would not only benefit farmers, but also consumers, who can now purchase based on age (abalone tastes better with age).

### Future topics of research include:
How do abalone prices depend on age?
What is the optimal age for an abalone to be sold on the market to maximize profits?