## Capstone 2 - Abalone Age Prediction
### Data Wrangling
**Context**:

The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

_Credit: https://www.kaggle.com/rodolfomendes/abalone-dataset_


**Goal**: The goal of this capstone project is to build a regression model that can predict the age of an abalone shell by accurately predicting its ring count.


**Data Dictionary**:

Sex: Male (M), Female (F), Infant (F)

Length: Longest shell measurement

Diameter: Perpendicular to length

Height: With meat in shell

Whole weight: Whole abalone weight

Shucked weight: Weight of meat

Viscera weight: Gut weight (after bleeding)

Shell Weight: after being dried

Rings: Number of rings on shell

Age(added): Number of rings + 1.5 gives the abalone's age in years

In [61]:
#Import pandas, matplotlib.pyplot, seaborn, and os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [62]:
#Save abalone data CSV file into a pandas dataframe
abalone_data = pd.read_csv('')

In [63]:
#Call the info to see summary of data
abalone_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [64]:
#Call head to see first several rows of data
abalone_data.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [65]:
#Confirm there are no missing values in dataset
abalone_data.isnull().sum()

Sex               0
Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
dtype: int64

In [66]:
#Call describe on data to see summary of statistics
abalone_data.describe(include = 'all')

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
unique,3,,,,,,,,
top,M,,,,,,,,
freq,1528,,,,,,,,
mean,,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0


In [67]:
#Call describe on categorical data to see summary of statistics
abalone_data.describe(include =['object'])

Unnamed: 0,Sex
count,4177
unique,3
top,M
freq,1528


In [68]:
#Call value_counts to view categorical 'Sex' column
abalone_data['Sex'].value_counts()

M    1528
I    1342
F    1307
Name: Sex, dtype: int64

In [69]:
#Add column for age which will be the response variable and drop the Rings column
#An abalone shell's age in years is the ring count plus 1.5
abalone_data['Age'] = abalone_data['Rings']+1.5
abalone_data.drop('Rings', axis = 1, inplace = True)

In [70]:
#Call head to confirm the Age column is added and Rings column removed
abalone_data.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Age
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,16.5
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,8.5
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,10.5
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,11.5
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,8.5


In [71]:
abalone_data.to_csv('abaloneDW_cleaned.csv', index=False)

After conducting an overview of the dataset, it is clear the data is clean and ready for the exploratory data analysis phase.