<a href="https://colab.research.google.com/github/juDEcorous/Machine-Learning-Fundamentals/blob/main/2_Preprocessing_(Abalone_Dataset).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Abalone Preprocessing Exercise (Core)

Name: Jude Maico Jr.

Abalone are small marine snails, and this dataset contains information about the sex (note that the dataset includes F (female), M(male), and I (infant).  It also includes various measurements with lengths in mm and weights in grams.  And finally, the dataset includes the number of rings on the shell. Similar to trees, the number of rings on an Abalone shell can be used to determine its age.  Our machine-learning task will be to predict the number of rings.  For this assignment, you will be preparing the dataset for modeling.

#Tasks:

1. Separate your data into the feature matrix (X) and the target vector (y)
- rings will be your y
- The rest of the features will be your X

2. Train/test split the data. Please use the random number 42 for consistency.

3. Create a ColumnTransformer to preprocess the data. </br>
Remember to:
- Create column selectors for the numeric and categorical columns.
- Create a StandardScaler for scaling numeric columns.
- Create a OneHotEncoder for one-hot encoding the categorical columns.
- Match each transformer with the appropriate selector in a tuple.
- Use the tuples to create a ColumnTransformer to preprocess the data.

4. Transform your data and display the result.
- Individual transformers do NOT need to be fit separately.  Just fit the resulting preprocessing object once on the training data, and use it to transform both the training and testing data.

##Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.model_selection import train_test_split

##Loading Data

In [None]:
# Load data directly from url
df = pd.read_csv('https://docs.google.com/spreadsheets/d/1jfU2oFSfhX1ywUbqETExDJuztO95r3h6pbWAm7xpwNY/gviz/tq?tqx=out:csv&sheet=users')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sex             4177 non-null   object 
 1   length          4177 non-null   float64
 2   diameter        4177 non-null   float64
 3   height          4177 non-null   float64
 4   whole_weight    4177 non-null   float64
 5   shucked_weight  4177 non-null   float64
 6   viscera_weight  4177 non-null   float64
 7   shell_weight    4177 non-null   float64
 8   rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


#Assignment

##1. Separate your data into the feature matrix (X) and the target vector (y)
- rings will be your y
- The rest of the features will be your X

In [None]:
# The target we are trying to predict
y = df['rings']
# The features we will use to make the prediction
X = df.drop(columns = 'rings', axis = 1)   

In [None]:
X.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055


In [None]:
y.head()

0    15
1     7
2     9
3    10
4     7
Name: rings, dtype: int64

## 2. Train/test split the data. Please use the random number 42 for consistency.

In [None]:
# Train test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
X_train.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
3823,F,0.615,0.455,0.135,1.059,0.4735,0.263,0.274
3956,F,0.515,0.395,0.14,0.686,0.281,0.1255,0.22
3623,M,0.66,0.53,0.175,1.583,0.7395,0.3505,0.405
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
2183,M,0.495,0.4,0.155,0.8085,0.2345,0.1155,0.35


In [None]:
y_train.head()

3823     9
3956    12
3623    10
0       15
2183     6
Name: rings, dtype: int64

## 3. Create a ColumnTransformer to preprocess the data. 
- Create column selectors for the numeric and categorical columns.
- Create a StandardScaler for scaling numeric columns.
- Create a OneHotEncoder for one-hot encoding the categorical columns.
- Match each transformer with the appropriate selector in a tuple.
- Use the tuples to create a ColumnTransformer to preprocess the data.


In [None]:
#column selector
col_selector = make_column_selector(dtype_include = 'object')
num_selector = make_column_selector(dtype_include = 'number')

In [None]:
#standardscaler
scaler = StandardScaler()

In [None]:
#OneHotEncoder
ohe = OneHotEncoder(handle_unknown = 'ignore')

In [None]:
#match transformer in tuple
col_tuple = (ohe, col_selector)
num_tuple = (scaler, num_selector)

In [None]:
#tuples to create ColumnTransformer
preprocessor = make_column_transformer(col_tuple, num_tuple, 
                                       remainder='passthrough',
                                       verbose_feature_names_out = False)

## 4. Transform your data and display the result.

In [None]:
#FIT column tranformer
preprocessor.fit(X_train)

In [None]:
#TRANSFORM
X_train_scaled = preprocessor.transform(X_train)
X_test_scaled = preprocessor.transform(X_test) 

In [None]:
#PROCESS (df)
X_train_df = pd.DataFrame(X_train_scaled, 
                          columns = preprocessor.get_feature_names_out())

In [None]:
#DISPLAY
display(X_train_df.head())
print(f'\nShape of the data is: {X_train_df.shape}')
print(f'\nThere are {X_train_df.isna().sum().sum()} missing values')
print(f'\nThe data types are \n{X_train_df.dtypes}')

Unnamed: 0,sex_F,sex_I,sex_M,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
0,1.0,0.0,0.0,0.749291,0.464226,-0.118869,0.457447,0.499098,0.743973,0.241135
1,1.0,0.0,0.0,-0.090254,-0.144654,-0.001647,-0.301655,-0.364269,-0.51404,-0.145838
2,0.0,0.0,1.0,1.127086,1.225326,0.81891,1.523852,1.692114,1.544526,1.179902
3,0.0,0.0,1.0,-0.59398,-0.449095,-1.056649,-0.651696,-0.617673,-0.738195,-0.647469
4,0.0,0.0,1.0,-0.258163,-0.093914,0.35002,-0.052352,-0.572823,-0.605532,0.785763



Shape of the data is: (3132, 10)

There are 0 missing values

The data types are 
sex_F             float64
sex_I             float64
sex_M             float64
length            float64
diameter          float64
height            float64
whole_weight      float64
shucked_weight    float64
viscera_weight    float64
shell_weight      float64
dtype: object
