# Practice session: Abalone dataset

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn')

## Load and prepare the dataset

Here, we will use the [Pandas](https://pandas.pydata.org/) package to load the .csv dataset.
More information about this dataset is available at the [OpenML](https://www.openml.org/search?type=data&sort=runs&id=183&status=active) website.

In [None]:
cols = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
        'Viscera weight', 'Shell weight', 'Rings']

abalone = pd.read_csv("abalone/dataset_187_abalone.arff", header=14, names=cols)
abalone

As a quick check, we will look at the type of each column and check if there is any NaN entry.

In [None]:
abalone.info()

In [None]:
abalone.isna().sum()

Finally, we transform the sex information, which is categorical, into a set of binary columns.
This is called one-hot encoding and could be done with Scikit-learn but for now we will use Pandas.

In [None]:
abalone = pd.get_dummies(abalone, columns=["Sex"])
abalone

Finally, we convert the dataset into a regular set of numpy arrays.

In [None]:
X = abalone.drop(columns="Rings").values
y = abalone["Rings"].values

## Visualize the data

It's always nice to have a look at the input data.
As the data has been loaded using Pandas, we can use its [plotting API](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) to visualize relationships between feature and target values.

In [None]:
abalone.plot(x="Diameter", y="Rings", kind="scatter", alpha=0.3);

## And now, your turn!

Try some models to predict the number of rings (`y` variable) given the abalone features (`X` variable). Here is a reminder of the main steps that you would follow:

- split the data into train and test sets,
- import a model,
- instantiate the model,
- fit the model,
- make predictions,
- score the model on test data,
- plot the predictions vs true value (or residuals),
- rinse and repeat!

It would be even better if you use cross-validation for scoring.

In [None]:
# split your data in train and test folds, don't forget to fix the random seed

In [None]:
# import a model class you want to try, e.g. a linear regression

In [None]:
# create the model

In [None]:
# fit the model to the training data

In [None]:
# generate predictions on the test data

In [None]:
# import a scoring function from sklearn.metrics and compute a score on the test
# data (e.g. the root mean squared error)

In [None]:
# plot the predicted values vs. the true values, look for any systematic bias

Once you have obtained some results from your first model, you can explore the following tracks:

- test another model and compare the score on the test data,
- tune the hyperparameters of your model, using `cross_val_score` on the training data,
- or even better, learn more about [automatic tuning with a grid search or a random search](https://scikit-learn.org/stable/modules/grid_search.html). 
- investigate the performances of your model, is it over-fitting or under-fitting?

## Additional challenge

Try to predict the weight given the other features ;-).

In [None]:
weight_cols = ["Shucked weight", "Viscera weight", "Shell weight", "Whole weight"]

feature_names = abalone.columns.drop(weight_cols)
X = abalone.drop(columns=weight_cols).values
y = abalone["Shucked weight"].values