# <span style = "color:rebeccapurple">Introduction to scikit-learn</span>

## <span style = "color:darkorchid">What is scikit-learn?</span>

`scikit-learn` is a user-friendly python library that helps you implement a variety of machine learning algorithms with just a few lines of code. You do not need to be an expert in machine learning to use `scikit-learn`, however, you do need to learn some concepts to use it correctly.

Today, we will learn a bit of both. We will alternate between coding and concepts so you can be successful in your future machine learning endeavors!

## <span style = "color:darkorchid">What do I assume you know?</span>

This workshop is an introduction to `scikit-learn`, not to programming. Hence, we will assume you have some familiarity with the following:

#### Python preliminaries
<ul>
    <li>Basic Python structures (lists, dictionaries, tuples) and control flow (loops, conditionals, etc.)</li>
    <li>Numpy arrays</li>
    <li>Pandas dataframes</li>
    <li>Basics of matplotlib</li>
    <li>Basics of classes and objects (we will do a quick review)</li>
</ul>

We will have a quick review of python structures, arrays, and dataframes, but we cannot linger much on them. We will use `matplotlib` and `seaborn` for plotting, but you do not really need to know, at this stage, all the specifics for how those work.

#### Machine learning preliminaries
The reason you came to a `scikit-learn` workshop is because you want to implement machine learning in your projects. We cannot be comprehensive in such a short course, so I will assume you have <b>some familiarity with machine learning</b>, at least what it is and what it tries to do. If you don't, it is OK, we will still review some of those concepts as we proceed, but we may go a bit faster, so make sure you ask plenty of questions, and you are encouraged to review them after the workshop.

# <span style = "color:rebeccapurple">Python Review</span>

## <span style = "color:darkorchid">Data Structures in Python</span>

We will make some use of lists, dictionaries and tuples, and of pandas dataframes. You don't need to be an expert on these, and we can't dwell much on them, but I want to quickly remind you what they are and how they look. If you don't fully understand their behavior don't worry, it is not crucial for today's workshop.

#### Base python structures
Base python has three basic structures you should strive to be acquainted with:
<ul>
    <li>lists</li>
    <li>dictionaries</li>
    <li>tuples</li>
</ul>

<b>Lists</b>

Lists are ordered collections of objects. You can create empty lists, increase the size of lists, and obtain specific elements through indexing.

In [None]:
# Empty lists
my_list = []
print(my_list)

In [None]:
# Non-empty lists
my_list = [1, 2, "hello"]
print(my_list)

In [None]:
# Adding an element to a list
my_list.append("good bye")
print(my_list)

In [None]:
# Iterating over a list
for i in my_list:
    print(i)

In [None]:
# Indexing a list
print(my_list[0])
print(my_list[1:3])

In [None]:
# Overwriting an element in a list:
my_list[3] = "hello again"
print(my_list)

<b>Dictionaries</b>
Dictionaries are also collections of objects, but instead of being indexed by order, as in the list case, they have a *key*, which uniquely identifies the *value* of your object. These are called key-value pairs.

In [None]:
# An empty dictionary
my_dict = {}
print(my_dict)

In [None]:
# A non-empty dictionary with key-value pairs formatted as key:value
my_dict = {"data": [1,2,3],
          "salutation": "hello",
          "inception": {"some key":"some value"}}
print(my_dict)

In [None]:
my_dict["data"]

In [None]:
my_dict["salutation"]

In [None]:
my_dict["inception"] # <-- This is a dictionary inside a dictionary!

<i>(Optional) Note:</i> We have used only strings as keys. Most of the time this will be the case. It is possible to use other objects as keys, but we won't go into that (if you know about mutability and hashability, only mutable and hashable objects can be keys).

<b> Tuples </b><br>
Tuples are ordered collections of objects, almost like lists, BUT, they are immutable. Meaning that you can't change them as you did with lists. You cannot change their elements, add a new element, deete one, etc.

In [None]:
# :: TUPLES ::
my_tuple = (1,2,"hello")

In [None]:
print(my_tuple)

In [None]:
my_tuple[0]

In [None]:
my_tuple[0] = 10

You should have gotten an error in the last cell, that's because you can't change elements of tuples.

#### Numpy arrays

Numpy arrays will take the role of vectors and matrices in python. They are ordered collections of objects, like lists, but there is an important constraint: all elements must be of the same type. Furthermore, if your elements are numeric, you can do numeric operations on the arrays, including matrix multiplication.

Numpy is a python package that is not imported by default, so we must import it. Chances are you already have numpy installed, so there is no need to install it. We must only import it.

In [None]:
import numpy as np        # <-- Numpy is usually imported as 'np' (this is convention)

# create an array
my_array = np.array([1,2,3])
print(my_array)

In [None]:
# Index an array
print(my_array[0])

In [None]:
# Perform numeric computations with an array:
print(2 * my_array)

In [None]:
# Compare that with a list:
print(2 * [1,2,3])

In [None]:
# Let's build a matrix:
my_matrix = np.array([[0,1,0],
                     [1,0,0],
                     [0,0,1]])
my_matrix

In [None]:
# We can multiply arrays as if they were matrices/vectors with the matmul() method
np.matmul(my_matrix, my_array)

#### Pandas dataframes

Pandas is another package that is commonly used in python but must be imported. Pandas is mostly used to manipulate datasets, for example by subsetting. The main object in pandas is the dataframe, which is basically a table. Pandas allows you to manipulate these tables easily. When it comes to datasets, the convention is for rows to be the different observations (for example patients) and for columns to be the observed features (for example age, sex, etc.). 

In [None]:
import pandas as pd    # <-- Pandas is usually imported as 'pd' (this is convention)

In [None]:
# Creating a dataframe from a list of lists
pd.DataFrame(data = [[1,2,],[3,4], [5,6,]])

Note it automatically created column headings (0,1 in this case) and row indices.

In [None]:
#Creating dataframe with column names:
pd.DataFrame(data = [[1,2],[3,4], [5,6]], columns = ["Column 1", "Column 2"])

In [None]:
# Creating dataframe from a dictionary:
pd.DataFrame(data = {"Column 1": [1, 3, 4], "Column 2": [2,4,6]})

Note that with the dictionary each key-value pair is a column. In the case of nested lists, each sublist is a row.

In [None]:
# You can also use numpy arrays:
my_matrix = np.array([[0,1,0],
                     [1,0,0],
                     [0,0,1]])

df = pd.DataFrame(data = my_matrix, columns = ["col1", "col2", "col3"])
df

In [None]:
# To get the columns of a dataframe, you can call the column names as you'd do with dictionaries:
df["col1"]

This actually return a pandas *series*, which are basically single columns (the numbers on the left are the indices, not actual values). If you want to return a *dataframe*, which is often necessary, you can use double brackets:

In [None]:
df[["col1"]]

In [None]:
# You can create new columns also as with dictionaries:
df["col4"] = [1,1,1]
df

In [None]:
# If you only want to see the first few elements of your dataframe, use the .head() method:
df.head()

(In this case there is no difference bc we only have three rows, but you will see it used belo)

## <span style = "color:darkorchid">Classes and Objects</span>

Python is very versatile. You can write function after function if you wish (as you would in a language like R), however, its strength stems from object oriented programming. `scikit-learn` makes plenty of use of objects and classes, so let's take a quick review at what these are.

Imagine you are at a high-end restaurant. Let's say you are seeing Gordon Ramsay at work. There is an executive chef, a head chef, several sous-chefs, specialized chefs (for example for raosting, for pastries, etc.), each with a team of specialists (the butcher, the grill chef, the baker, the confectioner, etc.). Here is an image I got from google images:

<img src = "images/brigade-de-cuisine-high-speed-learning.png" width = 400 style="display: block; margin-left: auto; margin-right: auto;">

Think of each of these as a type or a *class* of chefs. There are important things to note:
<ul>
    <li>They are all of the generic type <i>chef</i>, but some have extra skills or responsibilities.</li>
    <li>You can have several chefs of the same class (like several sous-chefs). However, these are not the same people!</li>
</ul>

Well, classes in python are something similar. It is a specified type of entity that has specific skills and attributes. Objects are the realizations of these classes. Each realization is called an *instance*. In python, the skills are called *methods*, these are actions that all instances of a class can perform. They also have *attributes* which variables, possibly unique, that each instance have (like the names of individual chefs).

We won't go further over classes but the important principle is this:

When you are writing a large python code, do not think of yourself as a homecook that does everything by themselves from scratch, following each step one after another. Instead, **think of yourself as an executive chef**. First you appoint all the chefs working for you, each with predetermined skills and attributes. You think of the overall plan, and then you delegate tasks to the respective chefs (which may, in turn, delegate tasks to their respective chefs and specialists).

Indeed, `scikit-learn` is very similar. It has classes of classifiers, of pre-processors, of regression models, etc. Your job will not be to cook absolutely everything by yourself, but to organize things conceptually, hire your appropriate chefs, tell them what to do and trust them.

#### How to deal with classes

The following are fake examples because `ChefClass` does not exist. Don't run the cells or you will get an error.

In [None]:
# If you have already a class, you can create an instance of that class like this:
chef = ChefClass()

In [None]:
# You can obtain attributes using a period:
chef.specialty

In [None]:
# You call a method (ask it to perform a skill) using a period and parenthesis:
chef.cook()

And on to `scikit-learn`!

# <span style = "color:rebeccapurple">Hands-on scikit-learn</span>

## <span style = "color:darkorchid"> Imports

First, as with all scripts, let's import all the modules we will use for this workshop. `scikit-learn` is referred to as `sklearn`, pronounced "S - K - Learn".

In [None]:
# :: IMPORTS ::

# Scikit-learn specifics:
from sklearn import datasets
from sklearn import preprocessing
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Helper modules
import pandas as pd
import matplotlib as plt
import numpy as np

## <span style = "color:darkorchid"> The adventures of Alice and Bo

Throughout the workshop we will follow the adventures of two researchers: Alice and Bo. I will teach with examples from Alice's adventures. You will work on exercises from Bo's breakthroughs.

I asked midjourney to generate images of Alice and Bo on the style of Alice in Wonderland. This is what I got:

<img src = "images/Alice-and-Bo.png" width = 900>

# <span style = "color:rebeccapurple"> Part 1 - Load and Preprocess

## <span style = "color:darkorchid"> Loading Data

We will obtain data in two ways: through `scikit-learn`, which gets the datasets from their website, and using `pandas' to load our own datasets.

### Toy data with scikit-learn

`scikit-learn` contains some toy datasets you can load directly from them. We will be using a couple of those in this tutorial. It requires the `datasets` module, which we already imported above.

In [None]:
# Loading diabetes dataset
diabetes = datasets.load_diabetes()

Turns out datasets are dictionary-like objects. We can see what they contain by looking at the keys:

In [None]:
# What type of object is diabetes?
diabetes.keys()

The `DESCR` key contains a description of the dataset. It is a long string, so use it with the `print()` function:

In [None]:
print(diabetes.DESCR)

We can obtain the values either by key name, as in dictionaries, or as if they were class attributes

In [None]:
# Dictionary syntax:
diabetes['feature_names']

In [None]:
# Attribute syntax:
diabetes.feature_names

The *data* object will be by default a numpy array:

In [None]:
# Default datasets: arrays
diabetes.data

But you can also as for it to be a pandas dataframe:

In [None]:
diabetes_df = datasets.load_diabetes(as_frame = True)
diabetes_df.data

Let's review the `.head()` method for pandas dataframes:

In [None]:
diabetes_df.data.head()

In [None]:
# The prediction target is separate:
diabetes_df.target.head()

`scikit-learn` has other "not toy" datasets you can obtain but which we won't go into. You can find more about them <a href = "https://scikit-learn.org/stable/datasets.html" target = "blank_">here</a>

#### <span style = "color:red"> EXERCISE

`scikit-learn` has an "iris" dataset. Based on how we loaded "diabetes", can you guess how you can load "iris"?
1. Load iris, use the `as_dataframe` argument.
2. Show the keys so you know how it is structured.
3. Print the description of the dataset
4. Show the target names.
5. Obtain the feature data and the target.
6. Show the first few rows of the feature data.
7. Show the full target vector.

In [None]:
# Load iris


In [None]:
# Show keys


In [None]:
# Print description


In [None]:
# Show labels (target names)


In [None]:
# Get feature data


In [None]:
# Get target


In [None]:
# Show first few rows of features (predictive data)


In [None]:
# Show target


### Loading our own data

<b>NOTE</b> If you are using Google colab, instead of your jupyter lab. You will need to upload the datasets by hand:
1. First option, click on the folder icon on the left, then upload it. The datasets are found in the github repository. Note: if you by accident click out of the folder you were in, and it shows you a weird list of folders, you are looking for the one called "content".
2. Second option, get them directly from github. For that you will need the `!wget` command, and then paste the "raw content" link from github:

In [1]:
# For COLAB USERS only: change the variable below to 1
get_file_yn = 0
if get_file_yn:
    !wget https://raw.githubusercontent.com/efren-cc/scikit-learn-workshop/main/data/penguins.csv
    !wget https://raw.githubusercontent.com/efren-cc/scikit-learn-workshop/main/data/fish.csv

Let's load the "penguins" dataset from our "data" folder using pandas. Remember we imported pandas as pd above.

In [None]:
penguins_df = pd.read_csv("data/penguins.csv")

In [None]:
penguins_df.head()

<b>Summary:</b> If we have our own .csv data, we can load it using pandas `.read_csv()` method. scikit-learn also has toy datasets we can play with, in our case we used `load_diabetes()` to obtain the diabetes dataset, but others are available. See the full list <a href = "https://scikit-learn.org/stable/datasets/toy_dataset.html" target = "_blank">here</a>.

#### <span style = "color:red"> EXERCISE

1. Load the fish dataset from our data folder as a pandas dataframe.
2. Visualize the first few rows.
3. Make a new dataframe with only the "weight" column.
4. Drop the weight column from the original dataframe, make sure the change is permanent.

In [None]:
# Read fish dataset


In [None]:
# Show first few rows


In [None]:
# Make new dataframe with only weigth column


In [None]:
# Make new dataframe without weight column


## <span style = "color:darkorchid">Preprocessing: Transforming data with scikit-learn

Many machine learning algorithms are based on taking distances among data points. Distances will depend on the scale of each dimension (each measured feature). Hence, if the scales are too different from each other (say one feature is measured in the 1000s, and another in decimals), we will get suboptimal, if not completely disastrous, results. There are many different ways to transform your data, which one you choose will depend on the type of data and the algorithm you are using. Here are three commonly used transformations:
<ul>
    <li>Standardization</li>
    <li>Normalization</li>
    <li>Encoding categorical features</li>
</ul>

`scikit-learn` uses the module `preprocessing` to perform the operations above. We already imported in our imports section. Now let's review these transformations one by one.

### Standardization

Standardization is a statistics based approach to bring all features to a similar scale. It computes the mean and standard deviation over observations of a feature, and then subtracts the mean from each observation and divides by the standard deviation. This results in mean 0, standard deviation 1 statistics.

The name *standardization* comes from computing the standard score, or z-score, of your observations. This is done in the following manner:
$$
z_i = \frac{x_i - \hat{\mu}}{\hat{\sigma}}
$$

Let's try it with one of the penguin columns.

In [None]:
penguins_df = pd.read_csv("data/penguins.csv")

In [None]:
# This is how flipper length looks
penguins_df[["flipper_length_mm"]].head(10)

Did you notice I used double brackets? `scikit-learn` prefers dataframes as input, and not series, so even though we'll use only one column, we'll keep it as a dataframe. Similarly, if your input is a numpy array, it must be 2-dimensional.

In [None]:
# Create a standard scaler object:
z_scaler = preprocessing.StandardScaler()

# Fit it to the data
z_scaler.fit(penguins_df[["flipper_length_mm"]])

The fitted scaler now has $\hat{\mu}$ and $\hat{\sigma}$. We can use the `transform` method to transform our data:

In [None]:
# Transform the data into z-scores
z_flipper_length = z_scaler.transform(penguins_df[["flipper_length_mm"]])

In [None]:
z_flipper_length[0:10]

Notice that the output is a numpy array. If you are a Pandas fan you could create a new column in your dataframe with these values, or if you are more adept with numpy you can just keep it as is.

<b>Review</b>

In [None]:
# mock data, let's use two columns this time
x = penguins_df[["flipper_length_mm", "bill_depth_mm"]]

In [None]:
# Create a standard scaler object
z_scaler = preprocessing.StandardScaler()

# Fit it to the data
z_scaler.fit(x)

# Transform the data
z = z_scaler.transform(x)

In [None]:
z[0:10]

#### <span style = "color:red"> EXERCISE

Pick any numerical column(s) from your fish dataframe (except weight) and standardize it/them!

In [None]:
# I'll reload the data for you
fish_df = pd.read_csv("data/fish.csv")

In [None]:
# Create a standard scaler object


# Fit it to the data


# Transform the data


In [None]:
# View the first 10 elements


### Normalization

While standardization is statistics based, normalization is geometry based. That means there is an important assumption: the data point to be normalized is assumed to be a vector in a vector space. If you don't know what this is, let's not worry about it at this point. For now, this means we cannot use it in categorical data.

Another important point is that normalization happens over the whole ambient space of a data point. That is, it normalizes per row. Notice that, in contrast, standardization was done with column statistics of the whole sample.

If you are interested, normalization is the process of transforming a vector so it has unit norm:
$$
x_{\text{norm}} = \frac{x}{\left||x\right||}
$$

Let's try it with the first few rows from diabetes:

In [None]:
# Rows before normalization
diabetes.data[0:4]

In [None]:
# Create normalizer
norm_scaler = preprocessing.Normalizer(norm = "l2")    # <-- "l2" indicates Euclidean norm

# Fit to data, we can give a whole matrix or only one row:
norm_scaler.fit(diabetes.data[0:4])

In [None]:
norm_diab = norm_scaler.transform(diabetes.data[0:4])
norm_diab

In [None]:
np.linalg.norm(norm_diab[0])

Note: normalization actually does not require fitting (you are not computing a sample mean and standard deviation as in standardization). Hence, the fit method above is kind of redundant. However, it is kept for consistency among other transformation methods. There is actually a shortcut function that can be used directly:

In [None]:
# Normalization using one function:
norm_diab = preprocessing.normalize(diabetes.data[0:4], norm = "l2")

norm_diab

<b>Summary</b>

In [None]:
# mock data
x = diabetes.data[0:4]

In [None]:
# Create normalizer
norm_scaler = preprocessing.Normalizer(norm = "l2")

# "Fit" to data
norm_scaler.fit(x)    # <-- doesn't really do much, but keeps syntax/logic consistent

# transform data
x_norm = norm_scaler.transform(x)

In [None]:
# Alternatively, you can use the function shortcut:
x_norm = preprocessing.normalize(x, norm = "l2")

The advantage of creating a Normalizer object is that it can be added to a Pipeline object (we will talk about these soon).

#### <span style = "color:red"> EXERCISE

Normalize the predictive, numerical rows of the fish dataset.

In [None]:
# I'll load the data for you, and drop the columns we don't want
fish_df = pd.read_csv("data/fish.csv")
fish_df = fish_df.drop(columns = ["species", "weight"])

In [None]:
# Create normalizer


# "Fit" to data


# transform data


In [None]:
# Visualize the first 10 elements


In [None]:
# Check the norm of the first row is 1
np.linalg.norm(___) # <-- Your first row goes inside these parentheses.

### Encoding Categorical variables

Categorical variables don't always have the nice properties of numbers (order, distance, etc.). Therefore, we have to encode them in a way we can deal with them mathematically. There are two main ways of doing this, which depends on the categories having an ordered structure or not.

#### Categorical variables with order

When your categorical variables have an order structure, you will do *ordinal* encoding, which means you will map them to the integers. For example, if you have a list of grades: $A$, $B$, $C$, etc. You know the following facts:
<ul>
    <li>$A > B$</li>
    <li>$B > C$</li>
    <li>$A > C$</li>
</ul>

Note that these facts are encoded in the integer numbers ${0, 1, 2, 3, ...}$. So we can map each letter to a number.

To exemplify this, let's make a mock dataframe:

In [None]:
grades_df = pd.DataFrame({"Grades":["A", "A", "D", "B", "C", "A", "C"]})
grades_df

In [None]:
# Step 1: Create the OrdinalEncoder
ord_encoder = preprocessing.OrdinalEncoder()

In [None]:
# Step 2: Fit it to our data
ord_encoder.fit(grades_df)

In [None]:
# Step 3: Transform your data
ord_encoder.transform(grades_df)

You can also transform new data:

In [None]:
new_grades_df = pd.DataFrame({"Grades":["D", "D", "B"]})

In [None]:
# Transforming new data
ord_encoder.transform(new_grades_df)

Notice two things:
<ol>
    <li>The order was alphabetical ($A$ got mapped to $0$), but maybe you wanted the opposite ($D$ to $0$)</li>
    <li>What would happen if we try to transform a category it hasn't seen before?</li>
</ol>
Let's explore these.

To my knowledge, there is not nice option to encode in reverse alphabetical order. However, we can provide the categories to encode as an explicit list (a list of lists to be precise, where the $i$th list corresponds to the $i$th column in your dataframe). In this case, the order in which we provide these categories will indicate the order of encoding:

In [None]:
# Indicating explicitly the categories:
ord_encoder = preprocessing.OrdinalEncoder(categories = [["D", "C", "B", "A"]])

ord_encoder.fit(grades_df)

ord_encoder.transform(grades_df)

Now $A$ maps to the higher number.

In [None]:
# Switch A and B
ord_encoder = preprocessing.OrdinalEncoder(categories = [["B", "A", "C", "D"]])

ord_encoder.fit(grades_df)

ord_encoder.transform(grades_df)

#### Categorical variables with no order:

If your variables don't have any order whatsoever, the recommended approach is one-hot encoding. This basically maps each value to a vector whose $i$th element is $1$ if it belongs to category $i$, and $0$ otherwise.

For example, if we have "Aardvark", "Babirusa", and "Capybara", for which there is no natural order, the encoding could go like this:
<ul>
    <li>"Aardvark" $\rightarrow [1,0,0]$</li>
    <li>"Babirusa" $\rightarrow [0,1,0]$</li>
    <li>"Capybara" $\rightarrow [0,0,1]$</li>
</ul>
Let's see this in action with the penguins island feature:

In [None]:
# This is how the original data looks
penguins_df[["island"]]

In [None]:
# Step 1: Create a OneHotEncoder
oh_encoder = preprocessing.OneHotEncoder()

In [None]:
# Step 2: Fit it to the data
oh_encoder.fit(penguins_df[["island"]])

In [None]:
# Step 3: Transform the data
oh_islands = oh_encoder.transform(penguins_df[["island"]])

Note: since the result will be a matrix with a lot of zeros, scikit-learn actually returns in "compressed sparse row" format. Like his:

In [None]:
oh_islands

We don't have time to go over this in detail, but, you have two options: 1) You can specify you don't want a sparse output with `sparse_output` argument, or you can easily "decompress" it by calling the `toarray()` method:

In [None]:
# convert to dense array
oh_islands_arr = oh_islands.toarray()
oh_islands_arr[0:10]

In [None]:
# create OneHotEncoder with sparse_output as False:
preprocessing.OneHotEncoder(sparse_output = False).fit(penguins_df[["island"]]).transform(penguins_df[["island"]])[0:10]

How do we know which vector element belongs to which category? Use the `categories_` attribute, the order they appear on will be the order of the data:

In [None]:
oh_encoder.categories_

How do we put the data back into our dataframe?

In [None]:
penguins_df.head()

First, we could just create a new dataframe from the array:

In [None]:
oh_islands_df = pd.DataFrame(oh_islands_arr, columns = oh_encoder.categories_)

In [None]:
oh_islands_df.head()

In [None]:
penguins_df.join(oh_islands_df).head()

Alternatively, and presumably easier, we can change the encoder's output format to a pandas dataframe, BUT, if that is the case, we must set `sparse_output` as `False`.

In [None]:
# Make new encoder with updated sparse_output argument:
oh_encoder = preprocessing.OneHotEncoder(sparse_output = False)

# Changing the output format of encoder:
oh_encoder.set_output(transform = "pandas")

oh_encoder.fit(penguins_df[["island"]])

oh_islands_df = oh_encoder.transform(penguins_df[["island"]])

In [None]:
oh_islands_df.head()

In [None]:
penguins_df.join(oh_islands_df).head()

#### <span style = "color:red"> EXERCISE

The first predictive feature of the fish dataset (the species) is categorical.
1. Create a one hot encoder and fit it to this data.
2. Transform the data.
3. Visualize the transformed data.

In [None]:
# I'll load the data for you, and select the column we want
fish_df = pd.read_csv("data/fish.csv")
fish_df = fish_df[["species"]]

In [None]:
# Make new encoder with updated sparse_output argument:


# Changing the output format of encoder:


# Fit it to the data


# Transform the data


In [None]:
# Check the transformed data


### Summary

We learn a few classes from the `preprocessing` module: `StandardScaler`, `Normalizer`, `OrdinalEncoder`, and `OneHotEncoder`. There are many more you will learn in your scikit-learn adventures, but we can't deal with those here. Remember the main recipe for preprocessors:
1. Create and instance of the object you need. Usually like this: `preprocessor = SomeClass()`</li>
2. Fit the preprocessor, usually like this: `preprocessor.fit(X)`</li>
3. Transform your data, which may or may not be the same as the one used for fitting. Usually like this: `preprocessor.transform(X_2)`

Keep this recipe in mind, because the models we are about to use follow a similar logic!

# <span style = "color:rebeccapurple"> Part 2 - Regression

OK, it's time to get into machine learning models! Let's start with regression, since it's likely most of you have some familiarity with it.

## <span style = "color:darkorchid"> Alice goes to Antarctica!

Alice is studying penguins in Antarctica, and there is a reported shortage of fish in the area. Alice wants to know if this will be consequential for the penguin populations. To find out, she concludes she can calculate a penguin's consuption rate through its body mass. Hence, if she has their body mass, she can estimate if the penguin population is affected by the fish shortage. Seems straighforward enough...

However, weighing the penguins is a difficult, slippery task! On the other hand, Alice reasons that with visual characteristics like fipper lenght, bill dimension, and sex, which are easier to obtain, she can estimate the body mass. She uses a fancy camera equipment and a computer vision software to make these measurements.

Her researchers already obtained a small sample of visual features and body mass measurements, which she will use to create a model she can use in the future.

These are the penguin species:<br>
<img src = "images/penguins.png" width = 900>

### <span style = "color:darkorange">Intermezzo - What is regression?

See slides.

### <span style = "color:teal">Version 1: The classical approach.

#### Load and inspect the data

In [None]:
# Load data
penguins_df = pd.read_csv("data/penguins.csv")

In [None]:
# Let's see what's in there
penguins_df.head()

#### Preprocess the data

Our data actually has both the predictive features (visual features) and the target (body mass). So let's put the target in a separate dataframe:

In [None]:
# Extract target from dataframe:
penguins_y = penguins_df[["body_mass_g"]]
penguins_y.head()

Now we have the option of using all features for preduction or just a few. For simplicity let's constrain ourselves to just the numerical values and sex.

In [None]:
# Use just a few features
pred_features = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex"]
penguins_df[pred_features].head()

Note we need to identify which features are categorical, and which numerical. From the four we are using, all are numerical except for sex.

In [None]:
num_features = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]
cat_features = ["sex"]

We'll need to deal with these differently. Let's use one-hot encoding for sex and a standard scaler for the numerical features:

In [None]:
# Deal with numerical features:
sd_scaler = preprocessing.StandardScaler()                     # <-- Create scaler
sd_scaler.set_output(transform = "pandas")                     # <-- Set output to be in pandas dataframe format
sd_scaler.fit(penguins_df[num_features])                       # <-- Fit scaler
penguins_X = sd_scaler.transform(penguins_df[num_features])    # <-- Transform data

penguins_X.head()

In [None]:
# Deal with categorical features
oh_encoder = preprocessing.OneHotEncoder(sparse_output = False)
oh_encoder.set_output(transform = "pandas")
oh_encoder.fit(penguins_df[cat_features])
pxx = oh_encoder.transform(penguins_df[cat_features])

pxx.head()

Now, there is a little trick I want you to know here. As you can see, everything that is not one category is the other, so there is some redundancy in having the two columns above. Indeed, this may cause problems for some models, like linear regression. We can easily solve this by "dropping" one of them like this:

In [None]:
# Deal with categorical features - v2
oh_encoder = preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary")
oh_encoder.set_output(transform = "pandas")
oh_encoder.fit(penguins_df[cat_features])
pxx = oh_encoder.transform(penguins_df[cat_features])

pxx.head()

Let's merge those:

In [None]:
penguins_X = penguins_X.join(pxx)

In [None]:
penguins_X.head()

#### Introducing the Regression object

The logic is as with the preprocessors: create the object, then fit it.

In [None]:
# lm will stand for "linear model"
lm_penguins = linear_model.LinearRegression()

In [None]:
# Fit it to the data
lm_penguins.fit(X = penguins_X, y = penguins_y)

Great! Now what? Well, we can look at the coefficients like this:

In [None]:
lm_penguins.coef_

Now, `scikit-learn` will not give you p-values and the like, since these are not commonly used in machine learning. But, you can predict the $y$ value of an $X$ observation:

In [None]:
penguins_X.iloc[[0]]

In [None]:
lm_penguins.predict(X = penguins_X.iloc[[0]])

Let's compare this to the true value

In [None]:
penguins_y.iloc[[0]]

You can also get the $R^2$ score. You input a whole $X$ matrix on which to do the predictions, together with the true $y$ values, and you will get the error term:

In [None]:
lm_penguins.score(X = penguins_X, y = penguins_y)

#### <span style = "color:red"> EXERCISE

1. Create a regression object.
2. Fit it to the fish dataset, the target in this case is the weight column.
3. Obtain the R2 score.

In [None]:
# I'll load the data for you
fish_df = pd.read_csv("data/fish.csv")

In [None]:
# Create two distinct dataframes, one for the predictive features and one for the target
fish_X = 
fish_y = 

In [None]:
fish_X.head()

In [None]:
fish_y.head()

In [None]:
# I will do the preprocessing for you. Feel free to skip this cell and do it yourself.
fishX_num = fish_X.drop(columns = ["species"])
fishX_num = preprocessing.StandardScaler().set_output(transform = "pandas").fit_transform(fishX_num)
fishX_cat = fish_X[["species"]]
fishX_cat = preprocessing.OneHotEncoder(sparse_output = False).set_output(transform = "pandas").fit_transform(fishX_cat)

fish_X = fishX_num.join(fishX_cat)
fish_X.head()

In [None]:
# Create linear regression object


# Fit to data


# Compute R2


### <span style = "color:teal"> Version 2: The machine learning validation approach

Mmm... for those of you with more machine learning experience, did something feel off?

That's right, we fitted our model to the <b>complete</b> dataset, and also checked its performance based on it. This is actually taboo in machine learning. The reason dates back centuries, and is an important aspect of the philosophy of science. Basically, when we use all our data to create a model, we are "overfit" the model, which means it will adapt as much as possible to fit these observations, but at the expense of losing its capacity to generalize to new observations!!

### <span style = "color:darkorange"> Conceptual intermezzo - The bias-variance trade-off, generalization, and validation

See slides

<b>So what is the solution?</b>

Well, the concensus is to separate the data into <b>training</b>  and <b>testing</b>  data. Then, everything that goes into creating the model must only stem from the training data, while the testing data is kept <span style = "color:red"><b>secret</b></span> from the model until the very end. At the end we can test the model on the secret, testing data.

<b>How about preprocessing?</b>

Some preprocessing steps also use information from the data to estimate parameters (for example the `StandardScaler`, which uses the mean and standard deviation). But, we need to keep the testing data secret from everything used to build the model. Hence, preprocessing should only be fitted using training data. We will see this in a moment.

#### Loading the data

In [None]:
# Load the data again, to make sure we are working with the correct dataset
penguins_df = pd.read_csv("data/penguins.csv")
penguins_X = penguins_df[pred_features]
penguins_y = penguins_df[["body_mass_g"]]

#### Splitting the data

The function `train_test_split()` from the `model_selection` module does this automatically for us. We imported it already.

In [None]:
# split data:
pX_train, pX_test, py_train, py_test = train_test_split(penguins_X, penguins_y, test_size = .3)

The `test_size` parameter is the fraction of the data that will be kept as testing data. Let's check the sizes we got:

In [None]:
print(f"Trainig data: Matrix X of size {pX_train.shape}, target vector y of size {py_train.shape}\n" +
     f"Testing data: Matrix X of size {pX_test.shape}, target vector y of size {py_test.shape}.")

This is how they look:

In [None]:
pX_train.head()

In [None]:
py_train.head()

Did you notice something strange? The indices are all shuffled! Don't panic, as you can see they remain consistent among the feature matrix and the target vector.

#### Preprocess

We are experienced with this, so we can easily do it now:

In [None]:
# Preprocess the numerical features
sd_scaler = preprocessing.StandardScaler().set_output(transform = "pandas")
pX_train_num = sd_scaler.fit_transform(pX_train[num_features])

In [None]:
pX_train_num.head()

In [None]:
# Preprocess the categorical features
oh_encoder = preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary").set_output(transform = "pandas")
pX_train_cat = oh_encoder.fit_transform(pX_train[cat_features])

In [None]:
pX_train_cat.head()

Did you notice my little trick? That's right, since the fit data and the transform data are the same, and transforming after fitting is such a common task, `scikit-learn` provides a method that does both at the same time: `fit_transform`. That saved us a bit of space.

In [None]:
# Let's merge them
pX_train_all = pX_train_num.join(pX_train_cat)

In [None]:
# You can take a quick look if you want:
pX_train_all.head()

#### Build the regression model ONLY on the training data

In [None]:
lm_penguins = linear_model.LinearRegression()
lm_penguins.fit(X = pX_train_all, y = py_train)

#### The testing stage

We also need to preprocess our testing set, BUT, we should do it with the preprocessors that were trained using the training set:

In [None]:
# Scale testing data:
pX_test_num = sd_scaler.transform(pX_test[num_features])
pX_test_cat = oh_encoder.transform(pX_test[cat_features])

pX_test_all = pX_test_num.join(pX_test_cat)

Did you notice the difference? When using the training data, we user `fit_transform(X_train)`, this will both fit the preprocessor, and then transform our training data. It is equivalent to first using `.fit()` and then `.transform`.

On the contrary, if we are processing the testing data, we should only call `.transform()`, since we don't want to fit the preprocessors with testing data. This would be data leakage

In [None]:
# Predict on testing data:
peng_predictions = lm_penguins.predict(pX_test_all)

In [None]:
peng_predictions[0:10]

In [None]:
# Compute R2 for testing data:
p_r2 = lm_penguins.score(X = pX_test_all, y = py_test)

print(f"The coefficient of determination R2 is {p_r2}")

#### Putting it all together

Did you see how easy everything became once we understood the different `scikit-learn` classes? We just needed a few lines!! Indeed, here is the code again, without all those mid-code checks:

In [None]:
# Load and split data
penguins_df = pd.read_csv("data/penguins.csv")
pX_train, pX_test, py_train, py_test = train_test_split(penguins_df[pred_features], penguins_y, test_size = .2)

# -- Training stage --

# Preprocess training data
sd_scaler = preprocessing.StandardScaler().set_output(transform = "pandas")
pX_train_num = sd_scaler.fit_transform(pX_train[num_features])

oh_encoder = preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary").set_output(transform = "pandas")
pX_train_cat = oh_encoder.fit_transform(pX_train[cat_features])

pX_train_all = pX_train_num.join(pX_train_cat)

# Make and fit model
lm_penguins = linear_model.LinearRegression().fit(X = pX_train_all, y = py_train)

# -- Testing stage --

# Process testing data:
pX_test_num = sd_scaler.transform(pX_test[num_features])
pX_test_cat = oh_encoder.transform(pX_test[cat_features])
pX_test_all = pX_test_num.join(pX_test_cat)

# Evaluate
print(f"The coefficients are: {lm_penguins.coef_}")
print(f"The R2 score is: {lm_penguins.score(X = pX_test_all, y = py_test)}")



## <span style = "color:red"> Long Exercise - Bo's Fishy Quest

Now it's Bo's time to shine!

Alice asks Bo for help with the penguin project. This time, they want to be able to take images of fish at a large scale, and estimate their weight based on visual features. This will allow them to keep track of food availability for the penguins, and of the ecosystem health in general. As with the penguins, the visual features will be extracted using some fancy computer vision software.

Bo already has this dataset, which is the fish dataset you've been working on. Your task is to create a linear regression model on this dataset. Remember that the $y$ values, or target, are the weights of the fish.

In [None]:
# Step 1: Load the fish dataset


In [None]:
# Step 2: Specify predictive features and target


In [None]:
# Step 3: Split training and testing data. You can choose your  test size


In [None]:
# Step 4: Preprocess the training data. Keep preprocessors for later use


In [None]:
# Step 5: Make and fit the linear model


In [None]:
# Step 6: Preprocess the testing data


In [None]:
# Step 7: Evaluate


## <span style = "color:darkorchid">Preprocessing revisited: the ColumnTransformer

Before we continute to pipelines, let's learn a new trick. Was it not super annoying that when it came to preprocessing, we had to do the numeric variables and the categorical ones separately and then join them? Some of you may want to avoid this by preprocessing your data outside of `scikit-learn`, but there are important advantages to do it within it (avoid data leakage and do parameter search, for example).

To ease our existential burdens, we have the `ColumnTransformer` from the `compose` module, which we have imported already.

The way it works is straightforward, we create the object and pass it a list of tuples. These tuples contain three things: the name we want to call each of our *transformers* (like our preprocessor objects), the objects themselves, and the column of our data over which we want the transformer to act.

This makes more sense in action:

In [None]:
penguins_df.head()

In [None]:
# Create a ColumnTransformer object
col_trans = ColumnTransformer(
    [("cat", preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary"), cat_features),
    ("num", preprocessing.StandardScaler(), num_features)]
)

Guess what, we can set the output format for column transformers also, and we just need to do it once:

In [None]:
# Set output to pandas format for all transformers at the same time
col_trans.set_output(transform = "pandas")

Woah, what just happened here? Well, it's showing us a nice diagram of our column transformer, click on the arrows to see what's inside. We will see more of these diagrams soon.

Do you see the line exiting the bottom? It's indicating to us what the output is.

We can also create the column transformer and set the output data in one go:

In [None]:
col_trans = ColumnTransformer(
    [("cat", preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary"), cat_features),
    ("num", preprocessing.StandardScaler(), num_features)]
).set_output(transform = "pandas")

Now, we haven't given it any data yet. So let's do that:

In [None]:
penguins_transformed = col_trans.fit_transform(penguins_df)

In [None]:
penguins_transformed.head()

Is this cool or what?

# <span style = "color:rebeccapurple"> Part 3 - The Pipeline class

OK, it is time to talk about pipelines. A pipeline is an **abstraction** representing the whole process through which we implement a machine learning solution. A simplified pipeline may look like this:

1. State the problem.
2. Gather data.
3. Split training and testing data.
4. Preprocess the data.
5. Train the model.
6. Evaluate and optimize the model.
7. Draw awesome figures.

Having such a guide is very useful, and a starting point. However, 1) not all machine learning tasks follow the recipe above, so we must remain flexible, and 2) no realistic process is linear.

A realistic process would be more of a dynamic web of relationships. However, this model will do for now.

As it turns out, `scikit-learn` makes our life even easier by letting us specify our own pipeline. This is done through the `Pipeline` class, which we imported already.

## <span style = "color:darkorchid"> Our first pipeline object

Each pipeline object requires a list of *steps* in the pipeline, which we specify as a list of tuples. The tuples consist of two elements: the name we want to give a given step (for example, "scaler", "regressor", etc.), and then the actual object that will realize that step of the pipeline.

In [None]:
# Create a list of steps:
pipe_list = [("step1_sd_scaler", preprocessing.StandardScaler()),
            ("step2_lm", linear_model.LinearRegression())]

In [None]:
# Create the pipeline
my_pipe = Pipeline(pipe_list)

In [None]:
my_pipe

Nice! That right there is a pipeline object. Looks a bit like our `ColumnTransformer` right? Notice the output is the output of the LinearRegression object.

What's great about this is we can now just fit the whole pipeline to our dataset and it will magically work:

In [None]:
# Let's get some dummy data
penguins_X_num = pd.read_csv("data/penguins.csv")[num_features]
penguins_y = pd.read_csv("data/penguins.csv")["body_mass_g"]

In [None]:
# Fit the pipeline:
my_pipe.fit(penguins_X_num, penguins_y)

So what did it do? It first scaled our data, and then it took the output of that data and used it to fit a linear regression model.

The pipeline can calculate your $R^2$:

In [None]:
my_pipe.score(penguins_X_num, penguins_y)

But it can't get you the regression coefficients directly. For that, you'll need to access the linear regression object. You can access each step with python indexing:

In [None]:
# Access first step
my_pipe[0]

In [None]:
# Access second step
my_pipe[1]

Now we can get the regression coefficients:

In [None]:
my_pipe[1].coef_

We can also access the steps by name, as in a dictionary:

In [None]:
# Access linear regression object, and get coefficients from there.
my_pipe["step2_lm"].coef_

### <span style = "color:teal"> Don't forget about the train test split!

OK, the above helped us get familiar with pipelines, but we didn't split our data. That's a sin! Let's atone:

In [None]:
# Get data
penguins_X_num = pd.read_csv("data/penguins.csv")[num_features]
penguins_y = pd.read_csv("data/penguins.csv")["body_mass_g"]

# Split data right away:
pX_train, pX_test, py_train, py_test = train_test_split(penguins_X_num, penguins_y, test_size = .3)

# Create pipeline:
penguins_pipeline = Pipeline([("step1_sd_scaler", preprocessing.StandardScaler()),
            ("step2_lm", linear_model.LinearRegression())])

# Fit pipeline:
penguins_pipeline.fit(pX_train, py_train)

# Get score on test dataset:
penguins_pipeline.score(pX_test, py_test)

Now hold on a sec, didn't we have to preprocess the testing data also? Don't worry, the pipeline does it for us!

When we use the `fit()` method, the entire pipeline is fitted with the given data. However, when we use methods like `score()` and `predict()`, it uses the already fitted values to preprocess the given data, and then it performs the scoring or prediction on the preprocessed data. How cool!

## <span style = "color:darkorchid"> Implementing ColumnTransformer

Above we only used the numerical data in our pipeline. But what if we also have categorical data, and we want a one hot encoder together with a standard scaler? Well, no sweat, we can take our `ColumnTransformer()` apraoch above, and use it with our pipeline!

<b> Load data

In [None]:
# Let's get some dummy data
penguins_X = pd.read_csv("data/penguins.csv")[pred_features]    # <-- Using all predictive features
penguins_y = pd.read_csv("data/penguins.csv")[["body_mass_g"]]

In [None]:
penguins_X.head()

<b> Split the data!

In [None]:
pX_train, pX_test, py_train, py_test = train_test_split(penguins_X, penguins_y, test_size = .3)

In [None]:
pX_train.head()

<b> Create a ColumnTransformer:

In [None]:
# Column Transformer
col_trans = ColumnTransformer(
    [("cat", preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary"), cat_features),
    ("num", preprocessing.StandardScaler(), num_features)]
)

In [None]:
# Make pipeline
penguins_pipeline = Pipeline([
    ("col_trans", col_trans),
    ("linear_model", linear_model.LinearRegression())
])

<b> Now we fit and score:

In [None]:
# Fit pipeline
penguins_pipeline.fit(pX_train, py_train)

In [None]:
# Calculate Score:
penguins_pipeline.score(pX_test, py_test)

### Note

Just as we can make a column tranformer a part of a pipeline, you can make a pipeline a part of a column transformer. For example if each column requires several steps of preprocessing, before they merge, you would create a preprocessing pipeline for each of them, then join them with a column transformer, then embed that column transformer into a larger pipeline.

We don't have time to do this but it's useful info for the future

## <span style = "color:darkorchid"> Alice in Antarctica - Redux

Time to put everything we've done together:

In [None]:
# Step 1 - Load the data
penguins_X = pd.read_csv("data/penguins.csv")[pred_features]
penguins_y = pd.read_csv("data/penguins.csv")[["body_mass_g"]]

# Step 2 - Split the data
pX_train, pX_test, py_train, py_test = train_test_split(penguins_X, penguins_y, test_size = .3)

# Step 3 - Create transformers and pipelines
col_trans = ColumnTransformer(
    [("cat", preprocessing.OneHotEncoder(drop = "if_binary"), cat_features),
    ("num", preprocessing.StandardScaler(), num_features)])

penguins_pipeline = Pipeline([
    ("col_trans", col_trans),
    ("linear_model", linear_model.LinearRegression())
])

# Step 4 - Fit full pipeline
penguins_pipeline.fit(pX_train, py_train)

# Step 5 - Evaluate
penguins_pipeline.score(pX_test, py_test)

Notice that we keep "abstracting away", and taking a perspective at higher and higher levels. This is the art of good object oriented programming, and also of a complicated machine learning / data science project.

## <span style = "color:red"> Bob's fishy quest - Redux

You know what to do here ;-)

# <span style = "color:darkorange"> Conceptual Intermezzo - Machine Learning tasks

See slides