# <span style = "color:rebeccapurple"> Regression

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

## <span style = "color:darkorchid"> Imports

First, as with all scripts, let's import all the modules we will use for this workshop.

In [1]:
# :: IMPORTS ::

# Scikit-learn specifics:
from sklearn import preprocessing
from sklearn import linear_model
from sklearn.model_selection import train_test_split

# Helper modules
import pandas as pd

OK, it's time to get into machine learning models! Let's start with regression, since it's likely most of you have some familiarity with it.

## <span style = "color:darkorchid"> Alice goes to Antarctica!

Alice is studying penguins in Antarctica, and there is a reported shortage of fish in the area. Alice wants to know if this will be consequential for the penguin populations. To find out, she concludes she can calculate a penguin's consuption rate through its body mass. Hence, if she has their body mass, she can estimate if the penguin population is affected by the fish shortage. Seems straighforward enough...

However, weighing the penguins is a difficult, slippery task! On the other hand, Alice reasons that with visual characteristics like fipper lenght, bill dimension, and sex, which are easier to obtain, she can estimate the body mass. She uses a fancy camera equipment and a computer vision software to make these measurements.

Her researchers already obtained a small sample of visual features and body mass measurements, which she will use to create a model she can use in the future.

These are the penguin species:<br>
<img src = "images/penguins.png" width = 900>

### <span style = "color:darkorange">Intermezzo - What is regression?

See slides.

## <span style = "color:darkorchid">Version 1: The classical approach.

### <span style="color:teal"> Load and inspect the data

In [2]:
# Load data
penguins_df = pd.read_csv("data/penguins.csv")

In [3]:
# Let's see what's in there
penguins_df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


### <span style="color:teal"> Preprocess the data

Our data actually has both the predictive features (visual features) and the target (body mass). So let's put the target in a separate dataframe:

In [4]:
# Extract target from dataframe:
penguins_y = penguins_df[["body_mass_g"]]
penguins_y.head()

Unnamed: 0,body_mass_g
0,3750.0
1,3800.0
2,3250.0
3,3450.0
4,3650.0


Now we have the option of using all features for prediction or just a few. For simplicity let's constrain ourselves to just the numerical values and sex.

In [5]:
# Use just a few features
pred_features = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex"]
penguins_df[pred_features].head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,sex
0,39.1,18.7,181.0,male
1,39.5,17.4,186.0,female
2,40.3,18.0,195.0,female
3,36.7,19.3,193.0,female
4,39.3,20.6,190.0,male


Note we need to identify which features are categorical, and which numerical. From the four we are using, all are numerical except for sex.

In [6]:
num_features = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]
cat_features = ["sex"]

We'll need to deal with these differently. Let's use one-hot encoding for sex and a standard scaler for the numerical features:

In [7]:
# Deal with numerical features:
sd_scaler = preprocessing.StandardScaler()                     # <-- Create scaler
sd_scaler.set_output(transform = "pandas")                     # <-- Set output to be in pandas dataframe format
sd_scaler.fit(penguins_df[num_features])                       # <-- Fit scaler
penguins_X = sd_scaler.transform(penguins_df[num_features])    # <-- Transform data

penguins_X.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm
0,-0.896042,0.780732,-1.426752
1,-0.822788,0.119584,-1.069474
2,-0.67628,0.424729,-0.426373
3,-1.335566,1.085877,-0.569284
4,-0.859415,1.747026,-0.783651


In [8]:
# Deal with categorical features
oh_encoder = preprocessing.OneHotEncoder(sparse_output = False)
oh_encoder.set_output(transform = "pandas")
oh_encoder.fit(penguins_df[cat_features])
pxx = oh_encoder.transform(penguins_df[cat_features])

pxx.head()

Unnamed: 0,sex_female,sex_male
0,0.0,1.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0


Now, there is a little trick I want you to know here. As you can see, everything that is not one category is the other, so there is some redundancy in having the two columns above. Indeed, this may cause problems for some models, like linear regression. We can easily solve this by "dropping" one of them like this:

In [9]:
# Deal with categorical features - v2
oh_encoder = preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary") # "drop" argument
oh_encoder.set_output(transform = "pandas")
oh_encoder.fit(penguins_df[cat_features])
pxx = oh_encoder.transform(penguins_df[cat_features])

pxx.head()

Unnamed: 0,sex_male
0,1.0
1,0.0
2,0.0
3,0.0
4,1.0


Let's merge our preprocessed data

In [10]:
penguins_X = penguins_X.join(pxx)
penguins_X.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,sex_male
0,-0.896042,0.780732,-1.426752,1.0
1,-0.822788,0.119584,-1.069474,0.0
2,-0.67628,0.424729,-0.426373,0.0
3,-1.335566,1.085877,-0.569284,0.0
4,-0.859415,1.747026,-0.783651,1.0


### <span style="color:teal">Introducing the Regression object

The logic is as with the preprocessors: create the object, then fit it.

In [11]:
# lm will stand for "linear model"
lm_penguins = linear_model.LinearRegression()
lm_penguins

In [12]:
# Fit it to the data
lm_penguins.fit(X = penguins_X, y = penguins_y)

Great! Now what? Well, we can look at the coefficients like this:

In [13]:
lm_penguins.coef_

array([[ -12.71565787, -169.27325249,  543.35599115,  541.02853344]])

Now, `scikit-learn` will not give you $p$-values and the like, since these are not commonly used in machine learning. But, you can predict the $y$ value of an $X$ observation:

In [14]:
penguins_X.iloc[[0]]

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,sex_male
0,-0.896042,0.780732,-1.426752,1.0


In [15]:
lm_penguins.predict(X = penguins_X.iloc[[0]])

array([[3579.13694585]])

Let's compare this to the true value

In [16]:
penguins_y.iloc[[0]]

Unnamed: 0,body_mass_g
0,3750.0


You can also get the $R^2$ score. You input a whole $X$ matrix on which to do the predictions, together with the true $y$ values, and you will get the score:

In [17]:
lm_penguins.score(X = penguins_X, y = penguins_y)

0.8230077584726003

#### <span style = "color:red"> EXERCISE

1. Create a regression object.
2. Fit it to the fish dataset, the target in this case is the weight column.
3. Obtain the $R^2$ score.

In [None]:
# I'll load the data for you
fish_df = pd.read_csv("data/fish.csv")

In [None]:
# Create two distinct dataframes, one for the predictive features and one for the target
fish_X = 
fish_y = 

In [None]:
fish_X.head()

In [None]:
fish_y.head()

In [None]:
# I will do the preprocessing for you. Feel free to skip this cell and do it yourself.
fishX_num = fish_X.drop(columns = ["species"])
fishX_num = preprocessing.StandardScaler().set_output(transform = "pandas").fit_transform(fishX_num)
fishX_cat = fish_X[["species"]]
fishX_cat = preprocessing.OneHotEncoder(sparse_output = False).set_output(transform = "pandas").fit_transform(fishX_cat)

fish_X = fishX_num.join(fishX_cat)
fish_X.head()

In [None]:
# Create linear regression object


# Fit to data


# Compute R2


## <span style = "color:darkorchid"> Version 2: The machine learning approach

Mmm... for those of you with more machine learning experience, did something feel off?

That's right, we fitted our model to the <b>complete</b> dataset, and also checked its performance based on it. This is actually taboo in machine learning. The reason dates back centuries, and is an important aspect of the philosophy of science. Basically, when we use all our data to create a model, we "overfit" the model, which means it will adapt as much as possible to fit these observations, but at the expense of losing its capacity to generalize to new observations!!

### <span style = "color:darkorange"> Conceptual intermezzo - The bias-variance trade-off, generalization, and validation

See slides

### <span style="color:teal"> The train-test split

<b>So what is the solution?</b>

Well, the concensus is to separate the data into <b>training</b>  and <b>testing</b>  data. Then, everything that goes into creating the model must only stem from the training data, while the testing data is kept <span style = "color:red"><b>secret</b></span> from the model until the very end. At the end we can test the model on the secret, testing data.

<b>How about preprocessing?</b>

Some preprocessing steps also use information from the data to estimate parameters (for example the `StandardScaler`, which uses the mean and standard deviation). But, we need to keep the testing data secret from everything used to build the model. Hence, preprocessing should only be fitted using training data. We will see this in a moment.

### <span style="color:teal"> Loading the data

In [18]:
# Load the data again, to make sure we are working with the correct dataset
penguins_df = pd.read_csv("data/penguins.csv")
penguins_X = penguins_df[pred_features]
penguins_y = penguins_df[["body_mass_g"]]

### <span style="color:teal"> Splitting the data

The function `train_test_split()` from the `model_selection` module does this automatically for us. We imported it already.

In [19]:
# split data:
pX_train, pX_test, py_train, py_test = train_test_split(penguins_X, penguins_y, test_size = .3)

The `test_size` parameter is the fraction of the data that will be kept as testing data. Let's check the sizes we got:

In [20]:
print(f"Trainig data: Matrix X of size {pX_train.shape}, target vector y of size {py_train.shape}\n" +
     f"Testing data: Matrix X of size {pX_test.shape}, target vector y of size {py_test.shape}.")

Trainig data: Matrix X of size (233, 4), target vector y of size (233, 1)
Testing data: Matrix X of size (100, 4), target vector y of size (100, 1).


This is how they look:

In [21]:
pX_train.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,sex
213,50.7,15.0,223.0,male
294,52.8,20.0,205.0,male
150,47.6,14.5,215.0,male
307,50.9,19.1,196.0,male
298,51.0,18.8,203.0,male


In [22]:
py_train.head()

Unnamed: 0,body_mass_g
213,5550.0
294,4550.0
150,5400.0
307,3550.0
298,4100.0


Did you notice something strange? The indices are all shuffled! Don't panic, as you can see they remain consistent among the feature matrix and the target vector.

### <span style="color:teal"> Preprocess

We are experienced with this, so we can easily do it now:

In [23]:
# Preprocess the numerical features
sd_scaler = preprocessing.StandardScaler().set_output(transform = "pandas")
pX_train_num = sd_scaler.fit_transform(pX_train[num_features])
pX_train_num.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm
213,1.257484,-1.122242,1.51244
294,1.642761,1.41052,0.259344
150,0.688742,-1.375518,0.955508
307,1.294177,0.954623,-0.367204
298,1.312523,0.802657,0.120111


In [24]:
# Preprocess the categorical features
oh_encoder = preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary").set_output(transform = "pandas")
pX_train_cat = oh_encoder.fit_transform(pX_train[cat_features])
pX_train_cat.head()

Unnamed: 0,sex_male
213,1.0
294,1.0
150,1.0
307,1.0
298,1.0


Did you notice my little trick? That's right, since the fit data and the transform data are the same, and transforming after fitting is such a common task, `scikit-learn` provides a method that does both at the same time: `fit_transform`. That saved us a bit of space.

In [25]:
# Let's merge them
pX_train_all = pX_train_num.join(pX_train_cat)

In [26]:
# You can take a quick look if you want:
pX_train_all.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,sex_male
213,1.257484,-1.122242,1.51244,1.0
294,1.642761,1.41052,0.259344,1.0
150,0.688742,-1.375518,0.955508,1.0
307,1.294177,0.954623,-0.367204,1.0
298,1.312523,0.802657,0.120111,1.0


### <span style="color:teal"> Build the regression model ONLY on the training data

In [27]:
lm_penguins = linear_model.LinearRegression()
lm_penguins.fit(X = pX_train_all, y = py_train)

### <span style="color:teal"> The testing stage

We also need to preprocess our testing set, BUT, we should do it with the preprocessors that were trained using the training set:

In [28]:
# Scale testing data:
pX_test_num = sd_scaler.transform(pX_test[num_features])
pX_test_cat = oh_encoder.transform(pX_test[cat_features])

pX_test_all = pX_test_num.join(pX_test_cat)

Did you notice the difference? When using the training data, we user `fit_transform(X_train)`, this will both fit the preprocessor, and then transform our training data. It is equivalent to first using `.fit()` and then `.transform`.

On the contrary, if we are processing the testing data, we should only call `.transform()`, since we don't want to fit the preprocessors with testing data. This would be data leakage

In [29]:
# Predict on testing data:
peng_predictions = lm_penguins.predict(pX_test_all)
peng_predictions[0:10]

array([[5125.25531532],
       [3919.23522584],
       [3765.45255221],
       [4524.95973105],
       [4062.30121508],
       [3691.66584447],
       [3521.03056887],
       [4194.97819903],
       [4488.16036295],
       [5642.74636083]])

In [31]:
# Compute R2 for testing data:
p_r2 = lm_penguins.score(X = pX_test_all, y = py_test)

print(f"The coefficient of determination R2 is {p_r2:.2f}")

The coefficient of determination R2 is 0.81


### <span style="color:teal"> Putting it all together

Did you see how easy everything became once we understood the different `scikit-learn` classes? We just needed a few lines!! Indeed, here is the code again, without all those mid-code checks:

In [32]:
# Load and split data
penguins_df = pd.read_csv("data/penguins.csv")
pX_train, pX_test, py_train, py_test = train_test_split(penguins_df[pred_features], penguins_y, test_size = .2)

# -- Training stage --

# Preprocess training data
sd_scaler = preprocessing.StandardScaler().set_output(transform = "pandas")
pX_train_num = sd_scaler.fit_transform(pX_train[num_features])

oh_encoder = preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary").set_output(transform = "pandas")
pX_train_cat = oh_encoder.fit_transform(pX_train[cat_features])

pX_train_all = pX_train_num.join(pX_train_cat)

# Make and fit model
lm_penguins = linear_model.LinearRegression().fit(X = pX_train_all, y = py_train)

# -- Testing stage --

# Preprocess testing data:
pX_test_num = sd_scaler.transform(pX_test[num_features])
pX_test_cat = oh_encoder.transform(pX_test[cat_features])
pX_test_all = pX_test_num.join(pX_test_cat)

# Evaluate
print(f"The coefficients are: {lm_penguins.coef_}")
print(f"The R2 score is: {lm_penguins.score(X = pX_test_all, y = py_test):.2f}")



The coefficients are: [[   6.39248782 -172.08860136  520.38758995  525.40523352]]
The R2 score is: 0.83


### <span style="color:teal"> The random state (seed)

Finally, you may have noticed your train-test split was different to mine. To ensure reproducibility, any object in `sklearn` that takes random actions has an argument `random_state` which determines which random number generator to use. By setting this we can be sure we'll always get the same results.

In [40]:
pX_train, _, _, _ = train_test_split(
    penguins_df[pred_features], penguins_y, test_size = .2,
    random_state=42       # <-- I'm indicating which random number generator to use
)
pX_train.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,sex
224,49.1,14.5,212.0,female
78,37.3,17.8,191.0,female
295,40.9,16.6,187.0,female
17,35.9,19.2,189.0,female
24,40.5,18.9,180.0,male


## <span style = "color:red"> Long Exercise - Bo's Fishy Quest

Now it's Bo's time to shine!

Alice asks Bo for help with the penguin project. This time, they want to be able to take images of fish at a large scale, and estimate their weight based on visual features. This will allow them to keep track of food availability for the penguins, and of the ecosystem health in general. As with the penguins, the visual features will be extracted using some fancy computer vision software.

Bo already has this dataset, which is the fish dataset you've been working on in the exercises. Your task is to create a linear regression model on this dataset. Remember that the $y$ values, or target, are the weights of the fish.

In [None]:
# Step 1: Load the fish dataset


In [None]:
# Step 2: Specify predictive features and target


In [None]:
# Step 3: Split training and testing data. You can choose your test size


In [None]:
# Step 4: Preprocess the training data. Keep preprocessors for later use


In [None]:
# Step 5: Make and fit the linear model


In [None]:
# Step 6: Preprocess the testing data


In [None]:
# Step 7: Evaluate
