## Overview

In the previous objectives, we used `seaborn` to fit a simple linear regression to a dataset containing penguin weight and flipper lengths. In the example, we compared our baseline (the ratio of weight to flipper length) to the actual best-line fit in the plot.

Throughout this unit we're going to be using the tools available in the scikit-learn library. Most likely, you've already come across this library and even used some of the tools, either in Unit 1 or during your own learning.

Right now, we're going to work through an example using scikit-learn to fit a linear regression model, using the same dataset from the previous objective. While some of this material may be review, it's still important to go through each of the steps, both for practice and to address concepts that we might have missed.

### Linear Regression

Before we get into how to use scikit-learn to fit a model, we'll do a quick review of linear regression and the associated coefficients. A linear regression fits data to a line where the equation of the line is given by 

<script src="https://i.upmath.me/latex.js"></script>
<p>$$y = \beta_0 + \beta_1x$$</p>
<p>When we fit a line, we’re trying to find the coefficients $\beta_0$ and $\beta_1$. The parameter $\beta_0$ is the intercept (when $x$=0, the intercept is the $y$ value) and $\beta_1$ is the slope. The results of the model fit will return the slope and intercept.</p>

In the next objective, we'll focus more on the meaning of the coefficients. Right now, the goal is to learn how to use the scikit-learn tool to fit a simple model.

## Follow Along

The following steps show the same process you will follow with the scikit-learn API (application programming interface; how we interact with the many tools in the scikit-learn predictor) to fit many different types of models. The model type, model complexity, data type, and size of the data set don't affect the following steps:

### Scikit-learn API

* Load the data set and "clean: if needed (not specifically part of scikit-learn but important to do first
* Create features and target(s) from the data
* Import the model and instantiate the class
* Fit the model
* Apply your model; use the model to predict new values

In the above process, the data loading, cleaning, and preparing for modeling can be done all at once before any of the other steps. Or creating features and target(s) can be completed right before you fit the model; the important point is to have the data in the correct form *before* fitting.

### Load Data

As in the previous objective, we'll use the penguin data set available from the `seaborn` library. When we import `seaborn`, all of the associated datasets are included; we don't need to download any other data or load files from our local system.

We also need to make sure we remove any NaN values now; the model-fitting algorithm requires that we input clean data or data that is free of missing values.

In [1]:
# Import pandas and seaborn
import pandas as pd
import numpy as np
import seaborn as sns

# Load the data into a DataFrame
penguins = sns.load_dataset("penguins")

# Print the shape of the DataFrame
print('Shape of the dataset (before removing NaNs): ', penguins.shape)

# Drop NaNs
penguins.dropna(inplace=True)

# Print the shape of the DataFrame
print('Shape of the dataset (after removing NaNs): ', penguins.shape)

# Display the first five rows
display(penguins.head())

Shape of the dataset (before removing NaNs):  (344, 7)
Shape of the dataset (after removing NaNs):  (333, 7)


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


### Representing Data

In the previous unit, we discussed the tidy data format and how organizing out data this way makes it easier to clean and format it for machine learning. Now we get to see the benefit of having data in this format as we prepare to use it with scikit-learn.

In the above table, we have 333 rows of data (after filtering), where each row is an observation of a single penguin. The rows are sometimes called samples; think of each row as a sample of observations about a penguin. We also have seven columns that correspond to the information that describes each sample. Here, we are describing the species, home island, and physical characters of our samples (penguins). Features are often numeric (`body_mass_g`, `flipper_length_mm`) but not always; the `species`, `island`, and `sex` columns are all described by string variables.

### Feature Matrix and Target Array

Before we can input our data into a scikit-learn model, we have to separate it into a *feature matrix* and *target array*. First, we need to decide what we're trying to predict from this dataset.

We've already fit a simple linear regression model to the `flipper_length_mm` and `body_mass_g` variables, so we'll continue with those two variables. We want to use the flipper length to predict the weight of the penguin. The terminology we use is as follows: our feature (flipper length) will be used to predict the target (weight).

For this simple linear regression example, we are only predicting one target variable; the target is an array with a length equal to the number of rows in the feature matrix.

![features](https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_2/sprint_1/mod1_obj2_features.gif)

In the following code, we'll create our feature matrix and target array. It's customary to use a capital `X` for the features and a lowercase `y` for the target array. We'll add the name `penguins` to our variable names to make it easier to remember the data we are fitting.

In [2]:
# Create the feature matrix
X_penguins = penguins['flipper_length_mm']
print("The shape of the feature matrix: ", X_penguins.shape)

# Create the target array
y_penguins = penguins['body_mass_g']
print("The shape of the target array: ", y_penguins.shape)

The shape of the feature matrix:  (333,)
The shape of the target array:  (333,)


We can see that these are both one-dimensional arrays of 333 elements, which is what we expected. Our data is now ready to be input in a scikit-learn model.

### Scikit-learn Predictor

The scikit-learn predictor is the object that learns from the data. There is a standard process to follow to use the predictor object. Our example will be for a linear regression, but we can apply these steps to any of the scikit-learn predictors (classification, regression, and clustering).

#### 1. Import the model class

We already know we're trying to fit a linear model to our data, so we'll use a regression algorithm.

`from sklearn.linear_model import LinearRegression`

#### 2. Instantiate the class

The term *instantiate* is a fancy way to say you are creating an instance of a class. We imported the predictor class but that's it; we need to create an instance of that class to actually do anything. With this step, we also determine the *hyperparameters* or model parameters we would like to use.

To create an instance of `LinearRegression()` predictor, we use the following code:

In [3]:
# Import the predictor class
from sklearn.linear_model import LinearRegression

# Instantiate the class (with default parameters)
model = LinearRegression()

# Dispay the model parameters
model

LinearRegression()

The `LinearRegression()` predictor has four parameters that we can set. For now, let's use the default setting but you can read more about the parameters [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression).

#### Arrange data

Part of this step was already completed above but all predictors require the feature matrix to be in the form of a two-dimensional matrix. We can do a re-shaping of the one-dimensional array by adding a new axis with the `np.newaxis` function.

In [4]:
# Display the shape of X_penguins
print('Original features matrix: ', X_penguins.shape)

# Add a new axis to create a column vector
X_penguins_2D = X_penguins[:, np.newaxis]
print(X_penguins_2D.shape)

Original features matrix:  (333,)
(333, 1)


Our feature matrix is now a two-dimensional array and we can move to the next step.

#### Fit the model

We have a model predictor imported, the class instantiated, and our data in the correct format. The next step is to fit our model! Using the `fit()` method associated with the model, the model results will be stored in model-specific attributes.

In [5]:
# Fit the model
model.fit(X_penguins_2D, y_penguins)

LinearRegression()

#### Look at the coefficients

As reviewed above, the coefficients describe the slope and intercept. We access these coefficients with the following methods:

In [6]:
# Slope (also called the model coefficient)
print(model.coef_)

# Intercept
print(model.intercept_)

# In equation form
print(f'\nbody_mass_g = {model.coef_[0]} x flipper_length_mm + ({model.intercept_})')

[50.15326594]
-5872.092682842825

body_mass_g = 50.15326594224113 x flipper_length_mm + (-5872.092682842825)


## Challenge

In the original data set, there are other physical measurements on the penguins that we can perform a linear regression on. The bill length and depth measure the characteristics of a penguin's beak. Using two of these other measurements, fit a linear regression model to see how much these two variables might display a linear relationship.

Follow these suggested steps:

* Load the data set and remove the NaN values.
* Choose two variables to explore and plot them to check the relationship visually.
* Create the feature matrix and target array.
* Import the LinearRegression class and instantiate the model.
* Fit the model and then print out the coefficients

## Additional Resources

* [Glossary of Common Terms and API Elements](https://scikit-learn.org/stable/glossary.html#general-concepts)
* [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression)