# Regresión lineal

Hello again, ready to learn about linear regression?

Linear regression is perhaps one of the most basic algorithms used in the world of machine learning – as the name itself indicates, it helps us solve regression tasks, predicting a continuous numerical value.

The `LinearRegression` class is also one of the simplest to use in terms of number of attributes.

Let's create a synthetic dataset to practice our regression:

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=100, n_features=10, bias=2.0, noise=5.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y)


This is a generic dataset, which contains 10 features, in this case it is an array of numerical data ready to be used in regression. Remember that you usually have to do feature engineering, but since we've already covered that in the book, I'm going to skip it here.

In [None]:
X[:10]

The basic steps involve importing the class from the `linear_model` module:

In [None]:
from sklearn.linear_model import LinearRegression


Create an instance:

In [None]:
linear_regression = LinearRegression()


And call the `fit` method to train the model on the training data:

In [None]:
linear_regression.fit(X_train, y_train)


Finally, simply call the predict method, passing it the test data:

In [None]:
y_pred = linear_regression.predict(X_test)


In `y_pred` we have our predicted values that we can later use to evaluate the model's performance:

In [None]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, y_pred)


Or if you prefer, you can see this more graphically with this function that I created for the book:

In [None]:
from utils import plot_regression_results

plot_regression_results(y_test, y_pred)


## Arguments of `LinearRegression`

As I mentioned, linear regression in scikit-learn is quite simple, with few arguments that have little influence on the model's behavior. We have:

 - `fit_intercept` (boolean, optional): indicates whether to calculate the intercept (also called "y-intercept"). By default, the value is `True`.
 - `normalize` (boolean, optional): indicates whether the predictor variables (also called "features") should be normalized. By default, the value is `False`.
 - `copy_X` (boolean, optional): indicates whether to make a copy of the predictor variable matrix (`X`) before fitting. By default, the value is `True`.
 - `n_jobs` (integer, optional): indicates the number of parallel jobs to use for fitting the model. By default, the value is `None`, which means a single job is used.
 - `positive` (boolean, optional): indicates whether to force the coefficients to be non-negative. By default, the value is `False`.

The most important argument is `fit_intercept`, which determines whether to fit a model that includes an "intercept". The intercept is the value of the function when all predictor variables are equal to zero. If `fit_intercept=True`, the model will include an independent term. If `fit_intercept=False`, the model will not include an independent term and will pass through the origin.

The other arguments have minor effects on the fitted model. `normalize` is used to normalize the predictor variables, which can be useful if the variables have different scales – and you haven't normalized them previously. `n_jobs` controls the number of parallel jobs to use for fitting the model, which can be useful if multiple CPU cores are available – although it's only used in some special cases. Finally, `positive` will force the coefficients to be non-negative, which can be useful in some cases.

## Attributes of linear regression

The class also has some interesting attributes that might be useful in some situations. The most useful ones are `coef_` and `intercept_`:

The `coef_` attribute is a one-dimensional array containing the regression coefficients, one for each input variable in the dataset. The order of the coefficients corresponds to the order of the predictor variables in the input data matrix used to train the model. This can be used to determine, to a limited extent, the importance or weight that each variable has within the model.

The `intercept_` attribute is a scalar that represents the value of the dependent variable when all predictor variables are zero. For example, in a linear regression model that predicts house prices based on their size in square meters, the independent term could represent the base price of a house. In this case, the value of the independent term could be used to determine if our algorithm is correctly estimating the price of a "base" house.

Let's look at a simple example – a small dataset representing house prices in relation to their dimensions in square meters.

In [None]:
from utils import load_custom_houses
square_meters, price = load_custom_houses()


We can train a linear regression:

In [None]:
lr = LinearRegression()
lr.fit(square_meters, price)


And we can also review its coefficients:

In [None]:
lr.coef_


And the y-intercept:

In [None]:
lr.intercept_


$$
  y = x_1 \times {lr.coef\_}_{0} + lr.intercept\_
$$

Thanks to this, we have that for each square meter our house has, the price will increase by 373.95.

And you can even calculate the total price by multiplying:

In [None]:
y = square_meters[0] * lr.coef_[0] + lr.intercept_


$$
  price = metros \times {precioPorMetro} + precioBase
$$

$$
  1281537.73 ~= 1810.12 \times 373.95 + 604641.30
$$

In [None]:
lr.predict(square_meters[:1])


And there you have it, this is linear regression, and believe it or not, it's perhaps the most used model in the industry. Obviously, with many more variables and a lot of feature engineering.

I hope the things we covered here are clear to you. Remember that the notebook and slides are in the book's resources.

I'll see you in the next chapter to learn more about supervised machine learning.