# **INTRODUCTION**

# A- Creating models for data

The main job of a data scientist is analyzing data and creating models for obtaining results from the data. Oftentimes, data scientists will use simple statistical models for their data, rather than machine learning models like neural networks. This is because data scientists tend to work with smaller datasets than machine learning engineers, so they can quickly extract good results using statistical models.

The scikit-learn library provides many statistical models for linear regression. It also provides a few good models for classifying data, which will be introduced in later chapters.

When creating these models, data scientists need to figure out the optimal hyperparameters to use. Hyperparameters are values that we set when creating a model, e.g. certain constant coefficients used in the model's calculations. We'll talk more about hyperparameter tuning, the process of finding the optimal hyperparameter settings, in later chapters.

# **LINEAR REGRESSION**

# A- What is Linear Regression?

One of the main objectives in both machine learning and data science is finding an equation or distribution that best fits a given dataset. This is known as data modeling, where we create a model that uses the dataset's features as independent variables to predict output values for some dependent variable (with minimal error). However, it is incredibly difficult to find an optimal model for most datasets, given the amount of noise (i.e. random errors/fluctuations) in real world data.

Since finding an optimal model for a dataset is difficult, we instead try to find a good approximating distribution. In many cases, a linear model (a linear combination of the dataset's features) can approximate the data well. The term linear regression refers to using a linear model to represent the relationship between a set of independent variables and a dependent variable.

    Y = AX1 + BX2 + CX3 + D

The above formula is example linear model which produces output y (dependent variable) based on the linear combination of independent variables x_1, x_2, x_3. The coefficients a, b, c and intercept d determine the model's fit.

# B- Basic Linear Regression

The simplest form of linear regression is called least squares regression. This strategy produces a regression model, which is a linear combination of the independent variables, that minimizes the sum of squared residuals between the model's predictions and actual values for the dependent variable.

In scikit-learn, the least squares regression model is implemented with the LinearRegression object, which is a part of the linear_model module in sklearn. The object contains a fit function, which takes in an input dataset of features (independent variables) and an array of labels (dependent variables) for each data observation (rows of the dataset).

The code below demonstrates how to fit a LinearRegression model to a dataset of 5 different pizzas (pizza_data) and corresponding pizza prices. The first column of pizza_data represents the number of calories and the second column represents net weight (in grams).

In [1]:
# Imports to work with
import numpy as np
import pandas as pd

In [2]:
# define pizza_data and pizza_prices

pizza_data = np.array([[2100,  800],
                       [2500,  850],
                       [1800,  760],
                       [2000,  800],
                       [2300,  810]])
pizza_prices = np.array([10.99, 12.5, 9.99, 10.99, 11.99])

print('{}\n'.format(repr(pizza_data)))
print('{}\n'.format(repr(pizza_prices)))

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(pizza_data, pizza_prices)

array([[2100,  800],
       [2500,  850],
       [1800,  760],
       [2000,  800],
       [2300,  810]])

array([10.99, 12.5 ,  9.99, 10.99, 11.99])



LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [4]:
# new pizza data
new_pizzas = np.array([[2000, 820],
                       [2200, 830]])
price_predicts = reg.predict(new_pizzas)
print('{}\n'.format(repr(price_predicts)))

print('Coefficients: {}\n'.format(repr(reg.coef_)))
print('Intercept: {}\n'.format(reg.intercept_))

# Using prevously defined pizza_data, pizza_prices
r2 = reg.score(pizza_data, pizza_prices)
print('R2: {}\n'.format(r2))

array([10.86599206, 11.55111111])

Coefficients: array([0.00330913, 0.00232937])

Intercept: 2.3376587301587346

R2: 0.9758349388652625



# **RIDGE REGRESSION**