<img src="./pictures/logo-insa.jpg" style="float:right; max-width: 120px; display: inline" alt="INSA" /></a>

# Estimation models - Linear regression

*Written by Marc Budinger, INSA Toulouse*

This section presents the estimation models that calculate the component characteristics requested for their selection without requiring a detailed design. Linear regression of catalog datas are particularly suitable for this purpose.

In [2]:
import QCM_widget as QCM

#### Teaching video

This video is going to introduce you to linear regression of technical statistical data.

- Video V2.1 - Estimation models with linear regression [English](https://youtu.be/3sB0omXZCmY)

#### Keywords on linear regression

Main steps:
Statistical data:
	Catalog data
Primary parameters choice: 
	[Correlation analysis](http://benalexkeen.com/correlation-in-python/)
    
![Correlation coefficient](./pictures/LinearRegression_EquationCorrelation.png)

Model choice: 
	Polynomial, transformation

![Polynomial response surface](./pictures/LinearRegression_EquationRSM.png)

Model fitting: Least square error, [linear regression](https://realpython.com/linear-regression-in-python/#simple-linear-regression)

$Y = X.\theta + \varepsilon$  
$\varepsilon^t\varepsilon$ minimum for $\theta = (X^tX)^{-1}X^tY$  


### Quizzes on Linear Regression


##### Correlation

q1) Which correlation best describes the scatterplot ?
![Linear regression](./pictures/LinearRegression_ScatterPlot.png)

Reminder 1:
A scatter plot, scatterplot, or scattergraph is a type of mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.

Reminder 2:
The correlation coefficient is a measure of the linear dependence between two variables X and Y, giving a value between +1 and −1. 


In [9]:
QCM.quiz(1,"./quiz/RegressionQuiz.xlsx")

VBox(children=(Output(), RadioButtons(options=((-0.7, 0), (-0.3, 1), (0, 2), (0.3, 3), (0.7, 4)), value=0), Bu…

q2) You are conducting a correlation analysis between a response variable and an explanatory variable. Your analysis produces a significant positive correlation between the two variables. Which of the following conclusions is the most reasonable?

a. Change in the explanatory variable causes change in the response variable.  
b. Change in the explanatory variable is associated with in change in the response variable.  
c. Change in the response variable causes change in the explanatory variable.  
d. All from (1)-(3) are equally reasonable conclusions.  


In [6]:
QCM.quiz(2,"./quiz/RegressionQuiz.xlsx")

VBox(children=(Output(), RadioButtons(options=((1, 0), (2, 1), (3, 2), (4, 3)), value=0), Button(description='…

We have a database on SKF bearings. We want to estimate the mass of these bearings. A correlation analysis is carried out from these data. 

![Bearing correlation matrix](./pictures/LinearRegression_CorrelationBearing.png)

q3) Select the variable that seems the most related to the Mass.

In [8]:
QCM.quiz(3,"./quiz/RegressionQuiz.xlsx")

VBox(children=(Output(), RadioButtons(options=(('Diameter D', 0), ('Static load C0', 1), ('Dynamic load C', 2)…

#### Linear regression

q1) The line y = 4 + 2x has been proposed as a line of best for the following five sets of bivariate data.
For which data set is this line the best fit?

![Best fit](./pictures/LinearRegression_bestFit.png)



In [11]:
QCM.quiz(4,"./quiz/RegressionQuiz.xlsx")

VBox(children=(Output(), RadioButtons(options=(('a', 0), ('b', 1), ('c', 2), ('d', 3)), value=0), Button(descr…

q2) What is the greatest concern about the regression below?  
![Regression analysis](./pictures/LinearRegression_ValidationRegression.png)

a. It has a small slope.  
b. It has a high R2.  
c. The investigator should not be using a first order linear regression on these data.  
d. The residuals are too large.  
e. The regression line does not pass through the origin.  
  
Reminder: The coefficient of determination  R² express the percentage of total variation explained by the regression.


In [12]:
QCM.quiz(5,"./quiz/RegressionQuiz.xlsx")

VBox(children=(Output(), RadioButtons(options=(('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4)), value=0), Bu…

q3) Here is the scatter plot of mass according to the static load. 

![Regression analysis](./pictures/LinearRegression_BearingRegression.png)

Some points have been eliminated. What for ?  
a. Too many points to make the calculations.  
b. It removes items that do not work well.  
c. Some components are dominated by others (it is possible to find lighter components for bigger static load).  



In [14]:
QCM.quiz(6,"./quiz/RegressionQuiz.xlsx")

VBox(children=(Output(), RadioButtons(options=(('a', 0), ('b', 1), ('c', 2)), value=0), Button(description='Su…

q3) The regression equation is :

a. $M=0,0012$  
b. $M=0,0012+0,011.C_0$  
c. $M=0,0012+0,011.C_0+0,00021.C_0^2$  


In [16]:
QCM.quiz(7,"./quiz/RegressionQuiz.xlsx")

VBox(children=(Output(), RadioButtons(options=(('a', 0), ('b', 1), ('c', 2)), value=0), Button(description='Su…

Datas have been transformed with log10 function. 

![Regression analysis](./pictures/LinearRegression_BearingTransformation.png)

q4) The regression equation is :

a. $M=1,299.C_0-2,1398$

b. $M=-2.1398.C_0^{1.299}$

c. $M=10^{-2.1398}.C_0^{1,299}$


In [13]:
QCM.quiz(8,"./quiz/RegressionQuiz.xlsx")

VBox(children=(Output(), RadioButtons(options=(('a', 0), ('b', 1), ('c', 2)), value=0), Button(description='Su…