**Linear regression = just a fancy way of saying "(straight) line of best fit"**

In Statistical Learning Theory, the main idea is to construct a model to draw certain conclusions from data, and next, to use this model to make predictions.

In the context of Statistical learning, there are two main types of data:
    
    *Dependent variables: data that can be controlled directly (other names: outcome variables, target variables, response variables)
    --Examples: weight
    
    *Independent variables: data that cannot be controlled directly (other names: predictor variables, input variables, explanatory variables, features)
    --Examples: age,time

A straight line can be written as :

𝑦=𝑚𝑥+𝑐
 
or, alternatively

𝑦=𝛽0+𝛽1𝑥
 
𝛽0  has the same role as  𝑐  in the first expression and denotes the intercept with the y-axis.  
𝛽1  has the same role as  𝑚  in the first expression and denotes the slope of the line. 

1) A dependent variable that needs to estimated and predicted (here:  𝑦 )

2) An independent variable, the input variable (here:  𝑥 )

3) The slope which determines the angle of the line. Here, the slope is denoted as  𝑚 , or  𝛽1 .

4) The intercept which is the constant determining the value of  𝑦  when  𝑥  is 0. We denoted the intercept here as  𝑐  or  𝛽0 .

**Model validation**

Model validation is a process of controlling overfitting and allows a higher degree of generalizability.

Here is how we perform validation, in its simplest form:

    *Split the data into two parts with a 70/30, 80/20 or a similar split

    *Use the larger part for training so the model learns from it. This set of data is normally called the Training Data

    *Use the smaller part for testing the model. This is data is not being used during the model learning process and used only for testing the performance of a learned model. This dataset is called as the Testing Data

**Model loss**

A loss function evaluates how well your model represents the relationship between data variables.

If the model is unable to identify the underlying relationship between the independent and dependent variable(s), the loss function will output a very high number. 

These individual losses, which is essentially the vertical distance between the individual data points and the line are taken into account to calculate the overall model loss.

**Estimation**

when we draw our regression line based on these few green dots, we use the following notations:

𝑦̂ =𝑚̂ 𝑥+𝑐̂ 
 
or
𝑦̂ =𝛽̂ 0+𝛽̂ 1𝑥
 
As you can see, you're using a "hat" notation which stands for the fact that we are working with estimations.

When trying to draw a "best fit line", you're estimating the most appropriate value possible for your intercept and your slope, hence  𝑐̂   / 𝛽̂ 0  and  𝑚̂   / 𝛽̂ 1 .
Next, when we use our line to predict new values  𝑦  given  𝑥 , your estimate is an approximation based on our estimated parameter values. Hence we use  𝑦̂   instead of  𝑦 .  𝑦̂   lies ON your regression line,  𝑦  is the associated y-value for each of the green dots in the plot above. The error or the vertical offset between the line and the actual observation values is denoted by the red vertical lines in the plot above. Mathematically, the vertical offset can be written as  ∣𝑦̂ −𝑦∣ .

**Calculating best fit**

Before we calculate the best-fit line, we have to make sure that we have calculated the following measures for variables X and Y:

1) The mean of the X  (𝑋¯) 

2) The mean of the Y  (𝑌¯) 

3) The standard deviation of the X values  (𝑆𝑋) 

4) The standard deviation of the y values  (𝑆𝑌) 

5) The correlation between X and Y ( often denoted by the Greek letter "Rho" or  𝜌  - Pearson Correlation)

**Calculating Slope**

With the above ingredients in hand, we can calculate the slope (shown as  𝑏  below) of the best-fit line, using the formula:

![image.png](attachment:image.png)
 
This formula is also known as the least-squares method.

**Calculating Intercept**
So now that we have the slope value (\hat m), we can put it back into our formula  (𝑦̂ =𝑚̂ 𝑥+𝑐̂ )  to calculate intercept. The idea is that

![image.png](attachment:image.png)

**R-Squared**

The  𝑅2  or Coefficient of determination is a statistical measure that is used to assess the goodness of fit of a regression model

![image.png](attachment:image.png)

𝑆𝑆𝑅𝐸𝑆  (also called RSS) is the Residual sum of squared errors of our regression model also known as  𝑆𝑆𝐸  (Sum of Squared Errors).  𝑆𝑆𝑅𝐸𝑆  is the squared difference between  𝑦  and  𝑦̂  . For the one highlighted observation in our graph above, the  𝑆𝑆𝑅𝐸𝑆  is denoted by the red arrow. This part of the error is not explained by our model.

𝑆𝑆𝑇𝑂𝑇  (also called TSS) is the Total sum of squared error.  𝑆𝑆𝑇𝑂𝑇  is the squared difference between  𝑦  and  𝑦⎯⎯⎯ . For the one highlighted observation in our graph above, the  𝑆𝑆𝑇𝑂𝑇  is denoted by the orange arrow.

**R-Squared can take a value between 0 and 1 where values closer to 0 represent a poor fit and values closer to 1 represent an (almost) perfect fit**

## To calculate R-Squared:
   
   1) Calculate the **Squared Error.** Remember that the Squared Error is the Residual Sum of Squares of the difference between a given line and the actual data points.
   ______________________________ 
    EX: def sq_err(y_real, y_predicted):
            squarred_error = np.sum((y_real - y_predicted)**2)
            return squarred_error
   ______________________________ 

   2) Build a function that uses the sq_err() function above to calculate the value of R-Squared by first calculating SSE, then use this same function to calculate SST (use the mean of $y$ instead of the regression line), and then plug in these values into the R-Squared formula. Perform the following tasks

       1) Calculate the mean of the y_real  
       2) Calculate SSR using sq_err() or SSE 
           *(RSS = sum of squared residuals (SSR) = sum of squared estimate of errors (SSE))*  
       3) Calculate SST  
       4) Calculate R-Squared from above values using the given formula  
   ______________________________ 
 
    # Calculate Y_mean , squared error for regression and mean line , and calculate r-squared
            
            EX: def r_squared(y_real, y_predicted):
    
                    # calculate the numerator
                    num = sq_err(y_real, y_predicted)
                    # calculate the denominator
                    denom = np.sum((y_real - y_real.mean())**2)

                    return 1 - num/denom
    
    *OR*
    
    # Using SSexp / SStot

            EX: def r_squared2(y_real, y_predicted):

                    ssexp = np.sum((y_predicted - y_real.mean())**2)
                    denom = np.sum((y_real - y_real.mean())**2)

                    return ssexp / denom
   ______________________________ 
