# Linear Regression GP 2

***

## Related Lessons

 - [Multiple Linear Regression](https://github.com/learn-co-curriculum/dsc-multiple-linear-regression) 

 - [Dealing with Categorical Variables Lab](https://github.com/learn-co-curriculum/dsc-dealing-with-categorical-variables-lab/tree/solution)

 - [Multicollinearity of Features Lab](https://github.com/learn-co-curriculum/dsc-multicollinearity-of-features-lab/tree/solution)

 - [OLS Statsmodels Lab](https://github.com/learn-co-curriculum/dsc-ols-statsmodels-lab/tree/solution)

 - [Complete Regression Lab](https://github.com/learn-co-curriculum/dsc-complete-regression-lab/tree/solution)
 
 - [Log Transformations](https://github.com/learn-co-curriculum/dsc-log-transformation)
 
 - [Feature Scaling and Normalization Lab](https://github.com/learn-co-curriculum/dsc-feature-scaling-and-normalization-lab)
 
    - Topics 

  - How to interpret the results from a simple linear regression model and discuss their real-world implications.
- What multiple linear regression is and why it is useful for solving real-world problems.
- How to run a multiple linear regression model in Python using statsmodels.

## Guided Practice 2 Goals

Performing simple linear regression and understanding evaluation metrics.

- How to interpret the results from a simple linear regression model and discuss their real-world implications
- What multiple linear regression is and why it is useful for solving real-world problems
- How to run a multiple linear regression model in Python using statsmodels and scikit-learn

### First Goal

- To understand how well our model is able to **predict** future conditions, trends, or **values**.

![linear_model_equation_1.png](attachment:linear_model_equation_1.png)

Question 1: Does our regression line fit the data well?

We make predictions on train and test data and measure the error to understand how well our model will generalize when making new predictions.

###### Scope

* We analyze the performance between the train and test predictions using various metrics:
    - MAE
    - MSE
    - RMSE
    - $R^2$
    - Adjusted $R^2$

### Second Goal

* To determine and **measure** the **relationships** between the dependent and independent variables.

Question 2: Are the coefficients statistically significant?

### Third Goal

* To **understand** how one variable **changes** when another changes

Question 3: What is the economic impact of the estimated coefficients?

##### Scope

* The coefficients and p-values are able to inform us of the influence that an independent variable has on the dependent variable.

    - Coefficients
    - $p-value$

In [1]:
# import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
# https://www.kaggle.com/datasets/mirichoi0218/insurance
# create DataFrame
insurance_df = pd.read_csv('insurance.csv')

### Data Description

| Column     | Description                                                                                                                                                                                                                |
|------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `age`      | **Age of policy holder**                                                                                                                                                                                             |
| `sex`      | **Policy Holder's gender*, female male**                                                                                                                                                                              |
| `bmi`      | **Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9** |
| `children` | **Number of children covered by health insurance/ Number of dependents**                                                                                                                                                   |
| `smoker`   | **True if currently smokes, False if does not smoke**                                                                                                                                                                      |
| `region`   | **The beneficiary's residential area in the US, northeast, southeast, southwest, northwest**                                                                                                                               |
| `charges`  | **Individual medical costs billed by health insurance in dollars**                                                                                                                                                                    |

In [3]:
## Comments & Questions

In [4]:
# create colormap


# create a dataframe with the first 10 rows
# of original DataFrame


# view DataFrame


In [5]:
# check info on columns/data size


In [6]:
# check descriptive statistics


## Categorical Variables

To begin our analysis and modeling, we will need to manipulate our categorical variables so that their information can be interpreted by the model.

First we need to identify the categorical variables in our dataset. We can take a look at the variables that have `object` or `str` datatype in our `info()` call, or plot our data to begin:

- `sex`
- `smoker`
- `region`

While there are a few ways of working with categorical variables, in this notebook we will Label Encode our variables. This means we will change their values from the original `str` to `int` values that represent the original values. This will allow our model to perform the necessary tasks for linear regression.

In [7]:
# Label Encode object variables
from sklearn.preprocessing import LabelEncoder

# instantiate a label encoder


# fit the label encoder to the sex variable
# while dropping any duplicates
 

# create new sex variable with label encoded values


# fit the label encoder to the smoker variable
# while dropping any duplicates
 

# create new smoker variable with label encoded values


# fit the label encoder to the region variable
# while dropping any duplicates



Other methods to transform categorical variables are the `pd.Categorical()` function which changes the datatype of a variable to a niche `categorical` datatype, which from the documentation:

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Categorical.html

"Categoricals can only take on only a limited, and usually fixed, number of possible values (categories). In contrast to statistical categorical variables, a Categorical might have an order, but numerical operations (additions, divisions, …) are not possible."

### pd.Catergorical()

### StatsModels vs. Sci-Kit Learn

Statsmodels is useful for more statistical analysis oriented linear regression.

SKLearn is useful for more machine learning optimaization oriented linear regression.

## StatsModels Multiple Linear Regression

In [8]:
# import statsmodels library


# create predictors


# create model intercept


# fit model to data


Simple Linear Regression

![linear_equation.png](attachment:linear_equation.png)

Multivariate Linear Regression Equation

$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n$$

### Coefficients

In [9]:
# view model coefficients


### Tornado Diagram

The only trick to getting a tornado diagram is that the coefficients have to be sorted in descending order by the absolute value of the coefficient.

In [10]:
# create a tornado diagram
coeff = model.params
coeff = coeff.iloc[(coeff.abs()*-1.0).argsort()]
sns.barplot(x=coeff.values, y=coeff.index, orient='h');

NameError: name 'model' is not defined

With the coefficients we can explore questions 2 & 3:
    
    - To determine and measure the relationships between the dependent and independent variables
    
    - To understand how one variable changes when another changes

***Coefficients***: The coefficients of the features also describe the mathematical relationship between each independent variable and the dependent variable, which in this case is the medical costs billed by health insurance. 

The coefficient value helps us understand the marginal effect of the features on the dependent variable. **Given a one-unit change in the feature variable when the other features are unchanged, how much is the dependent variable changed**.

***
What is the economic impact of the estimated coefficient of `age`?

They also **inform us if there is a positive or negative correlation between the features and target**. For our notebook, we will assess the coefficients of our features to ensure we have features that are relevant to the charge for insurance. 

For this we can call the `.summary()` method which will display several results from our model:

### Model Summary

In [None]:
# view results of model


In [None]:
#dir(model)

### No. Observations

The number of observations within the data.

In [None]:
# view dimensions of the DataFrame


### DF residuals

Degrees of Freedom is the number of values in the final calculation of a statistic that are free to vary

* Degrees of Freedom, calculated by n-k-1 where:

    - $n$ = number of observations = 1338
    - $k$ = number of predicting variables = 6
    
1338-6-1 = 1331

### DF Model

How many independent variables we have in our model: 6

### Covariance Type

Covariance : Nonrobust

Recall that covariance is a measure of how two variables are linked in a positive or negative manner, and a robust covariance is one that is calculated in a way to minimize or eliminate variables, which is not the case here. 

Robust covariance methods are based on the fact that outliers lead to an increase of the values and making the spread of the data apparently larger.

### R-squared 

The **percentage of variation explained by the relationship between the dependent variable and the independent variables**. Lies within values between 0 and 1.

$$SS_{residual} = \sum (y - \hat{y})^2 $$

is the squared difference between $y$ and $\hat y$ (predicted y)

$$SS_{total} = \sum (y - \bar{y})^2 $$

is the squared difference between $y$ and $\overline y$ (mean of y)

So that

$$R^2 = 1- \dfrac{SS_{residual}}{SS_{total}}$$


In [None]:
# percentage of variance in the y variable explained by the x variable


This means there is **75.1% less variation around the regression line than the mean**, or the relationship between the dependent variable and the independent variables explains 75.1% of the variation in the data.

We can also say that 24.9% of the variation of the `charges` variable within the data is not explained by our model.

### Adjusted R-squared

***Adjusted $R^2$***: The Adjusted $R^2$ is **a key metric for evaluation of a multivariate linear regression model**, as **it accounts for the number of predictors in a model** when calculating the model's goodness-of-fit. It is a more accurate measure for assessing if our model explains changes in the dependent variable. 

$$R^2_{adj}= 1-(1-R^2)\dfrac{n-1}{n-p-1}$$

In [None]:
# display adjusted r-squared


An Adjusted R-squared value of 0.750 can be described conceptually as: 

> ***75.0% of the variations in dependent variable $y$ are explained by the independent variables $x$ in our model.***

High Adjusted R-squared doesn’t mean that your model is good. We need to check the residual plot when fitting a regression model.

The good fit indicates that normality is a reasonable approximation.

### F-statistic

R-squared measures the strength of the relationship between our model and the dependent variable. 

However, it is not a formal test for the relationship. The F-test of overall significance is the hypothesis test for this relationship. 

The F-test of overall significance informs us **whether our linear regression model provides a better fit to the data than a model that contains no independent variables, or the intercept-only model.** 

For the intercept-only model, all of the model’s predictions equal the mean of the dependent variable. 

If the overall F-test is statistically significant, our model’s predictions are an improvement over using the mean.

In [None]:
# f-statistic to compare p-value


The F-test for overall significance has two hypotheses:

**_Null Hypothesis_** $H_{0}$ : The intercept-only model fits the data as well as our model.

**_Alternative Hypothesis_** $H_{1}$ : The model fits the data better than the intercept-only model.

The null hypothesis should contain an equality (=, ≤ ,≥):
  - Average NBA Player's Height = 2.0m (6ft7in)
  - 𝐻0 : 𝜇 = 2.0

The alternate hypothesis should not have an equality (≠,<,>):

 - Average NBA Player's Height ≠ 2.0m (6ft7in)
 - 𝐻1 : 𝜇 ≠ 2.0 

### Prob (F-statistic)

The Prob (F-statistic) or p-value for the f-statistic informs us of the **likelihood that we would observe the values of our data** or values at least as extreme as the results actually observed **by random chance if there were no relationship between the features of our model and the** dependent variable, in this case the **medical costs billed by health insurance**.

In [None]:
# p-value for the liklihood our model 
# fits the data better than the mean


Here we have a p-value of 0, and a quite large f-statistic value, which suggests to us that **we can reject the null hypothesis**, and assume the model fits the data better than the intercept-only model.

Compare the p-value for the F-test to our **significance level of 0.05**. If the p-value is less than the significance level, our sample data provides sufficient evidence to conclude that our regression model fits the data better than the model with no independent variables.

We can say that **there is a linear relationship between the features of our model and the medical costs billed by health insurance** with **95% confidence**.

It is also important to note that **we consider all of the features together for the f-statistic**.

### Std Error

The standard error can be thought of as a measure of the precision with which the regression coefficient is measured. The standard error of the coefficient is always positive.The smaller the standard error, the more precise the estimate.

Here we can see that the standard error of the `age` coefficient is smaller than that of `bmi`.Therefore, our model was able to estimate the coefficient for `age` with greater precision.

If we divide the coefficient by the std error, we calculate the t-value.

In [None]:
# return the standard error for 
# all coefficients in the equation


### T - test

To infer if a given feature is significant or relevant to the target variable, we **perform a t-test**. Here instead of **considering all of the features individually**, we perform a t-test on the dependent variable and the features one by one. 

**_Null Hypothesis_** $H_{0}$ : If the independent variables' t value is equal to 0, then the intercept-only model **fits the data as well** as our model.

**_Alternative Hypothesis_** $H_{1}$ : If the indendent variable's t value is not equal to 0, then the model **fits the data better** than the intercept-only model.

In [None]:
# display t-value for hours variable


The farther the t-value is away from 0, the greater the chances that we reject the null hypothesis and accept the alternate hypothesis for that feature. 

With a t-value of `21.65`, we can say that we are more likely to accept the alternate hypothesis that the model fits the data better than the intercept-only model.

Here we see that the feature with a t-value closest to 0 is `sex`. This feature may not have a statistically significant relationship to the medical costs billed by health insurance.

### P>|t|  or p-value 

p-values for the t-test

Again we can compare the p-values, or **likelihood that we would observe our data by random chance if our features had no statistically significant relationship to the `charges` variable**. When we compare the p-values of our features with a significance threshold of 0.05individually, if:

 - 𝑝 < $0.05$ : Reject that there is no relationship between the features of our model and the medical costs billed by health insurance.
 

 - 𝑝 >= $0.05$ : Accept the null hypothesis. There is no relationship between the features of our model and the medical costs billed by health insurance.

In [None]:
# check t-statistic probability score for constant


Here we could consider **all of our features sharing a statistically significant relationship with the medical costs billed by health insurance** with **95% confidence** except for `sex`.

***
Based on the p-value of the t-test, which feature has the least statistical influence on the dependent variable?

### Mean Absolute Error

Mean Absolute Error MAE: Represents average error

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_{i} - \hat y_{i}|$$

In [None]:
# return residuals from model


# view residuals


In [None]:
# return predicted values from model


# view predicted values


In [None]:
# true y value


In [None]:
# return mean absolute error of model


# view residuals from model


The MAE informs us that **on average, our model has an error of** `4172.49` USD when **predicting the charge of insurance for a patient**.

### Mean Squared Error

The Mean Squared Error, or MSE tells us how close a regression line is to a set of true points. This is achieved by squaring the errors. It contrasts to MAE because it gives more weight to larger distances between the points to the regression line.

**MSE is more useful if we are concerned about large errors whose consequences are much larger than equivalent smaller ones**.

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2$$

In [None]:
# return Mean Squared Error of model


# view Mean Squared Error


Our MSE is useful if we are comparing the predictive performance of two or more models. It's values are less interpretable than MAE and RMSE because they are not in the same units as our dependent variable.

Here we are effectively saying our model has an error of `36719766.29` USD squared...

### Root Mean Squared Error

Root Mean Square Error: Interpretable MSE in units of y. 

RMSE is more sensitive to outliers, and penalizes large errors more than MAE because errors are squared.

$$RMSE = \sqrt\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2$$

In [None]:
# return Root Mean Squared Error


# view Root Mean Squared Error


Here our RMSE informs us that the expected error between the true value and the value our model predicts is `6059.68` USD when predicting the score of a project in our training data. 

**This is larger than our MAE of `4172.49`. When MAE < RMSE, it can be indicative that there are outliers present in our data**.

## Assumptions of Linear Regression

At this point, it is important to keep the assumptions of linear regression in mind as model our data. If these assumption are not met, we can not be confident in the accuracy of our model. The assumptions we will explore in this notebook are:

- Linearity: there is a linear relationship between the independent and dependent variables
- Homoscedasticity: the variance for the residual is the same for any value of x
- Independence: observations are independent of one another
- Normality: residuals are normally distributed

In [None]:
# https://www.kaggle.com/code/shrutimechlearn/step-by-step-assumptions-linear-regression/notebook

### Linearity

Pairplots are useful tools to check for linearity between the dependent and independent variables. 

- Ask yourself "What pattern do I see?". If a straight line appears to fit the data well, this is a great sign* that there is a linear relationship between the independent and dependent variables.

In [None]:
# view scatter plots and distribution plots for variables


### Homoscedasticity

Simply put we want to find homoscedasticity. This means that the residuals have equal or almost equal variance over the entire regression model. 

If there are any patterns when we plot residuals by predicted values, such as residuals being higher when the score is higher, etc. it is a sign of heteroscedasticity.

By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.¶

In [None]:
# plot the residuals against predicted values to 
# check for homoscedasticity


There appears to be an interesting pattern when we plot our residuals against the fitted values. Where as we were hoping to see randomness, there are clear groupings of seemingly in the four quadrants of the plot. Let's further explore this assumption.

### Goldfeld Quandt Test

Another method to check if the variance for the residual is the same for any value of x is to use the Goldfeld Quandt test. The Goldfeld Quandt test is a hypothesis test with the following hypotheses:

**_Null Hypothesis_** $H_{0}$ : The variance of the residuals of the model are the same for any value of x (Homoscedastic)

**_Alternative Hypothesis_** $H_{1}$ : The variance of the residuals of the model are note the same for any value of x (Heteroscedastic)

In [None]:
# run Goldfeld Quandt Test


Here our p-value is `0.716` much greater than `0.05`. This means we are not able to reject the null hypothesis which states our error terms are homoscedastic. 

This further validates the assumption of homoscedasticity of our residuals.

### Normality

We want to check that the residuals are normally distributed

the central limit theorem says that as the sample size increases the distribution tends to be normal. A skew is also visible from the plot. 

In [None]:
# https://www.sfu.ca/~mjbrydon/tutorials/BAinPy/10_multiple_regression.html

In [None]:
# check for normality among the residuals


In [None]:
# import stats module from scipy

# create mean and standard deviation


Let's re-plot the residuals as a kernel density plot and overlay the normal curve with the same mean and standard deviation:

In [None]:

# plot the residuals


# plot corresponding normal curve


In [None]:
# view boxplot of residuals


#### Q-Q plot

The Q-Q plot, or quantile-quantile plot, is a visual check we can use to confirm if the a given distribution belongs to a particular distribution, in this case the Normal Distribution.

Here we compare our distribution of the residuals to the Normal distribution.

The x-axis of a Q-Q plot represents the quantiles of standard normal distribution.



In [None]:
#https://data.library.virginia.edu/understanding-q-q-plots/

In [None]:
# create a Q-Q plot of the residuals


Whenever we encounter issues with our Q-Q plots, we can attempt to remove the outliers from our data to help our residuals

In [None]:

# IQR

# Upper bound

# Lower bound


In [None]:
# check charges variable for outliers


This boxplot confirms our work, there are no more outliers present in the `charges` variable.

In [None]:
# create predictors


# create model intercept


# fit model to data


In [None]:
# return residuals from model


In [None]:
# view boxplot of residuals


There are so many outliers in our residuals, this box plot is indicative that the mean of the resisuals is not 0.

In [None]:
# view model summary


In [None]:
# create a Q-Q plot of the residuals


## Feature Scaling

Feature scaling is a method used to normalize our data. By fitting our data to a common range of values, we can better align our model with the assumptions of linear regression.

## Log Transform

Transforming these initial features to have certain properties such as normality will improve the regression algorithms predictive performance

In fact, you'll often find that having the data more normally distributed will benefit your model and model performance in general. So while normality of the predictors is not a mandatory assumption, having (approximately) normal features may be helpful for your model!

When we take a look at the distributions of our data in the pairplot above, we can see several variables that do not have a normal distribution, some being `age`, `children`, and `charges`.

Let's perform a log transform on the `charges` variable to normalize it and hopefully improve our performance.

In [None]:
# create non normal variables list


# use for loop to apply log transform on variable


In [None]:
# view new distribution of data


The `charges` distribution appears to be more normal. Let's fit a linear regression model to this data to determine if our performance improves with this change.

In [None]:
# create predictors


# create model intercept


# fit model to data


\begin{equation}
\log(y_i) = \beta_0 + \beta_1 x_{1i} +  \cdots + \beta_k x_{ki} + e_i  ,
\end{equation}

In [None]:
# view model summary


Our $R^2$ has increased to `0.767`, great!

However look at the change in our coefficients! 

## Log Transformation Interpretation

In [None]:
# https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/

Now that we have transformed our dependent variable, we need to interpret our model coefficients differently.

When we perform a log transform on our dependent variable, we transform the relationship between it and the features into one more multiplicative than additive, so we need to express the marginal effect in terms of percent.

When we only log-transform the dependent variable, we exponentiate the coefficient, subtract one from this number, and multiply by 100. 

This will yield the percent increase (or decrease) in the response for every one-unit increase in the independent variable. 

So here we can use the `age` variable as an example, with a coefficient of `0.0347` 

Recall exponentiation is the inverse of the logarithm function, so we exponentiate the variable's coefficient, subtract it by `1` and multiply `100`:

In [None]:
# calculate percent change in dependent variable
# based on independent variable


Here we can see that for every one-unit increase in `age`, our dependent variable increases by `36.8`%.

## Normalization

##### Min-Max Scaling

This method of feature scaling typically rescales the range of the features so that the largest value would be converted to `1`, and the lowest value would be converted to `0`.

It is possible to rescale the values in the dataset between any two values.

Impact of Outliers is very high in Normalization

The formula we follow when performing min-max scaling to transform x is:

$$x' = \dfrac{x - \min(x)}{\max(x)-\min(x)}$$

In [None]:
# view DataFrame columns


In [None]:
# import preprocessing module from sklearn


# Use min max scaling on dataset


In [None]:
# view scaled values


In [None]:
# check shape of array


In [None]:
# create a copy of insurance dataframe


# drop old columns


In [None]:
# create new DataFrame with minmax variables
# add NumPy matrix as new columns in DataFrame


In [None]:
# view DataFrame


In [None]:
# view DataFrame


In [None]:
# rename new columns
# insurance_df2 = insurance_df2.rename(columns={"0": "age", "1": "bmi", "2": "children"}, axis= 1) # abc
# insurance_df2.columns = ["sex", "smoker", "region", "charges", "age", "bmi", "children"]

# view DataFrame


In [None]:
# check for missing values


In [None]:
# have to find out what is causing missing values

In [None]:
# check missing count


In [None]:
# drop missing values


In [None]:
# create predictors


# create model intercept


# fit model to data


In [None]:
# view model summary


## Standardization

When we perform standardization on our data, we change the mean and variance so that then mean is equal to 0 and standard deviation for one.

Our goal with scaling is 

This is done by taking each value of a variable, subtracting the mean of the variable

When 

$$x' = \dfrac{x - \bar x}{\sigma}$$

x' will have mean $\mu = 0$ and $\sigma = 1$

Note that standardization does not make data $more$ normal, it will just change the mean and the standard error!


Standardization may be used when data represent Gaussian Distribution, while Normalization is great with Non-Gaussian Distribution

standardized regression coefficients provide an easy way to estimate effect size that is independent of units.

In Python:

Transform the Y and each column of the X matrices into standardize values (z-scores) with mean = 0 and standard deviation = 1.0.

Run the regression with the standardized inputs. This provides standardized regression coefficients

Extract and display the standardized coefficient:

## Standard Scaling

Now let's scale the columns within the dataframe that are not categorical, and use them to feed into a linear model to test its performance.

In [None]:
# check variable types


In [None]:
# Scale features in dataframe


In [None]:
# create a copy of insurance dataframe


# drop old columns


In [None]:
# view array of standardized values


In [None]:
# view shape of array


In [None]:
# create new DataFrame with minmax variables
#add NumPy matrix as new columns in DataFrame


# view new DataFrame info


In [None]:
# drop missing values


In [None]:
# create predictors

# create model intercept

# fit model to data


In [None]:
# view model results


##### Data Leakage

#### Mean of Residuals = 0: 

When our residuals have a mean of zero, it is another indicator that they are normally distributed and thus do not form patterns or contain much information hidden from the model.

In [None]:
# calculate and print the mean of residuals


### Independance of Residuals

Ensure that residuals are independent of one another

#### Autocorrelation

When checking for autocorrelation among the residuals, we want to make sure our observations are independent of one another.

There should not be any kind of pattern found among our residuals, they should be independent of one another.

When using the Durbin-Watson statistic to analyze autocorrelation, we consider that the range of the DW statistic is:

$$0 ≤ DW ≤ 4$$where

- DW should be close to `2` if the Null Hypothesis $H_{0}$ = true
- DW $<$ `2` may indicate positive correlation
- DW $>$ `2` may indicate positive correlation

#### Durbin-Watson Test

The Durbin-Watson Test has two hypotheses:

**_Null Hypothesis_** $H_{0}$ : There is no autocorrelation of residuals.

   - $𝑝 = 2$ : Accept the null hypothesis. There is no relationship between the residuals of the observations.

**_Alternative Hypothesis_** $H_{1}$ : There is a strong autocorrelation between residuals.
 
   - $𝑝 < 2$ or $𝑝 > 2$: Reject that there is no relationship between the residuals of the observations.
     

In [None]:
# import durbin-watson test from statsmodels


It looks like there is some degree of correlation between the residuals, we can plot this as well using a line plot where we connect the points of the residual against the predicted y values.

In [None]:
# check for autocorrelation using lineplot



Time series is linearly related to a lagged version of itself. 

The coefficient of correlation between two values in a time series is called the autocorrelation function (ACF).

By plotting the autocorrelation function, we can visualize if there is any high autocorrelation between the residuals. Here the ACF would inform us that there is little correlation between residuals.

In [None]:
# check for autocorrelation with ACF


#### No perfect multicollinearity

While we want our `hours` variable to have a statistically significant relationship to `score`, a perfect `1.0` correlation would be somewhat suspicious. Here the Pearson Correlation Coefficient for `hours` is 0.96. 

With such a small sample size it can be expected to have a high correlation coefficient.

In [None]:
# heatmap of correlation between hours and score


## Sklearn

![linear_model_equation_model.png](attachment:linear_model_equation_model.png)

In [None]:
# import train_test_split from sci-kit learn


In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
# Create our X and y


In [None]:
# Split data


# print length


In [None]:
# Import LinearRegression from sklearn


# Create linear regression object


# Fit lr object to training data


In [None]:
# make predictions on training and testing data


In [None]:
# view predictions array on test data


$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$

### Model Intercept

In [None]:
# view model intercept


Expect a baseline of insurance charges to be $$\hat\beta_0$$ where:
    


***Coefficients***: The coefficients of the features also describe the mathematical relationship between each independent variable and the dependent variable, which in this case is the price of insurance. 

The coefficient value demonstrates how much the mean of the target variable changes given a one-unit change in the feature variable when the other features are unchanged.  

They also inform us if there is a positive or negative correlation between the features and target. For our notebook, we will assess the coefficients of our features to ensure we have features that are relevant to the charge for insurance. For a multivariate linear regression model

### Model Slope

In [None]:
# view model coefficients


$$\hat\beta_1 = 253.99185244$$

$$\hat\beta_2 = -24.32455098$$

$$\hat\beta_3 = 328.40261701$$

$$\hat\beta_4 = 443.72929547$$

$$\hat\beta_5 = -23568.87948381$$

$$\hat\beta_6 = 288.50857254$$

In [None]:
# check DataFrame columns


For the first policy holder we would say:

$$x1= age_0$$

$$x2= sex_0$$

$$x3= bmi_0$$

$$x4= children_0$$

$$x5= smoker_0$$

$$x6= region_0$$

What exactly are those values?

### Prediction Example

In [None]:
# view info for first instance


In [None]:
# create list of values for first instance


# view values


In [None]:
# create list of coefficients


# view list of coefficients


In [None]:
# view p-values for each variable


$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$

In [None]:
# make prediction


In [None]:
# check both predictions are the same


We now have our predictions, and can compare them to the actual price values. The difference between y_train and y_hat_train will be the residuals:

$$r_{i,train} = y_{i,train} - \hat y_{i,train}$$



In [None]:
# view predictions array on test data


In [None]:
# view true data


### Training Results

![linear_model_equation_error.png](attachment:linear_model_equation_error.png)

In [None]:
# return training error for each instance


#### Training MAE

Mean Absolute Error MAE: Represents average error

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_{i} - \hat y_{i}|$$

In [None]:
# return training mean absolute error


The MAE informs us that **on average, our model has an error of** $4,235.38 when **predicting the charges of policy holders in our training data**.

#### Training MSE

The Mean Squared Error, or MSE tells us how close a regression line is to a set of true points. This is achieved by squaring the errors. It contrasts to MAE because it gives more weight to larger distances between the points to the regression line.

**MSE is more useful if we are concerned about large errors whose consequences are much larger than equivalent smaller ones**.

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2$$

In [None]:
# return standard error

  
# return MSE


Our MSE is useful if we are comparing the predictive performance of two or more models. It's values are less interpretable than MAE and RMSE because they are not in the same units as our dependent variable.

#### Training RMSE

Root Mean Square Error: Interpretable MSE in units of y. 

RMSE is more sensitive to outliers, and penalizes large errors more than MAE because errors are squared.

$$RMSE = \sqrt\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat y_{i})^2$$

In [None]:
# return RMSE Training


Here our RMSE informs us that the expected error between the true value and the value our model predicts is $6,142 when predicting the charges of policy holders in our training data. **This is larger than our MAE of 4,235.38. When MAE < RMSE, it can be indicative that there are outliers present in our data**.

In [None]:
#from scipy import stats

In [None]:
#outlier_low = insurance_df['charges'].quantile(0.01)
#outlier_low

In [None]:
#outlier_high = insurance_df['charges'].quantile(0.99)
#outlier_high

#### Training R Squared

The percentage of variation explained by the relationship between the dependent variable and the independent variable. Lies within values between 0 and 1.

$$SS_{residual} = \sum (y - \hat{y})^2 $$

is the squared difference between $y$ and $\hat y$ (predicted y)

$$SS_{total} = \sum (y - \bar{y})^2 $$

is the squared difference between $y$ and $\overline y$ (mean of y)

So that

$$R^2 = 1- \dfrac{SS_{residual}}{SS_{total}}$$


In [None]:
# return training R squared


This means there is 73.6% less variation around the regression line than the mean, or **the relationship between the dependent variable and the independent variables explains 73.6% of the variation in the training data**.

#### Training Adjusted R Squared

***Adjusted $R^2$***: The Adjusted $R^2$ is a key metric for evaluation of a multivariate linear regression model, as it accounts for the number of predictors in a model when calculating the model's goodness-of-fit. It is a more accurate measure for assessing if our model explains changes in the dependent variable. 

$$R^2_{adj}= 1-(1-R^2)\dfrac{n-1}{n-p-1}$$

In [None]:
#display adjusted R-squared


An Adjusted R-squared value of 0.735 can be described conceptually as: 

> ***73.5% of the variations in dependent variable $y$ are explained by the independent variables $x$ in our model.***

### Testing Results

Recall that...

In [None]:
# create and view test residuals


#### Compare Test and Train Performance

In [None]:
# import metrics modules from sklearn.metrics library
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [None]:
# print training and testing MAE


# create and print standard error variable


# create and print mean squared error variable


# create and print root mean squared error

  
# create and view training and testing r squared


#display adjusted R-squared


**If our model has a higher error for the training set than the test set then we know that our model is not generalizing well and is too complicated, it is overfitting**. In this case we would need to optimize our model and continue iterating through training and testing.

# Practice

### Data Description

In [None]:
# import dataset


| Column     | Description                                                                                                                                                                                                                |
|------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `name`      | **Name of the cars**                                                                                                                                                                                             |
| `year`      | **Year of the car when it was bought**                                                                                                                                                                              |
| `selling_price`      | **Price at which the car is being sold**                                                                                                                                                   |
| `km_driven`   | **Number of Kilometres the car is driven**                                                                                                                                                                      |
| `fuel`   | **Fuel type of car (petrol / diesel / CNG / LPG / electric)**                                                                                                                               |
| `seller_type`  | **Tells if a Seller is Individual or a Dealer**                                                                                                                                                                    |
| `transmission`  | **Gear transmission of the car (Automatic/Manual)**                                                                                                                                                                    |
| `owner`  | **Number of previous owners of the car.**                                                                                                                                                                    |

In [None]:
# check DataFrame


In [None]:
# clean dataset
# Replace dollar sign and commas for production_budget column


In [None]:
# perform feature engineering


In [None]:
# work with categorical variables


In [None]:
# drop unnecessary columns


In [None]:
# view descriptive statistics


In [None]:
# drop any missing values


In [None]:
# create predictors

# create intercept

# create model

# check model coefficients


## Practice Summary

In [None]:
# view model summary


## Questions

1. How much variation does our model explain within the data?

2. Based on the p-value of the t-test, which feature has the least statistical influence on the dependent variable?

3. What is the economic impact of the `Vehicle_Age` estimated coefficient?

# Q & A

![prediction_1.jpg](attachment:prediction_1.jpg)

# Thank You

In [None]:
###############################################################################################################################
###############################################################################################################################
                                                            #THANKS#
###############################################################################################################################
###############################################################################################################################