# 19.05 Evaluating Performance

# 19.05 Evaluating Performance

So far, in module 19, you’ve learned how to build linear regression models and interpret estimated coefficients.  In this notebook the discussion turns to how you evaluate model performance in the training phase.  Recall that there are two contexts where you care about performance: in relation to the training set and in relations to a test set.  The former enables you to talk about how well the model explains the information in the garget variable, while the latter tells you how well the model will perform when its given previously unseen observations.  

In this notebook concepts lite **F-tests** and **R-squared** are covered.  F-tests ally you to compare your model to a reduced model with no features.  R-squared and **adjusted R-squared** (which is a variant of R-squared) values tell you how well the model accounts for variance in the target.

Last, you'll see how you can compare different models in terms of their explanatory power.  The notebook will show you how to read **Akaike** and **Bayesian** information criteria for this purpose.  

#### **Key Topics**
- training and test data
- evaluating training Performance
- F-tests
- degrees of freedom
- R-squared
- Akaike information criterion
- Bayesian information criterion

## Is Your Model Better Than an "Empty" Model?

When evaluating your model, you first need to ask whether your model contributes anything to the explanation of the outcome variable.  In other words, you need to determine whether your features explain variance in the outcome.  If not, you could drop your features altogether and the resulting "empty" model would perform equally well (which is to say, not very well).  For this purpose, you use an **F-test**.

#### F-tests

F-tests can be calculated in different ways depending on the situation, but, in general, they represent the ration between a model's unexplained variance compared to a reduced model.  Here the "reduced model" is a model with no features, meaning all variance in the outcome is unexplained.  For a linear regression model with two parameters $y = \alpha + \beta_x$, the F-test is built from these pieces: 
- unexplained model variance: $SSE_F=\sum(y_i-\hat{y}_i)^2$
- unexplained variance in reduced model: $SSE_R=Var_y = \sum(y_i-\bar{y})^2$
- number of parameters in the model: $p_F = 2 (\alpha \text{ and } \beta)$
- number of parameters in the reduced model: $p_R = 1 (\alpha)$
- number of observations: $n$
- degrees of freedom $SSE_F$: $df_F = n - p_F$
- degrees of freedom of $SSE_R$: $df_R = n - p_R$

These pieces come together to give you the full equation for the F-test: 
$$F=\dfrac{SSE_F-SSE_R}{df_F-df_R}÷\dfrac{SSE_F}{df_F}$$

**Degrees of Freedom** quantifies the amount of information "left over" to estimate variable after all parameters are estimated.

In regression, degrees of freedom for a function works like this:  With two data points, a regression $y = \alpha + \beta_x$ has $0$ degrees of freedom ($2$ minus the number of parameters).  Those two parameters encompass all the information in the data.  Knowing $\alpha$ and $\beta$ alone, you can perfectly reproduce the original data.  No additional information is available from the data itself.  If you have 10 data points, then the model's degrees of freedom would be $8$ ($10$ minus the number of parameters).

The F-test null hypothesis states that the mode is indistinguishable from the reduced model, which means that the features contribute nothing to the explanation of the target variable.  Instead of reading the F statistic, it's easier to read it's associated p-value.  The lower the p-value, the better for your model.  Namely, if the p-value of the F-test for your model is $\leq 0.1$ (or even $\leq 0.05$), you can say that your model is useful and contributes something that is statistically significant in the explanation of the target.

Now, look and see the F statistic in action:


In [4]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm

from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL

pd.options.display.float_format = "{:.3f}".format

warnings.filterwarnings(action="ignore")

kagle = dict(
    drivername = "postgresql",
    username = "dsbc_student",
    password = "7*.8G9QH21",
    host = "142.93.121.174",
    port = "5432",
    database = "medicalcosts"
)

In [5]:
# Load the data from the medicalcosts database
engine=create_engine(URL(**kagle), echo=True)

insurance_df = pd.read_sql("SELECT * FROM medicalcosts", con=engine)

# No need for an open connection, please close
engine.dispose()

2020-01-06 16:29:40,342 INFO sqlalchemy.engine.base.Engine select version()
2020-01-06 16:29:40,350 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 16:29:40,452 INFO sqlalchemy.engine.base.Engine select current_schema()
2020-01-06 16:29:40,454 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 16:29:40,553 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2020-01-06 16:29:40,554 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 16:29:40,604 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2020-01-06 16:29:40,605 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 16:29:40,655 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2020-01-06 16:29:40,657 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 16:29:40,758 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
20

In [7]:
# Take a look at a sample of the data
insurance_df.sample(n=10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
989,24,female,20.52,0,yes,northeast,14571.9
522,51,female,33.915,0,no,northeast,9866.3
899,19,female,22.515,0,no,northwest,2117.34
800,42,female,26.18,1,no,southeast,7046.72
572,30,female,43.12,2,no,southeast,4753.64
126,19,female,28.3,0,yes,southwest,17081.1
1164,41,female,28.31,1,no,northwest,7153.55
1171,43,female,26.7,2,yes,southwest,22478.6
371,57,female,22.23,0,no,northeast,12029.3
686,42,male,26.125,2,no,northeast,7729.65


In [8]:
# Create some dummy columns for the categorical data found in the "sex" & "smoker" variables
insurance_df["is_male"] = pd.get_dummies(insurance_df["sex"], drop_first=True)
insurance_df["is_smoker"] = pd.get_dummies(insurance_df["smoker"], drop_first=True)

In [9]:
# Fit an Ordinary Least Squares (OLS) model to the data
# Y is the target variable
Y = insurance_df["charges"]

# X is the feature set
X = insurance_df[["is_male", "is_smoker", "age", "bmi"]]

# Add a constant to the model (best practice)
X = sm.add_constant(X)

# Fit an OLS model using statsmodels
results = sm.OLS(Y,X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.747
Model:                            OLS   Adj. R-squared:                  0.747
Method:                 Least Squares   F-statistic:                     986.5
Date:                Mon, 06 Jan 2020   Prob (F-statistic):               0.00
Time:                        16:34:13   Log-Likelihood:                -13557.
No. Observations:                1338   AIC:                         2.712e+04
Df Residuals:                    1333   BIC:                         2.715e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.163e+04    947.267    -12.281      0.0

This model's F statistic is $986.5$, and the associated p-value is very close to zero.  This means that the model's features add some information to the reduced model and the model is useful in explaining the target variable (charges).  However, F-tests don't quantify how much information your model contributes.  This requires R-squared, which is discussed next.

### Quantifying the Performance of a Model on the Training Set

R-squared is probably the most common measure of goodness on fit in a linear regression model.  It is a proportion (between $0$ and $1$) that expresses how much variance in the outcome variable is explained by the explanatory variables in the model.  Generally speaking higher $R^2$ values are better to a point - a low $R^2$ indicated that the model isn't explaining much information about the outcome, which means it will not give very good predictions.  However, a very high $R^2$ is a warning sign of overfitting.  No dataset is a perfect representation of reality, so a model that perfectly fits your data ($R^2$ of $1$ or close to $1$) is likely to be biased by quirks in the data and will perform less well on the test set.

In the regression summary table above, you see that the R-squared value of the medical costs model is $0.747$. This means that the model explains 74.7% of the variance in the charges, leaving 25.3% unexplained.  You can conclude that there's still room for improvement.  Now, fit the model in the previous notebook, 19.04, where you included the interaction of body mass index (BMI) and is_smoking dummy.


In [10]:
# Y is the target variable
Y = insurance_df["charges"]

# This is the interaction between bmi and smoking
insurance_df["bmi_is_smoker"] = insurance_df["bmi"] * insurance_df["is_smoker"]

# X is the feature set
X = insurance_df[["is_male", "is_smoker", "age", "bmi", "bmi_is_smoker"]]

# Add a constant to the model as it's a best practice
X = sm.add_constant(X)

# Fit an OLS model using statsmodels
results = sm.OLS(Y,X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.837
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     1365.
Date:                Mon, 06 Jan 2020   Prob (F-statistic):               0.00
Time:                        16:52:45   Log-Likelihood:                -13265.
No. Observations:                1338   AIC:                         2.654e+04
Df Residuals:                    1332   BIC:                         2.657e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const         -2071.0750    840.644     -2.464

The R-squared of this model is $0.837$, which is higher than the previous model's, this improvement indicates that the interaction of BMI and is_smoker explains some previously unexplained variance in the charges.

As stated previously, high R-squared values are generally desirable.  However, in some cases, very high R-squared values indicate some potential problems with the model.  Specifically: 
- Very high R-squared value may be a sign of overfitting.  If your model is too complex for the data, then it may overfit the training set and do a poor job on the test set.  That said, there's not an agreed upon threshold for R-squared to detect overfitting.  Instead, it required a comparison between performance on test and training data.  **If you model performs significantly worse on the test set compared to the training set, then you should suspect overfitting**.  Evaluating linear regression models on the tests set will be discussed in the next notebook, 19.06.
- R-squared is an inherently biased estimate of the performance in the sense that the more explanatory variables added to the model, the higher the R-squared values that are returned.  This is so even if you include irrelevant variables like noises or random data.  To mitigate the problem, you usually include a metric called **adjusted R-squared** instead of R-squared.  Adjusted R-squared does the same job as R-squared, but it is adjusted according to the number of features included in the model.  Hence, **it's always safer to look at the adjusted R-squared value instead of the R-squared value.

\*A note on negative R-squared values: it is possible to get negative R-squared values for some models.  In general terms, if a model is weaker than a straight horizontal line, then the R-squared value becomes negative.  This usually happened when a constant is not included in the model.  Getting a negative value for R-squared means that your model does very poorly in explaining the target.

### Comparing Different Models 

Comparing different models and choosing the best one is one of the essential practices in data science.  Often, several models are tried and their performance is evaluated on a test set in order to determine the top performing one.  However, _interference_ is also a critical task when it comes to linear regression models.  Unlike testing the predictive power, in interference, you care about the explanatory power of you models.  

Through out this notebook, you saw that you can measure the performance of you models on the training set using F-test or R-squared.  Hence, both F-test and R-squared can be used in the comparison of differing models.  Unfortunately, the two metrics suffer from some drawbacks that make them inappropriate to use in certain situations.

### Using F-tests for Model Comparison 

You can use an F-test to compare tow models if one of them is nested within the other.  That is, if the feature set in a model is a subset of the feature set of the other, then you can use F-test.  In this case you say that the model with the higher F statistic is superior to the other one.

However, if models are not nested, then using an F-test may be misleading.  F-tests are quite sensitive to the normality of the error terms.  If errors are not normally distributed, you should try other methods.

### Using R-squared for Model comparison

R-squared can also be used.  You already saw that R-squared is biased as it tends to increase with the number of explanatory variables.  So, instead of R-squared, you can use adjusted R-squared.  Thei higher adjusted R-squared, the better the model explains the target variable.

### Using Information Criteria

Using information criteria is also a common way of comparing different models and selecting the best one.  Two information criteria known as the **Akaike Information Criterion (AIC)** and **Bayesian Information Criterion (BIC)** take into consideration the sum of the squared errors (SSE), the sample size, and the number of parameters.

The formula for AIC is:
$$nln(SSE)−nln(n)+2p$$ 
The formula for BIC is:
$$nln(SSE)−nln(n)+pln(n)$$

In both of these formulas, $n$ represents the sample size, and $p$ represents the number of regression coefficients in the model (including the constant).  $ln$ stands for the natural logarithm.  

For both AIC and BIC, the lower the value the better.  Hence, you choose the model with the lowest AIC or BIC value.  Although you can use either of the two criteria, AIC is usually criticized for its tendency to overfit.  In contrast, BIC penalizes the number of parameters more severely then AIC and hence favors mor parsimonious models (that is models with fewer parameters).

### Which Medical Costs Model is Better?

`'statsmodels'` `summary()` function give you all of the above metrics.  In the tables above you see that for your first model R-squared is $0.747$, adjusted R-squared is $0.747$, F statistic is $986.5$, AIC is $27.120$ and BIC is $27.150$.  For the second model, R-squared is $0.837$, adjusted R-squared is $0.836$, F statistic is $1365$, AIC is $26.540$ and BIC is $26.570$.  According to all of the metrics, the second model seems better than the first one.


## Assignments

As in previous checkpoints, please submit links to two Juypyter notebooks (one for each assignment below).

Please submit links to all your work below. This is not a graded checkpoint, but you should discuss your solutions with your mentor. Also, when you're done, compare your work to [these example solutions](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/5.solution_evaluating_goodness_of_fit.ipynb).



### 1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.


###  2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?