# Evaluation of Model

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:
data = pd.read_csv('http://pythontrade.com/public/Data/pro2/adv.csv',index_col=0)

In [3]:
print data.head()
# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf
lm1 = smf.ols(formula='Sales ~ Newspaper', data=data)
lm=lm1.fit()

      TV  Radio  Newspaper  Sales
1  230.1   37.8       69.2   22.1
2   44.5   39.3       45.1   10.4
3   17.2   45.9       69.3    9.3
4  151.5   41.3       58.5   18.5
5  180.8   10.8       58.4   12.9


## Evaluate the variance- Confidence in our Model

**Variance of model:** With repeated sampling, the variation of position of line. 
**Biase of model:** goodness of captuing the true relationship.

**Linear Regression Model is low-variance and high-bias model**. 



A closely related concept is **confidence intervals**. Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was **sampled 100 times**, approximately **95 of those confidence intervals** would contain the "true" coefficient.

In [4]:
# print the confidence intervals for the model coefficients
print lm.conf_int()
print lm.conf_int(0.00000001) 

                   0          1
Intercept  11.125956  13.576859
Newspaper   0.022005   0.087381
                  0          1
Intercept  8.632293  16.070521
Newspaper -0.044510   0.153897


Wwe only have a **single sample of data**, and not the **entire population of data**. The "true" coefficient is either within this interval or it isn't. We estimate the coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the coefficient is **probably** within.



## Evaluate the variance-  p-values

 **p-value** represents the probability that the coefficient is actually zero


In [5]:
# print the p-values for the model coefficients
lm.pvalues

Intercept    4.713507e-49
Newspaper    1.148196e-03
dtype: float64

- If the 95% confidence interval **includes zero**, the p-value for that coefficient will be **greater than 0.05**. 
- If the 95% confidence interval **does not include zero**, the p-value will be **less than 0.05**. 
-  p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response.



## Evaluate the bias-  R-squared

The most common way to evaluate the overall fit of a linear model is by the **R-squared** value. R-squared is the **proportion of variance explained**, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the **null model**. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)



<img src="http://pythontrade.com/public/PIC/ISOM2500/regession/ssr.png">

- SSR (regression sum of squares): $SSR=\sum_{i=1}^n(\hat{y}_i-\bar{y})^2=b_1^2S_{XX}$
- SSE (error  sum of squares): $SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2$
- SST(total sum of squares): $SST=\sum_{i=1}^n(y_i-\bar{y})^2$

**SST=SSR+SSE**

**R-squared=$\frac{SSR}{SST}$**

Let's calculate the R-squared value for our simple linear model:

In [6]:
# print the R-squared value for the model
lm.rsquared

0.052120445444305163

Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely on the domain. Therefore, it's most useful as a tool for **comparing different models**.

<img src="http://pythontrade.com/public/PIC/ISOM2500/regession/r2_1.png">
<img src="http://pythontrade.com/public/PIC/ISOM2500/regession/r2_2.png">
<img src="http://pythontrade.com/public/PIC/ISOM2500/regession/r2_3.png">
<img src="http://pythontrade.com/public/PIC/ISOM2500/regession/r2_4.png">

## Evaluate the bias-  RMSE

<img src="http://pythontrade.com/public/PIC/ISOM2500/regession/rmse.PNG">

-$$RMSE=\sqrt{\frac{SSE}{n-2}}$$
- sometime we also denote it as $s$, since it is also an point estimate for $\sigma$.




<img src="http://pythontrade.com/public/PIC/ISOM2500/regession/rmse1.png">

<img src="http://pythontrade.com/public/PIC/ISOM2500/regession/rmse2.png">

**Practice:** Consider the data with scores and learning hours, explore the variance and bias of linear regression model.

In [7]:
X=[1,1,1,2,2,3,4,5,6,6,7,8,9,1,1,1,2,4,5,5,20,16,15,15,6]
Y=[50,60,60,55,55,60,70,65,70,85,85,85,50,55,40,50,70,60,70,75,100,90,85,90,60]
print len(X),len(Y)

25 25


In [8]:
data1=pd.DataFrame()
data1['hours']=X
data1['score']=Y
print data1.head()
lm = smf.ols(formula='score ~ hours', data=data1).fit()
print lm.params
print "R-squared=", lm.rsquared


   hours  score
0      1     50
1      1     60
2      1     60
3      2     55
4      2     55
Intercept    54.192875
hours         2.329987
dtype: float64
R-squared= 0.644801084788


In [9]:
real=data1['score']
predict=lm.params[0]+lm.params[1]*data1['hours']

In [10]:
import numpy as np
SSE=((real-predict)**2).sum()
RMSE=np.sqrt(SSE/(len(X)-2))
print "RMSE=",RMSE

RMSE= 9.46750622298
