In [19]:
import pandas as pd
import numpy as np
np.set_printoptions(suppress=True)

# Introduction to Simple Linear Regression


### Goal of Regression Analysis
- A common goal in statistics is to answer the question: "Is the variable X (or more likely, X1, ..., Xp) associated with a variable Y, and if so, what is the relationship and can we use it to predict Y?"
- This process of training a model on data where the outcome is known, for subsequent application to data where the outcome is not known, is termed supervised learning.


### Simple Linear Regression
- Simple linear regression provides a model of the relationship between the magnitude of one variable and that of a second.
  - For example, as X increases, Y also increases, or as X increases, Y decreases.
- Correlation is another way to measure how two variables are related.
- The difference is that while correlation measures the strength of an association between two variables, regression quantifies the nature of the relationship.


### Key Terms

- Response: The variable we are trying to predict.
  - Synonyms: dependent variable, Y variable, target, outcome
- Independent variable: The variable used to predict the response.
  - Synonyms: X variable, feature, attribute, predictor
- Intercept: The intercept of the regression line—that is, the predicted value when X = 0.
  - Synonyms: b0, β0
- Regression coefficient: The slope of the regression line.
  - Synonyms: slope, b1, β1
  - b0 and b1 are often called parameter estimates, weights
- Fitted values: The estimates Yi obtained from the regression line.
  - Synonym: predicted values
- Residuals: The difference between the observed values and the fitted values.
  - Synonym: errors
- Least squares: The method of fitting a regression by minimizing the sum of squared residuals.
  - Synonyms: ordinary least squares, OLS

### The Regression Equation

- Simple linear regression estimates how much Y will change when X changes by a certain amount.
- With the correlation coefficient, the variables X and Y are interchangeable.
- With regression, we are trying to predict the Y variable from X using a linear relationship (i.e., a line):
  - $Y = b_0 + b_1X$
- We read this as "Y equals b1 times X, plus a constant b0."
  - The symbol b0 is known as the intercept (or constant), and the symbol b1 as the slope for X.
- The Y variable is known as the response or dependent variable since it depends on X.
- The X variable is known as the predictor or independent variable.
- The machine learning community tends to use other terms, calling Y the target and X a feature vector.


### Fitted Values and Residuals

- Important concepts in regression analysis are the fitted values (the predictions) and residuals (prediction errors).
- In general, the data doesn't fall exactly on a line, so the regression equation should include an explicit error term ei:
  - $Y_i = b_0 + b_1X_i + e_i$
- The fitted values, also referred to as the predicted values, are typically denoted by $\hat{Y}_i$ (Y-hat). These are given by:
  - $\hat{Y}_i = b_0 + b_1X_i$
- The notation $\hat{b}_0$ and $\hat{b}_1$ indicates that the coefficients are estimated versus known.
  - The "hat" notation is used to differentiate between estimates and known values. So the symbol $\hat{b}$ ("b-hat") is an estimate of the unknown parameter b. Why do statisticians differentiate between the estimate and the true value? The estimate has uncertainty, whereas the true value is fixed.
- We compute the residuals $e_i$ by subtracting the predicted values from the original data:
  - $\hat{e}_i = Y_i - \hat{Y}_i$


### Historical Use of Regression
- Historically, a primary use of regression was to illuminate a supposed linear relationship between predictor variables and an outcome variable.
- The primary focus in science and certain disciplines is on the estimated slope of the regression equation, b.
  - The model is explainable (as opposed to a black box).
  - Economists want to know the relationship between consumer spending and GDP growth.
  - Management of ABC stores want to know whether contributing to promoting tourism translates to increased sales.
- In such cases, the focus is not on predicting individual cases but rather on understanding the overall relationship among variables.



### Machine Learning and Other Disciplines
- In Machine Learning and other disciplines, regression is used to form a model to predict individual outcomes for new data.
- In this instance, the main items of interest are the fitted values $\hat{Y}$.
  - In marketing, regression can be used to predict the change in revenue in response to the size of an ad campaign.
  - Universities use regression to predict students' GPA based on their SAT scores.
- See the paper [To Explain or to Predict?](https://projecteuclid.org/journals/statistical-science/volume-25/issue-3/To-Explain-or-to-Predict/10.1214/10-STS330.full)



### Simple Linear Regression: Key Ideas

- The regression equation models the relationship between a response variable $Y$ and a predictor variable $X$ as a line.
- A regression model yields fitted values and residuals—predictions of the response and the errors of the predictions.
- Regression models are typically fit by the method of least squares.
- Regression is used both for prediction and explanation.


## Multiple Linear Regression

- When there are multiple predictors, the equation is simply extended to accommodate them:

$Y = b_0 + b_1X_1 + b_2X_2 + ... + b_pX_p + e$

- Instead of a line, we now have a linear model—the relationship between each coefficient and its variable (feature) is linear.

### Performance Metrics

- Root mean squared error (RMSE): The square root of the average squared error of the regression (the most widely used metric to compare regression models).
- Residual standard error (RSE): The same as the root mean squared error, but adjusted for degrees of freedom.
- R-squared (R²): The proportion of variance explained by the model, from 0 to 1. Also known as the coefficient of determination.
- t-statistic: The coefficient for a predictor, divided by the standard error of the coefficient, giving a metric to compare the importance of variables in the model.


## Extending Simple Linear Regression Concepts

- All of the other concepts in simple linear regression, such as fitting by least squares and the definition of fitted values and residuals, extend to the multiple linear regression setting. 

- Here, the model is a hyperplane, instead of a line

- For example, the fitted values are given by:

$\hat{Y}_i = b_0 + b_1X_{1, i} + b_2X_{2, i} + ... + b_pX_{p, i}$


## Example: King County Housing Data

- An example of using multiple linear regression is in estimating the value of houses. 
  - County assessors must estimate the value of a house for the purposes of assessing taxes. 
  - Real estate professionals and home buyers consult popular websites such as Zillow to ascertain a fair price.


In [5]:


    
    
file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)
df = df.dropna()
df.head()

Unnamed: 0.1,Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,0,7129300520,20141013T000000,221900.0,3.0,1.0,1180,5650,1.0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,1,6414100192,20141209T000000,538000.0,3.0,2.25,2570,7242,2.0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,2,5631500400,20150225T000000,180000.0,2.0,1.0,770,10000,1.0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,3,2487200875,20141209T000000,604000.0,4.0,3.0,1960,5000,1.0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,4,1954400510,20150218T000000,510000.0,3.0,2.0,1680,8080,1.0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


### King County Housing Data

| Column        | Description                                                                                                                     |
|---------------|---------------------------------------------------------------------------------------------------------------------------------|
| id            | A notation for a house                                                                                                          |
| date          | Date house was sold                                                                                                             |
| price         | Price is prediction target                                                                                                      |
| bedrooms      | Number of bedrooms                                                                                                              |
| bathrooms     | Number of bathrooms                                                                                                             |
| sqft_living   | Square footage of the home                                                                                                      |
| sqft_lot      | Square footage of the lot                                                                                                       |
| floors        | Total floors (levels) in house                                                                                                  |
| waterfront    | House which has a view to a waterfront                                                                                          |
| view          | Has been viewed                                                                                                                 |
| condition     | How good the condition is overall                                                                                               |
| grade         | overall grade given to the housing unit, based on King County grading system                                                    |
| sqft_above    | Square footage of house apart from basement                                                                                     |
| sqft_basement | Square footage of the basement                                                                                                  |
| yr_built      | Built Year                                                                                                                      |
| yr_renovated  | Year when house was renovated                                                                                                   |
| zipcode       | Zip code                                                                                                                        |
| lat           | Latitude coordinate                                                                                                             |
| long          | Longitude coordinate                                                                                                            |
| sqft_living15 | Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area                     |
| sqft_lot15    | LotSize area in 2015(implies-- some renovations)                                                                                |

In [37]:
from sklearn.linear_model import LinearRegression
predictors = ['sqft_living15', 'sqft_lot15', 'bathrooms', 'bedrooms', 'grade']
outcome = 'price'
house_lm = LinearRegression(fit_intercept=False)
house_lm.fit(df[predictors], df[outcome])

In [38]:
print(house_lm)

LinearRegression(fit_intercept=False)


In [39]:
house_lm.intercept_

0.0

In [40]:
house_lm.coef_

array([   187.67630277,     -0.31098733, 122782.88040016, -45991.44743304,
        10594.42932037])

The interpretation of the coefficients is as with simple linear regression: the predicted value $Y$ changes by the coefficient $b_j$ for each unit change in $X_j$ assuming all the other variables, $X_k$ for $k \neq j$, remain the same. For example, adding an extra finished square foot to a house increases the estimated value by roughly $187.67; adding 1,000 finished square feet implies the value will increase by $187,670.


### Root Mean Squared Error (RMSE)

The most important performance metric from a data science perspective is root mean squared error, or RMSE. RMSE is the square root of the average squared error in the predicted $\hat{y}_i$ values:

$RMSE = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n}}$

This measures the overall accuracy of the model and is a basis for comparing it to other models (including models fit using machine learning techniques).


### Residual Standard Error (RSE)

Similar to RMSE is the residual standard error, or RSE. In this case we have $p$ predictors, and the RSE is given by:

$RSE = \sqrt{\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n - p - 1}}$

The only difference is that the denominator is the degrees of freedom, as opposed to the number of records.


## Coefficient of Determination (R²)

- Another useful metric is the coefficient of determination, also called the R-squared statistic or $R^2$. 
  - Ranges from 0 to 1 and measures the proportion of variation in the data that is accounted for in the model. The formula for $R^2$ is:

$R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}$

- The denominator is proportional to the variance of $Y$. 

### t-Statistic and p-Value

* The t-statistic—and its mirror image, the p-value—measures the extent to which a coefficient is "statistically

- Note: Data scientists do not generally get too involved with the interpretation of these statistics, nor with the issue of statistical significance. 
  * primarily focus on the t-statistic as a useful guide for whether to include a predictor in a model or not. 
    * High t-statistics (which go with p-values near 0) indicate a predictor should be retained in a model
    * Very low t-statistics indicate a predictor could be dropped.

In [None]:
import statsmodels.api as sm
model = sm.OLS(df[outcome], df[predictors])
results = model.fit()
results.summary()

In [36]:
import statsmodels.api as sm
model = sm.OLS(df[outcome], df[predictors])
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared (uncentered):,0.805
Model:,OLS,Adj. R-squared (uncentered):,0.805
Method:,Least Squares,F-statistic:,17840.0
Date:,"Wed, 27 Mar 2024",Prob (F-statistic):,0.0
Time:,11:23:18,Log-Likelihood:,-302160.0
No. Observations:,21597,AIC:,604300.0
Df Residuals:,21592,BIC:,604400.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
sqft_living15,187.6763,4.018,46.712,0.000,179.801,195.551
sqft_lot15,-0.3110,0.073,-4.245,0.000,-0.455,-0.167
bathrooms,1.228e+05,3527.657,34.806,0.000,1.16e+05,1.3e+05
bedrooms,-4.599e+04,2333.164,-19.712,0.000,-5.06e+04,-4.14e+04
grade,1.059e+04,1413.114,7.497,0.000,7824.622,1.34e+04

0,1,2,3
Omnibus:,20449.378,Durbin-Watson:,1.972
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1907303.443
Skew:,4.326,Prob(JB):,0.0
Kurtosis:,48.218,Cond. No.,55900.0
