### 168 - Multiple linear regression

A linear regression tries to find a linear relationship between multiple input variables (predictors) and an output variable (outcome) by fitting a line (or hyperplane in multiple dimensions) that best matches the data.

In [3]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import OLSInfluence

from pygam import LinearGAM, s, l
from pygam.datasets import wage


import seaborn as sns
import matplotlib.pyplot as plt

In [13]:
# We define a subset of colums that we will consider from the csv file
subset = ['AdjSalePrice', 'SqFtTotLiving', 'SqFtLot', 'Bathrooms', 
          'Bedrooms', 'BldgGrade']



house = pd.read_csv('house_sales.csv', sep='\t')
print(house[subset].head())


   AdjSalePrice  SqFtTotLiving  SqFtLot  Bathrooms  Bedrooms  BldgGrade
1      300805.0           2400     9373       3.00         6          7
2     1076162.0           3764    20156       3.75         4         10
3      761805.0           2060    26036       1.75         4          8
4      442065.0           3200     8618       3.75         5          7
5      297065.0           1720     8620       1.75         4          7


In [23]:
# Defining the variables:

    # We definine which features of the houses will be used to predict the price:
predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 
              'Bedrooms', 'BldgGrade']

    # We set the target variable (the adjusted sale price of houses, which we're trying to predict):
outcome = 'AdjSalePrice'


# Regression:

    # A LinearRegression object is instantiated from scikit-learn:
    # We're creating an instance of the LinearRegression class. At this point
    # it's like an empty container waiting to learn from data. 
    # It's like a brand new calculator that hasn't performed any calculations yet.

house_lm = LinearRegression()


    # The fit() method finds the optimal coefficients using the training data. 
    # After we call the the fit() method, the coefficients get created and stored inside the object as "attributes"
    #     - "coef_" stores the coefficients (the slopes)
    #     - "intercept_" stores the intercept (the y-intercept)

    # The fit method takes your input data (X) and target values (y)
    # It performs the mathematical calculations to find the optimal coefficients using ordinary least squares
    # Once it finds these values, it stores them as attributes inside the object
    # These stored values can then be used later for predictions or analysis

    #  This pattern - creating an empty object and then filling it with learned parameters through a fit method - 
    # is common in scikit-learn. It's called the "estimator API" pattern.

house_lm.fit(house[predictors], house[outcome])


    # We print the results:
    
print(f'Intercept: {house_lm.intercept_:.3f}')
print('Coefficients:')

    # Zip: The core purpose here is to pair up two related pieces of information: 
    # the predictor names and their corresponding coefficients.
    # Like a zipper on a jacket - it takes two separate rows of teeth and joins them together, pair by pair. 
    # It takes multiple iterables (like lists) and creates pairs (or tuples) of items from each list, matching them by position.
    # Without zip, we'd have to access these values using indices, which would be more cumbersome
    # Using zip makes the code both more elegant and safer
    # This pairing is particularly valuable here because these values are inherently related - 
    # each coefficient corresponds to a specific predictor variable. 
    # Using zip maintains this relationship explicitly in our code, making it more readable, 
    # less prone to errors (we don't risk mixing up indices) and 
    # More maintainable (if we add or remove predictors, the code still works as long as the lists match in length)

for name, coef in zip(predictors, house_lm.coef_):
    print(f' {name}: {coef}')

Intercept: -521871.368
Coefficients:
 SqFtTotLiving: 228.83060360240796
 SqFtLot: -0.06046682065307607
 Bathrooms: -19442.840398321056
 Bedrooms: -47769.95518521438
 BldgGrade: 106106.96307898083


In [21]:
house_lm.coef_

array([ 2.28830604e+02, -6.04668207e-02, -1.94428404e+04, -4.77699552e+04,
        1.06106963e+05])

## Assessing the Model

Scikit-learn provides a number of metrics to determine the quality of a model. Here we use the "r2_score".

In [36]:
# SCIKIT-LEARN:
# This evaluates how well our linear regression model performs

    # After we trained our model with fit(), we can use it to make predictions. 
    # The predict() method takes our predictor variables and applies the coefficients we learned to generate predicted house prices. 
    # These predictions are called "fitted values" because they represent the prices that "fit" our model. 
    # Think of it like drawing a line through a scatter plot - the fitted values are the points that lie exactly on that line.

fitted = house_lm.predict(house[predictors])


    # RMSE (Root Mean Square Error) tells us how far off our predictions are from the actual house prices, on average. 
    # Here's how it works:
    #    - Take each prediction and subtract the actual price (house[outcome] - fitted)
    #    - Square these differences (to make negatives positive and emphasize big errors)
    #    - Calculate the mean of these squared differences (mean_squared_error)
    #    - Take the square root (np.sqrt) to get back to the original units (dollars)
    # "mean_squared_error" arguments: Ground truth (correct) target values, Estimated target values.

RMSE = np.sqrt(mean_squared_error(house[outcome], fitted))


    # R² (R-squared) tells us what proportion of the variation in house prices our model explains (ranges from 0 to 1):
    # "r2_score" arguments: (ground truth (correct) target values, Estimated target values)

r2 = r2_score(house[outcome], fitted)


    # We print the values:
print(f'RMSE: {RMSE:.0f}')
print(f'r2: {r2:.4f}')

RMSE: 261220
r2: 0.5406




While scikit-learn provides a variety of different metrics, statsmodels provides a more in-depth analysis of the linear regression model. This package has two different ways of specifying the model, one that is similar to scikit-learn and one that allows specifying R-style formulas. 

Here we use the first approach. As statsmodels doesn't add an intercept automaticaly, we need to add a constant column with value 1 to the predictors. We can use the pandas method assign for this.



In [47]:
# STATSMODEL:
# This evaluates how well our linear regression model performs (with statsmodel this time)

    # The sm.OLS() function creates an Ordinary Least Squares (OLS) regression model. 
    # Unlike scikit-learn, statsmodels doesn't automatically include an intercept term 
    # that's why we use .assign(const=1) to add a column of 1's to our predictors. 
    # This constant term allows the model to fit an intercept.

    # With the following line, we are creating an object, specifically an OLS (Ordinary Least Squares) object. 
    # However, this line is doing more than just creating an empty container,
    # it's actually setting up all the data and mathematical structures needed for the regression.

    # With scikit-learn's LinearRegression(), we just create an empty object,
    # With statsmodels' OLS(), the OLS constructor immediately prepares several key mathematical components, technical stuff like:
    #    - Preparing the Design Matrix (X) (The constructor takes our predictor variables and transforms them into 
    #      a matrix format suitable for linear algebra operations. 
    #      Each column represents a predictor variable, and each row represents an observation)
    #    - Preparing the Response Vector (y) (It's our outcome variable (house prices) formatted as a vector that will be used 
    #      in the matrix calculations.
    #    - Preparing matrix Properties (The constructor calculates and stores important properties of 
    #      these matrices that will be needed later).
    # The constructor also sets up the mathematical framework for what will happen during fitting.
    # While the OLS object doesn't perform these calculations yet (that happens in fit()), it has everything organized and ready to go. 
    # It's like having all your ingredients measured and your pans ready before you start cooking.

    # This preparation is different from scikit-learn's approach, which is why we need to explicitly add the constant term. 
    # Statsmodels is designed to give us more control and visibility into the statistical processes, 
    # while scikit-learn prioritizes ease of use.

model = sm.OLS(house[outcome], house[predictors].assign(const=1))


# we fit the model:
    # This line performs the actual regression calculations, similar to what we did with scikit-learn. 
    # However, statsmodels stores much more detailed statistical information about the regression.

results = model.fit()


    # Prints a comprehensive statistical report
print(results.summary())

    # This summary provides several important pieces of information:
    # Overall model statistics:
    #    - R-squared and Adjusted R-squared tell us how well our model fits
    #      (The adjusted R-squared, which adjusts for the degrees of freedom, 
    #      effectively penalizing the addition of more predictors to a model)
    #    - F-statistic tests if our model is better than just using the mean
    #    - AIC and BIC help us compare different possible models

    #For each predictor:
    #    - Coefficient (coef): The estimated effect on house price
    #    - Standard error (std err): The uncertainty in our coefficient estimate
    #    - t-statistic (t): Tests if the coefficient is significantly different from zero
    #    - P-value (P>|t|): The probability of seeing such a coefficient by chance
    #    - Confidence intervals [0.025 0.975]: The range where we believe the true coefficient lies

    # The t-statistic — and its mirror image, the p-value — measures the extent to which a coefficient is 
    # “statistically significant”—that is, outside the range of what a random chance arrangement of predictor 
    # and target variable might produce. 
    # The higher the t-statistic (and the lower the p-value), the more significant the predictor. 
    
    # [0.025, 0.975]: These numbers represent the lower bound (2.5th percentile) and upper bound (97.5th percentile) 
    # of the confidence interval. This corresponds to the 95% confidence level (100% - 2.5% - 2.5%).
    # Interpretation:
    # If the confidence interval for a coefficient does not include 0, it indicates that the coefficient 
    # is statistically significant at the 95% confidence level.
    
    # (For example, if the confidence interval for a variable is [1.5, 3.2], it means the true coefficient 
    # is likely between 1.5 and 3.2, with 95% confidence)

    # DATA SCIENCE NOTE:
    # p-value (Pr(>|t|) and F-statistic: Data scientists do not generally get too involved with the interpretation of these statistics, 
    # nor with the issue of statistical significance. 
    # Data scientists primarily focus on the t-statistic as a useful guide for whether to include a predictor in a model or not. 
    # High t-statistics (which go with p-values near 0) indicate a predictor should be retained in a model, while very low t-statistics indicate a predictor could be dropped

    

                            OLS Regression Results                            
Dep. Variable:           AdjSalePrice   R-squared:                       0.541
Model:                            OLS   Adj. R-squared:                  0.540
Method:                 Least Squares   F-statistic:                     5338.
Date:                Tue, 14 Jan 2025   Prob (F-statistic):               0.00
Time:                        13:06:24   Log-Likelihood:            -3.1517e+05
No. Observations:               22687   AIC:                         6.304e+05
Df Residuals:                   22681   BIC:                         6.304e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
SqFtTotLiving   228.8306      3.899     58.694