# Ames Housing - GAM (Generalized Additive Model)
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
# Install packages that are not already installed on Colab
#!pip install pygam

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import pygam
from pygam import LinearGAM, s

In [None]:
# check numpy version
print(np.__version__)

In [None]:
pygam.__version__

In [None]:
plt.style.use('fivethirtyeight')

## Problem description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: <https://www.kaggle.com/c/house-prices-advanced-regression-techniques>


## Load data

Load training data from CSV file.

In [None]:
data_train = pd.read_csv('https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/ames_housing/data/train.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.columns

## Prepare data

Let us first focus on some easy to understand variables.

In [None]:
data = data[["SalePrice", "LotArea", "GrLivArea", "FullBath", "BedroomAbvGr", "KitchenAbvGr", "OverallQual", "OverallCond"]]

In [None]:
data.head()

Finally, we will split the data into features (*X*) and labels (*y*) and into training (*X_train, y_train*) and test (*X_test, y_test*) sets.

In [None]:
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Fit a Generalized Additive Model

In [None]:
X_train.head()

Unfortunately, the API of pyGAM is not very user-friendly. The following code illustrates how to fit a GAM with two smoothing spline terms (i.e., cubic splines with 2nd derivative smoothness constraints). The features have to be specified by using their column index (e.g., s(0, spline_order=3) refers to the first column of the feature matrix (LotArea)).

In [None]:
gam_mod = LinearGAM(s(0, spline_order=3) + s(1, spline_order=3))
gam_mod.fit(X_train, y_train)

In the following, we will create partial dependence plots (incl. confidence intervals) for all terms of the model. 

In [None]:
for i, term in enumerate(gam_mod.terms):
    if term.isintercept:
        continue

    XX = gam_mod.generate_X_grid(term=i)
    pdep, confi = gam_mod.partial_dependence(term=i, X=XX, width=0.95)

    plt.figure()
    plt.plot(XX[:, term.feature], pdep)
    plt.plot(XX[:, term.feature], confi, c='r', ls='--')
    plt.title(repr(term))
    plt.show()

If you seriouly want to use GAMs, I recommend to use the R package *mgcv*, which is much more powerful and user-friendly.