## Machine Learning Exercise 3: Bias and Variance

**Bias** refers to the error introduced by approximating a complex real-world problem with a simplified model, while **variance** refers to the model's sensitivity to fluctuations in the training data. A linear regression model has high bias and low variance; it makes strong assumptions about the data (linearity) but is stable across different datasets. If these strong assumptions are not correct, there will be places where it systematically overestimates or underestimates. In contrast, a decision tree model has low bias and high variance; it can capture complex patterns but is prone to overfitting, especially if deep and unpruned. This means that it can start to memorize the training data rather than capturing patterns that generalize.

### 1. Fit a linear regression model to the housing data, using sqft_living to predict price. Check the mean squared error on the training data and the test data.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

In [4]:
house_sales_df = pd.read_csv("data/kc_house_data.csv")
house_sales_df

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,...,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


In [43]:
X = house_sales_df[["sqft_living"]]
y = house_sales_df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)
reg_test = LinearRegression()
reg_test.fit(X_train, y_train)
y_pred = reg_test.predict(X_test)
rmse_test = mean_squared_error(y_test, y_pred)
print("rmse testing data {}:".format(rmse_test))

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)
reg_train = LinearRegression()
reg_test.fit(X_train, y_train)
y_pred_train = reg.predict(X_train)
rmse_train = mean_squared_error(y_train, y_pred_train)
print("rmse training data {}:".format(rmse_train))

rmse testing data 74509993356.49603:
rmse training data 65713942542.233665:


### 2. Repeat this but with a [DecisionTreeRegresor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). Again check the mean squared error on the training data and the test data. How does what you see differ from the linear regression model?

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)
tree_test = DecisionTreeRegressor(random_state = 42)
tree_test.fit(X_train, y_train)
tree.predict(X_test)
y_pred_test = tree.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred_test)
print("mse testing data {}:".format(mse_test))

mse testing data 79044952597.92511:


In [74]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)
tree_train = DecisionTreeRegressor(random_state = 42)
tree_train.fit(X_train, y_train)
tree_train.predict(X_train)
y_pred_train = tree_train.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred_train)
print("mse training data {}:".format(mse_train))

mse training data 48410745078.45624:


### One way of avoiding overfitting is by restricting the flexibility of the model. We can do this with a decision tree by restricting the number of splits that it can perform. 

### 3. Fit a DecisionTreeRegressor where you restrict the max_depth to 5. Again check the mean squared error on the training data and the test data. What do you notice now?

In [72]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)
tree_test = DecisionTreeRegressor(random_state = 42, max_depth = 5)
tree_test.fit(X_train, y_train)
tree_test.predict(X_test)
y_pred_test = tree_test.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred_test)
print("mse testing data {}:".format(mse_test))

mse testing data 73264250436.88351:


In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)
tree_train = DecisionTreeRegressor(random_state = 42, max_depth = 5)
tree_train.fit(X_train, y_train)
tree_train.predict(X_train)
y_pred_train = tree_train.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred_train)
print("mse training data {}:".format(mse_train))

mse training data 56030892751.02288:


### When working with machine learning models, we often have to balance bias and variance. This is called the [bias-variance tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). One method of this is through [regularization](https://www.ibm.com/think/topics/regularization), where we restrict the complexity of the model, adding some bias but reducing the variance, which can lead to a lower mean squared error on the test set.

### Lasso and ridge regression do this by adding a penalty term based on the size of the coefficients. Smaller coefficients means that the model has less flexibility. The neat thing about these types of models is that they determine how to allocate the coefficients automatically as part of the model fitting process, so we can start with a large set of potential predictors and allow the model fitting to determine which ones to focus on.

### For the next part of the exercise, we'll see how we can add complexity to our model but control the complexity through regularization.

### 4. So far, we've only been predicting based off of the square footage of living space. Fit a new linear regression model using all variables besides id, date, price, and zipcode. How well does this model perform on the test set compared to the model with just square footage of living space?