## Machine Learning Supervised

## Table of Contents

So far, we have focused on using KNN as our model to predict California housing prices. However, there are other models worth exploring. Today, we will experiment with both simple Linear Regression and Decision Trees to understand how they explain our target variable. In machine learning, we typically choose our model based on the relationship between our features and the target variable, or simply by selecting the model with the higher score

Yesterday, we applied some feature engineering techniques, and our model indeed increased its performance. Now, let's see how Linear Regression and Decision Tree perform when we apply the same feature engineering techniques.

#### Loading and preparing the data

In [1]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
california = fetch_california_housing()
print(california["DESCR"])

In [None]:
df_cali = pd.DataFrame(california["data"], columns = california["feature_names"])
df_cali["median_house_value"] = california["target"]

df_cali.head()

#### Normalization & Feature Selection

Like we did in Feature Engineering lesson, we are going to normalize our data and select a subset of columns as our features.

#### Train Test Split

In [4]:
features = df_cali.drop(columns = ["median_house_value","AveOccup", "Population", "AveBedrms"])
target = df_cali["median_house_value"]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

Create an instance of the normalizer

In [None]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [7]:
X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [None]:
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_train_norm.head()

In [None]:
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)
X_test_norm.head()

## Linear Regression

Let's create an instance of Linear Regression model.

In [10]:
lin_reg = LinearRegression()

Training Linear Regression with our normalized data

In [None]:
lin_reg.fit(X_train_norm, y_train)

Evaluate model's performance

In [None]:
pred = lin_reg.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", lin_reg.score(X_test_norm, y_test))

Linear Regression yielding a worse score than our previous model, KNN.

In Linear Regression, we often assess feature importance by examining the coefficients in the model. These coefficients indicate the impact of each feature on the model's predictions.

- Determine the coefficients (β) in the linear regression equation corresponding to each feature.
- The magnitude of these coefficients reflects the relative importance of the features. **Greater absolute values suggest more substantial impacts.**

In [None]:
lin_reg_coef = {feature : coef for feature, coef in zip(X_train_norm.columns, lin_reg.coef_)}
lin_reg_coef

We can conclude that **Median Income** have the highest impact in our model.

## Decision Tree

So far between KNN and Liner Regression, the first yield a better score, let's see how a Decision Tree performs.

- Initialize a Decision Tree instance

- Setting max_depth as 10, this means we will allow our tree to split 10 times

In [14]:
tree = DecisionTreeRegressor(max_depth=10)

- Training the model

In [None]:
tree.fit(X_train_norm, y_train)

- Evaluate the model

In [None]:
X_train_norm

In [None]:
y_test

In [None]:
pred = tree.predict(X_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", tree.score(X_test_norm, y_test))

Often we check what are the most relevant features, like we did before in Linear Regression.

In [None]:
tree_importance = {feature : importance for feature, importance in zip(X_train_norm.columns, tree.feature_importances_)}
tree_importance           

In [None]:
from sklearn.tree import export_text

tree_viz = export_text(tree, feature_names=list(X_train_norm.columns))
print(tree_viz)


A bit overwhelming to see, let's use graphviz library.

**Note**: you will need to install graphivz - pip install graphviz

- We will train a decision tree, in this case with max_depth=2 to better see the diagram

In [None]:
from sklearn.tree import DecisionTreeRegressor, export_graphviz
import graphviz

tree = DecisionTreeRegressor(max_depth=2)
tree.fit(X_train_norm, y_train)


dot_data = export_graphviz(tree, out_file="tree.dot", filled=True, rounded=True, feature_names=X_train_norm.columns)

with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)