# Machine Learning - Decision Trees and Ensemble Learning

## Decision Trees:
Decision trees are a versatile tool in machine learning for classification and regression tasks. They mimic human decision-making by creating a flowchart-like structure to make predictions based on input features.

## Goal

The goal of this homework is to create a regression model for predicting housing prices (column 'median_house_value').

### Dataset

In [None]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import numpy as np

### Prepare the dataset

For this homework, we only want to use a subset of data. This is the same subset we used in homework #2. But in contrast to homework #2 we are going to use all columns of the dataset.

First, keep only the records where ocean_proximity is either '<1H OCEAN' or 'INLAND'

**Preparation:**

- Fill missing values with zeros.
- Apply the log tranform to median_house_value.
- Do train/validation/test split with 60%/20%/20% distribution.
- Use the train_test_split function and set the random_state parameter to 1.
- Use DictVectorizer(sparse=True) to turn the dataframes into matrices.

In [9]:
# load the data from housing.csv into a Pandas dataframe
df = pd.read_csv('housing.csv')

# keep only records where ocean_proximity is either '<1H OCEAN' or 'INLAND'
df = df[df['ocean_proximity'].isin(['<1H OCEAN', 'INLAND'])]

# fill in the missing values with zeros
df.fillna(0, inplace=True)

df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
701,-121.97,37.64,32.0,1283.0,194.0,485.0,171.0,6.0574,431000.0,<1H OCEAN
830,-121.99,37.61,9.0,3666.0,711.0,2341.0,703.0,4.6458,217000.0,<1H OCEAN
859,-121.97,37.57,21.0,4342.0,783.0,2172.0,789.0,4.6146,247600.0,<1H OCEAN
860,-121.96,37.58,15.0,3575.0,597.0,1777.0,559.0,5.7192,283500.0,<1H OCEAN
861,-121.98,37.58,20.0,4126.0,1031.0,2079.0,975.0,3.6832,216900.0,<1H OCEAN


In [34]:
# apply log transform to median_house_value
features = 'median_house_value'
df[features] = np.log1p(df[features])

# split the data in train/val/test sets, with 60%/20%/20% distribution with seed 1
# .2 splits the data into 80% train and 20% test
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
#.25 splits the 80% train into 60% train and 20% val
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

# reset the indexes of the dataframes
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

# separate the target variable from the train/val/test sets
y_train = df_train[features].values
y_val = df_val[features].values
y_test = df_test[features].values

# Separate features and target
train_features = df_train.drop(features, axis=1)
val_features = df_val.drop(features, axis=1)
test_features = df_test.drop(features, axis=1)

In [35]:
# Convert dataframes to dictionaries
train_dict = train_features.to_dict(orient='records')
val_dict = val_features.to_dict(orient='records')
test_dict = test_features.to_dict(orient='records')

# Initialize DictVectorizer
dv = DictVectorizer(sparse=True)

# Transform features using DictVectorizer
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)
X_test = dv.transform(test_dict)

# Print the shapes of transformed data
print('X_train shape:', X_train.shape)
print('X_val shape:', X_val.shape)
print('X_test shape:', X_test.shape)

X_train shape: (9411, 10)
X_val shape: (3138, 10)
X_test shape: (3138, 10)


## Question 1 - Let's train a decision tree regressor to predict the median_house_value variable.

Train a model with max_depth=1.

Which feature is used for splitting the data?

- ocean_proximity
- total_rooms
- latitude
- population

In [42]:
# Train a decision tree regressor with max_depth=1
model = DecisionTreeRegressor(max_depth=1, random_state=1)
model.fit(X_train, y_train)

# Get the feature used for splitting
splitting_feature = train_features.columns[model.tree_.feature[0]]
print('The feature used for splitting:', splitting_feature)

The feature used for splitting: households


In [50]:
# Train a Random Forest model
rf_model = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Predict on the validation set
y_val_pred = rf_model.predict(X_val)

print (y_val_pred[:5], y_val[:5])

y_val_pred_inverse = np.expm1(y_val_pred)  # Inverse log transform
y_val_inverse = np.expm1(y_val)  # Inverse log transform

print (y_val_pred_inverse[:5], y_val_inverse[:5])

rmse = mean_squared_error(y_val, y_val_pred, squared=False)
print(f'RMSE on validation set (log): {rmse}')

# Calculate RMSE
rmse = mean_squared_error(y_val_inverse, y_val_pred_inverse, squared=False)
print(f'RMSE on validation set (inverse): {rmse}')


[0.46995539 0.46957111 0.46776644 0.46968943 0.47035552] [0.46967075 0.46983277 0.46761814 0.46917852 0.47054857]
[0.59992283 0.59930812 0.5964245  0.59949737 0.60056312] [0.59946748 0.59972665 0.59618777 0.59868037 0.60087215]
RMSE on validation set: 0.0012937498938391813
