# Model Selection in Machine Learning: Predicting Housing Prices
We'll explore how different models perform on a simple task: predicting housing prices using the Boston Housing dataset. This dataset is built into Scikit-Learn, making it easily accessible for our exercises. Let's get started!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

## Load and Explore the Dataset
The California Housing dataset contains metrics such as the median income, housing median age, average rooms, average bedrooms, population, average occupancy, latitude, and longitude for various blocks in California.

In [None]:
california = fetch_california_housing()
X = pd.DataFrame(california.data, columns=california.feature_names)
y = california.target
X.head()

## Visualizing the DataA quick visualization to understand our data better.



In [None]:
plt.scatter(X['MedInc'], y)
plt.xlabel('Median Income (tens of thousands)')
plt.ylabel('Median House Value ($100K)')
plt.title('Income vs. House Value')
plt.show()

## Splitting the DatasetSplit the dataset into a training set and a test set to evaluate our models effectively.



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Selection
We will explore three different models to predict housing prices: Linear Regression, Decision Tree Regressor, and Random Forest Regressor.
#### How do we evaluate the model?
We will be using MAE.</br>
MAE stands for Mean Absolute Error. It's a way to measure how close your machine learning model's predictions are to the actual outcomes. Here's a simple way to understand it:

Imagine you're trying to guess the ages of several people. After making your guesses, you find out their real ages and calculate how far off you were for each person. Some guesses might be too high, and some might be too low, but you're only interested in how wrong you were, regardless of the direction. So, you take the absolute value of each mistake (which turns any negative numbers into positives) and then average these to get a single number that tells you how well you did overall.

The lower the MAE, the closer your guesses were to the real ages, which means your predictions were pretty accurate!


### 1. Linear RegressionA good baseline model due to its simplicity.



In [None]:
# Train the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Evaluate the model
linear_predictions = linear_model.predict(X_test)
linear_mae = mean_absolute_error(y_test, linear_predictions)
print("Linear Regression MAE:", linear_mae)

### 2. Decision Tree Regressor
Useful for capturing non-linear relationships.

In [None]:
# Train the Decision Tree model
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)

# Evaluate the model
tree_predictions = tree_model.predict(X_test)
tree_mae = mean_absolute_error(y_test, tree_predictions)
print("Decision Tree Regressor MAE:", tree_mae)

### 3. Random Forest Regressor
An ensemble method that generally provides high accuracy.

In [None]:
# Train the Random Forest model
forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
forest_model.fit(X_train, y_train)

# Evaluate the model
forest_predictions = forest_model.predict(X_test)
forest_mae = mean_absolute_error(y_test, forest_predictions)
print("Random Forest Regressor MAE:", forest_mae)


## ComparisonAfter training and evaluating our models, let's compare their performance.



In [None]:
# Comparing the MAE of all models
mae_values = [linear_mae, tree_mae, forest_mae]
model_names = ['Linear Regression', 'Decision Tree', 'Random Forest']

plt.bar(model_names, mae_values)
plt.ylabel('Mean Absolute Error')
plt.title('Model Comparison')
plt.show()

## Improve
We saw some base models and compared them, but we can improve the performance of our model by modifying parameters.

### Decision Tree Regressor Parameters
Let's adjust the max_depth parameter of the Decision Tree Regressor and see how it influences the model

max_depth: Controls the maximum depth of the tree. A deeper tree can capture more complex patterns but also risks overfitting. Setting it too low might not capture enough complexity, leading to underfitting..

In [None]:
# Varying the max_depth of the Decision Tree
max_depth_values = [2, 4, 6, 8, None]  # None means the tree can grow as much as it needs
dt_mae_scores = []

for depth in max_depth_values:
    dt_model = DecisionTreeRegressor(max_depth=depth, random_state=42)
    dt_model.fit(X_train, y_train)
    dt_predictions = dt_model.predict(X_test)
    dt_mae = mean_absolute_error(y_test, dt_predictions)
    dt_mae_scores.append(dt_mae)

# Plotting the MAE scores for different max_depth values
plt.figure(figsize=(10, 5))
plt.plot(['2', '4', '6', '8', 'None'], dt_mae_scores, marker='o')
plt.xlabel('Max Depth')
plt.ylabel('Mean Absolute Error')
plt.title('Decision Tree Performance vs. Max Depth')
plt.show()

### Random Forest Regressor Parameters
For the Random Forest Regressor, let's tweak the n_estimators parameter, which controls the number of trees in the forest

n_estimators: Determines the number of trees in the forest. More trees can lead to better performance but also require more computational resources. It's a balance between performance and efficiency..

In [None]:
# Adjusting the n_estimators of the Random Forest
n_estimators_values = [10, 50, 100, 200]
rf_mae_scores = []

for n_estimators in n_estimators_values:
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
    rf_model.fit(X_train, y_train)
    rf_predictions = rf_model.predict(X_test)
    rf_mae = mean_absolute_error(y_test, rf_predictions)
    rf_mae_scores.append(rf_mae)

# Plotting the MAE scores for different n_estimators values
plt.figure(figsize=(10, 5))
plt.plot(n_estimators_values, rf_mae_scores, marker='o')
plt.xlabel('Number of Trees (n_estimators)')
plt.ylabel('Mean Absolute Error')
plt.title('Random Forest Performance vs. Number of Trees')
plt.show()

### Playing with Parameters
Adjusting these parameters allows us to control the model's complexity and its ability to generalize from training data to unseen data. Here's how you can play with them:

For the Decision Tree, start with a low max_depth and gradually increase it to see how the model's performance changes. Notice when the performance starts to degrade, indicating overfitting.</br>
For the Random Forest, increasing n_estimators generally improves model performance up to a point. Identify the sweet spot where adding more trees has diminishing returns on performance improvement. </br>