### Regression Tree

#### Definition:
A **regression tree** is a type of decision tree used for predicting continuous values. Unlike classification trees (which predict categories), regression trees predict numerical outcomes based on input features. 

#### Why Use a Regression Tree?
1. **Simple to Understand**: Regression trees are easy to interpret and visualize. You can see the decision process step by step, which makes it transparent.
2. **Handle Non-linear Relationships**: Unlike linear models, regression trees can capture non-linear relationships between input variables and the target.
3. **Handle Different Types of Data**: They work well with both numerical and categorical data.

#### How It Works:
The tree is built by splitting the dataset into smaller subsets based on features. At each step, the algorithm chooses the feature and threshold that minimize the error (often measured by mean squared error, MSE). This process continues until a stopping criterion is met (like maximum tree depth or minimum number of data points in a leaf).

The final tree has "leaves" with predictions, which are typically the average of the target values in that subset.

#### Where to Implement:
- **Real Estate**: Predicting house prices based on features like size, location, and age.
- **Finance**: Forecasting stock prices or predicting loan amounts.
- **Health**: Estimating medical expenses based on patient data.
  
#### Advantages:
- **Easy to Interpret**: You can easily visualize how decisions are made.
- **Non-linear Relationships**: It can model complex patterns without the need for transformation of features.
- **Less Data Preprocessing**: It doesn’t require feature scaling or normalization.
- **Works Well with Missing Data**: Regression trees can handle missing data without too much trouble.

#### Disadvantages:
- **Overfitting**: Trees can become overly complex and fit the training data too well, performing poorly on unseen data.
- **Less Accurate Compared to Other Models**: By themselves, regression trees are often less accurate than other models like Random Forest or Gradient Boosting.
- **Sensitive to Small Changes**: A small change in the data can result in a completely different tree, making them less stable.

#### Improving Performance:
To reduce overfitting and improve accuracy, ensemble methods like **Random Forest** and **Gradient Boosting** are often used, which combine multiple trees to give better predictions.

In summary, regression trees are a simple yet powerful tool for predicting continuous outcomes, especially when you need interpretability and can deal with non-linear data. However, they might not always be the most accurate, and they can overfit if not carefully controlled.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

In [2]:
df = pd.read_csv('BostonHousing.csv')

In [3]:
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [4]:
X = df.iloc[:, 0:13]
y = df.iloc[:, 13]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
rt = DecisionTreeRegressor(criterion='squared_error', max_depth=5)

In [7]:
rt.fit(X_train, y_train)

In [8]:
y_pred = rt.predict(X_test)

In [9]:
y_pred = rt.predict(X_test)

In [10]:
r2_score(y_test, y_pred)

0.882408456330235

## Hyperparamerter Tuning

In [11]:
param_grid = {
    'max_depth': [2,4,8, 10, None],
    'criterion': ['squared_error', 'absolute_error'],
    'max_features': [0.25, 0.5, 1.0],
    'min_samples_split': [0.25, 0.5, 1.0],
}

In [12]:
reg = GridSearchCV(DecisionTreeRegressor(), param_grid=param_grid)

In [13]:
reg.fit(X_train, y_train)

225 fits failed out of a total of 450.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
225 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\write\Desktop\ai\AI-Development\ML\virtual_env\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\write\Desktop\ai\AI-Development\ML\virtual_env\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\write\Desktop\ai\AI-Development\ML\virtual_env\Lib\site-packages\sklearn\tree\_classes.py", line 1377, in fit
    super()._fit(
  File "C:\User

In [14]:
reg.best_score_

0.6307891226629944

In [15]:
reg.best_params_

{'criterion': 'squared_error',
 'max_depth': 4,
 'max_features': 0.5,
 'min_samples_split': 0.25}

## Feature Importance

In [16]:
for importance, name in sorted(zip(rt.feature_importances_, X_train.columns), reverse=True):
    print(name, importance)

rm 0.6353566207929299
lstat 0.1942600281622136
dis 0.06659435817958925
crim 0.03550311247609862
nox 0.025315420672282752
ptratio 0.021087723238671835
b 0.011905400128461608
age 0.006175991292050488
indus 0.002627411344586613
tax 0.001173933713115263
zn 0.0
rad 0.0
chas 0.0
