# F20DX Coursework Project 4&5

Machine Learning Analysis on the Boston House Prices Dataset using Linear Regression and Regression Trees

## Data Setup and Inspection

The description of all the features is given below:

  **CRIM**: Per capita crime rate by town

  **ZN**: Proportion of residential land zoned for lots over 25,000 sq. ft

  **INDUS**: Proportion of non-retail business acres per town

  **CHAS**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

  **NOX**: Nitric oxide concentration (parts per 10 million)

  **RM**: Average number of rooms per dwelling

  **AGE**: Proportion of owner-occupied units built prior to 1940

  **DIS**: Weighted distances to five Boston employment centers

  **RAD**: Index of accessibility to radial highways

  **TAX**: Full-value property tax rate per 10,000

**PTRATIO**: pupil-teacher ratio by town

  **B**: '1000(Bk - 0.63)²', where Bk is the proportion of people of African American descent by town

  **LSTAT**: Percentage of lower status of the population

  **MEDV**: Median value of owner-occupied homes in $1000s

In [135]:
# import packages
from sklearn.model_selection import train_test_split , cross_val_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [136]:
# Load the dataset from the CSV file
data = pd.read_csv('boston.csv')
# Create a DataFrame with the selected columns

# Print Information on the columns
data.info()
# Print data
data.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


Using describe() function to view the Data and info() function to print the number of cells in each feature,as well as the data type in each cell

It's also good practice to check for empty or null cells within the data.

In [137]:
data.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

However, as shown there are none in this dataset.

In [138]:
print(data.std())

CRIM         8.601545
ZN          23.322453
INDUS        6.860353
CHAS         0.253994
NOX          0.115878
RM           0.702617
AGE         28.148861
DIS          2.105710
RAD          8.707259
TAX        168.537116
PTRATIO      2.164946
B           91.294864
LSTAT        7.141062
MEDV         9.197104
dtype: float64


## Correlation
We can  check the correlation values of each feature as displayed in the heatmap bellow


In [139]:
# use the heatmap function from seaborn to plot the correlation matrix, passing in the correlated values
# annot = True to print the values inside the square
# cmap to change the colour for easier visuals
plt.subplots(figsize=(12,8))
sns.heatmap(data.corr(),cmap='RdBu' ,annot=True)


<IPython.core.display.Javascript object>

<Axes: >

### Observations
When Analysising this heat map we are looking for features that are strong in each colour in respect to the **MEDV** attribute. Features that have values nearing 1 have a positive correlation while features nearing -1 have a negative correlation

We can see that **RM** has the strongest positive correlation with the **MEDV** attribute. This means that there is a positive linear relationship between the number of rooms in a house and the house price.

In contrast, we can see that the **LSTAT** has the strongest negative correlation with the **MEDV**. This meaning that there is a negative linear relationship between the status of population and the house prices.

We can also pick out features that have no/little correlation such as the **CHAS**

It is also rele

### Validation of Observation


In [140]:
plt.scatter(data['RM'], data['MEDV'])
plt.xlabel('RM')
plt.ylabel('MEDV')
plt.title('Scatter Plot of RM vs MEDV')

# Calculate the best fit line
fit = np.polyfit(data['RM'], data['MEDV'], 1)
line = np.poly1d(fit)

# Plot the best fit line
plt.plot(data['RM'], line(data['RM']), color='red')

plt.show()

In this graph we can clearly see a general trend that the **RM** has a positive impact on the **MEDV**, bar a few outlining points.

The gradient of the line of best in this graph is positive.


In [141]:
plt.scatter(data['LSTAT'], data['MEDV'])
plt.xlabel('LSTAT')
plt.ylabel('MEDV')
plt.title('Scatter Plot of LSTAT vs MEDV')

# Calculate the best fit line
fit = np.polyfit(data['LSTAT'], data['MEDV'], 1)
line = np.poly1d(fit)

# Plot the best fit line
plt.plot(data['LSTAT'], line(data['LSTAT']), color='red')

plt.show()

In this graph we can clearly see a general trend that the **LSTAT** has a general negative impact on **MEDV**

We can see that the best fit line has a negative gradient in this case.


# Linear Regression


**Pre Processing**

In [142]:
# inspect Data
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


In [143]:
data.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

no Null values and value types are compatible with algorithm

## Training the Model

We will train two models to inspect if combining features with the max correlation will provide m more accurate model. Both Models will be trained on 20% of the data

**Complete Model**

In [144]:

#Complete model

# Separate the features (X) and the target variable (y)
x = data.drop('MEDV', axis=1)
y = data['MEDV']

# Split the dataset into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Train a linear classifier on the training set
linear_model = LinearRegression()
linear_model.fit(x_train, y_train)


**Correlation Model**
Variables will be marked with 'c' to show correlation model

In [145]:

# Correlation model
x_c = pd.DataFrame(np.c_[data['LSTAT'], data['RM']], columns=['LSTAT', 'RM'])
y_c = data['MEDV']

# Split the dataset into training and test sets
x_ctrain, x_ctest, y_ctrain, y_ctest = train_test_split(x_c, y_c, test_size=0.2, random_state=42)

# Train a linear classifier on the training set
c_linear_model = LinearRegression()
c_linear_model.fit(x_ctrain, y_ctrain)

## 10-Fold Validation

### Complete Model

In [146]:
ten_f = cross_val_score(linear_model, x_test, y_test, cv=10)
print("Accuracy scores with 10-fold cross-validation:", ten_f)
print("Average accuracy score:", np.mean(ten_f))
# without 10-fold
test_accuracy = linear_model.score(x_test, y_test)
print("Accuracy on the test set:", test_accuracy)

Accuracy scores with 10-fold cross-validation: [ 0.49530423  0.80239097  0.49809325  0.14267101  0.8560368   0.67783362
  0.70716076  0.91237586  0.64034695 -0.53792062]
Average accuracy score: 0.5194292838853557
Accuracy on the test set: 0.6687594935356307


A score of 0.5 is ok in this scenario but could be better. with a range of 0-1, a score of 0.5 shows that the model utilises some of the relationships between the features and target however there could be more underlying relationships

**Correlation Model**

In [147]:
# Evaluate the model with 10-fold cross-validation
c_scores = cross_val_score(c_linear_model, x_ctest, y_ctest, cv=10)
print("Accuracy scores with 10-fold cross-validation:", c_scores)
print("Average accuracy score:", np.mean(c_scores))

# Evaluate the model on the test set
c_test_accuracy = c_linear_model.score(x_ctest, y_ctest)
print("Accuracy on the test set:", c_test_accuracy)

Accuracy scores with 10-fold cross-validation: [ 0.6574719   0.6962719   0.24157323  0.23795368  0.65676564  0.74316698
  0.50296751  0.68580848  0.78347583 -0.45926933]
Average accuracy score: 0.4746185812641385
Accuracy on the test set: 0.5739577415025858


In this case, using only the highest correlation features we have a slightly less score. The difference in this models performance compared to the complete model can help understand the strength of relationships that the two selected features have in correlation to the other features.

## RMSE & R2 Evaluation

### Complete Model


In [148]:
linear_rmse_scores = np.sqrt(-cross_val_score(linear_model, x_test, y_test, cv=10, scoring='neg_mean_squared_error'))
linear_r2_scores = cross_val_score(linear_model, x_test, y_test, cv=10, scoring='r2')
print("Complete Model:")
print("RMSE scores with 10-fold cross-validation:", linear_rmse_scores)
print("Average RMSE score:", np.mean(linear_rmse_scores))
print("R2 scores with 10-fold cross-validation:", linear_r2_scores)
print("Average R2 score:", np.mean(linear_r2_scores))

Complete Model:
RMSE scores with 10-fold cross-validation: [ 3.60928066  4.95159403  2.87911289  4.92426139  3.96553216  4.60659798
  3.2454746   3.43966961  3.60080768 12.61516287]
Average RMSE score: 4.783749387660611
R2 scores with 10-fold cross-validation: [ 0.49530423  0.80239097  0.49809325  0.14267101  0.8560368   0.67783362
  0.70716076  0.91237586  0.64034695 -0.53792062]
Average R2 score: 0.5194292838853557


### Correlation Model



In [149]:
c_linear_rmse_scores = np.sqrt(-cross_val_score(c_linear_model, x_ctest, y_ctest, cv=10, scoring='neg_mean_squared_error'))
c_linear_r2_scores = cross_val_score(c_linear_model, x_ctest, y_ctest, cv=10, scoring='r2')
print("Correlation Model:")
print("RMSE scores with 10-fold cross-validation:", c_linear_rmse_scores)
print("Average RMSE score:", np.mean(c_linear_rmse_scores))
print("R2 scores with 10-fold cross-validation:", c_linear_r2_scores)
print("Average R2 score:", np.mean(c_linear_r2_scores))

Correlation Model:
RMSE scores with 10-fold cross-validation: [ 2.97340403  6.13880902  3.53919103  4.64256537  6.12309669  4.11306562
  4.22820436  6.51331485  2.79390249 12.2883516 ]
Average RMSE score: 5.335390505774066
R2 scores with 10-fold cross-validation: [ 0.6574719   0.6962719   0.24157323  0.23795368  0.65676564  0.74316698
  0.50296751  0.68580848  0.78347583 -0.45926933]
Average R2 score: 0.4746185812641385


The general trend continues here being that the Complete model predictions are more accurate. However, both are generally quiete accurate in the context of this data set.

The RMSE is showing the distance between the predicted value and the actual value on a graph. While R2 is used for quantifying the proportion of explained variance.

In this case, we can see that for each model the prediction is out by 4-6k and in the context of house prices, this isn't a large error and is accepted as ok.

In both cases, the r2 score is also ok, this is shown as how well the model is able to verify the variety of features, meaning around 50% in both cases .

## Other Improvement Methods ##

Feature selection by taking correlation and VIF score into account

**Multi Correlation**

Multi Correlation is removing some features that both strong or negatively correlate



**VIF**
VIF is a score that measures high multiplication

In [150]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF
X_filtered = data.drop('MEDV', axis=1)
vif = pd.DataFrame()
vif['Feature'] = X_filtered.columns
vif['VIF'] = [variance_inflation_factor(X_filtered.values, i) for i in range(X_filtered.shape[1])]

# Remove features with high VIF
filtered_features = vif[vif['VIF'] <= 70]['Feature'].tolist()

#Correct dataset
filtered_data = data[['MEDV'] + filtered_features]

filtered_data.head()

Unnamed: 0,MEDV,CRIM,ZN,INDUS,CHAS,AGE,DIS,RAD,TAX,B,LSTAT
0,24.0,0.00632,18.0,2.31,0,65.2,4.09,1,296.0,396.9,4.98
1,21.6,0.02731,0.0,7.07,0,78.9,4.9671,2,242.0,396.9,9.14
2,34.7,0.02729,0.0,7.07,0,61.1,4.9671,2,242.0,392.83,4.03
3,33.4,0.03237,0.0,2.18,0,45.8,6.0622,3,222.0,394.63,2.94
4,36.2,0.06905,0.0,2.18,0,54.2,6.0622,3,222.0,396.9,5.33


Note the features removed

In [151]:




x = filtered_data.drop('MEDV', axis=1)
y = filtered_data['MEDV']

# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Train a linear classifier on the training set
f_linear_model = LinearRegression()
f_linear_model.fit(x_train, y_train)

## Evaluation
### 10-Fold

In [152]:
scores = cross_val_score(f_linear_model, x_test, y_test, cv=10)
print("Accuracy scores with 10-fold cross-validation:", scores)
print("Average accuracy score:", np.mean(scores))



Accuracy scores with 10-fold cross-validation: [0.63952747 0.75161869 0.55464643 0.16879504 0.79777474 0.60359462
 0.60484455 0.82526081 0.71224744 0.0815688 ]
Average accuracy score: 0.5739878608919465


This is shown to have greater accuracy then the complete model. This is shown the effect that there are a 2 or more variables that have high correlation to the target.



## Feature Engineering

Creating new features to capture relationships.

In [153]:

filtered_data['LSTAT_RM'] = filtered_data['TAX'] * filtered_data['RAD']
filtered_data['LSTAT_SQRD'] = filtered_data['LSTAT'] ** 2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data['LSTAT_RM'] = filtered_data['TAX'] * filtered_data['RAD']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data['LSTAT_SQRD'] = filtered_data['LSTAT'] ** 2


* Creating new features based on the strong correlation between the TAX and RAD features
* Squaring the LSTAT to try

In [154]:
filtered_data.head()

Unnamed: 0,MEDV,CRIM,ZN,INDUS,CHAS,AGE,DIS,RAD,TAX,B,LSTAT,LSTAT_RM,LSTAT_SQRD
0,24.0,0.00632,18.0,2.31,0,65.2,4.09,1,296.0,396.9,4.98,296.0,24.8004
1,21.6,0.02731,0.0,7.07,0,78.9,4.9671,2,242.0,396.9,9.14,484.0,83.5396
2,34.7,0.02729,0.0,7.07,0,61.1,4.9671,2,242.0,392.83,4.03,484.0,16.2409
3,33.4,0.03237,0.0,2.18,0,45.8,6.0622,3,222.0,394.63,2.94,666.0,8.6436
4,36.2,0.06905,0.0,2.18,0,54.2,6.0622,3,222.0,396.9,5.33,666.0,28.4089


Note the Changes

In [155]:
X = filtered_data.drop('MEDV', axis=1)
y = filtered_data['MEDV']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

fe_model = LinearRegression()
fe_model.fit(x_train, y_train)

# 10-fold
scores = cross_val_score(fe_model, x_test, y_test, cv=10)
print("Accuracy scores with 10-fold cross-validation:", scores)
print("Average accuracy score:", np.mean(scores))

# Evaluate the model on the test set
test_accuracy = fe_model.score(x_test, y_test)
print("Accuracy on the test set:", test_accuracy)

Accuracy scores with 10-fold cross-validation: [0.56433529 0.78681229 0.71330418 0.26449017 0.79782288 0.67833911
 0.66626446 0.86671066 0.76110125 0.20611659]
Average accuracy score: 0.6305296863070827
Accuracy on the test set: 0.7225506462714459


We can see this noticeably increases the accuracy of the data.


We can see that the combination of the two features had a greater impact on the accuracy.

## Further Validation
Run RMSE & R2 to further validate the improved accuracy

In [156]:

fe_rmse_scores = np.sqrt(-cross_val_score(fe_model, x_train, y_train, cv=10, scoring='neg_mean_squared_error'))
fe_r2_scores = cross_val_score(linear_model, x_train, y_train, cv=10, scoring='r2')
print("Complete Model:")
print("RMSE scores with 10-fold cross-validation:", fe_rmse_scores)
print("Average RMSE score:", np.mean(fe_rmse_scores))
print("R2 scores with 10-fold cross-validation:", fe_r2_scores)
print("Average R2 score:", np.mean(fe_r2_scores))

Complete Model:
RMSE scores with 10-fold cross-validation: [4.97749758 5.09493223 5.32946189 4.78660338 4.22788212 4.67735132
 5.0246697  4.8386554  4.39258922 6.31775188]
Average RMSE score: 4.9667394723630816
R2 scores with 10-fold cross-validation: [0.61516106 0.70168105 0.75615199 0.64379599 0.81951612 0.77382263
 0.80209829 0.62947133 0.71349038 0.44410628]
Average R2 score: 0.6899295109736483


As expected, both scores have improved. More noticeably, the R2 score. However, with this example dataset, the RMSE score is challenging to improve givin the nature of the target attribute. As staded earlier when talking about housing prices being off by 4-5k is very acceptable.

The improved R2 score is showing that adding the two extra features suggests there are linear and non-linear relationships between the features and target. Adding this has also shown to allow the modle to more accurate predictions.


 # Decision Tree

The target attribute in this data is a numerical value not a class value so regression trees will be adopted to produce a value from a terminal node

In [195]:
data.info

<bound method DataFrame.info of         CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0    0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296.0   
1    0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242.0   
2    0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242.0   
3    0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222.0   
4    0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222.0   
..       ...   ...    ...   ...    ...    ...   ...     ...  ...    ...   
501  0.06263   0.0  11.93     0  0.573  6.593  69.1  2.4786    1  273.0   
502  0.04527   0.0  11.93     0  0.573  6.120  76.7  2.2875    1  273.0   
503  0.06076   0.0  11.93     0  0.573  6.976  91.0  2.1675    1  273.0   
504  0.10959   0.0  11.93     0  0.573  6.794  89.3  2.3889    1  273.0   
505  0.04741   0.0  11.93     0  0.573  6.030  80.8  2.5050    1  273.0   

     PTRATIO       B  LSTAT  MEDV  
0       15.3  396.90   4.98  24

In [196]:
data.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

Similar to the linear regression all columns have no Null cells and all data types are compatible with the regression tree

## Training the model ##

**Complete Model**

In [203]:
# Separate the features (X) and target variable (y)
X = data.drop('MEDV', axis=1)
y = data['MEDV']

# Split the data for testing and training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the decision tree model on the training set
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)


## Training Set Evaluation ##

**10-fold**

In [209]:
scores = cross_val_score(regressor, X_train, y_train, cv=10)
print("Accuracy scores with 10-fold cross-validation:", scores)
print("Average accuracy score:", scores.mean())

Accuracy scores with 10-fold cross-validation: [0.79145168 0.80798178 0.81994446 0.23290816 0.37265525 0.88007829
 0.8173623  0.48910753 0.86746129 0.58093827]
Average accuracy score: 0.665988901568151


A score 0.65 in this case is generally good, especially with the amount of features in this dataset. It shows that the model can capture 65% of the variance in the target variable

**RMSE and R2**


In [558]:
#Calculate the RMSE and R2 scores on 10-fold
scores_rmse = -cross_val_score(regressor, X_train, y_train, cv=10, scoring='neg_root_mean_squared_error')
scores_r2 = cross_val_score(regressor, X_train, y_train, cv=10, scoring='r2')

#Calculate the Averages
avg_rmse = scores_rmse.mean()
avg_r2 = scores_r2.mean()

#print results
print("Average RMSE:", avg_rmse)
print("Average R2 :", avg_r2)

Average RMSE: 5.285266886388991
Average R2 : 0.6992161702210012


These scores are showing to be average, however with the context of this dataset, and similarly to the regression tree, being out by 4-5k for a house price is accepted as ok as these prices are usually rounded anyway in practice.

The R2 score is good showing that the model can capture a large proportion of the variance in the target variable.

## Testing Set Evaluation ##

**10-fold**


In [200]:
scores = cross_val_score(regressor, X_test, y_test, cv=10)
print("Accuracy scores with 10-fold cross-validation:", scores)
print("Average accuracy score:", scores.mean())

Accuracy scores with 10-fold cross-validation: [0.44105047 0.89104072 0.36121001 0.65467621 0.91289215 0.70723665
 0.88342707 0.81043314 0.77276573 0.03650889]
Average accuracy score: 0.6471241029670877


In [227]:
#Calculate the RMSE and R2 scores on 10-fold
scores_rmse = -cross_val_score(regressor, X_test, y_test, cv=10, scoring='neg_root_mean_squared_error')
scores_r2 = cross_val_score(regressor, X_test, y_test, cv=10, scoring='r2')

#Calculate the Averages
avg_rmse = scores_rmse.mean()
avg_r2 = scores_r2.mean()

#print results
print("Average RMSE:", avg_rmse)
print("Average R2 :", avg_r2)

Average RMSE: 3.9341124497875333
Average R2 : 0.6932380985182736


In [559]:


# Example 1: Changing the maximum depth
regressor = DecisionTreeRegressor(max_depth=40)
scores = cross_val_score(regressor, X_test, y_test, cv=10)
scores_rmse = cross_val_score(regressor, X_train, y_train, cv=10, scoring='neg_root_mean_squared_error')
scores_r2 = cross_val_score(regressor, X_train, y_train, cv=10, scoring='r2')
rmse = (-scores_rmse.mean())
print("RMSE (Max Depth=40):", rmse)
print('R2 :',scores_r2.mean())
print("Average accuracy score:", scores.mean())


# Example 2: Changing the minimum samples leaf
regressor = DecisionTreeRegressor(min_samples_leaf=5)
scores = cross_val_score(regressor, X_test, y_test, cv=10)
scores_rmse = cross_val_score(regressor, X_train, y_train, cv=10, scoring='neg_root_mean_squared_error')
scores_r2 = cross_val_score(regressor, X_train, y_train, cv=10, scoring='r2')
rmse = (-scores_rmse.mean())
print("RMSE (Min Samples Leaf=5):", rmse)
print('R2 :',scores_r2.mean())
print("Average accuracy score:", scores.mean())

# Example: Changing the maximum number of features
regressor = DecisionTreeRegressor(max_features=11)
scores = cross_val_score(regressor, X_test, y_test, cv=10)
scores_rmse = cross_val_score(regressor, X_train, y_train, cv=10, scoring='neg_root_mean_squared_error')
scores_r2 = cross_val_score(regressor, X_train, y_train, cv=10, scoring='r2')
rmse = (-scores_rmse.mean())
print("RMSE (Max Features=3):", rmse)
print('R2 :',scores_r2.mean())
print("Average accuracy score:", scores.mean())


RMSE (Max Depth=40): 5.08593414066089
R2 : 0.6493275828494498
Average accuracy score: 0.701209508398742
RMSE (Min Samples Leaf=5): 4.8532701879680005
R2 : 0.686357889001019
Average accuracy score: 0.5972526970579635
RMSE (Max Features=3): 4.9198001958546795
R2 : 0.7030988819419728
Average accuracy score: 0.6558709943348624
