# Modelos 04

In [26]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeRegressor

The objective of this task is to discuss the following hypotheses:

1. Performing standard normalization of the features improves the performance of models induced by the k-nearest neighbors algorithm.

2. Reducing the dimensionality of the problem using PCA improves the performance of models induced by the k-nearest neighbors algorithm.

3. Performing standard normalization of the features improves the performance of models induced by the decision tree algorithm.

4. Reducing the dimensionality of the problem using PCA improves the performance of models induced by the decision tree algorithm.

Since our goal is not to find the best model in each case but rather to compare the effect of PCA and standard normalization on the models, we will not perform hyperparameter optimization in this task and will only use the default values from `sklearn`. Next, we will test these hypotheses on the k-NN and decision tree algorithms.

For hypothesis testing, we will perform cross-validation with 100 folds to represent our simulation. We will use Welch's $t$-test to compare the mean of the models before and after the transformations. Given the specifications of the test, our hypotheses will be:

- $H_0$: The average RMSE of the standard model is equal to the average RMSE of the model after the transformation, meaning the transformations do not alter (or improve) its performance;

- $H_1$: The average RMSE of the standard model is lower than the average RMSE of the model after the transformation, meaning the transformations worsen its performance.

Defining our trust level, and our dataset:

In [2]:
FEATURES = ['carat', 'depth', 'table']
TARGET = ['price']

df = sns.load_dataset("diamonds")
df = df.reindex(FEATURES + TARGET, axis=1)
df = df.dropna()  

x = df.reindex(FEATURES, axis=1)
y = df.reindex(TARGET, axis=1)

x = x.values
y = y.values.ravel()

SEED = 8163295

TRUST = 0.9
SIGNIFICANCE = 1 - TRUST

## k-NN

First, let's train a k-NN model without applying standard normalization or PCA. We will consider the number of neighbors to be 5 for all 3 models, as this is the default value in the `sklearn` function. Instantiating the model:

In [3]:
knn = KNeighborsRegressor()

Performing cross validation to get the RMSE:

In [4]:
NUM_FOLDS = 100

rmse_knn = cross_val_score(
    knn,
    x,
    y,
    cv=NUM_FOLDS,
    scoring='neg_root_mean_squared_error',
)

print(f'The k-NN model achieved an average RMSE of {-rmse_knn.mean():.1f}')

The k-NN model achieved an average RMSE of 1194.1


### k-NN with Standard Normalization 

Now, let's use standard normalization before training the model. We will use the `make_pipeline` function to simplify the process:

In [5]:
knn_norm = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

Calculating the RMSE:

In [6]:
rmse_knn_norm = cross_val_score(
    knn_norm,
    x,
    y,
    cv=NUM_FOLDS,
    scoring='neg_root_mean_squared_error',
)

print(f'The k-NN model with standard normalization achieved an average RMSE of {-rmse_knn_norm.mean():.1f}')

The k-NN model with standard normalization achieved an average RMSE of 1171.7


### k-NN with PCA

Now, let's apply PCA before training the k-NN model. We will consider only the two principal components.

In [7]:
PCS = 2

knn_pca = make_pipeline(
    PCA(n_components=PCS),
    KNeighborsRegressor()
)

Calculating the RMSE:

In [8]:
rmse_knn_pca = cross_val_score(
    knn_pca,
    x,
    y,
    cv=NUM_FOLDS,
    scoring='neg_root_mean_squared_error',
)

print(f'The k-NN model with PCA achieved an average RMSE of {-rmse_knn_pca.mean():.1f}')

The k-NN model with PCA achieved an average RMSE of 1661.4


## Decision Tree

Let's perform the same process we did with the k-NN algorithm for the decision tree, using the default hyperparameters of the `sklearn` function.

In [9]:
tree = DecisionTreeRegressor(random_state=SEED)

In [10]:
rmse_tree = cross_val_score(
    tree,
    x,
    y,
    cv=NUM_FOLDS,
    scoring='neg_root_mean_squared_error',
)

print(f'The decision tree model achieved an average RMSE of {-rmse_tree.mean():.1f}')

The decision tree model achieved an average RMSE of 1352.5


### Decision Tree with Standard Normalization

In [11]:
tree_norm = make_pipeline(
    StandardScaler(),
    DecisionTreeRegressor(random_state=SEED)
)

In [12]:
rmse_tree_norm = cross_val_score(
    tree_norm,
    x,
    y,
    cv=NUM_FOLDS,
    scoring='neg_root_mean_squared_error',
)

print(f'The decision tree model with standard normalization achieved an average RMSE of {-rmse_tree_norm.mean():.1f}')

The decision tree model with standard normalization achieved an average RMSE of 1353.1


### Decision Tree with PCA

In [13]:
tree_pca = make_pipeline(
    PCA(n_components=PCS),
    DecisionTreeRegressor(random_state=SEED)
)

In [14]:
rmse_tree_pca = cross_val_score(
    tree_pca,
    x,
    y,
    cv=NUM_FOLDS,
    scoring='neg_root_mean_squared_error',
)

print(f'The decision tree model with PCA achieved an average RMSE of {-rmse_tree_pca.mean():.1f}')

The decision tree model with PCA achieved an average RMSE of 1980.4


## Results

With the RMSE data, let's apply Welch's $t$-test to the obtained results and determine if the hypotheses are rejected or not.

In [15]:
def reject(pvalue, significance):
    if pvalue < significance:
        return 'With this test, we should reject H0'
    else:
        return 'With this test, we should not reject H0'

In [16]:
def rmse_comparison(algorithm, rmse, transformation, rmse_transformation):
    ''' Returns if the model improved its average performance after a transformation, with respect 
    to the RMSE metric '''
    
    rmse_mean = -rmse.mean()
    rmse_transformation_mean = -rmse_transformation.mean()
    
    if rmse_mean > rmse_transformation_mean:
        return f'The {algorithm} algorithm improved, on average, its performance after the {transformation} transformation by {rmse_mean - rmse_transformation_mean}'
    else:
        return f'The {algorithm} algorithm worsened, on average, its performance after the {transformation} transformation by {-rmse_mean + rmse_transformation_mean}'

### Hypothesis 1

In [17]:
print(rmse_comparison('k-NN', rmse_knn, 'standard normalization', rmse_knn_norm))

The k-NN algorithm improved, on average, its performance after the standard normalization transformation by 22.36081972982356


Standard normalization of the features improved, on average, the k-NN algorithm. Let's see if this hypothesis is confirmed by Welch's test.

In [18]:
test_knn_norm = stats.ttest_ind(-rmse_knn, -rmse_knn_norm, equal_var=False, alternative='less')

pvalue_knn_norm = test_knn_norm.pvalue

print(f'{reject(pvalue_knn_norm, SIGNIFICANCE)}, given the p-value of {pvalue_knn_norm:.3f}.')

With this test, we should not reject H0, given the p-value of 0.562.


Indeed, we should not reject the fact that standard normalization of the features improves the performance of the k-NN model in this case.

The k-NN model is a distance-based algorithm, meaning the intuition behind it is that similar results occupy nearby regions in the space, according to some predefined mathematical distance. If there is a disparity in the magnitude of the features, the distance calculation will be influenced by features with higher absolute values, which can hinder the model’s prediction (since these may not necessarily be the most important attributes for predicting the target).

When all features are normalized, the magnitude of the features does not influence the distance calculation, making the algorithm more effective in prediction.

### Hypothesis 2

In [19]:
print(rmse_comparison('k-NN', rmse_knn, 'PCA', rmse_knn_pca))

The k-NN algorithm worsened, on average, its performance after the PCA transformation by 467.28301504075307


As we can see, the algorithm worsened, on average, after PCA. Let's check if this hypothesis is indeed rejected.

In [20]:
test_knn_pca = stats.ttest_ind(-rmse_knn, -rmse_knn_pca, equal_var=False, alternative='less')

pvalue_knn_pca = test_knn_pca.pvalue

print(f'{reject(pvalue_knn_pca, SIGNIFICANCE)}, given the p-value of {pvalue_knn_pca:.3f}')

With this test, we should reject H0, given the p-value of 0.001


We should reject, in this case, that PCA improves the k-NN algorithm.

After PCA, we changed the distances between the original points, causing the distances between them to also change. Since k-NN relies on the intuition that the distance between points directly reflects their target values, altering the spatial orientation of the features might remove the relationship between the original distances of the data and their similarities. This could be why PCA was not effective in this particular case. Additionally, since we had only 3 dimensions and reduced them to 2 with PCA, we might have lost a lot of useful information in the process, something that might not have happened if we had a significantly larger number of dimensions.

### Hypothesis 3

In [21]:
print(rmse_comparison('decision tree', rmse_tree, 'standard normalization', rmse_tree_norm))

The decision tree algorithm worsened, on average, its performance after the standard normalization transformation by 0.598185748386868


As we can see, the decision tree algorithm worsened slightly after applying standard normalization. Let's check if this hypothesis is rejected or not:

In [22]:
test_tree_norm = stats.ttest_ind(-rmse_tree, -rmse_tree_norm, equal_var=False, alternative='less')

pvalue_tree_norm = test_tree_norm.pvalue

print(f'{reject(pvalue_tree_norm, SIGNIFICANCE)}, given the p-value of {pvalue_tree_norm:.3f}')

With this test, we should not reject H0, given the p-value of 0.499


As we can see, we should not reject, in this case, that standard normalization improved the decision tree model. 

The decision tree model is a search-based algorithm with the intuition of breaking down a larger problem into smaller problems. During its training, the tree divides the initial set into subspaces (according to the impurity function of a split node), thus simplifying the prediction. Consequently, the tree does not care about the magnitude or spread of the data, only about the split point that generates the least impurity in the subsequent nodes. Therefore, any transformation that does not alter the order of the data will not affect the decision tree formation process. This is why we observe such a small improvement in the RMSE of trees with standard normalization, merely due to the random process of cross-validation.

We can see that normalization makes little difference when training a decision tree with all the training data:

In [27]:
x_train, x_test, y_train, y_test = train_test_split(
                    x, y, test_size=0.1, random_state=SEED)

tree_total = DecisionTreeRegressor(random_state=SEED)

tree_total.fit(x_train, y_train)

y_tree_total = tree_total.predict(x_test)

tree_norm_total = make_pipeline(
    StandardScaler(),
    DecisionTreeRegressor(random_state=SEED)
)

tree_norm_total.fit(x_train, y_train)

y_tree_norm_total = tree_norm_total.predict(x_test)

print(f'The RMSE of the decision tree without normalization was {root_mean_squared_error(y_tree_total, y_test):.1f}, and with normalization was {root_mean_squared_error(y_tree_norm_total, y_test):.1f}')

The RMSE of the decision tree without normalization was 1599.5, and with normalization was 1603.0


### Hypothesis 4

In [28]:
print(rmse_comparison('decision tree', rmse_tree, 'PCA', rmse_tree_pca))

The decision tree algorithm worsened, on average, its performance after the PCA transformation by 627.908433802101


We can see that, on average, decision trees worsened significantly after PCA. Performing the hypothesis test:

In [30]:
test_tree_pca = stats.ttest_ind(-rmse_tree, -rmse_tree_pca, equal_var=False, alternative='less')

pvalue_tree_pca = test_tree_pca.pvalue

print(f'{reject(pvalue_tree_pca, SIGNIFICANCE)}, given the p-value of {pvalue_tree_pca:.3f}')

With this test, we should reject H0, given the p-value of 0.000


This was the most unexpected result among the 4 hypotheses. As we mentioned, the decision tree relies on the information gain at each decision node, and we believed that after performing PCA, the tree would have an easier time making splits, as it would be analyzing the components with the highest variance of the observed data. However, the algorithm performed significantly worse, possibly because too much information was lost by excluding a principal component.

Let's quickly analyze this new hypothesis by training two trees with all the training data and applying PCA, but without excluding any principal components:

In [31]:
tree_pca_total = make_pipeline(
    PCA(),
    DecisionTreeRegressor(random_state=SEED)
)

tree_pca_total.fit(x_train, y_train)

y_tree_pca_total = tree_pca_total.predict(x_test)

print(f'The RMSE of the decision tree without normalization was {root_mean_squared_error(y_tree_total, y_test):.1f}, and with PCA was {root_mean_squared_error(y_tree_pca_total, y_test):.1f}')

The RMSE of the decision tree without normalization was 1599.5, and with PCA was 1597.1


As we can see, the performance improved slightly. Now, excluding one principal component:

In [32]:
tree_pca2_total = make_pipeline(
    PCA(n_components=2),
    DecisionTreeRegressor(random_state=SEED)
)

tree_pca2_total.fit(x_train, y_train)

y_tree_pca2_total = tree_pca2_total.predict(x_test)

print(f'The RMSE of the decision tree without normalization was {root_mean_squared_error(y_tree_total, y_test):.1f}, and with PCA excluding 1 PC was {root_mean_squared_error(y_tree_pca2_total, y_test):.1f}')

The RMSE of the decision tree without normalization was 1599.5, and with PCA excluding 1 PC was 2137.4


In this case, we see a significant deterioration in performance. This suggests that excluding a principal component removes too much information from our data. However, in cases with many more features, we might observe an improvement in the decision tree’s performance after dimensionality reduction with PCA.

## Conclusion

After testing the 4 initial hypotheses and analyzing the performance of the k-NN and decision tree algorithms, we have gathered strong indications about the influence of standard normalization and PCA on these models. The k-NN algorithm benefits from standard normalization but may be negatively affected by PCA. In contrast, the decision tree does not change significantly with standard normalization but can experience substantial influences, both positive and negative, with the application of PCA (this is the case that should be clarified further).