<a href="https://colab.research.google.com/github/prateekchandrajha/mastering-ml-algorithms/blob/main/Ch_15_Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random forest with scikit-learn on Wines Dataset

In [3]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [4]:
wine = load_wine()
X, Y = wine["data"], wine["target"]
ss = StandardScaler()
Xs = ss.fit_transform(X)

In [5]:
lr = LogisticRegression(max_iter=5000, solver='lbfgs', multi_class='auto', random_state=1000)
scores_lr = cross_val_score(lr, Xs, Y, cv=10, n_jobs=-1)

In [6]:
dt = DecisionTreeClassifier(criterion='entropy',
 max_depth=5,
 random_state=1000)
scores_dt = cross_val_score(dt, Xs, Y, cv=10,
 n_jobs=-1)


In [7]:
svm = SVC(kernel='rbf',
 gamma='scale',
 random_state=1000)
scores_svm = cross_val_score(svm, Xs, Y, cv=10,
 n_jobs=-1)

In [8]:
print("Avg. Logistic Regression CV Score: {:.3f}".
 format(np.mean(scores_lr)))
print("Avg. Decision Tree CV Score: {:.3f}".
 format(np.mean(scores_dt)))
print("Avg. SVM CV Score: {:.3f}".
 format(np.mean(scores_svm)))

Avg. Logistic Regression CV Score: 0.978
Avg. Decision Tree CV Score: 0.893
Avg. SVM CV Score: 0.978


In [9]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=255,
 n_jobs=-1,
 criterion='entropy',
 random_state=1000)
scores = cross_val_score(rf, Xs, Y, cv=10,
 n_jobs=-1)
print("Avg. Random Forest CV score: {:.3f}".
 format(np.mean(scores)))

Avg. Random Forest CV score: 0.983


# Feature Selection using SelectFromModel from scikit-learn

Given a dataset, it's important to remember that the predictive power of the features
changes with the predicted value(s).
In other words, the feature importance is not an intrinsic property of the dataset (like
the principal components), but rather is a function of the specific task. It can happen
that a large dataset containing thousands of features can be reduced to a fraction
of them for particular predictions, while it could completely discard them if the
goal is changed. If there are more targets to predict and each of them is associated
with specific predictor sets, it can be a good idea to create a pipeline that outputs
the training/validation sets for each task. This approach has a clear advantage with
respect to using the whole dataset, in fact, in terms of XAI, it's much easier to show
the responsibilities of the important features while discarding all those factors that
don't play a primary role. Moreover, the computational cost is still higher than the
space cost, therefore it's not an issue to have multiple specialized copies of the same
dataset if this improves the model performances and helps the domain experts in
understanding the outcomes.


In [10]:
from sklearn.feature_selection import SelectFromModel
rf.fit(X, Y)
sfm = SelectFromModel(estimator=rf, prefit=True, threshold=0.02)
X_sfm = sfm.transform(X)
print('Feature selection shape: {}'.format(X_sfm.shape))

Feature selection shape: (178, 10)


# AdaBoost with scikit-learn

In [11]:
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
scores_ne = []
for ne in range(10, 201, 10):
  adc = AdaBoostClassifier(n_estimators=ne,
  learning_rate=0.8,
  random_state=1000)
  scores_ne.append(np.mean(
  cross_val_score(adc, X, Y,
  cv=10,
  n_jobs=-1)))

In [12]:
from sklearn.decomposition import PCA, FactorAnalysis
scores_pca = []
for i in range(13, 1, -1):
  if i < 12:
    pca = PCA(n_components=i,
    random_state=1000)
    X_pca = pca.fit_transform(X)
  else:
    X_pca = X
    adc = AdaBoostClassifier(n_estimators=125,
    learning_rate=0.8,
    random_state=1000)
    scores_pca.append(np.mean(
    cross_val_score(adc, X_pca, Y,
    n_jobs=-1, cv=10)))
    
scores_fa = []
for i in range(13, 1, -1):
    if i < 12:
      fa = FactorAnalysis(n_components=i,
      random_state=1000)
      X_fa = fa.fit_transform(X)
    else:
      X_fa = X
      adc = AdaBoostClassifier(n_estimators=125,
      learning_rate=0.8,
      random_state=1000)
      scores_fa.append(np.mean(
      cross_val_score(adc, X_fa, Y,
      n_jobs=-1,
      cv=10)))

This exercise confirms some important features analyzed in Chapter 13, Component
Analysis and Dimensionality Reduction. First of all, performances are not dramatically
affected even by a 50% dimensionality reduction. This consideration is further
confirmed by the feature importance analysis performed in the previous example.
Decision trees can perform quite a good classification considering only 6/7 features
because the remaining ones offer a marginal contribution to the characterization of
a sample. Moreover, FA is almost always superior to PCA. With 7 components, the
accuracy achieved using the FA algorithm is higher than 0.95 (very close to the value
achieved with no reduction), while PCA reaches this value with 12 components. The
reader should remember that PCA is a particular case of FA, with the assumption
of homoscedastic noise. The diagram confirms that this condition is not acceptable
with the Wine dataset. Assuming different noise variances allows remodelling the
reduced dataset in a more accurate way, minimizing the cross-effect of the missing
features. Even if PCA is normally the first choice, with large datasets, I suggest
you always compare the performance with a Factor Analysis (FA) and choose the
technique that guarantees the best result (given also that FA is more expensive in
terms of computational complexity).