## Basic use of decision trees and ensembles (forests)
We can classify data based upon features in a number of ways.  While many such methods use weights applied to those features, *decision trees* instead treat features as variables controlling a branching path through the data, selecting subsets based upon the values of their features, and seeking to classify that data given particular combinations of feature-values.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

A *decision forest* is a combination of multiple such trees.  Such classifiers use the combined outputs of each tree in the forest—in a basic "voting" scheme—in order to classify new data.  Each tree in the forest is designed to be somewhat different, usually by using different parts of the training data, different subsets of the overall set of features, and the like.  As a result, decision forests often show better performance with respect to overfitting, since they already introduce a lot of model diversity.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [1]:
import numpy as np 

from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn import set_config
set_config(print_changed_only=False)

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# A synthetic data set of 10 thousand entries, where only some of the features are likely 
# to be of practical use in classification.
num_features = 25
num_inform = 5
x, y = make_classification(n_samples=10_000, n_features=num_features, n_informative=num_inform, 
                           n_redundant=(num_features - num_inform), random_state=13)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=1)