The main purpose of this document is to introduce how to apply two classifiers, **decision tree** and **random forest**, implemented by [scikit-learn](https://scikit-learn.org/stable/). We will use the modified Iris dataset introduced in Week 2, and assume we have already completed the imputation, so there are no null values.

## 1. Data preprocessing

We first import the packages that will be used in this document.

1. [Pandas](https://pandas.pydata.org/): Pandas is an open-source Python library widely used for data manipulation, analysis, and cleaning tasks. The central data structure in Pandas is the [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) which provides methods to facilitate the preliminary examination of essential properties, statistical summaries, and a select number of rows for a cursory exploration of the data.

2. [Numpy](https://numpy.org/): Numpy is a powerful Python library for numerical and array-based computing. It provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions to operate on these arrays efficiently. 

3. [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html): OneHotEncoder is a package in the scikit-learn library (sklearn) used for one-hot encoding categorical (nominal) features. One-hot encoding is a process of converting categorical data into binary vectors, where each category is represented by a unique binary vector with a 1 in the position corresponding to the category and 0s everywhere else.

4. [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html): StandardScaler is a package from scikit-learn (sklearn) used for normalization. It scales the data to have a mean of 0 and a standard deviation of 1.

5. [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html): train_test_split() is used to split a dataset into training and testing subsets, allowing users to evaluate the performance of machine learning models on unseen data.

6. [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics): sklearn.metrics includes performance metrics functions used to evaluate a classifier's performance.

7. [sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html): A decision tree classifier, limited to handling only numerical values.

8. [sklearn.ensemble.RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html): A random forest classifier, limited to handling only numerical values.

These packages will be utilized in following tasks for data preprocessing, classification and evaluation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

First, we load the data as introduced last week.

In [2]:
df = pd.read_csv('iris_modified.csv')

### One-hot Encoding

The scikit-learn's implementation of decision tree and random forest models does not natively handle nominal features. Instead, it requires all features to be numerical or ordinal. If your dataset contains nominal features, you'll need to preprocess them into numerical representations, such as using one-hot encoding, before fitting the models.

We preprocess the nominal values using the [OneHotEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) from scikit-learn. First, we apply the [fit_transform()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.fit_transform) method of the OneHotEncoder to the data, which both fit the encoder to the dataset and transforms it into a one-hot encoded array. We then store this resulting array as `feature_array` for further use in our analysis

In [3]:
ohe = OneHotEncoder()
feature_array = ohe.fit_transform(df.iloc[:,3].to_frame()).toarray()

The attribute `categories_` in the OneHotEncoder object holds the categories of each feature determined during the fitting process. To observe the newly created features, we print the `categories_` attribute, and then we store it as `feature_labels` for future reference in our analysis

In [4]:
print(ohe.categories_)

[array(['petal_width_0', 'petal_width_1', 'petal_width_2', 'petal_width_3',
       'petal_width_4'], dtype=object)]


In [5]:
feature_labels = ohe.categories_[0]

Then we utilize the pandas library to construct a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) called `features`. It takes two essential components:

1. `feature_array`: This is an array containing the transformed data after performing one-hot encoding on the nominal features of the original dataset.

2. `columns = feature_labels`: This specifies the column names for the DataFrame. feature_labels is a list (or an array-like object) containing the names of the new features created as a result of the one-hot encoding process.



In [6]:
features = pd.DataFrame(feature_array, columns = feature_labels)

When executed, this code creates a structured DataFrame where each column corresponds to one of the original nominal features, and each row represents an individual data point. In summary, the code effectively converts the transformed one-hot encoded data, along with their corresponding feature labels, into a structured DataFrame, enabling easier handling, analysis, and manipulation of the data in tabular form.

We then merge the original numerical features with the transformed one-hot features by

In [7]:
df_new = pd.concat([df.iloc[:,:3],features,df.iloc[:,4]],axis = 1)

We can then get some statistical information on each feature of the new dataset `df_new` by [describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) and [info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html).

In [8]:
print(df_new.describe())

       sepal length in cm  sepal width in cm  petal length in cm  \
count          150.000000         150.000000          150.000000   
mean             5.843333           3.054000            3.758667   
std              0.828066           0.433594            1.764420   
min              4.300000           2.000000            1.000000   
25%              5.100000           2.800000            1.600000   
50%              5.800000           3.000000            4.350000   
75%              6.400000           3.300000            5.100000   
max              7.900000           4.400000            6.900000   

       petal_width_0  petal_width_1  petal_width_2  petal_width_3  \
count     150.000000     150.000000     150.000000     150.000000   
mean        0.333333       0.100000       0.340000       0.206667   
std         0.472984       0.301005       0.475296       0.406271   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       

In [9]:
print(df_new.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   sepal length in cm  150 non-null    float64
 1   sepal width in cm   150 non-null    float64
 2   petal length in cm  150 non-null    float64
 3   petal_width_0       150 non-null    float64
 4   petal_width_1       150 non-null    float64
 5   petal_width_2       150 non-null    float64
 6   petal_width_3       150 non-null    float64
 7   petal_width_4       150 non-null    float64
 8   class               150 non-null    object 
dtypes: float64(8), object(1)
memory usage: 10.7+ KB
None


As we can observe, all the features are numerical now.

### Train/test split and normalization

Last week, we briefly introduced the [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. By utilizing classifiers, we can gain a deeper understanding of the data processing sequence.

When working with machine learning models, it's crucial to avoid data leakage from the test set into the training process. Therefore, normalization across instances should be performed after splitting the data into training and test sets. This ensures that the normalization process relies solely on the training data, avoiding any data leakage from the test set.

Furthermore, there is no need to normalize the features obtained through one-hot encoding, as they already represent binary sets with values 0 or 1.

As a result, we follow this sequence:
1. First, we use [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the dataset into training and test sets. By the default settings, it will create a test set consisting of 25% of the original data. For example, in the current dataset with 150 instances, it will lead to 38 test instances.

2. After splitting the data, we then proceed with the normalization process, applying it only to the **numerical features** present in the training set. This ensures that the normalization is based on the training data and avoids any data leakage from the test set.

3. Since the one-hot encoded features already represent binary sets, there is no need to perform any further normalization on them.

Following this data processing sequence ensures that we have a proper setup for training and evaluating our machine learning models.

In [10]:
y = df_new.iloc[:,-1].values
X_num = df_new.iloc[:,:3].values #numerical features
X_nom = df_new.iloc[:,3:-1].values #nominal features
X_num_train, X_num_test, X_nom_train, X_nom_test, y_train, y_test = train_test_split(X_num, X_nom, y, random_state = 0)

We then apply the [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to achieve the standardization (or z-score normalization). We normalize both the training and testing sets, based on the statistics calculated from the training set.

In [11]:
scaler = StandardScaler()

In [12]:
scaler.fit(X_num_train)
X_num_train = scaler.transform(X_num_train)
X_num_test = scaler.transform(X_num_test)

Afterward, we use [numpy.concatenate()](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) to merge the standardized features with the one-hot encoded ones to create the complete feature set for further analysis or model training.

In [13]:
X_train = np.concatenate((X_num_train, X_nom_train), axis = 1)
X_test = np.concatenate((X_num_test, X_nom_test), axis = 1)

You may try other normalisation method by yourself. 

## 2. Classifiers

### 2.1 Decision Tree

For conducting decision tree classification, we use the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) provided by scikit-learn.

By default, the parameter `criterion` is set to `'gini'`, which employs the Gini impurity as the criterion for splitting nodes in the decision tree. However, it can also be set to `'entropy'`, using information gain based on entropy.

As we have already introduced information gain based on entropy in class, it might be worthy to experiment by setting the `criterion` parameter to `'entropy'` and observe the differences in the results compared to the default `'gini'` criterion. We will leave that as a self-practice.

Switching the criterion allows us to assess how the decision tree's performance varies with different impurity measures and helps us gain insights into the impact of criterion selection on the model's decision boundaries and accuracy.

Feel free to explore both `'gini'` and `'entropy'` criteria to better understand their effects on decision tree classification in our analysis.

Note that scikit-learn's DecisionTreeClassifier does not have built-in support for splitting nodes using the `'gain ratio'` criterion. The `'gain ratio'` is an alternative to information gain that takes into account the number of categories (branches) for each attribute when making splits. While the `'gain ratio'` is not directly available in scikit-learn's DecisionTreeClassifier, you can create a custom implementation of a decision tree using the gain ratio criterion by extending the DecisionTreeClassifier class or using other libraries that offer this feature. We will leave that as an optional extension for those interested in further experimentation.

Concretely, We first initialize a DecisionTreeClassifier model object by 

In [14]:
dt = DecisionTreeClassifier(random_state = 0)

The [fit()](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) method is a fundamental function in scikit-learn's machine learning models used for training the model on the provided training data.

We then instruct the decision tree model `dt` to learn from the provided training data. During this process, the model will analyze the features and target labels to create a decision tree that can make predictions based on the patterns and relationships it identifies in the data.

After the `fit()` method completes its execution, the `dt` model will be trained and ready to make predictions on new, unseen data based on the knowledge it has acquired from the training dataset.

In [15]:
dt = dt.fit(X_train, y_train)

Then, we classify the test data by [predict()](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict).

In [16]:
y_pred_dt = dt.predict(X_test) 

Using the method introduces last week, we can evaluate the test performance.

In [17]:
acc_dt = metrics.accuracy_score(y_test, y_pred_dt)
print("The test accuracy of decision tree on the dataset is: ", acc_dt)

The test accuracy of decision tree on the dataset is:  0.9473684210526315


In [18]:
f1_dt = metrics.f1_score(y_test, y_pred_dt, average='macro')
print("The test macro f1-score of decision tree on the dataset is: ", f1_dt)

The test macro f1-score of decision tree on the dataset is:  0.9444444444444445


### 2.2 Random Forest

Similar to the process described above, we utilize the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) provided in scikit-learn for our classification tasks.

The `RandomForestClassifier` is an ensemble learning method that constructs multiple decision trees and combines their predictions to improve overall accuracy and reduce overfitting. By default, the parameter `criterion` in the `RandomForestClassifier` is set to `'gini'`, which uses the Gini impurity as the criterion for splitting nodes in the individual decision trees.

Feel free to explore both `'gini'` and `'entropy'` criteria in your random forest classification to further enhance your comprehension of their effects on the model's performance.

In [19]:
rf = RandomForestClassifier(random_state=0) 
rf = rf.fit(X_train,y_train)

In [20]:
rf = rf.fit(X_train,y_train)

Then, we do test by [predict()](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict).

In [21]:
y_pred_rf = rf.predict(X_test)

And we evaluate by

In [22]:
acc_rf = metrics.accuracy_score(y_test, y_pred_rf)
print("The test accuracy of random forest on the dataset is: ", acc_rf)

The test accuracy of random forest on the dataset is:  0.9736842105263158


In [23]:
f1_rf = metrics.f1_score(y_test, y_pred_rf, average='macro')
print("The test macro f1-score of random forest on the dataset is: ", f1_rf)

The test macro f1-score of random forest on the dataset is:  0.9717034521788342


We can see the performance of random tree is better than decision tree on this dataset.

Author: *Kaki Zhou* 3/8/2023 