<center><img src="images/trojan-horse.png" alt="drawing" width="7500" style="background-color:white; padding:1em;" /></center> <br/>

<div class="alert alert-block alert-success"><span style="color:blue"><h1>Trojan Detection through ML</h1></span></div>
<div class="alert alert-block alert-warning">
<span style="color:blue"><h2>Using Ensemble Learning</h2></span></div>

Ensemble methods create a strong model by combining the predictions of multiple weak models (also known as weak learners or base estimators) that are built with a given dataset and a given learning algorithm.

Three major kinds of meta-algorithms that aims at combining weak learners:

- <span style="color:red"><b>Bagging,</b></span> that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process
- <span style="color:red"><b>Boosting,</b></span> that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy
- <span style="color:red"><b>Stacking,</b></span> that often considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions

----

__Trojan Detection Dataset__

In this project, the CulinaryML team will work with historical trojan detection data in the [Trojan Detection Dataset](https://www.kaggle.com/datasets/subhajournal/trojan-detection/code). The target field of the dataset (**Class**) is the outcome of detection: <span style="color:red"><b>1 for Trojan and 0 for Benign.</b></span> Multiple features are used in the dataset.

__Dataset schema:__
<span style="color:blue">
- __ID:__ Unique ID of the Packet
- __Flow ID:__ Unique ID of the Packet Flow
- __Source IP:__ Source IP address
- __Source Port:__ Source TCP/User Datagram Protocol (UDP) ports
- __Destination IP:__ Destination IP address
- __Destination Port:__ Destination TCP/User Datagram Protocol (UDP) ports
- __Protocol:__ TCP flags and encapsulated protocol (TCP/UDP)
- __Flow Duration:__ Duration of Packet Flow
- __Total Fwd Packets:__ Number of Forward Packets
- __Total Backward Packets:__  Number of Backward Packet
- __Total Length of Fwd Packets:__ Length of Forward Packet
- __Total Length of Bwd Packets:__ Length of Backward Packet
- __Fwd Packet Length Max:__ Length of Forward Packet (Max)
- __Fwd Packet Length Min:__ Length of Forward Packet (Min)
- __Fwd Packet Length Mean:__ Length of Forward Packet (Mean)
- __Fwd Packet Length Std:__ Length of Forward Packet (STD)
- __Bwd Packet Length Max:__ Length of Backward Packet (Max)
- __Bwd Packet Length Min:__ Length of Backward Packet (Min)
- __Bwd Packet Length Mean:__ Length of Backward Packet (Mean)
- __Bwd Packet Length Std:__ Length of Backward Packet (STD)
- __Fwd IAT Total:__ IAT Total
- __Fwd Header Length:__ Length of Forward Header
- __Bwd Header Length:__ Length of Backward Header
- __Min Packet Length:__ Packet Length (Min)
- __Max Packet Length:__ Packet Length (Max)
- __Packet Length Mean:__ Packet Length (Mean)
- __Packet Length Std:__ Packet Length (STD)
- __Packet Length Variance:__ Packet Length (Variance)
- __Average Packet Size:__ Packet Size
- __Avg Fwd Segment Size:__ Forward Segment Size
- __Avg Bwd Segment Size:__ Backward Segment Size
- __Fwd Header Length.1:__ Forward Header Leader
- __Class:__ Trojan or Benign
</span>
----

<center><img src="images/culinaryML.png" alt="drawing" width="500" style="background-color:white; padding:1em;" /></center>
<div class="alert alert-block alert-success">
<h1><span style="color:blue">CulinaryML Process</span></h1></div>

- [Data Collection](#Data-Collection)
- [Feature Engineering](#Feature-Engineering)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Preparation](#Data-Preparation)
- [Model Building](#Model-Building)
- [Bagging](#Bagging)
- [Random Forest](#Random-Forest)
- [Boosting](#Boosting)
- [Model Evaluation](#Model-Evaluation)

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue"> Data Collection</span></h1></div>

Before CulinaryML builds a model, we need to collect the data. 

In [None]:
%%capture
# Install libraries
!pip install -U -q -r requirements.txt

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
df = pd.read_csv("data/Trojan_Detection_modified.csv",sep=",")
pd.set_option("display.max_columns", None)

print("The shape of the dataset is:", df.shape)

In [None]:
df.head(10)

In [None]:
d = {"Benign": 0, "Trojan": 1}
df['Malware Type'] = df['Class'].map(d)

Convert Propose Target Feature **(Class)** to Binary and Rename to **"Malware Type"** 

d = {"Benign": 0, "Trojan": 1}
df["Malware Type"] = df["Class"].map(d)
df.head(20)

In [None]:
df['Source IP'] = df['Source IP'].str.replace('.','')
df['Destination IP'] = df['Destination IP'].str.replace('.','')

In [None]:
df.head(10)

In [None]:
df.info()

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue">Feature Engineering</span></h1></div>

Remove unnecessary features (columns) to minimize under-fitting and over-fitting

We used some commands including the number of rows, number of columns, and some simple statistics.


In [None]:
# Feature Engineering

df.drop([
    "Flow ID",
    "Total Backward Packets",
    "Total Fwd Packets",
    "Total Backward Packets",
    "Total Length of Fwd Packets",
    "Total Length of Bwd Packets",
    "Bwd Packet Length Max",
    "Bwd Packet Length Min",
    "Bwd Packet Length Mean",
    "Bwd Packet Length Std",
    'Fwd Packet Length Min',
    'Fwd Packet Length Max',
    'Fwd Packet Length Mean',
    'Fwd Packet Length Std',
    'Fwd Packet Length Min',
    'Fwd Packet Length Max',
    'Fwd Packet Length Mean',
    'Fwd Packet Length Std',
    'Fwd IAT Total',
    'Fwd Header Length',
    'Bwd Header Length',
    'Min Packet Length',
    'Max Packet Length',
    'Packet Length Mean',
    'Packet Length Std',
    'Packet Length Variance',
    'Average Packet Size',
    'Avg Fwd Segment Size',
    'Avg Bwd Segment Size',
    'Fwd Header Length.1',
'Class'],axis=1, inplace=True)

In [None]:
df.head(20)

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue">Exploratory Data Analysis (EDA)</span></h1></div>

CulinaryML takes an analysis approach that identifies general patterns in the data. These patterns include outliers and features of the data that might be unexpected.

In [None]:
df.isnull().sum()

In [None]:
sns.countplot(x=df['Malware Type'])

In [None]:
print("Malware distribution from the Trojan Detection set:")
print(df['Malware Type'].value_counts())

In [None]:
def le(df):
    for col in df.columns:
        if df[col].dtype == 'object':
                label_encoder = LabelEncoder()
                df[col] = label_encoder.fit_transform(df[col])

In [None]:
df.shape

In [None]:
# Create lists of the features and name the target

# Numerical features 
numerical_features = [
    'Flow Duration',
    'Source Port',
    'Destination Port',
    'Protocol' 
]

# Based on exploratory data analysis (EDA), select the categorical features
categorical_features = [
                 'Source IP',
                 'Destination IP'
                              
]

model_features = numerical_features + categorical_features
model_target = ['Malware Type']

To review the numerical features, use the `value_counts()` function to get a view of the feature values in respective bins.

In [None]:
# Print and plot statistics for the numerical features
for c in numerical_features:
    # Print the name of the feature
    print(c)
    # Print the value counts in 10 bins for each feature
    print(df[c].value_counts(bins=10, sort=False))

    # Plot bar charts based on value_counts (alternative plot method)
    df[c].value_counts(bins=10, sort=False).plot(kind="bar", alpha=0.75, rot=45)
    plt.show()

In [None]:
df[model_features].head(10)

In [None]:
df[model_target].head(20)

In [None]:
df.describe(include='object')

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue">Data Processing</span></h1></div>

Next need to import and prepare the data.

In [None]:
X = df[model_features]
y = df[model_target]

In [None]:
print(X)

In [None]:
print(y)

The data is now prepared, and you are ready to create a classifier.

---
<div class="alert alert-block alert-success">
    <h1><span style="color:blue">Model Selection</span></h1></div>

To train a logistic regression model, you will use sklearn's [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 


<div class="alert alert-block alert-warning">
<h3><span style="color:blue">Adoption Classification</span></h3></div>

   (**Malware Type**) is the outcome of adoption:<span style="color:red"><b> 1 for Trojan and 0 for not Benign.</b></span>

<center><img src="images/logistic_function.png" alt="drawing" width="800" style="background-color:white; padding:1em;" /></center>

In [None]:
#Convert Data into One Dimension Array for easy processing.
X = np.asarray(X)
y = np.asarray(y)

In [None]:
print(X)

In [None]:
print(y)

---
<div class="alert alert-block alert-warning">
<h3><span style="color:blue">Performance</span></h3></div>

   (**Malware Type**) is the outcome of adoption:<span style="color:red"><b> 1 for Trojan and 0 for not Benign.</b></span>

<center><img src="images/performance.png" alt="drawing" width="800" style="background-color:white; padding:1em;" /></center>


---
<div class="alert alert-block alert-warning">
<h3><span style="color:blue">Process the data using Cross Validation with GridSearchCV, and RandomSearchCV</span></h3></div>

In <span style="color:red"><b><span>GridSearchCV,</b></span> along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts – train data and test data. In cross-validation, the process divides the train data further into two parts – the train data and the validation data.

The most popular type of Cross-validation is K-fold Cross-Validation. It is an iterative process that divides the train data into k partitions. Each iteration keeps one partition for testing and the remaining k-1 partitions for training the model. The next iteration will set the next partition as test data and the remaining k-1 as train data and so on. In each iteration, it will record the performance of the model and at the end give the average of all the performance. 

Primarily, it takes 4 arguments i.e. estimator, param_grid, cv, and scoring. The description of the arguments is as follows:

**1. estimator** – A scikit-learn model

**2. param_grid** – A dictionary with parameter names as keys and lists of parameter values.

**3. scoring** – The performance measure. For example, ‘r2’ for regression models, ‘precision’ for classification models.

**4. cv** – An integer that is the number of folds for K-fold cross-validation.

<span style="color:red"><b><span>GridSearchCV</b></span> can be used on several hyperparameters to get the best values for the specified hyperparameters.
    
Random search cross-validation <span style="color:red"><b><span>(RandomizedSearchCV)</b></span> is another powerful technique for optimizing the hyperparameters of a machine learning model. It works in a similar way to grid search cross-validation, but instead of searching over a predefined grid of hyperparameters, it samples them randomly from a distribution.

In [None]:
# importing the dependencies
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings("ignore")

---
<div class="alert alert-block alert-warning">
<h3><span style="color:blue">GridSearchCV</span></h3></div>

In [None]:
# loading the SVC model
model = SVC()

In [None]:
# hyperparameters

parameters = {
              'kernel':['linear','poly','rbf','sigmoid'],
              'C':[1, 5, 10, 20]
}

In [None]:
# grid search
clf = GridSearchCV(model, parameters, cv=5)

In [None]:
# fitting the data to our model
clf.fit(X, y)

In [None]:
clf.cv_results_

In [None]:
# best parameters

best_parameters = clf.best_params_
print(best_parameters)

In [None]:
# higest accuracy

highest_accuracy = clf.best_score_
print(highest_accuracy)

In [None]:
# loading the results to pandas dataframe
result = pd.DataFrame(clf.cv_results_)

In [None]:
result.head()

In [None]:
grid_search_result = result[['param_C','param_kernel','mean_test_score']]

In [None]:
grid_search_result

Highest Accuracy = 84%

Best Parameters = {'C':1, 'kernel':'sigmoid'}

---
<div class="alert alert-block alert-warning">
<h3><span style="color:blue">RandomSearchCV</span></h3></div>

In [None]:
# loading the SVC model
model = SVC()

# hyperparameters

parameters = {
              'kernel':['linear','poly','rbf','sigmoid'],
              'C':[1, 5, 10, 20]
}

# Baseline grid search
clf = RandomizedSearchCV(model, parameters, cv=5)

In [None]:
# fitting the data to our model
clf.fit(X, y)

In [None]:
RandomizedSearchCV(cv=5, estimator=SVC(),
                   param_distributions={'C': [1, 5, 10, 20],
                                        'kernel': ['linear', 'poly', 'rbf',
                                                   'sigmoid']})

In [None]:
clf.cv_results_

In [None]:
# best parameters

best_parameters = clf.best_params_
print(best_parameters)

In [None]:
# higest accuracy

highest_accuracy = clf.best_score_
print(highest_accuracy)

In [None]:
# loading the results to pandas dataframe
result = pd.DataFrame(clf.cv_results_)

In [None]:
result.head()

In [None]:
randomized_search_result = result[['param_C','param_kernel','mean_test_score']]

In [None]:
randomized_search_result

Highest Accuracy = 84%

Best Parameters = {'C':1, 'kernel':'sigmoid'}

In [None]:
# importing the models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
# list of baseline models
models = [LogisticRegression(max_iter=1000), SVC(kernel='linear'), KNeighborsClassifier(), RandomForestClassifier(random_state=0)]

In [None]:
def compare_models_cross_validation():

  for model in models:

    cv_score = cross_val_score(model, X, y, cv=5)
    mean_accuracy = sum(cv_score)/len(cv_score)
    mean_accuracy = mean_accuracy*100
    mean_accuracy = round(mean_accuracy, 2)

    print('Cross Validation accuracies for the',model,'=', cv_score)
    print('Acccuracy score of the ',model,'=',mean_accuracy,'%')
    print('---------------------------------------------------------------')

In [None]:
compare_models_cross_validation()

Inference: For the Heart Disease dataset, Random Forest Classifier has the Highest accuracy value with default hyperparameter values

    Comparing the models with different Hyperparameter values using GridSearchCV

In [None]:
# list of models
models_list = [LogisticRegression(max_iter=10000), SVC(), KNeighborsClassifier(), RandomForestClassifier(random_state=0)]

In [None]:
# creating a dictionary that contains hyperparameter values for the above mentioned models


model_hyperparameters = {
    

    'log_reg_hyperparameters': {
        
        'C' : [1,5,10,20]
    },

    'svc_hyperparameters': {
        
        'kernel' : ['linear','poly','rbf','sigmoid'],
        'C' : [1,5,10,20]
    },


    'KNN_hyperparameters' : {
        
        'n_neighbors' : [3,5,10]
    },


    'random_forest_hyperparameters' : {
        
        'n_estimators' : [10, 20, 50, 100]
    }
}

In [None]:
print(model_hyperparameters.keys())

In [None]:
model_hyperparameters['log_reg_hyperparameters']

In [None]:
model_keys = list(model_hyperparameters.keys())
print(model_keys)

In [None]:
model_keys[0]

In [None]:
model_hyperparameters[model_keys[0]]

Applying GridSearchCV

In [None]:
def ModelSelection(list_of_models, hyperparameters_dictionary):

  result = []

  i = 0

  for model in list_of_models:

    key = model_keys[i]

    params = hyperparameters_dictionary[key]

    i += 1

    print(model)
    print(params)
    print('---------------------------------')


    clf = GridSearchCV(model, params, cv=5)

    # fitting the data to classifier
    clf.fit(X,y)

    result.append({
        'model used' : model,
        'highest score' : clf.best_score_,
        'best hyperparameters' : clf.best_params_
    })

  result_dataframe = pd.DataFrame(result, columns = ['model used','highest score','best hyperparameters'])

  return result_dataframe

In [None]:
ModelSelection(models_list, model_hyperparameters)

Random Forest Classifier with n_estimators = 100 has the highest accuracy

Finally, train the classifier with <span style="color:red"><b>.fit()</b></span> on the training dataset. 

---
<div class="alert alert-block alert-success">
    <h1><span style="color:blue">Model Training</span></h1></div>

---
<div class="alert alert-block alert-warning">
<h3><span style="color:blue">Create training and test datasets</span></h3></div>


As part of data preparation, the dataset is split into training and test subsets by using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

For this notebook, you will use 80 percent of the data for the training set and 20 percent for the test set. Determine the best split based on the size of your dataset.

In [None]:
# loading the results to pandas dataframe
X = df[model_features]

In [None]:
y = df[model_target]

In [None]:
# Use the `replace()` function to remove commas
df['Destination IP'] = df['Destination IP'].replace('/d.', '', regex=True)

# Convert the column to floats
#df['Source IP','Destination IP'] = df['Source IP','Destination IP'].astype(float)

# Print the DataFrame
df.head()


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=23)

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
X_train.shape

In [None]:
y_train.head()

In [None]:
y_test.head()

In [None]:
y_test.shape

In [None]:
X_train.info()

---
<div class="alert alert-block alert-warning">
<h3><span style="color:blue">Process the data with a pipeline and ColumnTransformer</span></h3></div>

In a typical ML workflow, you need to apply data transformations, such as imputation and scaling, at least twice: first on the training dataset by using <span style="color:red"><b>.fit()</b></span> and <span style="color:red"><b><span>.transform()</b></span> when preparing the data to train the model, and then by using <span style="color:red"><b><span>.transform()</b></span> on any new data that you want to predict on (validation or test). Sklearn's [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is a tool that simplifies this process by enforcing the implementation and order of data processing steps, being important for reproducibility. In other words, all the data is transformed the same way each time that you process any part of it.

In this section, you will build separate pipelines to handle the numerical, and categorical features. Then, you will combine them into a composite pipeline along with an estimator. To do this, you will use a [LogisticRegression classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

You will need multiple pipelines to ensure that all the data is handled correctly:

* __Numerical features pipeline:__ Impute missing values with the mean by using sklearn's SimpleImputer, followed by a MinMaxScaler. If different processing is desired for different numerical features, different pipelines should be built as described for the text features pipeline. See the <span style="color:red"><b><span>numerical_processor</b></span> in the following code cell.

* __Categoricals pipeline:__ Impute with a placeholder value (this won't have an effect because you already encoded the 'nan' values), and encode with sklearn's OneHotEncoder. If computing memory is an issue, it is a good idea to check the number of unique values for the categoricals to get an estimate of how many dummy features one-hot encoding will create. Note the <span style="color:red"><b><span>handle_unknown</b></span> parameter, which tells the encoder to ignore (rather than throw an error for) any unique value that might show in the validation or test set that was not present in the initial training set. See the <span style="color:red"><b><span>categorical_processor</b></span> in the following code cell.

Finally, the selective preparations of the dataset features are then put together into a collective ColumnTransformer, which is used in a pipeline along with an estimator. This ensures that the transforms are performed automatically in all situations. This includes on the raw data when fitting the model, when making predictions, when evaluating the model on a validation dataset through cross-validation, or when making predictions on a test dataset in the future.

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipeline.fit(X_train, y_train).score(X_test, y_test)
# An estimator's parameter can be set using '__' syntax

pipeline.set_params(svc__C=10).fit(X_train, y_train).score(X_test, y_test)

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue">Data Modeling</span></h1></div>

---
<div class="alert alert-block alert-success">
<h2><span style="color:blue">Bagging</span></h2></div>


   (**Malware Type**) is the outcome of adoption:<span style="color:red"><b> 1 for trojan and 0 for benign.</b></span>

<center><img src="Hyperparameter_Optimization_using_Grid_Search.svg.png" alt="drawing" width="800" style="background-color:white; padding:1em;" /></center>

In this section, you will build your first ensemble model by using the bootstrap aggregating, or bagging, approach. With this approach, you randomly draw multiple data subsets from the training set (with replacement) and train one model for each subset.

The first approach will use multiple trees in the bagging model.

---
<div class="alert alert-block alert-warning">
<h3><span style="color:blue">Data processing with a pipeline and a bagging ColumnTransformer</span></h3></div>


You need to use different pipelines to handle the numerical, categorical, and text features. Then, you will combine them into a composite pipeline along with an estimator. To do this, you will use a [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html).

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    GradientBoostingClassifier,
)

### COLUMN_TRANSFORMER ###
##########################

# Preprocess the numerical features
numerical_processor = Pipeline(
    [
        (
            "num_scaler",
            MinMaxScaler(),
        )  # Shown in case it is needed. Not a must with decision trees.
    ]
)

# Preprocess the categorical features
# handle_unknown tells it to ignore (rather than throw an error for) any value
# that was not present in the initial training set.

#categorical_processor = Pipeline(
#    [("cat_encoder", OneHotEncoder(handle_unknown="ignore"))]
#)


# Combine all data preprocessors from above (add more if you choose to define more)
# For each processor/step, specify a name, the actual process, and the features to be processed
data_preprocessor = ColumnTransformer(
    [
        ("numerical_pre", numerical_processor, numerical_features),
        ("categorical_pre", categorical_processor, categorical_features)
    ]
)

### PIPELINE ###
################

# Pipeline with all desired data transformers, along with an estimator
# Later, you can set/reach the parameters by using the names issued - for hyperparameter tuning, for example

#####################################################
### Notice the pipeline using a BaggingClassifier ###
#####################################################
pipeline = Pipeline(
    [
        ("data_preprocessing", data_preprocessor),
        (
            "bg",
            BaggingClassifier(
                DecisionTreeClassifier(max_depth=25),  # Each tree has max_depth=25
                n_estimators=10,
            ),
        ),
    ]
)  # Use 10 trees

# Visualize the pipeline
# This will be helpful especially when building more complex pipelines,
# stringing together multiple preprocessing steps
from sklearn import set_config

set_config(display="diagram")
pipeline

Now you can fit the bagging model, and see the training and test scores.

In [None]:
# Get training data to train the pipeline
# Get testing data to test the pipeline
X_train = df[model_features]
y_test = df[model_target]

# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipeline.fit(X_train, y_train).score(X_train, y_train)

# Use the fitted pipeline to make predictions on the training dataset
train_predictions = pipeline.predict(X_train)
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print("Accuracy (training):", accuracy_score(y_train, train_predictions))

# Get testing data to test the pipeline
#X_test = test_data[model_features]
#y_test = test_data[model_target]

# Use the fitted pipeline to make predictions on the testing dataset
#test_predictions = pipeline.predict(X_test)
#print(confusion_matrix(y_test, test_predictions))
#print(classification_report(y_test, test_predictions))
#print("Accuracy (test):", accuracy_score(y_test, test_predictions))

<center><img src="Confusion Matrix.png" alt="drawing" width="500" style="background-color:white; padding:1em;" /></center>


Using a bagging classifier isn't difficult because it only requires updating one line of code.

Next, you will create a random forest model.

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue">Random forest</span></h1></div>


Now, you will try the second ensemble model: random forest. Random forest involves a similar ensemble process:
- Draw random subsets (with replacement) from the original dataset.
- Train individual trees with each subset.

However, a difference is that random forest uses a randomly selected feature subset for each tree. As a rule of thumb, pick the `sqrt(# features)` as the number of random features for each tree and don't use any other features.


The model is called in a similar way to the bagging method. You will replace the BaggingClassifier with a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) in the pipeline.

In [None]:
### COLUMN_TRANSFORMER ###
##########################

# Preprocess the numerical features
numerical_processor = Pipeline(
    [
        (
            "num_scaler",
            MinMaxScaler(),
        )  # Shown in case it is needed. Not a must with decision trees.
    ]
)

# Preprocess the categorical features
# handle_unknown tells it to ignore (rather than throw an error for) any value
# that was not present in the initial training set.

#categorical_processor = Pipeline(
#    [("cat_encoder", OneHotEncoder(handle_unknown="ignore"))]
#)

# Preprocess the text feature
text_processor_0 = Pipeline(
    [("text_vect_0", CountVectorizer(binary=True, max_features=150))]
)

# Combine all data preprocessors (add more if you choose to define more)
# For each processor/step, specify a name, the actual process, and the features to be processed
data_preprocessor = ColumnTransformer(
    [
        ("numerical_pre", numerical_processor, numerical_features),
#        ("categorical_pre", categorical_processor, categorical_features),
        ("text_pre_0", text_processor_0, text_features[0]),
    ]
)

### PIPELINE ###
################

# Pipeline with all desired data transformers, along with an estimator
# Later, you can set/reach the parameters by using the names issued - for hyperparameter tuning, for example

##########################################################
### Notice the pipeline using a RandomForestClassifier ###
##########################################################
pipeline = Pipeline(
    [
        ("data_preprocessing", data_preprocessor),
        (
            "rf",
            RandomForestClassifier(
                max_depth=25, n_estimators=100  # Each tree has max_depth=25
            ),
        ),
    ]
)  # Use 100 trees

# Visualize the pipeline
# This will be helpful especially when building more complex pipelines,
# stringing together multiple preprocessing steps
from sklearn import set_config

set_config(display="diagram")
pipeline

Now you can fit the random forest model, and see the training and test scores.

In [None]:
# Get training data to train the pipeline
X_train = train_data[model_features]
y_train = train_data[model_target]

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Use the fitted pipeline to make predictions on the training dataset
train_predictions = pipeline.predict(X_train)
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print("Accuracy (training):", accuracy_score(y_train, train_predictions))

# Get testing data to test the pipeline
X_test = test_data[model_features]
y_test = test_data[model_target]

# Use the fitted pipeline to make predictions on the testing dataset
test_predictions = pipeline.predict(X_test)
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print("Accuracy (test):", accuracy_score(y_test, test_predictions))

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h3><i>Try it yourself!</i></h3>
    <br>
    <p style="text-align:center;margin:auto;"><img src="images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">You can perform hyperparameter tuning on a random forest model.</p><br>
    <p style=" text-align: center; margin: auto;">In the following code cell, run a grid search with the random forest classifier using <code>param_grid={'rf__max_depth': [25, 30, 45]}</code>.</p><br>
    <p style=" text-align: center; margin: auto;">What is the best hyperparameter value after this run?</p>
    <br>
</div>

In [None]:
# Write your code for grid search with param_grid={'rf__max_depth': [25, 30, 45]}

# Parameter grid for GridSearch

############### CODE HERE ###############

from scipy.stats import randint
# Parameter grid for GridSearch

param_grid = {
    'rf__max_depth': [25, 30, 45]
}


grid_search = GridSearchCV(
    pipeline,  # Base model
    param_grid,  # Parameters to try
    cv=5,  # Apply 5-fold cross validation
    verbose=1,  # Print summaryGridSearchCV
    n_jobs=-1,  # Use all available processors
)

# Fit the RandomizedSearch to the training data
grid_search.fit(X_train, y_train)


############## END OF CODE ##############

print(grid_search.best_params_)
print(grid_search.best_score_)

# Get the best model out of GridSearchCV
classifier = grid_search.best_estimator_

# Fit the best model to the training data
classifier.fit(X_train, y_train)

In [None]:
# Get testing data to test the classifier
X_test = test_data[model_features]
y_test = test_data[model_target]

# Use the fitted model to make predictions on the test dataset
# Testing data going through the pipeline is first imputed
# (with means from the training set), scaled (with the min/max from the training data),
# and finally used to make predictions.
test_predictions = classifier.predict(X_test)

print("Model performance on the test set:")
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print("Test accuracy:", accuracy_score(y_test, test_predictions))

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue">Boosting</span></h1></div>

The last ensemble model that you will try is boosting. This method builds multiple weak models sequentially. Each subsequent model attempts to boost performance overall by overcoming or reducing the errors of the previous model.

You will use sklearn's [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) in the pipeline.

In [None]:
### COLUMN_TRANSFORMER ###
##########################

# Preprocess the numerical features
numerical_processor = Pipeline(
    [
        (
            "num_scaler",
            MinMaxScaler(),
        )  # Shown in case it is needed. Not a must with decision trees.
    ]
)

# Preprocess the categorical features
# handle_unknown tells it to ignore (rather than throw an error for) any value
# that was not present in the initial training set.

#categorical_processor = Pipeline(
#    [("cat_encoder", OneHotEncoder(handle_unknown="ignore"))]
#)

# Preprocess the text feature
text_processor_0 = Pipeline(
    [("text_vect_0", CountVectorizer(binary=True, max_features=150))]
)

# Combine all data preprocessors (add more if you choose to define more)
# For each processor/step, specify a name, the actual process, and the features to be processed
data_preprocessor = ColumnTransformer(
    [
        ("numerical_pre", numerical_processor, numerical_features),
#        ("categorical_pre", categorical_processor, categorical_features),
        ("text_pre_0", text_processor_0, text_features[0]),
    ]
)

### PIPELINE ###
################

# Pipeline with all desired data transformers, along with an estimator
# Later, you can set/reach the parameters by using the names issued - for hyperparameter tuning, for example

##############################################################
### Notice the pipeline using a GradientBoostingClassifier ###
##############################################################
pipeline = Pipeline(
    [
        ("data_preprocessing", data_preprocessor),
        (
            "gbc",
            GradientBoostingClassifier(
                max_depth=10, n_estimators=100  # Each tree has max_depth=10
            ),
        ),
    ]
)  # Use 100 trees

# Visualize the pipeline
# This will be helpful especially when building more complex pipelines,
# stringing together multiple preprocessing steps
from sklearn import set_config

set_config(display="diagram")
pipeline

Now fit the model, and see the training and testing scores.

In [None]:
# Get training data to train the pipeline
X_train = train_data[model_features]
y_train = train_data[model_target]

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Use the fitted pipeline to make predictions on the training dataset
train_predictions = pipeline.predict(X_train)
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print("Accuracy (training):", accuracy_score(y_train, train_predictions))

# Get testing data to test the pipeline
X_test = test_data[model_features]
y_test = test_data[model_target]

# Use the fitted pipeline to make predictions on the testing dataset
test_predictions = pipeline.predict(X_test)
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print("Accuracy (test):", accuracy_score(y_test, test_predictions))

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue">Conclusion</span></h1></div>

This notebook provided an introduction to using Bagging, RandomForest, and GradientBoosting classifiers on the same dataset.

---
<div class="alert alert-block alert-success">
<h1><span style="color:blue">Next Lab</span></h1></div>

In the next lab, you will be introduced to fairness and bias mitigation in ML by exploring different types of bias that are present in data and practicing how to build various documentation sheets.