# 1. Setting up enviroments requirements
If you want to run this Jupyter Notebook on Google colab, clieck on the next hyperlink: [Load on Google Colab.](https://githubtocolab.com/mjacker/MJCapstone/blob/master/0_merged_ipynb_files_for_google_colab.ipynb)

If you want to load the Jypyter Notebook locally in your computer then clone the github repository on [Github Repository](https://github.com/mjacker/MJCapstone/tree/develop) installing the requirements from the `requirements.yml` file with `# python -m pip install -r requirements.yml`.

Uncomment the next block to install dependencies.


## Downloading the Dataset

### Downloading on Google colab. (by default)

Since google colab is running on linux, most depencencies are already installed in it, but in order to download the dataset from amazon web services first needs to install the aws-cli.

In [None]:
# Tested on linux (Google-Colab)
!apt-get install awscli
!python -m pip install requests==2.28.2
!mkdir datasets
!aws s3 ls --no-sign-request --region ap-northeast-3 "s3://cse-cic-ids2018/" --recursive --human-readable
!aws s3 cp --no-sign-request --region ap-northeast-3 "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/Friday-02-03-2018_TrafficForML_CICFlowMeter.csv" "./datasets/"
!aws s3 cp --no-sign-request --region ap-northeast-3 "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/Friday-16-02-2018_TrafficForML_CICFlowMeter.csv" "./datasets/"


### Downloading on windows.

Trying Downloading on windows I had realize that this could be achieve with a different aproach, using boto3 library from python. 


In [None]:
# Tested on windows 10
# On powershell 7.4

# # !python -m pip install boto3
# !python .\scripts\download-cic-ids-dataset.py 


--- 

# 2. Dataset Preparation
---
For this Capstone, are eelected to be procesed two files from #[CSE-CIC-IDS2018](https://www.unb.ca/cic/datasets/ids-2018.html) those are 
- `Friday-16-02-2018_TrafficForML_CICFlowMeter.csv`
This file contains most of Dos attacks

- `Friday-02-03-2018_TrafficForML_CICFlowMeter.csv`
This file contains most of botnet computers.

since these two files contains a large malicius packages, it will help help to balance the dataset which will be uses to train the model.


In [None]:
%%time
import os
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading path to dataset files.

In [None]:
DATASET_FILES_PATH = []
for path, _, file in (os.walk("./datasets/")):
    for eachFile in file:
        DATASET_FILES_PATH.append(path + eachFile)
DATASET_FILES_PATH

## Data Ingestion / versioning
### Concatenating datasetsLoading datasets to PandaData Frame

In [None]:
%%time

# df_dataset = pd.read_csv(DATASET_FILES_PATH[0])
# print(df_dataset.shape)

# For Google Colab, due to memory capacity, only can handle one day dataset.
df_friday1 = pd.read_csv(DATASET_FILES_PATH[0])
df_friday2 = pd.read_csv(DATASET_FILES_PATH[1])
# # For Google Colab, due to memory capacity, only can handle one day dataset.
df_dataset = pd.concat([df_friday1, df_friday2], axis=0, ignore_index=True)
# # Because two datasets was concatenated, then need to delete the row which cointain the second dataframe title
df_dataset.drop(df_dataset.loc[df_dataset["Label"] == "Label"].index, inplace=True)
print(df_dataset.shape)

## Data Cleaning
- ### Drop unrelated columns
Since Port, protocol and the timestand are not related to the label with those selectec machine learning, those will be droped

In [None]:
df_dataset.drop(columns=['Dst Port', 'Protocol', 'Timestamp'], inplace=True)



- ### Droping rows with infinite or null values

In [None]:
print("Shape before deleting rows: ", df_dataset.shape)
df_dataset[df_dataset.isnull().any(axis=1)]
df_dataset.replace([np.inf, -np.inf], np.nan)
df_dataset.dropna(inplace=True)
print("Shape after deteling rows:", df_dataset.shape)

## Encoding
### Check Label labels

In [None]:
print(df_dataset['Label'].unique())
print(df_dataset.shape)

##### Changing Labels names 
To unify the labels, those malicius packages will be renamend as ones, and the normal as zeros.
- 0 - normal package
- 1 - malicius package

In [None]:
%%time
df_dataset.replace(to_replace=['Benign'], value=0, inplace=True)
df_dataset.replace(to_replace=["Bot", "DoS attacks-SlowHTTPTest", "DoS attacks-Hulk"], value=1, inplace=True)
df_dataset[df_dataset.columns[-1]].unique()
# some values are saved as string, but actually they should be integer values, forcing here changing types
df_dataset.astype('float')

### Dropping duplicated rows

In [None]:
print(df_dataset.shape)
df_dataset.drop_duplicates(inplace=True)
print(df_dataset.shape)


### Check columns datatypes

In [None]:
df_dataset.info()

### Distributions labels after drop rows

In [None]:
label_benign = df_dataset["Label"].value_counts()[[0]].sum()
label_malicious = df_dataset["Label"].value_counts()[[1]].sum()

print(df_dataset.shape)

abs_values = [label_benign, label_malicious]
sns.set(rc={'figure.figsize':(8, 6)})
ax = sns.countplot(x=df_dataset[df_dataset.columns[-1]], 
              data = df_dataset,
              palette = 'dark:#5A9_r')
ax.bar_label(container=ax.containers[0], labels=[label_benign])
ax.bar_label(container=ax.containers[1], labels=[label_malicious])
plt.xlabel(f"0 = Bening; 1 = Malicious")


##### Inbalance problem
There are two ways to solve this problem, 
1. Droping bening rows.
2. Sampling Malicious rows.

for this attempt, I am dropping bening rows until it gets balanced.


In [None]:
df_dataset.drop(df_dataset[df_dataset.Label == 0].index[-(label_benign-label_malicious):], inplace=True)

### Distributions labels after drop fixing imbalance problem

In [None]:
label_benign = df_dataset["Label"].value_counts()[[0]].sum()
label_malicious = df_dataset["Label"].value_counts()[[1]].sum()

print(df_dataset.shape)

abs_values = [label_benign, label_malicious]
sns.set(rc={'figure.figsize':(8, 6)})
ax = sns.countplot(x=df_dataset[df_dataset.columns[-1]], 
              data = df_dataset,
              palette = 'dark:#5A9_r')
ax.bar_label(container=ax.containers[0], labels=[label_benign])
ax.bar_label(container=ax.containers[1], labels=[label_malicious])
plt.xlabel(f"0 = Bening; 1 = Malicious")

### Saving the Dataset as a csv file

In [None]:
df_dataset.to_csv("processed_dataset_in_2.csv", index=False)

## Data Exploratory - Columns

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Loading processed dataset to a dataframe

In [None]:
df_dataset = pd.read_csv("processed_dataset_in_2.csv")
df_dataset

In [None]:
df_dataset.describe()

In [None]:
df_dataset.info()

In [None]:
df_dataset.columns

## Pearson correlation between the input features.

In [None]:
### TEMP
df_dataset.head()

In [None]:
import seaborn as sns
corr_matrix = df_dataset.corr()
plt.rcParams['figure.figsize'] = (16, 9)
g = sns.heatmap(corr_matrix, 
                cmap="coolwarm", #colos
                annot=False, # add a value to each cell 
                fmt='.1g',
                vmin = -1, 
                vmax = 1)
g.set_xticklabels(g.get_xticklabels(), verticalalignment='top', horizontalalignment='right', rotation=30);
plt.savefig('corr_heatmap1.png', dpi=800, bbox_inches='tight')

### dropping not valuable columns

In [None]:
columns_to_drop = []

for x, e in enumerate(df_dataset.columns[:-1]):
    if (len(df_dataset[df_dataset.columns[x]].unique()) == 1):
        print(x, e, df_dataset[df_dataset.columns[x]].unique())
        columns_to_drop.append(x)

# columns_to_drop
# for x in columns_to_drop:
#     df_dataset.drop(columns=[df_dataset.columns[x]], axis=1, inplace = True)

print("Columns to drop: ", columns_to_drop)
print("Dataframe shape: ", df_dataset.shape)

reversed_order = list(reversed(columns_to_drop))
print(reversed_order)
for x in reversed_order:
    df_dataset.drop(columns=[df_dataset.columns[x]], inplace = True)

df_dataset.shape
    

##### Checking pearson correlation after drops

In [None]:
corr_matrix = df_dataset.corr()
plt.rcParams['figure.figsize'] = (16, 9)
g = sns.heatmap(corr_matrix, 
                cmap="coolwarm", #colos
                annot=False, # add a value to each cell 
                fmt='.2f',
                vmin = -1, 
                vmax = 1)
g.set_xticklabels(g.get_xticklabels(), verticalalignment='top', horizontalalignment='right', rotation=30);
plt.savefig('corr_heatmap2.png', dpi=800, bbox_inches='tight')

In [None]:
# df_dataset.columns

In [None]:
# df_dataset.replace([np.inf, -np.inf], np.nan, inplace=True)
# df_dataset.dropna(inplace=True)

# print(np.any(np.isnan(df_dataset)))
# print(np.any(np.isinf(df_dataset)))

# # si trato de usar where infinite, normalmente trae malos resultados onda overflow de memoria
# df_dataset.isin([np.inf, -np.inf]).values.sum()

In [None]:


# y = np.array(df_dataset.pop('Label'))
# X = np.array(df_dataset)

In [None]:
# from sklearn.preprocessing import MinMaxScaler
# X_scaler = MinMaxScaler().fit(X)
# pd.DataFrame(X_scaler.transform(X))
# X = np.array(X_scaler.transform(X))
# X
# df_dataset = pd.DataFrame(X)

In [None]:
df_dataset.to_csv("processed_dataset_in_3.csv", index=False)

In [None]:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
import joblib

from sklearn.model_selection import train_test_split, GridSearchCV


# For reproducible results
RANDOM_STATE_SEED = 732

In [None]:
df_dataset = pd.read_csv("processed_dataset_in_3.csv")
df_dataset


In [None]:
# es realmente necesario volver a filtrar los datos si supuestamente el procesado no deveria tener valores infinitos

print(np.any(np.isnan(df_dataset)))
print(np.any(np.isinf(df_dataset)))

# si trato de usar where infinite, normalmente trae malos resultados onda overflow de memoria
df_dataset.isin([np.inf, -np.inf]).values.sum()

In [None]:
# df_dataset.isinf()
df_dataset.replace([np.inf, -np.inf], np.nan, inplace=True)
df_dataset.dropna(inplace=True)


In [None]:
# es realmente necesario volver a filtrar los datos si supuestamente el procesado no deveria tener valores infinitos

print(np.any(np.isnan(df_dataset)))
print(np.any(np.isinf(df_dataset)))

# si trato de usar where infinite, normalmente trae malos resultados onda overflow de memoria
df_dataset.isin([np.inf, -np.inf]).values.sum()

In [None]:
df_dataset.describe()


In [None]:
df_dataset.info()

In [None]:
y = np.array(df_dataset.pop('Label'))
X = np.array(df_dataset)

In [None]:
print(X.shape)
print(y.shape)

In [None]:
pd.DataFrame(X)

In [None]:
pd.DataFrame(y)

In [None]:
from sklearn.preprocessing import MinMaxScaler
X_scaler = MinMaxScaler().fit(X)
pd.DataFrame(X_scaler.transform(X))
X = np.array(X_scaler.transform(X))
X

In [None]:
# X, y = train_test_split(df_dataset, test_size=0.3, random_state=RANDOM_STATE_SEED)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=RANDOM_STATE_SEED)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from sklearn.utils import class_weight  # For balanced class weighted classification training

# Calculating class weights for balanced class weighted classifier training
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)

print(class_weights)

# Must be in dict format for scikitlearn
class_weights = {
    0: class_weights[0],
    1: class_weights[1]
}

print(class_weights)

In [None]:

# predictions
# joblib.dump(model, r".\trained_models\remote-random-forest-classifier.pkl")

In [None]:
# model = joblib.load(f".\trained_models\remote-random-forest-classifier")
# model = joblib.load(r".\trained_models\remote-random-forest-classifier.pkl")
# model

In [None]:
 # Step 7: Comparing Decision Tree, Random Forest, XGBoost, CatBoost, and LightGBM
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Initialize classifiers
classifiers = {
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Bagging' : BaggingClassifier(),
    'XGBoost': XGBClassifier(),
    'CatBoost': CatBoostClassifier(),
    'LightGBM': LGBMClassifier()
}

In [None]:
### TEMP
classifiers.items()
# for name, clf in classifiers.items()

In [None]:
# Train and evaluate classifiers
results = {}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(X_test)[:, 1])
    roc_auc = auc(fpr, tpr)

    results[name] = {
        'Accuracy': accuracy,
        'Confusion Matrix': confusion_mat,
        'Classification Report': class_report,
        'ROC Curve': (fpr, tpr, roc_auc)
    }

In [None]:
# Bar plot for accuracy comparison
accuracy_values = [result['Accuracy'] for result in results.values()]
classifiers_names = list(classifiers.keys())

plt.figure(figsize=(7, 3))
plt.bar(classifiers_names, accuracy_values, color=['blue', 'green', 'red', 'purple', 'orange'])
plt.xlabel('Classifiers')
plt.ylabel('Accuracy')
plt.title('Classifier Accuracy Comparison')
plt.ylim([0, 1])
plt.show()


In [None]:
# Confusion matrices and classification reports
for name, result in results.items():
    print(f'\n{name}:\n')
    # print(f'Confusion Matrix:\n{result["Confusion Matrix"]}\n')
    print(f'Classification Report:\n{result["Classification Report"]}\n')

    # Plot Confusion Matrix with Blues Colormap
    plt.figure(figsize=(4, 2))
    sns.heatmap(result["Confusion Matrix"], annot=True, fmt='g', cmap=plt.cm.Greens, cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'Confusion Matrix - {name}')
    plt.show()

In [None]:
 # Plot ROC curves
plt.figure(figsize=(20, 6))
for name, result in results.items():
    fpr, tpr, roc_auc = result['ROC Curve']
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.9f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend()
plt.show()

In [None]:
clf_report = classification_report(true,
                                   pred,
                                   labels=labels,
                                   target_names=target_names,
                                   output_dict=True)
sns.heatmap(pd.DataFrame(clf_report).iloc[:-1, :].T, annot=True)

## Decision Tree

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV

# Model
from sklearn.tree import DecisionTreeClassifier, export_graphviz


# For reproducible results
RANDOM_STATE_SEED = 420

In [None]:
df_dataset = pd.read_csv("processed_dataset_in_3.csv")
df_dataset





In [None]:
df_dataset

In [None]:
# es realmente necesario volver a filtrar los datos si supuestamente el procesado no deveria tener valores infinitos

print(np.any(np.isnan(df_dataset)))
print(np.any(np.isinf(df_dataset)))

In [None]:
df_dataset.replace([np.inf, -np.inf], np.nan, inplace=True)
df_dataset.dropna(inplace=True)

In [None]:
print(np.any(np.isnan(df_dataset)))
print(np.any(np.isinf(df_dataset)))

In [None]:
df_dataset.info()

In [None]:
y = np.array(df_dataset.pop('Label'))
y

In [None]:
X = np.array(df_dataset)
X

In [None]:
print(df_dataset.shape)
print(X.shape)
print(y.shape)

In [None]:
pato = pd.DataFrame(X)
pato

In [None]:
df_dataset.info()

In [None]:
# pienso que aqui tengo que agregar uso de baja los valores
# no, tengo que hacer despues de la separacion X e Y

In [None]:
# TEMP
len(df_dataset.columns)

In [None]:
# print(df_X.shape)
# print(df_y.shape)

In [None]:
# train, test = train_test_split(df_dataset, test_size=0.3, random_state=RANDOM_STATE_SEED)
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=RANDOM_STATE_SEED)

In [None]:
print(df_dataset.shape)

print("TRAIN:")
print(X_train.shape)
print(y_train.shape)

print("TEST")
print(X_test.shape)
print(y_test.shape)

In [None]:
model = DecisionTreeClassifier(
    criterion='gini',
    splitter='best',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features=None,
    random_state=None,
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    class_weight=None,
    ccp_alpha=0.0
)

In [None]:
hyperparameters = {
    'max_depth': [i for i in range(1, 20)]
}

In [None]:
clf = GridSearchCV(
    estimator=model,
    param_grid=hyperparameters,
    cv=5,
    verbose=1,
    n_jobs=-1  # Use all available CPU cores
)

In [None]:
%%time
clf.fit(X=X_train, y=y_train)

In [None]:
print("Accuracy score on Validation set: \n")
print(clf.best_score_ )
print("---------------")
print("Best performing hyperparameters on Validation set: ")
print(clf.best_params_)
print("---------------")
print(clf.best_estimator_)

In [None]:
model = clf.best_estimator_
model

In [None]:
predictions = model.predict(X_test)
predictions

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [None]:
print(accuracy_score(y_test, predictions))

In [None]:
cm = confusion_matrix(y_test, predictions)
print(cm)

In [None]:
# from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(y_test, predictions, cmap=plt.cm.Greens)