<a href="https://colab.research.google.com/github/jvitorc/TCC/blob/main/OneVsAll.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### *João Vitor Cardoso <2021>*

# **Explorando estratégia One vs All com MPL na deteção e classificação de intrusão**

  Usando a base [CSE-CIC-IDS2018](https://www.unb.ca/cic/datasets/ids-2018.html) para explorar estratégia One vs All com MPL (MultilayerPerceptron) na deteção e classificação de intrusão
  

## Baixando Base da Dados

#### Baixando awc-cli

In [None]:
!curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
!unzip awscliv2.zip
!sudo ./aws/install

#### Baixando CSV de ataques DOS

In [None]:
!aws s3 sync --no-sign-request --region sa-east-1 "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms" "./CSE-CIC-IDS2018"

In [None]:
!ls -l --block-size=M "CSE-CIC-IDS2018"


Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv ->  Retirar pois já tem outro.

Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv -> Remover.

Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv -> Remover.

## Conectar com o drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
TRAIN_PATH =  '/content/drive/MyDrive/UFSC/TCC/Arquivos/ids2018/train/'
TEST_PATH =  '/content/drive/MyDrive/UFSC/TCC/Arquivos/ids2018/test/'

## Importando Bibliotecas

In [None]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import  keras
import matplotlib.pyplot as plt
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neural_network import MLPClassifier

## Explorando os Dados

### Carregando dados

In [None]:
FILEPATH = 'CSE-CIC-IDS2018/'

In [None]:
def carregar_arquivos(filename):
  data = pd.read_csv(FILEPATH + filename)
  data = data[data['Protocol'] != 'Protocol']
  target = data.pop('Label')
  timestamp = data.pop('Timestamp')
  data = data.apply(pd.to_numeric)
  data['Label'] = target
  return data

In [None]:
dataset = carregar_arquivos('Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv')
dataset = dataset.append(carregar_arquivos('Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv'), ignore_index=False)
dataset = dataset.append(carregar_arquivos('Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv'), ignore_index=False)
dataset = dataset.append(carregar_arquivos('Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv'), ignore_index=False)
dataset = dataset.append(carregar_arquivos('Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv'), ignore_index=False)
dataset = dataset.append(carregar_arquivos('Friday-02-03-2018_TrafficForML_CICFlowMeter.csv'), ignore_index=False)

  """


In [None]:
dataset['Label'].value_counts()

Benign                   4073170
DDOS attack-HOIC          686012
Bot                       286191
FTP-BruteForce            193360
SSH-Bruteforce            187589
Infilteration              93063
DoS attacks-GoldenEye      41508
DoS attacks-Slowloris      10990
DDOS attack-LOIC-UDP        1730
Brute Force -Web             249
Brute Force -XSS              79
SQL Injection                 34
Name: Label, dtype: int64

### Tamanho dos dados

In [None]:
dataset.shape

(5573975, 79)

### Visualizando dados

In [None]:
dataset.head()

Unnamed: 0,Dst Port,Protocol,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,0,0,112641158,3,0,0,0.0,0,0,0.0,...,0,0.0,0.0,0.0,0.0,56320579.0,704.2784,56321077.0,56320081.0,Benign
1,22,6,37366762,14,12,2168,2993.0,712,0,154.857143,...,32,1024353.0,649038.754495,1601183.0,321569.0,11431221.0,3644991.0,15617415.0,8960247.0,Benign
2,47514,6,543,2,0,64,0.0,64,0,32.0,...,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
3,0,0,112640703,3,0,0,0.0,0,0,0.0,...,0,0.0,0.0,0.0,0.0,56320351.5,366.9884,56320611.0,56320092.0,Benign
4,0,0,112640874,3,0,0,0.0,0,0,0.0,...,0,0.0,0.0,0.0,0.0,56320437.0,719.8347,56320946.0,56319928.0,Benign


## Pré-processamento dos dados





In [None]:
with pd.option_context('mode.use_inf_as_na', True):
  dataset = dataset.dropna()

In [None]:
dataset['Label'] = dataset['Label'].replace('Brute Force -Web', 'Brute Force')
dataset['Label'] = dataset['Label'].replace('Brute Force -XSS', 'Brute Force')
dataset['Label'] = dataset['Label'].replace('DDOS attack-HOIC', 'DDOS')
dataset['Label'] = dataset['Label'].replace('DDOS attack-LOIC-UDP', 'DDOS')
dataset['Label'] = dataset['Label'].replace('DoS attacks-GoldenEye', 'DoS')
dataset['Label'] = dataset['Label'].replace('DoS attacks-Slowloris', 'DoS')
dataset['Label'] = dataset['Label'].replace('SSH-Bruteforce', 'Brute Force')
dataset['Label'] = dataset['Label'].replace('FTP-BruteForce', 'Brute Force')

In [None]:
dataset['Label'].value_counts()

Benign           4049406
DDOS              687742
Brute Force       381271
Bot               286191
Infilteration      92403
DoS                52498
SQL Injection         34
Name: Label, dtype: int64

In [None]:
CODIGOS_LABEL = {'Benign': 0, 'Bot': 1, 'Brute Force': 2, 'DDOS': 3, 'DoS': 4,'Infilteration': 5, 'SQL Injection': 6 }

In [None]:
for key,value in CODIGOS_LABEL.items():
  dataset['Label'] = dataset['Label'].replace(key, value)

## Separando conjunto de treino e teste salvando no drive

In [None]:
dataset = dataset.groupby('Label')

In [None]:
CONJUNTO_40 = '/content/drive/MyDrive/Colab Notebooks/ids2018/conjunto40/'
CONJUNTO_60 = '/content/drive/MyDrive/Colab Notebooks/ids2018/conjunto60/'

In [None]:
def separar_salvar(dataset, name):
  conjunto_40 = dataset.sample(frac=0.4,random_state=1)
  conjunto_60 = dataset.drop(conjunto_40.index)
  conjunto_40.to_csv(CONJUNTO_40 + name + '.csv', encoding='utf-8', index=False)
  conjunto_60.to_csv(CONJUNTO_60 + name + '.csv', encoding='utf-8', index=False)

In [None]:
separar_salvar(dataset.get_group(CODIGOS_LABEL['Benign']), 'benign')

In [None]:
separar_salvar(dataset.get_group(CODIGOS_LABEL['Bot']), 'bot')

In [None]:
separar_salvar(dataset.get_group(CODIGOS_LABEL['Brute Force']), 'brute_force')

In [None]:
separar_salvar(dataset.get_group(CODIGOS_LABEL['DDOS']), 'ddos')

In [None]:
separar_salvar(dataset.get_group(CODIGOS_LABEL['DoS']), 'dos')

In [None]:
separar_salvar(dataset.get_group(CODIGOS_LABEL['Infilteration']), 'infilteration')

In [None]:
separar_salvar(dataset.get_group(CODIGOS_LABEL['SQL Injection']), 'sql_injection')

In [None]:
dataset = dataset.get_group(CODIGOS_LABEL['Benign'])

In [None]:
separar_salvar(dataset, 'benign')

## Carregar conjunto de treinamento

In [None]:
def carregar_arquivo(name, path):
  return pd.read_csv(path + name + '.csv')

In [None]:
def carregar_arquivo_conjunto40(name):
  return carregar_arquivo(name, CONJUNTO_40)

In [None]:
train = carregar_arquivo_conjunto40('benign')

In [None]:
train = train.append(carregar_arquivo_conjunto40('bot'), ignore_index=False)
train = train.append(carregar_arquivo_conjunto40('brute_force'), ignore_index=False)
train = train.append(carregar_arquivo_conjunto40('ddos'), ignore_index=False)
train = train.append(carregar_arquivo_conjunto40('dos'), ignore_index=False)
train = train.append(carregar_arquivo_conjunto40('infilteration'), ignore_index=False)
train = train.append(carregar_arquivo_conjunto40('sql_injection'), ignore_index=False)

In [None]:
target = pd.Categorical(train.pop('Label'))

In [None]:
target.describe()

## Normalizar 

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
transformer = StandardScaler()

In [None]:
transformer = transformer.fit(train.values)

In [None]:
normalized_dataset = transformer.transform(train.values)

In [None]:
normalized_dataset  = pd.DataFrame(normalized_dataset)
normalized_dataset['Label'] = target

In [None]:
normalized_dataset.to_csv(CONJUNTO_40 + "standard_scaler.csv")

In [None]:
normalized_dataset = carregar_arquivo_conjunto40("standard_scaler")

In [None]:
normalized_dataset.head()

## Treinamento

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier

In [None]:
target = normalized_dataset.pop('Label')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(normalized_dataset, target, test_size=0.3, random_state=42, stratify=target) 

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape 

((1553871, 79), (1553871,), (665946, 79), (665946,))

In [None]:
pd.Categorical(y_test).describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
0,485929,0.729682
1,34343,0.05157
2,45753,0.068704
3,82529,0.123927
4,6300,0.00946
5,11088,0.01665
6,4,6e-06


### MultilayerPerceptron

In [None]:
def criarModeloMLP():
  return  MLPClassifier(hidden_layer_sizes=(78,39), max_iter=300,activation = 'relu',solver='adam',random_state=1, verbose=50)

In [None]:
mlpClassifier = criarModeloMLP()

In [None]:
mlpClassifier = mlpClassifier.fit(X_train, y_train)

### One Vs All com MLP

In [None]:
ova = OneVsRestClassifier(criarModeloMLP()).fit(X_train, y_train)

## Pós processamento

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score

### Métricas

In [None]:
nome_classes = ['Benign','Bot','Brute Force','DDOS','DoS','Infilteration','SQL Injection']

def salvar_informacoes(clf):
  y_pred = clf.predict(X_test)
  matriz = multilabel_confusion_matrix(y_test, y_pred)
  acc = accuracy_score(y_test, y_pred)
  acc_balanced = balanced_accuracy_score(y_test, y_pred)
  precision = precision_score(y_test, y_pred, average='weighted')
  texto = '=====================================================================\n\n'
  texto += f'Acurácia: {acc}\n'
  texto += f'Acurácia Balanceada: {acc_balanced}\n'
  texto += f'Precision : {precision}\n'
  texto += '\n\n'

  acc_classes = []
  precision_classes = []
  recall_classes = []
  TNR_classes = []
  f1_score_classes = []
  texto += 'RESULTADOS POR CLASSE\n\n'
  for j in range(0,len(nome_classes)):
    texto += '\n\n'
    texto += f'Classe {j}: {nome_classes[j]}\n'

    #separa a matriz de cada classe j em tn, fp, fn, tp  
    tn = matriz[j][0][0]
    fp = matriz[j][0][1]
    fn = matriz[j][1][0]
    tp = matriz[j][1][1]
    #imprime matriz no arquivo
    texto += '\n\n-- N --|-- P --\n'
    texto += f'N| {tn} | {fp} |\n'
    texto += '-----------------------\n'
    texto += f'P| {fn} | {tp} |\n'
    texto += '\n\n'

    #calcula as métricas com base nos dados tp, tn, fn, fp
    acc_classes.append(((tn+tp)/(tn+tp+fn+fp)))
    precision = (tp/(tp+fp))
    precision_classes.append(precision)
    recall = (tp/(tp+fn))
    recall_classes.append(recall)
    TNR_classes.append((tn/(tn+fp)))
    f1_score_classes.append((2 * ((precision * recall)/(precision + recall))))

    #imprimir linha no arquivo com as métricas calculadas para a classe j no fold i
    texto += '             acc,                     loss,                   precision,            recall,                TNR,              f1-score\n'
    texto += f'Classe {nome_classes[j]}:  {acc_classes[j]},   {1-acc_classes[j]},   {precision_classes[j]},   {recall_classes[j]},     {TNR_classes[j]},  {f1_score_classes[j]}\n'
  return texto

In [None]:
ova_info = salvar_informacoes(ova)

In [None]:
mlp_info = salvar_informacoes(mlpClassifier)

In [None]:
def salvar(texto, nome, caminho):
  arquivo = open(caminho + nome, 'w')
  arquivo.write(texto)

In [None]:
salvar(ova_info,'ovaInfo3.txt', CONJUNTO_40)

In [None]:
salvar(mlp_info,'mlpClassifierInfo3.txt', CONJUNTO_40)