# Homework of Ch3. Malware Behavior Log Classification
----
This is the homework snippet of TU-ETP-AD1062 Machine Learning Fundamentals.

For more information, please refer to:
https://sites.google.com/view/tu-ad1062-mlfundamentals/

## Import Packages
----
- Data pre-processing:
    - `pandas`: Used for CSV reading
    - `os`: Used for path join
    - `sklearn.preprocessing.LabelEncoder`: Convert string-based labels into numeric labels
- Classifier training and predicting:
    - `lightgbm`: Gradient boosting (Ch.3)
    - `sklearn.svm.SVC`: Support Vector Machine (Ch.2, Ch.3)
    - `sklearn.neural_network.MLPClassifier`: Multi-Layer Perceptron (Ch.3)
- Performance evaluation:
    - `sklearn.model_selection.cross_validate`: **Automatically** divide your data into training and validation set for k-times, construct classifier and compute the scores, which is for k-fold cross-validation
    - `sklearn.model_selection.train_test_split`: Divide your data into training and validation set for once, then feed into classifier by yourself, observing the score and confusion matrix
    - `mlfund.plot.PlotMetric`: plot confusion matrix (provided by this repository)

In [None]:
!pip install pandas

import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder

import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

from mlfund.plot import PlotMetric
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
%matplotlib inline

## 1. Data pre-processing
----
The code snippet is used to:
1. Read CSV files,
2. Convert the required part into `numpy.ndarray` for scikit-learn training and predicting, and
3. Convert the string labels into numeric labels by `sklearn.preprocessing.LabelEncoder`, i.e.,:
    - `PWS:Win32/Fareit`: 0
    - `Trojan:HTML/Brocoiner`: 1
    - `Trojan:O97M/Obfuse`: 2
    - ...
    - `VirTool:Win32/VBInject`: 19

### 1.1. Read CSV Files by Pandas
----
Here we simply use `pandas.read_csv` for the csv reading. Notice that:
- The result will be pandas `DataFrame`, however most of machine learning framework accept `numpy.ndarray`, therefore we need to convert it by accessing `.values`
- The first column `id` should be ignored, therefore we accessed the values from the 1-st column instead of 0-th column (i.e., using `.values[:, 1:]`)

In [None]:
# Training set
df_train_feature = pd.read_csv(os.path.join('data', 'hw03_dataset.train.feature.csv'))
df_train_label = pd.read_csv(os.path.join('data', 'hw03_dataset.train.label.csv'))

X_train = df_train_feature.values[:, 1:]
y_train_str = df_train_label.values[:, 1:].reshape(len(df_train_label.values[:, 1:]))


# Testing test
df_test_feature = pd.read_csv(os.path.join('data', 'hw03_dataset.test.feature.csv'))
df_test_label = pd.read_csv(os.path.join('data', 'hw03_dataset.test.label.csv'))

X_test = df_test_feature.values[:, 1:]
y_test_str = df_test_label.values[:, 1:].reshape(len(df_test_label.values[:, 1:]))

In [None]:
display(df_train_feature)

### 1.2. Convert String Label to Numeric Labels
----


In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(y_train_str)

y_train = label_encoder.transform(y_train_str)
y_test = label_encoder.transform(y_test_str)

In [None]:
display(label_encoder.classes_)

## 2. Construct your Classifier
----
Build your classifier with 2 types evaluation:
- Spliting your training data into 80% training set (`X1` and `y1`) and 20% testing set (`X2` and `y2`) **only for once**
- Automatically conduct k-fold cross validation

In [None]:
def create_gradient_boost():
    return lgb.LGBMClassifier()

def create_svc():
    return SVC(C=1.0)

def create_MLP():
    return MLPClassifier()

In [None]:
scores = cross_val_score(create_gradient_boost(), X_train, y_train, cv=5)
display(scores)

In [None]:
clfLgb = lgb.LGBMClassifier()
clfLgb.fit(X_train, y_train, feature_name=df_train_feature.columns.to_list()[1:])

In [None]:
y_test_predict = clfLgb.predict(X_test)

In [None]:
plot = PlotMetric()
plot.set_labels(list(label_encoder.classes_))
plot.confusion_matrix(y_test, y_test_predict, normalize=True)