# Homework of Ch3. Malware Behavior Log Classification
----
This is the homework snippet of TU-ETP-AD1062 Machine Learning Fundamentals.

For more information, please refer to:
https://sites.google.com/view/tu-ad1062-mlfundamentals/

> You do NOT have to build up from nothing, please try your best for the following parts:
> - **Your task: HW3.3.1.**
> - **Your task: HW3.3.2.**
> - **Your task: HW3.4.**

## HW3.1. Import Packages
----
- Data pre-processing:
    - `pandas`: Used for CSV reading
    - `os`: Used for path join
    - `sklearn.preprocessing.LabelEncoder`: Convert string-based labels into numeric labels
- Classifier training and predicting:
    - `lightgbm`: Gradient boosting (Ch.3)
    - `sklearn.svm.SVC`: Support Vector Machine (Ch.2, Ch.3)
    - `sklearn.neural_network.MLPClassifier`: Multi-Layer Perceptron (Ch.3)
- Performance evaluation:
    - `sklearn.model_selection.cross_validate`: **Automatically** divide your data into training and validation set for k-times, construct classifier and compute the scores, which is for k-fold cross-validation
    - `sklearn.metrics.zero_one_loss`: Used for accuracy evaluation
    - `sklearn.model_selection.train_test_split`: Divide your data into training and validation set for once, then feed into classifier by yourself, observing the score and confusion matrix
    - `mlfund.plot.PlotMetric`: plot confusion matrix (provided by this repository)

In [None]:
!pip install pandas

import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder

import lightgbm as lgb
from sklearn.svm import SVC, LinearSVC
from sklearn.neural_network import MLPClassifier

from mlfund.plot import PlotMetric
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
%matplotlib inline

## HW3.2. Data pre-processing
----
The code snippet is used to:
1. Read CSV files,
2. Convert the required part into `numpy.ndarray` for scikit-learn training and predicting, and
3. Convert the string labels into numeric labels by `sklearn.preprocessing.LabelEncoder`, i.e.,:
    - `PWS:Win32/Fareit`: 0
    - `Trojan:HTML/Brocoiner`: 1
    - `Trojan:O97M/Obfuse`: 2
    - ...
    - `VirTool:Win32/VBInject`: 19

### HW3.2.1. Read CSV Files by Pandas
----
Here we simply use `pandas.read_csv` for the csv reading. Notice that:
- The first column `id` should be ignored, therefore we accessed the values from the 1-st column instead of 0-th column (i.e., using `.values[:, 1:]`)

In [None]:
# Training set
df_train_feature = pd.read_csv(os.path.join('data', 'hw03_dataset.train.feature.csv'))
df_train_label = pd.read_csv(os.path.join('data', 'hw03_dataset.train.label.csv'))

X_train = df_train_feature.values[:, 1:]
y_train_str = df_train_label.values[:, 1:].reshape(len(df_train_label.values[:, 1:]))


# Testing test
df_test_feature = pd.read_csv(os.path.join('data', 'hw03_dataset.test.feature.csv'))
X_test = df_test_feature.values[:, 1:]

display(df_train_feature)

### HW3.2.2. Convert String Label to Numeric Labels
----
Use `LabelEncoder` to convert your string lables into `0`, `1`, `2`, ..., and `19`.

In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(y_train_str)

y_train = label_encoder.transform(y_train_str)

display( [ (idx, label) for idx, label in enumerate(label_encoder.classes_) ] )

## HW3.3. Construct your Classifier
----
Build your classifier, with parameters fine-tuned.

> **Your task: HW3.3.1.**  
> Training and Predicting for only once, keep adjusting your `create_clf`, and making sure the parameter is not too bad  
> Here we leverage the `train_test_split`, divide your training data `X_train` and `y_train` into:
> - 80% `X1`, `y1`, as training set in this round
> - 20% `X2`, `y2`, as testing set in this round
>
> In this round, you're able to observe the confusion matrix, and you're able to check if data from each class is well-classified.


In [None]:
def create_clf():
    # You can use:
    #     sklearn.svm.SVC,
    #     sklearn.svm.LinearSVC,
    #     sklearn.neural_network.MLPClassifier,
    #     sklearn.ensemble.GradientBoostingClassifier
    #     lightgbm.LGBMClassifier
    #     ...
    # Or any classifier you found!
    # Remember to fine-tune the model parameters
    
    return LinearSVC()

In [None]:
X1, X2, y1, y2 = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

In [None]:
model = create_clf()
model.fit(X1, y1)

In [None]:
y2_predict = model.predict(X2)

# Error rate
err_01loss = zero_one_loss(y2, y2_predict)
print('Error rate = %2.3f' % err_01loss)

# Confusion matrix of prediction
plot_conf_mat = PlotMetric()
plot_conf_mat.set_labels(label_encoder.classes_.tolist())
plot_conf_mat.confusion_matrix(y2, y2_predict, True)

> **Your task: HW3.3.2.**  
> Now you already have a classifier, with the parameter fine-tuned in **HW3.3.1**.
> Your model shoud accept more challenges! Lets conduct **cross validation**, which is mentioned in ch.1.:
> ![Cross Validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)
>
> If performance is not good, or big differences between different rounds, your model might be over-fitting with particular training/testing set. Keep find-tune your `create_clf` and make your model stable!

In [None]:
scores = cross_val_score(create_clf(), X_train, y_train, cv=10, n_jobs=8)
display(scores)

## HW3.4. Submit to Kaggle InClass
----
Finally, you should have your classifier model fine-tuned. Now:

> **Your task: HW3.4.**
> 1. Training with full data set `X_train` with the model created by `create_clf`,
> 2. Predict the **unknown** testing data `X_test` by the trained model, then
> 3. Submit your result to Kaggle

**Notice: You got only 2 chances to submit your result every day, which means you should fine-tune your model by cross-validation**

In [None]:
# Create model and train
model = create_clf()
model.fit(X_train, y_train)

# Predict the testing data
y_test_predict = model.predict(X_test)
y_test_predict_str = label_encoder.inverse_transform(y_test_predict)

## Before you submit
----
Please join the homework 3 competition by **using the Email ended with \@trendmicro.com as your Kaggle InClass team name**.

Type your Email in the variable `my_trendmicro_email_which_is_also_my_team_name` to make sure you've already read this paragraph, then the following code snippet will help you to generate the csv file for submission.

In [None]:
my_trendmicro_email_which_is_also_my_team_name = ''


import re
assert re.match(r"[^@]+@trendmicro.com", my_trendmicro_email_which_is_also_my_team_name), "Please read the instruction above paragraph carefully"

target_path = 'data/hw03.result.csv'
df_test_label = pd.DataFrame({'id': df_test_feature['id'], 'label': y_test_predict_str})
df_test_label.to_csv(target_path, index=False)

print('Congratulation! Please submit your result \'%s\' to https://www.kaggle.com/t/0415e3e953e54638b617fcd0bc5c04bd' % target_path)