# Malware Detection with ML

**DISCLAIMER**: This code represents only one of many possible ways to perform the same action. 

The documentation of scikit-learn (one of the most popular Python packages Machine Learning) is a very useful reference:
- http://scikit-learn.org/stable/documentation.html

## Importing libraries

In [None]:
# the Dataset to be used
from lab_dataset import malware_dataset

In [None]:
# this line will plot in the python notebook the plot themselves
%matplotlib inline 

In [None]:
# libraries for plotting
import matplotlib
import matplotlib.pyplot as plt

In [None]:
# mathematical libraries
import numpy as np
import pandas as pd
import sklearn
from collections import Counter

First, we print the dataset in output just to see how it is structure

In [None]:
malware_dataset

## Pre-processing the dataset for ML

We now need to convert this dataset in a numerical format that can be used for learning the SVM. 

For example, we want to extract the labels (in `labels_list`) and the permissions as an array of words (`words_list`). In particular, the array of words will be computed as a binary bit vector, where:
- 1 means the permission is requested
- 0 means the permission is not requested

In [None]:
words_list = []
labels_list = []
for sample in malware_dataset:
    # Through the Python join function, I am creating a concatenation of words in the array
    words_list.append(' '.join(sample[1:]))
    # The first element of the array is the label
    labels_list.append(sample[0])

We use the `CountVectorizer` to get e binary vector representation of all the permissions. This function takes as input a list of strings, where each string is a concatenation of words separated by space. This is often use also in NLP (Natural Language Processing). Through the `fit_transform` function, it returns a binary representation as a feature matrix, where each column corresponds to a different feature.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X= vectorizer.fit_transform(words_list)

The vector `X` is saved in a "sparse" format (i.e., it cannot be seen directly because in real-world applications the number of zeros may be high. In other words, the "sparse" format is like a compact data representation that occupies less RAM. 

In [None]:
X

Instead, to see the value, we need to use the `X.toarray()` function to instantiate the content, and `vectoriezer.get_feature_names()` to recover the mapping between columns and features.

In [None]:
vectorizer.get_feature_names()

In [None]:
X.toarray()

In [None]:
from  sklearn.preprocessing import LabelEncoder
# The LabelEncoder will map the labels to integer values
le = LabelEncoder()
y = le.fit_transform(labels_list)

In [None]:
# To recover the mapping between integer and strings, we need to check the classes of the label encoder. 
# (In order words, )
le.classes_

In [None]:
# Putting the classes into a pandas.DataFrame structure
samples = pd.DataFrame(X.toarray())

In [None]:
# Creating a new column named `label`
samples['label'] = y

In [None]:
# samples = samples[samples['label']!= 2]
# X = samples.drop('label',axis=1)
# y = samples['label']

## Drawing ROC curves

Now we perform the **classification** of an SVM classifier, and then plot the corresponding **ROC** curves. We consider two kernels: `linear` and `rbf`

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# I am splitting the model into training and testing
X_train, X_test, y_train, y_true = train_test_split(
    X, y, test_size=0.4, random_state=22)

# The matplotlib.pyplot library (here referred to with the alias "plt"), is used for plotting
plt.style.use('ggplot')
# changing the xlabel
plt.xlabel('fpr')
# changing the ylabel
plt.ylabel('tpr')

# I am fitting with two kernels
for k in ['linear','rbf']:

    # I am instantiating the linear classifier
    clf = SVC(kernel=k, C=1).fit(X_train, y_train)
    y_test = clf.predict(X_test)
    y_score = clf.fit(X_train, y_train).decision_function(X_test)
    fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_true, y_score, pos_label=1)
    plt.plot(fpr,tpr,marker='o',label=k)

plt.legend()

# Exercise - March 19th, 2019

- **Dataset:** Analyze another dataset in a CSV file in this same folder: `big_dataset.csv` 
    - This dataset has 1500 goodware Android applications, and 500 malware Android applications. 
    - The label features is called `mw_family`, and is 0 for goodware, and 1 for malware. All the other features 
    - The other columns are the various features, in particular, it's a binary vector related to Android permissions; it is 0 if the permission is not requested, and 1 otherwise. 

- **Objectives:** 
    - _You need to compare the performance of: Linear SVM, RBF SVM, and Random Forest algorithm, by plotting the ROC curve of the three solutions_  
    - _You need to report also: Precision, Recall, F-Score_

- **Suggestions**: 
    - Use `pandas.read_csv` library to read the CSV file, which will put it in a structure called `pandas.DataFrame`, which is a matrix in which you can then access rows and columns
    - Use `pandas.DataFrame.iloc` function to access rows and columns by number indexes by using Python slicing. For example, `df.iloc[:10,:]` will read all the first ten rows and all the columns. 
    - To access individual columns by names, call them directly: e.g., `df['mw_family']`
    - To drop a certain column, use the `pandas.drop('columnname',axis=1)` function (e.g., to get the X of samples)