# Assignment 1

## Dataset Description:
The Drebin dataset comprises various Android applications, both benign and malicious. The features from these apps are extracted based on different aspects like:

1. AndroidManifest.xml: Extracted details include requested permissions, app components like activities, services, etc.
2. API calls: This includes specific Android API calls that the app makes.
3. Network addresses: Any URLs or IP addresses that might be hardcoded in the app.
4. Code patterns: Such as the use of reflection, native code, etc.

The details of each feature is included in drebin_features.txt.

The Drebin dataset primarily provides a binary label for each app, indicating whether it's benign or malicious. However, within the malicious apps, there can be different families of malware, each with specific characteristics and behaviors. While the main focus of the Drebin paper was on the binary classification task (malicious vs. benign), the authors did categorize the malicious samples into various malware families. These family labels can be used for multi-class classification tasks or for understanding the distribution of different types of malware in the dataset.

Some malware families that might be present in such datasets (not limited to Drebin) include:

**FakeInstaller:** Malware posing as a legitimate app installer.
**DroidKungFu:** Known for exploiting several vulnerabilities and using encryption to hide its payloads.
**Plankton:** Known for its stealthy nature and the ability to download and execute arbitrary code.
**GingerMaster:** Exploits vulnerabilities specific to the Gingerbread version of Android.
**BaseBridge:** Utilizes a privilege escalation exploit.
... and others.

The mapping between labels and malware families in our dataset is provided below:

0: FakeInstaller

1: DroidKungFu

2: Plankton

3: GingerMaster

4: BaseBridge

5: Iconosys

6: Kmin

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
# import other modules you may need

In [None]:
# load dataset
#filepath = os.path.join('/data/', 'drebin_data.npz')
from google.colab import files
data = np.load('drebin_data.npz')
#data = np.load(filepath)
X, y = data['X'], data['y']
print(X.shape)
print(y.shape)

(3183, 1340)
(3183,)


In [None]:
# split into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (2546, 1340)
X_test shape: (637, 1340)
y_train shape: (2546,)
y_test shape: (637,)


In [None]:
# Design you MLP model
class MLP(nn.Module):
    def __init__(self, input_size):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 128),
            # define some middle layers
            nn.ReLU(),  # Activation function
            nn.Linear(128, 64),  # Hidden layer 1
            nn.ReLU(),  # Activation function
            nn.Linear(64, 32),  # Hidden layer 2
            nn.ReLU(),  # Activation function
            nn.Linear(32, 7),  # Output layer with 7 classes
            nn.Softmax(dim=1)  # Softmax activation for classification

        )

    def forward(self, x):
        return self.layers(x)

In [None]:
# Data Preparation(may convert them into tensors)
X_train = torch.Tensor(X_train)  # Converting X_train to a PyTorch tensor
y_train = torch.Tensor(y_train).long().squeeze()  # Converting y_train to a PyTorch tensor, and squeezing to remove extra dimensions
X_test = torch.Tensor(X_test)  # Convert X_test to a PyTorch tensor
y_test = torch.Tensor(y_test).long().squeeze()  # Converting y_test to a PyTorch tensor and squeezing to remove extra dimensions

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

torch.Size([2546, 1340])
torch.Size([2546])
torch.Size([637, 1340])
torch.Size([637])


In [None]:
# Define your loss, optimizer, and other hyper-parameters
batch_size = 64
epochs = 20
learning_rate = 0.001

input_size = X.shape[1]
model = MLP(input_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
from torch.utils.data import DataLoader, TensorDataset

# Create DataLoader
#train_data = TensorDataset(torch.Tensor(X_train), torch.Tensor(y_train).long())
#train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)

# Training
for epoch in range(epochs):
    model.train()
    for i in range(0, X_train.shape[0], batch_size):
       # Get a batch of training data
        X_batch = X_train[i:i+batch_size]  # Getting inputs for this batch
        y_batch = y_train[i:i+batch_size]  # Getting corresponding labels for this batch

        # Convert to PyTorch tensors
        X_batch = torch.Tensor(X_batch)
        y_batch = torch.Tensor(y_batch).long()

        if len(y_batch.shape) > 1:  # If y_batch has more than 1 dimension
            y_batch = y_batch[:,0]  # Selecting the first element from the second dimension

         # Check if batch is smaller than batch_size
        if len(X_batch) < batch_size:
            continue # Skiping the batch if it's smaller than batch_size

        if X_batch.shape[0] != y_batch.shape[0]:
            # Skiping this iteration if there's a mismatch
            continue

        # Forward pass: compute predicted outputs by passing inputs to the model
        outputs = model(X_batch)

        # Compute the loss
        loss = criterion(outputs, y_batch)

        # backpropogate the loss and update the model's parameters
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # compute the loss


    # Testing loss
    model.eval()
    with torch.no_grad():
        test_outputs = model(X_test)
        test_loss = criterion(test_outputs, y_test)

        predictions = torch.argmax(test_outputs, dim=1)
        accuracy = (predictions == y_test).float().mean()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}, Test Loss: {test_loss.item()}, Test Acc: {accuracy}")


Epoch 1/20, Loss: 1.4445796012878418, Test Loss: 1.4221535921096802, Test Acc: 0.7598116397857666
Epoch 2/20, Loss: 1.2314707040786743, Test Loss: 1.2244514226913452, Test Acc: 0.9780219793319702
Epoch 3/20, Loss: 1.1837626695632935, Test Loss: 1.1803070306777954, Test Acc: 0.9890109896659851
Epoch 4/20, Loss: 1.1787784099578857, Test Loss: 1.175390362739563, Test Acc: 0.9921507239341736
Epoch 5/20, Loss: 1.1682156324386597, Test Loss: 1.174363374710083, Test Acc: 0.9937205910682678
Epoch 6/20, Loss: 1.1670653820037842, Test Loss: 1.170760154724121, Test Acc: 0.9984301328659058
Epoch 7/20, Loss: 1.1664751768112183, Test Loss: 1.1696022748947144, Test Acc: 0.9968602657318115
Epoch 8/20, Loss: 1.1657682657241821, Test Loss: 1.1694778203964233, Test Acc: 0.9968602657318115
Epoch 9/20, Loss: 1.1656917333602905, Test Loss: 1.1693884134292603, Test Acc: 0.9968602657318115
Epoch 10/20, Loss: 1.165639877319336, Test Loss: 1.1692655086517334, Test Acc: 0.9968602657318115
Epoch 11/20, Loss: 1.16

Done before so skipped

In [None]:
# Data Preparation(may convert them into tensors)
#X_train = torch.Tensor(X_train)  # Convert X_train to a PyTorch tensor
#y_train = torch.Tensor(y_train).long().squeeze()  # Convert y_train to a PyTorch tensor, ensure it's long for classification, and squeeze to remove extra dimensions
#X_test = torch.Tensor(X_test)  # Convert X_test to a PyTorch tensor
#y_test = torch.Tensor(y_test).long().squeeze()  # Convert y_test to a PyTorch tensor and squeeze to remove extra dimensions

In [None]:
# Calculate precision, recall, and F1-score for each class.

import numpy as np
import torch
from sklearn.metrics import precision_score, recall_score, f1_score

# Set the model to evaluation mode
model.eval()

# Make predictions on the testing dataset
with torch.no_grad():
    # Converting X_test to tensor
    X_test_tensor = torch.Tensor(X_test)
    # Getting the model outputs
    outputs = model(X_test_tensor)
    # Getting the predicted class indices
    _, predicted = torch.max(outputs, 1)

# Converting predictions to numpy array
predicted = predicted.numpy()
y_test = y_test.numpy()  # Assuming y_test is also a numpy array

# Calculate precision, recall, and F1 score for each class
precision = precision_score(y_test, predicted, average=None)
recall = recall_score(y_test, predicted, average=None)
f1 = f1_score(y_test, predicted, average=None)

# Create a 3x7 table (3 metrics for 7 classes)
metrics_table = np.vstack((precision, recall, f1))

# Display the metrics table
class_labels = np.arange(7)  # Assuming classes are labeled from 0 to 6
print("Metrics Table (Rows: Precision, Recall, F1 Score; Columns: Classes 0 to 6):")
print(metrics_table)

# Formatting and print as a DataFrame for better readability
import pandas as pd

metrics_df = pd.DataFrame(metrics_table, index=["Precision", "Recall", "F1 Score"], columns=class_labels)
print(metrics_df)


Metrics Table (Rows: Precision, Recall, F1 Score; Columns: Classes 0 to 6):
[[0.99438202 1.         1.         0.98245614 1.         1.
  1.        ]
 [1.         0.98529412 1.         1.         1.         1.
  1.        ]
 [0.9971831  0.99259259 1.         0.99115044 1.         1.
  1.        ]]
                  0         1    2         3    4    5    6
Precision  0.994382  1.000000  1.0  0.982456  1.0  1.0  1.0
Recall     1.000000  0.985294  1.0  1.000000  1.0  1.0  1.0
F1 Score   0.997183  0.992593  1.0  0.991150  1.0  1.0  1.0


# Assignment 2

## Background:
The paper "Byteweight: Learning to recognize functions in binary code" focuses on function boundary detection in binary code. One of the key insights of the paper is that specific byte sequences or n-grams are highly indicative of function starts. Detecting function boundaries is a foundational step for various binary analysis tasks such as disassembly, decompilation, and vulnerability discovery.

## Dataset Description:
The dataset derived from the Byteweight paper contains sequences of bytes extracted from binary files. These sequences represent potential function starts and other non-starting positions. Each byte in the sequence is treated as a token, and the goal is to recognize patterns that indicate the start of functions.

Features: Sequences of bytes from binary files.
Labels: Binary labels where '1' indicates the start of a function, and '0' indicates a non-starting position.



In [111]:

from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from keras.preprocessing.sequence import pad_sequences
# import other modules you may need
import pickle
import numpy as np
import torch
from keras.utils import pad_sequences

In [112]:
# load dataset
train_file = 'elf_x86_32_gcc_O1_train.pkl'
test_file = 'elf_x86_32_gcc_O1_test.pkl'

with open(train_file, 'rb') as f:
    x_train, y_train = pickle.load(f)

with open(test_file, 'rb') as f:
    x_test, y_test = pickle.load(f)

In [115]:
import pickle
import torch
from torch.nn.utils.rnn import pad_sequence

# Load dataset
with open(train_file, 'rb') as f:
    x_train, y_train = pickle.load(f)

with open(test_file, 'rb') as f:
    x_test, y_test = pickle.load(f)

# Convert the sequences to PyTorch tensors
x_train = [torch.tensor(seq) for seq in x_train]
x_test = [torch.tensor(seq) for seq in x_test]

# Set a fixed length for padding/truncating
fixed_length = 200

# Pad sequences to the fixed length
x_train_padded = pad_sequence(
    [seq[:fixed_length] for seq in x_train],  # Truncate if longer
    batch_first=True,
    padding_value=0  # Use 0 for padding
)

x_test_padded = pad_sequence(
    [seq[:fixed_length] for seq in x_test],
    batch_first=True,
    padding_value=0
)

y_train_padded = pad_sequence(
    [torch.tensor(seq[:fixed_length]) for seq in y_train],
    batch_first=True,
    padding_value=0
)

y_test_padded = pad_sequence(
    [torch.tensor(seq[:fixed_length]) for seq in y_test],
    batch_first=True,
    padding_value=0
)

# Convert y labels to tensors directly (assuming they are not sequences)
#y_train_tensor = torch.tensor(y_train)  # Ensure it is a 1D tensor
#y_test_tensor = torch.tensor(y_test)    # Ensure it is a 1D tensor

# Checking the shapes of the prepared datasets
print(x_train_padded.shape)
print(y_train_padded.shape)
print(x_test_padded.shape)
print(y_test_padded.shape)

torch.Size([14006, 200])
torch.Size([14006, 200])
torch.Size([6003, 200])
torch.Size([6003, 200])


In [116]:
import torch
import torch.nn as nn
# Design you RNN model
class RNNModel(nn.Module):
    def __init__(self, seq_len, vocab_size, embed_dim, hidden_dim, num_layers, output_dim):
        super(RNNModel, self).__init__()
        super(RNNModel, self).__init__()
        # Embedding layer to learn a dense representation of the input bytes
        self.embedding = nn.Embedding(vocab_size, embed_dim)

        # LSTM layer
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        # define some layers

         # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)

        # Sigmoid activation for binary classification
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # Embedding the input sequence
        x = self.embedding(x)

        # Passing through the LSTM layer
        lstm_out, (hn, cn) = self.lstm(x)

        # Taking the output from the last time step (many-to-one)
        lstm_out_last = lstm_out[:, -1, :]  # shape: [batch_size, hidden_dim]

        # Passing through the fully connected layer
        fc_out = self.fc(lstm_out_last)

        # Applying sigmoid activation
        out = self.sigmoid(fc_out)

        return out
    #def forward(self, x):
        # forward process

        #return x


In [119]:
# Define your loss, optimizer, and other hyper-parameters
# Define loss function and optimizer
loss_fn = nn.BCELoss()  # Binary cross-entropy for binary classification
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer

batch_size = 64
epochs = 10
learning_rate = 0.001

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
seq_len = 200
#model = RNNModel(seq_len).to(device)

# Define the RNN model
model = RNNModel(seq_len, vocab_size=256, embed_dim=128, hidden_dim=256, num_layers=2, output_dim=1).to(device)

# Define loss function (binary cross-entropy)
criterion = nn.BCELoss()

# Define optimizer (Adam optimizer)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)




In [121]:
# Training
for epoch in range(epochs):
    model.train()  # Set the model to training mode
    total_loss = 0

    for i in range(0, x_train_padded.shape[0], batch_size):
        # Getting batch data
        batch_x = x_train_padded[i:i + batch_size].to(device)
        batch_y = y_train_padded[i:i + batch_size, -1].to(device).float()  # Ensuring labels are float for BCELoss

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(batch_x).squeeze()  # Squeezing for correct output shape for BCELoss

        # Compute the loss
        loss = criterion(outputs, batch_y)

        # Backpropagate the loss and update model's parameters
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Testing/Evaluation Phase
    model.eval()  # Set the model to evaluation mode
    test_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for i in range(0, x_test_padded.shape[0], batch_size):
            batch_test_x = x_test_padded[i:i + batch_size].to(device)
            batch_test_y = y_test_padded[i:i + batch_size, -1].to(device).float()

            # Forward pass
            test_outputs = model(batch_test_x).squeeze()

            # Compute test loss
            loss_test = criterion(test_outputs, batch_test_y)
            test_loss += loss_test.item()

            # Calculate accuracy
            predictions = (test_outputs >= 0.5).float()  # Classifying based on threshold 0.5
            correct += (predictions == batch_test_y).sum().item()
            total += batch_test_y.size(0)
            accuracy = correct / total  # Calculate accuracy


    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss / len(x_train_padded)}, Test Loss: {test_loss / len(x_test_padded)}, Test Acc: {accuracy}")


# Also, don't forget to handle device assignments (to GPU or CPU) using .to(device) if you use GPU

# Save model
# torch.save(model.state_dict(), 'model_file.pth')

Epoch 1/10, Loss: 0.0002207953245181542, Test Loss: 0.0002381456573327, Test Acc: 0.9975012493753124
Epoch 2/10, Loss: 9.080215724067397e-05, Test Loss: 4.5113004164109494e-05, Test Acc: 0.9991670831251042
Epoch 3/10, Loss: 2.3109445364795758e-05, Test Loss: 3.7202013615927146e-05, Test Acc: 0.9995002498750625
Epoch 4/10, Loss: 1.600105870208751e-05, Test Loss: 3.768547315245874e-05, Test Acc: 0.9995002498750625
Epoch 5/10, Loss: 1.042347121166559e-05, Test Loss: 3.459051451716018e-05, Test Acc: 0.9998334166250208
Epoch 6/10, Loss: 1.2843809313828326e-05, Test Loss: 4.0087081728498824e-05, Test Acc: 0.9993336665000833
Epoch 7/10, Loss: 1.3152772601546962e-05, Test Loss: 3.9512627486435435e-05, Test Acc: 0.9991670831251042
Epoch 8/10, Loss: 9.097333365915057e-06, Test Loss: 3.632979087683282e-05, Test Acc: 0.9996668332500417
Epoch 9/10, Loss: 7.230619895906689e-06, Test Loss: 3.6065181850077044e-05, Test Acc: 0.9996668332500417
Epoch 10/10, Loss: 6.03172582349992e-06, Test Loss: 3.53581

In [123]:
# Evaluate the performance of your final model on test set using accuracy, precision and recall.

from sklearn.metrics import precision_score, recall_score
import numpy as np

model.eval()

# Variables for accumulating performance metrics
correct = 0
total = 0
all_preds = []
all_labels = []

# Disable gradient calculations for evaluation
with torch.no_grad():
    for i in range(0, x_test_padded.shape[0], batch_size):
        # Get the batch of test data
        batch_test_x = x_test_padded[i:i + batch_size].to(device)
        batch_test_y = y_test_padded[i:i + batch_size, -1].to(device).float()

        # Forward pass
        test_outputs = model(batch_test_x).squeeze()

        # Convert model output to binary predictions (threshold 0.5 for binary classification)
        predictions = (test_outputs >= 0.5).float()

        # Append predictions and labels for evaluation
        all_preds.extend(predictions.cpu().numpy())
        all_labels.extend(batch_test_y.cpu().numpy())

        # Calculate accuracy
        correct += (predictions == batch_test_y).sum().item()
        total += batch_test_y.size(0)

# Convert predictions and labels to numpy arrays for metric calculations
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)

# Calculate accuracy
accuracy = correct / total

# Calculate precision and recall
precision = precision_score(all_labels, all_preds)
recall = recall_score(all_labels, all_preds)

# Print results
print(f"Test Accuracy: {accuracy * 100:.2f}%")
print(f"Test Precision: {precision:.4f}")
print(f"Test Recall: {recall:.4f}")


Test Accuracy: 99.98%
Test Precision: 1.0000
Test Recall: 0.9333
