Name : Nidhi Hegde

# Assignment 1

## Dataset Description:
The Drebin dataset comprises various Android applications, both benign and malicious. The features from these apps are extracted based on different aspects like:

1. AndroidManifest.xml: Extracted details include requested permissions, app components like activities, services, etc.
2. API calls: This includes specific Android API calls that the app makes.
3. Network addresses: Any URLs or IP addresses that might be hardcoded in the app.
4. Code patterns: Such as the use of reflection, native code, etc.

The details of each feature is included in drebin_features.txt.

The Drebin dataset primarily provides a binary label for each app, indicating whether it's benign or malicious. However, within the malicious apps, there can be different families of malware, each with specific characteristics and behaviors. While the main focus of the Drebin paper was on the binary classification task (malicious vs. benign), the authors did categorize the malicious samples into various malware families. These family labels can be used for multi-class classification tasks or for understanding the distribution of different types of malware in the dataset.

Some malware families that might be present in such datasets (not limited to Drebin) include:

**FakeInstaller:** Malware posing as a legitimate app installer.
**DroidKungFu:** Known for exploiting several vulnerabilities and using encryption to hide its payloads.
**Plankton:** Known for its stealthy nature and the ability to download and execute arbitrary code.
**GingerMaster:** Exploits vulnerabilities specific to the Gingerbread version of Android.
**BaseBridge:** Utilizes a privilege escalation exploit.
... and others.

The mapping between labels and malware families in our dataset is provided below:

0: FakeInstaller

1: DroidKungFu

2: Plankton

3: GingerMaster

4: BaseBridge

5: Iconosys

6: Kmin

In [2]:
import os
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
# import other modules you may need

In [3]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [4]:
# load dataset
import numpy as np
filepath = os.path.join('/content/drive/MyDrive/data/','drebin_data.npz')
data = np.load(filepath)
X, y = data['X'], data['y']

In [5]:
# split into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Design you MLP model
class MLP(nn.Module):
    def __init__(self, input_size):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 128),
            # define some middle layers
            nn.ReLU(),  # Activation function for the first hidden layer

            nn.Linear(128, 64),
            nn.ReLU(),  # Activation function for the second hidden layer

            nn.Linear(64, 7),
            nn.Softmax(dim=1)  # Output layer with softmax activation
        )

    def forward(self, x):
        return self.layers(x)

In [7]:
# Data Preparation(may convert them into tensors)
import torch
from torch.utils.data import DataLoader, TensorDataset

# Convert NumPy arrays to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.LongTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.LongTensor(y_test)


In [8]:
# Define your loss, optimizer, and other hyper-parameters
batch_size = 67 #factor of 1340
epochs = 10
learning_rate = 0.001

input_size = 1340 #number of features given in drebin_features.txt.
model = MLP(input_size)
criterion = nn.CrossEntropyLoss() #Loss function
optimizer = optim.Adam(model.parameters(),lr=learning_rate)


In [9]:
# Training
for epoch in range(epochs):
    model.train()  # Set the model to training mode
    for i in range(0, X_train_tensor.shape[0], batch_size):
        X_batch = X_train_tensor[i:i+batch_size]
        y_batch = y_train_tensor[i:i+batch_size]
        # Forward pass
        outputs = model(X_batch)

        # Compute the loss
        loss = criterion(outputs, y_batch)

        # Backpropagation
        optimizer.zero_grad()  # Clear the gradients
        loss.backward()
        optimizer.step()  # Update the model's parameters

    # Testing loss
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        test_outputs = model(X_test_tensor)
        test_loss = criterion(test_outputs, y_test_tensor)

        _, predictions = torch.max(test_outputs, 1)  # Get the class with the highest probability
        correct = (predictions == y_test_tensor).sum().item()
        accuracy = correct / y_test_tensor.size(0)

    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}, Test Loss: {test_loss.item()}, Test Acc: {accuracy}")


Epoch 1/10, Loss: 1.3373757600784302, Test Loss: 1.3306015729904175, Test Acc: 0.8602825745682888
Epoch 2/10, Loss: 1.2120193243026733, Test Loss: 1.1956560611724854, Test Acc: 0.9827315541601256
Epoch 3/10, Loss: 1.1784517765045166, Test Loss: 1.1799172163009644, Test Acc: 0.9905808477237049
Epoch 4/10, Loss: 1.1767877340316772, Test Loss: 1.1760783195495605, Test Acc: 0.9968602825745683
Epoch 5/10, Loss: 1.1718500852584839, Test Loss: 1.1735628843307495, Test Acc: 0.9968602825745683
Epoch 6/10, Loss: 1.1690107583999634, Test Loss: 1.1714403629302979, Test Acc: 0.9984301412872841
Epoch 7/10, Loss: 1.1678078174591064, Test Loss: 1.1700644493103027, Test Acc: 0.9984301412872841
Epoch 8/10, Loss: 1.1670539379119873, Test Loss: 1.1693918704986572, Test Acc: 0.9984301412872841
Epoch 9/10, Loss: 1.16667902469635, Test Loss: 1.1692970991134644, Test Acc: 0.9984301412872841
Epoch 10/10, Loss: 1.1663779020309448, Test Loss: 1.1689729690551758, Test Acc: 0.9984301412872841


In [10]:
# Calculate precision, recall, and F1-score for each class.

from sklearn.metrics import classification_report

# Convert the model's predictions to a NumPy array
y_pred = predictions.numpy()
y_true = y_test_tensor.numpy()

# Calculate the classification report
class_names = ['FakeInstaller', 'DroidKungFu', 'Plankton', 'GingerMaster', 'BaseBridge', 'Iconosys', 'Kmin']
report = classification_report(y_true, y_pred, target_names=class_names)

print(report)



               precision    recall  f1-score   support

FakeInstaller       0.99      1.00      1.00       177
  DroidKungFu       1.00      0.99      1.00       136
     Plankton       1.00      1.00      1.00       120
 GingerMaster       1.00      1.00      1.00        56
   BaseBridge       1.00      1.00      1.00        76
     Iconosys       1.00      1.00      1.00        40
         Kmin       1.00      1.00      1.00        32

     accuracy                           1.00       637
    macro avg       1.00      1.00      1.00       637
 weighted avg       1.00      1.00      1.00       637



# Assignment 2

## Background:
The paper "Byteweight: Learning to recognize functions in binary code" focuses on function boundary detection in binary code. One of the key insights of the paper is that specific byte sequences or n-grams are highly indicative of function starts. Detecting function boundaries is a foundational step for various binary analysis tasks such as disassembly, decompilation, and vulnerability discovery.

## Dataset Description:
The dataset derived from the Byteweight paper contains sequences of bytes extracted from binary files. These sequences represent potential function starts and other non-starting positions. Each byte in the sequence is treated as a token, and the goal is to recognize patterns that indicate the start of functions.

Features: Sequences of bytes from binary files.
Labels: Binary labels where '1' indicates the start of a function, and '0' indicates a non-starting position.



In [56]:
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from keras.preprocessing.sequence import pad_sequences
import pickle
# import other modules you may need


In [57]:
# load dataset
train_file = os.path.join('/content/drive/MyDrive/data/','elf_x86_32_gcc_O1_train.pkl')
test_file = os.path.join('/content/drive/MyDrive/data/','elf_x86_32_gcc_O1_test.pkl')

# you may need to pre-process the data, for example, pad the data to some fixed length, for the RNN training
# you can use pad_sequence in pytorch/keras, etc.

# ...
# Load training data
with open(train_file, 'rb') as file:
    train_data = pickle.load(file)

# Load test data
with open(test_file, 'rb') as file:
    test_data = pickle.load(file)

# Split the loaded data into input sequences (X) and labels (y)
x_train, y_train = train_data
x_test, y_test = test_data

# you may need to pre-process the data, for example, pad the data to some fixed length, for the RNN training
# you can use pad_sequence in pytorch/keras, etc.
max_sequence_length = 200

padded_x_train = [torch.LongTensor(sequence[:max_sequence_length]) for sequence in x_train]
padded_y_train = [torch.LongTensor(sequence[:max_sequence_length]) for sequence in y_train]
padded_x_test = [torch.LongTensor(sequence[:max_sequence_length]) for sequence in x_test]
padded_y_test = [torch.LongTensor(sequence[:max_sequence_length]) for sequence in y_test]

# Convert to PyTorch tensors
x_train_tensor = pad_sequence(padded_x_train, batch_first=True, padding_value=0)
y_train_tensor = pad_sequence(padded_y_train, batch_first=True, padding_value=0)
x_test_tensor = pad_sequence(padded_x_test, batch_first=True, padding_value=0)
y_test_tensor = pad_sequence(padded_y_test, batch_first=True, padding_value=0)

print("x_train shape:", x_train_tensor.shape)
print("y_train shape:", y_train_tensor.shape)
print("x_test shape:", x_test_tensor.shape)
print("y_test shape:", y_test_tensor.shape)


x_train shape: torch.Size([14006, 200])
y_train shape: torch.Size([14006, 200])
x_test shape: torch.Size([6003, 200])
y_test shape: torch.Size([6003, 200])


In [58]:
# Design you RNN model
import numpy as np
class RNNModel(nn.Module):
    def __init__(self, seq_len, embedding_dim, hidden_dim, num_layers,num_classes, vocab_size):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        # Embedding layer
        embedded = self.embedding(x)

        # LSTM layer
        out, _ = self.rnn(embedded)
        out = self.fc(out)

        return out

In [59]:
# Define your loss, optimizer, and other hyperparameters
batch_size = 124
epochs = 10
learning_rate = 0.001
input_size = 256
hidden_dim = 47
embedding_dim = 128
num_layers = 1
num_classes = 2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize your model
seq_len = max_sequence_length
model = RNNModel(seq_len, embedding_dim, hidden_dim, num_layers,num_classes, input_size).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)


In [60]:
# Training
for epoch in range(epochs):
    model.train()  # Set the model in training mode
    total_loss = 0.0
    correct_total = 0
    total = 0
    for i in range(0, x_train_tensor.shape[0], batch_size):
        batch_x = x_train_tensor[i:i+batch_size].to(device)
        batch_y = y_train_tensor[i:i+batch_size].to(device).view(-1)
        optimizer.zero_grad()  # Clear gradients
        outputs = model(batch_x).view(-1,2)
        # Compute the loss
        loss = criterion(outputs, batch_y)
        # Backpropagate the loss and update the model's parameters
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        correct = (outputs.argmax(dim=1) == batch_y).sum().item()
        correct_total += correct
        total += batch_y.size(0)

    # Initialize test_loss here
    test_loss = 0.0
    with torch.no_grad():
        for i in range(0, x_test_tensor.shape[0], batch_size):
            test_batch_x = x_test_tensor[i:i+batch_size].to(device)
            test_batch_y = y_test_tensor[i:i+batch_size].to(device).view(-1)
            test_outputs = model(test_batch_x).view(-1,2)
            # Compute the test loss
            test_loss += criterion(test_outputs, test_batch_y).item()
            predicted_labels = torch.argmax(test_outputs, dim=1)
            correct = (predicted_labels == test_batch_y).sum().item()
            correct_total += correct
            total += test_batch_y.size(0)

    accuracy = 100 * correct_total / total
    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss}, Test Loss: {test_loss}, Test Acc: {accuracy}%")
# Save the trained model (if needed)
#torch.save(model.state_dict(), 'model_file.pth')

Epoch 1/10, Loss: 9.854401923716068, Test Loss: 0.42064105393365026, Test Acc: 98.04827827477635%
Epoch 2/10, Loss: 0.6157238709274679, Test Loss: 0.18306075234431773, Test Acc: 99.89242340946574%
Epoch 3/10, Loss: 0.3438895173603669, Test Loss: 0.12440238625276834, Test Acc: 99.93635364086161%
Epoch 4/10, Loss: 0.24791826965520158, Test Loss: 0.0948088314034976, Test Acc: 99.95579489229847%
Epoch 5/10, Loss: 0.1945550343953073, Test Loss: 0.07700216543162242, Test Acc: 99.9657654055675%
Epoch 6/10, Loss: 0.16050516767427325, Test Loss: 0.06464651474379934, Test Acc: 99.9723624369034%
Epoch 7/10, Loss: 0.13744428643258289, Test Loss: 0.056210395821835846, Test Acc: 99.97528612124545%
Epoch 8/10, Loss: 0.12068702638498507, Test Loss: 0.050151386531069875, Test Acc: 99.97683542405917%
Epoch 9/10, Loss: 0.10878608193888795, Test Loss: 0.04574515858257655, Test Acc: 99.97923434454495%
Epoch 10/10, Loss: 0.1001432819175534, Test Loss: 0.042297812542528845, Test Acc: 99.98088360237892%


In [61]:
# Evaluate the performance of your final model on test set using accuracy, precision and recall.
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Evaluate the performance on the test set
model.eval()  # Set the model in evaluation mode

# Lists to store true and predicted labels
true_labels = []
predicted_labels = []

with torch.no_grad():
    for i in range(0, x_test_tensor.shape[0], batch_size):
        test_batch_x = x_test_tensor[i:i+batch_size].to(device)
        test_batch_y = y_test_tensor[i:i+batch_size].to(device).view(-1)
        test_outputs = model(test_batch_x).view(-1, 2)

        # Calculate accuracy
        predicted = torch.argmax(test_outputs, dim=1)
        true_labels.extend(test_batch_y.cpu().numpy())
        predicted_labels.extend(predicted.cpu().numpy())

# Calculate accuracy, precision, and recall
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels)
recall = recall_score(true_labels, predicted_labels)

print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"Precision: {precision * 100:.2f}%")
print(f"Recall: {recall * 100:.2f}%")

Accuracy: 99.98%
Precision: 95.91%
Recall: 94.19%
