# NATICUSdroid (Android Permissions) Lab

### Radhika Agrawal & Jack Coyle

Project Details: a new malware detection framework for Android Devices. It analyzes the performance of different permissions and their significance in differentiating benign apps from malware. Similar to what we did in Lab 3, the study uses a bunch of machine learning techniques to predict the ability of these permissions to figure out which apps are malware. The study found Random Forest to be the best method.

Workflow: 

1. Data Collection and Preprocessing 

- Importing data from UCI Repository and Exploring

- Converting all binary permission variables to categorical

2. Data Exploration

- Correlation Matrix

- Correlation Heatmap

- Identification of Highly Correlated Variables

3. Model Development

- 6 ML models with no normalization & their performance

- 6 ML Models with zscore normalization & their performance 

4. Reflection

- Discrepancies between project and paper

- Potential Project Extensions

# Setup

### Importing necessary libraries

In [1]:
import pandas as pd
!pip install pycaret
from pycaret.classification import * # imports all the functions/methods from the pycaret clustering module

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.1/486.1 kB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting kaleido>=0.2.1
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
Collecting imbalanced-learn>=0.12.0
  Downloading imbalanced_learn-0.12.2-py3-none-any.whl (257 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.0/258.0 kB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tbats>=1.1.3
  Downloading tbats-1.1.3-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting schemdraw==0.15
  Downloading schemdraw-0.15-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.8/1

### Bringing in data from UCI Repository

In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
naticusdroid_android_permissions = fetch_ucirepo(id=722) 
  
# data (as pandas dataframes) 
X = naticusdroid_android_permissions.data.features 
y = naticusdroid_android_permissions.data.targets 
  
# metadata 
print(naticusdroid_android_permissions.metadata) 
  
# variable information 
print(naticusdroid_android_permissions.variables) 


{'uci_id': 722, 'name': 'NATICUSdroid (Android Permissions)', 'repository_url': 'https://archive.ics.uci.edu/dataset/722/naticusdroid+android+permissions+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/722/data.csv', 'abstract': 'Contains permissions extracted from more than 29000 benign & malware Android apps released between 2010-2019.', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 29333, 'num_features': 86, 'feature_types': [], 'demographics': [], 'target_col': ['Result'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2021, 'last_updated': 'Tue Apr 09 2024', 'dataset_doi': '10.24432/C5FS64', 'creators': ['Akshay Mathur'], 'intro_paper': {'title': 'NATICUSdroid: A malware detection framework for Android using native and custom permissions', 'authors': 'A. Mathur, Laxmi M. Podila, Keyur Kulkarni, Quamar Niyaz, A. Javaid', 'published_in': 'J. Inf. Se

### List of variables (permissions and response)

In [3]:
naticusdroid_android_permissions.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,android.permission.GET_ACCOUNTS,Feature,Integer,,,,no
1,com.sonyericsson.home.permission.BROADCAST_BADGE,Feature,Integer,,,,no
2,android.permission.READ_PROFILE,Feature,Integer,,,,no
3,android.permission.MANAGE_ACCOUNTS,Feature,Integer,,,,no
4,android.permission.WRITE_SYNC_SETTINGS,Feature,Integer,,,,no
...,...,...,...,...,...,...,...
82,com.google.android.finsky.permission.BIND_GET_...,Feature,Integer,,,,no
83,com.huawei.android.launcher.permission.READ_SE...,Feature,Integer,,,,no
84,android.permission.READ_SMS,Feature,Integer,,,,no
85,android.permission.PROCESS_INCOMING_CALLS,Feature,Integer,,,,no


### Convert permissions variables to Categorical

In [4]:
features = naticusdroid_android_permissions.variables[0:86] ## list of variables without the response
features = features['name'] ## only taking the name column of the variables df
features
for var in features: ## for loop to iterate over each variable to turn it into a categorical
    X[var] = X[var].astype('category')
X.dtypes

android.permission.GET_ACCOUNTS                                           category
com.sonyericsson.home.permission.BROADCAST_BADGE                          category
android.permission.READ_PROFILE                                           category
android.permission.MANAGE_ACCOUNTS                                        category
android.permission.WRITE_SYNC_SETTINGS                                    category
                                                                            ...   
android.permission.ACCESS_NETWORK_STATE                                   category
com.google.android.finsky.permission.BIND_GET_INSTALL_REFERRER_SERVICE    category
com.huawei.android.launcher.permission.READ_SETTINGS                      category
android.permission.READ_SMS                                               category
android.permission.PROCESS_INCOMING_CALLS                                 category
Length: 86, dtype: object

In [5]:
df = pd.concat([X,y], axis=1)
df.head()

Unnamed: 0,android.permission.GET_ACCOUNTS,com.sonyericsson.home.permission.BROADCAST_BADGE,android.permission.READ_PROFILE,android.permission.MANAGE_ACCOUNTS,android.permission.WRITE_SYNC_SETTINGS,android.permission.READ_EXTERNAL_STORAGE,android.permission.RECEIVE_SMS,com.android.launcher.permission.READ_SETTINGS,android.permission.WRITE_SETTINGS,com.google.android.providers.gsf.permission.READ_GSERVICES,...,com.android.launcher.permission.UNINSTALL_SHORTCUT,com.sec.android.iap.permission.BILLING,com.htc.launcher.permission.UPDATE_SHORTCUT,com.sec.android.provider.badge.permission.WRITE,android.permission.ACCESS_NETWORK_STATE,com.google.android.finsky.permission.BIND_GET_INSTALL_REFERRER_SERVICE,com.huawei.android.launcher.permission.READ_SETTINGS,android.permission.READ_SMS,android.permission.PROCESS_INCOMING_CALLS,Result
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0


# Exploration of Features

## Correlation Between Features

### Correlation Matrix for our 86 permissions

In [6]:
correlation_matrix = X.corr() # X is a binary for the declaration of permissions for a given app (row)
correlation_matrix

Unnamed: 0,android.permission.GET_ACCOUNTS,com.sonyericsson.home.permission.BROADCAST_BADGE,android.permission.READ_PROFILE,android.permission.MANAGE_ACCOUNTS,android.permission.WRITE_SYNC_SETTINGS,android.permission.READ_EXTERNAL_STORAGE,android.permission.RECEIVE_SMS,com.android.launcher.permission.READ_SETTINGS,android.permission.WRITE_SETTINGS,com.google.android.providers.gsf.permission.READ_GSERVICES,...,android.permission.CLEAR_APP_CACHE,com.android.launcher.permission.UNINSTALL_SHORTCUT,com.sec.android.iap.permission.BILLING,com.htc.launcher.permission.UPDATE_SHORTCUT,com.sec.android.provider.badge.permission.WRITE,android.permission.ACCESS_NETWORK_STATE,com.google.android.finsky.permission.BIND_GET_INSTALL_REFERRER_SERVICE,com.huawei.android.launcher.permission.READ_SETTINGS,android.permission.READ_SMS,android.permission.PROCESS_INCOMING_CALLS
android.permission.GET_ACCOUNTS,1.000000,-0.013964,-0.079951,0.196896,0.096710,0.034518,-0.012785,0.076991,-0.047679,0.105217,...,0.046158,0.031399,-0.014106,-0.016655,-0.013611,0.115601,-0.016237,-0.031852,-0.011002,0.110615
com.sonyericsson.home.permission.BROADCAST_BADGE,-0.013964,1.000000,-0.025185,0.024802,0.047568,0.136400,-0.022378,0.045782,0.004565,0.145745,...,0.012826,0.011259,0.021368,0.985484,0.969620,0.038538,0.078998,0.793052,-0.027908,-0.011397
android.permission.READ_PROFILE,-0.079951,-0.025185,1.000000,-0.000435,0.024207,-0.073922,-0.034185,-0.019334,0.574034,-0.017427,...,0.007443,-0.026817,-0.013699,-0.024483,-0.025214,0.036007,-0.023485,-0.028547,-0.034294,-0.013270
android.permission.MANAGE_ACCOUNTS,0.196896,0.024802,-0.000435,1.000000,0.309930,0.081554,0.154866,0.115690,0.150767,0.051156,...,0.102934,0.071417,-0.007335,0.020717,0.026386,0.025876,0.062909,0.032461,0.149333,0.490849
android.permission.WRITE_SYNC_SETTINGS,0.096710,0.047568,0.024207,0.309930,1.000000,0.088372,0.045208,0.113085,0.085063,0.040983,...,0.184793,0.087587,-0.004755,0.048473,0.045056,0.013545,0.085796,0.035908,0.029004,-0.004606
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
android.permission.ACCESS_NETWORK_STATE,0.115601,0.038538,0.036007,0.025876,0.013545,0.048371,0.001831,0.027728,0.068965,0.048806,...,0.013129,0.039603,0.009493,0.037935,0.038563,1.000000,0.035111,0.031760,0.016322,0.014022
com.google.android.finsky.permission.BIND_GET_INSTALL_REFERRER_SERVICE,-0.016237,0.078998,-0.023485,0.062909,0.085796,0.105434,-0.026215,0.062998,-0.000797,0.054276,...,0.073901,0.025557,-0.006138,0.081880,0.086315,0.035111,1.000000,0.103338,-0.027305,-0.009433
com.huawei.android.launcher.permission.READ_SETTINGS,-0.031852,0.793052,-0.028547,0.032461,0.035908,0.144295,-0.024881,0.067539,0.033771,0.133456,...,0.027031,0.019484,0.012087,0.801804,0.795122,0.031760,0.103338,1.000000,-0.025953,-0.009364
android.permission.READ_SMS,-0.011002,-0.027908,-0.034294,0.149333,0.029004,-0.013656,0.829839,0.038614,0.169655,-0.016827,...,0.037722,0.023598,-0.014587,-0.030598,-0.030476,0.016322,-0.027305,-0.025953,1.000000,0.251436


There isn't a lot of easily visible information from the correlation matrix alone because of the sheer number of permissions we have. Creating a heatmap and identifying pairs of permissions with high correlations can be helpful.

### Heatmap

In [7]:
import plotly.figure_factory as ff

# Create a heatmap with Plotly
fig = ff.create_annotated_heatmap(
    z=correlation_matrix.to_numpy(),
    x=correlation_matrix.columns.tolist(),  # Use column names here for correct alignment
    y=correlation_matrix.index.tolist(),  # Use index names here for correct alignment
    colorscale='Viridis',
    annotation_text=correlation_matrix.round(2).to_numpy(),
    showscale=True
)

# Removing axis labels for easier viewing
fig.update_xaxes(tickvals=[])
fig.update_yaxes(tickvals=[])

fig.update_layout(width=800, height=800, title='Correlation Heatmap')
fig.show()

In the heatmap above: x and y are the binary variables for given permissions, z is the correlation between the two of them.

### Highly Correlated Variables from the Matrix

Here, we want to identify specific pairs of permissions that are highly correlated in our dataset. To do this, we set a correlation threshold and create a dataframe with pairs that have correlations above that threshold. A threshold of 0.9 gives us 26 instances, which is a lot more manageable than exploring 7000+ pairs of permissions.

In [8]:
# Threshold for considering correlation
threshold = 0.9  # You can adjust this threshold as per your requirement

# Find variable pairs with correlation above the threshold
correlated_pairs = pd.DataFrame(columns=['x', 'y','z'])

for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            new_data_df = pd.DataFrame({'x': [correlation_matrix.columns[i]], 'y': [correlation_matrix.columns[j]], 'z': [correlation_matrix.iloc[i, j]] })
            correlated_pairs = pd.concat([correlated_pairs, new_data_df], ignore_index=True)

correlated_pairs


Unnamed: 0,x,y,z
0,com.sonyericsson.home.permission.BROADCAST_BADGE,com.majeur.launcher.permission.UPDATE_BADGE,0.970781
1,com.sonyericsson.home.permission.BROADCAST_BADGE,com.anddoes.launcher.permission.UPDATE_COUNT,0.977106
2,com.sonyericsson.home.permission.BROADCAST_BADGE,com.sec.android.provider.badge.permission.READ,0.970622
3,com.sonyericsson.home.permission.BROADCAST_BADGE,com.htc.launcher.permission.UPDATE_SHORTCUT,0.985484
4,com.sonyericsson.home.permission.BROADCAST_BADGE,com.sec.android.provider.badge.permission.WRITE,0.96962
5,com.huawei.android.launcher.permission.CHANGE_...,com.sonymobile.home.permission.PROVIDER_INSERT...,0.980177
6,com.huawei.android.launcher.permission.CHANGE_...,com.huawei.android.launcher.permission.WRITE_S...,0.98026
7,com.huawei.android.launcher.permission.CHANGE_...,com.huawei.android.launcher.permission.READ_SE...,0.978202
8,com.oppo.launcher.permission.READ_SETTINGS,android.permission.READ_APP_BADGE,0.97617
9,com.oppo.launcher.permission.READ_SETTINGS,com.oppo.launcher.permission.WRITE_SETTINGS,0.993086


We identified 26 permission variable pairs that have a threshold correlation greater than 0.9. Within those are only 2 permission sets that were native to Android from our understanding because the variable name starts with "android". This could suggest that the custom permissions often behave more similarly to each other than native permissions do when it comes to being declared in benign and malware apps.

# Machine Learning Model Analysis

### Set up dictionary

In [9]:
# Define the setup configuration as a dictionary
s = {
    'data': df,
    'target': 'Result',
    'session_id': 2024,
    'train_size': 0.7,
    'normalize': False,
    'use_gpu': True
}


# Call the setup function with the dictionary unpacked and additional arguments
clf = setup(
    **s
)

[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal

[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


### Creating optimal models for 6 ML techniques done in the study

In [10]:
best = compare_models(include = ('knn', 'svm', 'lr', 'rf', 'et', 'ada'), n_select = 6, sort = "F1")


Processing:   0%|          | 0/34 [00:00<?, ?it/s]

In [11]:
results = pull()
results

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.9699,0.9933,0.9669,0.9729,0.9698,0.9397,0.9398,2.856
et,Extra Trees Classifier,0.9699,0.9925,0.9643,0.9753,0.9697,0.9397,0.9398,3.088
knn,K Neighbors Classifier,0.9619,0.9837,0.9694,0.9553,0.9623,0.9238,0.924,2.292
lr,Logistic Regression,0.9589,0.9889,0.9641,0.9544,0.9592,0.9179,0.918,2.074
ada,Ada Boost Classifier,0.9573,0.9884,0.9651,0.9505,0.9577,0.9146,0.9147,2.367
svm,SVM - Linear Kernel,0.9568,0.9879,0.9652,0.9495,0.9573,0.9136,0.9138,1.828


## Does Normalization impact our results?

### New Setup Dictionary with zscore Normalization

In [12]:
# Define the setup configuration as a dictionary
s_norm = {
    'data': df,
    'target': 'Result',
    'session_id': 2024,
    'train_size': 0.7,
    'normalize': 'zscore',
    'use_gpu': True
}


# Call the setup function with the dictionary unpacked and additional arguments
clf_norm = setup(
    **s_norm
)

[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bi

[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Number of positive: 1, number of negative: 1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1
[LightGBM] [Fatal] CUDA Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_CUDA=1


In [13]:
best_norm = compare_models(include = ('knn', 'svm', 'lr', 'rf', 'et', 'ada'), n_select = 6, sort = "F1")


Processing:   0%|          | 0/34 [00:00<?, ?it/s]

In [14]:
results_norm = pull()
results_norm

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.9699,0.9933,0.9669,0.9729,0.9698,0.9397,0.9398,2.778
et,Extra Trees Classifier,0.9699,0.9925,0.9643,0.9753,0.9697,0.9397,0.9398,2.996
lr,Logistic Regression,0.9593,0.989,0.9643,0.955,0.9596,0.9187,0.9187,2.089
ada,Ada Boost Classifier,0.9573,0.9884,0.9651,0.9505,0.9577,0.9146,0.9147,2.335
knn,K Neighbors Classifier,0.9573,0.9816,0.9565,0.9584,0.9574,0.9147,0.9147,2.267
svm,SVM - Linear Kernel,0.957,0.9868,0.9592,0.9552,0.9572,0.914,0.914,1.967


# Reflection

### Why are our results somewhat different from those reported in the paper?

Our results did end up being very similar to what the paper produced. Like the paper, we found random forest to be the most effective technique, with an Accuracy and F1 score of about .97 each. However, as we explored other ML techniques, we saw some slight differences in statistics. This could potentially be attributed to differences in data preparation. One step that we took was to transform each of the permission variables to categorical form, even though they were given to us in integer form. Another potential reason could be in feature selection. As we read through the paper, there was a bit of ambiguity and confusion about what permissions were included in the different models that were developed. There is a chance of a discrepancy in the features provided in the dataset given to us and the features used in the initial experiment.

### Potential Expansion on Project

On that note, the paper includes a similar experiment on solely Android-native permissions. While we don't have access to the data for specific permissions used in that experiment, researching this would be a natural progression for this project. It would be interesting to explore the feature selection used in the native permission-only experiment and look at the effectiveness of the native permissions in predicting malware.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=acaea389-b817-471d-a8cb-0ed280d8077f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>