# Predicting Malware

## Reference:

Mathur,Akshay & Mathur,Akshay. (2022). [NATICUSdroid (Android Permissions) Dataset](https://archive-beta.ics.uci.edu/dataset/722/naticusdroid+android+permissions+dataset). UCI Machine Learning Repository.

In [2]:
# Import the required modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

## Prepare the Data

In [3]:
# Read in the app-data.csv file into a Pandas DataFrame.
file_path = "https://static.bc-edx.com/mbc/ai/m4/datasets/app-data.csv"
app_data = pd.read_csv(file_path)

# Review the DataFrame
app_data.head()

Unnamed: 0,android.permission.GET_ACCOUNTS,com.sonyericsson.home.permission.BROADCAST_BADGE,android.permission.READ_PROFILE,android.permission.MANAGE_ACCOUNTS,android.permission.WRITE_SYNC_SETTINGS,android.permission.READ_EXTERNAL_STORAGE,android.permission.RECEIVE_SMS,com.android.launcher.permission.READ_SETTINGS,android.permission.WRITE_SETTINGS,com.google.android.providers.gsf.permission.READ_GSERVICES,...,com.android.launcher.permission.UNINSTALL_SHORTCUT,com.sec.android.iap.permission.BILLING,com.htc.launcher.permission.UPDATE_SHORTCUT,com.sec.android.provider.badge.permission.WRITE,android.permission.ACCESS_NETWORK_STATE,com.google.android.finsky.permission.BIND_GET_INSTALL_REFERRER_SERVICE,com.huawei.android.launcher.permission.READ_SETTINGS,android.permission.READ_SMS,android.permission.PROCESS_INCOMING_CALLS,Result
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0


In [5]:
app_data.columns

Index(['android.permission.GET_ACCOUNTS',
       'com.sonyericsson.home.permission.BROADCAST_BADGE',
       'android.permission.READ_PROFILE', 'android.permission.MANAGE_ACCOUNTS',
       'android.permission.WRITE_SYNC_SETTINGS',
       'android.permission.READ_EXTERNAL_STORAGE',
       'android.permission.RECEIVE_SMS',
       'com.android.launcher.permission.READ_SETTINGS',
       'android.permission.WRITE_SETTINGS',
       'com.google.android.providers.gsf.permission.READ_GSERVICES',
       'android.permission.DOWNLOAD_WITHOUT_NOTIFICATION',
       'android.permission.GET_TASKS',
       'android.permission.WRITE_EXTERNAL_STORAGE',
       'android.permission.RECORD_AUDIO',
       'com.huawei.android.launcher.permission.CHANGE_BADGE',
       'com.oppo.launcher.permission.READ_SETTINGS',
       'android.permission.CHANGE_NETWORK_STATE',
       'com.android.launcher.permission.INSTALL_SHORTCUT',
       'android.permission.android.permission.READ_PHONE_STATE',
       'android.permission.C

In [6]:
# The column 'Result' is the thing you want to predict. 
# Class 0 indicates a benign app and class 1 indicates a malware app
# Using value_counts, how many malware apps are in this dataset?
app_data['Result'].value_counts()

1    14700
0    14632
Name: Result, dtype: int64

## Split the data into training and testing sets

In [7]:
# The target column `y` should be the binary `Result` column.
y = app_data['Result']

# The `X` should be all of the features. 
X = app_data.copy()
X = X.drop(columns='Result')

In [8]:
# Split the dataset using the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Model and Fit the Data to a Logistic Regression

In [9]:
# Declare a logistic regression model.
# Apply a random_state of 7 and max_iter of 120 to the model
logistic_regression_model = LogisticRegression(max_iter=120, random_state=7)

In [10]:
# Fit and save the logistic regression model using the training data
logistic_regression_model.fit(X_train, y_train)

In [11]:
# Validate the model
print(f"Training Data Score: {logistic_regression_model.score(X_train, y_train)}")
print(f"Testing Data Score: {logistic_regression_model.score(X_test, y_test)}")

Training Data Score: 0.9594981590072276
Testing Data Score: 0.9609982271921451


## Predict the Testing Labels

In [12]:
# Make and save testing predictions with the saved logistic regression model using the test data
predictions = logistic_regression_model.predict(X_test)

# Review the predictions
predictions

array([1, 1, 1, ..., 1, 1, 1])

## Calculate the Performance Metrics

In [14]:
# Display the accuracy score for the test dataset.
# Import the accuracy_score function
from sklearn.metrics import accuracy_score

# Calculate the model's accuracy on the test dataset
accuracy_score(y_test, predictions)


0.9609982271921451

**Question:** For this dataset, how well did the model predict actual malware?

**Answer:** 

In [16]:
#96%