# Android Malaware Detection

**Introduction**


With the popularity of Android devices, the number of applications made for the android operating system is
also increasing day by day. But the biggest challenge in this scenario is to identify if an application is an authentic
application or a malware. This project tries to identify an application as malware/not based on the permissions
required by the application

**Dataset**


The dataset given here is taken from Kaggle and consists of about 331 features which are the different android
permissions asked by the application (0 denotes not required and 1 denotes required). The no rows/malware
readings for each permission is 398. It is the ‘type’ label which represents a given row corresponding to whether
an application is malware or not.

**Tasks in this assignment**

1. Write a Data Science Proposal for achieving the objective mentioned.
2. Perform exploratory analysis on the data and describe your understanding of the data.
3. Perform data wrangling / pre-processing on the data if required
    a. E.g., missing data, normalization, discretization, etc.
4. Apply any two feature engineering techniques.
5. Plot top 10 features.
6. Implement any two Machine Learning models (SVM or Decision Tree or Random Forest or kNN or Naïve Bayes etc)
7. Compare the performance of the two models. Provide a table for comparison. (Here you may use the combination of FE1+ML1, FE1+ML2, FE2+ML1 and FE2+ML2 etc)
8. Present the conclusions/results in the format shared.


**Expected Submissions**

Two files are expected as the assignment submission.
1. The summary of the work in the template provided. (you may fill only the boxes relevant to this problem
statement)
2. The executed ipynb file with clear subdivision of the codes and brief description of the purpose of
respective code. All the executed tables or graphs and results should be present in the ipynb file. The
ipynb file may be submitted as a single .pdf file.

#### Import packages

In [None]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn import preprocessing
import torch
from sklearn import svm
from sklearn import tree
import pandas as pd
import pickle
import numpy as np
import seaborn as sns
import category_encoders as ce
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn import tree

### Load Data


Read dataset into pandas dataframe

In [None]:
df = pd.read_csv("../input/androidmalwaredetection/Dataset.csv", sep=";")
df.shape
df.columns = map(str.lower, df.columns)

#### Exploratory Analysis

List the features

In [None]:
for column in df.columns.tolist():
    print(column)

### Data wrangling and Pre-processing

Get all non numeric values from the dataset

In [None]:
null_series = df.isnull().sum()
count =0
for _, val in null_series.iteritems():
    if(val>0):
        print (_ + "      "+ str(val))
    else:
        count = count +1
print("number of columns with no null values: "+ str(count))

na_series = df.isna().sum()
count =0
for _, val in na_series.iteritems():
    if(val>0):
        print (_ + "      "+ str(val))
    else:
        count = count +1
print("number of columns with no na values: "+ str(count))

So there are no missing values in the dataset

Find all outlier values in each column

In [None]:
onezero = 0
for column in df.columns:
    cnt = len(df[(df[column]!= 0) & (df[column]!= 1)])
    if(cnt > 0):
        print(column + " has "+ str(cnt) +" rows with value other than 0,1")
    else:
        onezero = onezero + 1
print("Total number of features with values as only 0,1: " + str(onezero))

There are no outliers in any column

Since all values are either 0 or 1 no need to perform normalization

So let's cast the dataframe columns to integer type to ease out our analysis process

In [None]:
df = df.astype("int64")

View sample data

In [None]:
df.head()

Analyze feautres 

Count of malware (1) vs benign apps (0) based on type column

In [None]:
df.type.value_counts()

So there are 199 malwares and 199 benign apps in the dataset so the data is equally distributed

#### Plot 10 features

Let us find the top 10 features that determine whether the app is malware or not

Top 10 permissions required by Malware apps

In [None]:
pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11]

Top 10 permissions required by benign apps

In [None]:
pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[1:11]

Let us plot a bar char for the above top 10 features

In [None]:
fig, axs =  plt.subplots(nrows=2, sharex=True)
pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[1:11].plot.bar(ax=axs[0], color="green", title="Benign Apps")
pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11].plot.bar(ax=axs[1], color="red", title="Malware Apps", ylabel="Count of apps", xlabel="Permissions")

Describing the features in the dataset

In [None]:
df_desc = df.describe()
df_desc

from above we can see that certain keywords in permission seems more suspicious so selecting those columns

In [None]:
df1= df.copy()
df1 = df1.loc[:,df1.columns.str.contains('type')  |  df1.columns.str.contains('write') | df1.columns.str.contains('delete') | df1.columns.str.contains('clear') | df1.columns.str.contains('boot') | df1.columns.str.contains('change')| df1.columns.str.contains('credential')|df1.columns.str.contains('admin')|df1.columns.str.contains('list')|df1.columns.str.contains('secure_storage')|df1.columns.str.contains('notifications')|df1.columns.str.contains('account')|df1.columns.str.contains('destroy')|df1.columns.str.contains('mount')|df1.columns.str.contains('authenticate')|df1.columns.str.contains('privileged')|df1.columns.str.contains('brick')|df1.columns.str.contains('transmit')|df1.columns.str.contains('capture')|df1.columns.str.contains('disable')|df1.columns.str.contains('install')|df1.columns.str.contains('certificate')|df1.columns.str.contains('send')|df1.columns.str.contains('shutdown')|df1.columns.str.contains('start_any_activity')|df1.columns.str.contains('lock')|df1.columns.str.contains('sms')|df1.columns.str.contains('call')|df1.columns.str.contains('danger')|df1.columns.str.contains('voicemail')]

In [None]:
df1.head()

remove columns that contain only 0

In [None]:
df1 = df1.loc[:, (df1 != 0).any(axis=0)]
df1.describe()

Plot grouped bar chart to better understand the feature relationship

In [None]:
bdf1 = pd.Series.sort_values(df1[df1.type==0].sum(axis=0), ascending=False)
mdf1 = pd.Series.sort_values(df1[df1.type==1].sum(axis=0), ascending=False)
del bdf1['type']
del mdf1['type']
pd.concat({'Benign Apps': bdf1, 'Malware Apps': mdf1}, axis=1).plot.bar(figsize=(18,5))

## Feature Engineering

#### Feature selection

top 10 features that determine malware

In [None]:
fig, axs =  plt.subplots(nrows=2, sharex=True)
bdf1[1:11].plot.bar(ax=axs[1], color="green", title="Benign Apps")
mdf1[1:11].plot.bar(ax=axs[0], color="red", title="Malware Apps", ylabel="Count of apps", xlabel="Permissions")

#### Observation
From the above bat chart it is evident that only Malware apps predominantly require permission that control sms, wifi, lock, call, apn and contacts

## Modeling

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df1.iloc[:, 1:42], df1['type'], test_size=0.20, random_state=42)

In [None]:
X_train.shape, X_test.shape

In [None]:
y_train.shape, y_test.shape

Naive Bayes algorithm

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)
pred = gnb.predict(X_test)
accuracy = accuracy_score(pred, y_test)
print("Naive Bayes")
print("Accuracy: " + str(accuracy))
print(classification_report(pred, y_test, labels=None))

k-neighbors algorithm

In [None]:
for i in range(3,15,3):
    
    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    pred = neigh.predict(X_test)
    accuracy = accuracy_score(pred, y_test)
    print("k-neighbors {}".format(i))
    print("Accuracy: " + str(accuracy))
    print(classification_report(pred, y_test, labels=None))
    print("")

Decision Tree

In [None]:
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
clf_gini.fit(X_train, y_train)

Predict the Test set results with criterion gini index

In [None]:
y_pred_gini = clf_gini.predict(X_test)

Check accuracy score with criterion gini index

In [None]:
print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))

Here, y_test are the true class labels and y_pred_gini are the predicted class labels in the test-set.

Compare the train-set and test-set accuracy to check for overfitting

In [None]:
y_pred_train_gini = clf_gini.predict(X_train)

y_pred_train_gini

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))

Check for overfitting and underfitting

In [None]:
print('Training set score: {:.4f}'.format(clf_gini.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_gini.score(X_test, y_test)))

Here, the training-set accuracy score is 1.0000 while the test-set accuracy to be 1.0000. These two values are quite comparable. So, there is no sign of overfitting.

Visualize decision-trees

In [None]:
plt.figure(figsize=(12,8))
tree.plot_tree(clf_gini.fit(X_train, y_train))

Decision Tree Classifier with criterion entropy

In [None]:
clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)
clf_en.fit(X_train, y_train)

Predict the Test set results with criterion entropy

In [None]:
y_pred_en = clf_en.predict(X_test)

Check accuracy score with criterion entropy

In [None]:
print('Model accuracy score with criterion entropy: {0:0.4f}'. format(accuracy_score(y_test, y_pred_en)))

Compare the train-set and test-set accuracy

In [None]:
y_pred_train_en = clf_en.predict(X_train)

y_pred_train_en

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_en)))

Check for overfitting and underfitting

In [None]:
print('Training set score: {:.4f}'.format(clf_en.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(clf_en.score(X_test, y_test)))

We can see that the training-set score and test-set score is same as above. The training-set accuracy score is 1.0000 while the test-set accuracy to be 1.0000. These two values are quite comparable. So, there is no sign of overfitting

Visualize decision-tree

In [None]:
plt.figure(figsize=(12,8))

tree.plot_tree(clf_en.fit(X_train, y_train)) 

Now, based on the above analysis we can conclude that our classification model accuracy is excellent. Our model is doing a very good job in terms of predicting the class labels.

But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making.

We have another tool called Confusion matrix that comes to our rescue.

We have another tool called Confusion matrix that comes to our rescue.

Confusion matrix

In [None]:
cm = confusion_matrix(y_test, y_pred_en)

print('Confusion matrix\n\n', cm)

Classification Report

In [None]:
print(classification_report(y_test, y_pred_en))

Random Forest

In [None]:
rdF=RandomForestClassifier(n_estimators=250, max_depth=50,random_state=45)
rdF.fit(X_train,y_train)
pred=rdF.predict(X_test)
cm=confusion_matrix(y_test, pred)

accuracy = accuracy_score(y_test,pred)
print("Random Forest Classifier")
print("Accuracy Score: "+ str(accuracy))
print(classification_report(y_test,pred, labels=None))
print("cohen kappa score: ", cohen_kappa_score(y_test, pred))
print("")
print('Confusion matrix\n\n',cm)

1.	We tried Naive Bayes which resulted in accuracy score of 1.0
2.	Then we tried K-neighbout which resulted in accuracy score as follows,
    *     kneighbors:  3 Accuracy: 1.0
    *     kneighbors:  6 Accuracy: 0.9125
    *     kneighbors:  9 Accuracy: 0.9125
    *     kneighbors: 12 Accuracy: 0.9
3.	So we have built a Decision-Tree Classifier model for Android Malware Detection. We built two models, one with criterion gini index and another one with criterion entropy. These models yields a very good performance as indicated by the model accuracy in both the cases to be 1.0000
4.	In the model with criterion gini index, the training-set accuracy score and the test-set accuracy to be 1.0000. These two values are same. So, there is no sign of overfitting.
5.	Similarly, in the model with criterion entropy, the training-set accuracy score and the test-set accuracy to be 1.0000. These two values are same. So, there is no sign of overfitting.
6.	Then we tried Random forest classifier which resulted in accuracy score of  1.0 and cohen kappa score of 1.0
7.	The confusion matrix and classification report yields excellent model performance.