<a href="https://colab.research.google.com/github/inspire-lab/SecurePrivateAI/blob/master/5_android_malware.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are using [androguard](https://github.com/androguard/androguard) to perform apk analysis. We are using a small data set that is available on [kaggle](https://www.kaggle.com/xwolf12/datasetandroidpermissions). 

We will also use some live android malware as well as a (hopefully ;) ) clean android app. The cell below downloads the data we need through this notebook.


In [0]:
!pip install androguard
!pip install tensorflow-gpu==1.15.2  keras==2.2.3 cleverhans==2.1.0
!wget https://github.com/duckduckgo/Android/releases/download/5.36.3/duckduckgo-5.36.3-release.apk
!wget 'https://docs.google.com/uc?export=download&id=1_eK_o1Jdp0K8lIVptcgfrn3x546bbc3d' -O android_permissions.csv
!wget https://github.com/ashishb/android-malware/raw/master/fake_bankers/eba335956afad3b50a93effc61cd7467552ff0f7c8ac14032f784c5fec3a5720.apk -O fake_banker.apk
!wget https://raw.githubusercontent.com/ashishb/android-malware/master/feabme/com.tinker.jumperchess\(Jump%20Chess\).apk -O feabme.apk
!wget https://github.com/ashishb/android-malware/raw/master/TrojanDownloader.Agent.JI/Google-play.apk -O TrojanDownloader.apk


Below is some help code that is mostly about data handling. It provides function to split the data in tow the two different classes as well as perform the test and train split.

In [0]:
def split_by_class(  x, y, MALWARE_LABEL=1, BEGING_LABEL=0 ):
  """
  Return two datasets. one for benign and one for malicious
  :param x:
  :param y:
  :return: (x_mal, y_mal), (x_beg, y_beg)
  """

  # saftey checks
  size = x.shape[ 0 ]
  assert (size == y.shape[ 0 ])

  # converted to category labels
  if hasattr( y[ 0 ], 'shape' ) and  len( y[0].shape) != 0 and y[ 0 ].shape[ 0 ] > 1:
    mal_label = to_categorical( MALWARE_LABEL, 2 )
    beg_label = to_categorical( BEGING_LABEL, 2 )
    # cause numpy is fun
    i_m = (y == mal_label).all( axis=1 ).nonzero( )[ 0 ]
    i_x = (y == beg_label).all( axis=1 ).nonzero( )[ 0 ]
  else:
    i_m = np.argwhere( y == MALWARE_LABEL )[ :, 0 ]
    i_x = np.argwhere( y == BEGING_LABEL )[ :, 0 ]

  x_mal = x[ i_m ][ : ]
  y_mal = y[ i_m ][ : ]

  x_beg = x[ i_x ][ : ]
  y_beg = y[ i_x ][ : ]

  print( 'Malware: ', x_mal.shape )
  print( 'Goodware: ', x_beg.shape )

  # saftey checks
  assert (size == x_mal.shape[ 0 ] + x_beg.shape[ 0 ])
  assert (size == y_mal.shape[ 0 ] + y_beg.shape[ 0 ])

  return (x_mal, y_mal), (x_beg, y_beg)


def training_and_test( x, y, split=0.75, balance_classes=False,
                      MALWARE_LABEL=1, BEGING_LABEL=0, **kwargs ):
  """
  Splits the data set into training set. If x or y is None the data is loaded 
  or processed.
  :param x:
  :param y:
  :param split: percentage of the training going into the training data
  :balance_classes: keep the ratio of classes in training and test set
  :return: (x_train, y_train), (x_test, y_test)
  """
  # safety checks
  size = x.shape[ 0 ]
  assert (size == y.shape[ 0 ])

  print( 'X: ', x.shape )
  print( 'Y: ', y.shape )

  rand = np.random.RandomState( )
  rand.seed( 7 )
  if balance_classes:
    (x_mal, y_mal), (x_beg, y_beg) = split_by_class( x, y )
    p_mal = rand.permutation( x_mal.shape[ 0 ] )
    p_beg = rand.permutation( x_beg.shape[ 0 ] )
    # training set
    print( y_beg[ p_beg ][ : int( x.shape[ 0 ] * split ) ].shape )
    x_train = np.vstack( (x_mal[ p_mal ][ : int( x_mal.shape[ 0 ] * split ) ],
                        x_beg[ p_beg ][ : int( x_beg.shape[ 0 ] * split ) ]) )
    if len( y.shape ) ==1 :
      y_train = np.concatenate( (y_mal[ p_mal ][ : int( x_mal.shape[ 0 ] * split ) ],
                            y_beg[ p_beg ][ : int( x_beg.shape[ 0 ] * split ) ]),
                            axis=None )
    else:
      y_train = np.vstack( (y_mal[ p_mal ][ : int( x_mal.shape[ 0 ] * split ) ],
                            y_beg[ p_beg ][ : int( x_beg.shape[ 0 ] * split ) ]) )
    # test set
    x_test = np.vstack( (x_mal[ p_mal ][ int( x_mal.shape[ 0 ] * split ): ],
                        x_beg[ p_beg ][ int( x_beg.shape[ 0 ] * split ): ]) )
    if len( y.shape ) ==1 :
      y_test = np.concatenate( (y_mal[ p_mal ][ int( x_mal.shape[ 0 ] * split ): ],
                          y_beg[ p_beg ][ int( x_beg.shape[ 0 ] * split ): ]),
                            axis=None )
    else:
      y_test = np.vstack( (y_mal[ p_mal ][ int( x_mal.shape[ 0 ] * split ): ],
                          y_beg[ p_beg ][ int( x_beg.shape[ 0 ] * split ): ]) )
  else:
    p = rand.permutation( x.shape[ 0 ] )
    x_train = x[ p ][ : int( x.shape[ 0 ] * split ) ]
    y_train = y[ p ][ : int( x.shape[ 0 ] * split ) ]
    x_test = x[ p ][ int( x.shape[ 0 ] * split ): ]
    y_test = y[ p ][ int( x.shape[ 0 ] * split ): ]

  print( 'X_train: ', x_train.shape, type( x_train ) )
  print( 'Y_train: ', y_train.shape, type( y_train ) )
  print( 'X_test: ', x_test.shape, type( x_test ) )
  print( 'Y_test: ', y_test.shape, type( y_test ) )
  # safety checks
  assert (size == x_train.shape[ 0 ] + x_test.shape[ 0 ])
  assert (size == y_train.shape[ 0 ] + y_test.shape[ 0 ])

  return (x_train, y_train), (x_test, y_test)


Next we'll read in the csv file and parse it. the first line gives us the names of the permissions. After that each line represents an instance. A 1 indicates that a certain feature/permission is present while a 0 indicates that it is not. The last value of the line gives us the class. 

In [0]:
import numpy as np

with open( 'android_permissions.csv', 'r' ) as f:
  lines = f.readlines()

permissions = lines[ 0 ].split( ';' )[ :-1 ]
print( 'all knonw permissions' )
print( permissions )

x = [ ]
y = [ ]
for line in lines[ 1: ]:
    features = line.rstrip( '\n' ).split( ';' )
    arr = [ int( i ) for i in features[ : -1 ] ]
    x.append( arr )
    y.append(  int( features[ -1 ] ) )
f.close( )
x = np.array( x )
y = np.array( y )

(x_train, y_train), (x_test, y_test) = training_and_test( x, y, 
                                                         balance_classes=True)



Before we dive into learning from the data. Let's take a look at it.

In [0]:
import matplotlib.pyplot as plt

print( x_train.shape )
print( y_train.shape )

(x_mal, y_mal), (x_beg, y_beg) = split_by_class( x_train, y_train )

print( 'MALWARE' )
mal = x_mal.sum( axis=0 )
plt.bar( np.arange( mal.shape[ 0 ] ), mal )
plt.show()
argsorted = np.flip( np.argsort( mal ) )
for i in range(10):
  print( permissions[ argsorted[ i ] ] + ': ' + str( mal[ argsorted[ i ] ] )  )

print( 'BENIGN' )
beg = x_beg.sum( axis=0 )
plt.bar( np.arange( beg.shape[ 0 ] ), beg )
plt.show()
argsorted = np.flip( np.argsort( beg ) )
for i in range(10):
  print( permissions[ argsorted[ i ] ] + ': ' + str( beg[ argsorted[ i ] ] )  )


Using what you have learned so far, create an svm classifier, train it on that training data and evaluate it on the test data.

In [0]:
# your code here

In this task accuracy is not the most important meassure. A more important way of meassuring the effectivness of our classifier is false negative and false positve rate. The function below allows us to compute it.

In [0]:
from sklearn.metrics import confusion_matrix

def metrics( y_true, y_pred ):
  # convert from categorial labels if required
  if len( y_pred[ 0 ].shape ) != 0:
    y_pred = np.argmax( y_pred, axis=1 )
  if len( y_true[ 0 ].shape ) != 0:
    y_true = np.argmax( y_true, axis=1 )
  tn, fp, fn, tp = confusion_matrix( y_true, y_pred, labels=[ 0, 1 ] ).ravel( )
  return { 'tn': tn, 'fp': fp, 'fn': fn, 'tp': tp }


In [0]:
print( clf.predict( x_test ) )
print( metrics( y_test, clf.predict( x_test ) ) )


Let's see how our classifier work to new apks. We just downloaded a few at the beginning of the notebook. First we need to extract the permissions from the apk. 


In [0]:
from androguard.misc import AnalyzeAPK
import re

def extract_permissions( apk, dv_formant, analysis ):
  print( apk.get_app_name() )
  print( 'permissions' )
  apk_permissions = apk.get_permissions()
  print( apk_permissions )

  # create empty feature vector
  apk_features = np.zeros( [ len( permissions ) ] )

  for perm in apk_permissions:
    if not isinstance( perm, str ):
      continue
    try:
      idx = permissions.index( perm )
    except:
      print( 'encountered unknown permission:' + perm )  
    apk_features[ idx ] = 1

  return apk_features

def predict_svm( x ):
  print( clf.predict( x ) )
  

If we wanted to get more elaborate with our feature extraction. The code below extracts suspicous API calls and URL that are used in the app.


In [0]:
def suspicous_api_and_urls ( apk, dv_formant, analysis ):
  """
  taken from
  https://github.com/MLDroid/drebin
  """

  print( apk.get_app_name() )
  SuspiciousApiSet = set()
  URLDomainSet = set()
  for dv in dv_formant:
    for m in dv.get_methods():
      for block in analysis.get_method( m ).get_basic_blocks().get():
        DalvikCodeList = []
        for Instruction in block.get_instructions():
            CodeLine = str(Instruction.get_name() + " " + Instruction.get_output())
            DalvikCodeList.append(CodeLine)
        DalvikCodeList = set(DalvikCodeList)
        ApiList = []
        AndroidSuspiciousApiNameList = ["getExternalStorageDirectory", "getSimCountryIso", "execHttpRequest", 
                    "sendTextMessage", "getSubscriberId", "getDeviceId", "getPackageInfo", "getSystemService", "getWifiState", 
                    "setWifiEnabled", "setWifiDisabled", "Cipher"]
        OtherSuspiciousApiNameList = ["Ljava/net/HttpURLconnection;->setRequestMethod(Ljava/lang/String;)", "Ljava/net/HttpURLconnection", 
                                      "Lorg/apache/http/client/methods/HttpPost", "Landroid/telephony/SmsMessage;->getMessageBody", 
                                      "Ljava/io/IOException;->printStackTrace", "Ljava/lang/Runtime;->exec"]
        NotLikeApiNameList = ["system/bin/su", "android/os/Exec"]
        for DalvikCode in DalvikCodeList:
          if "invoke-" in DalvikCode:
              Parts = DalvikCode.split(",")
              for Part in Parts:
                  if ";->" in Part:
                      Part = Part.strip()
                      if Part.startswith('Landroid'):
                          FullApi = Part
                          ApiParts = FullApi.split(";->")
                          ApiClass = ApiParts[0].strip()
                          ApiName = ApiParts[1].split("(")[0].strip()
                          ApiDetails = {}
                          ApiDetails['FullApi'] = FullApi
                          ApiDetails['ApiClass'] = ApiClass
                          ApiDetails['ApiName'] = ApiName
                          ApiList.append(ApiDetails)
                          if(ApiName in AndroidSuspiciousApiNameList):
                              #ApiClass = Api['ApiClass'].replace("/", ".").replace("Landroid", "android").strip()
                              SuspiciousApiSet.add(ApiClass+"."+ApiName)
                  for Element in OtherSuspiciousApiNameList:
                      if(Element in Part):
                          SuspiciousApiSet.add(Element)
          for Element in NotLikeApiNameList:
              if Element in DalvikCode:
                  SuspiciousApiSet.add(Element)
        for Instruction in DalvikCodeList:
          URLSearch = re.search("https?://([\da-z\.-]+\.[a-z\.]{2, 6}|[\d.]+)[^'\"]*", Instruction, re.IGNORECASE)
          if (URLSearch):
              URL = URLSearch.group()
              Domain = re.sub("https?://(.*)", "\g<1>",
                              re.search("https?://([^/:\\\\]*)", URL, re.IGNORECASE).group(), 0, re.IGNORECASE)
              URLDomainSet.add(Domain)

  print( SuspiciousApiSet )
  print( URLDomainSet )
  return SuspiciousApiSet, URLDomainSet


Now that we have a trained classifier and a way to extract features from APK files we can see how are classifier perform on apps that were not part of the dataset.

Complete the code sub below to classify the apps.

In [0]:
# list of apk filenames
apks =  [ 'duckduckgo-5.36.3-release.apk', 'fake_banker.apk', 'feabme.apk',
         'TrojanDownloader.apk' ]
# labels
labels = [ 0, 1, 1, 1 ]

# analyze the apks

# extract features

# perform inference


How does the classifier perform? Check against www.virustotal.com 
(You need to download and upload the files)


Download links:
https://github.com/duckduckgo/Android/releases/download/5.36.3/duckduckgo-5.36.3-release.apk

https://github.com/ashishb/android-malware/raw/master/fake_bankers/eba335956afad3b50a93effc61cd7467552ff0f7c8ac14032f784c5fec3a5720.apk 
https://raw.githubusercontent.com/ashishb/android-malware/master/feabme/com.tinker.jumperchess\(Jump%20Chess\).apk 
https://github.com/ashishb/android-malware/raw/master/TrojanDownloader.Agent.JI/Google-play.apk 

Of course we are not limited to SVMs for malware detection. We can use neural nets too. The code below builds a and trains a simple neural network. 

In [0]:
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
# from keras.utils.np_utils import to_categorical

y_train_cat = to_categorical( y_train )  
y_test_cat = to_categorical( y_test )  

model = Sequential()
model.add( Dense( 64, activation='relu', input_shape=x_train.shape[ 1: ]  ) )
model.add( Dense( 32, activation='relu' ) )
model.add( Dense( 2, activation='softmax' ) )

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(x_train.shape)

model.fit( x_train, y_train_cat, epochs=64 )


How does the neural network perform when compared against our SVM classifier we trained earlier? 
Compare the relevant metrics on the test set and the 4 APKs we downloaded.

In [0]:
# evaluate model performance


Of course the neural model is vulnerable to same attacks we discussed earlier. But in the malware setting we are not free to make any change we want.
We need to constrain ourselves, so we can "easily" make the modifications to the APK without changing its functionality. A common constraint is to only add features.

Another constraint is that we can only make a change of exactly 1 since our features are binary. There is no 0.1 change. It is only on or off.
We deal with this by rounding up our down in the code below.

For the moment lets ignore the first constraint.

Let's build a simple FGSM attack with rounding and attack the model we trained above.


In [0]:
import tensorflow as tf
from cleverhans.utils_tf import model_loss
import keras

epsillon = .1 # rate of change
alpha = 0.5 # threshold for rounding

# we don't want training phase beahviour
keras.layers.core.K.set_learning_phase( 0 )

# Set TF random seed to improve reproducibility
tf.set_random_seed( 1234 )

# we need the tensorflow session to run the attack
sess = keras.backend.get_session()

# compute natural loss


# forward pass


# compute gradient


# find next sample


# make sure we are in the correct range


# rounding


# compute adversarial loss
loss_adv = model_loss( y_test_cat, model( x_tensor ), mean=False )

# run the attack
x_adv = sess.run( x_tensor )


print( 'chagens that were made' )
print( np.sum( x_adv, axis=1 ) - np.sum( x_test, axis=1 ) )


How does the attack perform? Evaluate the created examples.

A more powerful version of the attack is an iterative version. Modify the code above to make the attack iterative. Additionally, the attack currently still regards our first constraint. How would that be changed?


It is time to analyze the adversarial examples that we have created. Using code from earlier check which features where added the most often. Also do the samples work against the SVM classifier?

Bonus question for homework: with the androguard tool you can repakage apks. Make the changes to the manifest suggested by the adversarial examples, repakage the apk and see how it fares against virustotal.
