<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/veronica-thesis/blob/master/malware.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolution Network

Convolution Neural Networks (CNN) have been very successful especially when modeling images. In this notebook we introduce CNNs and use Keras to learn the CIFAR10 data set

In [None]:
%%bash
fileid="1--HR99qhGfN-5G5vutGbWsQ6NvB1a8HF&export=download" 
filename="byte_names.1.7z"
curl -L -c cookies.txt 'https://docs.google.com/uc?export=download&id='$fileid | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p' > confirm.txt

curl -L -b cookies.txt -o $filename 'https://docs.google.com/uc?export=download&id='$fileid'&confirm='$(<confirm.txt)

rm -f confirm.txt cookies.txt

In [None]:
!ls -l

In [None]:
!7z x byte_names.1.7z

### Packages

In [1]:
import tensorflow as tf 
import numpy as np 
import matplotlib.pyplot as plt
from tensorflow.keras import models,layers
from tensorflow.keras.utils import Sequence
#from tensorflow.python.keras.utils import data_utils
import math
import os
import pandas
from PIL import Image
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Conv2D,Input,MaxPool2D,Dropout,Flatten,MaxPooling2D,BatchNormalization


INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


In [2]:

def conv(x):
    if x=='??':
        return -1
    else:
        return int(x,16)


def getImage(filename):
    f=open(filename)
    rest=f.read()
    lines=rest.splitlines()
    img=[]
    for line in lines:
        el=line.split()
        el=list(map(conv,el))
        if(len(el)==17):
            img.append(el[1:])
    img=np.asarray(img,dtype=np.uint8)
    img=Image.fromarray(img,'L')
    img=img.resize((16,256),resample=Image.BOX)
    img=np.array(img)
    img=img.reshape(img.shape[0],img.shape[1],1)
    f.close()
    return img

class ImageSequence(Sequence):
    def __init__(self, x_set, y_set,dir, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.dir=dir

    def __len__(self):
        return math.ceil(len(self.x) / self.batch_size)

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) *
        self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) *
        self.batch_size]

        return np.array([getImage(dir+filename+".bytes") for filename in batch_x]),batch_y


In [3]:
batch_size=64
data=pandas.read_csv("data/data1.csv",header=Image.NONE)
names=data.iloc[:,0].to_numpy()
classes=data.iloc[:,1].to_numpy()-1
#dir="c:\\Users\\hikma\\Downloads\\"
dir="data\\train\\"

train_dataset=ImageSequence(names,classes,dir,batch_size)
#train_dataset=ImageSequence(train_figs,train_labels,batch_size)
#test_dataset=CIFAR10Sequence(test_figs,test_labels,batch_size)
(f,l)=train_dataset.__getitem__(0)
f.shape
l

array([8, 7, 6, 7, 7, 1, 5, 2, 0, 0, 2, 1, 2, 7, 2, 2, 2, 2, 0, 0, 6, 2,
       3, 0, 2, 7, 2, 5, 8, 1, 2, 5, 0, 2, 8, 8, 1, 1, 2, 2, 7, 5, 5, 8,
       1, 0, 1, 2, 2, 2, 4, 1, 7, 0, 1, 8, 1, 7, 1, 2, 0, 2, 1, 1],
      dtype=int64)

In [4]:
def createModel():
    model = Sequential()
    model.add(Input(shape=(256,16,1)))
    model.add(tf.keras.layers.experimental.preprocessing.Rescaling(1./255))
    model.add(Conv2D(32, kernel_size = (3,3), activation = 'relu'))
    #model.add(Conv2D(64, kernel_size = (3,3), activation = 'relu'))
    #model.add(MaxPooling2D(pool_size = (2,2)))
    #model.add(Dropout(.5))
        
    #model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
    #model.add(Conv2D(256, kernel_size=(3, 3), activation='relu'))
    #model.add(MaxPooling2D(pool_size = (2,2)))
    #model.add(Dropout(.5))
    
    model.add(Flatten())
    model.add(Dense(1024, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(9, activation = 'softmax', name = 'Output'))
    model.summary()
    return model

In [5]:
model=createModel()
#tf.keras.utils.plot_model(model,show_shapes=True)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
rescaling (Rescaling)        (None, 256, 16, 1)        0         
_________________________________________________________________
conv2d (Conv2D)              (None, 254, 14, 32)       320       
_________________________________________________________________
flatten (Flatten)            (None, 113792)            0         
_________________________________________________________________
dense (Dense)                (None, 1024)              116524032 
_________________________________________________________________
dropout (Dropout)            (None, 1024)              0         
_________________________________________________________________
Output (Dense)               (None, 9)                 9225      
Total params: 116,533,577
Trainable params: 116,533,577
Non-trainable params: 0
__________________________________________

## Optimization

Keras can use many optimization method. In this notebook we use the __Adam__ method which can be described loosely as __adaptive__ gradient descent.

Also since the labels are __NOT__ in one_hot_encoding we use the "Sparse" version of the crossentropy loss: __SparseCategoricalCrossentropy__. Finally, if we don't specify from_logits=False then the loss function would compute softwmax before computing the loss. Since we are computing softwmax in our model already we turn this step off by specifying from_logits=False

In [7]:
# if we don't use softmax in the last layer, i.e. if the output of the
# model is NOT probabilities then use from_logits=True
model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

#history = model.fit(img_train,label_train, batch_size=128,epochs=10, 
#                    validation_data=(img_test, label_test))
history=model.fit(train_dataset,batch_size=batch_size,epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
 6/32 [====>.........................] - ETA: 16:34 - loss: 0.2377 - accuracy: 0.9417

### Testing the Accuracy

In [None]:
_,test_accuracy=model.evaluate(test_dataset)