In this exercise, we use the pefile module to analyze PE malware samples in the "Mediyes" folder. The goal is to extract the names of the PE sections, as well as the names of imported DLLs from each sample. 

In [5]:
# To install pefile, uncomment and execute the following line:
!pip install pefile

Processing /home/zoochigucci/.cache/pip/wheels/2c/19/61/c79689ef799ed5062f1376a5628fd8427cad5df1e7e15fd095/pefile-2019.4.18-py3-none-any.whl
Processing /home/zoochigucci/.cache/pip/wheels/6e/9c/ed/4499c9865ac1002697793e0ae05ba6be33553d098f3347fb94/future-0.18.2-py3-none-any.whl
Installing collected packages: future, pefile
Successfully installed future-0.18.2 pefile-2019.4.18


In [6]:
from os import listdir
from os.path import isfile, join
directories = ["Mediyes"]
import pefile

In [7]:
# takes input such as [b'ADVAPI32.dll', b'KERNEL32.dll', b'msvcrt.dll']
# and converts case to lower and removes .dll 
def preprocessImports(listOfDLLs):
    processedListOfDLLs = []
    return [x.decode().split(".")[0].lower() for x in listOfDLLs]

def getImports(pe):
    listOfImports = []
    for entry in pe.DIRECTORY_ENTRY_IMPORT:
        listOfImports.append(entry.dll)
    return preprocessImports(listOfImports)

def getSectionNames(pe):
    listOfSectionNames = []
    for eachSection in pe.sections:
        refined_name = eachSection.Name.decode().replace('\x00','').lower()
        listOfSectionNames.append(refined_name)
    return listOfSectionNames

The following block may take a couple of minutes to run. It is OK to see 'utf-8' decoding error messages for some of the files.

In [8]:
importsCorpus = []
numSections = []
sectionNames = []
print (directories)
for datasetPath in directories:
    samples = [f for f in listdir(datasetPath) if isfile(join(datasetPath,f))]
    for file in samples:
        filePath = datasetPath+"/"+file
        try:
            pe = pefile.PE(filePath)
            imports = getImports(pe)
            nSections = len(pe.sections)
            secNames = getSectionNames(pe)
            importsCorpus.append(imports)
            numSections.append(nSections)
            sectionNames.append(secNames)
                  
        except Exception as e: 
            print(e)
            print("Unable to obtain imports from "+filePath)

['Mediyes']
'utf-8' codec can't decode byte 0x8d in position 0: invalid start byte
Unable to obtain imports from Mediyes/VirusShare_14f3035781bb698c37ad287483af569e.exe
'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte
Unable to obtain imports from Mediyes/VirusShare_1a89b7d4fb8ded72e1f8e81ee9352262.exe
'utf-8' codec can't decode byte 0xb8 in position 0: invalid start byte
Unable to obtain imports from Mediyes/VirusShare_7a30183b105b4200fc201925aba4886c.exe


In [9]:
print(importsCorpus[0:5])
print(numSections[0:5])
print(sectionNames[0:5])

[['ws2_32', 'rpcrt4', 'kernel32', 'user32', 'advapi32', 'ole32', 'oleaut32'], ['ws2_32', 'rpcrt4', 'kernel32', 'user32', 'advapi32', 'ole32', 'oleaut32'], ['ws2_32', 'rpcrt4', 'kernel32', 'user32', 'advapi32', 'ole32', 'oleaut32'], ['ntoskrnl', 'hal'], ['ntoskrnl', 'hal']]
[5, 5, 5, 6, 6]
[['.text', '.rdata', '.data', '.rsrc', '.reloc'], ['.text', '.rdata', '.data', '.rsrc', '.reloc'], ['.text', '.rdata', '.data', '.rsrc', '.reloc'], ['.text', '.rdata', '.data', 'init', '.rsrc', '.reloc'], ['.text', '.rdata', '.data', 'init', '.rsrc', '.reloc']]


**Exercise:** Find and print the names of the top 5 most frequently imported DLLs in the "Mediyes" malware dataset. Use a process lookup website (e.g., processlibrary.com) to find the function of each of these top 5 DLLs. Do you see anything suspicious in these calls?

In [44]:
#print(len(sectionNames))

names = []

def rmvNest(data):
    for i in data:
        if type(i) == list:
            rmvNest(i)
        else:
            names.append(i)

rmvNest(importsCorpus)

countofNames = [names.count(c) for c in names]
pairs = dict(list(zip(names, countofNames)))

aux = [(pairs[key], key) for key in pairs]
aux.sort()
aux.reverse()

for s in aux: print(str(s))


(302, 'kernel32')
(287, 'advapi32')
(283, 'ws2_32')
(282, 'user32')
(282, 'ole32')
(265, 'oleaut32')
(252, 'rpcrt4')
(144, 'ntoskrnl')
(144, 'hal')
(17, 'msvcrt')
(17, 'loadperf')
(1, 'wininet')
(1, 'imagehlp')
(1, 'gdi32')
(1, 'comctl32')


The top five most frequently imported DLLs in the Mediyes malware dataset are:
- kernel32
- advapi32
- ws2_32
- user32
- ole32

*kernel32*: Most windows functions are linked to this kernel DLL in some way.  This is required for the Windows system to work properly and should not be stopped or removed.  It is an executable on the system's harddrive containing machine code.  Handles memory management, input/output operations and interuptions for the given system.  Upon start this is loaded into protected memory space to avoid takeover.

*advapi32*: Part of an advanced API service library supporting numerous APIs including security and registry calls.  While considered non-essential, it should not be removed if not the root of an issue as it is necessary for the Windows system to work properly.  It is an executable on the system's harddrive containing machine code.It runs with the start of Windows NT 4.0 on the PC thus is moved into main memory due to frequency of use.

*ws2_32*: Pertains to the Windows Sockets API used by most internet and network applications to handle network connections.  This is a necessary system process for the Windows system to work properly and should not be removed.  It is an executable file on the system's harddrive which contains machine code.  Loaded into main memory at start of Windows Sockets software and runs as WinSock 2.0 32bit process.

*user32*: A module which contains Windows API functions related to the Windows user interface, including: Windows handling, basic UI functions, This is a necessary system process for the Windows system to work properly and should not be removed.  It is an executable file on the system's harddrive which contains machine code.  Executes at the start of Windows USer API software, thus it is loaded into main memory and runs as Windows User API Client DLL.

*ole32*: Contains core OLE (Object Linking and Embedding) functions.  This is a necessary system process for the Windows system to work properly and should not be removed.  It is an executable file on the system's harddrive which contains machine code.  When OLE software is started the DLL will execute and load into main memory; it runs as Microsoft OLE for Windows.

**Suspicions**

All of these calls access machine code and has direct control over the user's system.  If these files were to be manipulated the system would be effectively "owned" by the attacker as these processes cannot be stopped or removed and leave a functional system behind.  The fact that these are called so often leads one to believe there must be a bug in the system which allows for access to these DLLs when there should not be access.  As all of these are loaded into main memory (aside from the Kernel) it may have to do with this process.  It could also have to do with hijacking the auto-load functionality behind these, as most, if not all, of the above processes run at boot of the Windows system.