In this exercise, we use the pefile module to analyze PE malware samples in the "Mediyes" folder. The goal is to extract the names of the PE sections, as well as the names of imported DLLs from each sample. 

In [1]:
# To install pefile, uncomment and execute the following line:
#!pip install pefile

In [1]:
from os import listdir
from os.path import isfile, join
directories = ["Mediyes"]
import pefile

In [2]:
# takes input such as [b'ADVAPI32.dll', b'KERNEL32.dll', b'msvcrt.dll']
# and converts case to lower and removes .dll 
def preprocessImports(listOfDLLs):
    processedListOfDLLs = []
    return [x.decode().split(".")[0].lower() for x in listOfDLLs]

def getImports(pe):
    listOfImports = []
    for entry in pe.DIRECTORY_ENTRY_IMPORT:
        listOfImports.append(entry.dll)
    return preprocessImports(listOfImports)

def getSectionNames(pe):
    listOfSectionNames = []
    for eachSection in pe.sections:
        refined_name = eachSection.Name.decode().replace('\x00','').lower()
        listOfSectionNames.append(refined_name)
    return listOfSectionNames

The following block may take a couple of minutes to run. It is OK to see 'utf-8' decoding error messages for some of the files.

In [3]:
importsCorpus = []
numSections = []
sectionNames = []
print (directories)
for datasetPath in directories:
    samples = [f for f in listdir(datasetPath) if isfile(join(datasetPath,f))]
    for file in samples:
        filePath = datasetPath+"/"+file
        try:
            pe = pefile.PE(filePath)
            imports = getImports(pe)
            nSections = len(pe.sections)
            secNames = getSectionNames(pe)
            importsCorpus.append(imports)
            numSections.append(nSections)
            sectionNames.append(secNames)
                  
        except Exception as e: 
            print(e)
            print("Unable to obtain imports from "+filePath)

['Mediyes']
'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte
Unable to obtain imports from Mediyes/VirusShare_1a89b7d4fb8ded72e1f8e81ee9352262.exe
'utf-8' codec can't decode byte 0xb8 in position 0: invalid start byte
Unable to obtain imports from Mediyes/VirusShare_7a30183b105b4200fc201925aba4886c.exe
'utf-8' codec can't decode byte 0x8d in position 0: invalid start byte
Unable to obtain imports from Mediyes/VirusShare_14f3035781bb698c37ad287483af569e.exe


In [4]:
print(importsCorpus[0:5])
print(numSections[0:5])
print(sectionNames[0:5])

[['ws2_32', 'rpcrt4', 'kernel32', 'user32', 'advapi32', 'ole32', 'oleaut32'], ['ntoskrnl', 'hal'], ['ws2_32', 'rpcrt4', 'kernel32', 'user32', 'advapi32', 'ole32', 'oleaut32'], ['ntoskrnl', 'hal'], ['ntoskrnl', 'hal']]
[5, 6, 5, 7, 6]
[['.text', '.rdata', '.data', '.rsrc', '.reloc'], ['.text', '.rdata', '.data', 'init', '.rsrc', '.reloc'], ['.text', '.rdata', '.data', '.rsrc', '.reloc'], ['.text', '.rdata', '.data', 'page', 'init', '.rsrc', '.reloc'], ['.text', '.rdata', '.data', 'init', '.rsrc', '.reloc']]


**Exercise:** Find and print the names of the top 5 most frequently imported DLLs in the "Mediyes" malware dataset. Use a process lookup website (e.g., processlibrary.com) to find the function of each of these top 5 DLLs. Do you see anything suspicious in these calls?

In [None]:
# Your code

 -- your answer.