# Data Structure
An *entry* is a block of text (including whitespaces such as line breaks) belonging to the description of the malware.
<br>Following format is given to the entry:
<br>Source:Name - SomeText (IoCType_1: IoCValue_1) MoreText (IoCType_2: IoCValue_2) EvenMoreText ... (IoCType_n: IoCValue_n) ...


# Order of operations
1. A single manual input is required from the operator. The input is list of known sources that distribute the IoC messages.
<br>The list of sources is used to check whether a line of text is beginning of a new entry, as observation shows that an entry always starts with the source.

2. A texf file of unstructured data is read line by line. 
<br>The text file is a direct copy-paste of entire text block presented under *\[DATENG\] #2 Ustrukturert data* @https://memes.agency/o21/

3. `parse_data()` runs through through the list of lines of text, and separates the entries within the text file. 
<br>Each entry is placed in a list. The function returns a list of entries.

4. After separating entries and placing them in a list of entires, `entry_to_dict()` structures each entry text into a dict. 
<br>Using a regular expression, it groups substrings of an entry into `Source`, `Name` and `Text`
<br>`Text` is passed to the function `collect_ioc()`. This function uses another regular expression to find all the parenthesis containing `IoC` and `IoCValue`
<br>`collect_ioc()` returns list of dicts structured as following:
```
[
    {'ioc_type':"name_of_type_1", 'ioc_value':"value_of_ioc_1"},
    {'ioc_type':"name_of_type_2", 'ioc_value':"value_of_ioc_2"},
    {'ioc_type':"name_of_type_m", 'ioc_value':"value_of_ioc_m"},
]
```

5. Final structure of the dictionary structuring an entry looks like this:
```
{
    'name':"name_of_malware"
    'source':"source_of_info"
    'ioc':[
            {'ioc_type':"name_of_type_1", 'ioc_value':"value_of_ioc_1"},
            {'ioc_type':"name_of_type_2", 'ioc_value':"value_of_ioc_2"},
            {'ioc_type':"name_of_type_m", 'ioc_value':"value_of_ioc_m"},
        ]
    'text':"info_text_about_malware"
}
```
<br> The dictionary of each entry is saved with filename: Source, Name

In [7]:
import re
import json
import os


In [2]:
# Step 1 in the order of operation. The list is hardcoded, but can/should be given in another form.
source_list = [
    "Unit42",
    "Fireeye"
]

source_list = [source.lower() for source in source_list]

# Step 2 in the order of operation
with open("unstructured.txt", 'r') as file:
    lines = file.readlines()


In [3]:
def check_new_entry(list_of_sources, line):
    first_substr = line.split(':')[0]

    is_new_entry = first_substr.lower() in list_of_sources

    return is_new_entry

def collect_ioc(entry_text):
    substr_format = r'\([^\(:\)]+: [^\(:\)]*(?:\([^\(:\)]*\)[^\(:\)]*)*\)'
    matches = re.findall(substr_format, entry_text)
    return matches


def parse_data(list_of_sources, lines):
    entries = []
    entry_text = lines[0]
    for line in lines[1:]:
        if check_new_entry(list_of_sources, line):
            entries.append(entry_text)
            entry_text = line
            source = line.split(':')[0]
        else:
            entry_text = entry_text + line
    entries.append(entry_text)

    return entries

def entry_to_dict(entry):
    substr_format = r'([^:]+):([^\s]+) - ([.\S\s]+)'
    match = re.search(substr_format, entry)
    if match:
        source = match.group(1).strip()
        name = match.group(2).strip()
        text = match.group(3).strip()
        list_ioc_dicts = []
        iocs = collect_ioc(text)
        
        for ioc in iocs:
            ioc_no_semicolon = ioc.replace(';', ':')
            ioc_elements = ioc_no_semicolon[1:-1].split(':')
            ioc_type = ioc_elements[0]
            ioc_value = ioc_elements[1][1:]
            list_ioc_dicts.append(
                {
                    'ioc_type':ioc_type,
                    'ioc_value':ioc_value
                }
            )

    entry_dict = {
        'name':name,
        'source':source,
        'ioc':list_ioc_dicts,
        'text':text
    }

    return entry_dict




In [8]:
# Step 3 in order of operation
parsed_data = parse_data(source_list, lines)

#Step 4 in Order of operation
data_dicts = []
for data in parsed_data:
    data_dicts.append(entry_to_dict(data))

#Step 5 in Order of operation
if not os.path.exists('./outputs'):
    os.makedirs('./outputs')
for data_dict in data_dicts:
    source = data_dict['source']
    name = data_dict['name']
    output_fname = source + ', ' + name + '.json'
    with open('./outputs/'+output_fname, 'w') as outfile:
        json.dump(data_dict, outfile)

In [5]:
f = open('Fireeye, UNC1945')
data = json.load(f)
data

{'name': 'UNC1945',
 'source': 'Fireeye',
 'ioc': [{'ioc_type': 'MD5',
   'ioc_value': 'd5b9a1845152d8ad2b91af044ff16d0b (SLAPSTICK)'},
  {'ioc_type': 'MD5',
   'ioc_value': '0845835e18a3ed4057498250d30a11b1 (STEELCORGI)'},
  {'ioc_type': 'MD5', 'ioc_value': '6983f7001de10f4d19fc2d794c3eb534'},
  {'ioc_type': 'IP', 'ioc_value': '46.30.189.0/24'},
  {'ioc_type': 'IP', 'ioc_value': '66.172.12.0/24'}],
 'text': 'PUPYRAT (aka Pupy) is an open source, multi-platform (Windows, Linux, OSX, Android), multi-function RAT (Remote Administration Tool) and post-exploitation tool mainly written in Python. It features an all-in-memory execution guideline and leaves very low footprint. It can communicate using various transports, migrate into processes (reflective injection), and load remote Python code, Python packages and Python C-extensions from memory.(MD5: d5b9a1845152d8ad2b91af044ff16d0b (SLAPSTICK)) (MD5; 0845835e18a3ed4057498250d30a11b1 (STEELCORGI)) (MD5: 6983f7001de10f4d19fc2d794c3eb534) (IP