Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output IP addresses, domain names, file manipulations, and (potentially) registry details #1914

Conversation

aaronatp
Copy link
Contributor

@aaronatp aaronatp commented Dec 22, 2023

This PR addresses issue #1907. It extracts IP addresses, domain names, and file manipulations from CAPE sandbox traces and outputs them. It may eventually extract and present registry details as well. It uses a regex to identify IPv4 and IPv6 addresses. It uses another regex to identify web domains and potential subdomains.

The extraction and rendering functions require tests (still a work in progress). This PR may require changelog and documentation updates.

Most of the significant features are in common.py and default.py - changes in the other files are to make the main additions compatible. I have gotten it to pass most of the tests locally. I'm still working on a few issues and need to improve the tests (especially the rendering tests). What do you think of the general design? Do you have any questions or suggestions?

The output would look like this:
        +-----------------+
        |   Domain Names  |
        |------------------+
        | google.com |
        | web.domain.net |
        | mywebsite.edu   |
        +-----------------+
        +-----------------+
        |    IP Addresses    |
        |-----------------  |
        | 10.0.0.1           |
        | 192.1.23.45     |
        +-----------------+
        +------------------------+------------------------+
        |        APIs               |      File Names                              |
        |------------------------+-------------------------|
        |        CreateFile      |          /path/to/file.txt                  |
        |        WriteFile         |           /path/to/other_file.txt      |
        +------------------------+------------------------+
The ResultDocument class seems to be the staging ground for outputting results. I have added a couple attributes that I use for presenting IPs and domains, and might be useful if capa outputs further dynamic analysis results in the future. The default, verbose, and vverbose modes all output the same results right now, but the verbose and vverbose modes can probably be expanded.

Also, @mr-tz can you please tell me a bit more about what type of registry key/value analysis you think would be useful? I was thinking about listing: what keys are created, modified, and deleted; whether any of these registry keys are run when the computer starts up; whether functions like RegSetKeySecurity are called, which can grant malware increased privileges; and the registry keys that all of these actions relate to. But I don't want to add too many details for a simple default output - it might be good to add additional details in verbose and vverbose mode. What do you think of these things?

EDIT: I actually think it would be better to refactor the new features out of "common.py" into their own file(s), and also their rendering functions into files like "render_ip_addresses.py", "render_domains.py", "render_file_names.py", etc. Default, verbose, and vverbose rendering functions could be included in each of these files. I am going to work on this, and also work on making sure the suggested changes pass the CI tests.

@aaronatp
Copy link
Contributor Author

aaronatp commented Dec 25, 2023

Hi @mr-tz, I hope you're having a nice week. Sorry for all the messages. I have changed the program design since the last commits - I realized that the default extractor/rendering functions should be easily extensible to verbose modes so I have restructured the program to reflect this (haven't pushed any additional code to GitHub though). Since I have changed these features' design fairly significantly, I just wanted to update this PR.

In the initial PR, the extraction functions (like "extract_domain_names," etc.) don't fully handle dynamic and static inputs. These extraction functions: 1) extract strings from inputs; and, 2) regex parse the strings for e.g., domain names and IP addresses. Furthermore, some static inputs in capa implement the "extract_file_strings" in slightly different ways. This is roughly what I was thinking:

def get_strings(args.sample: Path) -> Iterator[str]:  # we say ' buf = Path(path).read_bytes()' below - is path str?
    '''different extractors implement 'extract_file_strings' in slightly different ways
    the **kwargs matching is meant to: 1) find the extractor being used;
    and, 2) call 'extract_file_strings' as a given extractor requires'''
    # Static instances of extract_file_strings methods
    if format_ = FORMAT_IDA:
        strings = ida.helpers.extract_file_strings()
    elif format_ = FORMAT_GHIDRA:
        strings = ghidra.file.extract_file_strings()
    else:
        buf = args.sample.read_bytes()
        if format_ = FORMAT_PE:
            strings = pefile.extract_file_strings(buf)
        elif format_ = FORMAT_ELF:
            strings = elffile.extract_file_strings(buf)
        elif format_ = FORMAT_VIV:
            strings = viv.file.extract_file_strings(buf)
        # Dynamic
        elif format_ = FORMAT_CAPE:
            strings = cape.file.extract_file_strings(buf)

    return strings


def default_extract_domain_names(args.sample: Path) -> Generator[str, None, None]:
    """yield web domain regex matches from list of strings""" 
    # See this Stackoverflow post that discusses the parts of this regex (http://stackoverflow.com/a/7933253/433790)
    domain_pattern = r"^(?!.{256})(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+(?:[a-z]{1,63}|xn--[a-z0-9]{1,59})$"
    for string in get_strings(args.sample):
        if re.search(domain_pattern, string):
            yield string


def verbose_extract_domain_names(args.sample: Path) -> Generator[str, None, None]:
    """yield web domain regex matches from list of strings""" 
    # See this Stackoverflow post that discusses the parts of this regex (http://stackoverflow.com/a/7933253/433790)
    domain_pattern = r"^(?!.{256})(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+(?:[a-z]{1,63}|xn--[a-z0-9]{1,59})$"
    domain_counter_dict = {}
    for string in get_strings(args.sample):
        if re.search(domain_pattern, string):
            try:
                domain_counter_dict[string] += 1
            except KeyError:
                domain_counter_dict[string] = 1
    
    for string, total_occurrances in domain_counter_dict:
        yield formatted_verbose(string, total_occurrances)


def formatted_verbose(string: str, total_occurrances: int) -> str:
"""
example output:

google.com
    |----IP addresses:
    |    |----192.0.0.1
    |    |----192.0.0.2
    |----Protocols used to communicate with google.com: HTTPS (2), HTTP (1)
    |----5 occurrances of google.com in sample

"""
    return f"{string}\n"
        + f"    |---- {ip_address_statement(string)}\n"
        + f"    |---- {network_protocol_statement(string)}\n"
        + f"    |---- {total_occurrances} occurrances of {string} in sample\n"


import dnspython
def ip_address_statement(string: str) -> str:
    resolver = dns.resolver.Resolver()
    answer = resolver.query(f'{string}', 'A')  # resolve domain names to ip addresses
    if len(answer) == 1:
        return f"IP address: {ip_address for ip_address in answer}"
    else:
        statement = "IP addresses:\n"
        counter = 0
        for ip_address_in answer:
            statement.join(f"|    |----{ip_address}\n")
            counter += 1
            if counter = 5
                statement.join(f"|    |----{total_ips(string) - 5} IP addresses not shown")
                return statement

        return statement


def network_protocol_statement(string: str) -> str:
    protocols = get_protocols(string, args.sample)
    if len(protocols) = 1:
        return f"Protocol used to communicate with {string}: {protocol for protocol in protocols}"
    else:
        return f"Protocols used to communicate with {string}: {', '.join(f"{protocol} ({count})" for protocol, count in protocols)}"


def get_protocols(string: str, ags.sample: Path) -> Dict[str, int]:
    domain_protocols = {}
    for string in get_strings(args.sample):
        if re.search(domain_pattern, string):
            # if we find a domain, look at its assembly context to see if the domain is used by e.g., an HTTP function, etc.
            context = get_context(string, args.sample)
            # tidy up the following
            if 'http_context' in context:
                increment_protocols(domain_protocols, "HTTP")
            elif 'https_context' in context:
                increment_protocols(domain_protocols, "HTTPS")

    return protocols  # dict of all the protocols used to interact with a domain and number of times each interacts


def increment_protocols(protocols: dict, protocol: str) -> dict:
    try:
        protocols[protocol] += 1
    except KeyError:
        protocols[protocol] = 1


def get_context(string: str, args.sample: Path):
    # if domain is passed as an argument to a function, get caller's name

    # if caller's name contains e.g., 'HTTP' return 'http_context'
    # etc

However, I have had trouble implementing the get_context function. I was wondering if you could suggest some things that I could look at? I have been looking at the Python's Inspect module, which is useful for investigating a function's context, but I think it would be difficult to use Inspect to determine a caller function's name if we are only given one parameter (the domain name). I am not especially familiar with vivisect but is this something it is designed for?

Also, can you suggest some features for a vverbose mode for outputting domain names? I am unsure what would be most useful (although I am still mulling over ideas!).

Apologies again for all the messages - just wanted to make sure my approach sounds reasonable before proceeding too far! But no worries if you're on holiday - Merry Christmas and Happy New Year!

@aaronatp aaronatp closed this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant