# GraphGuard

***Locate and find Classes in Apks with updated Obfuscation Mapping***


Processing Steps:
* Usage of Strings
* Method Signatures (Modifiers, Parameter Types, Number of Parameters...)
* Other methods in same class
* Analyze Method Calls (from and to) via Call Graph (Distance, Offsets, Graph Analysis)

In [None]:
%matplotlib notebook


from IPython.core.display import display, HTML
display(HTML("<style>div.output_area pre {white-space: pre;}</style>"))

In [None]:
import unittest
from collections import defaultdict, Counter
from os import path

from androguard.core.analysis.analysis import MethodAnalysis, ClassAnalysis, FieldAnalysis
from androguard.core.bytecode import FormatClassToJava
from androguard.misc import AnalyzeAPK
from androguard.session import Save, Session, Load

from formats import *
from decs import *

from matching import matcher, strings, structures

from start import process_files

# Loading Androguard

The following code loads the files and starts Androguard

It should support multiprocessing, however the Pipe communication seems to break when transmitting the processed Androguard Objects. I suspect the Object is simply too big for Pickle to serialize or another component in the transmitting chain.

In [None]:
AG_SESSION_FILE = "./Androguard.ag"
MULTIPROCESS_FILES = False  # Currently not working due to serialization issues


# Matching Rules
strings.MAX_USAGE_COUNT_STR = 20
strings.UNIQUE_STRINGS_MAJORITY = 2 / 3



# APK Files to load
file_paths = (
    "../../../Downloads/com.snapchat.android_10.85.5.74-2067_minAPI19(arm64-v8a)(nodpi)_apkmirror.com.apk",
    "../../../Downloads/com.snapchat.android_10.86.5.61-2069_minAPI19(arm64-v8a)(nodpi)_apkmirror.com.apk"
)

In [None]:
(a, d, dx), (a2, d2, dx2) = process_files(file_paths, MULTIPROCESS_FILES)

### Utility Functions to work with Androguard and Java Representations

* Converting Parameter types to TypeDescriptor Format
* Strip return type (not used for hooking)
* Method Representation Format

Loaded with Unit Tests

# Method Declarations

Lightweight Method Declaration for internal representation of a Method / Hook.

Not keeping Androguard Objects in memory to avoid high memory usage.

### List of Methods

Defining the list of methods to find (obviously requires full class names)

In [None]:
decs_to_find = (
    MethodDec("rD5", "a", "rD5", "qD5"),
    MethodDec("MSg", "j0", "SGd"),
    MethodDec("x45", "h"),
    MethodDec("GIb", "<init>", skip_params=True)
)

# Processing

## Strings as Characteristics

Extracting Strings used either in the given methods directly or in the classes the methods define

In [None]:
resolved_classes = resolve_classes(dx, decs_to_find)
resolved_methods = resolve_methods(decs_to_find, resolved_classes)
decs_ma = dict(zip(decs_to_find, resolved_methods))

if False:
    print("Resolved all Classes and Methods", *map(pretty_format_ma,resolved_methods), sep="\n* ")

### Utility functions for working with dx.get_strings()

Filters Strings and xrefs to Strings. Only allow strings with (#xrefs < MAX_USAGE_COUNT_STR) to be used as characteristic to locate classes

Building Maps of MethodDec and ClassNames associated to lists containing strings used in them

### Count occurrences of strings

Converting list of strings to a Counter object for faster comparisons

### Searching for Found Strings

Tries to resolve Classes and methods with the strings previously found

* Loading second Apk File
* Find All Strings found previously, build Map of potential matches (ClassName/Method to Counter)
* Filter Potential Matches by comparing both Counter Objects

#### Try to resolve Method
Try to resolve classes by only using information about Strings (exact Counter Match)

In [None]:
accumulator = matcher.Accumulator()

args = (dx, dx2, resolved_classes, decs_ma)

In [None]:
string_matcher = strings.StringMatcher(*args, accumulator.get_unmatched_ms(decs_to_find))
candidates_cs, candidates_ms = string_matcher.compare_counters()

accumulator.add_candidates(candidates_cs, candidates_ms)

### Fallback by unique strings

To resolve unresolved methods, get all unique strings (strings only used by this class) and try to find the matching class by only searching for the unique string.

In [None]:
candidates_cs = string_matcher.compare_unique_strings(accumulator.get_unmatched_cs(decs_to_find))

print()
print("| Applying Filtered Candidates to Accumulator")
print()
accumulator.add_candidates(candidates_cs)

### Using Class Information

Gathers the following information and tries to find the correct classes by finding a similar "Profile"
* Modifiers for Methods and Fields
* "static" Field and return types of Methods
* #Fields and #Methods

In [None]:
import importlib
importlib.reload(structures)
structure_matcher = structures.StructureMatcher(*args, accumulator.get_unmatched_ms(decs_to_find))
candidates_cs = structure_matcher.get_exact_structure_matches(accumulator.get_unmatched_cs(decs_to_find))

print("| Applying Candidates to Accumulator")
print()
accumulator.add_candidates(candidates_cs)

### Fallback if Class was found

In case the class was found, but the method could not be resolved, check each method of the class for the following criteria:

* Matching #xrefs_to
* Matching #xrefs_from
* Matching Code length

All of these checks are currently strict/exact

In [None]:
# Compare Function
cfs = [
    (MethodAnalysis.get_access_flags_string, 4),
    (get_usable_description, 10),
    (MethodAnalysis.get_length, 1),
    (lambda x: len(x.get_xref_to()), 1),
    (lambda x: len(x.get_xref_from()), 1)
]
total_score = sum((score for _, score in cfs))

In [None]:
MIN_MATCH_POINTS = 2

def try_resolve_ms(exact):    
    candidates_ms2 = defaultdict(set)

    for m in m_not_found:
        if m.class_name in c_not_found:
            print("> Could not find class of method", m.pretty_format())
            continue

        class_name1 = FormatClassToJava(m.class_name)
        class_name2 = matching_cs[class_name1]

        for ma1 in dx.get_class_analysis(class_name1).get_methods():
            if not m.equals_ma(ma1):
                continue

            m_match_points = {}
            for ma2 in dx2.get_class_analysis(class_name2).get_methods():
                
                if exact:
                    if all((c_fun(ma1) == c_fun(ma2)) * score for c_fun, score in cfs):
                        candidates_ms2[m].add(ma2)
                else:
                    x = sum(((c_fun(ma1) == c_fun(ma2)) * score for c_fun, score in cfs))
                    if x >= MIN_MATCH_POINTS:
                        m_match_points[ma2] = x

            if not exact:
                max_matches = max(map(lambda x: x[1], m_match_points.items()))
                c = [s[0] for s in m_match_points.items() if s[1] == max_matches]
                
                if len(c) == 0:
                    print("- Could not find any matching candidate for", pretty_format_ma(ma1))
                elif len(c) == 1:
                    print(f"+ Found single non-exact candidate for matching method. (Certainty of {(max_matches / total_score):.2f})", pretty_format_ma(ma1), pretty_format_ma(c[0]), sep="\n\t* ")
                    matching_ms[m] = c[0]
                else:
                    candidates_ms2[m] |= set(c)
                    print("* Found multiple non-exact candidates for matching method " + str(m) + ":\n\t*", "\n\t* ".join(map(pretty_format_ma, c)))
            break

    for m, ms_li in candidates_ms2.items():
        ms_li = list(ms_li)
        if len(ms_li) == 1:
            print("+ Found single candidate for matching method. Considering it a match!",
                  f"\n\t{m.pretty_format()} -> {ms_li[0]}")

            matching_ms[m] = ms_li[0]
            continue

        print(f"* Multiple Matches for method {m.pretty_format()}", *map(pretty_format_ma, ms_li), sep="\n\t* ")
        
        if m in candidates_ms:
            combined = set(ms_li) & set(candidates_ms[m])
            if len(combined) == 0:
                print("- Inner join on possible candidates resulted in no method match! for", m)
                continue
            if len(combined) == 1:
                el = list(combined)[0]
                print("+ Inner join concluded single matching candidate. Considering match!"
                      + f"\n\t{m.pretty_format()} -> {pretty_format_ma(el)}")
                matching_ms[m] = el
                continue
            if len(combined) < len(candidates[ms]):
                print(f".. Could narrow down search by combining candidates ({len(candidates[ms])} -> {len(combined)})")
                candidates[ms] = combined
            
    for m, ma2 in matching_ms.items():
        if m in candidates_ms:
            del candidates_ms[m]
        
try_resolve_ms(exact=True)
m_not_found = decs_to_find - matching_ms.keys()

print("Using non-exact Checks")

try_resolve_ms(exact=False)
m_not_found = decs_to_find - matching_ms.keys()

print(len(m_not_found), "/", len(decs_to_find))

In [None]:
m_not_found = decs_to_find - matching_ms.keys()
print(len(m_not_found), "/", len(decs_to_find))

print()
print("Classes that have not been found:", *c_not_found, sep="\n* ")
print()
print("Resolved MethodDecs:")
for m, ma2 in matching_ms.items():
    print("*", m.pretty_format(), "->", pretty_format_ma(ma2))
for m in m_not_found:
    print("-", m.pretty_format())