# Authors

- Ikram Kohil, 2019115
- Johnatan Gao, 2013298

# Taint Analysis
In this lab, we will be implementing taint analysis to ensure that all data coming from the outside (user inputs) do not reach critical parts of the system.

# 1. Implementation and tests

In this first part we will implement the teint analysis algorithm called: Possibly Teinted Definitions. We will then test the implementation on four examples in the folder **part_1**.

## 1.1 Possibly Tainted Definitions

According to class notes, there are a couple of variables/concepts that we need to define in order to implement the conceptual algorithm.

$$
in\_taintedDefs: V → \Rho(DEFS)
$$
$$
out\_taintedDefs: V → \Rho(DEFS)
$$

In this statement:

- V: This typically represents a set or type of variables within the context of a programming language or a system. Variables can be anything from integers, strings, objects, etc., depending on the language or system under consideration.
- P(DEFS): P represents the power set, which is the set of all subsets of a given set. DEFS is a set of definitions. So, P(DEFS) would be the power set of the set of definitions. In this context, it suggests that for each variable in V, there is a set of definitions (DEFS) associated with it, and in_taintedDefs maps each variable in V to a subset of its associated definitions.

$$
in\_taintedDefs(v) = \bigcup_{p \in preds(node)} out\_taintedDefs(p)
$$

This equation defines the behavior of the in_taintedDefs function. It states that the in_taintedDefs for a variable v is the union (⋃) of the out_taintedDefs for all predecessors (preds(node)) of a given node. In other words, to compute the in_taintedDefs for a variable at a particular node, you take the union of the out_taintedDefs for all nodes that flow into that node.

$$
tainted\_GEN[node] = 
\left\{
  \begin{array}{ll}
    \ {d_i \in defs[node]  | \exists(d_j, r_k) \in defRefChains | (r_k \in refs[node]), (d_j \in in\_tainted[node])} &\text{ if } (node \in EXPR)\ \\

    \ {d_i \in defs[node]} & \text{ if } (node \in SOURCES) \\

    \emptyset & \text{ if } (node \in FILTERS) \\
    
    \emptyset & \text{ if } (node \in SAFESET)
  \end{array}
\right.
$$

In this statement, we are trying to populate our GEN array. For each node in our CFG, we check for four conditions:
- If the node is part of the FILTERS set, then the tainted_gen for this node is empty
- If the node is part of the SAFESET set, then the tainted_gen for this node is empty as well
- If the node is part of the SOURCES set, then the tainted_gen for this node is the definition for this node
- If the node is part of the EXPR set (in other words, is an expression), then we will:
  - Take the defintion for this node, if it exists
  - Take the definition/reference pair for this node, if it exists
  - Create a pair of (reference, definition), if the reference is part of our reference set and definition is part of the in_tainted set for this node
  
In other words, 
1. Check if node is a definition: First, determine if the current node is a definition.

2. If the node is a definition and the right side is a source: If the node is indeed a definition, and its right side is a source, then mark the definition as tainted. This means that the definition corresponds to a value that is considered tainted or untrusted.

3. If the node is a definition and it corresponds to a filter or safe: If the node is a definition but corresponds to a filter or safe, then no action is taken. This implies that the definition is associated with a value that is considered safe or filtered, so it doesn't need further processing.

4. If the node is a definition and the right side is an expression: If the node is a definition and its right side is an expression, further investigation is needed.
Check if any references in the expression are tainted: Examine all references in the expression associated with the definition. If any of these references are tainted (i.e., marked as untrusted), then the definition is also considered tainted.

$$
tainted\_KILL[node] = {d_k | (var(d_k) = var(d_m)) ^ (d_m \in defs[node] )}
$$

In this statement, we are trying to populate the kill set. Essentially, everytime we detect a definition OR a redefinition, we must add it to our tainted_kill set for this particular node.

In [None]:
import os
import json
from pathlib import Path
from code_analysis import CFGReader
from code_analysis import CFG

# Global variable - directory where cfg.json and .dot files generated by our code will be stored 
part1_output_directory = "output/part_1/"
part2_output_directory = "output/part_2/"

# Utility functions taken from TP1
def get_json_files(extension, directory):
   directory = Path(directory)
   return [str(file) for file in directory.rglob(extension)]

def create_output_file(filename, directory):
    # Check if output directory exists, if not, create it
    if not os.path.exists(directory):
        os.makedirs(directory)

    # Check if output file already exists, if so, delete and create new file
    file_path = os.path.join(directory, filename)
    if os.path.exists(file_path):
        os.remove(file_path)

    # Open in "append" mode to avoid overwriting the whole file after each modification
    return open(directory + filename, "a")

def close_output_file(file):
   file.close()

In [None]:
from typing import Dict, List, Set


class TaintAnalysisAlgorithm:
    def __init__(self, filename):
        self.cfg = None
        self.filename = filename

        # Dictionnary containing all necessary parameters (safe, filter, etc)
        self.tainted_params: Dict[str, List[int]] = dict() # Format: key = param_type(safe/filter/etc) value = node_ids arary

        # GEN and KILL dictionnaries
        ## Dictionnary containing all tainted_gen nodes for a specific node
        self.tainted_gen = Dict[int, Set()] = dict() # Format (for all following dictionnaries): key = node_id, value = set of node_ids
        ## Dictionnary containing all tainted_kill nodes for a specific node
        self.tainted_kill = Dict[int, Set()] = dict()

        # IN and OUT dictionnaries
        ## Dictionnary containing all tainted_in nodes for a specific node
        self.tainted_in = Dict[int, Set()] = dict()
        ## Dictionnary containing all tainted_out nodes for a specific node
        self.tainted_out = Dict[int, Set()] = dict()

    def __init_tainted_params(self, taint_json_filename):
        # Read the file and initialize the appropriate parameters in a dictionnary
        params = json.loads(taint_json_filename)
        self.tainted_params = {
            'defs': params['defs'],
            'refs': params['refs'],
            'pairs': params['pairs'],
            'sinks': params['sinks'],
            'filters': params['filters'],
            'safes': params['safes'],
            'sources': params['sources']
        }

    def get_nodes_tainted_gen(self, var_node_id, expr_node_id):
        ## To determine if the definition is tainted, we need to check the right side of the definition, and we need to check EACH node involved
        ## Ex: For definition x = y + z + w +1, we need to check y, z and w. If AT LEAST one of them is tainted, then the definition is tainted
        ## To do so, we need to check for BinOP nodes

        # Populate tainted_gen according to the algorithm
        ## If part of filter or safe, then not tainted (in which case, skip)
        if expr_node_id in self.tainted_params['filters']:
            pass
        if expr_node_id in self.tainted_params['safes']:
            pass
        if expr_node_id in self.tainted_params['sources']:
            self.tainted_gen[var_node_id, expr_node_id]['defs'].append() # append what???
    
    def get_nodes_tainted_kill(self, node_id):
        pass

    def get_taint_analysis(self, taint_json_filename):
        is_binOP_equal = lambda child_node_id: self.cfg.get_type(child_node_id) == "BinOP" and self.cfg.get_image(child_node_id) == "="

        # Start by initializing the relevant parameters for the analysis in order to populate the gen and kill dictionnaries for each node
        self.__init_tainted_params(taint_json_filename)

        # Retrieve the nodeSet. The algorithm we have to implement cannot be done recursively like we usually do 
        ## At least we found it simpler to do in an iterative manner, so as to follow the given algorithm as closely as possible
        ## So the nodeSet here is the list of all nodes in the cfg
        node_set = self.cfg.get_node_ids()

        for node_id in node_set:
            # Only check the taint for definitions (since we are implementing the possibly tainted definitions algorithm):
            ## Check if the node is BinOP and if child node is an '='
            if is_binOP_equal(node_id):
                # Left child is the variable, right child is the value/expression
                variable_node_id, expression_node_id = self.cfg.get_op_hands(node_id)

                # Initialize the gen and kill dictionnaries for the current node
                self.get_nodes_tainted_gen(variable_node_id, expression_node_id)

                # Initialize the in and out dictionnaries for the current node
                self.tainted_in[variable_node_id] = set()
                self.tainted_out[variable_node_id] = set()

                # Initialize and populate the gen and kill for the current node
                self.get_nodes_tainted_gen(node_id)