# Implementation Task: Feature Extraction

<div class="alert alert-success alertsuccess">
[Task] Implement the functions <i>extract_existence</i>, <i>extract_numeric</i>, and <i>collect_features</i> to extract all possible features from a grammar and to parse each input file into its individual features.
</div>

## Overview

This task revolves around the problem of semantic feature extraction from inputs. Specifically, the task is guided by the methods outlined in the tool Alhazen which proposes various features based on the input grammar such as *existence* and *numeric interpretation*. These features are then retrieved from the parsed trees of the inputs. For further details, please refer to [Section 3 of the Alhazen paper](https://publications.cispa.saarland/3107/7/fse2020-alhazen.pdf).

The feature extraction task is broken down into the following key tasks **[60 points]**:

1. **(Implementation of Individual Feature Classes):** Implement feature classes. Instances of these classes will be used to obtain specific feature values from inputs.

2. **Feature Extraction from Grammar:** Extract features from the grammar through the instantiation of the feature classes created in the previous step. **[20 points + 20 points]**

3. **Computation of Feature Vectors:** Compute feature vectors from a set of inputs. These vectors will subsequently be utilized as inputs for the decision tree to learn the circumstances of a program's failure. **[20 points]**

This hands-on task will give you an opportunity to apply the concepts we've discussed so far and put your programming and problem-solving skills into practice. Good luck!

<div class="alert alert-info">
[Info]: For more information about parsing inputs with a grammar, we recommand to have a look at the chapters <a href="https://www.fuzzingbook.org/html/Grammars.html">Fuzzing with Grammars</a> and <a href="https://www.fuzzingbook.org/html/Parser.html">Parsing Inputs</a> of the fuzzingbook.
</div>

In [None]:
from typing import Tuple, List, Optional, Any, Union, Set, Callable, Dict
DerivationTree = Tuple[str, Optional[List[Any]]]

## Calculator Grammar Definition:

Let's start by defining the grammar for our calculator. We'll import the necessary components from the Fuzzingbook's `Grammars` module and then create our grammar.
 grammar

In [None]:
from fuzzingbook.Grammars import Grammar, is_valid_grammar

calculator_grammar = {
    "<start>":
        ["<function>(<term>)"],

    "<function>":
        ["sqrt", "tan", "cos", "sin"],
    
    "<term>": ["-<value>", "<value>"], 
    
    "<value>":
        ["<integer>.<integer>",
         "<integer>"],

    "<integer>":
        ["<digit><integer>", "<digit>"],

    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

# Check if the grammar is valid
assert is_valid_grammar(calculator_grammar)

Our grammar for the calculator includes function names, arithmetic operations, and numeric values. By defining these elements and their possible combinations, we create a valid structure for the calculator's input. It's important to ensure the grammar's validity as it determines the success of further operations and manipulations.


## Implementing the feature classes

We'll begin with the implementation of the abstract base class Feature, which will serve as a template for different grammar features. This class will define the core structure and methods that every feature class should implement.

In [None]:
from abc import ABC, abstractmethod

class Feature(ABC):
    '''
    This is the abstract base class for grammar features.
    Any specific feature class should inherit from this base class and implement its abstract methods.
    
    Args:
        name : A unique identifier for this feature. It should not contain white spaces. 
               For example, 'type(<feature>@1)'.
        rule : The production rule associated with this feature (e.g., '<function>' or '<value>').
        key  : The specific attribute of the feature (e.g., the chosen alternative or the rule itself).
    '''
    
    def __init__(self, name: str, rule: str, key: str) -> None:
        self.name = name
        self.rule = rule
        self.key = key
        super().__init__()
    
    @abstractmethod
    def __repr__(self) -> str:
        '''Returns a string representation of the feature that
        can be printed for debugging or logging purposes.'''
        pass
    
    @abstractmethod
    def get_feature_value(self, derivation_tree) -> float:
        '''Computes and returns the feature value based on a given derivation tree of an input.
        The exact computation depends on the specific feature class.'''
        pass


The `ExistenceFeature` class, derived from the base`Feature` class, represents existence features of a grammar. Existence features denote the usage of a specific production rule in the derivation sequence of an input.

In [None]:
from fuzzingbook.GrammarFuzzer import expansion_to_children

class ExistenceFeature(Feature):
    '''
    This class captures the existence features of a grammar. The existence features indicate 
    whether a specific production rule was used in the derivation sequence of an input. 
    For a given production rule P -> A | B, a production existence feature for P and 
    alternative existence features for each alternative (i.e., A and B) are defined.
    
    Args:
        name : A unique identifier for this feature. It should not contain white spaces. 
               For example, 'exist(<digit>@1)'.
        rule : The production rule associated with this feature.
        key  : The feature key, which is equal to the rule attribute for production features, 
               or corresponds to the respective alternative for alternative features.
    '''
    def __init__(self, name: str, rule: str, key: str) -> None:
        super().__init__(name, rule, key)
    
    def __repr__(self) -> str:
        '''Returns a string representation of the feature.'''
        if self.rule == self.key:
            return f"exists({self.rule})"
        else:
            return f"exists({self.rule} == {self.key})"
    
    def get_feature_value(self, derivation_tree) -> float:
        '''Computes and returns the feature value based on a given derivation tree of an input.'''
        raise NotImplementedError  # You need to implement this method for the feature collection task


#### NumericInterpretation Feature

The `NumericInterpretation` class, a subclass of the abstract`Feature`
 class, encapsulates numeric interpretation features of a grammar. These features are assigned to productions that exclusively derive strings composed of the characters [0-9], '.', and '-'. The feature value returned corresponds to the maximum floating-point numeric interpretation of the derived strings of a production.

In [None]:
from fuzzingbook.GrammarFuzzer import tree_to_string
from numpy import nanmax, isnan

class NumericInterpretation(Feature):
    '''
    This class represents numeric interpretation features of a grammar. These features
    are defined for productions that exclusively derive words composed of the characters
    [0-9], '.', and '-'. The feature value returned corresponds to the maximum
    floating-point numeric interpretation of the derived words of a production.

    Args:
        name : A unique identifier name for this feature. It should not contain white spaces. 
               For example, 'num(<integer>)'.
        rule : The production rule associated with this feature.
    '''
    def __init__(self, name: str, rule: str) -> None:
        super().__init__(name, rule, rule)
    
    def __repr__(self) -> str:
        '''Returns a string representation of the feature.'''
        return f"num({self.key})"
    
    def get_feature_value(self, derivation_tree) -> float:
        '''Computes and returns the feature value based on a given derivation tree of an input.'''
        raise NotImplementedError  # This method needs to be implemented for the feature collection task


## Task 1: Extracting the feature sets from the grammar

#### Task 1.1 Existence Feature

<div class="alert alert-success alertsuccess">
[Task] Implement the function <i>extract_existence()</i>. This function should extract all existence features from a given grammar and return them in a list.
</div>

In [None]:
def extract_existence_features(grammar: Grammar) -> List[ExistenceFeature]:
    '''
    Extracts all existence features from a given grammar and returns them in a list.

    Args:
        grammar: The input grammar from which to extract features.

    Returns:
        A list of existence features extracted from the grammar.
    '''
    
    # Your code goes here
    raise NotImplementedError("extract_existence_features: Function not yet implemented.")

The goal of this function is to parse the input grammar and create an `ExistenceFeature` object for each rule and alternative in the grammar. These objects should be stored in a list which is then returned. The specifics of this list of Existence feautres depend on the structure and details of your grammar and feature classes.

#### Task 1.2: NumericInterpretation Feature

<div class="alert alert-success alertsuccess">
[Task] Implement the function <i>extract_numeric()</i>. This function should extract all numeric interpretation features from the provided grammar and return them as a list.
</div>

In [None]:
def extract_numeric_features(grammar: Grammar) -> List[NumericInterpretation]:
    '''
    Extracts all numeric interpretation features from a given grammar and returns them in a list.

    Args:
        grammar: The input grammar from which to extract features.

    Returns:
        A list of numeric interpretation features extracted from the grammar.
    '''
    
    # Your code goes here
    raise NotImplementedError("extract_numeric_features: Function not yet implemented.")


The goal of this function is to parse the input grammar and create a `NumericInterpretation` object for each rule and alternative in the grammar that can be interpreted as a numeric value. These objects should be stored in a list which is then returned. The specifics of how this is done will depend on the structure and details of your grammar and feature classes.

<div class="alert alert-danger" role="alert">
[Note] Regarding the 'Feature.name' attribute, ensure to utilize a unique identifier that does not include any whitespaces. For instance, you might choose an identifier name similar to 'exists(&lt;feature&gt;@1)' or 'exists(&lt;digit&gt;@0)'. In this context, '@i' represents the i-th derivation alternative of a rule.

For example, exists(&lt;digit&gt;@0) correspondes to exists(&lt;digit&gt; == 0), or exists(&lt;feature&gt;@1) corresponds to exists(&lt;feature&gt; == tan). These identifiers are employed to expedite the parsing process. The friendly representation of 'exists({self.rule} == {self.key})' is primarily intended for human readability, providing an easier understanding. 
</div>

The function `get_all_features()` combines the functionalities of extract_existence and extract_numeric functions to generate a comprehensive list of features from a given grammar.

In [None]:
def get_all_features(grammar: Grammar) -> List[Feature]:
    return extract_existence(grammar) + extract_numeric(grammar)

Subsequently, we can output all the extracted features to gain insights into the structure and numerical implications of the grammar rules. In the output, you'll be able to see each feature related to the calculator grammar, providing a deeper understanding of its structure and properties.

In [None]:
for feature in get_all_features(calculator_grammar):
    # transform your feautre objects to strings (this will call the __repr__() function)
    print(str(feature))

## Test 1: Confirm that you have extracted the right number of features

In [None]:
def test_features(features: List[Feature]) -> None:
    existence_features = 0
    numeric_features = 0
    
    for feature in features:
        if isinstance(feature, ExistenceFeature):
            existence_features += 1
        elif isinstance(feature, NumericInterpretation):
            numeric_features += 1
            
    assert(existence_features == 27)
    assert(numeric_features == 4)
    
    expected_feature_names = {"exists(<start>)",
        "exists(<start> == <function>(<term>))",
        "exists(<function>)",
        "exists(<function> == sqrt)",
        "exists(<function> == tan)",
        "exists(<function> == cos)",
        "exists(<function> == sin)",
        "exists(<term>)",
        "exists(<term> == -<value>)",
        "exists(<term> == <value>)",
        "exists(<value>)",
        "exists(<value> == <integer>.<integer>)",
        "exists(<value> == <integer>)",
        "exists(<integer>)",
        "exists(<integer> == <digit><integer>)",
        "exists(<integer> == <digit>)",
        "exists(<digit>)",
        "exists(<digit> == 0)", 
        "exists(<digit> == 1)",
        "exists(<digit> == 2)",
        "exists(<digit> == 3)",
        "exists(<digit> == 4)",
        "exists(<digit> == 5)",
        "exists(<digit> == 6)",
        "exists(<digit> == 7)",
        "exists(<digit> == 8)",
        "exists(<digit> == 9)",
        "num(<term>)",
        "num(<value>)",
        "num(<digit>)",
        "num(<integer>)"
    }
    
    actual_feature_names = set([str(f) for f in features])
    
    for feature_name in expected_feature_names:
        assert (feature_name in actual_feature_names), f"Missing feature with name: {feature_name}"
        
    for feature_name in actual_feature_names:
        assert (feature_name in expected_feature_names), f"Missing feature with name: {feature_name}"
        
    print("All checks passed!")

In [None]:
if __name__ == "__main__":
    test_features(get_all_features(calculator_grammar))

## Task 2:  Extracting Feature Vectors from Inputs

<div class="alert alert-success alertsuccess">
[Task] Implement the function <i>collect_features(sample_list, grammar)</i>. The function should parse all inputs from a list of samples into its features.
</div>

**INPUT**:
the function requires the following parameter:
- sample_list: a list of samples that should be parsed
- grammar: the corresponding grammar of the syntactical features

**OUTPUT**: the function should return a pandas Dataframe of the parsed features for all inputs in the sample list:

|sample| feature_1     | feature_2     | ...    |feature_n|
|-------------|------------- |-------------|-------------|-----|
|sqrt(-900)| 1     | 0 | ...| -900 |
|cos(20)| 0     | 1 | ...| 20 |

<div class="alert alert-info">
[Hint]: It might be usefull to use the implement the abstract functions get_feature_value(self, derivation_tree) of the Feature class for each of the feature types (Existence, Numeric). Given a derivation tree, these functions return the value of the feature.
</div>

In [None]:
class ExistenceFeature(ExistenceFeature):
    
    def get_feature_value(self, derivation_tree) -> float:
        '''Computes and returns the feature value based on a given derivation tree of an input.'''
        raise NotImplementedError  # You need to implement this method for the feature collection task

In [None]:
class NumericInterpretation(NumericInterpretation):
    
    def get_feature_value(self, derivation_tree) -> float:
        '''Computes and returns the feature value based on a given derivation tree of an input.'''
        raise NotImplementedError  # This method needs to be implemented for the feature collection task

This task requires you to define the collect_features function. This function takes in a list of input samples and a grammar, and extracts a DataFrame representing the feature vectors for each input.

In [None]:
from fuzzingbook.Parser import EarleyParser
from fuzzingbook.Grammars import Grammar
from pandas import DataFrame

def collect_features(sample_list: List[str],
                     grammar: Grammar) -> DataFrame:
    
    # write your code here
    raise NotImplementedError("Func. collect_features: Function not Implemented")

To illustrate the utility of this function, we'll take a sample list of inputs for a calculator application, ["sqrt(-900)", "sin(24)", "cos(-3.14)"], and extract their respective feature vectors using the calculator_grammar.

The `feature_data` object will now contain the feature vectors for each sample, providing a structured and digestible epresentation of their properties as they relate to the specified grammar.

In [None]:
sample_list = ["sqrt(-900)", "sin(24)", "cos(-3.14)"]
feature_data = collect_features(sample_list, calculator_grammar)

display(feature_data)

## Test 2: Check whether we produce the correct feature values

In [None]:
# TODO: Implement max values for multiple parse trees
def get_feature_vector(sample, grammar, features):
    '''Returns the feature vector of the sample as a dictionary of feature values'''
    
    feature_dict = defaultdict(int)
    
    earley = EarleyParser(grammar)
    for tree in earley.parse(sample):
        for feature in features:
            feature_dict[feature.name] = feature.get_feature_value(tree)
    
    return feature_dict

from sklearn.feature_extraction import DictVectorizer
import pandas as pd

# Features for each input, one dict per input
feature_vectors = [get_feature_vector(sample, calculator_grammar, get_all_features(calculator_grammar)) for sample in sample_list]

# Transform to numpy array
vec = DictVectorizer()
X = vec.fit_transform(feature_vectors).toarray()

df2 = pd.DataFrame(X, columns = vec.get_feature_names_out())

# Check if both dataframes are equal by element-wise comparing each column
if __name__ == "__main__":
    assert all(map(lambda col: all(feature_data[col] == df2[col]), df2.head()))

In [None]:
# TODO: handle multiple trees
from fuzzingbook.Parser import EarleyParser

def compute_feature_values(sample: str, grammar: Grammar, features: List[Feature]) -> Dict[str, float]:
    '''
        Extracts all feature values from an input.
        
        sample   : The input.
        grammar  : The input grammar.
        features : The list of input features extracted from the grammar.
        
    '''
    earley = EarleyParser(calculator_grammar)
    
    features = {}
    for tree in earley.parse(sample):
        for feature in get_all_features(calculator_grammar):
            features[feature.name_rep()] = feature.get_feature_value(tree)
    return features

In [None]:
def test_feature_values() -> None:

    sample_list = ["sqrt(-900)", "sin(24)", "cos(-3.14)"]

    expected_feature_values = {
        "sqrt(-900)": {
            "exists(<start>)" : 1,
            "exists(<start> == <function>(<term>))" : 1,
            "exists(<function>)" : 1,
            "exists(<function> == sqrt)" : 1,
            "exists(<function> == tan)" : 0,
            "exists(<function> == cos)" : 0,
            "exists(<function> == sin)" : 0,
            "exists(<term>)" : 1,
            "exists(<term> == -<value>)" : 1,
            "exists(<term> == <value>)" : 0,
            "exists(<value>)" : 1,
            "exists(<value> == <integer>.<integer>)" : 0,
            "exists(<value> == <integer>)" : 1,
            "exists(<integer>)" : 1,
            "exists(<integer> == <digit><integer>)" : 1,
            "exists(<integer> == <digit>)" : 1,
            "exists(<digit>)" : 1,
            "exists(<digit> == 0)" : 1,
            "exists(<digit> == 1)" : 0,
            "exists(<digit> == 2)" : 0,
            "exists(<digit> == 3)" : 0,
            "exists(<digit> == 4)" : 0,
            "exists(<digit> == 5)" : 0,
            "exists(<digit> == 6)" : 0,
            "exists(<digit> == 7)" : 0,
            "exists(<digit> == 8)" : 0,
            "exists(<digit> == 9)" : 1,
            "num(<term>)" : -900.0,
            "num(<value>)" : 900.0,
            "num(<digit>)" : 9.0,
            "num(<integer>)" : 900.0
        }, 
        "sin(24)": {
            "exists(<start>)" : 1,
            "exists(<start> == <function>(<term>))" : 1,
            "exists(<function>)" : 1,
            "exists(<function> == sqrt)" : 0,
            "exists(<function> == tan)" : 0,
            "exists(<function> == cos)" : 0,
            "exists(<function> == sin)" : 1,
            "exists(<term>)" : 1,
            "exists(<term> == -<value>)" : 0,
            "exists(<term> == <value>)" : 1,
            "exists(<value>)" : 1,
            "exists(<value> == <integer>.<integer>)" : 0,
            "exists(<value> == <integer>)" : 1,
            "exists(<integer>)" : 1,
            "exists(<integer> == <digit><integer>)" : 1,
            "exists(<integer> == <digit>)" : 1,
            "exists(<digit>)" : 1,
            "exists(<digit> == 0)" : 0,
            "exists(<digit> == 1)" : 0,
            "exists(<digit> == 2)" : 1,
            "exists(<digit> == 3)" : 0,
            "exists(<digit> == 4)" : 1,
            "exists(<digit> == 5)" : 0,
            "exists(<digit> == 6)" : 0,
            "exists(<digit> == 7)" : 0,
            "exists(<digit> == 8)" : 0,
            "exists(<digit> == 9)" : 0,
            "num(<term>)" : 24.0,
            "num(<value>)" : 24.0,
            "num(<digit>)" : 4.0,
            "num(<integer>)" : 24.0
        },
        "cos(-3.14)": {
            "exists(<start>)" : 1,
            "exists(<start> == <function>(<term>))" : 1,
            "exists(<function>)" : 1,
            "exists(<function> == sqrt)" : 0,
            "exists(<function> == tan)" : 0,
            "exists(<function> == cos)" : 1,
            "exists(<function> == sin)" : 0,
            "exists(<term>)" : 1,
            "exists(<term> == -<value>)" : 1,
            "exists(<term> == <value>)" : 0,
            "exists(<value>)" : 1,
            "exists(<value> == <integer>.<integer>)" : 1,
            "exists(<value> == <integer>)" : 0,
            "exists(<integer>)" : 1,
            "exists(<integer> == <digit><integer>)" : 1,
            "exists(<integer> == <digit>)" : 1,
            "exists(<digit>)" : 1,
            "exists(<digit> == 0)" : 0,
            "exists(<digit> == 1)" : 1,
            "exists(<digit> == 2)" : 0,
            "exists(<digit> == 3)" : 1,
            "exists(<digit> == 4)" : 1,
            "exists(<digit> == 5)" : 0,
            "exists(<digit> == 6)" : 0,
            "exists(<digit> == 7)" : 0,
            "exists(<digit> == 8)" : 0,
            "exists(<digit> == 9)" : 0,
            "num(<term>)" : -3.14,
            "num(<value>)" : 3.14,
            "num(<digit>)" : 4.0,
            "num(<integer>)" : 14.0
        }
    }

    all_features = get_all_features(calculator_grammar)
    for sample in sample_list:
        input_features = compute_feature_values(sample, calculator_grammar, all_features)

        for feature in all_features:
            key = feature.name_rep()
            #print(f"\t{key.ljust(50)}: {input_features[key]}")
            #print('"' + key + '"' + ' : ' + str(input_features[key]) + ',')
            expected = expected_feature_values[sample][key]
            actual = input_features[key]
            assert (expected == actual), f"Wrong feature value for sample {sample} and feature {key}: expected {expected} but is {actual}."
            
    print("All checks passed!")

In [None]:
if __name__ == "__main__":
    test_feature_values()