# 1.0 Loading a Training Data Sample into Memory

In this example, we'll walk through a heuristics-based automated feature selection process for an autoencoder based digital fingerprinting workflow. The selection process can be a compute and time intensive process on large datasets. Consequently, it's recommended that you select datasets that meet the following criteria for an initial exploratory analysis.

1. Tractable Size: Chose a sample of your dataset that can easily fit into GPU and host memory. As an example, limiting your sample size to ~10,000 rows will result in a runtime of approximately 2 minutes. We recommend experimenting with a few different dataset sizes to find one that works best for you. 
2. Capturing Data Variance: Chose a sample of data that sufficiently captures the general behavior you're trying to model. For example, if you have a lot of categorical columns in your dataset, try and ensure your sample contains most, if not all, values your categorical variables can take. Also aim to capture as much variance (diversity) in your data as possible to get the best results out of this process. 
3. Don't attempt to optimize on too many features: While the automated process is capable of handling an arbitrary number of features (within system limits), try to optimize your process on only those features that may have cyber-specific relevance to the behavior you're trying to capture. Including a very large number of features to this analysis will noticably increase runtime and could also lead to suboptimal feature seleciton. 

With that being said, let's explore what a sample dataset could look like. Please note that your data should be **JSON formatted** in JSON files for this to work. If you're reading instead from something like `parquet` or CSV, we suggest reading it into a DataFrame with `pandas` and then turning that into a dictionary using `df.to_dict(orient='records')`. 

In [1]:
import os
import json
from pprint import pprint

def concatenate_json_lists(directory_path):
    """
    Reads all JSON files in the specified directory, assuming each JSON file contains a list,
    and concatenates these lists into a single list.

    Parameters:
    - directory_path (str): The path to the directory containing the JSON files.

    Returns:
    - list: A concatenated list containing all elements from the lists in the JSON files.
    """
    concatenated_list = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".json"):
            file_path = os.path.join(directory_path, filename)
            try:
                with open(file_path, 'r') as json_file:
                    data = json.load(json_file)
                    if isinstance(data, list):
                        concatenated_list.extend(data)
                    else:
                        print(f"File {filename} does not contain a list.")
            except Exception as e:
                print(f"An error occurred while processing file {filename}: {e}")
    return concatenated_list

concatenated_data = concatenate_json_lists("/workspace/examples/data/dfp/azure-training-data/") #Change the path to your data directory here. 

print(f"The loaded data contains {len(concatenated_data)} rows. An exampled data entry looks like: \n")
pprint(concatenated_data[0])

The loaded data contains 3239 rows. An exampled data entry looks like: 

{'Level': 4,
 'callerIpAddress': '13.113.40.157',
 'category': 'NonInteractiveUserSignInLogs',
 'correlationId': '36764e36-d379-45f4-914a-96a69bd59ae5',
 'durationMs': 0,
 'identity': 'Attack Target',
 'location': 'XR',
 'operationName': 'Sign-in activity',
 'operationVersion': '1.0',
 'properties': {'appDisplayName': 'Articulate 360',
                'appId': '9c5b7fe3-0ad2-4ea6-94e5-9e0001f367e3',
                'appServicePrincipalId': None,
                'appliedConditionalAccessPolicies': [],
                'authenticationContextClassReferences': [],
                'authenticationDetails': [],
                'authenticationProcessingDetails': [],
                'authenticationProtocol': 'none',
                'authenticationRequirement': 'singleFactorAuthentication',
                'authenticationRequirementPolicies': [],
                'autonomousSystemNumber': 34974,
                'clientAppUsed

# 2 Performing Automated Feature Selection

## 2.1 Known Limitations and Considerations

Once we have the data oriented as a list of dictionary objects (JSON), we're ready to run the automated feature selection process. 

__The tool is domain agnostic__, which means that it doesn't select features that work well for your specific cyber workflow. Instead, it selects features that work well for an autoencoder model from a statistical perspective. This necessarily introduces a few considerations we recommend taking into account.

1. **Consult with cyber experts**: We recommend analyizing the output of the feature selection process with cybersecurity experts in your domain area to verify if they believe the features contain enough 'signal' from a use-case persepctive. Domain experts can also help give you an idea of how they would solve the problem, which can inform your feature selection. 
2. **Limit the possibility of overfitting**: Avoid selecting feature outputs from the model that could lead to undesrible properties like model overfitting. For example, avoid using features such as individually identifiable IP addresses, usernames, or MAC addresses outside of the actual user attribute in the DFP pipeline. 
3. **Derived features**: This tool does not suggest or create any derived features. Such features are often helpful, if not critical, in the successful functioning of advanced ML workflows. Consider adding some derived features that could be helpful in capturing the behavior of interest into the data. This can often be done in collaboration with cyber domain experts who can help inform how they would solve the problem, which can be tranlated into derived features. 
4. **Alternate encodings**: By default, any non-numeric features are one-hot encoded as categorical variables by the tool. Often, there may be better ways of representing the data that lend themselves well to your use case. For example, only encoding variables of interest, embedding text instead of one-hot encoding them, etc. We recommend exploring alternatives once you've established a baseline

Refer to the notebook demo [here](https://github.com/nv-morpheus/Morpheus/blob/branch-24.06/models/training-tuning-scripts/dfp-models/dfp-feature-selection-demo.ipynb) for a more detailed analysis of datasets and ideas on derived features. The automated method here is intended to be used purely as a starting point in your development. 

## 2.2. Running Automated Feature Selection

We can run automated feature selection by using Morpheus' `AutoencoderFeatureSelector` class and configuring some parameters. Let's see how. The AutoencoderFeatureSelector class provides the following configurable parameters. 

```python
"""
    A class to select features using an autoencoder, handling categorical and numerical data.
    Supports handling ambiguities in data types and ensures selected feature count constraints.

    Attributes:
        input_json (dict): List of dictionary objects to normalize into a dataframe
        id_column (str) : Column name that contains ID for morpheus AE pipeline. Default None.
        timestamp_column (str) : Column name that contains the
            timestamp for morpheus AE pipeline. Default None.
        encoding_dim (int): Dimension of the encoding layer, defaults to half of input
            dimensions if not set.
        batch_size (int): Batch size for training the autoencoder.
        variance_threshold (float) : Minimum variance a column must contain
            to remain in consideration. Default 0.
        null_threshold (float): Maximum proportion of null values a column can contain. Default 0.3.
        cardinality_threshold_high (float): Maximum proportion
            of cardinality to length of data allowable. Default 0.99.
        cardinality_threshold_low_n (int): Minimum cardinalty for a feature to be considered numerical
            during type infernce. Default 10.
        categorical_features (list[str]): List of features in the data to be considered categorical. Default [].
        numeric_features (list[str]): List of features in the data to be considered numeric. Default [].
        ablation_epochs (int): Number of epochs to train the autoencoder.
        device (str): Device to run the model on, defaults to 'cuda' if available.
        
"""
```

In [2]:
import sys 

sys.path.insert(0, '/workspace/morpheus')

In [3]:
from models.dfencoder.ae_feature_selector import AutoencoderFeatureSelector 

selector = AutoencoderFeatureSelector(
    input_json = concatenated_data, 
    id_column = 'identity', #This is the entity you want to 'fingerprint'
    timestamp_column = 'time', #This is your log timestamp
    variance_threshold=0.1, #Removes cols. with variance lower than this
    null_threshold=0.3, #Removes cols will null proportion great than this
    cardinality_threshold_high=0.9, #Remove columns with high cardinality
    cardinality_threshold_low_n=10, #Cols. with this cardinality with considered categorica
)

Once we've instantiated the class, we can run the feature selection as a one-line command. 

In [None]:
schemas = selector.select_features(
    k_min = 5, #Select at minimum 5 features
    k_max = 20, #Select at most 20 features
)

Categorical or numeric features not provided. Performing type inference which could be inaccurate.
Found sparse arrays when one-hot encoding. Consider using fewer categorical variables.
  _torch_pytree._register_pytree_node(
Not going to perform early-stopping. self.patience(=-1) is provided for early-stopping but validation is not enabled. Please set `run_validation` to True and provide a `validation_dataset` to enable early-stopping.


AE Ablation Study:   0%|          | 0/28 [00:00<?, ?it/s]

Not going to perform early-stopping. self.patience(=-1) is provided for early-stopping but validation is not enabled. Please set `run_validation` to True and provide a `validation_dataset` to enable early-stopping.
Not going to perform early-stopping. self.patience(=-1) is provided for early-stopping but validation is not enabled. Please set `run_validation` to True and provide a `validation_dataset` to enable early-stopping.
Not going to perform early-stopping. self.patience(=-1) is provided for early-stopping but validation is not enabled. Please set `run_validation` to True and provide a `validation_dataset` to enable early-stopping.
Not going to perform early-stopping. self.patience(=-1) is provided for early-stopping but validation is not enabled. Please set `run_validation` to True and provide a `validation_dataset` to enable early-stopping.
Not going to perform early-stopping. self.patience(=-1) is provided for early-stopping but validation is not enabled. Please set `run_valida

The output of the selection process is a report of what operations were performed on the data as well as the most important features the class thinks is important when building an autoencoder. The results are ranked based on the output of a generic __ablation study__ wherein a very basic autoencoder is trained on various combinations of features to identify which features help it learn the 'most'. 

Please not here, too, that all of these combinations are purely heuristics. We recommend augmenting or verifying them with more detailed analyses and with domain experts. 

## 2.3 Extracting Schema Objects for Morpheus

The feature selector also dynamically builds JSON schema representations of the selected and raw features which you can use in a Morpheus pipeline using the `JSONSchemaBuilder` class for `DataFrameInputSchemas`. Let's examine what recommended initial schema for the provided data looks like:

In [5]:
pprint(schemas[1])

{'JSON_COLUMNS': ['properties'],
 'SCHEMA_COLUMNS': [{'data_column': 'category',
                     'dtype': 'string',
                     'type': 'ColumnInfo'},
                    {'data_column': 'operationName',
                     'dtype': 'string',
                     'type': 'ColumnInfo'},
                    {'data_column': 'tenantId',
                     'dtype': 'string',
                     'type': 'ColumnInfo'},
                    {'data_column': 'operationVersion',
                     'dtype': 'string',
                     'type': 'ColumnInfo'},
                    {'data_column': 'callerIpAddress',
                     'dtype': 'string',
                     'type': 'ColumnInfo'},
                    {'data_column': 'resultSignature',
                     'dtype': 'string',
                     'type': 'ColumnInfo'},
                    {'data_column': 'durationMs',
                     'dtype': 'int',
                     'type': 'ColumnInfo'},
                 

These schemas can also be saved to JSON files by providing the `raw_schema_path` and `preprocess_schema_path` path arugments to the `select_features` function. 

Please refer to the __Morpheus/examples/digital_fingerprinting/production/morpheus/notebooks/json_schema_builder.ipynb__ notebook for a demo on using JSON schema files.