Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better management of unconfigured analyzers #306

Closed
mlodic opened this issue Dec 30, 2020 · 15 comments
Closed

better management of unconfigured analyzers #306

mlodic opened this issue Dec 30, 2020 · 15 comments

Comments

@mlodic
Copy link
Member

mlodic commented Dec 30, 2020

Regarding this closed issue (#298), we have a problem with those analyzers that require the use of an API key but either the user does not have the key or has not configured it.

Problems faced by users:

  • they execute the "Run all available analyzers" option and get a lot of errors. This could be avoided just by executing the correctly configured analyzers.

  • they see those analyzers as "runnable" but, actually, they are not and, when executed, they just instantly crash. I think we should not allow the user to execute them without the configuration set but, at the same time, let the user understand from the GUI that those analyzers exists and can be appropriately configured (the requires_configuration flag could be enough for this).

This seems easier than it is. ATM the API keys are checked in the code of each analyzer, while this should be done before the analysis actually runs. I am open to suggestions on how to solve this problem

@eshaan7
Copy link
Member

eshaan7 commented Dec 31, 2020

The way I see it is there are 2 cases here:

  • If user selects to run an analyzer explicitly (configured or unconfigured), then showing the error in the logs and even in the final report makes sense.
  • in case of run_all_available_analyzers, the analyzers which failed because they were not configured should not be in the final report.

My suggestion:

  • in the set_report_and_cleanup function, we can pass the boolean run_all_available_analyzers and implement a check that the error-ed out analyzers are not added to the final report when run_all_available_analyzers=True. (this can be done very easily and does not require us to change the implementation of analyzers)

  • we can remove these 2 log statements from here.

    since it's already logged here:

    logger.warning(
    f"({analyzer_name}, job_id #{job_id}) -> set as FAILED. "
    f" Error message: {err_msg}"
    )

    this way the user is not bombarded with error messages.

@eshaan7
Copy link
Member

eshaan7 commented Dec 31, 2020

On another note, we should use JSON logging format (for example, python-json-logger instead of just plain texts. This would allow users to load the logs into elastic;'s APM or other similar services. Thoughts ?

@mlodic
Copy link
Member Author

mlodic commented Jan 2, 2021

About the JSON logger I think that for someone could be helpful. However it is already easy to ingest logs in ElasticSearch with the actual logs format because it is pretty standard and it would just require a simple grok filter + logstash.
Maybe we could talk a little more about how to extend the ElasticSearch support in general (and for logs too). This could be another issue.

in the set_report_and_cleanup function, we can pass the boolean run_all_available_analyzers and implement a check that the error-ed out analyzers are not added to the final report when run_all_available_analyzers=True. (this can be done very easily and does not require us to change the implementation of analyzers)

This is a good compromise to avoid to make bigger changes. However it seems a little hack because the application would lie to the user who would think that those analyzers never executed.

@CITIZENDOT
Copy link
Contributor

I'll suggest one method but you might get dissappointed by amount of work has to be done for that. But I feel it's safe way.
First of all, We'll agree on a fact that every analyzer either has an API Key or doesn't have an API key. No need to worry about second case. They'll work even if unconfigured. (Correct me, If I am wrong here)

And, In the first case, I believe every API Key follows a FIXED REGEX PATTERN. We should include that pattern in analyzer_config.json. And before sending a request/starting analysis, Make a check whether the API Key matches that Regex. If not, Set the status as Ignored/Unconfigured and It shouldn't be counted as error in report.

Work needs to be done: We should identify regex patterns of all existing API Keys and for future analyzers too.

@mlodic
Copy link
Member Author

mlodic commented Jan 12, 2021

Thank you for sharing your idea.

One of the problems is that the "secrets" (api keys or similar) can be more than one for each analyzer and this should remain as it is (for example see Censys or HoneyDB). However when you mention the idea of including the keys names in analyzer_config.json, I think that it could work.

At the moment, we are just using the requires_configuration parameter that is shown to the users in the interface to tell them if an analyzer should be configured or not. It is a boolean value. One idea could be to replace this value for a required_secrets parameter where there is a list of secrets names that the analyzer needs. Before executing the analyzers, in the general.py module, we could check if those secrets are available and, if not, we could just not execute the related analyzer.

Also, required_secrets could be used in the GUI in the place of the requires_configuration (no need to keep that boolean value anymore) and show the names of the required secrets in the table.
image

In this way, if the user runs run_all_available_analyzers or several analyzers manually, the unconfigured analyzers would not be executed transparently, just like now it works for analyzers that does not supported a specific type of file for instance. Obviously, we should log this behavior somewhere.

In another case, if the user tries to execute the unconfigured analyzer alone, we should manage this special case in the API and show to the user that the analysis did not start because that analyzer is unconfigured...and showing a proper status code.

This should solve all the forementioned problems. Thoughts?

@CITIZENDOT
Copy link
Contributor

CITIZENDOT commented Jan 12, 2021

I agree with rest of the things but, We shouldn't just rely on the presence of that API-KEY/secrets.

Because, The recommended usage is to copy .env_app_template to .env_app. In that process, All the API_KEYs are given a default value. So, that presence check always returns true. WHICH SHOULDN'T BE THE CASE.

I suggest to perform a basic check. Like if that secret is not the default value assigned in the template. Or If it matches with that Analyzer's Regex pattern (Robust way but requires lot of effort).

Edit : I am mistaken. I am assuming the test env file which has test as API key. I got your idea. Yes, A secret presence check will do the job.

@mlodic
Copy link
Member Author

mlodic commented Jan 12, 2021

Yes, we could just check if the key is empty or not, that is the idea

@CITIZENDOT
Copy link
Contributor

I'll work on this, If that's okay

@mlodic
Copy link
Member Author

mlodic commented Jan 12, 2021

Sure, feel free and thank you!

@eshaan7
Copy link
Member

eshaan7 commented Jan 20, 2021

One idea could be to replace this value for a required_secrets parameter where there is a list of secrets names that the analyzer needs.

In this way, we are duplicating the API-key variable names under api_key_name and required_secrets. IMHO, this seems like a hack. I feel this solution wouldn't last long and we would soon have to implement it in some other way to allow for newer features in the future.

Since making big changes in the analyzer_config.json can't be avoided, I'd prefer to do it another way. To be precise, something similar to as done here. Basically, instead of just having a key-value for the api_key_name, it will be an object where we can add further information about the particular config parameter.

PS: Now that I think about this, this would involve tonnes of changes everywhere including the clients and maybe overkill for the problem that we are trying to solve which is just silencing unconfigured analyzers.

@eshaan7
Copy link
Member

eshaan7 commented Jan 20, 2021

Before executing the analyzers, in the general.py module, we could check if those secrets are available and, if not, we could just not execute the related analyzer.

the start_analyzers() function in general.py is called synchronously when we make a call to the /send_analysis_request endpoint, so if we add these checks their it would significantly slow down the response time. Not to mention, the overhead that it would add since now we are checking/reading the environment variables twice for each analyzer.

in the set_report_and_cleanup function, we can pass the boolean run_all_available_analyzers and implement a check that the error-ed out analyzers are not added to the final report when run_all_available_analyzers=True. (this can be done very easily and does not require us to change the implementation of analyzers)

Whether, we like it or not, all we can do is let the analyzer run (like it does now) and handle the exception in a way that whether we should show it to the user or not.

@mlodic
Copy link
Member Author

mlodic commented Jan 20, 2021

the start_analyzers() function in general.py is called synchronously when we make a call to the /send_analysis_request endpoint, so if we add these checks their it would significantly slow down the response time. Not to mention, the overhead that it would add since now we are checking/reading the environment variables twice for each analyzer.

Yes, there would be a overhead but is it really relevant? We are trying to solve a user-experience problem that is worth to waste some milliseconds of computation. I know that without a real benchmark is difficult to understand but I guess that IntelOwl bottleneck is not there but it is inside all the tools that the analyzers use and all the APIs that the analyzers interact with.

Also, we could avoid to read the environment variables twice if we pass those values to the celery worker when we execute the analyzers. However this would require little changes to all the analyzers modules. It would be a painful work to do

@m0mosenpai
Copy link
Member

m0mosenpai commented Feb 28, 2021

One idea could be to replace this value for a required_secrets parameter where there is a list of secrets names that the analyzer needs.

In this way, we are duplicating the API-key variable names under api_key_name and required_secrets. IMHO, this seems like a hack. I feel this solution wouldn't last long and we would soon have to implement it in some other way to allow for newer features in the future.

Since making big changes in the analyzer_config.json can't be avoided, I'd prefer to do it another way. To be precise, something similar to as done here. Basically, instead of just having a key-value for the api_key_name, it will be an object where we can add further information about the particular config parameter.

PS: Now that I think about this, this would involve tonnes of changes everywhere including the clients and maybe overkill for the problem that we are trying to solve which is just silencing unconfigured analyzers.

I feel this would be nice too. I was thinking right now, analyzers that require API_KEYS show up in the list, while starting a job - even if they aren't configured properly. If the user forgets to set them up, they don't get to know about it till they run the job and get the error in the result JSON.

These unconfigured analyzers could be grayed out in the list and hovering over them could give a tooltip telling the user what exactly they forgot to configure. For accurate tooltips, we might need objects like @eshaan7 mentioned where at the time of adding an analyzer, we can specify the configuration required in terms of objects that we can check to know if the analyzer has been fully configured. If anyone of those objects of that Analyzer class, for example, are unconfigured/empty, we show that in the tooltip and when all have been configured, the analyzer is ungrayed and can be chosen.

This will prevent any of the unconfigured analyzers to run accidentally or while using the run_all_available_analyzers option and also provide the User real-time feedback on what he/she is missing. It also feels like a more scalable solution imo. Although, it would certainly require more work than the current preferred solution.

@CITIZENDOT
Copy link
Contributor

Yes. Someway or Other, That's our Final Goal. I mean, User just couldn't select configured analyzer. Forgot to post the idea here.

For Now, I'm considering only run_all_available_analyzers and ignoring error report of un-configured analyzers, And in future, We'll implement the above idea, i.e., Knowing the configured analyzers before hand and disabling them.

@eshaan7
Copy link
Member

eshaan7 commented Apr 5, 2021

This is an example analyzer object inside the new format of analyzer_config.json,

"Cuckoo_Scan": {
  "type": "file",
  "disabled": false,
  "description": "scan a file on a Cuckoo instance",
  "python_module": "cuckoo_scan.CuckooAnalysis",
  "config": {
    "soft_time_limit": 500,
    "queue": "long",
    "max_post_tries": 5,
    "max_poll_tries": 20,
  },
  "secrets": {
    "api_key_name": {
      "secret_name": "CUCKOO_API_KEY",
      "type": "string",
      "required": true,
      "default": null,
      "description": "API key for your cuckoo instance"
    }
  }
}

Here, config is basically the additional_config_params as it was before except that currently we also store what secrets are to be read as part of additional_config_params but in this new implementation, we have a separate field secrets which is it's own JSON object and stores for each secret the related information such as secret_name, required, default , type (choices:number, string, bool) and description.

and this is how it would be stored in the cache after verify_analyzer_config() is called:

"Cuckoo_Scan": {
  "type": "file",
  "disabled": false,
  "description": "scan a file on a Cuckoo instance",
  "python_module": "cuckoo_scan.CuckooAnalysis",
  "config": {
    "soft_time_limit": 500,
    "queue": "long",
    "max_post_tries": 5,
    "max_poll_tries": 20,
  },
  "secrets": {
    "api_key_name": {
      "secret_name": "CUCKOO_API_KEY",
      "type": "string",
      "required": true,
      "default": null,
      "description": "API key for your cuckoo instance"
    }
  },
  "verification": {
    "configured": false,
    "error_message": "'api_key_name' unknown: expected string, got int (3 of 4 satisfied)",
    "missing_secrets": ["api_key_name"]
  }
}

Since the secrets field now stores all information regarding whether a secret is required, default, etc. that means we can loop over each analyzer object inside analyzer_config.json and then traverse through it's secrets field and verify for all of them thereby marking them as configured or unconfigured and constructing the error_message that will be shown on the frontend as the tooltip.

@eshaan7 eshaan7 moved this from To do to In progress in Analyzers Manager/ Analyzers Jun 25, 2021
@eshaan7 eshaan7 assigned eshaan7 and m0mosenpai and unassigned eshaan7 Jun 25, 2021
@eshaan7 eshaan7 added this to the v3.0 milestone Jul 8, 2021
GSoC '21 - IntelOwl Improvements automation moved this from In progress to Done Jul 22, 2021
@sp35 sp35 moved this from In progress to Done in Analyzers Manager/ Analyzers Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants