Skip to content

Commit

Permalink
for idaholab#453, work in progress for pruning zeek extracted files.sh
Browse files Browse the repository at this point in the history
  • Loading branch information
mmguero committed Apr 9, 2024
1 parent 9517b9a commit 8250577
Show file tree
Hide file tree
Showing 11 changed files with 166 additions and 21 deletions.
9 changes: 8 additions & 1 deletion Dockerfiles/file-monitor.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ ARG EXTRACTED_FILE_SCANNER_START_SLEEP=10
ARG EXTRACTED_FILE_LOGGER_START_SLEEP=5
ARG EXTRACTED_FILE_MIN_BYTES=64
ARG EXTRACTED_FILE_MAX_BYTES=134217728
ARG EXTRACTED_FILE_PRUNE_THRESHOLD_MAX_SIZE=1TB
ARG EXTRACTED_FILE_PRUNE_THRESHOLD_TOTAL_DISK_USAGE_PERCENT=0
ARG EXTRACTED_FILE_PRUNE_INTERVAL_SECONDS=300
ARG VTOT_API2_KEY=0
ARG VTOT_REQUESTS_PER_MINUTE=4
ARG EXTRACTED_FILE_ENABLE_CLAMAV=false
Expand Down Expand Up @@ -65,6 +68,9 @@ ENV EXTRACTED_FILE_SCANNER_START_SLEEP $EXTRACTED_FILE_SCANNER_START_SLEEP
ENV EXTRACTED_FILE_LOGGER_START_SLEEP $EXTRACTED_FILE_LOGGER_START_SLEEP
ENV EXTRACTED_FILE_MIN_BYTES $EXTRACTED_FILE_MIN_BYTES
ENV EXTRACTED_FILE_MAX_BYTES $EXTRACTED_FILE_MAX_BYTES
ENV EXTRACTED_FILE_PRUNE_THRESHOLD_MAX_SIZE $EXTRACTED_FILE_PRUNE_THRESHOLD_MAX_SIZE
ENV EXTRACTED_FILE_PRUNE_THRESHOLD_TOTAL_DISK_USAGE_PERCENT $EXTRACTED_FILE_PRUNE_THRESHOLD_TOTAL_DISK_USAGE_PERCENT
ENV EXTRACTED_FILE_PRUNE_INTERVAL_SECONDS $EXTRACTED_FILE_PRUNE_INTERVAL_SECONDS
ENV VTOT_API2_KEY $VTOT_API2_KEY
ENV VTOT_REQUESTS_PER_MINUTE $VTOT_REQUESTS_PER_MINUTE
ENV EXTRACTED_FILE_ENABLE_CLAMAV $EXTRACTED_FILE_ENABLE_CLAMAV
Expand Down Expand Up @@ -134,7 +140,7 @@ RUN sed -i "s/main$/main contrib non-free/g" /etc/apt/sources.list.d/debian.sour
pkg-config \
tini \
unzip && \
apt-get -y -q install \
apt-get -y -q install \
inotify-tools \
libzmq5 \
psmisc \
Expand All @@ -148,6 +154,7 @@ RUN sed -i "s/main$/main contrib non-free/g" /etc/apt/sources.list.d/debian.sour
python3 -m pip install --break-system-packages --no-compile --no-cache-dir \
clamd \
dominate \
humanfriendly \
psutil \
pycryptodome \
python-magic \
Expand Down
6 changes: 6 additions & 0 deletions config/zeek.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,12 @@ EXTRACTED_FILE_PRESERVATION=quarantined
EXTRACTED_FILE_MIN_BYTES=64
# The maximum size (in bytes) for files to be extracted by Zeek
EXTRACTED_FILE_MAX_BYTES=134217728
# Prune ./zeek-logs/extract_files/ when it exceeds this size...
EXTRACTED_FILE_PRUNE_THRESHOLD_MAX_SIZE=1TB
# ... or when the *total* disk usage exceeds this percentage
EXTRACTED_FILE_PRUNE_THRESHOLD_TOTAL_DISK_USAGE_PERCENT=0
# Interval in seconds for checking whether to prune ./zeek-logs/extract_files/
EXTRACTED_FILE_PRUNE_INTERVAL_SECONDS=300
# Rate limiting for VirusTotal, ClamAV, YARA and capa with Zeek-extracted files
VTOT_REQUESTS_PER_MINUTE=4
CLAMD_MAX_REQUESTS=8
Expand Down
2 changes: 2 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Malcolm can also easily be deployed locally on an ordinary consumer workstation
- [Malcolm Configuration](malcolm-config.md#ConfigAndTuning)
+ [Environment variable files](malcolm-config.md#MalcolmConfigEnvVars)
+ [Command-line arguments](malcolm-config.md#CommandLineConfig)
+ [Managing disk usage](malcolm-config.md#DiskUsage)
- [Configure authentication](authsetup.md#AuthSetup)
+ [Local account management](authsetup.md#AuthBasicAccountManagement)
+ [Lightweight Directory Access Protocol (LDAP) authentication](authsetup.md#AuthLDAP)
Expand All @@ -41,6 +42,7 @@ Malcolm can also easily be deployed locally on an ordinary consumer workstation
+ [Linux host system configuration](host-config-linux.md#HostSystemConfigLinux)
+ [macOS host system configuration](host-config-macos.md#HostSystemConfigMac)
+ [Windows host system configuration](host-config-windows.md#HostSystemConfigWindows)
- [Managing disk usage](malcolm-config.md#DiskUsage)
* [Running Malcolm](running.md#Running)
- [OpenSearch and Elasticsearch instances](opensearch-instances.md#OpenSearchInstance)
+ [Authentication and authorization for remote data store clusters](opensearch-instances.md#OpenSearchAuth)
Expand Down
1 change: 1 addition & 0 deletions docs/hedgehog.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Hedgehog Linux is a Debian-based operating system built to
* [miscbeat](malcolm-hedgehog-e2e-iso-install.md#Hedgehogmiscbeat): System metrics forwarding
* [acl-configure](malcolm-hedgehog-e2e-iso-install.md#HedgehogACL): Configure ACL for artifact reachback from Malcolm
- [Autostart services](malcolm-hedgehog-e2e-iso-install.md#HedgehogConfigAutostart)
- [Managing disk usage](malcolm-hedgehog-e2e-iso-install.md#HedgehogDiskUsage)
+ [Zeek Intelligence Framework](hedgehog-config-zeek-intel.md#HedgehogZeekIntel)
* [Appendix A - Generating the ISO](hedgehog-iso-build.md#HedgehogISOBuild)
* [Appendix B - Generating a Raspberry Pi Image](hedgehog-raspi-build.md#HedgehogRaspiBuild)
Expand Down
3 changes: 1 addition & 2 deletions docs/index-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@

Malcolm releases prior to v6.2.0 used environment variables to configure OpenSearch [Index State Management](https://opensearch.org/docs/latest/im-plugin/ism/index/) [policies](https://opensearch.org/docs/latest/im-plugin/ism/policies/).

Since then, OpenSearch Dashboards has developed and released plugins with UIs for [Index State Management](https://opensearch.org/docs/latest/im-plugin/ism/index/) and [Snapshot Management](https://opensearch.org/docs/latest/opensearch/snapshots/sm-dashboards/). Because these plugins provide a more comprehensive and user-friendly interface for these features, the old environment variable-based configuration code has been removed from Malcolm; with the exception of the code that uses the `OPENSEARCH_INDEX_SIZE_PRUNE_LIMIT` and `OPENSEARCH_INDEX_SIZE_PRUNE_NAME_SORT` [variables in `dashboards-helper.env`](malcolm-config.md#MalcolmConfigEnvVars), which deals with deleting the oldest network session metadata indices when the database exceeds a certain size.
Since then, OpenSearch Dashboards has developed and released plugins with UIs for [Index State Management](https://opensearch.org/docs/latest/im-plugin/ism/index/) and [Snapshot Management](https://opensearch.org/docs/latest/opensearch/snapshots/sm-dashboards/). Because these plugins provide a more comprehensive and user-friendly interface for these features, the old environment variable-based configuration code has been removed from Malcolm, with a few exceptions. See [**Managing disk usage**](malcolm-config.md#DiskUsage) for more information.

Note that OpenSearch index state management and snapshot management only deals with disk space consumed by OpenSearch indices: it does not have anything to do with PCAP file storage. The `MANAGE_PCAP_FILES` environment variable in the [`arkime.env` file](malcolm-config.md#MalcolmConfigEnvVars) can be used to allow Arkime to prune old PCAP files based on available disk space.

# <a name="ArkimeIndexPolicies"></a> Using ILM/ISM with Arkime

Expand Down
27 changes: 24 additions & 3 deletions docs/malcolm-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Although the configuration script automates many of the following configuration
- `ARKIME_PASSWORD_SECRET` - the password hash secret for the Arkime viewer cluster (see `passwordSecret` in [Arkime INI Settings](https://arkime.com/settings)) used to secure the connection used when Arkime viewer retrieves a PCAP payload for display in its user interface
- `ARKIME_ROTATE_INDEX` - how often (based on network traffic timestamp) to [create a new index](https://arkime.com/settings#rotateIndex) in OpenSearch
- `ARKIME_QUERY_ALL_INDICES` - whether or not Arkime should [query all indices](https://arkime.com/settings#queryAllIndices) instead of trying to calculate which ones pertain to the search time frame (default `false`)
- `MANAGE_PCAP_FILES` – if set to `true`, all PCAP files imported into Malcolm will be marked as available for deletion by Arkime if available storage space becomes too low (default `false`)
- `MANAGE_PCAP_FILES` and `ARKIME_FREESPACEG` - these variables deal with PCAP [deletion by Arkime](https://arkime.com/faq#pcap-deletion), see [**Managing disk usage**](#DiskUsage) below
- `MAXMIND_GEOIP_DB_LICENSE_KEY` - Malcolm uses MaxMind's free GeoLite2 databases for GeoIP lookups. As of December 30, 2019, these databases are [no longer available](https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/) for download via a public URL. Instead, they must be downloaded using a MaxMind license key (available without charge [from MaxMind](https://www.maxmind.com/en/geolite2/signup)). The license key can be specified here for GeoIP database downloads during build- and run-time.
- The following variables configure [Arkime's use](index-management.md#ArkimeIndexPolicies) of OpenSearch [Index State Management (ISM)](https://opensearch.org/docs/latest/im-plugin/ism/index/) or Elasticsearch [Index Lifecycle Management (ILM)](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html):
+ `INDEX_MANAGEMENT_ENABLED` - if set to `true`, Malcolm's instance of Arkime will [use these features](https://arkime.com/faq#ilm) when indexing data
Expand All @@ -33,7 +33,9 @@ Although the configuration script automates many of the following configuration
- `DASHBOARDS_URL` - used primarily when `OPENSEARCH_PRIMARY` is set to `elasticsearch-remote` (see [OpenSearch and Elasticsearch instances](opensearch-instances.md#OpenSearchInstance)), this variable stores the URL for the [Kibana](https://www.elastic.co/kibana) instance into which Malcolm's dashboard's and index templates will be imported
- `DASHBOARDS_PREFIX` – a string to prepend to the titles of Malcolm's prebuilt [dashboards](dashboards.md#PrebuiltVisualizations) prior upon import during Malcolm's initialization (default is an empty string)
- `DASHBOARDS_DARKMODE` – if set to `true`, [OpenSearch Dashboards](dashboards.md#DashboardsVisualizations) will be set to dark mode upon initialization (default `true`)
- `OPENSEARCH_INDEX_SIZE_PRUNE_LIMIT` - the maximum cumulative size of OpenSearch indices are allowed to consume before the oldest indices are deleted, see [**Managing disk usage**](#DiskUsage) below
* **`filebeat.env`** - settings specific to [Filebeat](https://www.elastic.co/products/beats/filebeat), particularly for how Filebeat watches for new log files to parse and how it receives and stores [third-Party logs](third-party-logs.md#ThirdPartyLogs)
- `LOG_CLEANUP_MINUTES` and `ZIP_CLEANUP_MINUTES` - these variables deal cleaning up already-processed log files, see [**Managing disk usage**](#DiskUsage) below
* **`logstash.env`** - settings specific to [Logstash](https://www.elastic.co/products/logstash)
- `LOGSTASH_OUI_LOOKUP` – if set to `true`, Logstash will map MAC addresses to vendors for all source and destination MAC addresses when analyzing Zeek logs (default `true`)
- `LOGSTASH_REVERSE_DNS` – if set to `true`, Logstash will perform a reverse DNS lookup for all external source and destination IP address values when analyzing Zeek logs (default `false`)
Expand Down Expand Up @@ -108,7 +110,7 @@ Although the configuration script automates many of the following configuration
- `EXTRACTED_FILE_HTTP_SERVER_KEY` – specifies the password for the ZIP archive if `EXTRACTED_FILE_HTTP_SERVER_ZIP` is `true`; otherwise, this specifies the decryption password for encrypted Zeek-extracted files in an `openssl enc`-compatible format (e.g., `openssl enc -aes-256-cbc -d -in example.exe.encrypted -out example.exe`)
- `EXTRACTED_FILE_IGNORE_EXISTING` – if set to `true`, files extant in `./zeek-logs/extract_files/` directory will be ignored on startup rather than scanned
- `EXTRACTED_FILE_PRESERVATION` – determines behavior for preservation of [Zeek-extracted files](file-scanning.md#ZeekFileExtraction)
- `EXTRACTED_FILE_UPDATE_RULES` – if set to `true`, file scanner engines (e.g., ClamAV, Capa, Yara) will periodically update their rule definitions (default `false`)
- `EXTRACTED_FILE_UPDATE_RULES` – if set to `true`, file scanner engines (e.g., ClamAV, Capa, Yara) will periodically update their rule definitions (default `false`)
- `EXTRACTED_FILE_YARA_CUSTOM_ONLY` – if set to `true`, Malcolm will bypass the default Yara rulesets ([Neo23x0/signature-base](https://github.com/Neo23x0/signature-base), [reversinglabs/reversinglabs-yara-rules](https://github.com/reversinglabs/reversinglabs-yara-rules), and [bartblaze/Yara-rules](https://github.com/bartblaze/Yara-rules)) and use only [user-defined rules](custom-rules.md#YARA) in `./yara/rules`
- `VTOT_API2_KEY` – used to specify a [VirusTotal Public API v.20](https://www.virustotal.com/en/documentation/public-api/) key, which, if specified, will be used to submit hashes of [Zeek-extracted files](file-scanning.md#ZeekFileExtraction) to VirusTotal
- `ZEEK_AUTO_ANALYZE_PCAP_FILES` – if set to `true`, all PCAP files imported into Malcolm will automatically be analyzed by Zeek, and the resulting logs will also be imported (default `false`)
Expand All @@ -125,6 +127,7 @@ Although the configuration script automates many of the following configuration
- `ZEEK_LIVE_CAPTURE` - if set to `true`, Zeek will monitor live traffic on the local interface(s) defined by `PCAP_FILTER`
- `ZEEK_LOCAL_NETS` - specifies the value for Zeek's [`Site::local_nets`](https://docs.zeek.org/en/master/scripts/base/utils/site.zeek.html#id-Site::local_nets) variable (and `networks.cfg` for live capture) (e.g., `1.2.3.0/24,5.6.7.0/24`); note that by default, Zeek considers IANA-registered private address space such as `10.0.0.0/8` and `192.168.0.0/16` site-local
- `ZEEK_ROTATED_PCAP` - if set to `true`, Zeek can analyze captured PCAP files captured by `netsniff-ng` or `tcpdump` (see `PCAP_ENABLE_NETSNIFF` and `PCAP_ENABLE_TCPDUMP`, as well as `ZEEK_AUTO_ANALYZE_PCAP_FILES`); if `ZEEK_LIVE_CAPTURE` is `true`, this should be `false`; otherwise Zeek will see duplicate traffic
- See [**Managing disk usage**](#DiskUsage) below for a discussion of the variables control automatic threshold-based deletion of the oldest [Zeek-extracted files](file-scanning.md#ZeekFileExtraction).

## <a name="CommandLineConfig"></a>Command-line arguments

Expand All @@ -148,4 +151,22 @@ options:

Note that the value for **any** argument not specified on the command line will be reset to its default (as if for a new Malcolm installation) regardless of the setting's current value in the corresponding `.env` file. In other words, users who want to use the `--defaults` option should carefully review all available command-line options and choose all that apply.

Similarly, [authentication](authsetup.md#AuthSetup)-related settings can also be set noninteractively by using the [command-line arguments](authsetup.md#CommandLineConfig) for `./scripts/auth_setup`.
Similarly, [authentication](authsetup.md#AuthSetup)-related settings can also be set noninteractively by using the [command-line arguments](authsetup.md#CommandLineConfig) for `./scripts/auth_setup`.

## <a name="DiskUsage"></a>Managing disk usage

In instances where Malcolm is deployed with the intention of running indefinitely, eventually the question arises of what to do when the file systems used for storing Malcolm's artifacts (e.g., PCAP files, raw logs, [OpenSearch indices](index-management.md), [extracted files](file-scanning.md#ZeekFileExtraction), etc.). Malcolm provides [options](#MalcolmConfigEnvVars) for tuning the "aging out" (deletion) of old artifacts to make room for newer data.

* PCAP deletion is configured by environment variables in **`arkime.env`**:
- `MANAGE_PCAP_FILES` – if set to `true`, all PCAP files imported into Malcolm will be marked as available for [deletion by Arkime](https://arkime.com/faq#pcap-deletion) if available storage space becomes too low (default `false`)
- `ARKIME_FREESPACEG` - when `MANAGE_PCAP_FILES` is `true`, this value is [used by Arkime](https://arkime.com/settings#freespaceg) to determine when to delete the oldest PCAP files. Note that this variable represents the amount of free/unused/available desired on the file system: e.g., a value of `5%` means "delete PCAP files if the amount of unused storage on the file system falls below 5%" (default `10%`).
* Zeek logs and Suricata logs are temporarily stored on disk as they are parsed, enriched, and indexed, and afterwards are periodically [pruned]({{ site.github.repository_url }}/blob/{{ site.github.build_revision }}/filebeat/scripts/clean-processed-folder.py) from the file system as they age, based on these variables in **`filebeat.env`**:
- `LOG_CLEANUP_MINUTES` - specifies the age, in minutes, at which already-processed log files should be deleted
- `ZIP_CLEANUP_MINUTES` - specifies the age, in minutes, at which the compressed archives containing already-processed log files should be deleted
* Files [extracted by Zeek](file-scanning.md#ZeekFileExtraction) stored in the `./zeek-logs/extract_files/` directory can be periodically [pruned]({{ site.github.repository_url }}/blob/{{ site.github.build_revision }}/shared/bin/prune_files.sh) based on the following variables in **`zeek.env`**. If either of the two threshold limits defined here are met, the oldest extracted files will be deleted until the limit is no longer met. Setting either of the threshold limits to `0` disables that check.
- `EXTRACTED_FILE_PRUNE_THRESHOLD_MAX_SIZE` - specifies the maximum size, specified either in gigabytes or as a human-readable data size (e.g., `250G`), that the `./zeek-logs/extract_files/` directory is allowed to contain before the prune condition triggers
- `EXTRACTED_FILE_PRUNE_THRESHOLD_TOTAL_DISK_USAGE_PERCENT` - specifies a maximum fill percentage for the file system containing the `./zeek-logs/extract_files/`; in other words, if the disk is more than this percentage utilized, the prune condition triggers
- `EXTRACTED_FILE_PRUNE_INTERVAL_SECONDS` - the interval between checking the prune conditions, in seconds (default `300`)
* [Index management policies](index-management.md) can be handled via plugins provided as part of the OpenSearch and Elasticsearch platforms, respectively. In addition to those tools, the `OPENSEARCH_INDEX_SIZE_PRUNE_LIMIT` variable in **`dashboards-helper.env`** defines a maximum cumulative that OpenSearch indices are allowed to consume before the oldest indices [are deleted]({{ site.github.repository_url }}/blob/{{ site.github.build_revision }}/shared/bin/opensearch_index_size_prune.py), specified as either as a human-readable data size (e.g., `250G`) or as a percentage of the total disk size (e.g., `70%`): e.g., a value of `500G` means "delete the oldest OpenSearch indices if the total space consumed by Malcolm's indices exceeds five hundred gigabytes."

Similar settings exist on for managing disk usage [Hedgehog Linux](malcolm-hedgehog-e2e-iso-install.md#HedgehogDiskUsage).
Loading

0 comments on commit 8250577

Please sign in to comment.