Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuzzy matching for manufacturers based on OUI to NetBox list is not very good #393

Closed
mmguero opened this issue Jan 24, 2024 · 1 comment
Assignees
Labels
bug Something isn't working logstash Relating to Malcolm's use of Logstash netbox Related to Malcolm's use of NetBox
Milestone

Comments

@mmguero
Copy link
Collaborator

mmguero commented Jan 24, 2024

Malcolm gets a big list of manufacturers from here. Some more we preload from our own list.

During enrichment, particularly when we're doing auto-population we use [fuzzy string matching](https://github.com/idaholab/Malcolm/blob/c78a043033eb24dbf05437c84abb875868325bc9/logstash/ruby/netbox_enrich.rb#L179-L189] to match up the list of NetBox manufacturers with the ]list we use to map MAC address to OUI](https://www.wireshark.org/download/automated/data/manuf). The problem is, they don't use the exact syntax. So, e.g., if NetBox has "Dell" and the OUI is "Dell Inc." then you get stuff like "Dell Inc." matching more closely to "Delta" than "Dell" which is not ideal.

So what can we do? Adjust the thresholds? try to standardize the manuf list on import (maybe filter out the suffixes like LTD, INC, LLC, etc.)? Not sure, but it could certainly be improved.

EDIT: What I've done is allowed the fuzzy string matching threshold to be configurable, and also added some cleaning code that uses some simple rules and patterns to scrub the OUI/manufacturer names before comparison. It's much better now.

@mmguero mmguero added bug Something isn't working logstash Relating to Malcolm's use of Logstash netbox Related to Malcolm's use of NetBox labels Jan 24, 2024
@mmguero mmguero added this to the v24.02.0 milestone Jan 24, 2024
@mmguero mmguero self-assigned this Feb 8, 2024
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 12, 2024
…ased on OUI to NetBox list is not very good (broken)
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 12, 2024
…ased on OUI to NetBox list is not very good (broken)
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 12, 2024
…ased on OUI to NetBox list is not very good (broken)
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 12, 2024
…ased on OUI to NetBox list is not very good (broken)
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 13, 2024
…ased on OUI to NetBox list is not very good
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 13, 2024
…ased on OUI to NetBox list is not very good
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 13, 2024
…ased on OUI to NetBox list is not very good
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 13, 2024
…ased on OUI to NetBox list is not very good
@mmguero
Copy link
Collaborator Author

mmguero commented Feb 13, 2024

From the updated documentation:

Matching device manufacturers to OUIs

Malcolm's NetBox inventory is prepopulated with a collection of community-sourced device type definitions which can then be augmented by users manually or through preloading. During passive autopopulation device manufacturer is inferred from organizationally unique identifiers (OUIs), which make up the first three octets of a MAC address. The IEEE Standards Association maintains the registry of OUIs, which is not necessarily very internally consistent with how organizations specify the name associated with their OUI entry. In other words, there's not a foolproof programattic way for Malcolm to map MAC address OUI organization names to NetBox manufacturer names, barring creating and maintaining a manual mapping (which would be very large and difficult to keep up-to-date).

Malcolm's NetBox lookup code used in the log enrichment pipeline attempts to match OUI organization names against the list of NetBox's manufacturers using "fuzzy string matching", a technique in which two strings of characters are compared and assigned a similarity score between 0 (completely dissimilar) and 1 (identical). The NETBOX_DEFAULT_FUZZY_THRESHOLD environment variable in netbox-common.env can be used to tune the threshold for determining a match. A fairly high value is recommended (above 0.85; 0.95 is the default) to avoid autopopulating the NetBox inventory with devices with manufacturers that don't actually exist in the network being monitored.

Users may select between two behaviors for when the match threshold is not met (i.e., no manufacturer is found in the NetBox database which closely matches the OUI organization name). This behavior is specified by the NETBOX_DEFAULT_AUTOCREATE_MANUFACTURER environment variable in netbox-common.env:

  • NETBOX_DEFAULT_AUTOCREATE_MANUFACTURER=false - the autopopulated device will be created with the manufacturer set to Unspecified
  • NETBOX_DEFAULT_AUTOCREATE_MANUFACTURER=true - the autopopulated device will be created along with a new manufacturer entry in the NetBox database set to the OUI organization name

@mmguero mmguero closed this as completed Feb 13, 2024
mmguero added a commit to mmguero-dev/Malcolm that referenced this issue Feb 13, 2024
…ased on OUI to NetBox list is not very good
This was referenced Feb 14, 2024
mmguero added a commit that referenced this issue Feb 15, 2024
Malcolm v24.02.0 contains new features, improvements, bug fixes and component version updates.

v24.01.0...v24.02.0

* Features and enhancements
    - [Hedgehog Linux SD card image for Raspberry Pi](https://idaholab.github.io/Malcolm/docs/hedgehog-raspi-build.html#HedgehogRaspiBuild) (#250; special thanks to @aut0exec for his work on this)
    - allow configuration of Arkime's ILM/ISM settings (#300)
    - add option for customizing which log types get NetBox enrichment (#316)
    - improve the extracted_files download page (#329)
    - include missing aggregations in API bucket queries (#386)
    - more intelligent .env file checking on startup (#387)
    - Malcolm report to itself on capture statistics (#395)
    - link to Dashboards/Arkime from NetBox devices view (#410)
    - changed default PCAP storage format to zstd(3) for new installations
    - various documentation updates and improvements
    - changed back to using official Zeek .deb files rather than building from source to reduce build times
* Component version updates
    - Arkime to [v5.0.0](https://github.com/arkime/arkime/blob/6914792d86ecba0009f9b49dabb1aa987e46ad26/CHANGELOG#L33-L130)
    - Capa to [v7.0.1](https://github.com/mandiant/capa/releases)
    - YARA to [v4.5.0](https://github.com/VirusTotal/yara/releases)
    - Beats to [v8.12.1](https://www.elastic.co/guide/en/beats/libbeat/current/release-notes-8.12.1.html)
    - Logstash to [v8.12.1](https://www.elastic.co/guide/en/logstash/current/logstash-8-12-1.html)
    - Zeek to [v6.1.1](https://github.com/zeek/zeek/releases/tag/v6.1.1)
* Bug fixes
    - pivot links from Arkime to Kibana in external elasticsearch are not working (#335)
    - redirect /dashboards/ link to Kibana in NGINX proxy in elasticsearch/kibana-based deployment (#403)
    - allow netbox-restore and netbox-backup to specify container name (#337)
    - fuzzy matching for manufacturers based on OUI to NetBox list is not very good (#393) (and [updated documentation](https://idaholab.github.io/Malcolm/docs/asset-interaction-analysis.html#NetBoxPopPassiveOUIMatch))
    - source.ip and destination.ip not set for parsed files.log entries for uploaded PCAP (#401)
    - event.severity_tags is not being assigned correctly based on rule.category (#402)
    - basic authentication breaks with special characters (#404)
    - changed some Logstash Ruby variables from global (`$`) to instance (`@`) (see ["avoiding concurrency issues"](https://www.elastic.co/guide/en/logstash/current/plugins-filters-ruby.html#plugins-filters-ruby-concurrency))
* Configuration changes (in [environment variables](https://idaholab.github.io/Malcolm/docs/malcolm-config.html#MalcolmConfigEnvVars) in [`./config/`](https://github.com/idaholab/Malcolm/blob/v24.02.0/config))
    * these variables in [`arkime.env`](https://github.com/idaholab/Malcolm/blob/main/config/arkime.env.example) to allow configuration of Arkime's ILM/ISM settings (#300)
    ```
    # These variables manage setting for Arkime's ILM/ISM features (https://arkime.com/faq#ilm)
    # Whether or not Arkime should perform index management
    INDEX_MANAGEMENT_ENABLED=false
    # Time in hours/days before moving to warm and force merge (number followed by h or d)
    INDEX_MANAGEMENT_OPTIMIZATION_PERIOD=30d
    # Time in hours/days before deleting index (number followed by h or d)
    INDEX_MANAGEMENT_RETENTION_TIME=90d
    # Number of replicas for older sessions indices
    INDEX_MANAGEMENT_OLDER_SESSION_REPLICAS=0
    # Number of weeks of history to retain
    INDEX_MANAGEMENT_HISTORY_RETENTION_WEEKS=13
    # Number of segments to optimize sessions for
    INDEX_MANAGEMENT_SEGMENTS=1
    # Whether or not Arkime should use a hot/warm design (storing non-session data in a warm index)
    INDEX_MANAGEMENT_HOT_WARM_ENABLED=false
    ```
    * these variables in [`dashboards.env`](https://github.com/idaholab/Malcolm/blob/main/config/dashboards.env.example) to override the values automatically configured for pivot links (#335) and `/dashboard/` redirect (#403) for Elasticsearch backend
    ```
    # These values are used to handle the Arkime value actions to pivot from Arkime
    #   to Dashboards. The nginx-proxy container's entrypoint will try to formulate
    #   them automatically, but they may be specified explicitly here.
    NGINX_DASHBOARDS_PREFIX=
    NGINX_DASHBOARDS_PROXY_PASS=
    ```
    * these variables in [`logstash.env`](https://github.com/idaholab/Malcolm/blob/main/config/logstash.env.example) for customizing which log types get NetBox enrichment (#316) and customizing which types of Zeek logs will be ignored (dropped) by LogStash
    ```
    # Which types of logs will be enriched via NetBox (comma-separated list of provider.dataset, or the string all to enrich all logs)
    LOGSTASH_NETBOX_ENRICHMENT_DATASETS=suricata.alert,zeek.conn,zeek.known_hosts,zeek.known_services,zeek.notice,zeek.signatures,zeek.software,zeek.weird
    ```
    ```
    # Zeek log types that will be ignored (dropped) by LogStash
    LOGSTASH_ZEEK_IGNORED_LOGS=analyzer,broker,bsap_ip_unknown,bsap_serial_unknown,capture_loss,cluster,config,ecat_arp_info,loaded_scripts,packet_filter,png,print,prof,reporter,stats,stderr,stdout
    ```
    * these variables in [`netbox-common.env`](https://github.com/idaholab/Malcolm/blob/main/config/netbox-common.env.example) for adjusting [matching device manufacturers to OUIs](https://idaholab.github.io/Malcolm/docs/asset-interaction-analysis.html#NetBoxPopPassiveOUIMatch) in NetBox autopopulation
    ```
    # Customize manufacturer matching/creation with LOGSTASH_NETBOX_AUTO_POPULATE (see logstash.env)
    NETBOX_DEFAULT_AUTOCREATE_MANUFACTURER=true
    NETBOX_DEFAULT_FUZZY_THRESHOLD=0.95
    ```
    * these variables in [suricata-live.env](https://github.com/idaholab/Malcolm/blob/main/config/suricata-live.env.example) and [zeek-live.env](https://github.com/idaholab/Malcolm/blob/main/config/zeek-live.env.example) that can be used to configure Malcolm reporting to itself on its Zeek and Suricata live capture statistics (#395)
    ```
    # Whether or not enable capture statistics and include them in eve.json
    SURICATA_STATS_ENABLED=false
    SURICATA_STATS_EVE_ENABLED=false
    SURICATA_STATS_INTERVAL=30
    SURICATA_STATS_DECODER_EVENTS=false
    ```
    ```
    # Set ZEEK_DISABLE_STATS to blank to generate stats.log and capture_loss.log
    ZEEK_DISABLE_STATS=true
    ```
    * this variable in [zeek.env](https://github.com/idaholab/Malcolm/blob/main/config/zeek.env.example) related to the improvements to the extracted_files download page (#329)
    ```
    # Whether or not to use libmagic to show MIME types for Zeek-extracted files served
    EXTRACTED_FILE_HTTP_SERVER_MAGIC=false
    ```
mmguero added a commit to cisagov/Malcolm that referenced this issue Feb 15, 2024
Malcolm v24.02.0 contains new features, improvements, bug fixes and component version updates.

v24.01.0...v24.02.0

* Features and enhancements
    - [Hedgehog Linux SD card image for Raspberry Pi](https://cisagov.github.io/Malcolm/docs/hedgehog-raspi-build.html#HedgehogRaspiBuild) (idaholab#250; special thanks to @aut0exec for his work on this)
    - allow configuration of Arkime's ILM/ISM settings (idaholab#300)
    - add option for customizing which log types get NetBox enrichment (idaholab#316)
    - improve the extracted_files download page (idaholab#329)
    - include missing aggregations in API bucket queries (idaholab#386)
    - more intelligent .env file checking on startup (idaholab#387)
    - Malcolm report to itself on capture statistics (idaholab#395)
    - link to Dashboards/Arkime from NetBox devices view (idaholab#410)
    - changed default PCAP storage format to zstd(3) for new installations
    - various documentation updates and improvements
    - changed back to using official Zeek .deb files rather than building from source to reduce build times
* Component version updates
    - Arkime to [v5.0.0](https://github.com/arkime/arkime/blob/6914792d86ecba0009f9b49dabb1aa987e46ad26/CHANGELOG#L33-L130)
    - Capa to [v7.0.1](https://github.com/mandiant/capa/releases)
    - YARA to [v4.5.0](https://github.com/VirusTotal/yara/releases)
    - Beats to [v8.12.1](https://www.elastic.co/guide/en/beats/libbeat/current/release-notes-8.12.1.html)
    - Logstash to [v8.12.1](https://www.elastic.co/guide/en/logstash/current/logstash-8-12-1.html)
    - Zeek to [v6.1.1](https://github.com/zeek/zeek/releases/tag/v6.1.1)
* Bug fixes
    - pivot links from Arkime to Kibana in external elasticsearch are not working (idaholab#335)
    - redirect /dashboards/ link to Kibana in NGINX proxy in elasticsearch/kibana-based deployment (idaholab#403)
    - allow netbox-restore and netbox-backup to specify container name (idaholab#337)
    - fuzzy matching for manufacturers based on OUI to NetBox list is not very good (idaholab#393) (and [updated documentation](https://cisagov.github.io/Malcolm/docs/asset-interaction-analysis.html#NetBoxPopPassiveOUIMatch))
    - source.ip and destination.ip not set for parsed files.log entries for uploaded PCAP (idaholab#401)
    - event.severity_tags is not being assigned correctly based on rule.category (idaholab#402)
    - basic authentication breaks with special characters (idaholab#404)
    - changed some Logstash Ruby variables from global (`$`) to instance (`@`) (see ["avoiding concurrency issues"](https://www.elastic.co/guide/en/logstash/current/plugins-filters-ruby.html#plugins-filters-ruby-concurrency))
* Configuration changes (in [environment variables](https://cisagov.github.io/Malcolm/docs/malcolm-config.html#MalcolmConfigEnvVars) in [`./config/`](https://github.com/cisagov/Malcolm/blob/v24.02.0/config))
    * these variables in [`arkime.env`](https://github.com/cisagov/Malcolm/blob/main/config/arkime.env.example) to allow configuration of Arkime's ILM/ISM settings (idaholab#300)
    ```
    # These variables manage setting for Arkime's ILM/ISM features (https://arkime.com/faq#ilm)
    # Whether or not Arkime should perform index management
    INDEX_MANAGEMENT_ENABLED=false
    # Time in hours/days before moving to warm and force merge (number followed by h or d)
    INDEX_MANAGEMENT_OPTIMIZATION_PERIOD=30d
    # Time in hours/days before deleting index (number followed by h or d)
    INDEX_MANAGEMENT_RETENTION_TIME=90d
    # Number of replicas for older sessions indices
    INDEX_MANAGEMENT_OLDER_SESSION_REPLICAS=0
    # Number of weeks of history to retain
    INDEX_MANAGEMENT_HISTORY_RETENTION_WEEKS=13
    # Number of segments to optimize sessions for
    INDEX_MANAGEMENT_SEGMENTS=1
    # Whether or not Arkime should use a hot/warm design (storing non-session data in a warm index)
    INDEX_MANAGEMENT_HOT_WARM_ENABLED=false
    ```
    * these variables in [`dashboards.env`](https://github.com/cisagov/Malcolm/blob/main/config/dashboards.env.example) to override the values automatically configured for pivot links (idaholab#335) and `/dashboard/` redirect (idaholab#403) for Elasticsearch backend
    ```
    # These values are used to handle the Arkime value actions to pivot from Arkime
    #   to Dashboards. The nginx-proxy container's entrypoint will try to formulate
    #   them automatically, but they may be specified explicitly here.
    NGINX_DASHBOARDS_PREFIX=
    NGINX_DASHBOARDS_PROXY_PASS=
    ```
    * these variables in [`logstash.env`](https://github.com/cisagov/Malcolm/blob/main/config/logstash.env.example) for customizing which log types get NetBox enrichment (idaholab#316) and customizing which types of Zeek logs will be ignored (dropped) by LogStash
    ```
    # Which types of logs will be enriched via NetBox (comma-separated list of provider.dataset, or the string all to enrich all logs)
    LOGSTASH_NETBOX_ENRICHMENT_DATASETS=suricata.alert,zeek.conn,zeek.known_hosts,zeek.known_services,zeek.notice,zeek.signatures,zeek.software,zeek.weird
    ```
    ```
    # Zeek log types that will be ignored (dropped) by LogStash
    LOGSTASH_ZEEK_IGNORED_LOGS=analyzer,broker,bsap_ip_unknown,bsap_serial_unknown,capture_loss,cluster,config,ecat_arp_info,loaded_scripts,packet_filter,png,print,prof,reporter,stats,stderr,stdout
    ```
    * these variables in [`netbox-common.env`](https://github.com/cisagov/Malcolm/blob/main/config/netbox-common.env.example) for adjusting [matching device manufacturers to OUIs](https://cisagov.github.io/Malcolm/docs/asset-interaction-analysis.html#NetBoxPopPassiveOUIMatch) in NetBox autopopulation
    ```
    # Customize manufacturer matching/creation with LOGSTASH_NETBOX_AUTO_POPULATE (see logstash.env)
    NETBOX_DEFAULT_AUTOCREATE_MANUFACTURER=true
    NETBOX_DEFAULT_FUZZY_THRESHOLD=0.95
    ```
    * these variables in [suricata-live.env](https://github.com/cisagov/Malcolm/blob/main/config/suricata-live.env.example) and [zeek-live.env](https://github.com/cisagov/Malcolm/blob/main/config/zeek-live.env.example) that can be used to configure Malcolm reporting to itself on its Zeek and Suricata live capture statistics (idaholab#395)
    ```
    # Whether or not enable capture statistics and include them in eve.json
    SURICATA_STATS_ENABLED=false
    SURICATA_STATS_EVE_ENABLED=false
    SURICATA_STATS_INTERVAL=30
    SURICATA_STATS_DECODER_EVENTS=false
    ```
    ```
    # Set ZEEK_DISABLE_STATS to blank to generate stats.log and capture_loss.log
    ZEEK_DISABLE_STATS=true
    ```
    * this variable in [zeek.env](https://github.com/cisagov/Malcolm/blob/main/config/zeek.env.example) related to the improvements to the extracted_files download page (idaholab#329)
    ```
    # Whether or not to use libmagic to show MIME types for Zeek-extracted files served
    EXTRACTED_FILE_HTTP_SERVER_MAGIC=false
    ```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logstash Relating to Malcolm's use of Logstash netbox Related to Malcolm's use of NetBox
Projects
Status: Released
Development

No branches or pull requests

1 participant