Skip to content

CDN Whitelist Parser

jfuruness edited this page Nov 29, 2020 · 2 revisions

Table of Contents

Short Description

The purpose of this submodule is to get all ASNs that are owned by CDNs from hackertarget.com, converting this data into csvs and inserting this data into the database.

Long Description

The purpose of this parser is to download ASNs owned by CDNs from hackertarget.com and insert them into a database. This is done through a series of steps.

  1. Get list of CDNs from cdns.txt in the submodule
    • Handled in the _get_cdns function
  2. Make an API call to https://api.hackertarget.com/aslookup/?q=
    • Handled in the _run function
    • This will get the json for the ASNs
  3. Format the data for database insertion
    • Handled in the _run function
  4. Insert the data into the database
    • Handled in the utils.rows_to_db
    • First converts data to a csv then inserts it into the database
    • CSVs are used for fast bulk database insertion

Usage

From the Command Line

lib_bgp_data --cdn_whitelist

For debugging:

lib_bgp_data --cdn_whitelist --debug

or a variety of other possible commands, I've tried to make it fairly idiot proof with the capitalization and such.

The other way you can run it is with: python3 -m lib_bgp_data --cdn_whitelist

From a Script:

Initializing the CDN_Whitelist class:

Parameter Default Description
name self.__class__.__name__ The purpose of this is to make sure when we clean up paths at the end it doesn't delete files from other parsers.
path "/tmp/bgp_{}".format(name) Not used
csv_dir "/dev/shm/bgp_{}".format(name) Path for CSV files, located in RAM
stream_level logging.INFO Logging level for printing
section "bgp" database section to use

Note that any one of the above attributes can be changed or all of them can be changed in any combination

To initialize CDN_Whitelist with default values:

from lib_bgp_data import CDN_Whitelist
parser = CDN_Whitelist()

To initialize CDN_Whitelist with custom path, CSV directory, and logging level and section:

from logging import DEBUG
from lib_bgp_data import CDN_Whitelist
parser = CDN_Whitelist(path="/my_custom_path",
                       csv_dir="/my_custom_csv_dir",
                       stream_level=DEBUG,
                       section="mydatabasesection")

To run the CDN_Whitelist with defaults (there are no optional parameters):

from lib_bgp_data import CDN_Whitelist
CDN_Whitelist().parse_roas()

Design Choices

  • Table of contents
  • Hacker target allows 100 free lookups/day. One company counts as 1 lookup
  • There are several tools for this, however most of them don't return all the ASNs for a company, or some companies don't show up in search, or can't search for the company by name.
    • utratools.com
    • mxtoolbox.com
    • dnschecker.org
    • spyse.com
    • ipinfo.io
  • Using the different IRR's APIs is convuluted. They each maintain a different one. RIPE's database lookup tool says it can lookup across all the IRRs but when I try, I just get errors. Also to get the ASN, you first need to search by organisation, then get the organisation id, then perform an inverse search for ASNs using that organisation id.
  • The list of CDNs is in cdns.txt. It's a handpicked list. Sometimes companies aren't very tight on branding and register ASNs under a different name.

Table Schema

  • Table of contents
    • This table contains information on the ASNs retrieved from the hackertarget.com
    • Unlogged tables are used for speed
    • asn: The ASN of an AS (bigint)
    • cdn: Name of CDN (varchar)
    • Create Table SQL:
    CREATE UNLOGGED TABLE IF NOT EXISTS {self.name} (
                 cdn varchar (200),
                 asn bigint
                 );