Skip to content

AS Rank Parser

jfuruness edited this page Dec 27, 2020 · 4 revisions

Table of Contents

Short Description

The purpose of this submodule is to parse AS Rank data received from https://asrank.caida.org/ and insert it into the database.

Long Description

The purpose of this submodule is to parse AS Rank data received from https://asrank.caida.org/ and insert it into the database. This is done through a series of steps.

  1. Clear AS Rank Table:
    • This is done to clear the table before any multiprocessing can happen
    • Handled in the _run function
  2. Then a random delay is performed
    • This is for when we run the Simulator
    • The simulator is run over multiple VMs at once, this would otherwise DOS the AS Rank website
    • Delay is between 0 and 20 seconds
  3. Then we get the total number of pages
    • To do this we initialize the Selenium Driver
    • We format the AS Rank URL in the _format_page_url function to get the first page of ASes
      • We use the maximum page size of 1000 ASes
    • We use the Selenium Driver to get the beautiful soup for the first page of ASes
    • We use beautiful soup and check the pagination on the website to see how many pages are possible
    • This is handled in the _total_pages function
  4. We parse each page
    • This initializes the Selenium Driver. Note that it is initialized here so that later it can be multiprocessed (if we just had one instance of it, it would not be multiprocess safe)
    • We format the URL for that specific page number, and get the beautiful soup using the Selenium Driver
    • The AS Rank data is formatted as an html table, so we get the tr tag to get all the rows of the table. We skip the first row for the headers
      • The header of this table is not useful information
    • This is in the _parse_page function
  5. Each row in the table is parsed for database insertion
    • We first get all columns by searching for the td tag
    • The second column is the org, which can sometimes be unknown, so we default this to None
    • Third column is country, which can sometimes be empty, which we default to None
    • Handled in the _parse_row function
  6. Rows are then added to a CSV which is then inserted into the database
    • This is handleded in the utils.rows_to_db function
    • This is done because the file comes in CSV format

Usage

From the Command Line

Best way:

lib_bgp_data --as_rank_website_parser

For debugging:

lib_bgp_data --as_rank_website_parser --debug

This example must be run over the package, so cd into one directory above that package

python3 -m lib_bgp_data --as_rank_website_parser

From a Script:

Initializing the as_rank_website_parser: The Defaults for the as_rank_website_parser are:

Parameter Default Description
name self.__class__.__name__ The purpose of this is to make sure when we clean up paths at the end it doesn't delete files from other parsers.
path "/tmp/bgp_{}".format(name) Not used
csv_dir "/dev/shm/bgp_{}".format(name) Path for CSV files, located in RAM
stream_level logging.INFO Logging level for printing
section "bgp" database section to use

Note that any one of the above attributes can be changed or all of them can be changed in any combination

To initialize as_rank_website_parser with default values:

from lib_bgp_data import AS_Rank_Website_Parser
parser = AS_Rank_Website_Parser()

To initialize AS_Rank_Website_Parser with custom path, CSV directory, and logging level and database section:

from logging import DEBUG
from lib_bgp_data import AS_Rank_Website_Parser
parser = AS_Rank_Website_Parser(path="/my_custom_path",
                                csv_dir="/my_custom_csv_dir",
                                stream_level=DEBUG,
                                section="mydatabasesection")

Running the AS_Rank_Website_Parser:

To run the AS_Rank_Website_Parser with defaults:

from lib_bgp_data import AS_Rank_Website_Parser
AS_Rank_Website_Parser().run()

To run the AS_Rank_Website_Parser for with a random delay:

from lib_bgp_data import AS_Rank_Website_Parser
# Random delay between 0 and 20 seconds
AS_Rank_Website_Parser().run(random_delay=True)

Design Choices

  • Table of contents
  • Random delay is used to avoid thrashing as rank website if doing a lot of parsing in parallel (like when doing it across 25 VMs for simulations)
  • unknown organizations and flags are defaulted to None for proper db queries
  • No multiprocessing is used because it is broken and needs to be fixed

Table Schema

as_rank Table Schema:

  • Table Schema

  • Contains data for AS Rank

  • as_rank: Rank of AS (bigint)

  • asn: ASN (bigint)

  • organization: organization of AS (varchar (250)),

  • country: two letter country code (varchar (2)),

  • cone_size: Size of customer cone (integer)

  • Create Table SQL:

        CREATE UNLOGGED TABLE IF NOT EXISTS as_rank (
              as_rank bigint,
              asn bigint,
              organization varchar (250),
              country varchar (2),
              cone_size integer
              );