Skip to content

BGPStream Website Parser

jfuruness edited this page Nov 28, 2020 · 2 revisions

Table of Contents

Short Description

The purpose of this submodule is to parse information received from https://bgpstream.com to obtain information about real BGP hijacks, leaks, and outages that actually occurred.

Long Description

This submodule parses through the html of bgpstream.com and formats the data for actual hjacks, leaks, and outages into a CSV for later insertion into a database. This is done through a series of steps.

  1. Initialize the three different kinds of data classes.
    • Handled in the __init__ function in the BGPStream_Website_Parser class
    • This class mainly deals with accessing the website, the data classes deal with parsing the information. These data classes inherit from the parent class Data and are located in the data_classes file
  2. All rows are received from the main page of the website
    • This is handled in the utils.get_tags function
    • This has some initial data for all bgp events
  3. The last ten rows on the website are removed
    • This is handled in the parse function in the BGPStream_Website_Parser
    • There is some html errors there, which causes errors when parsing
  4. The row limit is set so that it is not too high
    • This is handled in the parse function in the BGPStream_Website_Parser
    • This is to prevent going over the maximum number of rows on website
  5. Rows are iterated over until row_limit is reached
    • This is handled in the parse function in the BGPStream_Website_Parser
  6. For each row, if that row is of a datatype passed in the parameters,
    and the row is new (by default) add that to the self.data dictionary
    • This causes that row to be parsed as well
    • Rows are parsed into CSVs and inserted into the database
  7. Call the db_insert funtion on each of the data classes in self.data
    • This will parse all rows and insert them into the database
    • This formats the tables as well
      • Unwanted IPV4 or IPV6 prefixes are removed
      • Indexes are created if they don't exist
      • Duplicates are deleted
      • Temporary tables that are subsets of the data are created

Usage

From the Command Line

Best way:

lib_bgp_data --bgpstream_website_parser

For debugging:

lib_bgp_data --bgpstream_website_parser --debug

Must be called on the library:

python3 -m lib_bgp_data --bgpstream_website_parser

From a Script:

Initializing the BGPStream_Website_Parser:

The Defaults for the BGPStream_Website_Parser are the same as the base parser's it inherits:

Parameter Default Description
name self.__class__.__name__ The purpose of this is to make sure when we clean up paths at the end it doesn't delete files from other parsers.
path "/tmp/bgp_{}".format(name) Not used
csv_dir "/dev/shm/bgp_{}".format(name) Path for CSV files, located in RAM
stream_level logging.INFO Logging level for printing
section "bgp" database section to use

Note that any one of the above attributes can be changed or all of them can be changed in any combination

To initialize BGPStream_Website_Parser with default values:

from lib_bgp_data import BGPStream_Website_Parser
bgpstream_website_parser = BGPStream_Website_Parser()

To initialize BGPStream_Website_Parser with custom path, CSV directory, database section, and logging level:

from logging import DEBUG
from lib_bgp_data import BGPStream_Website_Parser
bgpstream_website_parser = BGPStream_Website_Parser(path="/my_custom_path",
                                                    csv_dir="/my_custom_csv_dir",
                                                    stream_level=DEBUG,
                                                    section="mydatabasesection")

Running the BGPStream_Website_Parser:

Parameter Default Description
row_limit None Defaults to all rows - 10 to get rid of corrupt rows. Really just for quick tests
IPV4 True Include IPV4 prefixes
IPV6 False Include IPV6 prefixes
data_types BGPStream_Website_Types.list_values() Event types to download, hijack, leak, or outage
refresh False If you've already seen the event, download it again anyways. Really just for quick testing

Note that any one of the above attributes can be changed or all of them can be changed in any combination

To run the BGPStream_Website_Parser with defaults:

from lib_bgp_data import BGPStream_Website_Parser
BGPStream_Website_Parser().run()

To run the BGPStream_Website_Parser with just hijacks:

from lib_bgp_data import BGPStream_Website_Parser, BGPStream_Wesbite_Types
BGPStream_Website_Parser().run(data_types=[BGPStream_Website_Types.HIJACK.value])

To run the BGPStream_Website_Parser with all IPV4 and IPV6 prefixes:
```python
from lib_bgp_data import BGPStream_Website_Parser
BGPStream_Website_Parser().run(IPV4=True, IPV6=True)

Useful examples for test usage:

To run the BGPStream_Website_Parser with just the first 50 rows for a quick test:

from lib_bgp_data import BGPStream_Website_Parser
BGPStream_Website_Parser().run(row_limit=50)

To run the BGPStream_Website_Parser and reparse all events you've seen already:

from lib_bgp_data import BGPStream_Website_Parser
BGPStream_Website_Parser().run(refresh=True)

Design Choices

  • Table of contents
  • The last ten rows of the website are not parsed due to html errors
  • Only the data types that are passed in as a parameter are parsed
    • This is because querying each individual events page for info takes a long time
    • Only new rows by default are parsed for the same reason
  • Multithreading isn't used because the website blocks the requests and rate limits
  • Parsing is done from the end of the page to the top
    • The start of the page is not always the same

Table Schema

hijacks Table Schema:

  • Table of contents
  • Table Schema
  • Contains data for hijack events
  • id: (serial PRIMARY KEY)
  • country: Two letter country abbreviation (varchar (50))
  • detected_as_path: detected_as_path of the hijack (bigint ARRAY)
  • detected_by_bgpmon_peers: (integer)
  • detected_origin_name: (varchar (200))
  • detected_origin_number: (bigint)
  • start_time: (timestamp with time zone) - Note that the server and website are set to UTC
  • end_time: (timestamp with time zone) - Note that the server and website are set to UTC
  • event_number: (integer)
  • event_type: (varchar (50))
  • expected_origin_name: (varchar (200))
  • expected_origin_number: (bigint)
  • expected_prefix: (cidr)
  • more_specific_prefix: (cidr)
  • url: (varchar (250))
  • Create Table SQL:
    CREATE UNLOGGED TABLE IF NOT EXISTS hijack (
              id serial PRIMARY KEY,
              country varchar (50),
              detected_as_path bigint ARRAY,
              detected_by_bgpmon_peers integer,
              detected_origin_name varchar (200),
              detected_origin_number bigint,
              start_time timestamp with time zone,
              end_time timestamp with time zone,
              event_number integer,
              event_type varchar (50),
              expected_origin_name varchar (200),
              expected_origin_number bigint,
              expected_prefix cidr,
              more_specific_prefix cidr,
              url varchar (250)
              );
    

leaks Table Schema:

  • Table of contents
  • Table Schema
  • Contains data for leak events
  • id: (serial PRIMARY KEY)
  • country: Two letter country abbreviation (varchar (50))
  • detected_by_bgpmon_peers: (integer)
  • start_time: (timestamp with time zone) - Note that the server and website are set to UTC
  • end_time: (timestamp with time zone) - Note that the server and website are set to UTC
  • event_number: (integer)
  • event_type: (varchar (50))
  • example_as_path: (bigint ARRAY)
  • leaked_prefix: (cidr)
  • leaked_to_name: (varchar (200) ARRAY)
  • leaked_to_number: (bigint ARRAY)
  • leaker_as_name: (varchar (200))
  • leaker_as_number: (bigint)
  • origin_as_name: (varchar (200))
  • origin_as_number: (bigint)
  • url: (varchar (250))
  • Create Table SQL:
    CREATE UNLOGGED TABLE IF NOT EXISTS Leak (
        id serial PRIMARY KEY,
        country varchar (50),
        detected_by_bgpmon_peers integer,
        start_time timestamp with time zone,
        end_time timestamp with time zone,
        event_number integer,
        event_type varchar (50),
        example_as_path bigint ARRAY,
        leaked_prefix cidr,
        leaked_to_name varchar (200) ARRAY,
        leaked_to_number bigint ARRAY,
        leaker_as_name varchar (200),
        leaker_as_number bigint,
        origin_as_name varchar (200),
        origin_as_number bigint,
        url varchar (250)
    );
    

outages Table Schema:

  • Table of contents
  • Table Schema
  • Contains data for outage events
  • id: (serial PRIMARY KEY)
  • as_name: (varchar (200))
  • as_number: (bigint)
  • country: Two letter country abbreviation (varchar (50))
  • start_time: (timestamp with time zone) - Note that the server and website are set to UTC
  • end_time: (timestamp with time zone) - Note that the server and website are set to UTC
  • event_number: (integer)
  • event_type: (varchar (50))
  • number_prefixes_affected: (integer)
  • percent_prefixes_affected: (smallint)
  • url: (varchar (250))
  • Create Table SQL:
    CREATE UNLOGGED TABLE IF NOT EXISTS outage (
        id serial PRIMARY KEY,
        as_name varchar (200),
        as_number bigint,
        country varchar (25),
        start_time timestamp with time zone,
        end_time timestamp with time zone,
        event_number integer,
        event_type varchar (25),
        number_prefixes_affected integer,
        percent_prefixes_affected smallint,
        url varchar(150)
    );