BGPStream Website Parser

Short Description

The purpose of this submodule is to parse information received from https://bgpstream.com to obtain information about real BGP hijacks, leaks, and outages that actually occurred.

Long Description

Table of contents

This submodule parses through the html of bgpstream.com and formats the data for actual hjacks, leaks, and outages into a CSV for later insertion into a database. This is done through a series of steps.

Initialize the three different kinds of data classes.
- Handled in the __init__ function in the BGPStream_Website_Parser class
- This class mainly deals with accessing the website, the data classes deal with parsing the information. These data classes inherit from the parent class Data and are located in the data_classes file
All rows are received from the main page of the website
- This is handled in the utils.get_tags function
- This has some initial data for all bgp events
The last ten rows on the website are removed
- This is handled in the parse function in the BGPStream_Website_Parser
- There is some html errors there, which causes errors when parsing
The row limit is set so that it is not too high
- This is handled in the parse function in the BGPStream_Website_Parser
- This is to prevent going over the maximum number of rows on website
Rows are iterated over until row_limit is reached
- This is handled in the parse function in the BGPStream_Website_Parser
For each row, if that row is of a datatype passed in the parameters,
and the row is new (by default) add that to the self.data dictionary
- This causes that row to be parsed as well
- Rows are parsed into CSVs and inserted into the database
Call the db_insert funtion on each of the data classes in self.data
- This will parse all rows and insert them into the database
- This formats the tables as well
  - Unwanted IPV4 or IPV6 prefixes are removed
  - Indexes are created if they don't exist
  - Duplicates are deleted
  - Temporary tables that are subsets of the data are created

Usage

Table of contents

From the Command Line

Best way:

lib_bgp_data --bgpstream_website_parser

For debugging:

lib_bgp_data --bgpstream_website_parser --debug

Must be called on the library:

python3 -m lib_bgp_data --bgpstream_website_parser

From a Script:

Initializing the BGPStream_Website_Parser:

The Defaults for the BGPStream_Website_Parser are the same as the base parser's it inherits:

Parameter	Default	Description
name	`self.__class__.__name__`	The purpose of this is to make sure when we clean up paths at the end it doesn't delete files from other parsers.
path	`"/tmp/bgp_{}".format(name)`	Not used
csv_dir	`"/dev/shm/bgp_{}".format(name)`	Path for CSV files, located in RAM
stream_level	`logging.INFO`	Logging level for printing
section	`"bgp"`	database section to use

Note that any one of the above attributes can be changed or all of them can be changed in any combination

To initialize BGPStream_Website_Parser with default values:

from lib_bgp_data import BGPStream_Website_Parser
bgpstream_website_parser = BGPStream_Website_Parser()

To initialize BGPStream_Website_Parser with custom path, CSV directory, database section, and logging level:

from logging import DEBUG
from lib_bgp_data import BGPStream_Website_Parser
bgpstream_website_parser = BGPStream_Website_Parser(path="/my_custom_path",
                                                    csv_dir="/my_custom_csv_dir",
                                                    stream_level=DEBUG,
                                                    section="mydatabasesection")

Running the BGPStream_Website_Parser:

Parameter	Default	Description
row_limit	`None`	Defaults to all rows - 10 to get rid of corrupt rows. Really just for quick tests
IPV4	`True`	Include IPV4 prefixes
IPV6	`False`	Include IPV6 prefixes
data_types	`BGPStream_Website_Types.list_values()`	Event types to download, hijack, leak, or outage
refresh	`False`	If you've already seen the event, download it again anyways. Really just for quick testing

Note that any one of the above attributes can be changed or all of them can be changed in any combination

To run the BGPStream_Website_Parser with defaults:

from lib_bgp_data import BGPStream_Website_Parser
BGPStream_Website_Parser().run()

To run the BGPStream_Website_Parser with just hijacks:

from lib_bgp_data import BGPStream_Website_Parser, BGPStream_Wesbite_Types
BGPStream_Website_Parser().run(data_types=[BGPStream_Website_Types.HIJACK.value])

To run the BGPStream_Website_Parser with all IPV4 and IPV6 prefixes:
```python
from lib_bgp_data import BGPStream_Website_Parser
BGPStream_Website_Parser().run(IPV4=True, IPV6=True)

Useful examples for test usage:

To run the BGPStream_Website_Parser with just the first 50 rows for a quick test:

from lib_bgp_data import BGPStream_Website_Parser
BGPStream_Website_Parser().run(row_limit=50)

To run the BGPStream_Website_Parser and reparse all events you've seen already:

from lib_bgp_data import BGPStream_Website_Parser
BGPStream_Website_Parser().run(refresh=True)

Design Choices

Table of contents
The last ten rows of the website are not parsed due to html errors
Only the data types that are passed in as a parameter are parsed
- This is because querying each individual events page for info takes a long time
- Only new rows by default are parsed for the same reason
Multithreading isn't used because the website blocks the requests and rate limits
Parsing is done from the end of the page to the top
- The start of the page is not always the same

Table Schema

Table of contents

hijacks Table Schema:

Table of contents
Table Schema
Contains data for hijack events
id: (serial PRIMARY KEY)
country: Two letter country abbreviation (varchar (50))
detected_as_path: detected_as_path of the hijack (bigint ARRAY)
detected_by_bgpmon_peers: (integer)
detected_origin_name: (varchar (200))
detected_origin_number: (bigint)
start_time: (timestamp with time zone) - Note that the server and website are set to UTC
end_time: (timestamp with time zone) - Note that the server and website are set to UTC
event_number: (integer)
event_type: (varchar (50))
expected_origin_name: (varchar (200))
expected_origin_number: (bigint)
expected_prefix: (cidr)
more_specific_prefix: (cidr)
url: (varchar (250))

Create Table SQL:

CREATE UNLOGGED TABLE IF NOT EXISTS hijack (
          id serial PRIMARY KEY,
          country varchar (50),
          detected_as_path bigint ARRAY,
          detected_by_bgpmon_peers integer,
          detected_origin_name varchar (200),
          detected_origin_number bigint,
          start_time timestamp with time zone,
          end_time timestamp with time zone,
          event_number integer,
          event_type varchar (50),
          expected_origin_name varchar (200),
          expected_origin_number bigint,
          expected_prefix cidr,
          more_specific_prefix cidr,
          url varchar (250)
          );

leaks Table Schema:

Table of contents
Table Schema
Contains data for leak events
id: (serial PRIMARY KEY)
country: Two letter country abbreviation (varchar (50))
detected_by_bgpmon_peers: (integer)
start_time: (timestamp with time zone) - Note that the server and website are set to UTC
end_time: (timestamp with time zone) - Note that the server and website are set to UTC
event_number: (integer)
event_type: (varchar (50))
example_as_path: (bigint ARRAY)
leaked_prefix: (cidr)
leaked_to_name: (varchar (200) ARRAY)
leaked_to_number: (bigint ARRAY)
leaker_as_name: (varchar (200))
leaker_as_number: (bigint)
origin_as_name: (varchar (200))
origin_as_number: (bigint)
url: (varchar (250))

Create Table SQL:

CREATE UNLOGGED TABLE IF NOT EXISTS Leak (
    id serial PRIMARY KEY,
    country varchar (50),
    detected_by_bgpmon_peers integer,
    start_time timestamp with time zone,
    end_time timestamp with time zone,
    event_number integer,
    event_type varchar (50),
    example_as_path bigint ARRAY,
    leaked_prefix cidr,
    leaked_to_name varchar (200) ARRAY,
    leaked_to_number bigint ARRAY,
    leaker_as_name varchar (200),
    leaker_as_number bigint,
    origin_as_name varchar (200),
    origin_as_number bigint,
    url varchar (250)
);

outages Table Schema:

Table of contents
Table Schema
Contains data for outage events
id: (serial PRIMARY KEY)
as_name: (varchar (200))
as_number: (bigint)
country: Two letter country abbreviation (varchar (50))
start_time: (timestamp with time zone) - Note that the server and website are set to UTC
end_time: (timestamp with time zone) - Note that the server and website are set to UTC
event_number: (integer)
event_type: (varchar (50))
number_prefixes_affected: (integer)
percent_prefixes_affected: (smallint)
url: (varchar (250))

Create Table SQL:

CREATE UNLOGGED TABLE IF NOT EXISTS outage (
    id serial PRIMARY KEY,
    as_name varchar (200),
    as_number bigint,
    country varchar (25),
    start_time timestamp with time zone,
    end_time timestamp with time zone,
    event_number integer,
    event_type varchar (25),
    number_prefixes_affected integer,
    percent_prefixes_affected smallint,
    url varchar(150)
);

lib_bgp_data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BGPStream Website Parser

Table of Contents

Short Description

Long Description

Usage

From the Command Line

From a Script:

Useful examples for test usage:

Design Choices

Table Schema

hijacks Table Schema:

leaks Table Schema:

outages Table Schema:

Clone this wiki locally