# Crawler usage demostration

In [1]:
from pyspark.sql import SparkSession

# Note: Make sure to set an environment variable called "TLHOP_DATASETS_PATH" 
# used to define where THLOP's Crawlers store their collected data.

## AS-Classification (AS-Type) Crawler

In [5]:
from tlhop.crawlers import AS2Type

crawler = AS2Type()

Last crawling timestamp: 20221104_120444
The current dataset version is the most recent.


In [29]:
crawler.download()

Downloading file: 'https://publicdata.caida.org/datasets/as-classification_restricted/20210401.as2types.txt.gz'
Advanced info:
Date: Fri, 04 Nov 2022 15:04:42 GMT
Server: Apache/2.4.43 (FreeBSD) OpenSSL/1.0.2u-freebsd
Last-Modified: Fri, 09 Apr 2021 00:26:58 GMT
ETag: "392a6-5bf7f38c9150d"
Accept-Ranges: bytes
Content-Length: 234150
Connection: close
Content-Type: application/x-gzip


New dataset version is download with success!


## AS Rank Crawler

In [6]:
from tlhop.crawlers import ASRank

crawler = ASRank()

[INFO] Last crawling timestamp: 20221104_182326
Because this dataset is collected from an external API, we can not verify if there is a newer version.


In [3]:
crawler.describe()


        # AS Rank Dataset
        
        - Description:  A ranking of Autonomous Systems (AS). ASes are ranked by their 
          customer cone size, which is the number of their direct and indirect customers. 
        - Reference: https://asrank.caida.org/
        - Download link: https://api.asrank.caida.org/v2/graphql
        - Fields: 
            * asn - code
            * asnName - name of the ASN;
            * cliqueMember - is true if the ASN is inferred to be a member of the clique of ASN at the top of the ASN hierarchy;
            * longitude - longitude of the ASN;
            * latitude - latitude of the ASN;
            * rank - is ASN's rank, which is based on it's customer cone size, which in turn;
            * seen - is true when ASN is seen in BGP;
            * announcing_numberAddresses - number of addresses announced by the ASN;
            * announcing_numberPrefixes - set of prefixes announced by the ASN;
            * asnDegree_customer - The number of ASN

In [4]:
crawler.download()

[INFO] Running - retrivied 10000 of 112490 records in 36.43 seconds.
[INFO] Running - retrivied 20000 of 112490 records in 42.72 seconds.
[INFO] Running - retrivied 30000 of 112490 records in 46.76 seconds.
[INFO] Running - retrivied 40000 of 112490 records in 44.50 seconds.
[INFO] Running - retrivied 50000 of 112490 records in 44.15 seconds.
[INFO] Running - retrivied 60000 of 112490 records in 45.59 seconds.
[INFO] Running - retrivied 70000 of 112490 records in 45.31 seconds.
[INFO] Running - retrivied 80000 of 112490 records in 40.52 seconds.
[INFO] Running - retrivied 90000 of 112490 records in 44.22 seconds.
[INFO] Running - retrivied 100000 of 112490 records in 44.38 seconds.
[INFO] Running - retrivied 110000 of 112490 records in 44.93 seconds.
[INFO] Running - retrivied 120000 of 112490 records in 20.69 seconds.
[INFO] New dataset version is download with success!


## Mikrotik Releases Crawler

In [7]:
from tlhop.crawlers import MikrotikReleases

crawler = MikrotikReleases()

Last crawling timestamp: 20221104_163448


In [3]:
crawler.describe()


        # Mikrotik Releases Dataset
        
        - Description: A dataset about Mikrotik's releases information crawled from official changelog.
        - Reference: https://mikrotik.com/download/changelogs/
        - Fields: environment deployment, release/version, date
        


In [4]:
crawler.download()

Crawling for new records.
New dataset version is download with success!


## NIST NVD Crawler

In [2]:
spark = SparkSession.builder.master("local[10]").getOrCreate()

24/01/09 16:00:31 WARN Utils: Your hostname, timbersaw resolves to a loopback address: 127.0.1.1; using 150.164.10.29 instead (on interface eno2)
24/01/09 16:00:31 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
24/01/09 16:00:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/09 16:00:32 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
from tlhop.crawlers import NISTNVD

crawler = NISTNVD()

[INFO] Last crawling timestamp: 20230330_093212
Checking CVES of year 2002
Checking CVES of year 2003
Checking CVES of year 2004
Checking CVES of year 2005
Checking CVES of year 2006
Checking CVES of year 2007
Checking CVES of year 2008
Checking CVES of year 2009
Checking CVES of year 2010
Checking CVES of year 2011
Checking CVES of year 2012
Checking CVES of year 2013
Checking CVES of year 2014
Checking CVES of year 2015
Checking CVES of year 2016
Checking CVES of year 2017
Checking CVES of year 2018
Checking CVES of year 2019
Checking CVES of year 2020
Checking CVES of year 2021
Checking CVES of year 2022
Checking CVES of year 2023
Checking CVES of year 2024
[INFO] A most recent version of the current dataset was found.


In [4]:
crawler.describe()


        # NIST NVD
        
        - Description: The National Vulnerability Database (NVD) is the U.S. government 
          repository of standards based vulnerability management data.
        - Reference: https://nvd.nist.gov/
        - Download link: https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-{year}.json.zip
        - Fields: cve_id, description, cvssv2, cvssv3, publishedDate, lastModifiedDate, 
                  baseMetricV2, baseMetricV3, cpe, references, published_year, rank_cvss_v2,
                  rank_cvss_v3
        


In [5]:
crawler.download()

[INFO] Downloading new file: 'nvdcve-1.1-2002.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2003.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2004.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2005.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2006.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2007.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2008.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2009.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2010.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2011.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2012.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2013.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2014.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2015.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2016.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2017.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2018.json.zip'
[INFO] Downloading new file: 'nvdcve-1.1-2019.js

                                                                                

[INFO] New dataset version is download with success!


True

In [6]:
spark.stop()

## CISA's Known Exploited Vulnerabilities Catalog

In [12]:
from tlhop.crawlers import CISAKnownExploits

crawler = CISAKnownExploits()

Last crawling timestamp: 20230414_142202
A most recent version of the current dataset was found.


In [13]:
crawler.describe()


        # CISA's Known Exploited Vulnerabilities Catalog
        
        - Description: CISA maintains a catalog of vulnerabilities that have been exploited in the wild.
        - Reference: https://www.cisa.gov/known-exploited-vulnerabilities
        - Download link: https://www.cisa.gov/sites/default/files/csv/known_exploited_vulnerabilities.csv
        


In [14]:
crawler.download()

Downloading new file ...
New dataset version is download with success!


True

## Brazilian Cities

In [2]:
from tlhop.crawlers import BrazilianCities

crawler = BrazilianCities()

The current dataset version is the most recent.


In [3]:
crawler.describe()


        # Brazilian Cities
        
        - Description: This dataset is a compilation of several publicly available information about Brazilian Municipalities.
        - Reference: https://www.kaggle.com/datasets/crisparada/brazilian-cities
        - Fields: Each city contains 79 fields, please check reference page to further details.
        - Authors: All credits to Cristiana Parada (https://www.kaggle.com/crisparada).
        - License: This dataset is under CC BY-SA 4.0 License, which means that you are allowed ti copy and redistribute the material in any medium or format.
        


In [4]:
crawler.download()

By default, the download of this dataset is expected to be manual.In this mode, the user must download this dataset using the reference page (https://www.kaggle.com/datasets/crisparada/brazilian-cities), and place it as file '/home/lucasmsp/shodan-analysis/data/brazilian-cities/brazilian_cities.csv'. Please press [y/n] when the new version is already in the directory or to abort:  y


New dataset version added with success!


True

## Federal Revenue of Brazil (Receita Federal Brasileira, RFB)

In [None]:
from tlhop.crawlers import BrazilianFR

spark = SparkSession.builder\
            .master("local[10]")\
            .getOrCreate()

crawler = BrazilianFR()

In [None]:
crawler.download(manual=False)

In [None]:
spark.stop()

## RDAP Dataset

In [2]:
spark = SparkSession.builder.master("local[10]").getOrCreate()

23/07/25 13:39:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
import os
from tlhop.crawlers import RDAP

crawler = RDAP()

target_ip_list_filepath = os.path.expanduser("~/demo-ip-rdap-list.csv")

[INFO] Last crawling timestamp: 20220927_093053


In [4]:
crawler.download(target_ip_list_filepath, append=True, resume=False, use_selenium=False, historical_data=True)

[INFO] Starting a new execution.
[INFO] Starting incremental running. Last block_id 47001
[INFO] Cleaning Tree - Initial size: 32249. Analysis possible collisions...
[INFO] Cleaning Tree - 1 intervals with colisions.
[INFO] Cleaning Tree - Final size: 32248 .
[INFO] Generating new consolided file.
This execution had none requisition. Operation Completed.


True

In [5]:
spark.stop()

## EndOfLife

In [1]:
from tlhop.crawlers import EndOfLife

crawler = EndOfLife()

Last crawling timestamp: 20230727_171333


In [2]:
crawler.describe()


        # EndOfLife Dataset
        
        - Description: Keep track of various End of Life dates and support lifecycles for various products.
        - References: https://endoflife.date/docs/api/ and https://github.com/endoflife-date/release-data
        


In [3]:
crawler.download()

Crawling for new records.
251 products found to be crawled.
New dataset version is download with success!


True

## First EPSS

In [None]:
spark = SparkSession.builder.master("local[1]").getOrCreate()

In [3]:
from tlhop.crawlers import FirstEPSS

In [None]:
crawler = FirstEPSS()

In [5]:
crawler.describe()


        # FIRST's Exploit Prediction Scoring system (EPSS) 
        
        - Description: EPSS is a daily estimate of the probability of exploitation activity being observed over the next 30 days. 
        - Reference: https://www.first.org/epss/
        - Download link: https://epss.cyentia.com/epss_scores-YYYY-mm-dd.csv.gz
        


In [None]:
crawler.download()

In [7]:
spark.stop()

## LACNIC RIR Statistics

In [None]:
from tlhop.crawlers import LACNICStatistics
crawler = LACNICStatistics()

In [2]:
crawler.describe()


        # LACNIC RIR Statistics
        
        - Description: This dataset contains daily summary reports of the allocations and assignments of numeric Internet address resources within
                        ranges originally delegated to LACNIC and historical ranges
                        transferred to LACNIC by other registries.
        - Reference: https://ftp.lacnic.net/pub/stats/lacnic/RIR-Statistics-Exchange-Format.txt
        - Download link: https://ftp.lacnic.net/pub/stats/lacnic/delegated-lacnic-{date}
        


In [3]:
crawler.download()

Downloading new file ...
New dataset version is download with success!


True