Skip to content
/ orbi Public

This repository is created to keep files updated for IDP in The Dr. Theo Schöller Chair of Technology and Innovation Management

Notifications You must be signed in to change notification settings

mrtrkmn/orbi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawl Data

This is a simple crawler that crawls data from two websites currently:

for company and patent related data.

  • main branch: [possible] sync with run-on-github branch.
  • run-on-self-hosted: runs on self-hosted computer - updated frequently than main branch. Create PR to main to receive updates.

How to run

./orbi contains the main script which is used to run the crawler. It contains two different classes to crawl data from the websites.

  • Class Crawler: This class is used to crawl data from ipo.gov.uk and sec.gov website. It is used to generate csv file with; name, city, country and CIK number of the companies. Name, city and country are scraped from sec.gov website.(CIK number of the companies provided by a xlsx file)

  • Class Orbis: This class is responsible for crawling data from Orbis database. It is using batch search to find the companies from the csv file generated by Crawler class. It also adds/removes columns to enchance the search results and export the results to xlsx file.

All process is automated by using selenium and chromedriver.

On Local Dev Machine

Since on Github actions, the script is using environment variables, it is required to have the environment variables set on your local machine. Providing all environment variables through commandline would be a bit tedious, so I have created a config file which is used by the script to load the environment variables. Check sample config file from here.

  • Setup the requirements:
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
  • There are three main components of the program, which are orbi/orbi.py, orbi/crawl.py and utils/visualize.py.

    • orbi/orbi.py is the main script which is used to start batch search on Orbis database by generated csv file from given XLSX file.

    • orbi/crawl.py is the script which is used to crawl data from sec.gov website.

    • utils/visualize.py is the script which is used to visualize the data.

Orbi (batch search on orbis database)

This part explains running it on local machine. For running it on remote, check out the On Remote section.

  • After setting up virtual environment, and installing requirements you can run the script by running the following command to start batch search on local machine. Before starting the process, make sure there is a config.yaml file in config folder, which includes all required credentials.
$ LOCAL_DEV=True CONFIG_PATH=./config/config.yaml CHECK_ON_SEC=False python orbi/orbi.py
  • The given command above will launch chrome browser not in headless mode, and will not check the companies on sec.gov website. It will take cleaned company names, merge it in one column, then feed it to Orbis database to get the results.

  • To add data from sec.gov website, you can set CHECK_ON_SEC=True in the command above. ( In our experiment this decreases hit rate of companies, preferred to be not used. This is added at first stages of the project and later discovered that it is not necessary. )

Make sure that you are defining the path to the config file correctly.

Crawl (scraping data from sec.gov website)

  • Crawl class is used to scrape data from sec.gov website. It makes requests to API endpoint (https://data.sec.gov/api/xbrl/companyfacts/CIK##########.json) where company's financial figures are given as JSON response. From this point, based on licensee agreement date for all companies financial figures are fetched and saved to a csv file.

  • Not found companies and missing KPI values are stored in a seperate file.

  • To run the crawler, you can run the following command:

$ python orbi/crawl.py 

  Example usage:

    python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --licensee  # searching over licensee information 
    python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --no-licensee # searching over licensor information 
            
  • --source_file: (required) is the path to the input file. The same file which is used as input file ( XLSX) for Orbis batch search.

  • --output_file: (optional) is the path to the output file. If not provided, it will be saved to ./data/ folder with the name company_facts_{timestamp}_licensee.csv or "company_facts_{timestamp}_licensor.csv.

  • --licensee: (required) is the boolean value to indicate if the source file is for licensee.

  • --no-licensee (required) is the boolean value to indicate if the source file is for licensor.

Example call for licensee field:

python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --licensee 

On Remote

  • The action can be triggered through actions tab on Github. Right side of the page, you can see the 'Run workflow' button to trigger the action.

To run the crawler classs seperately , check out the commented code in ./orbi/crawl.py` file.

Specifically, this line: ./orbi/orbi.py#494

Automation of Orbis database access and batch search on Orbis database

  • orbi.py can access Orbis database, execute batch search by providing the csv file generated by Crawler class, add/remove columns to enchance the search results and export the results to csv file.

  • Currently orbi.py file will produce following files, in order to download them you can use the link send to Slack, and append with following file names below.

Produced files by Orbi class
orbis_aggregated_data_{timestamp}.csv : example --> orbis_aggregated_data_13_01_2023.csv
orbis_aggregated_data_{timestamp}.xlsx : example --> orbis_aggregated_data_13_01_2023.xlsx
orbis_aggregated_data_licensee_{timestamp}.xlsx : example --> orbis_aggregated_data_licensee_14_01_2023.xlsx
orbis_aggregated_data_licensor_{timestamp}.xlsx : example --> orbis_aggregated_data_licensor_14_01_2023.xlsx
orbis_data_licensee_{timestamp}.csv : example --> orbis_data_licensee_14_01_2023.csv
orbis_data_licensee_14_01_2023.xlsx : example --> orbis_data_licensee_14_01_2023.xlsx
orbis_data_licensee_guo_{timestamp}.csv : example --> orbis_data_licensee_guo_14_01_2023.csv
orbis_data_licensee_guo_{timestamp}.xlsx : example --> orbis_data_licensee_guo_14_01_2023.xlsx
orbis_data_licensee_ish_{timestamp}.csv : example --> orbis_data_licensee_ish_14_01_2023.csv
orbis_data_licensee_ish_{timestamp}.xlsx : example --> orbis_data_licensee_ish_14_01_2023.xlsx
orbis_data_licensor_{timestamp}.csv  : example --> orbis_data_licensor_14_01_2023.csv
orbis_data_licensor_{timestamp}.xlsx : example --> orbis_data_licensor_14_01_2023.xlsx
orbis_data_licensor_guo_{timestamp}.csv : example --> orbis_data_licensor_guo_14_01_2023.csv
orbis_data_licensor_guo_{timestamp}.xlsx : example --> orbis_data_licensor_guo_14_01_2023.xlsx
orbis_data_licensor_ish_{timestamp}.csv : example --> orbis_data_licensor_ish_14_01_2023.csv
orbis_data_licensor_ish_{timestamp}.xlsx : example --> orbis_data_licensor_ish_14_01_2023.xlsx
- sample_data.xlsx
  • Data is accessible through: link + file name
Produced files by Crawler class
orbis_aggregated_data_{timestamp}.csv 
orbis_data_licensee_{timestamp}.csv
orbis_data_licensee_guo_{timestamp}.csv
orbis_data_licensee_ish_{timestamp}.csv
orbis_data_licensor_{timestamp}.csv
orbis_data_licensor_guo_{timestamp}.csv
orbis_data_licensor_ish_{timestamp}.csv
  • The XLSX files are generated through the ./orbi/orbi.py by conducting batch search on Orbis database.
Produced XLSX files by Orbi class - END RESULT -
orbis_aggregated_data_{timestamp}.xlsx
orbis_aggregated_data_licensee_{timestamp}.xlsx
orbis_aggregated_data_licensor_{timestamp}.xlsx
orbis_data_licensee_{timestamp}.xlsx
orbis_data_licensee_guo_{timestamp}.xlsx
orbis_data_licensee_ish_{timestamp}.xlsx
orbis_data_licensor_{timestamp}.xlsx
orbis_data_licensor_guo_{timestamp}.xlsx
orbis_data_licensor_ish_{timestamp}.xlsx

Slack Integration

Currently, action results are uploaded to AWS S3 service and accesible with the link sent to private Slack channel. The files can be downloaded as decribed in the slack channel.

Run orbi from Slack

Orbi can be triggered on Github from slack when you are in tum-tim.slack.com workspace.

Any user who writes in the message field of Slack the following command and press 'Enter', Orbi will start the process on Github:

/run-orbis-crawler 

You will receive a result as shown below from Slack.


how-to-run-orbi-from-slack


After it is initialized, you will receive a message to #idp-data-c channel on Slack similar to the following:


Initial Notification


When it is done successfully, you will have a new notification with the link which provides access to data that similar to following:

Screenshot 2023-03-01 at 13 47 46


In case of error on the process, similar notification will be received as provided below:

Error notification

Main Workflow

Beside the given main workflow given below, there are other options which can be used with this repository.

The workflow is subject to change in time.

Batch Search Flow Chart

The following flow chart shows the process of batch search done by Orbi.

Flowchart of the batch search functionality of Orbi.

About

This repository is created to keep files updated for IDP in The Dr. Theo Schöller Chair of Technology and Innovation Management

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published