DOMinator

What if you wanted to analyse the wins and losses of your favourite football teams, but you couldn't find an API to get the data from?

You would want to use a webscraper to periodically scrape a page. But setting one up, and creating the scripts, and configuring cron takes valuable time that you could use to analyse the data.

My solution

That's were the DOMinator comes in. It allows you to configure a webscraper using a simple declarative syntax and setup a cron intervall. The DOMinator will then automatically fetch the site, scrape the data and make it availiable in JSON, CSV and via a MongoDB database.

Interesting Things about this project

I had to create a custom declarative language based on JSON, which the users could use to specify what to scrape.

I wrote an interpreter/parser for this "language" (src/backend/dominator-parser). It recursively goes through the queries and gets the elements. In depth explanation of the interpreter

On the frontend side, I had to create a dynamic editing experience that would allow users to preview what they fetched from the page (Demo).

Images of the site

You can find a "walkthrough" of the site in this markdown file.

Images of the site

The interpreter / parser is explained in-depth here:

It features a visual explanation of the Query Language.

In depth explanation of the interpreter

The tech-stack

The project is a monorepo created with NX. This project uses a NestJS backend and an Angular frontend. The database is a MongoDB database.

How The DOMinator works

The first step to setting up a webscraper with DOMinator is to create a "ScrapeConfig". You have to set the name of the ScrapeConfig, the url of the page you want to scrape and specify what information you want to extract.

The image below is a screenshot of the docs page, please read it to understand how the parser works.

The Queries

Scrape Text from a page

<div>
  <div class="girl">
    Anna
  </div>
  <div class="boy">
    Marcus
  </div>
  <div class="boy">
    Alexander
  </div>
</div>

Query TextQuery > BaseQuery ( ".boy" )

Result "Marcus"

To scrape text from a website, use a TextQuery. It can be as simple or complex as the website you are scraping. Nested Queries? No problem, use nested BaseQueries. Div's without class or id? No problem, use nested ChildQueries.

Scrape any Text from a page

<div>
  <div class="girl">
    Anna
  </div>
  <div class="boy">
    Marcus
  </div>
  <div class="boy">
    Alexander
  </div>
</div>

Query TextQuery > BaseQueryAll ( ".boy" 1)

Result "Alexander"

To scrape any text, no matter if it's the first, second or hundreth element on a website, use a TextQuery and a BaseQueryAll. It supplements the capabilities of the BaseQuery.

Scrape a Table from a page

<div class="table">
    <div class="element">
        <p class="name1">
            Anna
        </p>
        <p>
            and
        </p>
        <p class="name2">
            Marcus
        </p>
    </div>
    <div class="element">
        <p class="name1">
            Alexander
        </p>
        <p>
            and
        </p>
        <p class="name2">
            Ferdinand
        </p>
    </div>
</div>

Query TableQuery > Row: BaseQuery ( ".element") > Column: BaseQuery ( ".name1") > Column: BaseQuery ( ".name2")

Result "Anna, Marcus" , "Alexander, Ferdinand"

To scrape a Table, you have to specify a Query for the element equivalent of a row and the elements to select from inside that row element.

This is what a finished ScrapeConfig looks like

Once the user has entered the URL of the page he wants to scrape, he can press Get Preview. The server will then create a screenshot of the page and save the entire DOM. It then sends that data back to the frontend.

Specifying what information you want from the page

As a tree, this query would look like this:

A more complex query might look like this:

Getting started

To run the backend project either use the NX commands:

docker-compose -f docker-compose.dev.yml up
nx run-many --target=serve --all=true

Or run it with docker:

docker-compose -f docker-compose.yml build
docker-compose -f docker-compose.yml up

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
.vscode		.vscode
apps		apps
libs		libs
tools		tools
.dev.env		.dev.env
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
README.md		README.md
angular.json		angular.json
backend.Dockerfile		backend.Dockerfile
decorate-angular-cli.js		decorate-angular-cli.js
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
frontend.Dockerfile		frontend.Dockerfile
jest.config.ts		jest.config.ts
jest.preset.js		jest.preset.js
nx.json		nx.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json

obrhubr/DOMinator-webscraping

Folders and files

Latest commit

History

Repository files navigation

DOMinator

My solution

Interesting Things about this project

The tech-stack

How The DOMinator works

The Queries

Scrape Text from a page

Scrape any Text from a page

Scrape a Table from a page

This is what a finished ScrapeConfig looks like

Specifying what information you want from the page

Getting started

About

Resources

Stars

Watchers

Forks

Languages