[Bounty] Headless browser for fetching Proof Post #40 #44

synycboom · 2022-11-09T17:50:10Z

Why do we need this changes?

I'm adding a headless browser service to the proof_server to enable proof finding by using the headless browser. This PR uses go-rod [https://github.com/go-rod/rod] as a driver for the headless browser. I saw that this repo uses AWS lambda to deploy the services, so I try to support it. Since the headless browser service requires the headless browser execution file, it must be deployed using docker image instead and it is not fully tested on lambda (it is based on https://github.com/YoungiiJC/go-rod-aws-lambda).

Design Specs

The new headless browser service has two APIs.

/healthz (for health check)
/v1/find

The parameters for the find API have three match types which are described below.

Match by RegExp

{
    "location": "https://gist.github.com/synycboom/2290ee73c760c554535860cd3ed4b636",
    "timeout": "10s",
    "match": {
        "type": "regexp",
        "regexp": {
            "selector": "*",
            "value": "/^Sig: .*/"
        }
    }
}

Match by XPath

{
    "location": "https://gist.github.com/synycboom/2290ee73c760c554535860cd3ed4b636",
    "timeout": "10s",
    "match": {
        "type": "xpath",
        "xpath": {
            "selector": "//text()[contains(.,'Sig:')]"
        }
    }
}

Match by JS

{
    "location": "https://gist.github.com/synycboom/2290ee73c760c554535860cd3ed4b636",
    "timeout": "10s",
    "match": {
        "type": "js",
        "js": {
            "value": "() => [].filter.call(document.querySelectorAll('*'), (el) => el.textContent.startsWith('Sig:'))[0]"
        }
    }
}

Note that all three match types above are trying to find the same thing which is the text shown in this picture

Example Usage

package main

import (
	"context"
	"fmt"

	"github.com/nextdotid/proof_server/headless"
)

func main() {
    client := headless.NewHeadlessClient("http://localhost:9801")
    content, err := client.Find(context.Background(), &headless.FindRequest{
        Location: "https://gist.github.com/synycboom/2290ee73c760c554535860cd3ed4b636",
        Timeout:  "10s",
        Match: headless.Match{
          Type: "regexp",
          MatchRegExp: &headless.MatchRegExp{
            Selector: "*",
            Value:    "/^Sig: .*/",
          },
        },
    })

    if err != nil {
        panic(err)
    }

    fmt.Println(content)
}

…ication

nykma

Almost flawless. Huge thanks to your contrib!
Only some trivial questions

nykma · 2022-11-18T08:07:15Z

headless/find.go

+	go router.Run()
+
+	page = page.Timeout(timeoutDuration)
+	if err := page.WaitLoad(); err != nil {


Will this WaitLoad() handles in-page XHR correctly?
I mean, will it "wait" for enough time to let the page load itself completely?

After I read the code of go-rod, this WaitLoad() waits only for window.onload. This will cause a bug when the target page has XHR. I'll fix this problem by waiting for XHR after window.onload.

synycboom · 2022-11-19T08:47:20Z

@nykma I made the api wait for XHR, and also changed the test to simulate XHR.

https://github.com/nextdotid/proof_server/blob/c722b74f32803d8722a442799360b59c0bce6568/headless/find_test.go#L73

BinaryHB0916 · 2022-11-22T05:58:48Z

@synycboom Thank you for the hard work! Would you be interested in attending our EN Community Call tmr via Telegram group video call?

Date: GMT+8, 11:00am-12:00pm, Wednesday

Sharing details regarding your PR on the headless browser. Add me on Telegram: BinaryHB0916

nykma

Have tested in my local env, it works very well.

nykma · 2022-11-22T07:21:44Z

headless/find.go

+		}
+	}
+
+	c.JSON(http.StatusOK, FindRespond{Found: true})


We need to get the full content of query result in response.

for example, here's a request:

{ "location": "https://www.minds.com/newsfeed/1421043369127186449", "timeout": "30s", "match": { "type": "regexp", "regexp": { "selector": "*", "value": "Sig" } } }

This service will respond with found: true, which is pretty good.
But what upstream indeed cares is the whole post content (contents after Sig: in this post).
So we need to return it in API response.

Or, at least, if I define type: regexp with search condition: ^Sig: .*$, the whole matched substring should be returned in API response.
Same for xpath and js selector: the node's content should be returned. Then proof_service upstream will have a chance to verify the signature in it.

if we really want to capture part of the content in the node (in case of regexp), normally we should define capturing group(s) in regex as it is more easy to cut the content. However, in my opinion, the returned value from this API should be consistent for all match types. For example, it might return the node's content regardless of what match type. I'm not sure whether you are okay about it. What do you think?

I'm going to change the code to return the node's content instead of a boolean.

the returned value from this API should be consistent for all match types.

Agree.

it might return the node's content regardless of what match type.

Yeah this is what I think. Even "kinda dirty" match result [1] returned can be tolerated, because all proof_service upstream wants is basiclly only one line: Sig: BASE64_SIGNATURE .

[1]: like some <div> tag is mixed into the returned result.

synycboom · 2022-11-23T11:39:51Z

I updated what we have discussed in the comments, test cases and PR details.

nykma

I'll merge this. Thank you again for your contrib!

gitpoap-bot · 2022-11-25T11:14:17Z

Congrats, your important contribution to this open-source project has earned you a GitPOAP!

GitPOAP: 2022 Next.ID Contributor:

Head to gitpoap.io & connect your GitHub account to mint!

Learn more about GitPOAPs here.

synycboom added 12 commits November 9, 2022 14:53

chore: add go-rod to dependencies

76ab7c5

feat: implement headless modules for handling headless browser commun…

6446a4d

…ication

feat: use a custom launcher

500d473

test: add test cases

7925e27

feat: implement headless cmd

c44b028

feat: block unwanted resources

dfc6539

feat: add lambda cmd for headless

24b88f0

feat: implement headless client

a96d685

refactor: rename function name from validate to find

d00e478

fix: missing url path

62a74cc

fix: wrong api calls

6d153d5

refactor: clean up unused code

d159d01

nykma reviewed Nov 18, 2022

View reviewed changes

synycboom added 3 commits November 19, 2022 15:35

fix: wait for XHR before searching

fc54970

fix: wrong indent

6e3c6ae

refactor: extract html response to a variable

c722b74

nykma requested changes Nov 22, 2022

View reviewed changes

feat: return node's content instead of boolean

282628d

test: change target matching text

3bdb903

nykma approved these changes Nov 25, 2022

View reviewed changes

nykma merged commit 7b05759 into NextDotID:develop Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bounty] Headless browser for fetching Proof Post #40 #44

[Bounty] Headless browser for fetching Proof Post #40 #44

synycboom commented Nov 9, 2022 •

edited

Loading

nykma left a comment

nykma Nov 18, 2022

synycboom Nov 19, 2022 •

edited

Loading

synycboom commented Nov 19, 2022 •

edited

Loading

BinaryHB0916 commented Nov 22, 2022

nykma left a comment

nykma Nov 22, 2022

nykma Nov 22, 2022

synycboom Nov 23, 2022

nykma Nov 23, 2022

synycboom commented Nov 23, 2022 •

edited

Loading

nykma left a comment

gitpoap-bot bot commented Nov 25, 2022

[Bounty] Headless browser for fetching Proof Post #40 #44

[Bounty] Headless browser for fetching Proof Post #40 #44

Conversation

synycboom commented Nov 9, 2022 • edited Loading

Why do we need this changes?

Design Specs

Example Usage

nykma left a comment

Choose a reason for hiding this comment

nykma Nov 18, 2022

Choose a reason for hiding this comment

synycboom Nov 19, 2022 • edited Loading

Choose a reason for hiding this comment

synycboom commented Nov 19, 2022 • edited Loading

BinaryHB0916 commented Nov 22, 2022

nykma left a comment

Choose a reason for hiding this comment

nykma Nov 22, 2022

Choose a reason for hiding this comment

nykma Nov 22, 2022

Choose a reason for hiding this comment

synycboom Nov 23, 2022

Choose a reason for hiding this comment

nykma Nov 23, 2022

Choose a reason for hiding this comment

synycboom commented Nov 23, 2022 • edited Loading

nykma left a comment

Choose a reason for hiding this comment

gitpoap-bot bot commented Nov 25, 2022

synycboom commented Nov 9, 2022 •

edited

Loading

synycboom Nov 19, 2022 •

edited

Loading

synycboom commented Nov 19, 2022 •

edited

Loading

synycboom commented Nov 23, 2022 •

edited

Loading