Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bounty] Headless browser for fetching Proof Post #40 #44

Merged
merged 17 commits into from
Nov 25, 2022

Conversation

synycboom
Copy link
Contributor

@synycboom synycboom commented Nov 9, 2022

Why do we need this changes?

I'm adding a headless browser service to the proof_server to enable proof finding by using the headless browser. This PR uses go-rod [https://github.com/go-rod/rod] as a driver for the headless browser. I saw that this repo uses AWS lambda to deploy the services, so I try to support it. Since the headless browser service requires the headless browser execution file, it must be deployed using docker image instead and it is not fully tested on lambda (it is based on https://github.com/YoungiiJC/go-rod-aws-lambda).

Design Specs

The new headless browser service has two APIs.

  • /healthz (for health check)
  • /v1/find

The parameters for the find API have three match types which are described below.

  1. Match by RegExp
{
    "location": "https://gist.github.com/synycboom/2290ee73c760c554535860cd3ed4b636",
    "timeout": "10s",
    "match": {
        "type": "regexp",
        "regexp": {
            "selector": "*",
            "value": "/^Sig: .*/"
        }
    }
}
  1. Match by XPath
{
    "location": "https://gist.github.com/synycboom/2290ee73c760c554535860cd3ed4b636",
    "timeout": "10s",
    "match": {
        "type": "xpath",
        "xpath": {
            "selector": "//text()[contains(.,'Sig:')]"
        }
    }
}
  1. Match by JS
{
    "location": "https://gist.github.com/synycboom/2290ee73c760c554535860cd3ed4b636",
    "timeout": "10s",
    "match": {
        "type": "js",
        "js": {
            "value": "() => [].filter.call(document.querySelectorAll('*'), (el) => el.textContent.startsWith('Sig:'))[0]"
        }
    }
}

Note that all three match types above are trying to find the same thing which is the text shown in this picture
image

Example Usage

package main

import (
	"context"
	"fmt"

	"github.com/nextdotid/proof_server/headless"
)

func main() {
    client := headless.NewHeadlessClient("http://localhost:9801")
    content, err := client.Find(context.Background(), &headless.FindRequest{
        Location: "https://gist.github.com/synycboom/2290ee73c760c554535860cd3ed4b636",
        Timeout:  "10s",
        Match: headless.Match{
          Type: "regexp",
          MatchRegExp: &headless.MatchRegExp{
            Selector: "*",
            Value:    "/^Sig: .*/",
          },
        },
    })

    if err != nil {
        panic(err)
    }

    fmt.Println(content)
}

Copy link
Member

@nykma nykma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost flawless. Huge thanks to your contrib!
Only some trivial questions

go router.Run()

page = page.Timeout(timeoutDuration)
if err := page.WaitLoad(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this WaitLoad() handles in-page XHR correctly?
I mean, will it "wait" for enough time to let the page load itself completely?

Copy link
Contributor Author

@synycboom synycboom Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I read the code of go-rod, this WaitLoad() waits only for window.onload. This will cause a bug when the target page has XHR. I'll fix this problem by waiting for XHR after window.onload.

@synycboom
Copy link
Contributor Author

synycboom commented Nov 19, 2022

@nykma I made the api wait for XHR, and also changed the test to simulate XHR.

https://github.com/nextdotid/proof_server/blob/c722b74f32803d8722a442799360b59c0bce6568/headless/find_test.go#L73

@BinaryHB0916
Copy link
Member

@synycboom Thank you for the hard work! Would you be interested in attending our EN Community Call tmr via Telegram group video call?

Date: GMT+8, 11:00am-12:00pm, Wednesday

Sharing details regarding your PR on the headless browser. Add me on Telegram: BinaryHB0916

Copy link
Member

@nykma nykma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have tested in my local env, it works very well.

headless/find.go Outdated
}
}

c.JSON(http.StatusOK, FindRespond{Found: true})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to get the full content of query result in response.

for example, here's a request:

{
  "location": "https://www.minds.com/newsfeed/1421043369127186449",
  "timeout": "30s",
  "match": {
     "type": "regexp",
     "regexp": {
       "selector": "*",
       "value": "Sig"
     }
  }
}

This service will respond with found: true, which is pretty good.
But what upstream indeed cares is the whole post content (contents after Sig: in this post).
So we need to return it in API response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, at least, if I define type: regexp with search condition: ^Sig: .*$, the whole matched substring should be returned in API response.
Same for xpath and js selector: the node's content should be returned. Then proof_service upstream will have a chance to verify the signature in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we really want to capture part of the content in the node (in case of regexp), normally we should define capturing group(s) in regex as it is more easy to cut the content. However, in my opinion, the returned value from this API should be consistent for all match types. For example, it might return the node's content regardless of what match type. I'm not sure whether you are okay about it. What do you think?

I'm going to change the code to return the node's content instead of a boolean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the returned value from this API should be consistent for all match types.

Agree.

it might return the node's content regardless of what match type.

Yeah this is what I think. Even "kinda dirty" match result [1] returned can be tolerated, because all proof_service upstream wants is basiclly only one line: Sig: BASE64_SIGNATURE .

[1]: like some <div> tag is mixed into the returned result.

@synycboom
Copy link
Contributor Author

synycboom commented Nov 23, 2022

I updated what we have discussed in the comments, test cases and PR details.

Copy link
Member

@nykma nykma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this. Thank you again for your contrib!

@nykma nykma merged commit 7b05759 into NextDotID:develop Nov 25, 2022
@gitpoap-bot
Copy link

gitpoap-bot bot commented Nov 25, 2022

Congrats, your important contribution to this open-source project has earned you a GitPOAP!

GitPOAP: 2022 Next.ID Contributor:

GitPOAP: 2022 Next.ID Contributor GitPOAP Badge

Head to gitpoap.io & connect your GitHub account to mint!

Learn more about GitPOAPs here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants