Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and implement the architecture of our analyzer #12

Closed
SebastianZimmeck opened this issue Nov 18, 2020 · 21 comments
Closed

Design and implement the architecture of our analyzer #12

SebastianZimmeck opened this issue Nov 18, 2020 · 21 comments
Assignees
Labels
enhancement New feature

Comments

@SebastianZimmeck
Copy link
Member

As we have now settled on Django/Python, Selenium + BrowserMob, and Heroku, the question becomes how to design and implement our analyzer. I have updated the architectural overview.

Essentially, we could design our analyzer as a web app that the developer logs in to and runs on the browser of our web app their web app, which is then being analyzed by us. So, everything could work remotely. The developer does not need to install anything locally on their end. As @davebaraka mentioned, maybe it is more clear to actually run two servers: one where the developer is running their web app and one with our analysis logic.

An alternative setup could be that only our analysis module runs on the server and the developer is running their app locally. Their app would then communicate with our server on the current running analysis, results to display, etc. Maybe, in this architecture we ask the developer to integrate into their web app an SDK that we provide that communicates with our sever and drives the analysis.

The bottom line is to come up with an architecture that does not require the developer to do a whole lot of set up steps on their end and also make our analysis work on different platforms (e.g., independently of whether the developer is using Windows or macOS locally).

@davebaraka
Copy link
Collaborator

Essentially, we could design our analyzer as a web app that the developer logs in to and runs on the browser of our web app their web app, which is then being analyzed by us. So, everything could work remotely. The developer does not need to install anything locally on their end.

One way to approach this is to display a website within a website using the HTML inline Frame element (<iframe>). I ran into some errors X-Frame-Options: deny/sameorigin when trying this, and while there may be some ways to get around these errors such as this X-Frame-Bypass library, this would not work with our proposed architecture as we would need to capture the http requests from the ‘front-end’. In other words, the developer’s web app would run on their browser and not the one on our server.

Another approach would be to stream a developer’s website from our server to their browser, and then they could remotely interact with it. For instance, using the chrome.tabCapture, though I would imagine this would be laggy and I’m not sure if this would be a practical.

An alternative setup could be that only our analysis module runs on the server and the developer is running their app locally. Their app would then communicate with our server on the current running analysis, results to display, etc. Maybe, in this architecture we ask the developer to integrate into their web app an SDK that we provide that communicates with our sever and drives the analysis.

I think this is the most promising approach, but there are a lot more moving pieces, such as selenium requiring a chrome installation. Do we pack chrome with the sdk or have the developer link to it on the computer. How simple can we make the experience for the developer? I’ll need more time to think about this.

Also, @SebastianZimmeck mentioned in the architecture, ‘The developer logs into the developer dashboard of the web app privacy analysis app and starts the analysis’, are we expecting to handle user authentication and remember a developer's websites analysis results?

@SebastianZimmeck
Copy link
Member Author

are we expecting to handle user authentication and remember a developer's websites analysis results?

Yes, I am imagining something like a dashboard on Google Analytics (to say one example).

I think this is the most promising approach,

I agree. Using iframes and streaming the website sound that a lot of things could go wrong. Maybe, we can package Selenium and our other tools into a relatively simple to install app that the developers can locally (possibly with some SDK in the developer's app if it helps us). Then send the information to our server with the dashboard. So there would be a clear separation between analysis (locally) and results (online).

The only other solution I can currently think of is to host a virtual machine on our server where the developer is running their web app. This should be, in principle, similar to the local solution, except that the machine is run not locally but on our server.

@SebastianZimmeck
Copy link
Member Author

On the VM idea, I started an f1-micro instance (1 vCPU, 0.6 GB memory, Ubuntu/Linux) on the Google Cloud Platform using Compute Engin. It should be up and running and both of you, @davebaraka and @rgoldstein01 should have received an email with details on how to access it.

@davebaraka, it would be good if you can look into whether that works for our purposes. Feel free to make any changes to the setup of the VM (or otherwise).

You can connect to Linux instances through either the GCP Console or the gcloud command-line tool. Compute Engine generates an SSH key for you and stores it in one of the following locations:
By default, Compute Engine adds the generated key to project or instance metadata.
If your account is configured to use OS Login, Compute Engine stores the generated key with your user account.

For me as a reminder: This instance should be free for the time being:

Your Free Tier f1-micro instance limit is by time, not by instance. Each month, eligible use of all of your f1-micro instances is free until you have used a number of hours equal to the total hours in the current month.

Here is an explanation on Stackoverflow on the pricing.

@davebaraka
Copy link
Collaborator

davebaraka commented Dec 2, 2020

Considering the VM instance, where we have a Linux distribution available to us, there still does not seem to be a seamless way to run a developer’s web app on the browser of our web app. I looked into SSH X11 Forwarding, which allows you to start up remote applications and forward the application display to your local machine, but there is no way to get this to work in the context of our web app and the developer’s browser.

To get something where a developer could access and navigate their website within our web app, Electron would be our best bet, as we would be basically building our own browser. Though this would probably require a re-exploration into intercepting and decrypting HTTP requests and a change in tools.

With our current set of tools Django/Python, Selenium + BrowserMob, and Heroku (or Google Cloud Platform) a ‘modular’ architecture may be best when considering a developer’s interaction with their website. I imagine this set up to include our current set of tools, where we analyze a website through crawling, and then provide an optional SDK or program that a developer could use with their web app, then we could perform a deeper analysis from the additional collected/intercepted data and present that in the dashboard.

@SebastianZimmeck
Copy link
Member Author

As discussed on Tuesday, @davebaraka will explore both the modular as well as the Electron-based architecture. We can see what makes the most sense.

@davebaraka
Copy link
Collaborator

I'm exploring different architectures and in order to use selenium, it seems that we would have to package a version of chrome, firefox, or another supported browser with selenium, or have the developer link to a local copy on their machine when distributing it. There could be compatibility issues with the selenium webdriver and the browser version if a developer linked to a local copy. In other words, I feel like we may run into complications packaging and creating an easy experience for a developer.

As an alternative, and since we are not crawling a site and currently only looking at the 'POST/PUT' data, I created an architecture that revolves around a browser extension. The extension is able to intercept and read the network requests and it can communicate with a web app. Additionally, I can inject javascript to webpages where I imagine that we guide a user and give status updates as they navigate their site. We also have a popup dialog available from the extension to provide updates. I've provided a demo of the prototype here.

I think I can do pretty much the same thing with selenium (without a browser extension), but I'm a bit unsure about how we would distribute this to developers. With a chrome extension, a developer would install the extension and we would move on from there.

@SebastianZimmeck
Copy link
Member Author

Nice. It looks really good. One drawback may be that with a browser extension we would be limited to intercepting the HTTP requests between the web app and its user. We could not intercept any requests in between a backend API and the app (issue #16). Generally, we would be limited by the browser sandbox. So, we would need to combine the extension with other tools.

Any thoughts on Electron? I found some tutorial on building a web browser with Electron and JS. It seems that Electron is using webkit, though, that route would be more complicated. Also, whatever browser solution we pick, there is probably not much we can do in terms of intercepting HTTP requests at the backend.

@davebaraka
Copy link
Collaborator

Using a browser extension does not meet our requirements for our analysis. I will explore an alternative set up using electron (maybe use selenium with electron) and another set up using a proxy on a server. mitmproxy seems like a promising https proxy.

At the moment, somehow packaging selenium and injecting javascript with it that communicates a developer's website with our web app seems like the most promising route.

@SebastianZimmeck
Copy link
Member Author

@davebaraka had the idea of looking into a reverse proxy to intercept the traffic between, say, an app's database and an third party data broker. Also, it seems that we are leaning towards Selenium as opposed to Electron.

@davebaraka
Copy link
Collaborator

Although electron does provide a viable solution, as we can intercept requests, read post and body data, and we are less restricted in providing a local user experience, we decided to use selenium and inject javascript to provide an experience to guide a developer. I will look into javascript injection, since I've have yet to test out this functionality in selenium.

Thinking way ahead to when we package/distribute our tool:

  • Does selenium require a certain version of firefox/chrome to operate? In other words, is the selenium webdriver dependent on a specific version of chrome? Do we have to continually update the tool when a new version of chrome/firefox is released?
  • Should we include our own version of chrome/firefox with the tool?

davebaraka added a commit that referenced this issue Dec 21, 2020
add browser extension to inject javascript #12
@davebaraka
Copy link
Collaborator

Here's a potential flow including a portal/sign-in.

Presentation1

@SebastianZimmeck
Copy link
Member Author

As we switched from a developer to a user tool, there is probably not a whole lot to say beyond that we will be creating a browser extension. However, I am leaving this issue open for now in case something is coming up.

davebaraka added a commit that referenced this issue Jan 23, 2021
davebaraka added a commit that referenced this issue Jan 23, 2021
@davebaraka
Copy link
Collaborator

I've scaffolded the browser extension.

@rgoldstein01 @notowen333 To get started:

  1. Open the Extension Management page by navigating to chrome://extensions.

  2. The Extension Management page can also be opened by clicking on the Chrome menu, hovering over More Tools then selecting Extensions.

  3. Enable Developer Mode by clicking the toggle switch next to Developer mode.

  4. Click the LOAD UNPACKED button and select the extension directory. This is the src folder in the repo. All code related to the extension lives there.

  5. You can debug the extension by clicking 'Inspect views background.html' on the extension page and then navigate to the console tab. You should see entries of network requests made since the extension was enabled. These network requests are being printed from src/analysis/analyze.js.

  6. When making changes to source files, use the reload button from the extension page in order for changes to take effect. For more info about developing extensions see the docs or ask me.

Unfortunately, chrome doesn't yet have a WebRequest api to get response body data. There have been talks about adding this feature to chrome since 2015 here. As a workaround, I've tried a method described here to get response data, but I was not getting the expected results (if anyone else wants to give it a try).

Though, it seems like Firefox has an api with this functionality. In other words, Firefox potentially has everything we need. If we decide to use Firefox, we wouldn't be able to fully port the extension to chrome or safari until they support an api for handling network response data.

Post data is available for both chrome and Firefox.

@SebastianZimmeck
Copy link
Member Author

That is excellent, @davebaraka! On the Chrome/Firefox question, I am OK with using Firefox as long as we can design this extension in a modular way with the Firefox logic clearly separated from the analysis logic, label creation logic, etc. so that we potentially can use our work in a non-Firefox context.

@SebastianZimmeck
Copy link
Member Author

(if anyone else wants to give it a try)

Yes, let's make sure we know that it does not work before going the Firefox route.

@rgoldstein01
Copy link
Collaborator

I have followed both the tutorial that David linked and another one, and it appears that Chrome has deprecated most of the functionality around webrequests, as the documentation for reading the bodies is not on their website anymore. I suggest we try out the Firefox version, instead. @davebaraka I don't mind creating that and giving it a try, if that helps. I can create the base app later today.

@rgoldstein01
Copy link
Collaborator

So I have created the Firefox extension, and it is able to do a few basic tasks, even print the headers, etc. However, I am still getting undefined when I try and look into the request bodies. There are a few discussions about it:
https://bugzilla.mozilla.org/show_bug.cgi?id=1376155
https://bugzilla.mozilla.org/show_bug.cgi?id=1201979

But I don't think any of it was ever resolved. Not sure where this leaves us, as we need to be able to see the bodies for more of the relevant information.

@rgoldstein01
Copy link
Collaborator

I have found another issue on it https://bugzilla.mozilla.org/show_bug.cgi?id=1416486
Which even says explicitly the following: "We don’t support upload streams from plugins, and we don’t intend to given that plugins are deprecated."

@davebaraka
Copy link
Collaborator

davebaraka commented Feb 10, 2021

I've added the Firefox extension to the repo under the src directory.

To install an extension temporarily:

  • open Firefox
  • enter "about:debugging" in the URL bar
  • click "This Firefox"
  • click "Load Temporary Add-on"
  • open the extension's directory and select any file inside the extension. (Select manifest.json from the src directory in the repo)

To debug an extension:

  • click Inspect next to your extension.
  • When you load a website, you should see a feed of http requests in the console. You can learn about my custom request object here

If you make any changes to the extension, make sure to reload the extension, by clicking Reload from the "about:debugging" page

Here are detailed docs about installation and debugging

Let me know if you experience any problems.

@SebastianZimmeck
Copy link
Member Author

For the time being, we decided to keep it simple with the extension working locally in the user's browser. So, maybe using the Web Storage API to store analysis results upon a user visiting a site and then showing them to the user in a popup or local html page of the extension.

@davebaraka
Copy link
Collaborator

I will start building the UI in issue 61

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature
Projects
None yet
Development

No branches or pull requests

4 participants