Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Can we build a heuristic for browser attribute fingerprinting? #34

Open
birdsarah opened this issue Mar 11, 2019 · 16 comments
Open

Can we build a heuristic for browser attribute fingerprinting? #34

birdsarah opened this issue Mar 11, 2019 · 16 comments
Labels
good first issue Good for newcomers research question Outstanding questions that have not been investigated yet.

Comments

@birdsarah
Copy link
Contributor

There are some scripts that we can pick out by name that are doing browser attribute fingerprinting:

  • The fingerprintjs scripts
  • Scripts with hs-analytics in the script_url
  • Scripts with /akam/ in the script_url

Can we build a heuristic for browser attribute fingerprinting that pulls out these scripts?

@birdsarah birdsarah added good first issue Good for newcomers research question Outstanding questions that have not been investigated yet. labels Mar 11, 2019
@birdsarah
Copy link
Contributor Author

I uploaded a notebook with basic examples for finding each of the scripts here: https://github.com/mozilla/overscripted/blob/master/analyses/issue_34_setup_and_dask_tips.ipynb

I also gave additional information in the chat. Pasting here too:

hs-analytics:

akam:

fingerprintjs2:

@Victory17
Copy link

Hi
I am interested in this. I want to work on it.

@muskankhedia
Copy link

muskankhedia commented Mar 15, 2019

Hi @birdsarah, I was looking for this issue and as the notebook uploaded by you already performs all the 3 tasks as mentioned in the issue. So, Can you please explain some detail information regarding what more changes are required to be performed in the notebook in order to solve this issue.

@srujana121
Copy link

Hi @birdsarah , I am applying for outreachy.

Do you think its a good idea to detect canvas fingerprinting. I am thinking on the lines of detecting unnecessary canvas elements. But I am not entirely sure how to detect which elements are not needed.

Generally canvas fingerprinting is done by calling the ToDataURL() method. I am assuming there is no real reason genuine scripts need to get the canvas image in DataURL format. Do you have any suggestions for me?

@birdsarah
Copy link
Contributor Author

@srujana121 @muskankhedia I will try and answer both your questions together. @srujana121 there is no need to develop a technique for detecting fingerprinting. This has already been developed and examples are in the literature. See "Online Tracking: A 1-million-site Measurement and Analysis " and "The Web's Sixth Sense" on the reading list: https://github.com/mozilla/overscripted/wiki/Reading-List-(WIP)

In particular, the code for detecting four types of fingerprinting we're interested in (canvas, font, audio, and webrtc) is available here: https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py

@willougr has done the work of applying these heuristics to our dataset and will be submitting his code shortly. Some of the results of his work are here: https://github.com/mozilla/overscripted/blob/master/analyses/2018_12_willoughr__fingerprinting_prevalence.txt

This issue is about developing code like that shown at https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py but finding a set of rules that detect browser attribute fingerprinting, that is the type of fingerprinting that compiles together a series of browser attributes. Again, the reading list articules will elaborate this type of fingerprinting in more detail.

The notebook supplied @muskankhedia does not solve this issue it provides the code to filter some relevant scripts out of the whole dataset. The hard work is then developing a "heuristic" that picks out these scripts and others like it. By "heuristic" I mean a rule-set encoded in code that selects for specific scripts and not others.

For in the case of canvas fingerprinting, the heursitic in extract_features.py looks for scripts that call toDataUrl but do not call save, restore, or addEventListener (along with some other things).

@muskankhedia
Copy link

Hi @birdsarah,

I have some doubts regarding this, do we have to make a list of such scripts used for browser attribute fingerprinting and search for all of them individually using a looping or we have to create a function to automatically search for such scripts based on some parameters.

@birdsarah
Copy link
Contributor Author

@muskankhedia have you reviewed "https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py"? Which of the papers in the reading list have you reviewed? What did you learn from them?

Please reformat your question "In ____ the authors do _____. When I tried to do _____, I was stuck by _____. As a result I have the following question ___________."

@srujana121
Copy link

@birdsarah

In "https://sensor-js.xyz/webs-sixth-sense-ccs18.pdf" the authors find the trackers by clustering the scripts that use sensor information. So this is what I have understood. Can you tell me if this is what I have to do.

  1. I have to cluster the scripts in the dataset like they did in the paper.
  2. My heuristic would be how close a script is to clusters which are preponderantly trackers.

@birdsarah
Copy link
Contributor Author

Hi @srujana121 .... I'm having a little trouble answering this. So I'm going to say up front that there's no right answers here. That's the hard part about data exploration and research. There's no hidden hint in the rest of what I write here about what I think is a "best" direction. The following is just notes and observations not direction.


What you posted is not what you have to do, but it is an approach. There are multiple ways of approaching the problem of building a heuristic.

You could keep investigating other approaches, and document their differences, strengths, and weaknesses. Or you pursue this approach.

If you pursue the approach you outlined I would be surprised if you were able to finish an undertaking like that in a couple of weeks. But that doesn't mean you shouldn't start. But given that it's a big job think about the interim outputs. Think about documenting your background research, your methodology, and how you will measure success. This preparation and thinking work alone can be a solid contribution. In addition, thinking through questions like how you will measure success will likely help you hone your methodology. If you're moving quickly, then post that preparation document early as a PR, get feedback and start a conversation about moving your analysis along.

@birdsarah
Copy link
Contributor Author

The work from @willougr has been posted: https://github.com/mozilla/overscripted/tree/master/analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense - it has a small bug in it so if you're trying to run it yourself you may need to fix up the variable names for the data file path - but other than that it's good. This applies the heuristics used for detecting audio, canvas, font, and webrtc by the Sixth Sense paper to the OverScripted dataset.

@birdsarah
Copy link
Contributor Author

@Victory17 I missed your message before. Permission is not required to work on issues. Just dive right in.

@14Richa
Copy link

14Richa commented Mar 19, 2019

Adding for clarity: Browser attribute fingerprinting is a kind of browser fingerprinting in which a bunch of browser specific attributes are collected and used to uniquely identify a browser. Eg. It could be something like a hash generated using a known algorithm which concats attributes like screen-size, resolution, font etc to a string and hashes that string. Now this hash will most likely be unique to a browser from which it was generated. Relevant paper.

@Tikwiza
Copy link

Tikwiza commented Mar 27, 2019

I think this is a fantastic topic as I work in the realm of GDPR in the UK and Europe and privacy laws here in the US. Just brainstorming, but looking at this, may it be a good idea to see if we can look into the fingerprinting on browsers or countries where privacy with internet is quite strictly regulated? Although the GDPR stays quite clear of some technology, I think this might give us a good way to establish similarities in scripts that pull the necessary data, the major changes to track fingerprinting scripts and also to look at what is really considered as true fingerprinting to identify an individual? I will continue to find other angles to find ways to create what is needed for this.

@14Richa
Copy link

14Richa commented Apr 2, 2019

I added some analysis in this PR. You can use this file as well.

@birdsarah
Copy link
Contributor Author

@14Richa bit confused by your last comment - is it aimed at me or tikwiza? always good to use an @ someone.

@14Richa
Copy link

14Richa commented Apr 4, 2019

@birdsarah oops, Apologies. I added it for general discussion use case. File summarizes some threads I started chasing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers research question Outstanding questions that have not been investigated yet.
Projects
None yet
Development

No branches or pull requests

6 participants