-
Notifications
You must be signed in to change notification settings - Fork 48
Browser attribute fingerprinting analysis [WIP] #78
Conversation
14Richa
commented
Mar 22, 2019
- Jupyter notebook doing the analysis
- Notes files to keep a list of threads and questions to follow
Update from mozilla overscripted
Analysis jupyter notebook and notes [WIP]
Analysis in Pandas is the main file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes as I go
- Excellent introduction and write-ups
- It is not sufficient to run this on just the sample file, please run on the 10% sample - now you have honed your analysis, this should be straight forward - only read in the columns you need and it should go pretty quickly.
- You use
df_plugins['script_url'].value_counts()
to determine "which script is being used the most" - think about what you're actually counting here and how / why it might be biased. - You end up stumped on the question that metrika js appears to be only looking at flash plugin based on
res_df['symbol'].value_counts()
but you've restricted your data to only be about plugins and mimeTypes - is that what you wanted? - You chose not to use 'Cwm fjordbank glyphs....' as heuristic for finding incidences of fingerprintjs because it is a panagram. What are other uses for panagrams in javascript, how would they show up in a dataset like this? Can you show me a script that is using 'Cwm fjordbank ....' but is not fingerprintjs or a slight modification of it?
- You say "almost all calls are made around same time and it is querying a bunch of attributes to produce a hash" - what is the distribution of timestamps and how "close" are the timestamps you're referring to relative to the general distribution of timestamps.
- You say "I want to test if I just filter on rare symbols can I catch fingerprint.js calls? Hypothesis is that these rare calls to symbols is only done by fingerprinting scripts. As expected sessionStorage is pretty common followed by ShockWaveLength. The count reduces a lot for FingerPrint, doNotTrack and FuturesplashSuffixes." I don't feel you've made your case well, if at all. A statistical justification would certainly be possible. But more simply than that show me a bar chart (or something) with the average population prevalence compared to the fingerprint prevalence. (Then my follow-up question will be how does that compare to hs-analytics or akam.)
- "Some of these like cloudfront.net are CDNs and can be overlooked." Why can they be overlooked?
- Good work on the metrika detection.
- I don't see why you need the "domains" and the "base_url" as you work with dask you'll want to keep this processing to a minimum - pick one - probably doesn't matter too much which for now.
Small coding things
- General clean-up, python formatting and coding style:
parse_base_url
needs an extra space before returnwrite_csv
function is unnecessary- commas have a space after them
get_end_of_path
function isn't used
Rdf = df[IMP_COLUMNS]
don't do this. Only read in what you need to start with:df = pd.read_parquet(PARQUET_FILE, engine='pyarrow', columns=IMP_COLUMNS)
- You don't need all the columns in
IMP_COLUMNS
now you've got a better feel for the data only read in what you need each time.
Big picture round-up
- You need to run this on the 10% dataset.
- Think about all the comments in "Notes as I go" and keep pulling at the threads you're developing.
- I'd love it if you actually took a position, declared your heuristic, ran it and had a resulting list of scripts that you're heuristic labels as "browser attribute fingerprinting" - how would you then go about deciding whether it had done a good job.
- Great work so far.
Hey Sarah,
Thanks!
Done
I am not actually counting which script is being used the most, I am interested to know which script is calling
Yes, so my reasoning is something like this --> shortlist the scripts which query information on plugins, find scripts which use this query a lot and then see for which plugins do these scripts query. metrika.js is the top user of
Yes, so what I am trying to do here is to find all instances of
Interesting point, let me think more on this that how can I see the general distribution of timestamps. Do you know any visualization tools for this? I want something of a clustering but in time-space.
I am littel confused here. My idea was simply that less common symbols would be called by fingerprinting scripts (assuming fingerprinting scripts are very less in number compared to clean scripts). I think plotting a graph of calls to
These are content delivery networks, host to many files. Can't directly be blamed for serving fingerprinting files
Thanks!
Sure, will do. |
(I'm replying one at a time as I'm on my phone)
Your words said "the most," so that's what I read. Definitely focus on being specific.
What I want you to think about is this: you are using number of rows to make inferences. What does that mean in the dataset? What do the rows represent? And what you can infer if that's what you choose to count? |
It's not a huge deal for your analysis but to be clear about the point I'm trying to make: the goal was to find instances of this library. Which is a more likely: that a developer has kept the name fingerprint.js is, or that they have kept using their methodology of using "Cwm fjordbank ...." In fairness, I never provided evidence that the "Cwm fjordbank...." lookup is superior, but similarly you haven't demonstrated that all instances of scripts names "fingerprint.js" are the correct library. I don't particularly want you to change anything but to think critically about the choices you are making. |
Nothing springs to mind hisogram type things should be good enough for thinking about distributions. |
There's a quite a few assumptions in ideas here and you haven't made a case for any of them. Let's unpack them (1) "less common symbols would be called by fingerprinting scripts" - I don't believe that to be true, but you certainly could present evidence to make that point and that would be interesting to see (2) "assuming fingerprinting scripts are very less in number compared to clean scripts" |
You need to make this justification in your writing not to me. Just as a thought experiment: If you are going to take the position that CDNs cannot be "blamed" for fingerprinting scripts then all fingerprinters would just move their content to a CDN. What should we do in that case if we want to stop fingerprinting? |
I think my comments are written more negatively than I intend, because I don't intend them negatively at all. There's a LOT to dig into here and you're well on your way. In particular, I have deleted "You are missing my point." - that is not helpful language to use on my part and I apologize. |
Added a new file --- Analysis in dask. It contains analysis on 10% dataset using dask. Please ignore Analysis in pandas. That is an old file. |
Please remove obsolete files. If doing this with git isn't familiar to you don't hesitate to ask. |
I was wondering if it should be removed totally? Isn't it a good idea to have the analysis in pandas as well, for someone to use it in case they have memory/system constraints. Though I agree that the two notebooks will go out of sync very soon and it will be a hassle to keep updating both of these. |
I would say no. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes as I go:
- Don't leave the print out of hundreds or thousands of rows in your notebook, it hinders comprehension. You will definitely look at this content while exploring, but clean up before review.
len(df.script_url.unique())
->df.script_url.nunique()
df['location_domain'] = df.location.apply(extract_domain)
->df['location_domain'] = df.location.apply(extract_domain, meta='O')
('O' is object which is all we have available for strings)df[df.symbol.str.contains('navigator.mimeTypes|navigator.plugins')]
nice- "These days some browsers don't return an array of plugins directly, except the most common plugins such as Shockwave flash, Java, etc." citation please
- "That is all queries to window.navigator.plugins[Shockwave Flash].description resulted in Shockwave Flash 28.0 r0. This is strange." Why is it strange? This data was collected in a crawl. That is identical machines were setup to crawl the web and their profiles were reset between every visit to a website. "There seems to be a bias in the dataset." Agreed. "Strange but on a brighter side difficult to fingerprint :)" Unfortunately you can't make this inference because this is not a population sample of the variation of plugins.
- "Memory usage for df_plugins is less. I can take all of this in pd dataframe and use pivots to analyze." Good thinking. Dask does have a pivot option. But converting to pandas when you can definitely makes things nicer.
- I was about to write: "I don't think you needed a pivot table. I think a groupby would have got you there
df_plugins_pd.groupby(['location', 'script_url', 'symbol']).count()
" but that is wrong. I see what you've done and I see that you're were getting the length of unique symbols. Perhaps at somepoint we can brainstorm how to make this a bit cleaner and more obvious. - In your analysis 2 you find 0 hs-analytics. Earlier you have noted that hs-analytics is a fingeprinting script, what do you think is going on?
- Avoid hardcoding numbers
There are 166862 unique script_urls in the dataset. From this we have identified 790 (725+53+12) unique URLs which definitely host fingerprinting scripts and another 888 potential urls worth checking out.
You could rewrite this as a code cellf'There are {len(unique_scripts:,} unique script_urls in the dataset. From this we have identified {sum(n_scripts)} unique URLs which definitely host fingerprinting scripts and another {n_new} potential urls worth checking out.'
While it might seem counter to other things I'm arguing for the oddness of duplicated text is out-weighed by the robustness of not transcribing numbers, and the re-usability for running this against a future dataset. - "So this script always asks for same 10 symbols which we can see below." Only because you've restricted your starting point to df_plugins which is the subset of scripts that calls plugins. Maybe that's what you're interested in but this statement is misleading.
- "metrika/watch.s can be used for browser plugin fingerprinting." This is true, but I don't think you've really shown it. There is much more evidence in the dataset for you to make this claim much more convincing. Why not just look at all the symbols metrika is getting?
- "I have found that the above string is a panagram and can be used in other fingprinting scripts" - I clearly still haven't explained this well enough. Let me try again. fingerprintjs2 is not just a browser attribute fingerprinting script. It also does canvas fingerprinting. It's characteristic canvas fingerprinting feature is the call to "Cwm fjordbank....". This enables you find as many of the fingerprintjs2 / fingerprintjs2-like scripts and then examine them for the browser attribute fingerprinting within them
- "Therefore I have looked for "fingerprint" in the URL column of the dataset." As I already mentioned, you haven't provided evidence that this is actually capturing the fingerprintjs2 library and not just other scripts with the word fingerprint in the url which may or may not be what you want
- The reason I care about this is because I believe you're cutting yourself off from data
df[df.script_url.str.contains('fingerprint', case=False)].script_url.nunique().compute()
returns 78 scripts,df[df.argument_0.str.contains('Cwm fjordbank glyphs vext quiz')].script_url.nunique().compute()
returns 505 scripts. The intersection between the lists is 36. If you look at the remaining 42 I can see a number that fall into this category - although i am happy to concede there are many that do look on quick inspection like what we were looking for but were not picked up by "Cwm fjordbank...." That said, of the 505 scripts detected by "Cwm fjordbank" 499 have plugin calls. If you only wanted to take the 499 that have plugin calls as an indicator of browser attribute fingerprinting, that seems fairly reasonable. You'll have missed a few that the "fingerprint" approach got, but you're still up 400 examples with, I believe, fewer false positives. - What you've got in the current analysis seems to me like an example of how you can have your data tell you what you're expecting it to. You haven't asked questions against your initial belief - e.g. how many "Cwm fjordbank" scripts are reading plugins - which you've already articulated as a hallmark of browser attribute fingerprinting.
- Well done for exploring and then ruling out, in the interests of time, the timestamp work - you were on the right track with your thinking here and i definitely think it could be explored in the future.
- "Is there an automatic way to download the linked javascript file from script_url and parse it to look for keywords like "murmurhash", "hashset", "fingerprint"?" Not without much pain :D. But it is partially possible. That said, future crawls will collect that data at the time of crawl and include it in the dataset.
- I would be interested to see not just the list of symbols, but the value_counts or perhaps more comparable normalized value counts per script. you could plot these to see if they all look similar - a somewhat tricky problem but potentially illuminating.
- Your questions at the end are starting to get into this.
Overall: A great improvement. Well done. Above are a lot of points. I think the most important of them is not the specifics but the principles in our ongoing back and forth about the validity of "fingerprint" vs "Cwm fjordbank". Moving forward, this could go on forever, but it shouldn't! I would like to think through a concrete set of refinements that will get this to a mergeable analysis contribution. I'm afraid I don't have this for you today as I have a lot of PRs to review, but lets touch base maybe next week. If you haven't heard from me, please ping me back on this PR.
Addressed the above points.
Agreed. Thanks for pointing out the flaw in the reasoning.
Addressed the issue,
Addressed.
Included more analysis around the symbols
Aah, now I get what you meant here. Working on it.
I agree to the point of catching false scripts here, though I feel that likelihood is less. Should check though.
I see your point clearly now, thanks for giving examples.
Right, working on this.
Thanks for your review, I have addressed few points and I am working on the remaining. (panagram and onwards) Analysis1, 2 and 3 has been updated. I am working on Analysis 4 and 5. I have updated the PR with the latest changes, feel free to take a look. |
Hi @14Richa, are you still planning to submit the remaining requested changes? |
Closing this PR due to lack of activity, please feel free to reopen. |