From e69849591ceaddea7b2bb12215a65353efebcf4a Mon Sep 17 00:00:00 2001 From: Joel Purra Date: Sun, 19 Oct 2014 09:28:43 +0200 Subject: [PATCH] Create a new methodology chapter structure, based on examiner feedback, move details to appendix --- report/report.lyx | 6973 +++++++++++++++++++++++---------------------- 1 file changed, 3544 insertions(+), 3429 deletions(-) diff --git a/report/report.lyx b/report/report.lyx index 130951a..85dd066 100644 --- a/report/report.lyx +++ b/report/report.lyx @@ -655,6 +655,204 @@ Based on a list of domains, the front page of each domain is downloaded These aggregates are then compared between datasets. \end_layout +\begin_layout Section +High level overview +\end_layout + +\begin_layout Standard +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Previous expected results, methodology and software chapters trimmed down + into a more compact format; refer to appendix for software/script details. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section +Domain categories +\end_layout + +\begin_layout Standard +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Use of domain names and suffix lists. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section +Capturing domain blocking +\end_layout + +\begin_layout Standard +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Use of blocking lists. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section +Data collection +\end_layout + +\begin_layout Standard +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Variation of the old retrieving websites and resources chapter, including + parallelization. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section +Data analysis and validation +\end_layout + +\begin_layout Standard +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Variation of the old analyzing resources chapter. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section +High level summary of datasets +\end_layout + +\begin_layout Standard +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Write high level summary of datasets. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section +Limitations +\end_layout + +\begin_layout Standard +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Write about limitations. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Section +Direction and scope +\end_layout + +\begin_layout Standard +Emphasis for the thesis will be on technical analysis, producing aggregate + numbers regarding domains and external resources. + Social aspects and privacy concerns are considered out of scope. +\end_layout + +\begin_layout Standard +The thesis will primarily be written from a Swedish perspective. + This is in part because .SE has access to the full list of Swedish .se domains, + and part because of their previous work with the +\emph on +.SE Health Status +\emph default + reports. + Focus is to analyze .se domains in the reports, as they have already been + deemed important and results can be incorporated in future reports. + The main non-technical grouping is also based on the same reports; government, + media, financial institutions and other nation-wide publicly relevant organizat +ion groups. +\end_layout + +\begin_layout Standard +One assumption is that all external resources can act as trackers, even + for static (non-script) resources with no capabilities to dynamically survey + the user's browser, collecting data and tracking users across domains using + for example the +\begin_inset Flex Code +status collapsed + +\begin_layout Plain Layout +Referer +\end_layout + +\end_inset + + HTTP header +\begin_inset CommandInset citation +LatexCommand cite +key "Krishnamurthy:2006:CMC:1135777.1135829" + +\end_inset + +. + While there are lists of known trackers, used by browser privacy tools, + they are not 100% effective +\begin_inset CommandInset citation +LatexCommand cite +key "Malandrino:2013:PAI:2517840.2517868,Krishnamurthy:2006:CMC:1135777.1135829" + +\end_inset + +. + The lists will instead optionally be used to emphasize those external resources + as +\emph on +confirmed +\emph default + trackers. + While cookies used for tracking have been a concern for many, they are + not necessary in order to identify most users upon return, even uniquely + on a global level +\begin_inset CommandInset citation +LatexCommand cite +key "Eckersley2009unique" + +\end_inset + +. + Cookies will not be considered to be an indicator of tracking, as it can + be assumed that a combination of other server and client side techniques + can achieve the same goal as a tracking cookie. +\end_layout + \begin_layout Section Document style \end_layout @@ -801,22 +999,16 @@ end{futurework} \end_layout -\begin_layout Section -Data sources -\end_layout - -\begin_layout Subsection -Domains +\begin_layout Chapter +Discussion and conclusions \end_layout \begin_layout Standard -Statistics regarding each list of domains is presented in an appendix. - \begin_inset Note Greyedout status open \begin_layout Plain Layout -Generate domain list tables or tables in this chapter? +Compare with expected results. \end_layout \end_inset @@ -824,26 +1016,21 @@ Generate domain list tables or tables in this chapter? \end_layout -\begin_layout Subsubsection +\begin_layout Standard +\begin_inset Note Greyedout +status open +\begin_layout Plain Layout +Write about the \emph on -.SE Health Status +Follow the Money: Understanding Economics of Online Aggregation and Advertising \emph default - domains + report findings. \end_layout -\begin_layout Standard -When .SE performs their annual -\emph on -.SE Health Status -\emph default - report measurements, they use an in-house curated list of domains of national - interest. - These domains are mostly from the .se zone and cover government, county, - municipality, higher education, government-owned corporations, financial - service, internet service provider (ISP), domain registrar, and media domains. - Some domains overlap both within and between categories; domains have been - deduplicated. +\end_inset + + \end_layout \begin_layout Standard @@ -851,7 +1038,11 @@ When .SE performs their annual status open \begin_layout Plain Layout -Write a summary with examples from each dataset. +Write about +\emph on +Third-party Identity Management Usage on the Web +\emph default + report findings. \end_layout \end_inset @@ -859,79 +1050,104 @@ Write a summary with examples from each dataset. \end_layout -\begin_layout Subsubsection -Random .se domains +\begin_layout Section + +\emph on +.SE Health Status +\emph default + comparison +\begin_inset CommandInset label +LatexCommand label +name "sec:.SE-Health-Status-comparison" + +\end_inset + + \end_layout \begin_layout Standard -The thesis was written in collaboration with .SE, which runs the .se TLD, - and the work focusing on the state of Swedish domains. - Early script development was done using a sample of -\begin_inset ERT +\begin_inset Note Greyedout status open \begin_layout Plain Layout - - -\backslash -numprint{10000} +Write about the +\emph on +.SE Health Status +\emph default + report findings. \end_layout \end_inset - random domains, most often tested in groups of -\begin_inset ERT -status open -\begin_layout Plain Layout +\end_layout - -\backslash -numprint{100} +\begin_layout Subsection +Google Analytics \end_layout +\begin_layout Standard +One of the reasons this thesis subject was chosen was the inclusion of a + Google Analytics coverage analysis in previous reports. + The reports shows overall Google Analytics usage in the curated dataset + of 44% 2010, 58% in 2011 and 62% in 2012 +\begin_inset CommandInset citation +LatexCommand cite +key "Lowinder:2010:healthstatus,Lowinder:2011:healthstatus,Lowinder:2012:healthstatus" + \end_inset . - A final sample of -\begin_inset ERT -status open - -\begin_layout Plain Layout + Today, in 2014, usage in the category with the least coverage (financial + services) is 58% while most are above 70% +\begin_inset CommandInset ref +LatexCommand eqref +reference "sub:Top-domains" +\end_inset -\backslash -numprint{100000} -\end_layout +. + The highest coverage category (government owned corporations) is even above + 90%. + Since Google Analytics can now be used from the DoubleClick domain as well, + looking only at the Google Analytics domain makes little sense - instead + it might make more sense to look at the organization Google as a whole. + The coverage jumps quite a bit, with most categories landing above 85% + +\begin_inset CommandInset ref +LatexCommand eqref +reference "sub:Top-organizations" \end_inset - domains was also provided. - The .se TLD is to be considered Sweden-centric. -\end_layout - -\begin_layout Subsubsection -Random .dk domains +. \end_layout \begin_layout Standard -The Danish .dk TLD organization, DK Hostmaster A/S -\begin_inset Foot +\begin_inset ERT status open \begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "https://www.dk-hostmaster.dk/" + + +\backslash +begin{futurework} +\end_layout \end_inset \end_layout -\end_inset +\begin_layout Standard +It is possible to extract the exact coverage for both Google Analytics and + DoubleClick from the current dataset. + Google Analytics already uses a domain of its own, and by writing a custom + question separating DoubleClick's ad and analytics resource URLs analytics + on that domain can be found as well. +\end_layout -, helped out with a sample of +\begin_layout Standard \begin_inset ERT status open @@ -939,107 +1155,128 @@ status open \backslash -numprint{10000} +end{futurework} \end_layout \end_inset - domains, chosen at random from the database of active domains in the zone. - The .dk TLD is to be considered Denmark-centric. + \end_layout -\begin_layout Subsubsection -Random .com, .net domains +\begin_layout Subsection +Reachability \end_layout \begin_layout Standard -The maintainers of the .com, .net and .name TLDs, Verisign, allow downloading - of the complete zone file under an agreement. - The .com zone is the largest one by far, and the .net zone is in the top - 4. -\begin_inset Foot -status open +The random zone domain lists (.se, .dk, .com, .net) have download failures for + 22-28% of all domains when it comes to +\begin_inset Flex Code +status collapsed \begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "http://www.keepalert.com/top-extension-ranking-july-2014-newgtlds" +http:// +\end_layout \end_inset + and +\begin_inset Flex Code +status collapsed +\begin_layout Plain Layout +http://www. \end_layout \end_inset - This allows for a random selection of sites from around the world, even - though usage is not geographically uniform - both in terms of registrations - and actual usage. -\end_layout + variations, where www has fewer failures. +\begin_inset CommandInset ref +LatexCommand eqref +reference "sec:Failed-versus-non-failed" -\begin_layout Subsubsection -Alexa Top -\begin_inset ERT +\end_inset + + The HTTP result for .se is consistent with results from the +\emph on +.SE Health Status +\emph default + reports, according to Patrik Wallström, where they only download www variations. + +\begin_inset Note Greyedout status open \begin_layout Plain Layout - - -\backslash -numprint{1000000} +Ask .SE about hard numbers? \end_layout \end_inset - -\begin_inset Foot -status open + Curated +\emph on +.SE Health Status +\emph default + lists have fewer failures for both HTTP, generally below 10% for the +\begin_inset Flex Code +status collapsed \begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "http://alexa.com/" +http://www. +\end_layout \end_inset - -\end_layout + variation - perhaps explained by the thesis software and network setup + +\begin_inset CommandInset ref +LatexCommand eqref +reference "sec:Failed-domains" \end_inset +. + Several prominent media sites with the same parent company respond as expected + when accessed with a normal desktop browser - but not automated requests, + suggesting that they detect and block some types of traffic. +\end_layout +\begin_layout Subsection +HTTPS usage \end_layout \begin_layout Standard -Alexa, owned by Amazon, is a well-known source of top sites -\begin_inset Foot -status open - -\begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "http://www.alexa.com/topsites" +.SE have measured HTTPS coverage among curated health status domains since + at least 2007 +\begin_inset CommandInset citation +LatexCommand cite +key "Lowinder:2008:healthstatus,Lowinder:2009:healthstatus,Lowinder:2010:healthstatus,Lowinder:2011:healthstatus,Lowinder:2012:healthstatus,Lowinder:2013:healthstatus" \end_inset - -\end_layout +. + The reports are a bit unclear about some numbers as measurement methodology + and focus has shifted over the years, but the general results seem to line + up with the results in this thesis +\begin_inset CommandInset ref +LatexCommand eqref +reference "sec:Failed-versus-non-failed" \end_inset - in the world. - It is used in many research papers, and can be seen as the standard dataset. +. \begin_inset Note Greyedout status open \begin_layout Plain Layout -List referenced research papers using it. +Verify results when all datasets have been downloaded. \end_layout \end_inset - Their daily 1-month average traffic rank top + +\end_layout + +\begin_layout Standard \begin_inset ERT status open @@ -1047,125 +1284,117 @@ status open \backslash -numprint{1000000} +tsvtable{se.health-status.https.tsv}{.SE Health Status HTTPS coverage 2008-2013}{}{f +ixed, display columns/0/.style={string type, column type=l}, display columns/1/.st +yle={string type, column type=i}, display columns/2/.style={string type, + column type=i}, display columns/3/.style={string type, column type=i}} \end_layout \end_inset - list is freely available for download. -\begin_inset Foot -status open - -\begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "https://alexa.zendesk.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites-" -\end_inset +\end_layout +\begin_layout Standard +Measurements changed in 2009, so they might not be fully comparable. + No HTTPS measurements were performed in 2012. + +\begin_inset Note Greyedout +status open +\begin_layout Plain Layout +Ask .SE about exact numbers? \end_layout \end_inset - As Alexa distinguishes between a site and a domain, some domains with several - popular sites are listed more than once. - URL paths have been stripped and domains have been deduplicated before - downloading. + \end_layout -\begin_layout Subsubsection -Reach50 -\begin_inset Foot +\begin_layout Standard +\begin_inset Note Greyedout status open \begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "http://reach50.com/" +Write a comparison with own results per category? +\end_layout \end_inset \end_layout +\begin_layout Standard +24% of HTTPS sites redirect from HTTPS back to HTTP in 2013 - see also +\begin_inset CommandInset ref +LatexCommand vref +reference "sec:HTTP,-HTTPS-and-redirects" + \end_inset +. +\end_layout +\begin_layout Section +Cat and Mouse: Content Delivery Tradeoffs in Web Access \end_layout \begin_layout Standard -The top 50 sites in Sweden are presented by Webmie -\begin_inset Foot +\begin_inset Note Greyedout status open \begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "http://webmie.com/" +Is this comparison relevant? +\end_layout \end_inset \end_layout -\end_inset +\begin_layout Standard +\begin_inset Note Greyedout +status open -, who base their list on data from a user panel. - The panelists have installed an extension into their browser, tracking - their browsing habits by automated means. - They also have results grouped by panelists categories: women, men, age - 16-34, 35-54, 55+ but only the unfiltered top list is publicly available. +\begin_layout Plain Layout +Describe the report. \end_layout -\begin_layout Subsubsection -Datasets in use -\begin_inset CommandInset label -LatexCommand label -name "sub:Datasets-in-use" - \end_inset \end_layout \begin_layout Standard -These are the final domain lists in use, including full dataset size -\begin_inset Foot +The report looks the top +\begin_inset ERT status open \begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "https://www.iis.se/domaner/statistik/tillvaxt/?chart=active" - -\end_inset +\backslash +numprint{100} \end_layout \end_inset - -\begin_inset Foot + English language sites from 12 categories in Alexa's top list, plus another + +\begin_inset ERT status open \begin_layout Plain Layout -\begin_inset CommandInset href -LatexCommand href -target "https://stats.dk-hostmaster.dk/domains/total_domains/" - -\end_inset +\backslash +numprint{100} \end_layout \end_inset - and selection method. -\end_layout - -\begin_layout Standard + sites from a political list. + These sites come from \begin_inset ERT status open @@ -1173,32 +1402,15 @@ status open \backslash -tsvtable{domains.datasets.tsv}{Datasets in use}{}{display columns/0/.style={string - type, column type=l}, display columns/1/.style={string type, column type=l}, - display columns/3/.style={string type, column type=l}} -\end_layout - -\end_inset - - +numprint{1116} \end_layout -\begin_layout Subsubsection -TLD distribution -\begin_inset CommandInset label -LatexCommand label -name "sub:TLD-distribution" - \end_inset - -\end_layout - -\begin_layout Standard -These are the top TLDs in the list of unique domains. -\end_layout - -\begin_layout Standard + domains. + They used a local proxy server to gather URLs, which means that HTTPS requests + were lost. + A total of \begin_inset ERT status open @@ -1206,42 +1418,25 @@ status open \backslash -tsvtable{domains.tlds.tsv}{TLDs in dataset in use}{}{display columns/2/.style={stri -ng type, column type=l},} +numprint{1113} \end_layout \end_inset + pages were downloaded from that set, plus +\begin_inset ERT +status open -\end_layout - -\begin_layout Subsection -External datasets -\end_layout - -\begin_layout Subsubsection -Disconnect's blocking list -\begin_inset CommandInset label -LatexCommand label -name "sub:Disconnect's-blocking-list" - -\end_inset +\begin_layout Plain Layout +\backslash +numprint{457} \end_layout -\begin_layout Standard -One of the most popular privacy tools is Disconnect, which blocks tracking - sites by running as a browser plugin. - Disconnect was started by ex-Google engineers, and still seems to have - close ties to Google as the own domain disconnect.me is listed as a Google - content domain in the blocking list. -\end_layout +\end_inset -\begin_layout Standard -The Disconnect software lets users block/unblock loading resources from - specific third-party domains. - A list of + pages from Alexa's top \begin_inset ERT status open @@ -1249,13 +1444,13 @@ status open \backslash -numprint{2149} +numprint{500} \end_layout \end_inset - domains is used as the basis from the blocking. - Each entry belongs to one of + in a secondary list. + The overlap was \begin_inset ERT status open @@ -1263,25 +1458,20 @@ status open \backslash -numprint{980} +numprint{180} \end_layout \end_inset - organizations, which come with a link to their webpage. - There is also a grouping into categories, here shown with some examples. - Worth noting is that the content category is not blocked by default. + pages. \end_layout \begin_layout Standard -\begin_inset ERT +\begin_inset Note Greyedout status open \begin_layout Plain Layout - - -\backslash -begin{futurework} +Fill in numbers. \end_layout \end_inset @@ -1290,18 +1480,16 @@ begin{futurework} \end_layout \begin_layout Standard -There are other open source alternatives to Disconnect's blocking list, - but they use data formats that are not as easy to parse. - The most popular ones also do not contain information about which organization - each blocking rule belongs to. - See -\begin_inset CommandInset ref -LatexCommand vref -reference "sec:Ad-and-privacy-blocking-lists" +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Does a new dataset need to be downloaded? +\end_layout \end_inset -. + \end_layout \begin_layout Standard @@ -1312,7 +1500,12 @@ status open \backslash -end{futurework} +tsvtable{krishnamurthy-2006-cmc-1135777.1135829_ad_coverage.tsv}{Ad coverage}{}{fi +xed, display columns/0/.style={string type, column type=l}, display columns/1/.sty +le={string type, column type=i}, display columns/2/.style={string type, column + type=i}, display columns/3/.style={string type, column type=i}, display + columns/4/.style={string type, column type={|i}}, display columns/5/.style={strin +g type, column type=i}, display columns/6/.style={string type, column type=i}} \end_layout \end_inset @@ -1320,11 +1513,11 @@ end{futurework} \end_layout -\begin_layout Paragraph -Categories +\begin_layout Section +Trackers which deliver content \begin_inset CommandInset label LatexCommand label -name "sub:Disconnect-Categories" +name "sec:Trackers-which-deliver-content" \end_inset @@ -1332,128 +1525,152 @@ name "sub:Disconnect-Categories" \end_layout \begin_layout Standard -Summary of Disconnect's categories. - Most domains and organizations by far are in the advertisement category. - The reason the Disconnect category has so few organizations, is that it - is treated as a special category +In Disconnect's blocking list, there's a category called content \begin_inset CommandInset ref LatexCommand eqref -reference "sub:Disconnect-category" +reference "sub:Disconnect-Content" \end_inset - with only Google, Facebook and Twitter. +. + While all other categories are blocked by default, this one is not as it + represents external resources deemed +\emph on +desirable +\emph default + to Disconnect's users. + So while they are known tracker domains, they are allowed to pass +\begin_inset Quotes eld +\end_inset + +by popular demand. +\begin_inset Quotes erd +\end_inset + + This brings an advantage to companies that can deliver content, as they + can just as well use content usage data as pure web bug/tracker usage data + when analyzing patterns. \end_layout \begin_layout Standard -\begin_inset ERT +Google has several popular embeddable services in the content category, + including Google Maps +\begin_inset Foot status open \begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "https://developers.google.com/maps/documentation/embed/" + +\end_inset -\backslash -tsvtable{disconnect.categories.tsv}{Disconnect's categories (2014-09-08)}{}{displa -y columns/0/.style={string type, column type=l},} \end_layout \end_inset +, Google Translate +\begin_inset Foot +status open -\end_layout - -\begin_layout Paragraph -Domains per organization -\begin_inset CommandInset label -LatexCommand label -name "sub:Disconnect-Domains-per-organization" +\begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "https://translate.google.com/manager/website/" \end_inset \end_layout -\begin_layout Standard -Many organizations have more than one domain. - One organization stands out, with -\begin_inset ERT +\end_inset + + and least but not least YouTube +\begin_inset Foot status open \begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "https://developers.google.com/youtube/player_parameters" + +\end_inset -\backslash -numprint{271} \end_layout \end_inset - domains - Google. - The biggest reason is that they own top level domains such as google.se - and google.ch from over 200 TLDs. - Yahoo comes in second with -\begin_inset ERT +. + Lesser known examples include Recaptcha +\begin_inset Foot status open \begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "https://developers.google.com/recaptcha/" + +\end_inset -\backslash -numprint{71} \end_layout \end_inset - domains, many of which are service-specific subdomains to yahoo.com, such - as finance.yahoo.com and travel.yahoo.com. -\end_layout - -\begin_layout Standard -\begin_inset ERT + which is an embeddable service to block/disallow web crawlers/bots access + to web page features. + Those are visible examples, which users interact with. + Google Fonts +\begin_inset Foot status open \begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "https://www.google.com/fonts" + +\end_inset -\backslash -tsvtable{disconnect.domains-per-organization.tsv}{Domains per organization - (2014-09-08)}{}{} \end_layout \end_inset + which serves modern web fonts for easy embedding, is still visible but + not branded. + Google Hosted Libraries +\begin_inset Foot +status open -\end_layout - -\begin_layout Paragraph -Organizations in more than one category -\begin_inset CommandInset label -LatexCommand label -name "sub:Disconnect-Organizations-in-more-than-one-category" +\begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "https://developers.google.com/speed/libraries/" \end_inset \end_layout -\begin_layout Standard -Some organizations are represented in more than one of the five Disconnect - categories. - Yahoo has several ad services, several social services, several content - services and a single analytics service, putting them in four categories. +\end_inset + + which serves popular javascript libraries from Google's extensive CDN network + instead of the local server for site speed/performance gains are not visible + as components, but they cannot be removed without affecting functionality. + Especially the two latter, served from the googleapis.com domain, are prevalent + in several of the datasets - and they are usually loaded on every single + page on a web site, and thus gain full (but passive, as opposed to for + example Google Analytics) insight on users' click paths and web history. \end_layout \begin_layout Standard -\begin_inset ERT +\begin_inset Note Greyedout status open \begin_layout Plain Layout - - -\backslash -tsvtable{disconnect.organizations-in-more-than-one-category.tsv}{Organizations - in more than one category (2014-09-08)}{}{display columns/0/.style={string - type, column type=l},} +Extract a data table with popular content domains. \end_layout \end_inset @@ -1461,8 +1678,12 @@ tsvtable{disconnect.organizations-in-more-than-one-category.tsv}{Organizations \end_layout -\begin_layout Paragraph -Advertising +\begin_layout Section +Other notable results +\end_layout + +\begin_layout Standard +Some results differ from my own assumptions. \end_layout \begin_layout Standard @@ -1470,7 +1691,7 @@ Advertising status open \begin_layout Plain Layout -Add technorati.com, wpp.com? +Write more about notable results. \end_layout \end_inset @@ -1478,61 +1699,61 @@ Add technorati.com, wpp.com? \end_layout -\begin_layout Description -overture.com Yahoo's ad network. -\end_layout - -\begin_layout Description -omniture.com Adobe's ad network. -\end_layout - -\begin_layout Description -amazon-adsystem.com Amazon's ad delivery network. -\end_layout - -\begin_layout Paragraph -Analytics -\end_layout - -\begin_layout Description -alexa.com Amazon's web statistics service, considered an authority in web - measurement. -\end_layout - -\begin_layout Description -comscore.com Analytics service that also publishes statistics. +\begin_layout Section +Automated, scalable data collection and repeatable analysis \end_layout -\begin_layout Description -gaug.es GitHub's analytics service. +\begin_layout Standard +One of the prerequisites for the type of analysis performed in this thesis + was that all collection should be automated, repeatable and be able to + handle tens of thousands of domains at a time. + This goal has been achieved, and a specialized framework for analyzing + web pages's HTTP requests has been built. + While most of the code has been tailored to answer questions posed in this + thesis, it is also built to be extendable, both in and between all data + processing steps. + More data can be included, additional datasets can be mixed in, separate + questions can be written to query data from any stage in the data preparation + or analysis. + Tools have been written to easily download and compare separate lists of + domains, and by default data is kept in its original downloaded form so + that historical analysis can be performed. \end_layout -\begin_layout Description -coremetrics.com Part of IBM's enterprise marketing services. -\end_layout +\begin_layout Standard +It might be hard to convince other researchers to use code, as it might + not fulfill all of their wishes at once on top of any +\begin_inset Quotes eld +\end_inset -\begin_layout Description -newrelic.com A suite of systems monitoring and analytics software, up to - and including browsers. -\end_layout +not invented here +\begin_inset Quotes erd +\end_inset -\begin_layout Description -nielsen.com Consumer studies. + mentality. + Fortunately, the code is easy to run, and with proper documentation other + groups should be able to at least test simple theories regarding web sites. + Some of the lists of domains used as input are publicly available, and + thus results can also be shared. + This should encourage other groups, as looking at example data might spark + interest. \end_layout -\begin_layout Description -statcounter.com Web statistics tool. +\begin_layout Section +Contributions to other open source projects \end_layout -\begin_layout Description -webtrends.com Digital marketing analytics and optimization across channels. +\begin_layout Standard +During the development of code for this thesis, other projects have been + utilized. + In good open source manners, those projects should be improved when possible. \end_layout -\begin_layout Paragraph -Content +\begin_layout Subsection +The HAR specification \begin_inset CommandInset label LatexCommand label -name "sub:Disconnect-Content" +name "sub:The-HAR-specification" \end_inset @@ -1540,163 +1761,207 @@ name "sub:Disconnect-Content" \end_layout \begin_layout Standard -Sites that deliver content. - There is a wide variety of content, from images and videos to A/B testing, - comment and help desk services. - This category is not blocked by default. +After looking at further processing of the data, some improvements might + be suggested. \end_layout -\begin_layout Description -apis.google.com One of Google's API domains. - -\begin_inset Note Greyedout -status open +\begin_layout Standard +One such suggestion might be to add an absolute/resolved version of +\begin_inset Flex Code +status collapsed \begin_layout Plain Layout -Look at which services are hosted. +response.redirectURL \end_layout \end_inset +, as specification 1.2 seems to be unclear wether or not it should be kept + as-is from the HTTP +\begin_inset Flex Code +status collapsed +\begin_layout Plain Layout +Location \end_layout -\begin_layout Description -brightcove.com Video hosting/monetization service. -\end_layout +\end_inset -\begin_layout Description -disqus.com A third party comment service. -\end_layout + header or browser's +\begin_inset Flex Code +status collapsed -\begin_layout Description -flickr.com Flickr is a photo/video hosting site, owned by Yahoo. +\begin_layout Plain Layout +redirectURL \end_layout -\begin_layout Description -googleapis.com One of Google's API domains, hosting third-party files/services - such as Google Fonts and Google Hosted Libraries. -\end_layout +\end_inset -\begin_layout Description -instagram.com Facebook's photo/video sharing site. -\end_layout + values - both of which possibly is relative. + As subsequent HTTP requests are hard to refer to without relying either + on exact request ordering (the executed redirect always coming exactly + as the next entry) or at least having the URL resolved (preferably by the + browser) before writing it to the HAR data. + Current efforts in +\begin_inset Flex Code +status collapsed -\begin_layout Description -office.com Microsoft's Office suite online. +\begin_layout Plain Layout +netsniff.js \end_layout -\begin_layout Description -optimizely.com An A/B testing service. -\end_layout +\end_inset -\begin_layout Description -truste.com Provides certification and tools for privacy policies in order - to gain users' trust; “enabling businesses to safely collect and use customer - data across web, mobile, cloud and advertising channels.” This includes - ways to selectively opt-out from cookies features; required, functional - or advertising. -\end_layout + +\begin_inset CommandInset ref +LatexCommand eqref +reference "sub:get/netsniff.js" -\begin_layout Description -tumblr.com A popular blogging platform. -\end_layout +\end_inset -\begin_layout Description -uservoice.com A customer support service. + to resolve relative URLs using a separate javascript library have proven + inexact when it comes to matching against the browser's executed URL, differing + for example in wether trailing slashes are kept for domain root requests + or not. + What would be even better, is a way to refer to the reason for the HTTP + request, be it an HTML tag, a script call or a HTTP redirect - but that + could to be highly implementation dependent per browser. \end_layout -\begin_layout Description -vimeo.com A video site. +\begin_layout Subsection +phantomjs \end_layout -\begin_layout Description -www.google.com Google's main domain, which also hosts services such as search. - -\begin_inset Note Greyedout -status open +\begin_layout Standard +While +\begin_inset Flex Code +status collapsed \begin_layout Plain Layout -Look at which services are hosted. +netsniff.js \end_layout \end_inset - -\end_layout - -\begin_layout Description -youtube.com One of Google's video sites. -\end_layout - -\begin_layout Paragraph -Disconnect -\begin_inset CommandInset label -LatexCommand label -name "sub:Disconnect-category" + +\begin_inset CommandInset ref +LatexCommand eqref +reference "sub:get/netsniff.js" \end_inset - + from the phantomjs example library has been improved in several ways, patches + have not yet been submitted. + Since it is only an example from their side, a more developed version might + no longer serve the same purpose - educating new users on the possibilities + of phantomjs. + An attempt to break the code down and separate pure bug fixes from other + improvements might help. + The version written for this thesis is released under the same license + as the original, so reuse should not be a problem for those interested. \end_layout -\begin_layout Standard -A special category for non-content resources from Facebook, Google and Twitter. - It seems to initially have been designed to block their respective like/+1/twee -t buttons which seem to belong in the social category. - As the category now contains many other known tracking domains from the - same organizations, unblocking the social buttons also lets many other - types of resources trough. +\begin_layout Subsection +jq \end_layout \begin_layout Standard -It is worth noting that this category includes google-analytics.com plus - Google ad networks such as adwords.google.com, doubleclick.net and admob.com. - It might have been more appropriate to have them in the analytics and advertise -ment categories respectively. +Using jq as the main program for data transformation and aggregation has + given me a fair amount of knowledge of real world usage of the jq domain-specif +ic language (DSL). + Bugs and inconsistencies have been reported, and input regarding for example + code sharing through a package management system and (semantic) versioning + has been given. + Some of the reusable jq code and helper scripts written for the thesis + has been packaged for easy reuse, and more is on the way. \end_layout -\begin_layout Paragraph -Social +\begin_layout Subsection +Disconnect \end_layout \begin_layout Standard -Site with an emphasis on social aspects. - They often have buttons to vote for, recommend or share with others. +Disconnect relies heavily on their blocking list +\begin_inset CommandInset ref +LatexCommand eqref +reference "sub:Disconnect's-blocking-list" + +\end_inset + +, as it is the base for both the service of blocking external resources + and presenting statistics to the user. + While preparing +\begin_inset CommandInset ref +LatexCommand eqref +reference "sub:classification/disconnect/prepare-service-list.sh" + +\end_inset + + and analyzing +\begin_inset CommandInset ref +LatexCommand eqref +reference "sub:classification/disconnect/analysis.sh" + +\end_inset + + the blocking list, a number of errors and inconsistencies were found. + Unfortunately, the maintainers do not seem very active in the project, + and even trivial data encoding errors were not patched over a month after + submission. + According to Disconnect's Eason Goodale in an email conversation 2014-08-13, + the team has been concentrating on a second version of Disconnect as well + as other projects. + While patches can be submitted through Disconnect's Github project pages, + Goodale's reply seems to indicate they will not be accepted in a timely + fashion and perhaps irrelevant by the time the next generation is released + to the public. \end_layout -\begin_layout Description -addthis.com A link sharing service aggregator. +\begin_layout Subsection +Public Suffix \end_layout -\begin_layout Description -digg.com News aggregator. +\begin_layout Standard +A tool that parses the public suffix list from its original format to a + JSON lookup object format has been written. + Using that tool an inconsistency in the data was detected - the TLD .engineering + being included twice instead of .engineer and .engineering separately. + This had already been detected and reported by others, but it can be used + to detect future inconsistencies in an automated manner. \end_layout -\begin_layout Description -linkedin.com Professional social network. +\begin_layout Chapter +Related work \end_layout -\begin_layout Description -reddit.com Social new and link sharing, and discussion. +\begin_layout Section +At .SE \end_layout -\begin_layout Subsubsection -Public suffix list -\begin_inset CommandInset label -LatexCommand label -name "sub:Public-suffix-list" +\begin_layout Standard +Part of .SE's work includes researching internet and technology, with a focus + on Sweden - for example, +\emph on +Swedes and the internet +\emph default + and broadband and cell phone bandwidth speed tests +\begin_inset Note Greyedout +status open -\end_inset +\begin_layout Plain Layout +Insert names of reports. +\end_layout +\end_inset + to the public, both in Swedish \begin_inset Foot status open \begin_layout Plain Layout \begin_inset CommandInset href LatexCommand href -target "https://publicsuffix.org/" +target "https://www.iis.se/lar-dig-mer/rapporter/" \end_inset @@ -1705,70 +1970,48 @@ target "https://publicsuffix.org/" \end_inset - + and English \begin_inset Foot status open \begin_layout Plain Layout \begin_inset CommandInset href LatexCommand href -target "https://en.wikipedia.org/wiki/Public_Suffix_List" - -\end_inset - - -\end_layout +target "https://www.iis.se/english/reports/" \end_inset \end_layout -\begin_layout Standard -In the domain name system, it is not always obvious what parts of a domain - name are a public suffix and which are open for registration by Internet - users. - The main example is -\begin_inset Flex Code -status collapsed - -\begin_layout Plain Layout -example.co.uk -\end_layout - \end_inset -, where the public suffix -\begin_inset Flex Code -status collapsed +. + Information and statistics are also published on a separate portal, in + collaboration with other organizations. +\begin_inset Foot +status open \begin_layout Plain Layout -co.uk -\end_layout +\begin_inset CommandInset href +LatexCommand href +target "https://www.iis.se/vad-vi-gor/internetstatistik/" \end_inset - is to different from the TLD -\begin_inset Flex Code -status collapsed -\begin_layout Plain Layout -uk \end_layout \end_inset -. - Because HTTP cookies are based on domains names, it is important to browser - vendors to be able to recognize which parts are public suffixes to be able - to protect users against supercookies + .SE's Internet Fund \begin_inset Foot status open \begin_layout Plain Layout \begin_inset CommandInset href LatexCommand href -target "https://en.wikipedia.org/wiki/HTTP_cookie#Supercookie" +target "http://www.internetfonden.se/" \end_inset @@ -1777,85 +2020,60 @@ target "https://en.wikipedia.org/wiki/HTTP_cookie#Supercookie" \end_inset -; cookies which are scoped to a public suffix, and therefore readable across - all web sites under that public suffix. - The same dataset is also useful for grouping domains without improperly - counting -\begin_inset Flex Code -status collapsed - -\begin_layout Plain Layout -example.co.uk -\end_layout + has also funded work on discussing and defining online privacy, aimed at + those working with or developing systems that handle personal data, often + with some kind of internet connection +\begin_inset CommandInset citation +LatexCommand cite +key "Bylund:2013:978-91-87379-12-3:integritet" \end_inset - as a -\emph on -user-owned subdomain -\emph default - of -\begin_inset Flex Code -status collapsed - -\begin_layout Plain Layout -co.uk +. \end_layout -\end_inset +\begin_layout Subsection -, which would then render -\begin_inset Flex Code -status collapsed +\emph on +.SE Health Status +\emph default -\begin_layout Plain Layout -co.uk -\end_layout +\begin_inset CommandInset label +LatexCommand label +name "sub:.SE-Health-Status" \end_inset - as the most popular domain under the -\begin_inset Flex Code -status collapsed - -\begin_layout Plain Layout -uk -\end_layout - -\end_inset - TLD. \end_layout \begin_layout Standard -Swedish examples include second level domains -\begin_inset Flex Code -status collapsed - -\begin_layout Plain Layout -pp.se -\end_layout - -\end_inset - - for privately owned domains and -\begin_inset Flex Code -status collapsed - -\begin_layout Plain Layout -tm.se -\end_layout +While .SE themselves have written reports analyzing the technical state of + services connected to .se domains, +\emph on +.SE Health Status +\emph default + +\begin_inset CommandInset citation +LatexCommand cite +key "Lowinder:2008:healthstatus,Lowinder:2009:healthstatus,Lowinder:2010:healthstatus,Lowinder:2011:healthstatus,Lowinder:2012:healthstatus,Lowinder:2013:healthstatus" \end_inset - for trademarks +, the focus has not been on exploring the web services connected to these + domains. + The research is focused on statistics about usage and security in DNS, + IP, web and e-mail; the target audience is IT strategists, executives and + directors. + Data for the reports is analyzed and summarized by Anne-Marie Eklund Löwinder, + a world-renown DNS and security expert \begin_inset Foot status open \begin_layout Plain Layout \begin_inset CommandInset href LatexCommand href -target "https://www.iis.se/data/barred_domains_list.txt" +target "https://www.iis.se/bloggare/anne-marie/" \end_inset @@ -1864,15 +2082,16 @@ target "https://www.iis.se/data/barred_domains_list.txt" \end_inset -. - These second level domains were more important before April 2003 +, while the technical aspects and tools are under the supervision of Patrik + Wallström, a well known DNSSEC expert and free and open source software + advocate \begin_inset Foot status open \begin_layout Plain Layout \begin_inset CommandInset href LatexCommand href -target "https://en.wikipedia.org/wiki/.se#Pre_2003_system" +target "https://www.iis.se/bloggare/pawal/" \end_inset @@ -1881,25 +2100,38 @@ target "https://en.wikipedia.org/wiki/.se#Pre_2003_system" \end_inset -, when first level domain registration rules restricted registration to - nation-wide companies, associations and authorities. +. \end_layout \begin_layout Standard -The public suffix list (2014-07-24) contains 6278 rules, against which domains - are checked in one of the classification steps +The thesis subject has been selected to be in line with the .SE reports, + but focusing on web issues; code may be reused and results may be included + in future reports. + The +\emph on +.SE Health Status +\emph default + reports do offer some groundwork in terms of selecting and grouping Swedish + domains, HTTPS usage and Google Analytics coverage +\begin_inset CommandInset citation +LatexCommand cite +key "Lowinder:2010:healthstatus,Lowinder:2011:healthstatus,Lowinder:2012:healthstatus" + +\end_inset + + \begin_inset CommandInset ref -LatexCommand eqref -reference "sub:classification/effective-tld/add.sh" +LatexCommand ref +reference "sec:.SE-Health-Status-comparison" \end_inset . - It becomes the basis for the domain's division into public suffix and primary - domain (first non-public suffix match), and subsequent grouping. -\end_layout - -\begin_layout Standard + The report +\emph on + +\emph default +is based on data collected from around \begin_inset ERT status open @@ -1907,51 +2139,49 @@ status open \backslash -begin{futurework} +numprint{900} \end_layout \end_inset - + .se domain names deemed of importance to the Swedish society as a whole, + as well as random selection of 1% of the registered .se domain names. \end_layout -\begin_layout Standard -There is also an algorithm for wildcard rules which can have exceptions; - this thesis has not implemented wildcards and exceptions in the classification - step. - There are 24 TLDs with wildcard public suffixes, and 8 non-TLD wildcards. - Out of these 8 non-TLD wildcards, 1 is *.sch.uk and 7 are Japanese geographic - areas. - The 24 wildcards have 10 exception rules; 7 of them are Japanese cities - grouped by the previously mentioned geographic areas and the remaining - 3 seem to belong to ccTLD owner organizations. +\begin_layout Subsection +.SE Domain Check \end_layout \begin_layout Standard -\begin_inset ERT +In order to facilitate repeatable and improvable analysis, tools will be + developed to perform the collection and aggregation steps automatically. + .SE already has a set of tools that run monthly; integration and interoperabilit +y will smooth the process and continuous usage. + There is also a public .SE tool to allow web site owners to test their own + sites, +\emph on +Domain Check +\emph default + +\begin_inset Note Greyedout status open \begin_layout Plain Layout - - -\backslash -end{futurework} +Add link to Domain Check. \end_layout \end_inset - +, which might benefit from some of the code developed within the scope of + this thesis. \end_layout \begin_layout Standard -\begin_inset ERT +\begin_inset Note Greyedout status open \begin_layout Plain Layout - - -\backslash -begin{futurework} +Write about .SE's abuse work? \end_layout \end_inset @@ -1959,303 +2189,270 @@ begin{futurework} \end_layout -\begin_layout Standard -Apart from ICANN domains, which have been implemented, there are also private - domains considered public suffixes listed as rules. - They are domains which have subdomains controlled by users/customers, for - example joelpurra.github.io which is controlled by me but hosted by the code - hosting service github.com. - Other examples include cloud hosting/CDN services such as cloudfront.net, - amazonaws.com, azurewebsites.net, fastly.net, herokuapp.com, blogs from several - blogspot.TLD domains and dyndns.com's wide choice of dynamic domains. - One example that looks like a technical choice in order to hinder accidental - or malicious setting of cookies is googleapis.com, which is listed despite - being (presumably) completely under Google's control. +\begin_layout Section +Other research \end_layout \begin_layout Standard -\begin_inset ERT +\begin_inset Note Greyedout status open \begin_layout Plain Layout +Write about the results of the research, not that there is research. +\end_layout + +\end_inset -\backslash -end{futurework} \end_layout +\begin_layout Standard +Some research has been done surrounding ad networks, trackers and their + spread on globally popular sites, as well as what kind of private data + users can expect to more or less inadvertently share in the course of normal + internet usage. + Those papers show both some of the problems and solutions in trying to + analyze external resources. + The Association for Computing (ACM) +\begin_inset Foot +status open + +\begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "http://acm.org/" + \end_inset \end_layout -\begin_layout Section -Retrieving websites and resources -\end_layout +\end_inset -\begin_layout Standard -Web sites based on lists of domains were downloaded using har-heedless, - see -\begin_inset CommandInset ref -LatexCommand vref -reference "sub:har-heedless" + group SIGCOMM +\begin_inset Foot +status open + +\begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "http://sigcomm.org/" \end_inset -. + \end_layout -\begin_layout Subsection -Computer machines -\begin_inset CommandInset label -LatexCommand label -name "sec:Computer-machines" +\end_inset + + has a yearly Internet Measurement Conference (IMC) +\begin_inset Foot +status open + +\begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "http://sigcomm.org/events/imc-conference" \end_inset \end_layout -\begin_layout Standard -Two computers were used to download web pages - one laptop machine and one - server machine. - The server is significantly more powerful than the laptop, and they downloaded - a different number of web pages at a time. -\end_layout +\end_inset -\begin_layout Standard -\begin_inset ERT +, where some papers of interest have been presented. + The Passive and Active Measurements (PAM) Conference +\begin_inset Foot status open \begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "http://pam2014.cs.unm.edu/" + +\end_inset -\backslash -tsvtable{computer-machines.tsv}{Machine specifications}{}{display columns/0/.style -={string type, column type=l}, display columns/1/.style={string type, column - type=l}, display columns/2/.style={string type, column type=l}, } \end_layout \end_inset + might also have interesting papers, as well as for example ACM's archives. + As for individuals, one of the most connected researchers in this field + is Balachander Krishnamurthy +\begin_inset Foot +status open -\end_layout +\begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "http://www2.research.att.com/~bala/papers/" -\begin_layout Subsection -Network connection -\end_layout +\end_inset -\begin_layout Standard -The laptop machine was connected by ethernet to the .SE office network, which - is shared with employees' computers. - The server machine was connected to server co-location network, which is - shared with other servers. - The .SE network technicians said load was kept very low, and only a few - percent of the dedicated 100 Mbps per location was used. - Both locations are in Stockholm city, and should therefore be well placed - in regard to web sites hosted in Sweden. -\end_layout -\begin_layout Subsection -Software considerations \end_layout -\begin_layout Standard -To expedite an automated and repeatable process, a custom set of scripts - were written as the project har-heedless. - The scripts are written using standard tools, available as open source - and on multiple platforms. -\end_layout +\end_inset -\begin_layout Subsubsection -Dynamic web pages +, who has worked with several groups looking at privacy in both online social + networks (OSNs) and general websites. \end_layout \begin_layout Standard -Previous efforts to download and analyze web pages by .SE used a static approach, - analyzing the HTML by means of simple searches for -\begin_inset Flex Code -status collapsed +Media have made reports regarding mass surveillance, especially by the United + States intelligence agency National Security Agency (NSA) +\begin_inset Foot +status open \begin_layout Plain Layout -http:// -\end_layout +\begin_inset CommandInset href +LatexCommand href +target "http://www.nsa.gov/" \end_inset - and -\begin_inset Flex Code -status collapsed -\begin_layout Plain Layout -https:// \end_layout \end_inset - strings in HTML and CSS. - It had proven hard to maintain, and the software project was abandoned - before the thesis was started, but had not yet been replaced. - In order to better handle the dynamic nature of modern web pages, the headless - browser phantomjs -\begin_inset CommandInset ref -LatexCommand eqref -reference "sub:phantomjs" - -\end_inset - - was chosen, as it would also download and execute javascript - a major - component in both user interfaces as well as active trackers and ads. +, but so far few papers seem to have been written. + There are also reports on what data private companies are collecting, in + part by their online efforts, and how they are packaging it for resale. + While media reports are not academic papers, they provide an up to date + source of information needed in explaining parts of the thesis subject. \end_layout -\begin_layout Subsubsection -Cached content +\begin_layout Subsection +Cookie syncing \end_layout \begin_layout Standard -Many of the external resources will be overlapping between web sites and - domains, and downloading them multiple times can be avoided by caching - the file the first time in a run. - Keeping cached content would, depending on per-response cache settings - and timeout, result in a different HTTP request and potentially different - response. - A file that has not changed on the server would generate a HTTP response - status of 304 with no data, saving bandwidth and lowering transfer delays, - where a changed file would generate a status 200 response with the latest - version. -\end_layout +A recent large-scale study +\begin_inset CommandInset citation +LatexCommand cite +key "G.-Acar:persistent:2014aa" -\begin_layout Standard -One of the technologies in determining if a locally cached file is the correct/l -atest version includes the HTTP -\begin_inset Flex Code -status collapsed +\end_inset + + included a cookie syncing privacy analysis. + It was shown that unique user identifiers were shared between different + third parties. + IDs can be shared in different ways. + If both third parties exist on the same page, they can be shared through + scripts or by looking for any IDs in the location URL. + They can also be shared by one third-party sending requests to a second + third-party (a fourth-party? +\begin_inset Note Greyedout +status open \begin_layout Plain Layout -Etag +Look up term, might be defined already. \end_layout \end_inset - -\begin_inset Note Greyedout +), either by leaking the location URL as a HTTP referrer or by embedding + it in the request URL. + In crawls of Alexa's top +\begin_inset ERT status open \begin_layout Plain Layout -Insert link to Etag standard. -\end_layout -\end_inset - header, which is a string representation of a URL/file at a certain version. - When content is transferred it may have an -\begin_inset Flex Code -status collapsed - -\begin_layout Plain Layout -Etag +\backslash +numprint{3000} \end_layout \end_inset - attached; if the file is cacheable, the -\begin_inset Flex Code -status collapsed - -\begin_layout Plain Layout -Etag + domains, one third-party script in particular sends requests with synced + IDs to 25 domains; the IDs were eventually are shared with 43 domains. + They also showed that a user's browsing history reconstruction rate rose + from 1.4% to 11% when backend/server-to-server overlaps were modeled. \end_layout -\end_inset - - is saved. - Subsequent requests for the same, cached URL contain the +\begin_layout Standard +The study used a modified Firefox browser to look at values stored in primarily + cookies. + As all HTTP requests are recorded in this thesis, including HTTP cookie + headers, a limited version of the same study could be performed. + In addition, they look at in-browser scripting utilizing for example \begin_inset Flex Code status collapsed \begin_layout Plain Layout -Etag +localStorage \end_layout \end_inset - - and the server uses it to determine if a compact 304 response is enough - or a full 200 response is necessary. - It has been found that the +, \begin_inset Flex Code status collapsed \begin_layout Plain Layout -Etag +canvas \end_layout \end_inset - header can be used for cookieless cross-site tracking by using an arbitrarily - chosen per-browser value instead of a file-dependent value -\begin_inset CommandInset citation -LatexCommand cite -key "M.-Ayenson:2011aa" + fingerprinting and ID storage in external plugins like Flash. + While that might be possible, the modifications that would need to be made + to phantomjs are non-trivial, and my current scope does not allow for that. + With their research as a base, cookie respawning and sharing could possibly + be confirmed using this thesis' code as a external tool using a different + browser platform. +\end_layout -\end_inset +\begin_layout Section +HTTP Archive +\begin_inset CommandInset label +LatexCommand label +name "sub:HTTP-Archive" -. - This means that keeping a local file cache might affect how trackers respond; - a file cache has not been implemented in har-heedless, making the browser - amnesiac. -\end_layout +\end_inset -\begin_layout Subsubsection -Flash files -\end_layout -\begin_layout Standard -Flash is a scriptable proprietary cross-platform vector based web technology - owned by Adobe. - Several kinds of content, including video players, games and ads, use Flash - because it has historically been better suited than javascript for in-browser - moving graphics and video. - Flash usage has not been considered for this thesis as the technology isn - not available on all popular web browsing platforms, notably Apple's iPad, - and is being phased out by HTML 5 features such as -\begin_inset Flex Code -status collapsed +\begin_inset Foot +status open \begin_layout Plain Layout - -\end_layout +\begin_inset CommandInset href +LatexCommand href +target "http://httparchive.org/" \end_inset - and -\begin_inset Flex Code -status collapsed -\begin_layout Plain Layout -