From a69f983e2e743b6c611b692f154d228fc43e5a4c Mon Sep 17 00:00:00 2001 From: Joel Purra Date: Wed, 18 Jun 2014 18:37:08 +0200 Subject: [PATCH] Wrote about some of the tools used --- report/report.lyx | 171 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 151 insertions(+), 20 deletions(-) diff --git a/report/report.lyx b/report/report.lyx index cec8f3c..faef68a 100644 --- a/report/report.lyx +++ b/report/report.lyx @@ -970,8 +970,10 @@ The thesis will primarily be written from a Swedish perspective. \end_layout \begin_layout Standard -One assumption is that all external resources can act as trackers, collecting - data and tracking users across domains using for example the +One assumption is that all external resources can act as trackers, even + for static (non-script) resources with no capabilities to dynamically survey + the user's browser, collecting data and tracking users across domains using + for example the \begin_inset Flex Code status collapsed @@ -1003,6 +1005,18 @@ key "Malandrino:2013:PAI:2517840.2517868,Krishnamurthy:2006:CMC:1135777.1135829" confirmed \emph default trackers. + While cookies used for tracking have been a concern for many, they are + not necessary in order to identify most users upon return, even uniquely + on a global level. +\begin_inset CommandInset citation +LatexCommand cite +key "Eckersley2009unique" + +\end_inset + + Cookies will not be considered to be an indicator of tracking, as it can + be assumed that a combination of other server and client side techniques + can achieve the same goal as a tracking cookie. \end_layout \begin_layout Chapter @@ -1010,28 +1024,137 @@ Methodology \end_layout \begin_layout Standard -Based on a list of domains, external resources are listed by downloading - of the front page of each domain, and analyzing its HTML content. - The URLs of external resources will be extracted, and associated with the - domain they were loaded from. +Based on a list of domains, the front page of each domain is downloaded + and parsed the way a browser would. + The URL of each requested resource will be extracted, and associated with + the domain they were loaded from. + This data will then be classified in a number of ways, before being boiled + down to statistics about the entire dataset. +\end_layout + +\begin_layout Section +Tools \end_layout \begin_layout Standard -All external resources get some of the relevant data upon each request, - even for static resources with no capabilities to dynamically survey the - user's browser. - While cookies used for tracking have been a concern for many, they are - not necessary in order to identify most users upon return, even uniquely - on a global level. -\begin_inset CommandInset citation -LatexCommand cite -key "Eckersley2009unique" +In order to download and analyze thousands of webpages in an automated fashion, + a set of suitable tools were sought. + Tools that are released as free and open source software have been preferred, + and the tools written for the thesis have also been released as such. + Development was performed in the Mac OS X operating system, but is thought + to be runnable on other Unix-like platforms with relative ease. +\end_layout + +\begin_layout Subsection +HTTP Archive (HAR) format +\begin_inset Foot +status open + +\begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "http://www.softwareishard.com/blog/har-12-spec/" \end_inset - Cookies will not be considered to be an indicator of tracking, as it can - be assumed that a combination of other server and client side techniques - can achieve the same goal as a tracking cookie. + +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +In an effort to record and analyze network traffic as seen by individual + browsers, the data/file format HTTP Archive (HAR) was developed. + Browsers such as Google Chrome implement it as a complement to the network + graph shown in the Developer Console, from where a HAR file can be exported. + While constructed to analyze for example web performance, it also contains + data suitable for this thesis: requested URLs and HTTP request/response + headers such as referrer and content type. + HAR files are based upon the JSON standard +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Insert a link to the JSON standard. +\end_layout + +\end_inset + +, which is a Javascript object compatible data format commonly used to communica +te dynamic data between client side scripts in browsers and web servers. + The most recent specification at the time of writing was HAR 1.2. +\end_layout + +\begin_layout Subsection +PhantomJS +\begin_inset Foot +status open + +\begin_layout Plain Layout +\begin_inset CommandInset href +LatexCommand href +target "http://phantomjs.org/" + +\end_inset + + +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +Accessing webpages is normally done by users in a graphical browser; the + browser downloads then displays images, executes scripts, plays videos. + A browser is user friendly but not optimal for batch usage due to the overhead + in constantly drawing results on screen and the lack of automation without + external tools such as Selenium Webdriver +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Insert link to Selenium Webdriver. +\end_layout + +\end_inset + +. + A good alternative is PhantomJS, which is built as a command line tool + without any user interface. + It acts like a browser internally, including rendering the webpage to a + image buffer that isn't displayed, and is controllable through the use + of scripts. + One such example script included in the default installation generates + HAR files from a webpage visit. +\end_layout + +\begin_layout Subsection +jq +\begin_inset Note Greyedout +status open + +\begin_layout Plain Layout +Insert link to jq. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard +While there are command line tools to transform data in for example plain + text, CSV and XML files, tools to work with JSON files are not as prevalent. + One such tools gaining momentum is jq, which is implemented with a domain + specific language (DSL) suitable for extracting or transforming data. + The DSL is based around a set of filters, similar to pipes in the Unix + world, transforming the input and passing it on to the next stage. + Jq performs well with large datasets, as it treats data as a stream. \end_layout \begin_layout Section @@ -1056,8 +1179,8 @@ In order to facilitate repeatable and improvable analysis, tools will be developed to perform the collection and aggregation steps automatically. .SE already has a set of tools that run monthly; integration and interoperabilit y will smooth the process and continuous usage. - There is also a publich .SE tool to allow web site owners to test their - own sites, + There is also a public .SE tool to allow web site owners to test their own + sites, \emph on Domain Check \emph default @@ -1273,6 +1396,14 @@ Can collected data served by different services differ depending on which tool is used to fetch the data? \end_layout +\begin_layout Itemize +Does content and external resources vary between requests? Is it time dependent, + or regenerated for each request? One example would be often-updated news + sites or blogs, where new content is added and old replaced. + Another would be ads, which might be loaded from different sources per + request. +\end_layout + \begin_layout Itemize Many of the external resources will be overlapping, and downloading them multiple times can be avoided by caching the file the first time in a run.