Skip to content

Datasets

okay edited this page May 1, 2016 · 1 revision

In snorkel, data is subdivided into datasets. Each dataset is made up of samples. All fields in a sample are one of three types: integer, string or set (of strings). Using the knowledge of the 3 data types, snorkel knows how to treat the different fields in the UI and populates the view controls with the appropriately relevant fields.

In general, string fields are meant for GROUP BY queries, while integer fields are used for aggregations and Set fields are used for filtering.

NOTE: all samples must have a time field that is seconds since the epoch (or equivalent field that can be used as a timestamp)

Example Schemas

Page Load Times

{ 
   integer: { 
     dom_load: 300,
     dns_lookup: 20,
     dom_complete: 900,
     resources_loaded: 30,
     time: <TIMESTAMP> // put your timestamp of seconds since epoch here
   },
   string: {
     page: "/home",
     user_id: "12912",
     network: "DSL",
     country: "USA",
     browser_family: "firefox",
     browser_major: "23",
     os_family: "Windows"
   },
   set: {
     perf_experiments: [ "socket_delivery", "XHR_chunks", "pipelined_delivery" ]
   }

}

Machine Monitoring

Let's say you were monitoring the performance or load of your machines. An example data scheme might look like:

{
   integer: {
     free_ram: 288888,
     load_avg: 10, // out of 100
     time: <TIMESTAMP>,
     requests_per_second: 50,
     avg_request_delay: 100 // 100ms delay
   },
   string: {
     cluster: "data-center-03",
     region: "NW",
     machine_id: "dc03-027",
 
   },
   set: {
     services: ["nagios", "cacti", "ganglion"]
   }
}

This scheme would let you do GROUP BY on cluster, region or machine_id as well as calculate the AVG, SUM and COUNT of the various integer fields.

You can’t perform that action at this time.