Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Cleaned up database documentation headers

  • Loading branch information...
commit a06de2cdffb01a548aebb6b3d0821149e92e6ed8 1 parent b43ec62
@selenamarie selenamarie authored
View
BIN  docs/admin-socorro.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
43 docs/creatingmatviews.rst
@@ -14,30 +14,30 @@ A materialized view, or "matview" is the results of a query stored as a table in
The rest of this guide assumes that all three conditions above are true. For matviews for which one or more conditions are not true, consult the PostgreSQL DBAs for your matview.
Do I Want a Matview?
-====================
+--------------------
Before proceeding to construct a new matview, test the responsiveness of simply running a query over reports_clean and/or reports_user_info. You may find that the query returns fast enough ( < 100ms ) without its own matview. Remember to test the extreme cases: Firefox release version on Windows, or Fennec aurora version.
Also, matviews are really only effective if they are smaller than 1/4 the size of the base data from which they are constructed. Otherwise, it's generally better to simply look at adding new indexes to the base data. Try populating a couple days of the matview, ad-hoc, and checking its size (pg_total_relation_size()) compared to the base table from which it's drawn. The new signature summaries was a good example of this; the matviews to meet the spec would have been 1/3 the size of reports_clean, so we added a couple new indexes to reports_clean instead.
Components of a Matview
-=======================
+-----------------------
In order to create a new matview, you will create or modify five or six things:
-1. a table to hold the matview data
-2. an update function to insert new matview data once per day
-3. a backfill function to backfill one day of the matview
-4. add a line in the general backfill_matviews function
-5. if the matview is to be backfilled from deployment, a script to do this
-6. a test that the matview is being populated correctly.
+# a table to hold the matview data
+# an update function to insert new matview data once per day
+# a backfill function to backfill one day of the matview
+# add a line in the general backfill_matviews function
+# if the matview is to be backfilled from deployment, a script to do this
+# a test that the matview is being populated correctly.
-Point (6) is not yet addressed by a test framework for Socorro, so we're skipping it currently.
+The final point is not yet addressed by a test framework for Socorro, so we're skipping it currently.
For the rest of this doc, please refer to the template matview code sql/templates/general_matview_template.sql in the Socorro source code.
Creating the Matview Table
-==========================
+--------------------------
The matview table should be the basis for the report or screen you want. It's important that it be able to cope with all of the different filter and grouping criteria which users are allowed to supply. On the other hand, most of the time it's not helpful to try to have one matview support several different reports; the matview gets bloated and slow.
@@ -59,7 +59,7 @@ So, as an example, we're going to create a simple matview for summarizing crashe
report_date
report_count
key product_version, domain, report_date
-
+
We actually use the custom procedure create_table_if_not_exists() to create this. This function handles idempotence, permissions, and secondary indexes for us, like so:
::
@@ -224,24 +224,5 @@ file and look like this:
END LOOP;
END;$f$;
-
-This script would then be checked into the set of upgrade scripts
-for that version of the database.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+This script would then be checked into the set of upgrade scripts for that version of the database.
View
17 docs/databaseadminfunctions.rst
@@ -14,7 +14,7 @@ All functions below return BOOLEAN, with TRUE meaning completion, and
throw an ERROR if they fail, unless otherwise noted.
MatView Functions
-=================
+-----------------
These functions manage the population of the many Materialized Views
in Socorro. In general, for each matview there are two functions
@@ -175,7 +175,7 @@ Functions marked "last day only" do not accumulate data, but display it only for
day they were run. As such, there is no need to fill them in for each day.
Other Matview Functions
-=======================
+-----------------------
Matview functions which don't fit the parameters above include:
@@ -288,7 +288,7 @@ Called By: other udpate functions
Schema Management Functions
-===========================
+----------------------------
These functions support partitioning, upgrades, and other management
of tables and views.
@@ -466,7 +466,7 @@ Notes: drop_old_partitions assumes a table_YYYYMMDD naming format.
Other Administrative Functions
-==============================
+------------------------------
add_old_release
---------------
@@ -549,12 +549,3 @@ release_throttle
If throttling back the number of release crashes processed, set here
Notes: add_new_product will return FALSE rather than erroring if the product already exists.
-
-
-
-
-
-
-
-
-
View
8 docs/databasemiscfunctions.rst
@@ -10,7 +10,7 @@ PostgreSQL database which are useful for application development, but
do not fit in the "Admin" or "Datetime" categories.
Formatting Functions
-====================
+--------------------
build_numeric
-------------
@@ -45,7 +45,7 @@ Takes a numeric build_id and returns the date of the build.
API Functions
-=============
+-------------
These functions support the middleware, making it easier to look up
certain things in the database.
@@ -71,7 +71,7 @@ Takes a product name and a list of version_strings, and returns an array (list)
WHERE product_version_id = ANY ( $list );
Mathematical Functions
-======================
+----------------------
These functions do math operations which we need to do repeatedly, saving some typing.
@@ -91,7 +91,7 @@ Returns the "crashes per hundred ADU", by this formula:
( crashes / throttle ) * 100 / adu
Internal Functions
-==================
+------------------
These functions are designed to be called by other functions, so are sparsely documented.
View
19 docs/databaseschema.rst
@@ -2,13 +2,6 @@
.. _databaseschema-chapter:
-Out-of-Date Data Warning
-========================
-
-While portions of this doc are still relevant and interesting for
-current socorro usage, be aware that it is extremely out of date
-when compared to current schema.
-
Database Schema
===============
@@ -24,10 +17,11 @@ The tables can be divided into three major categories: crash data,
aggregate reporting and process control.
-crash data
-----------
+Core crash data diagram
+=======================
.. image:: core-socorro.png
+ :width: 600px
reports
-------
@@ -126,20 +120,23 @@ Partitioned Child Table
Inherits: extensions
- Materialized View Reporting
- ===========================
+Materialized View Reporting
+===========================
.. image:: matviews-socorro.png
+ :width: 600px
Monitor, Processors and crontabber tables
=========================================
.. image:: helper-socorro.png
+ :width: 600px
Admin tables
============
.. image:: admin-socorro.png
+ :width: 600px
View
6 docs/databasescripts.rst
@@ -10,7 +10,7 @@ which are used to manage socorro in a staging and development environment, as we
deploy upgrades. These scripts are detailed below.
Upgrade Scripts
-===============
+---------------
These scripts are used on a weekly basis to upgrade the various socorro PostgreSQL database servers.
@@ -73,7 +73,7 @@ be run by the database superuser and won't run otherwise.
MiniDB Scripts
-==============
+--------------
This directory contains scripts for extracting and loading a smaller copy of the socorro PostgreSQL database ... called a "MiniDB" ... from production data. This MiniDB is used for testing and staging.
@@ -145,4 +145,4 @@ Creates a copy of /pgdata/9.0/data for backup so that it can be restored later f
postsql directory
-----------------
-Contains several SQL scripts which create database objects which error out during load due to broken dependencies, particularly views based on matviews. postsql.sh shell script calls these. Intended to be called by loadMiniDBonDev.py.
+Contains several SQL scripts which create database objects which error out during load due to broken dependencies, particularly views based on matviews. postsql.sh shell script calls these. Intended to be called by loadMiniDBonDev.py.
View
171 docs/databasetabledesc.rst
@@ -10,12 +10,11 @@ This document describes the various tables in PostgreSQL by their purpose and es
Tables which are in the database but not listed below are probably legacy tables which are slated for removal in future Socorro releases. Certainly if the tables are not described, they should not be used for new features or reports.
Raw Data Tables
-===============
+---------------
These tables hold "raw" data as it comes in from external sources. As such, these tables are quite large and contain a lot of garbage and data which needs to be conditionally evaluated. This means that you should avoid using these tables for reports and interfaces unless the data you need isn't available anywhere else -- and even then, you should see about getting the data added to a matview or normalized fact table.
-reports
--------
+*reports*
The primary "raw data" table, reports contains the most used information about crashes, one row per crash report. Primary key is the UUID field.
@@ -29,51 +28,43 @@ The reports table is partitioned by date_processed into weekly partitions, so an
Data in this table comes from the processors.
-extensions
-----------
+*extensions*
Contains information on add-ons installed in the user's application. Currently linked to reports via a synthetic report_id (this will be fixed to be UUID in some future release). Data is partitioned by date_processed into weekly partitions, so include a filter on date_processed in every query hitting this table. Has zero to several rows for each crash. This is used by correlations.
Data in this table comes from the processors.
-plugins_reports
----------------
+*plugins_reports*
Contains information on some, but not all, installed modules implicated in the crash: the "most interesting" modules. Relates to dimension table plugins. Currently linked to reports via a synthetic report_id (this will be fixed to be UUID in some future release). Data is partitioned by date_processed into weekly partitions, so include a filter on date_processed in every query hitting this table. Has zero to several rows for each crash.
Data in this table comes from the processors.
-bugs
-----
+*bugs*
Contains lists of bugs thought to be related to crash reports, for linking to crashes. Populated by a daily cronjob.
-bug_associations
-----------------
+*bug_associations*
Links bugs from the bugs table to crash signatures. Populated by daily cronjob.
-raw_adu
--------
+*raw_adu*
Contains counts of estimated Average Daily Users as calculated by the Metrics department, grouped by product, version, build, os, and UTC date. Populated by a daily cronjob.
-releases_raw
-------------
+*releases_raw*
Contains raw data about Mozilla releases, including product, version, platform and build information. Populated hourly via FTP-scraping.
-reports_duplicates
-------------------
+*reports_duplicates*
Contains UUIDs of groups of crash reports thought to be duplicates according to the current automated duplicate-finding algorithm. Populated by hourly cronjob.
Normalized Fact Tables
-======================
+----------------------
-reports_clean
--------------
+*reports_clean*
Contains cleaned and normalized data from the reports table, including product-version, os, os version, signature, reason, and more. Partitioned by date into weekly partitions, so each query against this table should contain a predicate on date_processed:
@@ -155,8 +146,7 @@ architecture
cores
number of CPU cores on the client, as reported.
-reports_user_info
------------------
+*reports_user_info*
Contains a handful of "optional" information from the reports table which is either security-sensitive or is not included in all reports and is large. This includes the full URL, user email address, comments, and app_notes. As such, access to this table in production may be restricted.
@@ -164,15 +154,14 @@ Partitioned by date into weekly partitions, so each query against this table sho
Updated by update_reports_clean().
-product_adu
-------------
+*product_adu*
The normalized version of raw_adu, contains summarized estimated counts of users for each product-version since Rapid Release began. Populated by daily cronjob.
Updated by update_adu().
Dimensions
-==========
+----------
These tables contain lookup lists and taxonomy for the fact tables in Socorro. Generally they are auto-populated based on encountering new values in the raw data, on an hourly basis. A few tables below are manually populated and change extremely seldom, if at all.
@@ -180,71 +169,60 @@ Dimensions which are lookup lists of short values join to the fact tables by nat
Some dimensions which come from raw crash data have a "first_seen" column which displays when that value was first encountered in a crash and added to the dimension table. Since the first_seen columns were added in September 2011, most of these will have the value '2011-01-01' which is not meaningful. Only dates after 2011-09-15 actually indicate a first appearance.
-addresses
----------
+*addresses*
Contains a list of crash location "addresses", extracted hourly from the raw data. Surrogate key: address_id.
Updated by update_reports_clean().
-crash_types
------------
+*crash_types*
Intersects process_types and whether or not a crash is a hang to supply 5 distinct crash types.
Used for the "Crashes By User" screen.
Updated manually.
-domains
--------
+*domains*
List of HTTP domains extracted from raw reports by applying a truncation regex to the crashing URL. These should contain no personal information. Contains a "first seen" column. Surrogate key: domain_id
Updated from update_reports_clean(), with function update_lookup_new_reports().
-flash_versions
---------------
+*flash_versions*
List of Abobe Flash version numbers harvested from crashes. Has a "first_seen" column. Surrogate key: flash_version_id.
Updated from update_reports_clean(), with function update_lookup_new_reports().
-os_names
---------
+*os_names*
Canonical list of OS names used in Sorocco. Natural key. Fixed list.
Updated manually.
-os_versions
------------
+*os_versions*
List of versions for each OS based on data harvested from crashes. Contains some garbage versions because we cannot validate. Surrogate key: os_version_id.
Updated from update_reports_clean(), with function update_os_versions_new_reports().
-plugins
--------
+*plugins*
List of "interesting modules" harvested from raw crashes, populated by the processors. Surrogate key: ID. Links to plugins_reports.
-
-process_types
--------------
+*process_types*
Standing list of crashing process types (browser, plugin and hang). Natural key.
Updated manually.
-products
---------
+*products*
List of supported products, along with the first version on rapid release. Natural key: product_name.
Updated manually.
-product_versions
-----------------
+*product_versions*
Contains a list of versions for each product, since the beginning of rapid release (i.e. since Firefox 5.0). Version numbers are available expressed several different ways, and there is a sort column for sorting versions. Also contains build_date/sunset_date visibility information and the featured_version flag. "build_type" means the same thing as "release_channel". Surrogate key: product_version_id.
@@ -271,166 +249,140 @@ beta_number
For "final betas", this number will be 99.
-product_version_builds
-----------------------
+*product_version_builds*
Contains a list of builds for each product-version. Note that platform information is not at all normalized. Natural key: product_version_id, build_id.
Updated from update_os_versions_new_reports().
-product_release_channels
-------------------------
+*product_release_channels*
Contains an intersection of products and release channels, mainly in order to store throttle values. Manually populated. Natural key: product_name, release_channel.
-reasons
--------
+*reasons*
Contains a list of "crash reason" values harvested from raw crashes. Has a "first seen" column. Surrogate key: reason_id.
-release_channels
-----------------
+*release_channels*
Contains a list of available Release Channels. Manually populated. Natural key. See "note on release channel columns" below.
-signatures
-----------
+*signatures*
List of crash signatures harvested from incoming raw data. Populated by hourly cronjob. Has a first_seen column. Surrogate key: signature_id.
-uptime_levels
--------------
+*uptime_levels*
Reference list of uptime "levels" for use in reports, primarily the Signature Summary. Manually populated.
-windows_versions
-----------------
+*windows_versions*
Reference list of Window major/minor versions with their accompanying common names for reports. Manually populated.
Matviews
-========
+--------
These data summaries are derived data from the fact tables and/or the raw data tables. They are populated by hourly or daily cronjobs, and are frequently regenerated if historical data needs to be corrected. If these matviews contain the data you need, you should use them first because they are smaller and more efficient than the fact tables or the raw tables.
-build_adu
----------
+*build_adu*
Totals ADU per product-version, OS, crash report date, and build date. Used primarily
to feed data to crashes_by_user_build and home_page_build.
-correlations
-------------
+*correlations*
Summaries crashes by product-version, os, reason and signature. Populated
by daily cron job. Is the root for the other correlations reports. Correlation reports in the database will not be active/populated until 2.5.2 or later.
-correlation_addons
-------------------
+*correlation_addons*
Contains crash-count summaries of addons per correlation. Populated by daily cronjob.
-correlation_cores
------------------
+*correlation_cores*
Contains crash-count summaries of crashes per architecture and number of cores. Populated by daily cronjob.
-correlation_modules
--------------------
+*correlation_modules*
Will contain crash-counts for modules per correlation. Will be populated daily by pull from Hbase.
-crashes_by_user, crashes_by_user_view
--------------------------------------
+*crashes_by_user, crashes_by_user_view*
Totals crashes, adu, and crash/adu ratio for each product-version, crash type and OS for each
crash report date. Used to populate the "Crashed By User" interactive graph.
crashes_by_user_view joins crashes_by_user to its various lookup list tables.
-crashes_by_user_build, crashes_by_user_build_view
--------------------------------------------------
+*crashes_by_user_build, crashes_by_user_build_view*
The same as crashes_by_user, but also summarizes by build_date, allowing you to do a
sum() and see crashes by build date instead of by crash report date.
-daily_hangs and hang_report
----------------------------
+*daily_hangs and hang_report*
daily_hangs contains a correlation of hang crash reports with their related hang pair crashes, plus additional summary data. Duplicates contains an array of UUIDs of possible duplicates.
hang_report is a dynamic view which flattens daily_hangs and its related dimension tables.
-explosiveness
--------------
+*explosiveness*
Matview which contains mathematical calculations of the "most explosive" signatures for
each product-version for the last 10 days. Only contains the last 10 days. Uses
two different calculations, one based on the one-day total, the other based on a
3-day average.
-home_page_graph, home_page_graph_view
--------------------------------------
+*home_page_graph, home_page_graph_view*
Summary of non-browser-hang crashes by report date and product-version, including ADU
and crashes-per-hundred-adu. As the name suggests, used to populate the home page graph.
The _view joins the matview to its various lookup list tables.
-home_page_graph_build, home_page_graph_build_view
--------------------------------------------------
+*home_page_graph_build, home_page_graph_build_view*
Same as home_page_graph, but also includes build_date. Note that since it includes
report_date as well as build_date, you need to do a SUM() of the counts in order to see
data just by build date.
-nightly_builds
---------------
+*nightly_builds*
contains summaries of crashes-by-age for Nightly and Aurora releases. Will be populated in Socorro 2.5.1.
-product_crash_ratio
--------------------
+*product_crash_ratio*
Dynamic VIEW which shows crashes, ADU, adjusted crashes, and the crash/100ADU ratio, for each product and versions. Recommended for backing graphs and similar.
-product_os_crash_ratio
-----------------------
+*product_os_crash_ratio*
Dynamic VIEW which shows crashes, ADU, adjusted crashes, and the crash/100ADU ratio for each product, OS and version. Recommended for backing graphs and similar.
-product_info
-------------
+*product_info*
dynamic VIEW which suppies the most essential information about each product version for both old and new products.
-signature_products and signature_products_rollup
-------------------------------------------------
+*signature_products and signature_products_rollup*
Summary of which signatures appear in which product_version_ids, with first appearance dates.
The rollup contains an array-style summary of the signatures with lists of product-versions.
-tcbs
-----
+*tcbs*
Short for "Top Crashes By Signature", tcbs contains counts of crashes per day, signature, product-version, and columns counting each OS.
-tcbs_build
-----------
+*tcbs_build*
Same as TCBS, only with build_date as well. Note that you need to SUM() values, since report_date
is included as well, in order to get values just by build date.
Note On Release Channel Columns
-===============================
-
+-------------------------------
Due to a historical error, the column name for the Release Channel in various tables may be named "release_channel", "build_type", or "build_channel". All three of these column names refer to exactly the same thing. While we regret the confusion, it has not been thought to be worth the refactoring effort to clean it up.
Application Support Tables
-==========================
+--------------------------
+
These tables are used by various parts of the application to do other things than reporting. They are populated/managed by those applications. Most are not accessible to the various reporting users, as they do not contain reportable data.
-data processing control tables
-------------------------------
+*data processing control tables*
These tables contain data which supports data processing by the
processors and cronjobs.
@@ -460,8 +412,7 @@ transform_rules
contains rule data for rewriting crashes by the processors. May be
used in the future for other rule-based rewriting by other components.
-email campaign tables
----------------------
+*email campaign tables*
These tables support the application which emails crash reporters with
follow-ups. As such, access to these tables will restricted.
@@ -470,8 +421,7 @@ follow-ups. As such, access to these tables will restricted.
* email_campaigns_contacts
* email_contacts
-processor management tables
----------------------------
+*processor management tables*
These tables are used to coordinate activities of the up-to-120 processors
and the monitor.
@@ -489,21 +439,18 @@ server_status
Contains summary statistics on the various processor servers.
-UI management tables
---------------------
+*UI management tables*
sessions
contains session information for people logged into the administration
interface for Socorro.
-monitoring tables
------------------
+*monitoring tables*
replication_test
Contains a timestamp for ganglia to measure the speed of replication.
-cronjob and database management
--------------------------------
+*cronjob and database management*
These tables support scheduled tasks which are run in Socorro.
View
12 docs/databasetablesbysource.rst
@@ -10,7 +10,7 @@ Last updated: 2012-10-22
This document breaks down the tables in the Socorro PostgreSQL database by where their data comes from, rather than by what the table contains. This is a prerequisite to populating a brand-new socorro database or creating synthetic testing workloads.
Manually Populated Tables
-=========================
+-------------------------
The following tables have no code to populate them automatically. Initial population and any updating need to be done by hand. Generally there's no UI, either; use queries.
@@ -28,7 +28,7 @@ The following tables have no code to populate them automatically. Initial popul
* windows_versions
Tables Receiving External Data
-==============================
+------------------------------
These tables actually get inserted into by various external utilities. This is most of our "incoming" data.
@@ -51,7 +51,7 @@ reports
Automatically Populated Reference Tables
-========================================
+----------------------------------------
Lookup lists and dimension tables, populated by cron jobs and/or processors based on the above tables. Most are annotated with the job or process which populates them. Where the populating process is marked with an @, that indicates a job which is due to be phased out.
@@ -78,7 +78,7 @@ signatures
cron job, update_reports_clean, based on reports
Matviews
-========
+--------
Reporting tables, designed to be called directly by the mware/UI/reports. Populated by cron job batch. Where populating functions are marked with a @, they are due to be replaced with new jobs.
@@ -104,7 +104,7 @@ tcbs
update_tcbs based on reports
Application Management Tables
-=============================
+------------------------------
These tables are used by various parts of the application to do other things than reporting. They are populated/managed by those applications.
@@ -137,7 +137,7 @@ These tables are used by various parts of the application to do other things tha
* report_partition_info
Deprecated Tables
-=================
+-----------------
These tables are supporting functionality which is scheduled to be removed over the next few versions of Socorro. As such, we are ignoring them.
View
BIN  docs/helper-socorro.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
4 docs/populatepostgres.rst
@@ -2,8 +2,8 @@
.. _populatepostgres-chapter:
-Populate PostgreSQL
-===================
+Populate PostgreSQL for the first time
+======================================
Socorro supports multiple products, each of which may contain multiple versions.
Please sign in to comment.
Something went wrong with that request. Please try again.