labels: Databases Blog created: 2016-02-07T19:01 modified: 2016-02-28T18:04 place: New York, USA comments: true
SQL (Structured Query Language) is a declarative programming language.
/ˈɛs kjuː ˈɛl/ (an SQL query).
Database I use in examples
psql template1 template1=# CREATE USER test WITH PASSWORD 'test'; template1=# CREATE DATABASE test; template1=# GRANT ALL PRIVILEGES ON DATABASE test TO test; template1=# \q psql -d test
Storing images in the database has some advantages:
- The image data is stored in the database. There is no extra step to load it. There is no risk that the file's pathname is incorrect
- Deleting a row deletes the image automatically
- Changes to an image are not visible to other clients until you commit the change
- Rolling back a transaction restores a previous state of the image
- Updating a row creates a lock, so no other client can update the same image concurrently
- Database backups include all the images
- SQL privileges control access to the image as well as the row
- Database network usage
- Backups size
CREATE TABLE DaysOfWeek ( name CHAR(20) PRIMARY KEY, CHECK (name IN ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')) ); INSERT INTO DaysOfWeek(name) VALUES ('Sunday'); INSERT INTO DaysOfWeek(name) VALUES ('Another');
INSERT 0 1 ERROR: new row for relation "daysofweek" violates check constraint "daysofweek_name_check" DETAIL: Failing row contains (Another ).
Use it only if You are 100% sure that the list of allowed options will be constant, otherwise - use
CREATE TABLE GenderChoices ( gender VARCHAR(20) PRIMARY KEY ); INSERT INTO GenderChoices(gender) VALUES ('male'); INSERT INTO GenderChoices(gender) VALUES ('female'); CREATE TABLE users ( name VARCHAR(100) PRIMARY KEY, gender VARCHAR(20) REFERENCES GenderChoices ON UPDATE CASCADE ); INSERT INTO users(name, gender) VALUES ('user1', 'male'); INSERT INTO users(name, gender) VALUES ('user2', 'other');
INSERT 0 1 ERROR: insert or update on table "users" violates foreign key constraint "users_gender_fkey" DETAIL: Key (gender)=(other) is not present in table "genderchoices".
There are other ways to solve the problem:
ENUM field and restriction on the client side.
Be careful when you alter existing table with a great number of rows, it locks the database.
Use it carefully, may cause performance problems on large data sets.
Mistakes in defining indexes:
- Defining no indexes or not enough indexes
- Defining too many indexes or indexes that don't help
- Running queries that no index can help
MENTOR - technique to maintain indexes:
- Measure (analyze logs, search for slowest/most time consuming/most popular queries)
- Explain (analyze query execution plan using
!!! note "A fat index may be fast"
If your query references only the columns included in the index data structure, the database generates your query results by reading only the index.
!!! alert "Create index concurrently"
Creating new indexes blocks the table. Use [CONCURRENTLY](http://www.postgresql.org/docs/9.5/static/sql-createindex.html) to run it in background.
FLOAT VS NUMERIC (DECIMAL)
The advantage of
DECIMAL) are that they store rational numbers without rounding, as the
FLOAT data types do.
CREATE TABLE numbers ( number_id SERIAL PRIMARY KEY, value_float FLOAT, value_decimal NUMERIC(10,2) ); INSERT INTO numbers (value_float, value_decimal) VALUES (1./3, 1./3); INSERT INTO numbers (value_float, value_decimal) VALUES (1./3, 1./3); INSERT INTO numbers (value_float, value_decimal) VALUES (1./3, 1./3); SELECT SUM(value_float) * 1000 as sum_float, SUM(value_decimal) * 1000 as sum_decimal from numbers;
sum_float | sum_decimal -----------+------------- 1000 | 990.00
One advantage of IEEE-754 is that by using the exponent, it can represent fractional values that are both very small and very large.
DECIMAL is appropriate type for an amount of money because it can handle money values just as easy as
FLOAT and more accurately.
FOREIGN KEY if a "key" to consistency.
How to declare:
CREATE TABLE artwork ( artwork_id SERIAL PRIMARY KEY, author_id INTEGER REFERENCES author ) /* or */ CREATE TABLE artwork ( artwork_id SERIAL PRIMARY KEY, author_id INTEGER, FOREIGN KEY (author_id) REFERENCES author ) /* or */ CREATE TABLE artwork ( artwork_id SERIAL PRIMARY KEY, author_id INTEGER ) ALTER TABLE artwork ADD CONSTRAINT artwork_author_id_fk FOREIGN KEY (author_id) REFERENCES author (author_id);
Use a specialized search engine, for instance: Elasticsearch.
Allows to update or delete the parent row and lets the database takes care of any child rows that reference it.
CREATE TABLE artwork ( artwork_id SERIAL PRIMARY KEY, author_id INTEGER, FOREIGN KEY (author_id) REFERENCES author ON UPDATE CASCADE ON DELETE SET DEFAULT )
CSV/JSON/etc. or FOREIGN KEY?
Storing lists in a text field is a bad practice:
- We are limited in number of items can be placed into text field (mostly because of performance degradation)
- Foreign Key helps us to keep data consistency on database layer
- Additional validation, encoding/decoding logic may be required on the client side
- Unable to use joins and
INstatement (must use
- No chance to use the field as part of compound index
- We can't use count (and other aggregation functions) to get number of value usages
- Updates are much easier if you use Foreign Key
Some people, when confronted with a problem, think, "I know, I'll use regular expressions." Now they have two problems.
If you are concerned about performance:
- Joins must work fast enough if number of possible values is small (hundreds and even thousands)
- Use caching to speed up your app
- Remember about YAGNI, caching or denormalization may not worth time spent on their implementation (do it only if you have evidence that it will improve performance drastically)
Example of Foreign Key constraint usage:
CREATE TABLE anime ( anime_key CHAR(255) PRIMARY KEY, title CHAR(255) NOT NULL ); CREATE TABLE genre ( genre_key CHAR(255) PRIMARY KEY, title CHAR(255) NOT NULL ); CREATE TABLE anime_genre ( anime_key CHAR(255) REFERENCES anime, genre_key CHAR(255) REFERENCES genre, PRIMARY KEY (anime_key, genre_key) ); INSERT INTO anime (anime_key, title) VALUES ('kill-la-kill', 'Kill La Kill'); INSERT INTO genre (genre_key, title) VALUES ('action', 'Action'); INSERT INTO genre (genre_key, title) VALUES ('comedy', 'Comedy'); INSERT INTO anime_genre (anime_key, genre_key) VALUES ('kill-la-kill', 'action'); INSERT INTO anime_genre (anime_key, genre_key) VALUES ('kill-la-kill', 'comedy'); SELECT genre.title FROM anime_genre LEFT JOIN genre USING (genre_key) WHERE anime_genre.anime_key = 'kill-la-kill';
anime_genre is an intersection table.
Invalid data example:
INSERT INTO anime_genre (anime_key, genre_key) VALUES ('unknown-anime', 'comedy'); ERROR: insert or update on table "anime_genre" violates foreign key constraint "anime_genre_anime_key_fkey" DETAIL: Key (anime_key)=(unknown-anime) is not present in table "anime".
GROUP BY and aggregate
SELECT anime.title, COUNT(anime_genre.*) as genre_count FROM anime LEFT JOIN anime_genre USING (anime_key) GROUP BY anime.anime_key;
anime | genre_count Kill La Kill | 2
NULL and the third state
Using null is not the antipattern; the antipattern is using null like an ordinary value or using an ordinary value like null.
SQL Antipatterns: Avoiding the Pitfalls of Database Programming by Bill Karwin
Sometimes I hear that the third state is bad, only True and False must be allowed. I don't agree with it, an unknown is a natural third state. Sometimes we not sure about something, and there is no right answer except "I don't know".
NULL in SQL is the best implementation of unknown I ever had seen:
DO language plpgsql $$ BEGIN RAISE NOTICE 'NULL OR TRUE = %', NULL OR TRUE; RAISE NOTICE 'NULL AND TRUE = %', NULL AND TRUE; RAISE NOTICE 'NULL OR NULL = %', NULL OR NULL; RAISE NOTICE 'NULL AND NULL = %', NULL AND NULL; RAISE NOTICE 'NULL + 1 = %', NULL + 1; RAISE NOTICE 'NULL || ''abc'' = %', NULL || 'abc'; RAISE NOTICE 'NOT NULL = %', NOT NULL; RAISE NOTICE 'NULL == NULL = %', NULL = NULL; RAISE NOTICE 'NULL != NULL = %', NULL != NULL; END $$;
NOTICE: NULL OR TRUE = t NOTICE: NULL AND TRUE = <NULL> NOTICE: NULL OR NULL = <NULL> NOTICE: NULL AND NULL = <NULL> NOTICE: NULL + 1 = <NULL> NOTICE: NULL || 'abc' = <NULL> NOTICE: NOT NULL = <NULL> NOTICE: NULL == NULL = <NULL> NOTICE: NULL != NULL = <NULL>
NONE plays important role:
- Shows that there is no value was assigned (not available yet)
- No reference exists
- An outer join uses
NULLvalues as placeholders for the columns of an unmatched table in an outer join
SELECT title FROM artwork WHERE autor_id IS NULL; SELECT title FROM artwork WHERE autor_id IS NOT NULL;
IS DISTINCT FROM:
SELECT title FROM artwork WHERE author_id IS DISTINCT FROM 123; /* is equal to */ SELECT title FROM artwork WHERE author_id IS NULL OR author_id != 123;
The function returns its first not-null argument.
ORDER BY RANDOM()
SELECT artwork_id FROM artwork ORDER BY RANDOM() LIMIT 1;
It is expensive and unreliable.
SELECT artwork_id FROM artwork LIMIT 1 OFFSET FLOOR(RANDOM() * (SELECT COUNT(*) FROM artwork));
PRIMARY KEY is a constraint.
PRIMARY KEY needs to:
- Prevent a table from containing duplicate rows (as it creates
- Support foreign key reference
How to declare:
CREATE TABLE artwork ( artwork_id INTEGER PRIMARY KEY ); /* or */ CREATE TABLE artwork ( artwork_id INTEGER, PRIMARY KEY (artwork_id) ); /* or */ CREATE TABLE artwork ( artwork_id INTEGER ); ALTER TABLE artwork ADD PRIMARY KEY (artwork_id);
Result is the same:
test=# \d+ artwork Table "public.artwork" Column | Type | Modifiers | Storage | Stats target | Description ------------+---------+-----------+---------+--------------+------------- artwork_id | integer | not null | plain | | Indexes: "artwork_pkey" PRIMARY KEY, btree (artwork_id)
PRIMARY KEY name
A good practice is to use next pattern:
Fields references the table must have the same name, it allows to use
JOIN USING syntax:
SELECT artwork.title, author.name FROM artwork JOIN author USING (author_id); /* instead */ SELECT artwork.title, author.name FROM artwork JOIN author ON author.author_id = artwork.author_id;
artwork_id is more clear.
PRIMARY KEY value
There are primary key value types:
- autoincrement field
- natural id
- guid (Globally Unique IDentifier)
- compound (aka composite) primary key
Autoincrement and guid fields are known as pseudo keys or surrogate keys.
Most databases provide a mechanism to generate unique integer values serially, outside the scope of transaction isolation.
Autoincrement field example (PG):
CREATE TABLE items ( item_id SERIAL PRIMARY KEY, title VARCHAR(255) ); INSERT INTO items (title) VALUES ('item1'); INSERT INTO items (title) VALUES ('item2'); SELECT * FROM items;
item_id | title ---------+------- 1 | item1 2 | item2
Natural id, if it is short and unique, usually the best solution for a primary key. For example, phone numbers, usernames, etc.
GUID (uuid4 usually) is solution for distributed (horizontally) databases.
A compound key is a good solution for intersection table (there are a lot of cases where it may be used).
Trees (hierarchical data)
The most obvious solution is to use a parent_key field (adjacency list). The solution has its benefits:
- easy to modify (copy/move tree leaves)
- leaf remove is complex (needs to find and remove all leaves in the subtree,
ON DELETE CASCADEsolves the problem)
If You need to know how many children has the leave, how many descendants has the tree, etc., in most cases better just to use counters instead of query all the tree each time.
But the simple parent_key solution fails if you need to select all descendants (for instance: get full replies tree).
Exception is two level tree:
SELECT FROM nodes as n1 LEFT OUTER JOIN nodes as n2 ON n2.parent_id = n1.node_id;
There are alternative solutions:
PostgreSQL>=8.4supports recursive queries
- Path enumeration technique (drawbacks: no referral integrity)
- Nested sets (drawbacks: hard to insert/delete, complex, no referral integrity)
- Closure table (benefits: allow node to belong to multiple trees, drawback: requires additional table)
Query results can be combined with
UNION only if their columns are the same in number and type (provide
NULL placeholders for columns that are unique to each table).
Wildcard in SELECT
* symbol means every column (the list of columns is implicit).
There are few thoughts why better to specify columns explicitly:
- when we ask for only fields used in the index: request must be executed faster (no need to query the main table)
- blob and text fields stored separately from the main table (it stores only references to them), if don't ask for these fields, we decrease response time and network usage
Antipattern is a technique that is intended to solve a problem but that often leads to other problems.
Stands for Entity-Attribute-Value. Aka open schema, schemaless or name-value pairs.
Aka a join table, many-to-many table or mapping table.
Intersection table has foreign keys referencing two tables (implements many-to-many relationship).
Objectives of normalization:
- To represent facts about the real world in a way that we can understand
- To reduce storing facts redundantly and to prevent anomalous or inconsistent data
- To support integrity constraints
One-way cryptographic hash function
The function transforms its input string into a new string, called the hash, that is unrecognizable. Another characteristic of a hash is that it's not reversible. You can't recover the input string from its hash because the hashing algorithm is designed to "lose" some information about the input.
Example using bcrypt (adaptive hashing function):
import bcrypt hashed_password = bcrypt.hashpw(password.encode('utf-8'), bcrypt.gensalt(12))
A relational database is a digital database whose organization is based on the relational model of data, as proposed by E.F. Codd in 1970.
Relational databases are specifically designed to manage relationships:
- each entity (row) can be uniquely identified by a primary key
- entities are normalized. The data for a unique entity is stored only once, and related entities store just its primary key. Changing the data of an entity has to happen in only one place
- entities can be joined at query time, allowing for cross-entity search
- changes to a single entity are atomic, consistent, isolated, and durable (ACID)
- most relational databases support ACID transactions across multiple entities