Skip to content

DynamoDB

Rico Hermans edited this page Dec 15, 2023 · 21 revisions

The database being used by the Hedy website is Amazon's DynamoDB. DynamoDB is a NoSQL database, which work a bit differently than SQL databases you might be familiar with.

In large installations, NoSQL databases like DynamoDB scale better and are more reliable than SQL databases typically are. In our case, we use it because at small installations it is nearly free 😌, and if you want to store additional fields on a record, you can do so without performing any database maintenance operations.

Modeling data

When you define a table, you don't have to specify what columns exist for a table and what type each column is. A DynamoDB table is schemaless, which means you can store as many or as few fields on every row as you want, and they can have any type, with only one exception.

The exception is the primary key of the table, which consists of 1 or 2 fields in the table. The types of the primary key fields must be predeclared, and values for the key fields must exist in every row (more information on keys, see the section "Table keys" below).

Fields in DynamoDB can have the following types:

  • string
  • number (int or float)
  • binary data (bytes in Python)
  • boolean
  • list (a list of values of any supported type)
  • dict (a mapping of string to any value of supported type)
  • set of all numbers, set of all strings, set of all binary values (no duplicates)

Table keys

When you create a table, you declare at least one of your fields to be the partition key, and you can optionally declare one of the fields to be the sort key. The combination of these fields must be unique for every row. The types of the fields that are designated as keys must be one of string, number or binary data.

The roles are as follows:

  • Partition key: every table must have a partition key, and every query to the table must also include a value for the partition key (see the Query operation below). Partition keys are typically values like unique identifiers.
  • Sort key: a sort key is optional: tables may not have a sort key, and even if they have a sort key, you don't have to provide a sort key when querying the table. When you retrieve multiple elements from a table, they will always be ordered by the sort key, in either ascending or descending order. When you query a table, you can restrict the query by use a condition like == (equals) on a sort key, but you can also use a condition like >= or <. Sort keys are typically used for data with an ordering, like timestamps.

The combination of partition key and sort key (if you are using both), must be unique. If you are not using a sort key, then the partition key must be unique by itself.

Here are some examples of tables in Hedy:

The users table

username (PK) email created password ...more data....
hedy hedy@hedy.org 1681239902406 $2b$09$riZem.ck ...
felienne felienne@felienne.com 1637851898173 $2b$09$xOqiciz ...
student123 st123@school.com 1700126054884 $2b$09$9z9d0a0 ...

Because this table only has a partition key (username) and no sort key, every query to this table will retrieve at most 1 row: you can query it by giving a username, and it will return the data for that username or tell you there is no such user. The table behaves like a Python dictionary.

Now let's look at a table that also has a sort key.

The quiz-stats table

This table also uses a sort key in addition to a partition key:

id#level (PK) week (SK) id level finished scores
hedy#1 2023-31 hedy 1 1 [ 60 ]
felienne#3 2022-08 felienne 3 0
felienne#3 2022-09 felienne 3 1 [ 95 ]
felienne#3 2022-10 felienne 3 1 [ 100 ]
student123#9 2023-23 student123 9 1 [ 40, 80 ]

This table has a partition key on the column id#level and a sort key on the column week. You can see a number of interesting things on this table:

  • The primary key column is a combination of two other fields: the column is called id#level, and it consists of the values of the individual columns id and level, combined with a separator. We can only have a single partition key: to query on the values of 2 columns at the same time (using what would have been an AND in an SQL query), we combine the values of those 2 columns into 1 column, so that we can query on that one partition key. There is nothing magical about this column: the Python code that inserts rows into this table builds the value for the id#level column as it's inserting. In fact, DynamoDB doesn't care whether the actual id and level fields are even in each row; our code just puts them there because it's convenient to have the original values as well.
  • Values with the same partition key are sorted by the value of the sort key (week). Since the sort key values are strings, but represent numbers, we have to take care to pad them with 0 on the left (if we don't, 2022-9 would incorrectly sort after 2022-10).
  • Field values don't have to be primitives. The scores column contain a set of numbers, representing all the scores that user has scored on all their quizzes during that week.

Because we have a sort key, we now have the option to query this data in multiple ways:

  • Given a user, level and week, we can query for the single row that has the user's scores for that combination.
  • Given just a user and a level, we can query for all of their test scores on that level over all time: this query will return all rows for a given id#level value, in ascending order of week.
  • We can also do partial queries over the week data of each user. For example, we can add conditions like starts_with(week, '2022-') to get all rows in a particular year, or week >= '2022-40' to get all quiz statistics since a particular week.

Designing tables

When you are designing a new table, the key schema is the most important decision you have to make. You can't change the keys after you have created the table, so take a moment to think about this.

In order to pick effective keys, it's useful to think about the queries you're going to do against the table:

  • What pages are going to retrieve rows from this table?
  • What information will those pages have access to in order to query the tables? Think about things like "current username", "current level", "ID argument passed in the URL bar", etc.
  • We like our page loads to be fast, but the more data the page needs to read from the database, the slower it will be. If you can keep your expected rows to read under 10 (or at least under a 100), your page will probably load just fine.
  • See if you can find a sort key that will make it possible to retrieve multiple rows of useful data at the same time, given a partition key. If you can do this, maybe you don't even need to add an index! (See "Indexes" below).
  • Bonus points if the sort key has structure that makes it possible to filter and sort on it usefully as well!
  • (Advanced) Strict normalization of data is not necessarily a goal like it is in SQL databases: if you can save effort at read time by doing duplicate writes under slightly different keys at write time, that might be worth the trade-off (but if you do this, make sure that reads are much more common than writes and the risks of potential inconsistency between different copies of the data are not a huge problem.

Operations

DynamoDB doesn't work by executing SQL statements. Instead, you can perform a set of network API calls against the database to write or read data. In Hedy, we have an abstraction layer that we program against, which will perform those API calls for you.

Below is a list of the most important API calls, and the equivalents in the Hedy code base:

DynamoDB API Hedy API Description
PutItem TABLE.put(data) / TABLE.create(data) Write a row to the table. Will completely overwrite all fields in the row if a row with the same key already exists in the table.
UpdateItem TABLE.update(key, updates) Updates some fields of a single row in the table, leaving the rest of the row unaffected if it already exists. Updates can overwrite fields with new values, but can also do in-place updates like increment a field, or add an element to a list or set. Returns the new values. (DynamoDB can also do conditional updates, those are not supported in our abstraction layer just yet).
DeleteItem TABLE.delete(key) Delete a row from the table.
GetItem TABLE.get(key) Return a single row given a full primary key (partition key + sort key)
BatchGetItem TABLE.batch_get(keys) Retrieve multiple rows in parallel. keys must be a list of key dicts, or a dictionary mapping a string identifier to a key dict; the return value will be in the same format.
Query TABLE.get_many(key) key must contain the partition key, and may contain an equality or comparison condition on the sort key. Returns one or more rows, depending on the query and the table schema.
Scan TABLE.scan() Return all items from the table. There is no filtering.
TABLE.del_many(key) First query all items for a given key, then delete them all one by one.

Indexes

(TODO)

Differences with SQL databases

(TODO)

How do I...?

(TODO)

Clone this wiki locally