Skip to content

Data Engineering

Bonnie Wolfe edited this page Apr 14, 2023 · 15 revisions

Data Engineering Tutorial

Useful References

Star schema: https://en.wikipedia.org/wiki/Star_schema

Normalization: https://en.wikipedia.org/wiki/Database_normalization

ERD: See https://www.visual-paradigm.com/guide/data-modeling/what-is-entity-relationship-diagram/

SQL Databases

SQL databases consist of tables of data (with rows [or records] and columns) which can be queried with syntax like

 select …(a list of columns) from …(one or more tables) where …(one or more conditions)

Where to practice SQL Queries

SQL Select Statements

I will describe these “select” statements here. (There are also “update”, “delete”, and other statements). Note that although there has been effort to standardize SQL since IBM formulated it for the DB2 database, different database engines support only parts of it, or variations not described here.

The list of columns following “select” are given by column names separated by commas, or by expressions in the column names, optimally followed by “as …” to specify an alternate name they are to have in the resultant data rows. The columns come from any of the “from” tables. If a column name appears in more than one table, it must be preceded by “xxx.” (xxx = table name) to specify which table.

The From Clause

The “from” tables are listed as

 table, table, ..

and/or as

 [left/right/full] join table on ….(join conditions)

The Where Clause

If simply comma separated, then the conditions following “where” need to specify how the rows of one table are to be selected with the rows of another table. Conditions can look like

 table1.column1 = table2.column2,

where “=” can be replace by other operators such as “!=” or “>” or “>=”,. etc, or by

 like …(search expression, such as ‘%Adm%’ for any string containing “Adm”)

or one can have range of values syntax

 table1.column1 between … and …

Also, more powerful SQL database engines allow subquery conditions, such as

 table1.column1 in (select table2.column2 from table2 where ….)

or

 exists (select … from … where …)

or

 not exists (select … from … where …)

These conditions can be combined with “and” or “or” or “not”.

Joined Tables

When tables are combined with “join”, the clause following “on” looks like a where clause and is used to select the rows in the joined table that are matched to the rows in the prior table. Join specifies that the results are returned even if there is no match for the “on” condition: “left join” means the table name after “left join” supplies values for those “select” columns from that table, but if no data row matches the condition, then the output row is still present with null for the columns from that table. Right join means rows of the second table that cannot be matched any of the first tables rows are included, and the “select” columns from the first table are set to null. Full join is like simply having the tables in a comma separated list with “where” conditions relating them – only rows with data from both tables are included.

Group by, Having, and Order by Clauses

The “where” clause can also be followed by “group by”, “having”, and “order by” clauses. “group by” tells the query to group rows with the same values of the list of columns following “group by” so that only one row of each set of such values is returned. The only columns that are allowed to appear after “select” in this case are the group by columns, or expressions that aggregate what is in other columns, such as

 sum(…), avg(…), max(…), min(…), count(…) [or just count(*) to count the number of rows],
 count(distinct …) [count how many district values of …]

“having” specifies values for one or more aggregate expressions (above) that are to be applied following the “group by” processing. “order by” is followed by one or more column names or expressions, optionally followed by “asc” (ascending order) or “desc” (descending order), to specify how the resultant rows are sorted.

Uniqueness of Rows and Primary Keys

SQL databases follow rules of mathematical logic where the “where” conditions are the only conditions for determining if the assertion is true (i.e., if there is data in the output rows). This means that if two rows in the table are identical (all columns have the same value), in principle there is no way of distinguishing them and the select will return both of them when the where clause is satisfied. One normally adds a column which has a unique (often sequential) numeric value, called a primary key, to each table, to prevent this from ever happening.

Keywords and Table and Field Name Delimiters

Table and column names can be the same as keywords for the database expressions (like “select”, “from”, “join”, “where”, and quite a few others). To ensure that the system does not confuse the two, typically the table and column names are delimited by [….] or “…” (brackets or double-quotes).

Relating Tables, Foreign Keys, and Referential Integrity

SQL databases have relationships between tables which are implicitly specified in the “where” clauses. Thus, this query

 select [User].[Name], [Role].[Name] as [Role] from [User]
 where [Role].[_ID] = [User].[Role ID]

implies that the (typically numeric) column [Role ID] in [User] specifies the value of [_ID] in [Role] for the role that corresponds to the user. There is syntax to specify, at the database design level, that this is so: that [Role ID] is a “foreign key” relating [User] to [Role] and [_ID] is the primary key in [Role] referenced by that foreign key. By specifying it at the database design level, a good SQL database engine can enforce this relationship, i.e., throw an error if ever [Role ID] does not point to a valid [_ID]. Databases which support these declarations of relationship are said to have “referential integrity”.

Schema and Entity-Relation Diagrams

A database schema is the list of all tables and columns and relationships and is often shown visually in an entity relationship diagram (ERD). In an ERD, the boxes are tables, and the list in each box is the name and data type for each column. The lines between the boxes are the referential integrity between the tables. The end of the line with forking is where multiple rows of data typically match a single row in the table at the other end of the line (this is called “many to one” or “one to many” relationship).

Star Databases and Normalized Databases

Here are two common database structures (related to each other):

STAR DATABASE – here one large table of “facts” has foreign keys into a number of small tables of discrete attribute values called “dimensions”. Each dimension table has a primary key whose values match a fact foreign key, and one or more data columns describing the dimension. Example: 911 call data has a large table of facts, the calls, and dimensions representing aspects of these calls that have a discrete set of values, such as call type, call disposition, dispatcher, and dispatched to agency or person.

NORMALIZED DATABASE – A database is normalized if it has a number of related tables, but no duplicated or separated information. Duplicated information would be facts listed repeatedly in a table instead of being referenced in a related table. Thus, one would not enter the name of an agency in a 911 call table repeatedly, but rather have a foreign key to a single row storing the agency name in another table. Separated information is the existence of information columns in separate tables that always go together, and therefore should be stored in one table. Thus, 911 call data might store an agency name and agency address repeatedly, and if those are moved out into separate agency name and agency address tables, that would not properly represent the fact that each name and its address go together. To normalize the database, one would create a table containing both agency name and address columns, and the 911 call fact table would have a single foreign key column that references this new name + address table.

Possibility of Automatically Normalizing Databases

It is interesting to note that if “fact” data, like a large 911 open data set, were uploaded into a system, the process of breaking out the “dimensions” into related tables could be automated, as well as the process of normalization. To find a dimension, one would run queries like

 select count(distinct [Agency]) from calls

and if the number of distinct values is small compared the number of data rows, the system could create a foreign key and store that column’s ([Agency]) values in a ([Agency]) “dimension” table. To normalize, the system could run queries like

 select [Agency] from [Calls] group by [Agency] where count(distinct [Agency Address]) > 1

If this does not output any rows (no agencies have more than one address), then the system could move this field ([Agency Address]) to the dimension ([Agency]) table.

Pragmatic Lore

Some lore based on years of experience creating, maintaining, and supporting databases:

  1. One often needs to track history. To do this one can add [Date Edited], [Edited by] (names can vary, of course) columns to every table. And copy the row to a history table before updating it with new data. Alternatively, one can add [Date Created], [Created by], [Date Edited], [Edited by], [Active Record] and keep the history in the primary table. [Active Record] is a true/false value.

  2. The standard of practice is to use numeric and intrinsically meaningless keys/foreign keys. This allows the meaningful values to be updated in one place to instantly change things. And selecting rows using numerical keys is fast. The downside is one has to join many tables together to extract any meaningful data

E.g.

 select [Name] from [User]
 where [Role ID] in (select [_ID] from [Role] where [Name] = ‘Admin’)

or

 select [User].[Name] from [User], [Role]
 where [User].[Role ID] = [Role].[_ID] and [Role].[Name] = ‘Admin’

is required instead of the much more intuitive

 select [Name] from [User] where [Role] = ‘Admin’

In practice I have used meaningful “keys” (e.g., [Role] in the [User] table with text values that describe the role) for simple dimensions (which I also call “lookups”), and have used some meaningful (text) values as other foreign keys at times.

  1. Another (probably very unpopular) comment: I have created and maintained large database with many primary and foreign keys but no declared referential integrity. The downside is that the database does not instantly catch bad keys; you have to run scripts to find them. The upside is if the database is not allowed to go down (e.g., the largest database I support has run 24x7 for the last 11 years except for rare emergencies), then it keeps running when there is an error and the admin can attend to and fix the problem without locking out the users.

  2. When creating databases, one often creates ERDs and then ports them into a database using frameworks or tools. I find it is easier to create symbolic scripts (e.g., say Python or JavaScript sequences of calls) which declare tables and fields and groups of fields symbolically, and can be run to directly create the database. This allows quick changes for data types, and consistent database structure for repeated groups of fields. An example of a group of fields is for a contact: name, address1, address2, city, state, postal code, country code, email, area code, phone number, preferred time zone). Here is an example of such code:

    addTable(“User”); // starts table definition, // adds _ID [int] primary key and ID [varchar] addContact() // add above-listed contact fields addPicture() // adds fields to update picture (binary field and // mime-type) addField(“Group”, “ID”) // add a field [Group] of type ID which is symbolic // for an actual database type, e.g., it could be varchar(64) endAddTable() // adds tracking fields [Date Edited], [Edited by] // and completes table declaration

Running this program would connect to the database and create a table with columns, e.g..,

 Column Name        Data Type       Description
 _ID                int             Primary Key
 Name               varchar(32)     Text ID
 Name               varchar(128)   	Name
 Address1          	varchar(128)   	Address line 1
 Address2          	varchar(128)   	Address line 2
 City               varchar(128)   	City
 State              varchar(128)   	State (US) or Province
 Postal Code    	varchar(20)     Zip Code (US) or Postal Code
 Country Code 	varchar(20)     Country
 Email              varchar(256)   	Email
 Area Code       	varchar(10)     Area Code
 Phone Number       varchar(20)     Phone #
 Time Zone ID  	int             Foreign key into time zone table
 Date Edited     	datetime        Date last edited
 Edited by         	varchar(32)     ID of user who last edited it

If one decided to include new fields everywhere a contact was added to a table, example [Latitude] and [Longitude] of address location, that could be done by editing the addContact method and re-running the script. If coded to add/update the table when the table/columns exist, instead of re-creating from scratch, it could be used to update a running database.

I maintain scripts like this to update the database structure, with one block of code for each successive historical update. If the end-user application simply iterates through table fields and drops controls on a screen (data driven user interface rendering), then script updates can be done on a running system without needing to update the user interface code whenever a new data field is added.


Issues used in the creation of this page

(Some of these issues may be closed or open/in progress.)

Contributors

Jonathan Melvin (jonmelvin on github)

Clone this wiki locally