<div style="text-align:center;">
  <img src="images/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Deciding Database Structure

<strong>Author(s):</strong> Jessica A. Nash, The Molecular Sciences Software Institute

<div class="alert alert-block alert-info"> 
<h2>Overview</h2>

<strong>Questions:</strong>

* What is a database schema?

* What is the purpose of a database schema?

* How does normalization help in database design?

<strong>Objectives:</strong>

* Understand the concept of database schema and its importance.

* Learn about normalization and how it reduces database redundancy.

* Identify primary keys and foreign keys in database tables.

</div>

## Database structure

In the previous presentation, we mentioned that relational databases are structured and have a specific schema. 
For this tutorial, we will create a database with a normalized schema.
The application we will be considering is a database that will hold information about scientific papers.

In order to create our relational database, our first step will be to decide the structure. 
The goal of relational databases with a normalized schema is to reduce repetition of information and dependencies between rows, this is called **normalization**.

### Tables
The **table** is the basic unit of storage in a relational database. 
A table is made up of columns which define the data being stored in the table, and rows (also called records) which are entries in the table.

First, consider what information about some of our papers would look like if we just added it to a spreadsheet.

<table style="width:100%">
    <tr>
        <th>DOI</th>
        <th>Title</th>
        <th>Journal</th>
        <th>Publication Year</th>
        <th>Authors</th>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>'Perspective: Computational chemistry software and its advancement as illustrated through three grand challenge cases for molecular science'</td>
        <td>J. Chem. Phys.</td>
        <td>2018</td>
        <td>Anna Krylov, Theresa L. Windus, Taylor Barnes, Eliseo Marin-Rimoldi, Jessica A. Nash, Benjamin Pritchard, Daniel G.A. Smith, Doaa Altarawy, Paul Saxe, Cecilia Clementi, T. Daniel Crawford, Robert J. Harrison, Shantenu Jha, Vijay S. Pande, Teresa Head-Gordon</td>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>'Sources of error in electronic structure calculations on small chemical systems'</td>
        <td>J. Chem. Phys.</td>
        <td>2006</td>
        <td>David Feller, Kirk A. Peterson, T. Daniel Crawford</td>
    </tr>
        <tr>
        <td>'10.1021/acs.jcim.9b00725'</td>
        <td>'New Basis Set Exchange: An Open, Up-to-Date Resource for the Molecular Sciences Community'</td>
        <td>J. Chem. Inf. Model</td>
        <td>2019</td>
        <td>'Benjamin P. Pritchard, Doaa Altarawy, Brett Didier, Tara D. Gibson, Theresa L. Windus'</td>
        <td></td>
    </tr>
</table>

To put this into a relational database, we will want to make some changes to this structure. 
Imagine that we have hundreds of articles in our spreadsheet. 
What if we wanted to find all papers with one particular author? 
This would be very hard to find with our current structure.
It would involve reading each cell, splitting the authors, and then checking if the author is in the list.

In relational databases, values which can be searched should be **atomic** (meaning indivisible). 
The author row lists multiple values, meaning that it violates [first normal form](https://en.wikipedia.org/wiki/First_normal_form).

One solution would be to add several more columns to our spreadsheet (Author1, Author2, Author3). 
However, this would results in a varying number of columns for each paper. 

<!-- Example table showing the problem of adding multiple author columns -->
<table style="width:100%">
    <tr>
        <th style="border: 1px solid #ddd; padding: 8px;">DOI</th>
        <th style="border: 1px solid #ddd; padding: 8px;">Title</th>
        <th style="border: 1px solid #ddd; padding: 8px;">Journal</th>
        <th style="border: 1px solid #ddd; padding: 8px;">Publication Year</th>
        <th style="border: 1px solid #ddd; padding: 8px;">Author1</th>
        <th style="border: 1px solid #ddd; padding: 8px;">Author2</th>
        <th style="border: 1px solid #ddd; padding: 8px;">Author3</th>
    </tr>
    <tr>
        <td style="border: 1px solid #ddd; padding: 8px;">'10.1063/1.5052551'</td>
        <td style="border: 1px solid #ddd; padding: 8px;">'Perspective: Computational chemistry software and its advancement...'</td>
        <td style="border: 1px solid #ddd; padding: 8px;">J. Chem. Phys.</td>
        <td style="border: 1px solid #ddd; padding: 8px;">2018</td>
        <td style="border: 1px solid #ddd; padding: 8px;">Anna Krylov</td>
        <td style="border: 1px solid #ddd; padding: 8px;">Theresa L. Windus</td>
        <td style="border: 1px solid #ddd; padding: 8px;">Taylor Barnes</td>
    </tr>
    <tr>
        <td style="border: 1px solid #ddd; padding: 8px;">'10.1063/1.2137323'</td>
        <td style="border: 1px solid #ddd; padding: 8px;">'Sources of error in electronic structure calculations...'</td>
        <td style="border: 1px solid #ddd; padding: 8px;">J. Chem. Phys.</td>
        <td style="border: 1px solid #ddd; padding: 8px;">2006</td>
        <td style="border: 1px solid #ddd; padding: 8px;">David Feller</td>
        <td style="border: 1px solid #ddd; padding: 8px;">Kirk A. Peterson</td>
        <td style="border: 1px solid #ddd; padding: 8px;">T. Daniel Crawford</td>
    </tr>
    <tr>
        <td style="border: 1px solid #ddd; padding: 8px;">'10.1021/acs.jcim.9b00725'</td>
        <td style="border: 1px solid #ddd; padding: 8px;">'New Basis Set Exchange: An Open, Up-to-Date Resource...'</td>
        <td style="border: 1px solid #ddd; padding: 8px;">J. Chem. Inf. Model</td>
        <td style="border: 1px solid #ddd; padding: 8px;">2019</td>
        <td style="border: 1px solid #ddd; padding: 8px;">Benjamin P. Pritchard</td>
        <td style="border: 1px solid #ddd; padding: 8px;">Doaa Altarawy</td>
        <td style="border: 1px solid #ddd; padding: 8px;"></td>
    </tr>
</table>


This approach can lead to inefficient storage, with empty cells when papers have fewer authors, and it complicates queries when searching for specific authors.


Another solution would be to have a separate row for each author

<table style="width:100%">
    <tr>
        <th>DOI</th>
        <th>Title</th>
        <th>Journal</th>
        <th>Publication Year</th>
        <th>Author</th>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>'Perspective: Computational chemistry software and its advancement as illustrated through three grand challenge cases for molecular science'</td>
        <td>J. Chem. Phys.</td>
        <td>2018</td>
        <td>Anna Krylov</td>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>'Perspective: Computational chemistry software and its advancement as illustrated through three grand challenge cases for molecular science'</td>
        <td>J. Chem. Phys.</td>
        <td>2018</td>
        <td>Theresa L. Windus</td>
    </tr>
    <tr>
        <td>'10.1063/1.2137323'</td>
        <td>'Perspective: Computational chemistry software and its advancement as illustrated through three grand challenge cases for molecular science'</td>
        <td>J. Chem. Phys.</td>
        <td>2018</td>
        <td>Taylor Barnes</td>
    </tr>
</table>

We have now fixed the problem of repeating authors, but we now repeat all other information several times. If there was a change to the title of a paper, for example, we would have to update this in all the associated rows. Our goal is to reduce this repetition.

Ultimately, the design which minimizes repetition and enforces atomicity is adding multiple sheets or tables.
Breaking down the data into separate tables is called **normalization** and ensures that each piece of data is stored only once.

In this structure, we will first define a table for articles (`Article` table), and a table for authors (`Author` table). 
We saw the need for the previously when we investigated repititon of data when trying to list authors for a paper.
To prevent this problem, we will have an `Article` table that lists each article only once, and an `Author` table which lists each author only once,
and a third table which links authors to articles (`Article-Author` table).

## Article Table

The `Article` table will store information about each article only once, identified by a unique ID (the DOI).

<table style="width:100%">
    <tr>
        <th>DOI</th>
        <th>Title</th>
        <th>Journal</th>
        <th>Publication Year</th>
    </tr>
        <tr>
        <td>'10.1063/1.5052551'</td>
        <td>'Perspective: Computational chemistry software and its advancement as illustrated through three grand challenge cases for molecular science'</td>
        <td>J. Chem. Phys.</td>
        <td>2018</td>
    </tr>
    <tr>
        <td>'10.1063/1.2137323'</td>
        <td>'Sources of error in electronic structure calculations on small chemical systems'</td>
        <td>J. Chem. Phys.</td>
        <td>2006</td>
    </tr>
    <tr>
        <td>'10.1021/acs.jcim.9b00725'</td>
        <td>'New Basis Set Exchange: An Open, Up-to-Date Resource for the Molecular Sciences Community'</td>
        <td>J. Chem. Inf. Model</td>
        <td>2019</td>
    </tr>
</table>

## Author Table

The author table will store each author's information only once, identified by a unique ID.

<table style="width:50%">
    <tr>
        <th>Author ID</th>
        <th>Author Name</th>
    </tr>
    <tr>
        <td>1</td>
        <td>Anna Krylov</td>
    </tr>
    <tr>
        <td>2</td>
        <td>T. Daniel Crawford</td>
    </tr>
    <tr>
        <td>3</td>
        <td>Theresa Windus</td>
    </tr>
    <tr>
        <td>4</td>
        <td>Taylor Barnes</td>
    </tr>
</table>

## Article - Author Table

In order to associate authors with articles, we need to create an **associate table**. 
The associate table will have two columns, one for the DOI of the article and one for the author.
This will link the information in the two tables together so that we will be able to search for authors by article and articles by author.

<table style="width:50%">
    <tr>
        <th>Article DOI</th>
        <th>Author</th>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>Anna Krylov</td>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>T. Daniel Crawford</td>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>Theresa L. Windus</td>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>Taylor Barnes</td>
    </tr>
    <tr>
        <td>'10.1063/1.2137323'</td>
        <td>David Feller</td>
    </tr>
    <tr>
        <td>'10.1063/1.2137323'</td>
        <td>T. Daniel Crawford</td>
    </tr>
</table>

Now, there is only one entry for each article in the article table, and a separate entry matching each author with an article.  
When we choose to express information in this way, we assume that names are unique (as in, no two authors can have the same name). 
This may not be true, so you may want to implement some other identifier for authors such as an ID number or ORCID.
No matter if we use names or IDs, we will need another name that lists all unique authors. 
In a real-world scenario, identifiying a unique author is a complex problem, but we will use a simple numerical ID.
If we do this, we will replace the author table with something like this

<table style="width:50%">
    <tr>
        <th>Article DOI</th>
        <th>Author ID</th>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>1</td>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>2</td>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>3</td>
    </tr>
    <tr>
        <td>'10.1063/1.5052551'</td>
        <td>4</td>
    </tr>
    <tr>
        <td>'10.1063/1.2137323'</td>
        <td>5</td>
    </tr>
    <tr>
        <td>'10.1063/1.2137323'</td>
        <td>2</td>
    </tr>
</table>


## Identifying unique entries - Primary Keys


In a relational database, having unique identifiers for records in each table is crucial. 
These unique identifiers are called primary keys. A primary key ensures that each record in a table can be uniquely identified.
 Let’s look at how primary keys are used in our current table structure.

We have an `Article` table that uses the DOI as its primary key. This ensures that each article can be uniquely identified because no two articles will have the same DOI. Similarly, the `Author` table uses `Author ID` as its primary key, which uniquely identifies each author in the database.

Primary keys can be single columns, or they may be composed of multiple columns, known as composite keys. 
For example, in our `Article-Author` table, neither the DOI nor the Author ID alone can uniquely identify a record, as the same article can have multiple authors, and the same author can contribute to multiple articles. 
In this case, the combination of both the DOI and the Author ID forms a composite primary key. 
This composite key ensures that each author-article relationship is unique, preventing duplication of the same author-article pair.

We'll see more on this later.

## Foreign Keys

The columns in the `Article-Author` table are also examples of **foreign keys**.  
Foreign keys reference values in other tables.  
For example, the `Article DOI` column in the `Article-Author` table references the `DOI` column in the `Article` table.  
Foreign keys are constraints that ensure data integrity by enforcing that every reference in the `Article-Author` table corresponds to an actual entry in the `Article` table.  

The `Author ID` column in the `Article-Author` table is a foreign key that references the `Author ID` column in the `Author` table.  
These relationships help maintain the consistency and integrity of the data across the database.  
Don't worry if this is all a lot to remember right now!  
We will work on applying these ideas later.

## Second Normal Form (2NF) and Third Normal Form (3NF)

Although we didn't explicitly discuss it, our database structure is in the second normal form (2NF) and third normal form (3NF).
The second normal form (2NF) requires that the table is in the first normal form and that all non-key attributes are fully functional dependent on the primary key.
In our database this means that we don't repeat the article DOI with the author names, for example.

The third normal form (3NF) requires that the table is in the second normal form and that all non-key
attributes are non-transitively dependent on the primary key. In our database structure, we might violate 3NF if we wanted to store journal publisher information in the `Article` table.
This information is not directly related to the article, but to the journal, and should be stored in a separate table in order to maintain 3NF.

## Relationships
Central to the idea of relational databases are...relationships! You can already see from inspection how these tables are related. 
You would be able to figure out authors for a paper by tracing relationships through our three tables. Relational databases have specific relationship patterns, so let's discuss these.

### Many to Many relationships
The article-author table we've created is what is known as an associative table. 
Consider the type of relationship between papers and authors. 
One paper can have many authors, and one authors can have many papers (consider that T. Daniel Crawford is an author on two of our papers). 
This is called a *many to many* relationship. 
When we have these in a database, we will have an associative table. 
In addition to the two tables outlined, we will also have an author table (where each author is listed by ID) which the `article-author` table will index into. 

### One to Many and Many to One relationships
Another type of relationship is the *one to many* relationship. 
An example of this is the relationship between journals and papers. Papers belong to only one journal, but a journal can have many papers associated with it. 
We do not need an associative table for this type of relationship because the paper table can reference the journal ID directly in a column on the table (since there is only one value per paper).

## Visualizing Database Structure - Entity Relationship Diagrams

An Entity Relationship Diagram (ERD) is a graphical representation of the entities within a database and the relationships between those entities. 
ERDs are used to visually map out the structure of a database, making it easier to understand how data is organized and how different pieces of data relate to one another. 

The ERD for our current database structure looks like this:

![ERD](images/db_schema_1.svg)

In this image, the boxes represent tables, and the lines connecting them represent relationships between the tables.

<div class="alert alert-block alert-warning">

### Exercise

How could we add keywords to our database schema?

How could we add information about the journal such as the publisher to our database schema?

What kind of relationshps would exist between the tables for each?

</div>

### Answer:
