# Databases and Database Management Systems


## Database

A database is a collection of data, usually related.  
This data is often structured in a contextually relevant manner.  
Databases provide access using various methods such as:

 * Structured Query Language (SQL)
 * Application Programmer Interface (API)
 * Custom programs (written using API)

Database files exist in a variety of formats, such as

 * well formed, text files such as a CSV; 
 * proprietary file format, e.g., MS Access or ESRI Shape/Geodatabase; and
 * Open source, e.g., SQLite or HDF5.

Some of these database files are special purpose (e.g., Access, ESRI, or FASTQ), and others are generic (e.g., CSV and SQLite).  
This above list is simply illustrative and not exhaustive.

In this course and subsequent data science courses, you will develop a flexible skill set such that any database becomes an accessible data repository to be exploited as you see fit.


## Relational Data Model, Relational Databases

Databases come in a variety of styles; however the most predominant structured storage paradigm is the relational model, developed by Edgar Frank "Ted" Codd while working at IBM in 1969.

The relational model was proposed in his paper "A Relational Model of Data for Large Shared Data Banks".

Eventually, Codd went on to define _Online analytical processing_ (OLAP), which could be considered a seminal exploration of the use of databases to perform online analytics.



### Relation Model Basics

#### Entity, Tables, and Tuples 

**Entity** : The design element of the database
 * A person, place or thing about which we want to collect and store multiple instances of data.
 * Similar to an Object in Object Oriented Design
 * Think of Entities as nouns
 * These will be the tables in your database
 * A table / relation is the instantiation of the Entity for data storage.

In the relational model data is stored as a set of tuples.

**Tuple** : a fixed-size, ordered set of datums, aka Record, Row.  
Each tuple is conceptually, a set of _key-value_ pairs.

**Relation** : a collection of identical structured tuples, aka Table, Entity.

Table Example, four records, using a *record\_id* as a **key** column.

<table border="1">
<tbody>
<tr>
<td></td>
<td><strong>record_id</strong></td>
<td><strong>datum_1</strong></td>
<td><strong>datum_2</strong></td>
<td style="text-align: center;"><strong>...</strong></td>
</tr>
<tr>
<td><strong>Row of Data</strong></td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">acb</td>
<td style="text-align: center;">def</td>
<td style="text-align: center;">...</td>
</tr>
<tr>
<td><strong>Row of Data</strong></td>
<td style="text-align: center;">2</td>
<td style="text-align: center;">ghi</td>
<td style="text-align: center;">jkl</td>
<td style="text-align: center;">...</td>
</tr>
<tr>
<td><strong>Row of Data</strong></td>
<td style="text-align: center;">3</td>
<td style="text-align: center;">mno</td>
<td style="text-align: center;">pqr</td>
<td style="text-align: center;">...</td>
</tr>
<tr>
<td><strong>Row of Data</strong></td>
<td style="text-align: center;">4</td>
<td style="text-align: center;">stv</td>
<td style="text-align: center;">wxy</td>
<td style="text-align: center;">...</td>
</tr>
</tbody>
</table>

**Example**: A single table is similar to a well structured tab of a spreadsheet.

![One Table](../images/SpreadSheet_single_tab.png)


#### Attributes, aka Fields
 * Data that describes the Entities.
 * A positional datum of a tuple
 * These will be the columns of each table in your database.

##### Data Constraints 
 * Specific rules for the Attributes.
 * Make sure that the data is consistent.
 * In SQL:
   * NOT NULL
   * UNIQUE
   * CHECK
   * DEFAULT

#### Keys

 * **Key** : a sub-tuple of one or more fields used to identify rows in the table.

 * **Candidate Key** : a key that uniquely identifies rows in the table.

 * **Primary Key** : a candidate key that is chosen as the primary index for the table.


#### SQL

Relational databases use the _Structured Query Language_ (SQL) as a standard for defining, querying, manipulating, and managing the data and metadata of the database.


Tables are defined into the database using SQL, structured as: 
```SQL
CREATE TABLE <table_name>( <tuple definition>, <constraints and modifiers> );
```

**Examples**
```SQL
CREATE TABLE Person(ident TEXT, personal TEXT, family TEXT);

CREATE TABLE Site(name TEXT, lati REAL, longi REAL);

CREATE TABLE Visited(ident INTEGER, site TEXT, dated TEXT);

CREATE TABLE Survey(taken INTEGER, person TEXT, quant REAL, reading REAL);
```

### Database

Databases in the relational model are conceptually similar to a many-tabbed Excel file (spreadsheet).
Wherein, each sheet has one and only record type. 
Additionally, rows of data are likely related to rows on other spreadsheet tabs.
In the relational model, the tabs are tables. 
Additionally, significant semantic structure and constraints exists within the relational model.


In the example below, the three tabs (Artist, Albums, Songs) and in the relational model the linkage of a Song to an Album and an Album to an Artist are encoded within the database.
<img src="../images/sheets.gif">

#### Relationships

 * Entities have some relationship to other entities in the system.
 * Relationships illustrate an association between two entities.
 * Cardinality Constraints:
   * Zero or More
   * One or More
   * One and only One
   * Zero or One

** Subsequent discussions on Database Design will revisit these basics of the Relational Model and designing databases with Entity Relationship Diagrams (ERD).**

## Database Management System (DBMS)

A DBMS is a software system that manages the storage, access, manipulation and security of one or more databases. 
Typically, DBMS are support multiple concurrent users or processes accessing data. 
Examples of common DBMS include:

 * PostgreSQL (free / opensource)
 * MySQL (free / opensource)
 * Oracle
 * IBM DB2
 * MS SQL Server

Another characteristic trait of many DBMS is that they can be accessed remotely, as they are often located on a dedicated server or deployed in a cloud environment.

## Coming Up!

In the remainder of this course, we will use [SQLite](https://www.sqlite.org/), which is a SQL compatible database engine and [PostgreSQL](https://www.postgresql.org/) which is an open source DBMS. 

You will access data within these databases using command line, direct access, the Python Programming language, and Jupyter Notebook extensions (which use Python SQLAlchemy).

The tools and techniques we learn in this module are applicable to any SQL supporting data repository, be it traditional RDBMS, data warehouses, or SQL wrapped Big Data ecosystems such a:
 * Phoenix on top of HBase on top of Hadoop, or
 * SparkSQL on top of Spark on top of Hadoop, or
 * Google Big Query
