# Introduction to Database

# Topics

**Introduction to DB**  

**Different type of Databases:**
* Centralized
* Distributed
* Relational
* NoSQL
 * Key-value store
 * Document-Oriented
 * Graph
 * Wide-Column
* Cloud
* Operational
* Object-Oriented
* Commercial
* Personal

**Terminology:**
* ACID (Atomicity, Consistency, Isolation and Durability)
* Client/Server
* Connection
* Cursor
* Database Definition Language (DDL)
* Data Manipulation Language (DML)
* Data Control Language (DCL)
* Primary Key or _id
* Transaction Control Language (TCL)

**Different data storage formats:**
* Comma Separated Values (CSV)
* Extensible Markup Language (XML)
* JavaScript Object Notation (JSON)
* Databases

**Data Acquisition:**
* how data can be obtained?
 * we have seen 1 way, using a Web Scrapper (Python Case Study)

**Data Preprocessing:**
* Data Validty vs Data Integrity
 * **Data Validty** - the data has undergone a strict set of rules to ensure that it is correct and useful.
 * **Data Integrity** - overall completeness, accuracy and consistency of data.
* Things that can go wrong with data
 * Missing Data
 * Duplicated Data
 * Errors Correction 

---

## Introduction to DB

Databases are a persistent collection of data that is organized in a way that facilitates the ease of data management. 

### What is Data?

In simple words, data can be facts related to any object in consideration. For example, your name, age, height, weight, etc. are some data related to you. A picture, image, file, pdf, etc. can also be considered data.

Data is information collected from qualitative or quantitative means from one or more people, object or events and it can be textual, numerical, in the form of graphics, reports, etc. Data can be either messy or tidy (structured in some way), it is always easily available, and plentiful. Therefore we need a container to store these data and those containers are called databases. 

### What is Database?
A database is a systematic collection of data. They support electronic storage and manipulation of data. Databases make data management easy.

Let us discuss a database example: An online telephone directory uses a database to store data of people, phone numbers, and other contact details. Your electricity service provider uses a database to manage billing, client-related issues, handle fault data, etc.

Let us also consider Facebook. It needs to store, manipulate, and present data related to members, their friends, member activities, messages, advertisements, and a lot more.

Before computers were invented, a physical means to store large amounts of information would be a library. But in todays terms database are not only used to store information, they are also used to organize, protect and deliver data. In order to store and manage these databases, a system call the **database management system** (DBMS) was created. 

<figure style="text-align: center">
<img src="../images/db_roadmap.png" class="center">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 1: Databases module roadmap</figcaption>
</figure>

## Types of Databases

There are several types of databases used for various applications 

### Centralized Databases

Centralized Databases are used to store data in a centralized location where users from different locations can access the data (figure 2 below). Users can use various applications with different kinds of authentication procedures to access the data securely. An example of a centralized database is a public library.

<figure style="text-align: center">
<img src="../images/central_db.png" class="center">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 2: Pictorial representation of a centralized database.</figcaption>
</figure>

A centralized database is stored at a single location such as a mainframe computer. It is maintained and modified from that location only and usually accessed using an internet connection such as a LAN or WAN. The centralized database is used by organisations such as colleges, companies, banks etc.

![image.png](attachment:image.png)

**Advantages**  
- The data integrity is maximised as the whole database is stored at a single physical location. This means that it is easier to coordinate the data and it is as accurate and consistent as possible.
- The data redundancy is minimal in the centralised database. All the data is stored together and not scattered across different locations. So, it is easier to make sure there is no redundant data available.
- Since all the data is in one place, there can be stronger security measures around it. So, the centralised database is much more secure.
- Data is easily portable because it is stored at the same place.
- The centralized database is cheaper than other types of databases as it requires less power and maintenance.
- All the information in the centralized database can be easily accessed from the same location and at the same time.

**Disadvantages**  

- Since all the data is at one location, it takes more time to search and access it. If the network is slow, this process takes even more time.
- There is a lot of data access traffic for the centralized database. This may create a bottleneck situation.
- Since all the data is at the same location, if multiple users try to access it simultaneously it creates a problem. This may reduce the efficiency of the system.
- If there are no database recovery measures in place and a system failure occurs, then all the data in the database will be destroyed.

### Distributed Databases

Distributed Databases are the opposite of centralized databases. Databases in distributed databases are stored across various locations and sites of an organization. These databases are connected to each other through the use of Local Area Networks and/or Wide Area Networks. Refer to figure 3 below.

<figure style="text-align: center">
<img src="../images/distributed_db.png" class="center" style="zoom:70%">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 3: Pictorial representation of a distributed database.</figcaption>
</figure>

Distributed databases are divided into 2 classifications: **homogenous** and **heterogeneous**. These classifications refer to the type of hardware, operating systems and application procedures that the databases operate on. Homogenous databases have **all the same** type of hardware, operating systems and application procedures, and heterogeneous uses **different** types of hardware, operating systems and application procedures. 

### Relational Databases

Relational Databases are databases that uses the relational data model to store data. This model categorizes data into a set of tables. Those tables consists of rows and columns where each column defines the specific category of data and each row contains a record of the data according to the categories stated by the columns (figure 4 below). Structured Query Language (SQL) statements are used for accessing, manipulating and maintaining the data in relational database. Examples of relational databases are MySQL, Oracle, Postgres, MariaDB, etc.

<figure style="text-align: center">
<img src="../images/rdb.png" class="center">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 4: Pictorial representation of some tables in a relational database.</figcaption>
</figure>

**Example2:**    
![image.png](attachment:image.png)

### NoSQL Databases 

NoSQL Databases are databases that stores a wide range of data sets. The format of the stored data is not only in tabular form but also in several other different ways. The ways can be broken down as follows

**1. Key-value storage** - the data is stored in key-value pairs where the keys denote the attribute of the value it is holding (figure 5 below). Examples of some key-value store databases are Clusterpoint Database Server and Apache Ignite and Redis.

<figure style="text-align: center">
<img src="../images/kv_db.png" class="center">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 5: Pictorial representation of a key-value database.</figcaption>
</figure>

**Examples**

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**2. Document-oriented Database** - stores data using a JSON-like document-model that mirrors the application code (figure 6 below). Examples of some document-oriented databases are Clusterpoint Database Server and MongoDB.

<figure style="text-align: center">
<img src="../images/doc_db.svg" class="center" style="zoom:70%">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 6: Pictorial representation of a document-oriented database.</figcaption>
</figure>

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**3. Graph Database** - data is stored in a graph-like structure consisting of nodes, edges and properties. Nodes contain entities or instances of data such as a person's information. A node is similar to a row in a relational database. Edges represent the relationships between the nodes. Properties are information associated to the nodes. Refer to figure 7 below. Examples of some graph databases are Neo4j and Amazon Neptune.

<figure style="text-align: center">
<img src="../images/GraphDatabase_PropertyGraph.png" class="center" style="zoom:80%">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 7: Example of a Graph Database nodes, edges and properties. From Wikipedia Commons.</figcaption>
</figure>

**Examples**  

![image.png](attachment:image.png)

**4. Wide-column stores** - are databases that stores data in columns instead of rows like in relational databases therefore the names and format of the columns can vary from row to row in the same table (figure 8 below). An Example is Google's Bigtable.

<figure style="text-align: center">
<img src="../images/wide_db.png" class="center">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 8: Pictorial representation of wide-column database.</figcaption>
</figure>

## Cloud Databases

Cloud Databases are databases where the data is stored in a virtual environment on a cloud computing platform. These databases are optimized to execute on such virtual environment using the various cloud computing services (like SaaS, PaaS, IaaS, etc) for accessing and manipulation of the database. Some well known cloud platforms are Amazon Web Services (AWS), Microsoft Azure and Google Cloud SQL.

### Operational Databases

Operational Databases are a type of database that handles day to day transactions. It creates and updates the database allowing operations such as add, change or delete on the data in real-time. A typical use for this database is to record daily bank transactions like transfers, interest payments and withdrawals.

### Object-Oriented Databases
Object-Oriented Databases are databases that stores data using the object-based data model. This object-based data model is used by object-oriented programming languages to represent objects. An example is ZODB for Python where it stores Python objects using an extended version of Python's built-in object persistence (pickle).

### Commercial Databases

Commercial Databases are huge databases that encompasses many different types of databases (as mentioned above). The usage rights to these databases are sold to organizations for a fee and access to these databases are provided through commercial links. Some examples are Oracle, SQL Server and DB2.

### Personal Databases

Personal Databases are database that are typically residing on personal computers or computers within a small organization. They are small, easy to manage and generally used by a small group of people or a single user.

## Terminology

As mentioned in the introduction, a database is an organized collection of a set of related data. A DBMS is used by users to interact with the databases and to access and manipulate the data contained in those databases within limits. The following list describes some of the more common database and DBMS related terms that are common between RDBMS and NoSQL databases.

* **ACID** - an acronym that stands for Atomicity, Consistency, Isolation and Durability. These are the properties maintained by any standard DBMS. We will expand on these properties in the section "Transactional Processing".

![image.png](attachment:image.png)

* **Client/Server** - is an architecture that has 2 parts, namely a client and a server. A server is a program that generally runs on a computer that has direct access to the database and a client is a separate program that communicates with the database server through some specific protocols like Remote Procedure Call (RPC) or Representational State Transfer (REST) API.


* **Connection** - a means of communication between a client and a server on an DBMS. Processes can have multiple connections to one or more databases at any time.


* **Cursor** - a cursor is like an iterator in Python, it allows the traversal of records or documents in a database. Depending on the DBMS used, cursors can be returned after a connection is made or after a query has been executed.


* **Database Definition Language (DDL)** - this language is used to define the database structure. The statements associated with it deals with the creation, modification and removal of databases and objects (such as tables) within it.


* **Data Manipulation Language (DML)** - this language is used for accessing and manipulating data in a database. The statements associated with it generally deals with user requests like retrieving data (SELECT/find), inserting data (INSERT), updating data (UPDATE), deleting records (DELETE), etc from a database.


* **Data Control Language (DCL)** - this language is used to control the access to data stored in a database. Typical commands are GRANT which allows specified users to perform specified tasks and REVOKE which removes a user's accessibility rights to the database. 


* **Primary Key** or **_id** - are columns that store unique values that helps identify each record in a database or collection.


* **Transaction Control Language (TCL)** - this language are used to make sure that the changes made by DML are permanent to the database and made visible to other users. Typical commands are COMMIT which saves the transaction on the database and ROLLBACK which restores the database to the last commit.

## Data Storage Format

There are several ways to store large amounts of data, the 4 common ways are 

**Comma Separated Values (CSV)** - data stored in CSV files are the simplest and most popular format used for exchanging data. Each line stores a record in text format and fields are typically separated with a comma (`,`). The format of the data is generally tabular. However, the separator used by CSV does not always have to be commas because if the data contains commas (like `Michael Connelly, Sr`), we would have to use other characters as separators such as asterisks, tabs, etc. Do note that each record in a CSV must have the same number of fields. A good way to test if your CSV file has been properly formatted, use a spreadsheet program such as MS Excel or LibreOffice Cal to open it. It should not have any issues if the CSV file has been formatted properly. An example of a CSV file opened with Notepad can be seen from figure 9 below.

<figure style="text-align: center">
<img src="../images/img_csv.png" class="center">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 9: CSV file opened in Notepad in Windows.</figcaption>
</figure>

**Extensible Markup Language (XML)** - is a markup language that stores data in a structured non-tabular format. Data is wrapped with custom tags which has some similarities to HTML. However, the tags for XML generally have descriptive tag names. The use of an open-source XML parser is required to read XML files. An example of XML is seen below

In [None]:
<?xml version="1.0"?>
<contactinfo>
    <address category="office">
        <name>Olympus Inc.</name>
        <location>587 Drive, Mount Olympus, Greece</location>
        <contact>+30 281 8154 2445</contact>
    </address>
</contactinfo>

**JavaScript Object Notation (JSON)** - is an open standard file and data interchange format that is used to transfer data between programs. It uses attribute-value pairs to store and transmit the text as data objects. As JSON allows storage of array data types which are a collection of elements and other serializable objects it is generally left for more complex structured objects that do not fit in tabular formats. An example of JSON is shown below

In [None]:
{
    "menu":{
        "id": "file",
        "popup": {
            "menuitem": [
                {"value": "New", "onclick": "CreateNewDoc()"},
                {"value": "Open", "onclick": "OpenDoc()"},
                {"value": "Close", "onclick": "CloseDoc()"}
            ]
        }
    }
}

**Databases** - are a collection of data organized in some format. The type of data stored in databases are persistent (ie not temporary). A database management system is a software used to manage the interactions between users and other software applications with the database. the 2 mainstream types of databases are Relational databases that stores data in tabular formats and uses SQL, and NoSQL  databases that uses query languages.

## Data Acquisition

Data acquisition is the process of acquiring, filtering and cleaning data obtained from various sources before the data is placed into a data warehouse or some other form of storage medium. The crux of gathering data can be summed up by these questions:

* Where can I get my data from?
* Does the data require authorization to acquire? If yes, do I have the authorization or know avenues who have the authorization to get the data?
* How can I get the data (after the first 2 questions have been answered)?

Sources of data can be varied ranging from proprietary data from companies, academic data like experiments carried out for medical research, data scrapped from websites or even logging data like the browsing habits of customers from an online shop. Data gathered during this processed is generally termed "messy" data as the data could have errors and/or missing data. However, the goal of data acquisition is to obtain raw data for processing into reports that you or organizations can use to base decisions upon.

## Data Preprocessing

After gathering the data, we need to clean the data before the data can be processed. This step is required to ensure that the data used for processing (in a later step) is free from errors and is valid for the processing tasks to be carried out on it. There are several things to look out for when cleaning raw data. In this section we will look at some of the ways to identify then clean the data.

Generally, the first few things that we would check for after we receive a dataset that was gathered either by ourselves or externally, would be the data's validity and integrity. The terms *Data validity* and *Data Integrity* are very broad and highly dependent on context and it can also be said that data validity is a prerequisite for data integrity. 

With respect to databases, data integrity means to have an overall completeness, accuracy and consistency of data and data validity means that the data has undergone a strict set of rules to ensure that it is correct and useful. The rules that govern data integrity and validity share some overlaps as both are needed to ensure quality data. An example of some rules for checking the validity of data would be the use of *Regular Expressions* to check data fields such as phone number and email address. Once the data is deem valid, its integrity can be checked using various business rules.

Let's take a look at how we can identify some of the things that can go wrong during and after the data has been gathered.

* **Missing Data** - Within a database, this is normally denoted by the keyword **`NULL`** in the field/s of a record. Missing data generally happens during the data gathering phase when there is no data for certain fields during certain situations like when information is deliberately left out during a survey or the data was not originally available during gathering.


* **Duplicated Data** - happens when there are records that inadvertently share data with another record in the same database or another database. This can happen during any type of data movement between systems (eg: data migration). The easiest type of duplication is exact carbon copies of entire records and the most harmful and common type is the **partial** duplication where a record could be missing data from some fields. Such errors are most often caused by human error, especially when data are input by hand. 


* **Errors Correction** - errors can happen even when is an absent of missing or duplicated data. For instance, names of people have many forms, the west has names comprising of first, middle and last names but Asians names only have first and last (family) names. An error would occur if a table in the database did not account for this abnormality. Correction for such errors are generally done by through a manual process of identifying before developing scripts to rectify the errors.

Once all the errors, missing or duplicated data within the data has been identified, data cleansing can begin. Data cleansing is a process whereby the detection and correcting of corrupted or inaccurate records from a record, table or database by means of replacement, modification or deleting the course or dirty data. This process is can be done manually through interactive data wrangling tools or through batch processing using scripts. An example of correct but inconsistent data can be seen in figure 10 below.  A table can have a column that stores *Gender* information but the values are not consistent. A script would be developed to replace the inconsistent data based on some rules to ensure that all the values in the column are changed to reflect a more consistent set of data defined for the database.

<figure style="text-align: center">
<img src="../images/data_cleanse.png" class="center">
<figcaption style="font-size:90%; font-weight: 550;">
Figure 10: Column <i>Gender</i> before and after data cleansing.</figcaption>
</figure>

Data cleansing differs from data validation in that validation is done before data is entered to the database and data cleansing is done on data already in a database.

To ensure that the data quality is high, it needs to have certain qualities:

* **Validity** - as described earlier, data validity is property where the data has to conform to the defined business rules or constraints during the data gathering process.
* **Accurate** - accuracy of data is generally hard to achieve as it involves the comparing the data that you have with an external source that contains "True values". Example would be a table of customer's address with postal codes where you may need to compare it with an external source to make sure that the postal codes are correct.
* **Complete** - refers to how much the data meets the expectations for the task. Data can still contain missing values so long as those fields are optional. However, if the data is incomplete, it is almost impossible to fix as we will need to go back to the source to get it
* **Consistent** - refers to how consistent data are across a system or database. As mentioned earlier, data can be inconsistent like in the *Gender* column of the above example. Having 2 records of the same customer data but with different addresses on 2 different systems is also deem as inconsistent as it may contradicts each other if both are current addresses therefore we have to use various strategies to decide which is the most updated record.
* **Uniform** - refers to how the the data measures up to the defined units of measurement such as a particular currency for money data.