# Module 3 Part 1: NoSQL

# Introduction

This module introduces the key concepts and principles of non-relational datastores for big data and MongoDB as an example of a document-oriented NoSQL database management system.

This module consists of 2 parts:

- **Part 1** - NoSQL
- **Part 2** - Introduction to MongoDB

Each part is provided in a separate notebook file. It is recommended that you follow the order of the notebooks.

# Learning Outcomes

In this module, you will:
* Explore different paradigms of modelling real world data  
* Understand the benefits of storing data using different types of datastores 
* Determine which stores are best suited for certain applications
* Learn how to structure and query MongoDB databases.

# Readings and Resources

We invite you to further supplement this notebook with the following recommended texts.


- Chodorow, K. and Bradshaw, S. (2018). MongoDB: The Deﬁnitive Guide, 3rd Edition. O’Reilly: Boston. http://shop.oreilly.com/product/0636920049531.do 


- Kleppmann, M. (2017). Chapter 2: Data Models and Query Languages; Chapter 3: Storage and Retrieval in Designing Data Intensive Applications in Designing Data Intensive Applications. O’Reilly: Boston. http://shop.oreilly.com/product/0636920032175.do
 

- Segaran, T., Evans, C., and Taylor, J. (2009).  Chapter 4: Just Enough RDF, Chapter 5: Sources of Semantic Data, and Chapter 6: What Do You Mean, “Ontology”? in Programming the Semantic Web. Build Flexible Applications with Graph Data. O’Reilly: Boston.  http://shop.oreilly.com/product/9780596153823.do


- Webber, J., Robinson, I., and Eifrem, E. (2013). Graph Databases.  O’Reilly: Boston. http://shop.oreilly.com/product/0636920028246.do


<h1>Table of Contents<span class="tocSkip"></span></h1>
<br>
<div class="toc">
<ul class="toc-item">
<li><span><a href="#Module-3-Part-1:-NoSQL" data-toc-modified-id="Module-3-Part 1:-NoSQL">Module 3 Part 1: NoSQL</a></span>
</li>
<li><span><a href="#Introduction" data-toc-modified-id="Introduction">Introduction</a></span>
</li>
<li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes">Learning Outcomes</a></span>
</li>
<li><span><a href="#Readings-and-Resources" data-toc-modified-id="Readings-and-Resources">Readings and Resources</a></span>
</li>
<li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents">Table of Contents</a></span>
</li>
<li><span><a href="#NoSQL" data-toc-modified-id="NoSQL">NoSQL</a></span>
<li><span><a href="#Column-Oriented-DataStores" data-toc-modified-id="Column-Oriented-DataStores">Column-Oriented DataStores</a></span>
<ul class="toc-item">
<li><span><a href="#Data-Organization-in-a-Column-Oriented-Datastore" data-toc-modified-id="Data-Organization-in-a-Column-Oriented-Datastore">Data Organization in a Column-Oriented Datastore</a></span>
</li>
<li><span><a href="#Compression" data-toc-modified-id="Compression">Compression</a></span>
</li>
</ul>
</li>
</li>
<li><span><a href="#Document-Oriented-Datastores" data-toc-modified-id="Document-Oriented-Datastores">Document-Oriented Datastores</a></span>
<ul class="toc-item">
<li><span><a href="#Consistency-and-Availability-in-Document-Oriented-Datastore" data-toc-modified-id="Consistency-and-Availability-in-Document-Oriented-Datastores">Consistency and Availability in Document-Oriented Datastores</a></span>
</li>
</ul>
</li>
<li><span><a href="#Object-Stores" data-toc-modified-id="Object-Stores">Object Stores</a></span>
</li>
<li><span><a href="#Graph-Datastores" data-toc-modified-id="Graph-Datastores">Graph Datastores</a></span>
</li>
<li><span><a href="#Triple-Stores" data-toc-modified-id="Triple-Stores">Triple Stores</a></span>
</li>
<li><span><a href="#Selecting-a-Database-Type" data-toc-modified-id="Selecting-a-Database-Type">Selecting a Database Type</a></span>
<ul class="toc-item">
<li><span><a href="#Is-the-Data-Structured,-Unstructured,-or-Semi-Structured?" data-toc-modified-id="Is-the-Data-Structured,-Unstructured,-or-Semi-Structured?">Is the Data Structured, Unstructured, or Semi-Structured?</a></span>
</li>
<li><span><a href="#Which-model-best-describes-your-Data?" data-toc-modified-id="Which-model-best-describes-your-Data?">Which model best describes your Data?</a></span>
</li>
<li><span><a href="#What-is-your-Availability-and-Consistency-Model?" data-toc-modified-id="What-is-your-Availability-and-Consistency-Model?">What is your Availability and Consistency Model?</a></span>
</li>
</ul>
</li>
<li><span><a href="#References" data-toc-modified-id="References">References</a></span>
</li>
</ul>
</div>

# NoSQL

Information storage technologies are continuously improving.  Memory storage has always been getting cheaper, smaller and faster.  
This has enabled the rise of new technologies such as IoT (Internet of Things), AI, computer vision, social networking and blockchain, just to name a few.  

Every aspect of our daily lives is being recorded somewhere as we interact with personal assistants, update our social media, buy products on the Internet, or watch streaming video.   Furthermore, with the proliferation of the Internet, data is accessible to millions of users around the world.

These new applications often require processing large volumes of data rapidly and making the data available at Internet scale in a reliable manner.  Unfortunately, traditional models such as the relational model are not suitable for these new applications because they're just too slow, particularly at loading data.   This has compelled the industry to come up with alternative database models tuned for high volume processing.  These new non-relational database management systems are often collectively referred to as **NoSQL**.  The name reflects the fact that what these new designs had in common, at least originally, was that they did not support SQL.  Many of them do now support at least subsets of SQL.  Anything other than the relational model isn't going to be ideal for supporting SQL queries though.  These new databases trade off full support for SQL for other desirable properties that relational databases can't provide.


So NoSQL is a broad term to describe a database system that does not conform to the relational model.  Many different databases can be considered NoSQL databases.  Generally speaking, a NoSQL database system has one or more of the following characteristics.
* Non-Relational
* Schemaless (does not require a fixed structure to be defined before any data can be loaded)
* Horizontally Scalable (designed for parallel processing of large loads using potentially many servers)
* Does not support full SQL
* Loosely adheres to ACID properties or not at all

We will explore some of the main categories of NoSQL databases that have emerged since the invention of relational databases:
* Column-Oriented Stores
* Document-Oriented Stores
* Object Stores
* Graph Stores
* Triple Stores

# Column-Oriented DataStores

Column-oriented stores were originally developed by Google to address the need to store and index the vast amount of data that exists on the Internet.  It's hard to even imagine the scale of storage and processing required to index that much data and serve the billions of queries each day.  

## Data Organization in a Column-Oriented Datastore

Column-oriented stores arrange data by column instead of rows.  Whereas relational databases physically store the contents of a row close together on a disk or SSD so they can be retrieved quickly as a unit, column-oriented stores keep the content of a column close together on the storage medium.  Consider the following table that represents a list of people.  Each row represents a person:

| ID | First Name | Last Name | Gender | Job Title | Department
|----|----|----|----|----|----|
| 1 | Peggy | Wotring | F | Manager | Marketing |
| 2 | Cleo | Holliman | M | Manager | Finance |
| 3 | Malisa | Ide | F | Engineer | IT |
| 4 | Rosemarie | Vigo | F | Manager | Marketing |
| 5 | Dewey | Fenton | M | Manager | Finance |

We can change the table by rearranging the data by column.  Notice each column represents a person while each row represents an attribute.

| 1 | 2 | 3 | 4 | 5 |
|-----|-----|-----|-----|-----|
| Peggy | Cleo | Malisa | Rosemarie | Dewey |
| Wotring | Holliman | Ide | Vigo | Fenton |
| F | M | F | F | M |
| Manager | Manager | Engineer | Manager | Manager |
| Marketing | Finance | IT | Marketing | Finance |


Column-oriented stores take advantage of this arrangement and separate out each column attribute into their own partition as in the following figure.   A partition is a physically separate area of disk storage.  By splitting  a large dataset this way it can be distributed across many disks, allowing the query load to be distributed across many processors.  It also fits well with analytics.  When we do business transactions, we usually operate on rows e.g. add a new customer to a table.  When we do analytics we usually want to retrieve the entire (or a subset of the) contents of one or more columns (e.g. what is the average price of the popcorn sold last month?) where we don't care about all the other details of each individual business transaction (who bought it, what time, etc.).

Note that the IDs reference the position of each attribute.  Thus, all attributes associated with ID 1 relate to “Peggy Wotring”.


<table width="100%" bgcolor="#FFFFFF"><tr><td></td><td></td><td></td><td></td><td></td></tr><tr>
    <td>       
        <table border="1" bgcolor="#FFFFFF">
            <td colspan=2><b>First Name</b></td>
            <tr><td>1</td><td>Peggy</td></tr>
            <tr><td>2</td><td>Cleo</td></tr>
            <tr><td>3</td><td>Malisa</td></tr>
            <tr><td>4</td><td>Rosemarie</td></tr>
            <tr><td>5</td><td>Dewey</td></tr>
        </table>
    </td>
    <td>
        <table>
            <td colspan=2><b>Last Name</b></td>
            <tr><td>1</td><td>Wotring</td></tr>
            <tr><td>2</td><td>Holliman</td></tr>
            <tr><td>3</td><td>Ide</td></tr>
            <tr><td>4</td><td>Vigo</td></tr>
            <tr><td>5</td><td>Fenton</td></tr>
        </table>
    </td>
    <td>
        <table>
            <td colspan=2><b>Gender</b></td>
            <tr><td>1</td><td>F</td></tr>
            <tr><td>2</td><td>M</td></tr>
            <tr><td>3</td><td>F</td></tr>
            <tr><td>4</td><td>F</td></tr>
            <tr><td>5</td><td>M</td></tr>
        </table>
    </td>
    <td>
        <table>
            <td colspan=2><b>Job Title</b></td>
            <tr><td>1</td><td>Manager</td></tr>
            <tr><td>2</td><td>Manager</td></tr>
            <tr><td>3</td><td>Engineer</td></tr>
            <tr><td>4</td><td>Manager</td></tr>
            <tr><td>5</td><td>Manager</td></tr>
        </table>
    </td>
    <td>
        <table>
            <td colspan=2><b>Department</b></td>
            <tr><td>1</td><td>Marketing</td></tr>
            <tr><td>2</td><td>Finance</td></tr>
            <tr><td>3</td><td>IT</td></tr>
            <tr><td>4</td><td>Marketing</td></tr>
            <tr><td>5</td><td>Finance</td></tr>
        </table>
    </td>
    <td>    
    </tr>
</table>

## Compression

Notice how much duplication there is within each column.  In our example, we can observe that there seem to be a small set of distinct values for Gender, Job Title and Department.  Imagine storing millions of people in our datastore.  The department partition will contain millions of entries with either ‘Marketing’, ‘IT’ or ‘Finance’.  We can use a compression strategy to reduce the storage required by a partition that has a lot of duplicate entries.  One such compression algorithm is **bitmap encoding**.  Consider the following column contents for Department with 10 entries.

| Position | Department |
|---|-----------|
| 1 | Marketing |
| 2 | Marketing |
| 3 | Marketing |
| 4 | IT |
| 5 | Marketing |
| 6 | Finance |
| 7 | Finance |
| 8 | Marketing |
| 9 | IT |
| 10 | IT |

We can encode the position of each distinct value of the department attribute as follows.

![ColumnOreientedColumnEncoding.png](attachment:ColumnOreientedColumnEncoding.png)

We encode a 1 when the value appears in a position otherwise we encode a 0.  Notice the “Marketing” value is associated with position 1, 2, 3, 5, and 8.  We place a 1 in those positions and 0 in the others.  We can further compress this information by noting the consecutive ones and zeros as follows:

| Attribute Value | Encoding | Sequence |
|-------|-------|--------|
| Marketing | 0,3,1,1,2,1,2 | 0 zeros, 3 ones, 1 zero, 1 one, 2 zeros, 1 one, 2 zeros |
| IT | 3,1,4,2 | 3 zeros, 1 one, 4 zeros, 2 ones |
| Finance | 5,2,3 | 5 zeros, 2 ones, 3 zeros |

Notice that the 10 records are compressed to 3 entries.  We can do the same when dealing with millions of records.  That is, we can reduce the millions of entries down to 3 entries of consecutive 1’s and 0’s.  Imaging how much disk accessing would be saved by a query asking for this data using this scheme rather than reading through millions of uncompressed records.  Compression allows column-oriented databases to achieve high performance for column-oriented analytics queries when storing millions or even billions of records.  A relational database would need to retrieve all of the records in the table to answer the same query.

Consider the following queries that work off our example:
* Find the number of females?
* Find the number of managers?
* How many female managers are working in the Marketing department?
Notice that finding the number of females only requires reading the Gender partition.  The query doesn’t need to load the other partitions.  

The following are some of the most popular implementations of column-oriented databases:
* Bigtable
* HBase
* Cassandra

# Document-Oriented Datastores

Document-oriented datastores represent **semi-structured** data such as PDF, XML, JSON or Microsoft Word.  The unit of storage, rather than being a tuple (row) as in a relational database, is a blob of bytes (usually text) called a **document**. Nowadays the most popular structure for each document is to use JSON to provide the semi-structuring.  Most document-oriented databases can use the key/value pairs in JSON to add features such as indexing to allow all documents containing a particular key to be queried for.  Consider the following JSON representation of a Person with 4 key value pairs:

```JSON
{
    firstname: “Blake”,
    lastname: “Jackson”,
    age: “23”,
    sports: [“hockey”,“baseball”, “soccer”]
}
 ```
 
Notice that the sports attribute is a multi-valued attribute that is represented by an array.  In a relational database this would require a separate table to store the sports.  Documents can also have an embedded document to model even richer data.  This is shown in the following JSON example that highlights an embedded document.  

<font face="courier">&nbsp;&nbsp;&nbsp;&nbsp;{<br></font>
<font face="courier">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;firstname: “Blake”,<br></font>
<font face="courier">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;lastname: “Jackson”,<br></font>
<font face="courier">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;age: “23”,<br></font>
<font face="courier">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sports: [“hockey”, “baseball”, “soccer”],<br></font>
<font face="courier" color='red'><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;address: {<br></b></font>
<font face="courier" color='red'><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;street_no: 232,<br></b></font>
<font face="courier" color='red'><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;street: “Cardinal St.”,<br></b></font>
<font face="courier" color='red'><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;city: “Toronto”,<br></b></font>
<font face="courier" color='red'><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;province: “ON”<br></b></font>
<font face="courier" color='red'><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;postal_code: "M2F8E2"<br></b></font>
<font face="courier" color='red'><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br></b></font>
<font face="courier">&nbsp;&nbsp;&nbsp;&nbsp;}<br></font>


Document-oriented databases organize documents in collections.  Collections are analogous to tables in a relational database in the sense that they provide a way of grouping documents together.  However, collections can hold documents that have different structures, which is not possible with a relational database.    This is very beneficial when your data doesn’t have a uniform structure.  Consider that addresses can have different structures depending on whether the people live in an apartment or not, or live in different countries.  Apartments will have a suite number while addresses in Canada have postal codes compared to addresses in U.S. that have zip codes.  A single collection can contain JSON documents with these different structures.

![jsonimage.png](attachment:jsonimage.png)

Every document has a document ID that uniquely identifies the document within the collection.  Document oriented stores will create an index on the document ID to speed up document retrieval.  The following shows the comparison between the Relational Model and document-oriented Datastores

| Relational Model | Document-oriented Datastore |
|------------------|-----------------------------|
| Table | Collection |
| Row | Document |
| Column | Key |
| Primary Key | Document ID |

Document-oriented datastores are also flexible when defining relationships between documents.  Documents can either embed another document or reference an external document by their document ID. 

![DocumentOrientedJSON4.png](attachment:DocumentOrientedJSON4.png)

## Consistency and Availability in Document-Oriented Datastores

Documented-oriented datastores are designed to scale.  This is achieved by replicating data across multiple server nodes.  If one server goes down, requests to access the data can be routed to another that has a copy of the data, making the overall system fault tolerant.  This has a downside however.  If new data is being stored, it would be impractical to try to lock all of the replicas for update simultaneously.  This would kill the performance gains we're looking to achieve in order to serve Internet-scale applications.  Document-oriented datastores adopt the **eventual consistency model** where a change made in one of the replicas will be duplicated to the other copies across the system over a short but non-zero time.  As a result, we can say that unlike relational databases, documented-oriented datastores are more focused on making the data available and less focused on making the data consistent.  We will come back to these concepts and principles in more detail later in this course.

There are many different implementations of the document-oriented datastore concept.  Here are some examples:
* MongoDB
* Cloud Firestore
* Apache CouchDB

# Object Stores

Many modern applications are developed using object-oriented programming languages.  In order to store data, an application must transform this object-oriented model to a model in memory into something that the datastore can store.  For example, to store the data in a relational database, the application must translate the object model to relational tables consisting of rows and columns.  Similarly, to store the data in a document store, the application must convert the object-oriented model to documents (usually JSON).  Converting the data from the object-oriented model to the form the datastore requires (relational, document-oriented, column-oriented, etc.) adds processing work to the application.  Furthermore, concepts that are represented in one model may not have an equivalent representation at all in another model.  As a result, the translation process can be difficult, and in some cases, information could be lost when transforming data from one model to another.  For example, there is no equivalent representation of object-oriented concepts such as inheritance, encapsulation and polymorphism in a relational model. This disconnect between models is sometimes called the **impedance mismatch** (an electrical engineering reference to electronic circuits that aren't very compatible with each other).

Object databases address the impedance mismatch conundrum by supporting concepts such as object identity, classes, inheritance, methods, encapsulation and extensibility (For more information on object-oriented concepts see https://en.wikipedia.org/wiki/Object-oriented_programming).  This allows object databases to have the following benefits over relational databases when being used with object-oriented programs:

- **Complex Modelling** – Object databases are able to model more complicated data compared to other types of stores


- **Higher Performance** – Since the models are consistent there is no need for translations between the application and datastore, which is more efficient


- **Maintainability** - Applications using object databases are easier to write and maintain because the application model and storage model are consistent

In spite of these benefits, object databases have not replaced relational databases as the datastore of choice.  Part of the reason for the slow adoption is the high cost of migrating data from a relational database to an object database.  Furthermore, there has been advancements in object-relational mapping tools and techniques for all major object-oriented languages.  Object databases tend to be used to complement relational databases or for niche applications such as in embedded devices and real-time systems.

Some of the implementations of object databases are as follows:

* Db4o
* ObjectDatabase++
* ObjectDB
* Objectivity/DB
* ObjectStore
* Caché

For further reading, see: https://en.wikipedia.org/wiki/Object_database

# Graph Datastores
Graph datastores are another NoSQL datastore that was developed to address the limitations of modelling richly inter-related data.  Relational theory is based on a tacit assumption that the contents of the tuples are important, but the kinds of relationships that you want to model will be few and well-structured. Graph datastores invert this emphasis and are based on concepts from mathematics’ graph theory where data is modelled as vertices and relationships as edges.  Graph datastores typically also allow vertices and/or edges to have flexible properties.

* **Vertices** represent entities that contains a set of properties.  Vertices are comparable to rows in relational databases.  Vertices are sometimes also called **nodes** (this is a different use of the word *node* than when it is used to mean a server in a cluster of servers).


* **Edges** represents relationships between vertices that can be either directed or undirected (bidirectional).  Edges can often contain a set of properties that describes the relationship.  Edges are also sometimes called **arcs**.


* **Properties** describes the attributes of a vertex or edge.

Graph datastores are a natural fit for representing social graphs.  Let’s look at representing a social graph for a person using vertices, edges and properties.

![SocialGraph.png](attachment:SocialGraph.png)

The above diagram shows that Bob Smith has a brother and was a student at the University of Waterloo from Sept 2011 to 2015.   He also worked for company ABC Inc. from Jun 2017 to Jan 2018.  Graph databases can easily model data that has a large number of relationships, even if the nature of each relationship is different (as described by its properties).  Graph datastores are well-suited for:
* Object-oriented models 
* Complexly-related data such as social networks
* Modeling computer system networks
* Modeling e-commerce data to model relationships between consumers and products 
* Graph-based machine learning systems such as Google’s graph-based machine learning platform

These applications make use of the following benefits of graph databases:

* **Vertices-Edge Philosophy** - It is easy to model data where the structure of properties and relationships is complex rather than following a rigid pre-defined schema as is the case for relational stores


* **Flexibility** – Nodes and relationships can be deleted or added to the graph model without compromising the existing data.  In a relational database the data would have to undergo a migration to support a new relationship between entities.  This would at least involve the tables being locked for the time required to do the update, but could also require a full database dump, schema update and data reload cycle.


* **Fast Relationship Queries** – Wueries that depend on relationships (e.g. find all people that work at company ABC) are fast compared to relational databases. In a graph database there is no need to join tables, which is an expensive relational operation.  Furthermore, relationships may be indexes which further contribute to the speed of some types of queries.

Some of the leading graph databases are:

* Neo4j
* Titan
* HyperGraphDB
* InfiniteGraph
* OrientDB
* dex

For further reading, see: https://en.wikipedia.org/wiki/Graph_database

# Triple Stores
A triple datastore (usually just called a Triple Store) is a specialized type of graph datastore that uses the **Resource Description Framework (RDF)** to formalize the way it models data.  RDF was developed by [W3C](https://www.w3.org/Consortium/) to define a language for representing information for the Internet/Semantic Web.

At its core, RDF defines a construct called a **triple**.  A ***triple*** *is a 3-tuple made up of a subject, predicate and object that is used to describe statements on a resource*.  Consider modeling “Alex works at Facebook” in a subject-predicate-object form.  We can break this statement as follows.

![Subject-Predicate-Object.png](attachment:Subject-Predicate-Object.png)

In this form, the subject, “Alex”, is linked to an object, “Facebook”, by its predicate, ”Works at”.  You can think of the subject as the entity, predicate as the name of an attribute and the object as the value of the attribute.  Therefore, if we want to model Alex’s age, we can define a subject, “Alex”, a predicate, “Age” and an object “25”.  The collection of triples forms the graph model where the subjects and objects are vertices and the predicates are the edges.

RDF also defines a schema language that provides a means to describe the structure of the RDF model.  For example, a schema can be defined that states only people can work at companies.  Similarly, we can define a schema that states only teachers can teach classes.  The schema defines constructs such as classes, literals, resources and datatypes.  The languages also define properties that are used to describe relationships between subject resources and objects resources such as “rdfs:subClassOf” and “rdf:type”.  By defining a set of schema rules, we can infer additional triples that provide additional insight into the data.

Let’s look at modeling the relationship between a person and a dog.  In our schema we can define three classes: Mammal, Person and Dog.  We can also define two properties:  every Person is a Mammal, and every Dog is a Mammal.  Finally, we can define a property that states that only Persons can own Dogs.

In our RDF model we can define a triple that states Bob owns Lassie.  Based on our RDF schema we can infer the following additional triples:
* Bob is a Human
* Bob is a Mammal
* Lassie is a Dog
* Lassie is a Mammal

W3C also defines a query language called SPARQL (see https://www.w3.org/TR/rdf-sparql-query/ for further reading) that allows users to query information from the triple datastore.  Similar to SQL, SPARQL defines a set of directives to perform the full set of CRUD operations.

As you can see from the example above, triple datastores improve the discovery of analytics through inference in addition to inheriting the benefits of graph data stores. There is a close relationship here to predicate calculus and formal logical inference systems which makes triple stores interesting for AI applications. 

Furthermore, since triple datastores follow an open standard, they have added benefits:
* Transferring data from one triple store to another is easy
* Triple stores can be developed on top of existing data stores such as relational databases
* Queries can be federated across multiple data stores

# Selecting a Database Type

There are many problem domains, each with their own set of unique challenges.   Unfortunately, there is no one kind of database technology that fits all situations.  Practitioners must understand the problem they are trying to solve and the kinds of user experiences they want to provide.  Only then can they determine the technology and architecture to use.  This is particularly the case both when determining the type of datastore to use and also how the datastore schema should be designed.

Fortunately there are some basic questions that practitioners can ask to help select a type of database:

## Is the data Structured, Unstructured, or Semi-Structured?

Can the data be described in a well-defined structure with no ambiguity?  Or is the data unstructured and not well organized, ambiguous, and irregular?    Examples of unstructured data are emails, audio and video files, text documents, webpages and most business data.  For these cases it may be suitable to select a datastore that has a flexible schema that can handle the ambiguity.  For data that is well structured a relational database may be a suitable datastore.  

Consider log data that is semi-structured.  A log event is composed of a timestamp, log level, and a message that is not well-structured.  Log systems can further parse the message to extract out more useful system information such as application name, process name, etc.  Organizations uses log data to determine the health of their systems.  A natural choice is to store log data in a column-oriented datastore such as HBase.  Analytic queries can be quickly executed since these queries only require a subset of attributes.  Furthermore, the datastore can be easily scaled to handle large volumes of log data.

## Which model best describes your Data?

We have gone through many types of datastores with their own unique way of modelling data.  When going through the process of modelling your data by identifying entities, relationships and properties, a clear method of modelling the data will emerge.  For example, a graph database is a natural fit for modelling a social graph.  In contrast, one might consider an object database to store applications that are written in an object-oriented programming language.

## What is your Availability and Consistency Model?

Does your application require that the data is always highly available or is it essential that the data always be consistent?  In the case of storing bank funds we want to make sure that our data is consistent.  The amount of funds in your banking account must reflect the exact amount at all times, even after system failures.  In this case, we would want to use a datastore that uses ACID transactions.

In contrast, consider a social networking application like Twitter.  Millions of users post small 280-character tweets to the site, while millions more read these tweets.  Thus, making the data available is more important than keeping the data consistent.  If the system goes down and users cannot read tweets, users will get frustrated with the bad user experience.  In this case we can use a datastore that will provide eventual consistency where data is replicated across many different partitions.  If one partition goes down, users can be redirected to another partition.   Furthermore, users are okay with the data eventually replicating to all the partitions.  For instance, a person living in Hong Kong can wait a few minutes to see updates from a person living in New York.  

We can keep asking more and more of these types of questions to understand how our data is used.  For example, we can further ask whether our application is write-intensive or read-intensive.   This understanding will help choose the type of datastore and help design how we structure its contents.


**End of Part 1**

This notebook makes up one part of this module. Now that you have completed this part, please proceed to the next notebook in this module.

If you have any questions, please reach out to your peers using the discussion boards. If you and your peers are unable to come to a suitable conclusion, do not hesitate to reach out to your instructor on the designated discussion board.

# References

- Icons of Progress, IBM. (Public Domain). Retrieve from https://www.ibm.com/ibm/history/ibm100/us/en/icons/reldb/


- Graph database. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Graph_database


- Column-oriented DBMS. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Column-oriented_DBMS


- Object-relational databases. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Object-relational_database


- Operational data store. SearchOracle. (Public Domain). Retrieved from https://searchoracle.techtarget.com/definition/operational-data-store


- Relational database. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Relational_database


- SQL. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/SQL


- Comparison of object database management systems. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Comparison_of_object_database_management_systems


- Object database. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Object_database
Document-oriented database. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Document-oriented_database


- Associative entity. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Associative_entity


- SQL syntax. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/SQL_syntax


- ACID. Wikipedia. (Public Domain). Consistency, Isolation and Durability definition. Retrieved from https://en.wikipedia.org/wiki/ACID_(computer_science)


- Big table. Wikipedia. (Public Domain). Retrieved from https://en.wikipedia.org/wiki/Bigtable


- Structured vs unstructured data. datamation. (Public Domain). Retrieved from https://www.datamation.com/big-data/structured-vs-unstructured-data.html


- What Are the Major Advantages of Using a Graph Database? DZone. (Public Domain). Retrieved from https://dzone.com/articles/what-are-the-pros-and-cons-of-using-a-graph-databa


- Resource Description Framework. Wikipedia. (Public Domain). Retrieved from   https://en.wikipedia.org/wiki/Resource_Description_Framework


- RDF Schema. Wikipedia. (Public Domain). Retrieved from  https://en.wikipedia.org/wiki/RDF_Schema


- A chart showing several of the SQL language elements that compose a single statement.  Wikipedia. (Public Domain). Retrieved from: https://wikimedia.org/api/rest_v1/media/math/render/svg/b0bfef3c941c1a88d3990bd1472653e60cf02d0a


- Multidimensional Warehouse (MDW).  Oracle. (Public Domain). Retrieved from: https://docs.oracle.com/cd/E41507_01/epm91pbr3/eng/epm/penw/concept_MultidimensionalWarehouseMDW-9912e0.html