<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **BUAN 6510**
# **Lesson 1: Principles & Overview** 
_The whole course from 40,000 feet_

## **Learning Objectives**
### **Theory / Be able to explain ...**
- Importance of data for decision making
- Features and components of database systems 
- Data models and data integrity
- Functions of a DB Management System
- Terminology like apps, layers, DBMS, SQL, metadata, etc.


### **Skills / Know how to ...**
- Identify the parts of a database table
- Use keys to match records from separate tables
- Run SQLite queries in a Jupyter notebook


--------
## **LESSON 1 HIGHLIGHTS**

In [None]:
#@title Run this cell if video is does not appear TODO REPLACE WITH NEW VIDEO
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/YkDLv6CtEnc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

## **BIG PICTURE: Where does data come from? Why do we care?** 

Data lives in a somewhat unique place in between technology and people. **Without people there is no data.** Data is a strictly artificial (i.e., human generated) artifact, our record of facts observed and imagined. It is also artificial in another sense, in that **it can't exist without technology.** Facts that remain locked in our heads are not data. They are just thoughts and memories that have to be **encoded into data** so they can be communicated, stored, updated, and (eventually) purged altogether when we don't need them anymore. Technology is how we do all that. Without it we are lost. 

In a real sense, understanding how data works and where it comes from gives us a peek inside how people think and, more specifically, how people make good decisions. Every ***rational*** decision involves at least four stages:
- A **stimulus** that prompts the need for action
- Collection of ***potentially* relevant facts** about alternative courses of action
- Development and application of **decision rationale** (models) informed by the facts and objectives
- Selection and **execution of a course of action** in keeping with the rationale  

Or, to put it another way, when trying to hit a target when it really counts, the best plan of action is more like Ready $\rightarrow$ Aim $\rightarrow$ Fire with information and intention instead of flailing about randomly in the dark with hunches based on nothing, explainable to no one.

Good leaders *who have to be accountable to others* can explain *why* they make decisions, *what* supporting evidence was used, and *how* they can be persuaded to act differently. Otherwise, why would anyone in their right minds choose to follow them? A common thread here is access to relevant data, which allows them to formulate, validate, communicate, and act on their decisions. 

So, if you really want to succeed in business, it is best to treat data as a critical resource, worthy of continual investment of time, money, and attention. Can you access data when you need it? Can you trust what it is telling you? Is it relevant to what you need to know at the time? Can you integrate data from multiple sources? Can you then communicate decisions in a way that any rational person can agree (or disagree) with? If not, then expect lots of unfortunate surprises and in some cases outright failures. 

And where does that data reside? Hopefully, in a **database system** that has been designed to meet the specific needs of the people using it. In this lesson we will sample ideas from the lessons that follow, providing just enough information to explain where we are going and why we need to go there.   

> ### TLDR for the impatient
> * Access to data and information are fundamental to modern business
> * Management is about decision making
> * Good decisions require information and rationale
> * Good information requires relevant, accurate, and timely data  
> * Ideally, that data is managed in a database system that has been designed for the needs of those who use it
>
> $\Rightarrow$ Important to understand how databases work and interact with business applications, getting as close to original source data (i.e., the unadulterated truth) as you can manage




## **Database Systems: Three Different Perspectives**
Any system that relies on the use of a data store is considered a database system. With that very broad definition, just about any "smart" device or app you use today is a database system. A smart watch that collects and stores data about the wearer is a database system, as is an email client that retrieves and archives email for reading later or the point of sale system used at the bodega around the corner. 

In this lesson we will look at database systems three different ways:
- **Technical Architecture** describes hardware and software resources that need to be bought, installed, integrated, maintained, and secured.   
- **Software Architecture** describes the logical structure of the system and how data is processed. 
- **Data Architecture** describes how the data itself is organized, used, maintained. 

## **Technical Architecture: Networks, Devices, Apps, and Servers**

![Enterprise Architecture](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_enterprise_architecture.png)

When viewed with an IT director's eye, technical resources include anything that has to be installed. Consider, for example, the technology needs of a small regional retail chain, as diagrammed above. 

At each **retail location** (on the left), one would find a number of devices needed to complete sales, track inventory, and report to headquarters. While some of the technology might be proprietary, much of it would be licensed from vendors who specialize in retail systems. 

At the right is the **corporate headquarters**, where functional managers and executive staff make decisions about marketing, human resources, supply chains, technology, etc. The needs of these executive offices are somewhat less industrial than the retail locations, with more of a need for historical data that can be analyzed offline. While they may, in fact, need to monitor individual transactions (e.g., if fraud is detected) they usually work with aggregated data (data marts) constructed to support specific kinds of decisions.

Somewhere in between the stores and headquarters is a **centralized data center** with the heavy-duty "big iron" technology needed to process and record transactions (sales, incoming orders, outgoing shipments, etc.) required for everyday operation of the firm. It cannot be exaggerated how critical these central servers are: if they go down or are hacked then everything reverts back to pencil and paper. 

Connecting the various locations together is a **virtual private network (VPN)** that is secured using state of the art technology. Like the central servers, these networks are potentially always under attack. If a remote hacker is going to gain access to the systems then it is going to be over the VPN. (However, despite what you see in the movies, virtually all black hat hacking actually happens *within* the VPN rather than through some cryptographic hack of the network itself. Usually, a user exposes a password, installs a bit of malware, or is the criminal in question.)

### **Scale: Embedded / User / Workgroup / Enterprise**

Database systems come in different sizes, each with different needs and operating characteristics.

Some are so small that you barely know they are there, **embedded** in other hardware and software. So, for example, the scanning wand used by a point of sale system may keep a cache of recent scans. Or your sports watch keeps a record of your heart rate that is synced (uploaded and reset) through an app on your phone. Often, if the device is turned off (all the way off, not in hibernation) for a long enough period of time, the data is lost. However, with static memory and solid state disks becoming cheaper and smaller every day, even the smallest devices often come equipped with persistent storage that survive a reboot. 

The next level up is data stored in files by an **end user app**. Such data will usually survive a system restart, though perhaps with some corruption if the system was writing data to storage at the time. In our example, the point of sale system may have a local storage mechanism so it can recover from power outages, void incorrect transactions, etc. Similarly, behind the scenes most desktop software stores data in caches, documents, or other kinds of files in order to improve the overall user experience.  

At a broader level are so-called **workgroup applications**, where data access is shared with a limited number of other users and devices on the same local area network (LAN). In our retail example, the back office systems and inventory systems might share a workgroup server that keeps track of recent activity. At the other end of the diagram, a similar setup connects the executive information systems and the analyst workstations to the data marts and file archives needed to do their work. 

At the largest scale are the **enterprise** systems in the center of the diagram. They are not necessarily designed for raw speed but instead for throughput. The data on these servers may come from hundreds or even thousands of devices or users, and it is more important that each transaction complete correctly than that any particular transaction complete quickly. 

### **Usage: Transaction Processing vs Analytical Processing**

In our retail example, there was a contrast between the operational systems used in the retail locations on the left and the decision support systems used by the headquarters on the right. 

We call the kinds of work performed by the retail locations and the central data center **transaction processing**. The emphasis is on capturing *what is happening right now* in as much detail as needed and then storing it for posterity. The transaction server and database are thus designed for writing data quickly and accurately, without dropping any transactions due to bandwidth constraints or technical failures. 

The work performed at headquarters, meanwhile, tends to be what we call **analytical processing**. Here the emphasis is on aggregating and understanding the transactions data and perhaps integrating it other data collected elsewhere. These sorts of activities are more about data integration and communication, with read-only access to (scrubbed and aggregated) historical data in a data warehouse or data mart. Such systems may be nearly as large as the transaction systems, in that they contain the same basic volume of facts, but they do not have to support as many users and are less subject to data corruption. Read-only data is not corruptable. If it is corrupt then it was so when it was created. 

### **Security: Files / DBMS / Services**

We conclude our discussion of information technology with note about the effect of architecture on privacy and security. As an illustration please consider the three "houses" below, each of which are designed with security and privacy in mind. The first is Philip Johnson's world famous Glass House, which looks stunning but would not provide much in the way of privacy. In the center is Johnson's almost as famous brick Guest House, a windowless structure with a single door that would provide lots of privacy and security, except for the easily accessed skylights on the roof. Lastly, we have the bomb shelter on the right, which provides maximum security and privacy but only if one is willing to trade off natural light and air.   

![Glass Brick Concrete](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_glass_brick_concrete.jpg)

**The Glass House is analogous to files on a local hard drive or in an email attachment.** Just as the thick panes of glass give the appearance of security but not any privacy, so does relying on file storage to keep your data safe. The contents of your files are visible to anyone with physical access to the storage device or network. While we can, of course, provide security ourselves $-$ with curtains for the house or encryption for the files $-$ doing so would potentially spoil the elegance and convenience of the original design. *So, unless you want to spend your time worrying about private data leaking out of your organization on thumb drives, email, etc. then do not rely on file storage to secure your files. It's about as insecure as it gets.*

**The brick Guest House is like direct access to a Database Management System (DBMS) over a secured local network.** By providing a single point of access (i.e., a thick wooden door), such systems make it fairly simple to secure the data behind a user authentication and authorization system. *However, just as the brick house's skylights can potentially be compromised, so can a database located on a local network drive. Its files may take longer to break into but with time can be compromised by those with the patience and skill to do so.*

**Finally, the bomb shelter is equivalent to an encrypted database that can only be accessed through a secure application server.** In this case, *all access* is limited to the specific commands (service requests) implemented by the app server. While in principle we could argue that a DBMS is itself an application server, in this case we have the ability to layer on more security beyond what is provided by the database vendor. It is certainly inconvenient, but is about as secure as it can possibly be. 

**So, whether you are storing data in the cloud or or your local hard disk, be sure that you use a secured network, with a single point of access to the data. Even then, use encrypted file storage to provide one more layer of protection from those with physical access to the server. What you may lose in raw speed will be more than made up for with peace of mind.** 



## **Software Architecture: Presentation / Logic / Data**

![Three Tiered Architecture](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_three_tiered_architecture.png)

**If we ignore networks and devices, then all remaining technology is software.** Conceptually, if technology is an organization's brain and central nervous system then software is its mind. The wiring, neurons, synapses and other parts exist to implement the thinking and executive processes needed to survive. Similarly, **information technology exists to implement the software** that makes it useful. 

Virtually all modern software implements some variation of the three part structure shown above:
- The **presentation** tier (or layer) that the end user sees and interacts with. To many users this is the totality of the software. Perhaps the most familiar example is the web browser, which many Microsoft users of a certain age still call "the Internet" even though Internet Explorer has been officially dead for many years now. 
- The **logic** tier that connects the user to other users, retains information that may be useful later, controls access to critical resources, etc. In its simplest form, it is defined by a set of actions or **requests** to be carried out by an application service. Continuing with our web example, each web page is assembled by the web browser through one or more requests made of the web server. Each request is received, authorized, and (potentially) executed, with a **response** (HTML, CSS, javascript, file, etc.) delivered back to the browser for presentation to the user.  
- The **data** tier that is responsible for reading and writing persistent data. If the entire system is rebooted from scratch, then all essential data should be restored from storage by the data tier. If any data cannot be restored then the data tier should initiate a **rollback** of any data that has become invald because of the loss. 

In our look at software we will take an operational view of database systems:
- How the three tiers cooperate (via requests and responses) to carry out a business transaction 
- The operations, functions, and features of database management systems (DBMS)
- SQL Standards for relational database systems
- How we will use SQL in this class 

### **The Transaction Lifecycle**

Consider this everyday sales transaction at a mom-and-pop retailer near you:
1. The customer selects a few items off the shelves and then approaches the cashier to check out. 
2. The cashier asks for the customer's phone number or other identifying account information to "make future checkouts easier."   
    2a. If the customer refuses ("the number is unlisted") then the cashier enters a dummy customer number (like "000000") and continues on with the transaction.    
    2b. If the customer supplies a phone number, then the cashier looks up the customer in the system. If the customer does not exist in the system then the cashier asks for a name and creates a new customer account.  
3. The cashier rings up the items and calculates a total. 
4. The customer pays with cash or credit card. 
5. The system confirms the transaction as valid and complete. 
6. The cashier offers a receipt and tells the customer to have a good day. 

It seems pretty simple, right? Now let's look at the same transaction as a set of requests and responses between the Point of Sale terminal, the Transaction Server, and the Database Server. To keep things simple, let's assume that the customer does not have an account but is willing to set one up. Each arrow on this UML sequence diagram is a request (solid line) or a response (dashed line). 

![Sales Transaction](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_sales_transaction.png)

Note that most server responses come only after issuing requests for help from the database. The requests are in SQL, of course, with responses that may include data or just a response code (e.g., "OK"). 

So ... it's not so simple after all! Here all of the requests succeed (with data or an "OK" response) but the system needs to also handle failed requests. We also didn't consider the possible transactions with the credit card company. Depending on the system design, those may be processed by the POS terminal or the Transaction Server. 

And if the system suffers a catastrophic failure, where does it look to start a reboot? With whatever data is in the database. In most cases, that's just fine. However, if the database itself shuts down in the middle of recording a transaction then it needs to *rollback* its data to just before the failure and notify the server of the error, which then gets reported to the sales terminal. We'll consider such cases in Lesson 8. 

### **DBMS Operations, Functions, and Features** 

As we saw with the sales transaction, even everyday business gets pretty complicated when we have to implement it in software. From the perspective of the database, however, there are only four kinds of **operations**: 
- **Create** (add) new data. Upon storage, the DBMS should respond with an identifier for retrieving the data later. 
- **Retrieve** specified data. The request is often called a *query* and the response includes a data *payload* and perhaps some descriptive *metadata*. It is possible, depending on the query language, to return collections of data if needed. 
- **Update** specified data. The request indicates what data is to be updated and how it is to be modified. The response indicates whether the update was successful. 
- **Delete** specified data. The DBMS either deletes the data or returns an error code if the data cannot be deleted.   

These fundamental operations are commonly referred to by the acronym CRUD and are found in every database system regardless of the technology or use case. 

As DBMS technology has matured, the industry has agreed on a few standard functional definitions (shown below). We will consider many of these in more detail in Lessons 7 and 8. 

Within and beyond these standards, there is plenty of room for DBMS vendors to innovate. We will consider vendor-specific features (also shown below) in our discussion of NoSQL and Distributed DBMS technology in Lessons 11 and 12. 
  
![DBMS Functions and Features](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_dbms_features_functions.png)

### **SQL Database Standards**

**The *lingua franca* of DBMS technology is Structured Query Language.** It is the standard against which *every other* data standard is defined. Further, while there are dialectical differences between the various SQL implementations, they are relatively rare, allowing most SQL queries to run unmodified from one DBMS to another. There may be quicker or easier vendor-specific ways to code a particular query, but there will also be a *standard* way. 

**SQL is more than just a language.** Each DBMS vendor provides a *platform* with tools, apps, and other utility software. To keep the chaos at a minimum, SQL Standards include specifications for DBMS functionality like ...
- How to connect to a database and initiate a request
- How data is stored and organized
- How transactions are handled to prevent data corruption
- How user permissions are granted and revoked

We will run a few simple SQL queries in the next section and then again for pretty much every lesson in this course. 

### **Jupyter, Colab, and %sql Magic**

In this course we will interact with a variety of different database servers, but we are only going to use one database client: `ipython-sql` running right here inside our Jupyter notebooks. 

We first learned about Jupyter in Lesson 0. It is a programming and reporting environment that combines formatted text and runnable code organized as "notebook" documents. There are different flavors of Jupyter notebooks from various vendors. What follows assumes that you are using Google Colab, though most actions translate pretty well to the other Jupyter variants. 

Text is entered in Markdown format into text cells. If you double click on this text you can see Markdown formatting for yourself in a fairly large text cell (screenshot below). When open this way, the cell is editable. If you modify anything then the formatted text (displayed to the right) also changes. Double-click the formatted text to return to read-only mode. 

![Markdown Screenshot](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_markdown_screenshot.png)

Runnable code is entered into code cells, identified by icons to the left of the cell. Pristine code that has not been run yet appears with an empty box icon.

![New code cell](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_code_cell_screenshot1.png)

In this case, the code is in Python but as we will see Jupyter can run code in many different languages. Python just happens to be the default. We will be using *mostly* SQL in this class. 

To run the code, hover over the cell and press the circular run icon.

![Run code cell](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_code_cell_screenshot2.png)

After the code has been run (and you are no longer hovering over the cell), the box icon returns, this time with a number inside. 

![Output code cell](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_code_cell_screenshot3.png)

The number indicates the order in which the cells in a notebook were run. It is possible to run the cells in a different order than they appear. That allows you to go back and debug things as you go along. However, you should also do a "clean run" (top-down after resetting the code cells) from time to time to be sure that the notebook works as written. 

Here, try it yourself.  The cell below is live. 



In [None]:
print('Hi There!')

Congrats. For many of you this is your first Python code. We'll see a little more Python before we switch to SQL pretty much full time. 

In order to run SQL in Jupyter *without* Python, we will use [`ipython-sql`](https://github.com/catherinedevlin/ipython-sql), a Jupyter add-on that also goes by the name "%sql magic." It does exactly what the name implies, it does all the hard work (i.e., magic) to interact with databases using just SQL. Recalling our earlier discussion of the three houses, %sql magic connections are like the Brick House with direct interactions with a remote database. 

We'll start with a tiny bit of Python to let Jupyter know that we are going to be using `ipython-sql`. The cell below, typically located near the top of the notebook, is used to enable (load) the %sql magic. Colab will let us now if it has already been loaded but there is no harm in loading it twice. 

![load sql magic](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_load_sql.png)

With %sql magic loaded we can now create and run SQL queries in any code cell. First, however, we will need to tell %sql magic where to find the database we want to connect to. 

![SQLite in memory connection](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_sqlite_in_memory_connection.png)

You'll notice that the part after `%sql` looks a bit like a web URL, with a protocol (`sqlite`) followed by `://`. That's no accident. We call this a **connection string**, which includes all the information (protocol, user, password, server, database) needed to find and then connect to the database, which can reside just about anywhere on the Internet. In this case, we are actually working directly with a database *in memory* (i.e., no network, no files, ... right inside your browser), a trick that is unique to SQLite, which was originally designed for embedded use in tiny devices without file systems. 

Once we have a database connection, we use the `%%sql` magic invocation (sort of like *abracadabra*) at the top of a code cell to indicate that all code after the first line is SQL. For example, the screenshot below includes a bit of SQL to create (or recreate) a table of customer profile data. 

![Create Table Example](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_create_table_example.png)

Again, the text below the `%%sql` invocation is written in SQLite-compatible SQL. We will come back to this soon enough in the ***SQL AND BEYOND*** tutorial at the end of the lesson. 


## **Data Architecture: Entities, Attributes, Values, and Relationships**

The figure belows show three different views of the same data:
* A receipt from a dry cleaner order from January 2, 2019
* An entity relationship diagram showing how the data is organized
* A table of invoices from January 2, 2019

In this section we expore data from the ground up, starting with basic definitions and issues, then moving on to data modeling and databse design, and concluding with actual SQL code to implement the design.   

![DeluxCare](https://github.com/christopherhuntley/BUAN6510/raw/master/img/L1_DeluxCare.png)

### **Data $\rightarrow$ Information $\rightarrow$ Knowledge**
Data = facts

Information = data + metadata

Knowledge = Information in use

### **Data Integrity: Do really know what's in that data your company is consuming?**

Big data is messy data

Domain / Entity / Relational Integrity 

Maintaining integrity as an active process


### **Data Models ... Once and Forever**

Conceptual / Logical / Physical Models

Entity Relationship Diagrams

Tables, Attributes, and Keys

SQL DDL and DML

## **SQL AND BEYOND: SQLite ... Files optional, no server required** 

## **HOMEWORK: Deconstruct ESPN's Gamecast**

