# SQL vs NoSQL


## SQL
- schema is a collection of tables
- table is a collection of rows with same attribute (i.e.columns)

#### When to use a relational database
- ease of use
- ability to do joins
- ability to do aggregations and analytics
- smaller data volumes
- easier to change business requirements
- flexibility for queries
- modelling the data, not modelling queries
- secondary indexes are available
- ACID transaction - data integrity

#### ACID
##### Atomicity
- the whole transaction is processed or nothing is processed
- e.g. transaction between two bank account

##### Consistency
- Only transactions that abide by constraints and rules are written into the database, otherwise the database keeps the previous state

##### Isolation
- transactions are processed independently and securely, order doesn't matter

##### Durability
- completed transactions are saved to database even in cases of system failure

#### When NOT to use a relational database
- have large amounts of data
- need to be able to store difference data type formats
- need high throughput - fast reads
- need flexible schema (e.g. adding columns that don't have to be used by every row)
- need high availability - little or no downtime
- need horizontal scalability - ability to add more servers to the system



## NoSQL

#### Basics of Apache Cassandra
- keyspace is a collection of tables
- table is a group of partitions
- rows are single items
- partition
    - fundamental unit of access
    - collection of row(s)
    - how data is distributed
- primary key
    - made up of partition key and clustering columns
- columns
    - clustering and data
    - labeled elements
![apache_basics.PNG](attachment:apache_basics.PNG)

#### When to use NoSQL
- large amounts of data
- need horizontal scalability
- need high throughout - fast reads
- need a flexible schema
- need high availability
- need to be able to store different data type formats
- users are distributed - low latency

#### When NOT to use a NoSQL database
- need ACID transactions
- nneed ability to do JOINS
- ability to do aggregations and analytics
- have changing business requirements
- queries are not available and need to have flexibility
- have small dataset


# Structruing the database (normalisation vs denormalisation)

## Normalisation
- is about trying to increase data integrity by reducing the number of copies of the data. Data that needs to be added or updated will be done in as few places as possible.

#### Objectives of normalisation
1. To free the database from unwanted insertions, updates, & deletion dependencies
2. To reduce the need for refactoring the database as new types of data are introduced
3. To make the relational model more informative to users
4. To make the database neutral to the query statistics

#### How to reach First Normal Form (1NF)
- Atomic values: each cell contains unique and single values
- Be able to add data without altering tables
- Separate different relations into different tables
- Keep relationships between tables together with foreign keys

#### Second Normal Form (2NF)
- Have reached 1NF
- All columns in the table must rely on the Primary Key

#### Third Normal Form (3NF)
- Must be in 2nd Normal Form
- No transitive dependencies
    - transitive dependency is when changing a non-key column, might cause any of the other non-key columns to change
- Remember, transitive dependencies you are trying to maintain is that to get from A-> C, you want to avoid going through B.


## Denormalisation
-  is trying to increase performance by reducing the number of joins between tables (as joins can be slow). Data integrity will take a bit of a potential hit, as there will be more copies of the data (to reduce JOINS).

# NoSQL

## CAP Theorem
- Consistency: every read from the database gets the latest (and correct) piece of data or an error
- Availability: every request is received and a response is given -- without a guarantee that the data is the latest update
- Partition tolerance: the system continues to work regardless of losing network connectivity between nodes

The CAP Theorem implies that in the presence of a network partition, one has to choose between consistency and availability - i.e. there is no such thing as consistency and availability.

You can only have consistency and partition tolerance (CP) or availability and partition tolerance (AP)

## Apache Cassandra
- denormalisation is a MUST
- denormalisation must be done for fast reads
- Apache Cassandra is optimised for fast writes
- think queries first
- one table per query
- Apache Cassandra does not allow for JOINs between tables

## Primary keys, composite keys and clustering columns

### PRIMARY KEY and COMPOSITE KEY
- PRIMARY KEY is made up of either just the PARTITION KEY or may also include additional CLUSTERING COLUMNS
- simple PRIMARY KEY is just one column that is also the PARTITION KEY
- a composite PRIMARY KEY is made up of more than one column
- the PARTITION KEY determines the distribution of data across the system

### CLUSTERING COLUMN
- sorts the data in ascending order
- more than one clustering column can be added (or none)
- use clustering columns in order when using the SELECT statement

# NOT NULL

# UNIQUE

# PRIMARY KEY

# Upsert