## Non-Relational Databases
### When to Use NoSQL
* **Need high Availability in the data**: Indicates the system is always up and there is no downtime
* **Have Large Amounts of Data**
* **Need Linear Scalability**: The need to add more nodes to the system so performance will increase linearly
* **Low Latency**: Shorter delay before the data is transferred once the instruction for the transfer has been received.
* **Need fast reads and write**

### Apache Cassandra
* Open Source NoSQL DB -- go download the code!
* Masterless Architecture
* High Availability
* Linear Scalability
* Used by Uber, Netflix, Hulu, Twitter, Facebook, etc
* Major contributors to the project: DataStax, Facebook, Twitter, Apple
---------

## Distributed Databases
In a **distributed database**, in order to have hight availability, you will need copies of your data.

### Eventual Consistency
Over time (if no new changes are made) each copy of the data will be the same, but if there are new changes, the data may be different in different locations. The data may be inconsistent for only milliseconds. There are workarounds in place to prevent getting stale data.

### Commonly Asked Questions:
**What does the network look like? Can you share any examples?**

In Apache Cassandra every node is connected to every node -- it's peer to peer database architecture.

**Is data deployment strategy an important element of data modeling in Apache Cassandra?**

Deployment strategies are a great topic, but have very little to do with data modeling. Developing deployment strategies focuses on determining how many clusters to create or determining how many nodes are needed. These are topics generally covered under database architecture, database deployment and operations, which we will not cover in this lesson. Here is a useful link to learn more about it for [Apache Cassandra](https://docs.datastax.com/en/dse-planning/doc/).

In general, the size of your data and your data model can affect your deployment strategies. You need to think about how to create a cluster, how many nodes should be in that cluster, how to do the actual installation. More information about deployment strategies can be found on this [DataStax documentation page](https://docs.datastax.com/en/dse-planning/doc/).

### Cassandra Architecture
We are not going into a lot of details about the Apache Cassandra Architecture. However, if you would like to learn more about it for your job, here are some links that you may find useful.

**Apache Cassandra Data Architecture:**

* [Understanding the architecture](https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archTOC.html)
* [Cassandra Architecture](https://www.tutorialspoint.com/cassandra/cassandra_architecture.htm)

The following link will go more in-depth about the Apache Cassandra Data Model, how Cassandra reads, writes, updates, and deletes data.

* [Cassandra Documentation](https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlIntro.html)

--------

## CAP Theorem
A theorem in computer science that states that it is **impossible** for a distributed data store to **simultaneously provide** more than two out of the following three guarantees of **consistency**, **availability** and **partition tolerance**.

* **Consistency**: Every read from the database gets the latest (and correct) piece of data or an error

* **Availability**: Every request is received and a response is given -- without a guarantee that the data is the latest update

* **Partition Tolerance**: The system continues to work regardless of losing network connectivity between nodes

### Commonly Asked Questions:
**Is Eventual Consistency the opposite of what is promised by SQL database per the ACID principle?**

Much has been written about how Consistency is interpreted in the ACID principle and the CAP theorem. Consistency in the ACID principle refers to the requirement that only transactions that abide by constraints and database rules are written into the database, otherwise the database keeps previous state. In other words, the data should be correct across all rows and tables. However, consistency in the CAP theorem refers to every read from the database getting the latest piece of data or an error.

To learn more, you may find this discussion useful:

* [Discussion about ACID vs. CAP](https://www.voltdb.com/blog/2015/10/22/disambiguating-acid-cap/)

**Which of these combinations is desirable for a production system - Consistency and Availability, Consistency and Partition Tolerance, or Availability and Partition Tolerance?**

As the CAP Theorem Wikipedia entry says, "The CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability." So there is no such thing as Consistency and Availability in a distributed database since it must always tolerate network issues. You can only have Consistency and Partition Tolerance (CP) or Availability and Partition Tolerance (AP). Remember, relational and non-relational databases do different things, and that's why most companies have both types of database systems.

**Does Cassandra meet just Availability and Partition Tolerance in the CAP theorem?**

According to the CAP theorem, a database can actually only guarantee two out of the three in CAP. So supporting Availability and Partition Tolerance makes sense, since Availability and Partition Tolerance are the biggest requirements.

**If Apache Cassandra is not built for consistency, won't the analytics pipeline break?**

If I am trying to do analysis, such as determining a trend over time, e.g., how many friends does John have on Twitter, and if you have one less person counted because of "eventual consistency" (the data may not be up-to-date in all locations), that's OK. In theory, that can be an issue but only if you are not constantly updating. If the pipeline pulls data from one node and it has not been updated, then you won't get it. Remember, in Apache Cassandra it is about **Eventual Consistency**.

------

## Quiz 1
### Question 1 of 2
Match the CAP theorem term to its definition.

|CAP THEOREM TERM| DEFINITION|
|----------------|-----------|
|Consistency|Database will get the latest piece of data requested|
|Availability|Every request is received and a response is given.|
|Partition Tolerance|Network connectivity does not affect the system|

### Question 2 of 2
Apache Cassandra is...
(Check all that apply)
- [x] A Highly Available Database
- [x] Linearly scalable
- [ ] Used when ACID transactions are needed
- [x] An Open Source project supported by The Apache Foundation
- [ ] A Consistency and Partition Tolerant Database during network failures.

----

## De-normalization in Apache Cassandra
De-normalization of tables in Apache Cassandra is absolutely critical. The biggest take away when doing data modeling in Apache Cassandra is to think about your **queries** first. There are no `JOINS` in Apache Cassandra.

### Data Modeling in Apache Cassandra:
* Denormalization is not just okay -- it's a must
* Denormalization must be done for fast reads
* Apache Cassandra has been optimized for fast writes
* ALWAYS think Queries first
* One table per query is a great strategy
* Apache Cassandra does not allow for JOINs between tables

### Commonly Asked Questions:
* **I see certain downsides of this approach, since in a production application, requirements change quickly and I may need to improve my queries later. Isn't that a downside of Apache Cassandra?**
In Apache Cassandra, you want to model your data to your queries, and if your business need calls for quickly changing requirements, you need to create a new table to process the data. That is a requirement of Apache Cassandra. If your business needs calls for ad-hoc queries, these are not a strength of Apache Cassandra. However keep in mind that it is easy to create a new table that will fit your new query.

### Additional Resource:
Here is a reference to the DataStax documents on [Apache Cassandra](https://docs.datastax.com/en/dse/6.7/cql/cql/ddl/dataModelingApproach.html)

### Question 1 of 3
True or False: Apache Cassandra denormalization of tables in data modeling is required.
- [x] True
- [ ] False

### Question 2 of 3
True or False: When doing data modeling in Apache Cassandra 1 table per 1 query is a very acceptable practice.
- [x] True
- [ ] False

### Question 3 of 3
True or False: When doing data modeling in Apache Cassandra knowing your queries first and modeling to those queries is essential.
- [x] True
- [ ] False

---

## Cassandra Query Language: CQL
Cassandra query language is the way to interact with the database and is similar to SQL. JOINS, GROUP BY, or sub-queries are not in CQL and are not supported by CQL.

---

## Lesson 3 Demo 1: 2 Queries 2 Tables

### In this demo we are going to walk through the basics of creating a table in Apache Cassandra, inserting rows of data, and doing a simple SQL query to validate the information. We will talk about the importance of Denormalization, and that 1 table per 1 query is an encouraged practice with Apache Cassandra. 

#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: 
! pip install cassandra-driver
#### More documentation can be found here:  https://datastax.github.io/python-driver/

#### Import Apache Cassandra python package

In [1]:
import cassandra

### First let's create a connection to the database

In [2]:
from cassandra.cluster import Cluster
try:
    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
except Exception as e:
    print(e)

### Let's create a keyspace to do our work in 

In [3]:
try:
    session.execute("""
    CREATE KEYSPACE IF NOT EXISTS udacity
    WITH REPLICATION = 
    {'class':'SimpleStrategy', 'replication_factor':1}
    """
    )
except Error as e:
    print(e)

#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  

In [4]:
try:
    session.set_keyspace('udacity')
except Exception as e:
    print(e)

### Let's imagine we would like to start creating a Music Library of albums. 

### We want to ask 2 questions of our data
#### 1. Give me every album in my music library that was released in a given year
`select * from music_library WHERE YEAR=1970`
#### 2. Give me every album in my music library that was created by a given artist  
`select * from artist_library WHERE artist_name="The Beatles"`


### Because I want to do two different queries, I am going to do need different tables that partition the data differently. 
* My music library table will be by year that will become my partition key, and artist name will be my clustering column to make each Primary Key unique. 
* My album library table will be by artist name that will be my partition key, and year will be my clustering column to make each Primary Key unique. More on Primary keys in the next lesson and demo. 

`Table Name: music_library
column 1: Year
column 2: Artist Name
column 3: Album Name
PRIMARY KEY(year, artist name)`


` Table Name: album_library 
column 1: Artist Name
column 2: Year
column 3: Album Name
PRIMARY KEY (artist name, year)`


In [5]:
try:
    session.execute("""
                        CREATE TABLE IF NOT EXISTS music_library(
                            year INT,
                            artist_name TEXT,
                            album_name TEXT,
                            PRIMARY KEY(year,artist_name))        
    """)
except Error as e:
    print(e)

try:
    session.execute("""
                        CREATE TABLE IF NOT EXISTS album_library(
                            artist_name TEXT,
                            year INT,
                            album_name TEXT,
                            PRIMARY KEY(artist_name,year))        
    """)
except Error as e:
    print(e)

### Let's insert some data into both tables

In [6]:
query = "INSERT INTO music_library (year, artist_name, album_name)"
query = query + " VALUES (%s, %s, %s)"

query1 = "INSERT INTO album_library (artist_name, year, album_name)"
query1 = query1 + " VALUES (%s, %s, %s)"

try:
    session.execute(query, (1970, "The Beatles", "Let it Be"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Beatles", "Rubber Soul"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Who", "My Generation"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1966, "The Monkees", "The Monkees"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1970, "The Carpenters", "Close To You"))
except Exception as e:
    print(e)
    
try:
    session.execute(query1, ("The Beatles", 1970, "Let it Be"))
except Exception as e:
    print(e)
    
try:
    session.execute(query1, ("The Beatles", 1965, "Rubber Soul"))
except Exception as e:
    print(e)
    
try:
    session.execute(query1, ("The Who", 1965, "My Generation"))
except Exception as e:
    print(e)

try:
    session.execute(query1, ("The Monkees", 1966, "The Monkees"))
except Exception as e:
    print(e)

try:
    session.execute(query1, ("The Carpenters", 1970, "Close To You"))
except Exception as e:
    print(e)

### This might have felt unnatural to insert duplicate data into two tables. If I just normalized these tables, I wouldn't have to have extra copies! While this is true, remember there are no `JOINS` in Apache Cassandra. For the benefit of high availibity and scalabity denormalization must be how this is done. 


### Let's Validate our Data Model

`select * from music_library WHERE YEAR=1970`

In [7]:
query = "select * from music_library WHERE YEAR=1970"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.artist_name, row.album_name,)

1970 The Beatles Let it Be
1970 The Carpenters Close To You


### For the sake of the demo, I will drop the table. 

In [8]:
query = "drop table music_library"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

query = "drop table album_library"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

### And Finally close the session and cluster connection

In [15]:
session.shutdown()
cluster.shutdown()

---

## Primary Key
* Must be unique
* The PRIMARY KEY is made up of either just the PARTITION KEY or may also include additional CLUSTERING COLUMNS
* A Simple PRIMARY KEY is just one column that is also the PARTITION KEY. A Composite PRIMARY KEY is made up of more than one column and will assist in creating a unique value and in your retrieval queries
* The PARTITION KEY will determine the distribution of data across the system
* May have one or more clustering columns.

### Partition Key
* The partition key's row value will be hashed (turned into a number) and stored on the the node in the system that holds the range of values.

---

## Quiz: Primary Key
### Question 1 of 2
True or False: Apache Cassandra supports duplicate rows.

- [ ] True
- [x] False

### Question 2 of 2
Which is better: Simple or Composite Primary Keys?
- [x] It depends on the data you have and the queries you will do
- [ ] Simple -- Simple is always better
- [ ] Composite
- [ ] Neither, it is better to use a Relational Database

---

## Lesson 3 Demo 2: Focus on Primary Key

### In this demo we are going to walk through the basics of creating a table with a good Primary Key in Apache Cassandra, inserting rows of data, and doing a simple SQL query to validate the information.

#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but to install this library  in the future you can run this command in a notebook to install locally: 
! pip install cassandra-driver
#### More documentation can be found here:  https://datastax.github.io/python-driver/

#### Import Apache Cassandra python package

In [2]:
import cassandra

### First let's create a connection to the database

In [3]:
from cassandra.cluster import Cluster
try: 
    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
except Exception as e:
    print(e)

### Let's create a keyspace to do our work in 

In [4]:
try:
    session.execute("""
    CREATE KEYSPACE IF NOT EXISTS udacity 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

except Exception as e:
    print(e)

#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  

In [5]:
try:
    session.set_keyspace('udacity')
except Exception as e:
    print(e)

### Let's imagine we would like to start creating a new Music Library of albums. We are going to work with one of the queries from Exercise 1.

### We want to ask 1 question of our data
#### 1. Give me every album in my music library that was released in a given year
`select * from music_library WHERE YEAR=1970`

### Here is our Collection of Data
### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the YEAR let's start with that. Is Partitioning our data by year a good idea? In this case our data is very small, but if we had a larger data set of albums partitions by YEAR might be a find choice. We would need to validate from our dataset. We want an equal spread of the data. 

`Table Name: music_library
column 1: Year
column 2: Artist Name
column 3: Album Name
Column 4: City
PRIMARY KEY(year)`

In [7]:
query = "CREATE TABLE IF NOT EXISTS music_library "
query = query + "(year int, artist_name text, album_name text, city text, PRIMARY KEY (year))"
try:
    session.execute(query)
except Exception as e:
    print(e)

### Let's insert our data into of table

In [8]:
query = "INSERT INTO music_library (year, artist_name, album_name, city)"
query = query + " VALUES (%s, %s, %s, %s)"

try:
    session.execute(query, (1970, "The Beatles", "Let it Be", "Liverpool"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Beatles", "Rubber Soul", "Oxford"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Who", "My Generation", "London"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1966, "The Monkees", "The Monkees", "Los Angeles"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1970, "The Carpenters", "Close To You", "San Diego"))
except Exception as e:
    print(e)

### Let's Validate our Data Model -- Did it work?? If we look for Albums from 1965 we should expect to see 2 rows.

`select * from music_library WHERE YEAR=1965`

In [9]:
query = "select * from music_library WHERE YEAR=1965"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.artist_name, row.album_name, row.city)

1965 The Who My Generation London


### That didn't work out as planned! Why is that? Because we did not create a unique primary key. 

### Let's Try Again. Let's focus on making the PRIMARY KEY unique. Look at our dataset do we have anything that is unique for each row? We have a couple of options (City and Album Name) but that will not get us the query we need which is looking for album's in a particular year. Let's make a composite key of the `YEAR` AND `ALBUM NAME`. This is assuming that an album name is unique to the year it was released (not a bad bet). --But remember this is just a demo, you will need to understand your dataset fully (no betting!)

In [10]:
query = "CREATE TABLE IF NOT EXISTS music_library1 "
query = query + "(year int, artist_name text, album_name text, city text, PRIMARY KEY (year, album_name))"
try:
    session.execute(query)
except Exception as e:
    print(e)

In [11]:
query = "INSERT INTO music_library1 (year, artist_name, album_name, city)"
query = query + " VALUES (%s, %s, %s, %s)"

try:
    session.execute(query, (1970, "The Beatles", "Let it Be", "Liverpool"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Beatles", "Rubber Soul", "Oxford"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Who", "My Generation", "London"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1966, "The Monkees", "The Monkees", "Los Angeles"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1970, "The Carpenters", "Close To You", "San Diego"))
except Exception as e:
    print(e)

### Let's Validate our Data Model -- Did it work?? If we look for Albums from 1965 we should expect to see 2 rows.

`select * from music_library WHERE YEAR=1965`

In [12]:
query = "select * from music_library1 WHERE YEAR=1965"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.artist_name, row.album_name, row.city)

1965 The Who My Generation London
1965 The Beatles Rubber Soul Oxford


### Success it worked! We created a unique Primary key that evenly distributed our data. 

### For the sake of the demo, I will drop the table. 


In [13]:
query = "drop table music_library"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

query = "drop table music_library1"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

### And Finally close the session and cluster connection

In [14]:
session.shutdown()
cluster.shutdown()

-----------------

## Clustering Columns
The **PRIMARY KEY** is made up of either just the **PARTITION KEY** or with the addition of **CLUSTERING COLUMNS**. The **CLUSTERING COLUMN** will determine the sort order within a Partition.

* The clustering column will sort the data in sorted **ascending order**, e.g., alphabetical order.
* More than one clustering column can be added (or none!)
* From there the clustering columns will sort in order of how they were added to the primary key

        CREATE TABLE
        music_library
        (year INT, 
         artist_name TEXT,
         album_name TEXT,
         PRIMARY KEY ((year),
         artist_name, album_name)

|year|artist_name|album_name|
|----|-----------|----------|
|1965|Elvis      |Blue Hawaii|
|1965|The Beatles|Rubber Soul|
|1965|The Beatles|Showing order|
|1965|The Monkees|Meet the Monkees|

Here Primary key is the combination of partition key which is `year` and two clustering columns `artist_name` and `album_name`

### Commonly Asked Questions:
**How many clustering columns can we add?**
You can use as many clustering columns as you would like. You cannot use the clustering columns out of order in the SELECT statement. You may choose to omit using a clustering column in your SELECT statement. That's OK. Just remember to use them in order when you are using the SELECT statement.

### QUESTION 1 OF 2
The PRIMARY KEY is made up of...
- [ ] The composite key, the primary key and the clustering key
- [ ] The composite key, the partition key and the clustering columns
- [x] The partition key and the clustering columns

### QUESTION 2 OF 2
A Clustering Column is required in the Primary Key
- [ ] True
- [x] False

---

## Lesson 3 Demo 3: Focus on Clustering Columns

### In this demo we are going to walk through the basics of creating a table with a good Primary Key and Clustering Columns in Apache Cassandra, inserting rows of data, and doing a simple SQL query to validate the information.

#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: 
! pip install cassandra-driver
#### More documentation can be found here:  https://datastax.github.io/python-driver/

#### Import Apache Cassandra python package

In [17]:
import cassandra

### First let's create a connection to the database

In [18]:
from cassandra.cluster import Cluster
try: 
    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
except Exception as e:
    print(e)

### Let's create a keyspace to do our work in 

In [19]:
try:
    session.execute("""
    CREATE KEYSPACE IF NOT EXISTS udacity 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

except Exception as e:
    print(e)

#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  

In [20]:
try:
    session.set_keyspace('udacity')
except Exception as e:
    print(e)

### Let's imagine we would like to start creating a new Music Library of albums. 

### We want to ask 1 question of our data
#### 1. Give me every album in my music library that was released by an Artist with Albumn Name in `DESC` Order and City In `DESC` Order
`select * from music_library WHERE ARTIST_NAME="The Beatles"`


### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the `ARTIST_NAME` let's start with that. From there we will need to add other elements to make sure the Key is unique. We also need to add the `CITY` and `ALBUM_NAME` as Clustering Columns to sort the data. That should be enough to make the row key unique

`Table Name: music_library
column 1: Year
column 2: Artist Name
column 3: Album Name
Column 4: City
PRIMARY KEY(artist name, album name, city)`

In [21]:
query = "CREATE TABLE IF NOT EXISTS music_library "
query = query + "(year int, artist_name text, album_name text, city text, PRIMARY KEY (artist_name, album_name, city))"
try:
    session.execute(query)
except Exception as e:
    print(e)

### Let's insert our data into of table

In [22]:
query = "INSERT INTO music_library (year, artist_name, album_name, city)"
query = query + " VALUES (%s, %s, %s, %s)"

try:
    session.execute(query, (1970, "The Beatles", "Let it Be", "Liverpool"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Beatles", "Rubber Soul", "Oxford"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1964, "The Beatles", "Beatles For Sale", "London"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1966, "The Monkees", "The Monkees", "Los Angeles"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1970, "The Carpenters", "Close To You", "San Diego"))
except Exception as e:
    print(e)

### Let's Validate our Data Model -- Did it work?? If we look for Albums from The Beatles we should expect to see 3 rows.

`select * from music_library WHERE ARTIST_NAME="The Beatles"`

In [23]:
query = "select * from music_library WHERE ARTIST_NAME='The Beatles'"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.artist_name, row.album_name, row.city, row.year)

The Beatles Beatles For Sale London 1964
The Beatles Let it Be Liverpool 1970
The Beatles Rubber Soul Oxford 1965


### Success it worked! We created a unique Primary key that evenly distributed our data, with clustering columns that sorted our data. 

### For the sake of the demo, I will drop the table. 

In [24]:
query = "drop table music_library"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)


### And Finally close the session and cluster connection

In [25]:
session.shutdown()
cluster.shutdown()

----

## WHERE Clause
* Data Modeling in Apache Cassandra is query focused, and that focus needs to be om the **WHERE** clause.
* The **PARTITION KEY** must be included in your query and any **CLUSTERING COLUMNS** can be used in the order they appear in your **PRIMARY KEY**.

<img src="images/where_sample_queries.png">

### Select * from table
The `WHERE` clause must be included to execute queries. It is recommended that one partition be queried at a time for performance implications. It is possible to do a select * from table if you add a configuration `ALLOW FILTERING` to your query. This is risky, but available if absolute necessary.

AVOID using "ALLOW FILTERING": Here is a reference in [DataStax](https://www.datastax.com/dev/blog/allow-filtering-explained-2) that explains ALLOW FILTERING and why you should not use it.

### Commonly Asked Questions:
**Why do we need to use a WHERE statement since we are not concerned about analytics? Is it only for debugging purposes?**
The `WHERE` statement is allowing us to do the fast reads. With Apache Cassandra, we are talking about big data -- think terabytes of data -- so we are making it fast for read purposes. Data is spread across all the nodes. By using the `WHERE` statement, we know which node to go to, from which node to get that data and serve it back. For example, imagine we have 10 years of data on 10 nodes or servers. So 1 year's data is on a separate node. By using the `WHERE year = 1` statement we know which node to visit fast to pull the data from.

### QUIZ QUESTION
Can you do SELECT * FROM myTable in Apache Cassandra?
- [ ] Yes
- [ ] No
- [x] It is highly discouraged as performance will be slow (or may just fail) but is possible with a configuration setting.
- [ ] Yes, and no one should worry about it.
---

## Lesson 3 Demo 4: Using the WHERE Clause


### In this demo we are going to walk through the basics of using the WHERE clause in Apache Cassandra.

#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: 
! pip install cassandra-driver
#### More documentation can be found here:  https://datastax.github.io/python-driver/

#### Import Apache Cassandra python package

In [40]:
import cassandra

### First let's create a connection to the database

In [41]:
from cassandra.cluster import Cluster
try: 
    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
except Exception as e:
    print(e)

### Let's create a keyspace to do our work in 

In [42]:
try:
    session.execute("""
    CREATE KEYSPACE IF NOT EXISTS udacity 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

except Exception as e:
    print(e)

#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  

In [43]:
try:
    session.set_keyspace('udacity')
except Exception as e:
    print(e)

### Let's imagine we would like to start creating a new Music Library of albums. 
### We want to ask 4 question of our data
#### 1. Give me every album in my music library that was released in a given year
`select * from music_library WHERE YEAR=1970`
#### 2. Give me the album that is in my music library that was released in a given year by "The Beatles"
`select * from music_library WHERE YEAR = 1970 AND ARTIST_NAME = "The Beatles"`
#### 3. Give me all the albums released in a given year in a give location 
`select * from music_library WHERE YEAR = 1970 AND LOCATION = "Liverpool"`
#### 4. Give me city that the albumn "Let It Be" was recorded
`select city from music_library WHERE YEAR = "1970" AND ARTIST_NAME = "The Beatles" AND ALBUM_NAME="Let it Be"`


### Here is our Collection of Data

### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the YEAR let's start with that. From there we will add clustering columns on Artist Name and Album Name.

`Table Name: music_library
column 1: Year
column 2: Artist Name
column 3: Album Name
Column 4: City
PRIMARY KEY(year, artist_name, album_name)`

In [44]:
query = "CREATE TABLE IF NOT EXISTS music_library "
query = query + "(year int, artist_name text, album_name text, city text, PRIMARY KEY (year, artist_name, album_name))"
try:
    session.execute(query)
except Exception as e:
    print(e)

### Let's insert our data into of table

In [45]:
query = "INSERT INTO music_library (year, artist_name, album_name, city)"
query = query + " VALUES (%s, %s, %s, %s)"

try:
    session.execute(query, (1970, "The Beatles", "Let it Be", "Liverpool"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Beatles", "Rubber Soul", "Oxford"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Who", "My Generation", "London"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1966, "The Monkees", "The Monkees", "Los Angeles"))
except Exception as e:
    print(e)

try:
    session.execute(query, (1970, "The Carpenters", "Close To You", "San Diego"))
except Exception as e:
    print(e)

### Let's Validate our Data Model with our 4 queries.

`select * from music_library WHERE YEAR=1970`

In [46]:
query = "select * from music_library WHERE YEAR=1970"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.artist_name, row.album_name, row.city)

1970 The Beatles Let it Be Liverpool
1970 The Carpenters Close To You San Diego


### Success it worked! Let's try the 2nd query.

`select * from music_library WHERE YEAR = 1970 AND ARTIST_NAME = "The Beatles"`

In [47]:
query = "select * from music_library WHERE YEAR=1970 AND ARTIST_NAME = 'The Beatles'"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.artist_name, row.album_name, row.city)

1970 The Beatles Let it Be Liverpool


### Success it worked! Let's try the 3rd query.
`select * from music_library WHERE YEAR = 1970 AND LOCATION = "Liverpool"`

In [48]:
query = "select * from music_library WHERE YEAR = 1970 AND city = 'Liverpool'"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.artist_name, row.album_name, row.city)

Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"


### Error! You can not try to access a column or a clustering column if you have not used the other defined clustering column. Let's see if we can try it a different way. 
`select city from music_library WHERE YEAR = "1970" AND ARTIST_NAME = "The Beatles" AND ALBUM_NAME="Let it Be"`


In [49]:
query = "select city from music_library WHERE YEAR = 1970 AND ARTIST_NAME = 'The Beatles' AND ALBUM_NAME='Let it Be'"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.city)

Liverpool


### For the sake of the demo, I will drop the table. 

In [50]:
query = "drop table music_library"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

query = "drop table music_library1"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)

Error from server: code=2200 [Invalid query] message="unconfigured table music_library1"


### And Finally close the session and cluster connection

In [51]:
session.shutdown()
cluster.shutdown()

---