# Lesson 3 Demo 4: Using the WHERE Clause
<img src="https://upload.wikimedia.org/wikipedia/commons/5/5e/Cassandra_logo.svg" width="250" height="250">

### In this exercise we are going to walk through the basics of using the WHERE clause in Apache Cassandra.

##### denotes where the code needs to be completed.

Note: __Do not__ click the blue Preview button in the lower task bar

#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: 
! pip install cassandra-driver
#### More documentation can be found here:  https://datastax.github.io/python-driver/

<h3><span style='color:blue'>Using K8S Cassandra</span></h3>
​
Obviously you need a k8s avaible like: Minikube, Minishift, Docker (with K8s)
​
Helm is need to, go to [helm.sh](http://helm.sh)

In [1]:
#Checks if Helm V3 is available
helm_version = !helm version --short
assert helm_version[0][:2] == 'v3', "Expected HELM version not available, visit https://helm.sh"

In [2]:
!helm repo add bitnami https://charts.bitnami.com/bitnami

"bitnami" has been added to your repositories


In [3]:
CHART_INSTANCE_NAME = 'dend-l3e4'
CASSANDRA_KEYSPACE = CHART_INSTANCE_NAME.replace('-','_')
CASSANDRA_PASSWORD = 'password'

In [4]:
%%writefile dend-cassandra-customize.yaml
service:
    type: NodePort
    nodePorts:
        cql: 30942
        rcp: 30160
dbUser:
    user: cassandra
    password: password

Overwriting dend-cassandra-customize.yaml


In [5]:
helm_chart_out = !helm install {CHART_INSTANCE_NAME} bitnami/cassandra --values dend-cassandra-customize.yaml
#for c_out in helm_chart_out: print(c_out)

In [6]:
#Thanks to @reuvenharrison https://medium.com/@reuvenharrison/how-to-wait-for-a-kubernetes-pod-to-be-ready-one-liner-144bbbb5a76f
!while [[ $(kubectl get pods -l app=cassandra -o 'jsonpath={..status.conditions[?(@.type=="Ready")].status}') != "True" ]]; do echo "Waiting for Cassandra pod to be ready" && sleep 5; done

Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready
Waiting for Cassandra pod to be ready


In [7]:
!kubectl get pods,svc

NAME                        READY   STATUS    RESTARTS   AGE
pod/dend-l3e4-cassandra-0   1/1     Running   0          68s

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                        AGE
service/dend-l3e4-cassandra            NodePort    10.103.180.114   <none>        9042:30942/TCP,9160:31656/TCP                  68s
service/dend-l3e4-cassandra-headless   ClusterIP   None             <none>        7000/TCP,7001/TCP,7199/TCP,9042/TCP,9160/TCP   68s
service/kubernetes                     ClusterIP   10.96.0.1        <none>        443/TCP                                        52m


In [8]:
installed_tabulate = !pip list|fgrep tabulate
if len(installed_tabulate) == 0:
    !pip install tabulate

from dend_cassandra_commons import run_query, print_query

#### Import Apache Cassandra python package

In [9]:
import cassandra

### First let's create a connection to the database

In [10]:
# This should make a connection to a Cassandra instance your kubernetes instance
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

try: 
    # Added connection auth for bitnami / helm / cassandra bundle
    auth_provider = PlainTextAuthProvider(username='cassandra', password='password')
    cluster = Cluster(['127.0.0.1'], port=30942, auth_provider=auth_provider) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
    print(session.hosts)
except Exception as e:
    print(f"Error: {e}")

[<Host: 127.0.0.1:30942 datacenter1>]


### Let's create a keyspace to do our work in 

In [11]:
try:
    session.execute(f"""
    CREATE KEYSPACE IF NOT EXISTS {CASSANDRA_KEYSPACE} 
    WITH REPLICATION = 
    {{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 }}"""
)

except Exception as e:
    print(e)

#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  

In [12]:
try:
    session.set_keyspace(CASSANDRA_KEYSPACE)
except Exception as e:
    print(e)

### Let's imagine we would like to start creating a new Music Library of albums. 
### We want to ask 4 question of our data
#### 1. Give me every album in my music library that was released in a 1965 year
#### 2. Give me the album that is in my music library that was released in 1965 by "The Beatles"
#### 3. Give me all the albums released in a given year that was made in London 
#### 4. Give me the city that the album "Rubber Soul" was recorded

### Here is our Collection of Data
* (1970, "The Beatles", "Let it Be", "Liverpool")
* (1965, "The Beatles", "Rubber Soul", "Oxford")
* (1965, "The Who", "My Generation", "London")
* (1966, "The Monkees", "The Monkees", "Los Angeles")
* (1970, "The Carpenters", "Close To You", "San Diego")

### How should we model this data? What should be our Primary Key and Partition Key? Since our data is looking for the YEAR let's start with that. From there we will add clustering columns on Artist Name and Album Name.

In [13]:
query = "CREATE TABLE IF NOT EXISTS music_library "
query = query + "(year int, artist_name text, album_name text, city text, PRIMARY KEY (year, artist_name, album_name))"
try:
    session.execute(query)
except Exception as e:
    print(e)

### Let's insert our data into of table

In [14]:
music_lib_ins_query = "INSERT INTO music_library (year, artist_name, album_name, city)"
music_lib_ins_query += " VALUES (%s, %s, %s, %s)"
    
music_library_data = [
    (1970, "The Beatles", "Let it Be", "Liverpool"),
    (1965, "The Beatles", "Rubber Soul", "Oxford"),
    (1965, "The Who", "My Generation", "London"),
    (1966, "The Monkees", "The Monkees", "Los Angeles"),
    (1970, "The Carpenters", "Close To You", "San Diego"),
]

for ele in music_library_data:
    run_query(session, music_lib_ins_query, ele)

In [15]:
music_lib_sel_query_all = "select * from music_library"
_ = print_query(session, music_lib_sel_query_all)

  year  artist_name     album_name     city
------  --------------  -------------  -----------
  1965  The Beatles     Rubber Soul    Oxford
  1965  The Who         My Generation  London
  1970  The Beatles     Let it Be      Liverpool
  1970  The Carpenters  Close To You   San Diego
  1966  The Monkees     The Monkees    Los Angeles


### Let's Validate our Data Model with our 4 queries.

Query 1: _Give me every album in my music library that was released in a 1965 year_

In [16]:
query_1 = "SELECT * from music_library where year = 1965"

_ = print_query(session, query_1)

  year  artist_name    album_name     city
------  -------------  -------------  ------
  1965  The Beatles    Rubber Soul    Oxford
  1965  The Who        My Generation  London


 Let's try the 2nd query.
 Query 2: _Give me the album that is in my music library that was released in 1965 by "The Beatles"_

In [17]:
query_2 = "SELECT * from music_library where year = 1965 and artist_name = 'The Beatles'"

_ = print_query(session, query_2)

  year  artist_name    album_name    city
------  -------------  ------------  ------
  1965  The Beatles    Rubber Soul   Oxford


### Let's try the 3rd query.
Query 3: _Give me all the albums released in a given year that was made in London_

In [18]:
query_3 = "SELECT * from music_library where city = 'London'"

print_query(session, query_3)

Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"


AttributeError: 'NoneType' object has no attribute 'column_names'

### Did you get an error? You can not try to access a column or a clustering column if you have not used the other defined clustering column. Let's see if we can try it a different way. 
Try Query 4: _Give me the city that the album "Rubber Soul" was recorded_



In [19]:
query_4 = "SELECT city from music_library where year = 1965 and artist_name = 'The Who'" 

_ = print_query(session, query_4)

city
------
London


In [20]:
query_4_1 = "select city from music_library where album_name = 'Rubber Soul'"

_ = print_query(session, query_4_1)

Error from server: code=2200 [Invalid query] message="PRIMARY KEY column "album_name" cannot be restricted as preceding column "artist_name" is not restricted"


AttributeError: 'NoneType' object has no attribute 'column_names'

In [21]:
query_4 = "SELECT city from music_library where year = 1965 and artist_name = 'The Beatles'" 

_ = print_query(session, query_4)

city
------
Oxford


### And Finally close the session and cluster connection

In [22]:
session.shutdown()
cluster.shutdown()

<h2><span style='color:blue'>Remove Environment</span></h2>

In [23]:
# Removes chart instances
!helm uninstall {CHART_INSTANCE_NAME}

release "dend-l3e4" uninstalled


In [24]:
# Removes persistent Volume
!kubectl get pvc|fgrep {CHART_INSTANCE_NAME}|cut -d ' '  -f1| xargs -t kubectl delete pvc

kubectl delete pvc data-dend-l3e4-cassandra-0
persistentvolumeclaim "data-dend-l3e4-cassandra-0" deleted


In [25]:
!kubectl get pvc

No resources found in default namespace.
