# L3 Exercise 3: Focus on Clustering Columns
<img src="https://upload.wikimedia.org/wikipedia/commons/5/5e/Cassandra_logo.svg" width="250" height="250">

### Walk through the basics of creating a table with a good Primary Key and Clustering Columns in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information.

#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: 
! pip install cassandra-driver
#### More documentation can be found here:  https://datastax.github.io/python-driver/

<h3><span style='color:blue'>Using K8S Cassandra</span></h3>
​
Obviously you need a k8s avaible like: Minikube, Minishift, Docker (with K8s)
​
Helm is need to, go to [helm.sh](http://helm.sh)

In [1]:
#Checks if Helm V3 is available
helm_version = !helm version --short
assert helm_version[0][:2] == 'v3', "Expected HELM version not available, visit https://helm.sh"

In [2]:
!helm repo add bitnami https://charts.bitnami.com/bitnami

"bitnami" has been added to your repositories


In [3]:
CHART_INSTANCE_NAME = 'dend-l3e3'
CASSANDRA_KEYSPACE = CHART_INSTANCE_NAME.replace('-','_')
CASSANDRA_PASSWORD = 'password'

In [4]:
%%writefile dend-cassandra-customize.yaml
service:
    type: NodePort
    nodePorts:
        cql: 30942
        rcp: 30160
dbUser:
    user: cassandra
    password: password

Overwriting dend-cassandra-customize.yaml


In [5]:
helm_chart_out = !helm install {CHART_INSTANCE_NAME} bitnami/cassandra --values dend-cassandra-customize.yaml
#for c_out in helm_chart_out: print(c_out)

In [6]:
#Thanks to @reuvenharrison https://medium.com/@reuvenharrison/how-to-wait-for-a-kubernetes-pod-to-be-ready-one-liner-144bbbb5a76f
!while [[ $(kubectl get pods -l app=cassandra -o 'jsonpath={..status.conditions[?(@.type=="Ready")].status}') != "True" ]]; do echo "Waiting for Cassandra pod to be ready" && sleep 5; done

In [7]:
!kubectl get pods,svc

NAME                        READY   STATUS    RESTARTS   AGE
pod/dend-l3e3-cassandra-0   1/1     Running   0          6m35s

NAME                                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                        AGE
service/dend-l3e3-cassandra            NodePort    10.99.154.169   <none>        9042:30942/TCP,9160:31756/TCP                  6m35s
service/dend-l3e3-cassandra-headless   ClusterIP   None            <none>        7000/TCP,7001/TCP,7199/TCP,9042/TCP,9160/TCP   6m35s
service/kubernetes                     ClusterIP   10.96.0.1       <none>        443/TCP                                        11m


In [12]:
installed_tabulate = !pip list|fgrep tabulate
if len(installed_tabulate) == 0:
    !pip install tabulate

from dend_cassandra_commons import run_query, print_query

#### Import Apache Cassandra python package

In [8]:
import cassandra

### Create a connection to the database

In [9]:
# This should make a connection to a Cassandra instance your kubernetes instance
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

try: 
    # Added connection auth for bitnami / helm / cassandra bundle
    auth_provider = PlainTextAuthProvider(username='cassandra', password='password')
    cluster = Cluster(['127.0.0.1'], port=30942, auth_provider=auth_provider) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
    print(session.hosts)
except Exception as e:
    print(f"Error: {e}")

[<Host: 127.0.0.1:30942 datacenter1>]


### Create a keyspace to work in 

In [10]:
try:
    session.execute(f"""
    CREATE KEYSPACE IF NOT EXISTS {CASSANDRA_KEYSPACE}
    WITH REPLICATION = 
    {{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 }}"""
)

except Exception as e:
    print(e)

#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  

In [11]:
try:
    session.set_keyspace(CASSANDRA_KEYSPACE)
except Exception as e:
    print(e)

### Imagine we would like to start creating a new Music Library of albums. 

### We want to ask 1 question of our data:
#### 1. Give me all the information from the music library about a given album
`select * from album_library WHERE album_name="Close To You"`


### Here is the Data:
* "Let it Be", "The Beatles", 1970, "Liverpool"
* "Rubber Soul", "The Beatles", 1965, "Oxford"
* "Beatles For Sale", "The Beatles", 1964, "London"
* "The Monkees", "The Monkees", 1966, "Los Angeles"
* "Close To You", "The Carpenters", 1970, "San Diego"

### How should we model this data? What should be our Primary Key and Partition Key? 

### Since the data is looking for the `ALBUM_NAME` let's start with that. From there we will need to add other elements to make sure the Key is unique. We also need to add the  `ARTIST_NAME` as Clustering Columns to make the data unique. That should be enough to make the row key unique.

`Table Name: music_library
column 1: Year
column 2: Artist Name
column 3: Album Name
Column 4: City
PRIMARY KEY(album name, artist name)`

In [14]:
query = "CREATE TABLE IF NOT EXISTS music_library "
query = query + "(album_name text, artist_name text, year int, city text, PRIMARY KEY (album_name, artist_name))"
try:
    session.execute(query)
except Exception as e:
    print(e)

### Insert the data into the table

In [15]:
query = "INSERT INTO music_library (album_name, artist_name, year, city)"
query = query + " VALUES (%s, %s, %s, %s)"

    
music_lib_ins_query =  "INSERT INTO music_library (album_name, artist_name, year, city)"
music_lib_ins_query += " VALUES (%s, %s, %s, %s)"

music_library_data = [
    ("Let it Be", "The Beatles", 1970, "Liverpool"),
    ("Rubber Soul", "The Beatles", 1965, "Oxford"),
    ("Beatles For Sale", "The Beatles", 1964, "London"),
    ("The Monkees", "The Monkees", 1966, "Los Angeles"),
    ("Close To You", "The Carpenters", 1970, "San Diego"),
]

for ele in music_library_data:
    run_query(session, music_lib_ins_query, ele)

In [16]:
music_lib_sel_query_all = "select * from music_library"
print(print_query(session, music_lib_sel_query_all))

album_name        artist_name     city           year
----------------  --------------  -----------  ------
Let it Be         The Beatles     Liverpool      1970
Rubber Soul       The Beatles     Oxford         1965
The Monkees       The Monkees     Los Angeles    1966
Close To You      The Carpenters  San Diego      1970
Beatles For Sale  The Beatles     London         1964
None


### Validate the Data Model -- Did it work?
`select * from album_library WHERE album_name="Close To You"`

In [17]:
music_lib_sel_query = "select * from music_library WHERE album_NAME='Close To You'"

print(print_query(session, music_lib_sel_query))

album_name    artist_name     city         year
------------  --------------  ---------  ------
Close To You  The Carpenters  San Diego    1970
None


### Success it worked! We created a unique Primary key that evenly distributed our data, with clustering columns

### For the sake of the demo, drop the table

In [18]:
query = "drop table music_library"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)


### Close the session and cluster connection

In [19]:
session.shutdown()
cluster.shutdown()

<h2><span style='color:blue'>Remove Environment</span></h2>

In [20]:
# Removes chart instances
!helm uninstall {CHART_INSTANCE_NAME}

release "dend-l3e3" uninstalled


In [21]:
# Removes persistent Volume
!kubectl get pvc|fgrep {CHART_INSTANCE_NAME}|cut -d ' '  -f1| xargs -t kubectl delete pvc

kubectl delete pvc data-dend-l3e3-cassandra-0
persistentvolumeclaim "data-dend-l3e3-cassandra-0" deleted


In [22]:
!kubectl get pvc

No resources found in default namespace.
