<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/data-engineering/l1_demo_2_creating_a_table_with_apache_cassandra.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 1 Demo 2: Creating a Table with Apache Cassandra

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5e/Cassandra_logo.svg/1200px-Cassandra_logo.svg.png" width="100" height="100">

## JDK requirement
Cassandra requires either the Oracle Java Standard Edition 8 or OpenJDK 8. To verify that you have the correct version of java installed, type run the cell below and check for `java-1.8.*`.

In [1]:
!update-java-alternatives --list

java-1.11.0-openjdk-amd64      1111       /usr/lib/jvm/java-1.11.0-openjdk-amd64
java-1.8.0-openjdk-amd64       1081       /usr/lib/jvm/java-1.8.0-openjdk-amd64


Set the right jdk version required by Cassandra.

In [2]:
!update-java-alternatives --set java-1.8.0-openjdk-amd64

update-alternatives: error: no alternatives for appletviewer
update-alternatives: error: no alternatives for jaotc
update-alternatives: error: no alternatives for jconsole
update-alternatives: error: no alternatives for jdeprscan
update-alternatives: error: no alternatives for jhsdb
update-alternatives: error: no alternatives for jimage
update-alternatives: error: no alternatives for jlink
update-alternatives: error: no alternatives for jmod
update-alternatives: error: no alternatives for jshell
update-alternatives: error: no alternatives for mozilla-javaplugin.so
update-alternatives: error: no alternatives for policytool
update-java-alternatives: jdk alternative does not exist: /usr/lib/jvm/java-8-openjdk-amd64/bin/appletviewer
update-java-alternatives: jdk alternative does not exist: /usr/lib/jvm/java-8-openjdk-amd64/bin/jconsole
update-alternatives: error: no alternatives for policytool
update-java-alternatives: plugin alternative does not exist: /usr/lib/jvm/java-8-openjdk-amd64/jr

In [3]:
!java -version

openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)


## Setup a Cassandra instance

In [4]:
!echo "deb https://downloads.apache.org/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

deb https://downloads.apache.org/cassandra/debian 311x main


In [5]:
!curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  252k  100  252k    0     0   287k      0 --:--:-- --:--:-- --:--:--  286k
OK


In [8]:
!sudo apt-get update

0% [Working]            Hit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease
0% [Connecting to archive.ubuntu.com (91.189.88.152)] [Connected to cloud.r-pro0% [1 InRelease gpgv 88.7 kB] [Connecting to archive.ubuntu.com (91.189.88.152)                                                                               Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Waiting for headers] [Conn                                                                               Hit:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease
0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Connected to downloads.apa0% [Waiting for headers] [Connected to downloads.apache.org (88.99.95.219)] [Wa0% [3 InRelease gpgv 3,626 B] [Waiting for headers] [Connected to downloads.apa                                                                               Hit:4 http://ppa

***If you get an error executing the command below, just retry and usually the error goes away***

In [9]:
!sudo apt-key adv --keyserver pool.sks-keyservers.net --recv-key A278B781FE4B2BDA

Executing: /tmp/apt-key-gpghome.34oIXpBxoi/gpg.1.sh --keyserver pool.sks-keyservers.net --recv-key A278B781FE4B2BDA
gpg: key A278B781FE4B2BDA: 28 signatures not checked due to missing keys
gpg: key A278B781FE4B2BDA: "Michael Shuler <michael@pbandjelly.org>" 1 new signature
gpg: Total number processed: 1
gpg:         new signatures: 1


In [11]:
!sudo apt-get install cassandra

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libopts25 netbase ntp sntp
Suggested packages:
  cassandra-tools ntp-doc
The following NEW packages will be installed:
  cassandra libopts25 netbase ntp sntp
0 upgraded, 5 newly installed, 0 to remove and 58 not upgraded.
Need to get 30.7 MB of archives.
After this operation, 42.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 netbase all 5.4 [12.7 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopts25 amd64 1:5.18.12-4 [58.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 ntp amd64 1:4.2.8p10+dfsg-5ubuntu7.1 [640 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 sntp amd64 1:4.2.8p10+dfsg-5ubuntu7.1 [86.9 kB]
Get:3 https://dl.bintray.com/apache/cassandra 311x/main amd64 cassandra all 3.11.6 [29.9 MB]
Fetched 30.7 MB in

In [0]:
# !sudo service cassandra start

Cassandra should be up and running. Confirm that the service is up and running executing the below command.

***The service actually takes few seconds to start***.

In [18]:
!nodetool status

Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  70.03 KiB  256          100.0%            f9466b92-374a-44ba-9c5f-4f284ade19fc  rack1



## Walk through the basics of Apache Cassandra

* Creating a table
* Inserting rows of data
* Running a simple SQL query to validate the information. 

### Install python driver

Use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: 
`! pip install cassandra-driver`<br>
More documentation can be found here:  https://datastax.github.io/python-driver/

In [20]:
!pip install cassandra-driver

Collecting cassandra-driver
[?25l  Downloading https://files.pythonhosted.org/packages/d8/76/63188a5dd8b62f72a387bab648c39555c5c2ba64b230fc71eb43b33f5915/cassandra_driver-3.22.0-cp36-cp36m-manylinux1_x86_64.whl (4.3MB)
[K     |████████████████████████████████| 4.3MB 2.4MB/s 
Collecting geomet<0.2,>=0.1
  Downloading https://files.pythonhosted.org/packages/d3/ad/9efd4457a27048128d1e8a83d48874dabd78cdcb9b36ce2b4eac5d74b5c0/geomet-0.1.2.tar.gz
Building wheels for collected packages: geomet
  Building wheel for geomet (setup.py) ... [?25l[?25hdone
  Created wheel for geomet: filename=geomet-0.1.2-cp36-none-any.whl size=14896 sha256=106a987fb5f9497d14a489902cd74d6300d19fdd1aa2bf567c119948f99b1f3b
  Stored in directory: /root/.cache/pip/wheels/08/43/84/50bd44f043b3c04c06b798cc5fc31d93586d38dfa3a48ec051
Successfully built geomet
Installing collected packages: geomet, cassandra-driver
Successfully installed cassandra-driver-3.22.0 geomet-0.1.2


### Import Apache Cassandra python package

In [22]:
import cassandra

print('cassandra:', cassandra.__version__)

cassandra: 3.22.0


### Create a connection to the database
1. Connect to the local instance of Apache Cassandra *['127.0.0.1']*.
2. The connection reaches out to the database (*studentdb*) and uses the correct privileges to connect to the database (*user and password = student*).
3. Once we get back the cluster object, we need to connect and that will create our session that we will use to execute queries.<BR><BR>
    
*Note 1:* This block of code will be standard in all notebooks

In [24]:
from cassandra.cluster import Cluster
try: 
    cluster = Cluster(['127.0.0.1']) #If you have a locally installed Apache Cassandra instance
    session = cluster.connect()
    print('session id:', session.session_id)
except Exception as e:
    print(e)
 

session id: 9d2d0979-a342-4da8-856a-b3eb925059c7


### Test the Connection and Error Handling Code
*Note:* The try-except block should handle the error: We are trying to do a `select *` on a table but the table has not been created yet.

In [25]:
try: 
    session.execute("""select * from music_libary""")
except Exception as e:
    print(e)
 

Error from server: code=2200 [Invalid query] message="No keyspace has been specified. USE a keyspace, or explicitly specify keyspace.tablename"


### Create a keyspace to the work in 
*Note:* We will ignore the Replication Strategy and factor information right now as those concepts are covered in depth in Lesson 3. Remember, this will be the strategy and replication factor on a one node local instance. 

In [0]:
try:
    session.execute("""
    CREATE KEYSPACE IF NOT EXISTS test 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

except Exception as e:
    print(e)

### Connect to our Keyspace.<br>
*Compare this to how a new session in PostgreSQL is created.*

In [31]:
try:
    session.set_keyspace('test')
    print('session keyspace:', session.keyspace)
except Exception as e:
    print(e)

session keyspace: test


### Begin with creating a Music Library of albums. Each album has a lot of information we could add to the music library table. We will  start with album name, artist name, year. 

### But ...Stop

### We are working with Apache Cassandra a NoSQL database. We can't model our data and create our table without more information.

### Think about what queries will you be performing on this data?

#### We want to be able to get every album that was released in a particular year. 
`select * from music_library WHERE YEAR=1970`

*To do that:* <ol><li> We need to be able to do a WHERE on YEAR. <li>YEAR will become my partition key,<li>artist name will be my clustering column to make each Primary Key unique. <li>**Remember there are no duplicates in Apache Cassandra.**</ol>

**Table Name:** music_library<br>
**column 1:** Album Name<br>
**column 2:** Artist Name<br>
**column 3:** Year <br>
PRIMARY KEY(year, artist name)


### Now to translate this information into a Create Table Statement. 
More information on Data Types can be found here: https://datastax.github.io/python-driver/<br>
*Note:* Again, we will go in depth with these concepts in Lesson 3.

In [0]:
query = "CREATE TABLE IF NOT EXISTS music_library"
query = f"{query} (year int, artist_name text, album_name text, PRIMARY KEY (year, artist_name))"
try:
    session.execute(query)
except Exception as e:
    print(e)


The query should run smoothly.

### Insert two rows 

In [0]:
query = "INSERT INTO music_library (year, artist_name, album_name)"
query = query + " VALUES (%s, %s, %s)"

try:
    session.execute(query, (1970, "The Beatles", "Let it Be"))
except Exception as e:
    print(e)
    
try:
    session.execute(query, (1965, "The Beatles", "Rubber Soul"))
except Exception as e:
    print(e)

### Validate your data was inserted into the table.
*Note:* The for loop is used for printing the results. If executing queries in the cqlsh, this would not be required.

*Note:* Depending on the version of Apache Cassandra you have installed, this might throw an "ALLOW FILTERING" error instead of printing the 2 rows that we just inserted. This is to be expected, as this type of query should not be performed on large datasets, we are only doing this for the sake of the demo.

In [34]:
query = 'SELECT * FROM music_library'
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.album_name, row.artist_name)

1965 Rubber Soul The Beatles
1970 Let it Be The Beatles


### Validate the Data Model with the original query.

`select * from music_library WHERE YEAR=1970`

In [35]:
query = "select * from music_library WHERE YEAR=1970"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    
for row in rows:
    print (row.year, row.album_name, row.artist_name)

1970 Let it Be The Beatles


### Drop the table to avoid duplicates and clean up. 

In [0]:
query = "drop table music_library"
try:
    rows = session.execute(query)
except Exception as e:
    print(e)
    

### Close the session and cluster connection

In [0]:
session.shutdown()
cluster.shutdown()