# Lesson 1 Exercise 2: Creating a Table with Apache Cassandra

<img src="images/cassandralogo.png" width="250" height="250">

### In this exercise we are going to walk through the basics of creating a table in Apache Cassandra, inserting rows of data, and doing a simple SQL query to validate the information. 

### Fill in the code where you see #####

#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled, but to install this library in the future you can run this command in a notebook to install locally: 
! pip install cassandra-driver
#### More documentation can be found here:  https://datastax.github.io/python-driver/

#### Import Apache Cassandra python package

In [1]:
import cassandra

### First let's create a connection to the database
This connects to our local instance of Apache Cassandra. This connection will reach out to the database and insure we have the correct privileges to connect to this database. Once we get back the cluster object, we need to connect and that will create our session that we will use to execute queries.
#### Note 1: This block of code will be standard in all notebooks

In [2]:
from cassandra.cluster import Cluster

# connect to the locally installed Apache Cassandra instance
cluster = Cluster(['127.0.0.1'])

# Create a session to execute inside it the queries
session = cluster.connect()

### Let's Test our Connection 
We are trying to do a `select *` on a table we have not created yet. We should expect to see a nicely handled error. 

In [3]:
session.execute("SELECT * FROM music_library")

InvalidRequest: Error from server: code=2200 [Invalid query] message="No keyspace has been specified. USE a keyspace, or explicitly specify keyspace.tablename"

### Let's create a keyspace to do our work in 
Note: Ignore the Replication Stratgety and factor information for now. Those will be discussed later. Just know that on a one node local instance this will be the strategy and replication factor. 

In [4]:
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS udacity
    WITH REPLICATION = 
        {'class' : 'SimpleStrategy', 'replication_factor' : 1}
""")

<cassandra.cluster.ResultSet at 0x1d73900f490>

#### Connect to our Keyspace. Compare this to how we had to create a new session in PostgreSQL.  

In [5]:
# we don't need to close the connection and re-connect
# we simply set the keyspace as follows:
session.set_keyspace("udacity")

## Let's imagine we would like to start creating a Song Library of all the songs we own. Each song has a lot of information we could add to the song library table, but we will just start with name of the song, artist name, year, album it was from, and if it was a single.

`song title
artist
year
album
single`


## But ...STOP
<img src="images/stop.jpeg" width="250" height="250">

### We are working with Apache Cassandra a NoSQL database. We can't model our data and create our table with out more information.

# What queries will I be performing on this data?

### In this case I would like to be able to get every song by a certain artist that was released in a particular year. 


`select * from songs WHERE YEAR=1970 AND artist_name='The Beatles'`

### Because of this I need to be able to do a WHERE on YEAR. YEAR will become my partition key, and artist name will be my clustering column to make each Primary Key unique. Remember there are no duplicates in Apache Cassandra. 

### Now to translate this information into a Create Table Statement. 

#### More infomration on Data Types can be found here: https://datastax.github.io/python-driver/

In [6]:
# Let's build our query
# remember to add space at the end if you will build the query by dividing it on multiple lines
query = "CREATE TABLE IF NOT EXISTS music_library " 

# Note that we will pass all three columns we want first,
# then we define in the PRIMARY KEY which of them will be the partition key, and which will be
# the clustering column
query = query + "(year int, artist_name text, album_name text, PRIMARY KEY (year, artist_name))"

# Execute the query to create our table
session.execute(query)

<cassandra.cluster.ResultSet at 0x1d73181be20>

### No error was found, but let's check to ensure our table was created.  `select count(*)` which should return 0 as we have not inserted any rows. 

Note: Depending on the version of Apache Cassandra you have installed, this might throw an "ALLOW FILTERING" error instead of a result of "0". This is to be expected, as this type of query should not be performed on large datasets, we are only doing this for the sake of the demo.

In [7]:
# No error was found, but let's check to ensure our table was created

# write the query
query = "SELECT COUNT(*) FROM music_library"

#execute the query
count = session.execute(query)

# print the output of the execution
print(count.one())

Row(count=2)


### Let's insert two rows
`First Row:  "Across The Universe", "The Beatles", "1970", "False", "Let It Be"`

`Second Row: "The Beatles", "Think For Yourself", "False", "1965", "Rubber Soul"`

In [8]:
# Set the foundation of the query
query = "INSERT INTO music_library (year, artist_name, album_name) "
query = query+  "VALUES (%s, %s, %s)"

# Insert first row
session.execute(query, (1970, "The Beatles","Let It Be"))

# Insert second row
session.execute(query, (1965, "The Beatles", "Rubber Soul"))

<cassandra.cluster.ResultSet at 0x1d73901cee0>

### Validate your data was inserted into the table.
Note: The for loop is used for printing the results. If executing queries in the cqlsh, this would not be required.

Note: Depending on the version of Apache Cassandra you have installed, this might throw an "ALLOW FILTERING" error instead of print the 2 rows we just inserted. This is to be expected, as this type of query should not be performed on large datasets, we are only doing this for the sake of the demo.

In [9]:
# set the query
query = "SELECT * FROM music_library"

# execute the query and return the resulted rows into a variable
rows = session.execute(query)

# loop over the results and print them line by line
for row in rows:
    print(row.year, row.album_name, row.artist_name)

1965 Rubber Soul The Beatles
1970 Let It Be The Beatles


### Let's Validate our Data Model with our orignal query.

`select * from songs WHERE YEAR=1970 AND artist_name='The Beatles'`

In [10]:
# Set the query
query = "SELECT * FROM music_library WHERE year=1970 AND artist_name='The Beatles'"

# Execute the query
rows = session.execute(query)

# Loop over the results and print them
for row in rows:
    print(row.year, row.album_name, row.artist_name)

1970 Let It Be The Beatles


### And Finally close the session and cluster connection

In [11]:
session.shutdown()
cluster.shutdown()