# Lesson 3 Exercise 2: Focus on Primary Key
<img src="images/cassandralogo.png" width="250" height="250">

### Walk through the basics of creating a table with a good Primary Key in Apache Cassandra, inserting rows of data, and doing a simple CQL query to validate the information. 

### Replace ##### with your own answers. 

Note: __Do not__ click the blue Preview button in the lower task bar

#### We will use a python wrapper/ python driver called cassandra to run the Apache Cassandra queries. This library should be preinstalled but in the future to install this library you can run this command in a notebook to install locally: 
! pip install cassandra-driver
#### More documentation can be found here:  https://datastax.github.io/python-driver/

#### Import Apache Cassandra python package

In [1]:
# We are going to use Python driver to communicate with the Cassandra NoSQL db
import cassandra

### Create a connection to the database

In [2]:
from cassandra.cluster import Cluster

# Create a connection the database
# We will use local IP address; since we have a locally installed Apache cassandra instance
cluster = Cluster(['127.0.0.1'])

In [3]:
# Create a session to execute inside it our queries
session = cluster.connect()

### Create a keyspace to work in 

In [4]:
# A keyspace is the top-level database object 
# that controls the replication for the object 
# it contains at each datacenter in the cluster.

# Keyspaces contain tables, materialized views and user-defined types, 
# functions and aggregates. 
# Typically, a cluster has one keyspace per application.

session.execute("""
    CREATE KEYSPACE IF NOT EXISTS udacity
    WITH REPLICATION = 
        {'class' : 'SimpleStrategy', 'replication_factor' : 1}"""
)

<cassandra.cluster.ResultSet at 0x146fe3a3520>

#### Connect to the Keyspace. Compare this to how we had to create a new session in PostgreSQL.  

In [5]:
session.set_keyspace('udacity')

### Imagine you need to create a new Music Library of albums 

### Here is the information asked of the data:
#### 1. Give every album in the music library that was created by a given artist
`select * from music_library WHERE artist_name="The Beatles"`


### Here is the collection of data
<img src="images/table3.png" width="650" height="350">

#### Practice by making the PRIMARY KEY only 1 Column (not 2 or more)

In [6]:
# Set the query
query = "CREATE TABLE IF NOT EXISTS music_library "
query += "(year INT, city TEXT, artist_name TEXT, album_name TEXT, PRIMARY KEY (artist_name))"

# Execute the query and create the table
session.execute(query)

<cassandra.cluster.ResultSet at 0x146fe3aee50>

### Let's insert the data into the table

In [7]:
# Set the query
query = "INSERT INTO music_library (year, city, artist_name, album_name) "
query += "VALUES (%s, %s, %s, %s)"

# Insert the data in the table
session.execute(query, (1970, 'Liverpool', 'The Beatles', 'Let it Be'))
session.execute(query, (1965, 'Oxford', 'The Beatles', 'Rubber Soul'))
session.execute(query, (1966, 'Los Angeles', 'The Monkees', 'The Monkees'))
session.execute(query, (1970, 'San Diego', 'The Carpenters', 'Close To You'))
session.execute(query, (1965, 'London', 'The Who', 'My Generation'))

<cassandra.cluster.ResultSet at 0x146fe4074c0>

### Validate the Data Model -- Does it give you two rows?

In [8]:
# Set the query
query = "select * from music_library WHERE artist_name='The Beatles'"

# Execute the query
rows = session.execute(query)

# Print the results
for row in rows:
    print(row.year, row.artist_name, row.album_name, row.city)

1965 The Beatles Rubber Soul Oxford


### If you used just one column as your PRIMARY KEY, your output should be:
1965 The Beatles Rubber Soul Oxford


### That didn't work out as planned! Why is that?  Did you create a unique primary key?

* No, we didn't create a unique primary key.
* We know that rows are partitioned by the primary key we choose.
* and that key should be unique. We can't have more than one row with the same value as the primary key.
* In this exercise, we chose the artist name to be the primary key. However, we had two rows with the same value of the artist_name column.
* We know that Apache Cassandra doesn't allow duplicated rows. Never!
* So, when we entered the first row with the artist_name = 'The Beatles', Cassandra saved the row in a partition with that artist_name.
* When we entered the second row with the same artist_name = 'The Beatles', Cassandra overwrote the data in that partition, and entered the new row instead of the old one.

### Try again - Create a new table with a composite key this time

**Let's try again. Let's focus on making the PRIMARY KEY unique. Look at our dataset.. do we have anything unique for each row?**

**We have a couple of options (Year, City and Album Name) but that will not get us the query we need, which is looking for album of a particular artist name.**

**Let's make a composite key of the `Artist Name` AND `Year`.**

**This is assuming that each artist release only one album per year (current dataset supports this assumption) -- but for a real business case, we need to fully understand our dataset to create the unique key (no betting at this case!)**

### Drop the old table

In [9]:
query = "DROP TABLE IF EXISTS music_library"
session.execute(query)

<cassandra.cluster.ResultSet at 0x146fe3aedc0>

Recreate the table and Make a composite PRIMARY KEY of (artist_name, year)

In [10]:
# Set the query
query = "CREATE TABLE IF NOT EXISTS music_library "
query += "(year INT, city TEXT, artist_name TEXT, album_name TEXT, PRIMARY KEY (artist_name, year))"

# Execute the query and create the table
session.execute(query)

<cassandra.cluster.ResultSet at 0x146fe4005e0>

### Let's insert the data into the table

In [11]:
# Set the query
query = "INSERT INTO music_library (year, city, artist_name, album_name) "
query += "VALUES (%s, %s, %s, %s)"

# Insert the data in the table
session.execute(query, (1970, 'Liverpool', 'The Beatles', 'Let it Be'))
session.execute(query, (1965, 'Oxford', 'The Beatles', 'Rubber Soul'))
session.execute(query, (1966, 'Los Angeles', 'The Monkees', 'The Monkees'))
session.execute(query, (1970, 'San Diego', 'The Carpenters', 'Close To You'))
session.execute(query, (1965, 'London', 'The Who', 'My Generation'))

<cassandra.cluster.ResultSet at 0x146fe400d00>

### Validate the Data Model -- Did it work?

In [12]:
# Set the query
query = "select * from music_library WHERE artist_name='The Beatles'"

# Execute the query
rows = session.execute(query)

# Print the results
for row in rows:
    print(row.year, row.artist_name, row.album_name, row.city)

1965 The Beatles Rubber Soul Oxford
1970 The Beatles Let it Be Liverpool


### Your output should be:
1970 The Beatles Let it Be Liverpool<br>
1965 The Beatles Rubber Soul Oxford

### Drop the tables

In [13]:
query = "DROP TABLE IF EXISTS music_library"
session.execute(query)

<cassandra.cluster.ResultSet at 0x146f6bf3460>

### Drop the keyspace

In [14]:
query = "DROP KEYSPACE IF EXISTS udacity"
session.execute(query)

<cassandra.cluster.ResultSet at 0x146fe41cb50>

### Close the session and cluster connection

In [15]:
session.shutdown()
cluster.shutdown()