In [2]:
from IPython.core.display import HTML
HTML("""
<style>

div.text_cell_render h1, h2, h3, h4, h5 { 
font-family: 'Georgia';
}


div.text_cell_render { /* Customize text cells */
font-family: 'Avenir';
font-size:15px;
line-height:18px;
color: #292929;
font-weight:400;
}
</style>
""")

In [3]:
### Run for formatting display width
display(HTML("<style>.container { width:100% !important;}</style>"))

# Project 2: Data Modeling with Postgres 

## Introduction 

So, in this project, we have startup, called Sparkify who wants to analyze the data they've been collecting on their new music streaming app regarding: 

1. songs 
2. user activity 

<img src="images/sparkify.png">

The analytics team is actually trying to understand what songs, users are listening to. Okay? They want to know which songs are most popular, what is the listening time of the songs, etc.  

So, as they are a startup, they don't have an easy way of querying their data.  

Their data currently resides in a directory of `CSV` files about user activity on the app. 


Now, they would like a data engineer to create an Apache Cassandra database designed to optimize their queries concerned with understanding what songs users are listening to. 

In other words, you have to create a database that is optimized for _song play analysis_.   


Our task is to create a database schema and ETL pipeline for this analysis.

So, in this project you will apply what you've learned on data modeling with Apache Cassandra and build an ETL pipeline using Python.   

Make sure you have completed [Lesson 4: NoSQL Data Models](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/864ae5c9-e9fb-47ca-bcde-b152d00543a1). The learnings from this lesson are the only way you will be able to complete this project. 

To complete this project, you will need to model your data by creating tables in Apache Cassandra to run queries.  


Now you may be wondering, 

## Project Datasets 

Now, for those of you who want a refresher on `CSV` or flat files, in general, I'll let my friend, [David explain it to you](https://www.youtube.com/watch?v=bLKVRIhrZUY). 

Basically, they are just comma separated values which represent tabular data in plain text format, with one data record per line. In each line, we have multiple fields  separated using commas. 

<img src="images/flat_file_structure.png">



We can see the header line, `ranking, critic_score, title, number_of_critic_ratings` in the `.csv` text file on the left. And we can see it's the first row in excel file on the right also.  

Then, we see the line `1, 99, The Wizard of Oz (1939), 110` on the left and the same values entered in the excel file on the right. So you get the sense of `csv`s work. 

This is how our `events_data` folder looks like in our workspace. We have 30 files, one for each day in November 2018. Each file contains events or logs from a day. 

<img src='images/event_data_files.png'>

Here is how one of our `events_data`'s `.csv` file looks like: 

> This is data fro 1st November, 2018. 
<img src="images/csv_file_1.png">

We can see that we have the header row: 

```
artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userId
``` 

Then, we have rows of data with a value for each column. 

Now, since this is event data from the Sparkify app. We have different number of records for each day. Some have just 16 entries, like shows in the `2018-11-01-events.csv` file, while some days have more than 100 events:  

> This data is from 5th November 2018. 
<img src="images/csv_file_2.png">

Let's read in the first 5 rows of the events from 1st November, 2018. 

In [4]:
import pandas as pd 
pd.read_csv('event_data/2018-11-01-events.csv').head()

FileNotFoundError: [Errno 2] No such file or directory: 'event_data/2018-11-01-events.csv'

Now, we can see that we have a column for the `artist` name.

Looking at the information on the user, we have the columns or `firstName` and `lastName` of the user.    
Then, we also have columns telling us the `gender` of the user, the `level` of the app they are using (free or paid), their `location`, their `registration` and `userId`.   

Looking at the information regarding the song they are listening to, we have the duration, or `length` of the song and the name of the `song`. 

Finally, looking at some generic information about the log, we see that we have the `iteminSession` column, the `method`, `page`, `sessionId`, `status` and the timestamp, i.e. `ts`.  

__NOTE:__ The `page` column is an important one. It help us filter which logs are related to song plays. That is, the logs with the `page == 'NextSong'` is a log wherein the user was listening to a song. 

So, now, let's get into the Project Template and see which files we are going to be working with. 

## Project Template 

For this project, it is advisable to work in the workspace itself, as setting up Cassandra is a little tricky.  
 

The project template includes only one Jupyter Notebook, in which you'll carry out the project steps. 

### 📝 My Notes: 
- 
- 

## Project Steps 

Below are the steps you can follow to complete each component of this project. 

### Modeling your NoSQL  or Apache Cassandra database 

__STEP 1__ Run the 4 code cells in Part I. This will import the python packages and create an `event_datafile_new.csv` which will have data from all `.csv`s in `event_data` folder. It will have $6821$ rows.   


__STEP 2__ After connecting to the local instance of Apache Cassandra and creating a session, create and set a keyspace for this project.   


__STEP 3__ Create queries to ask the given 3 questions from the data. These queries will include:
- `CREATE TABLE` queries to create the tables to answer the given question. 
- `INSERT` queries to populate the created tables with data from `event_datafile_new.csv`. 
- `SELECT` queries to select the relevant data from the created table to answer the given question. 

__STEP 4__ `DROP` the Tables you created. 

## Can we start the project now? 


## Wait! Before starting the project, make sure to go through the [Project Rubric](https://review.udacity.com/#!/rubrics/2475/view)


Now, before we start with the project, I want us to first go through the project rubric. It's always a good idea to go through the project rubric once before we do the project, and once after we do the project.

This rubric serves as a project checklist for us. You can see the tasks you need to do and make sure you meet specifications on the project


### Basically, here are the requirements: 

Just read them through. You won't really understand much of it but it's like reading the questions of a reading comprehension before actually reading the essay. 


### ETL Pipeline Processing 

- Student creates __`event_data_new.csv`__ file.    

- Student uses the __appropriate datatype within the CREATE__ statement.For e.g., `artist_name` and `song_title` use `TEXT`, length use `FLOAT` datatypes.  

#### Data Modeling 

- Student creates the __correct Apache Cassandra tables__ for each of the three queries. The `CREATE TABLE` statement should include the appropriate table. Student should adhere to the __one table per query rule of Apache Cassandra__. The student is allowed to use the same table for two of the questions, where it makes sense.    


- Student demonstrates good understanding of data modeling by generating __correct `SELECT` statements to generate the result being asked for in the question__. The SELECT statement should NOT use `ALLOW FILTERING` to generate the results. For e.g., Query 3, `SELECT `statement should not require anything more than user name first and last name in the `SELECT` statement IF the table has been created with the correct COMPOSITE PRIMARY KEY, including partitions and clustering columns.   


- Student should use __table names that reflect the query__ and the result it will generate. Table names should include alphanumeric characters and underscores, and table names must start with a letter. We are looking for table names that provide a good general sense of what this query will generate. For e.g., for Query 2, an appropriate table name should reflect song playlist in session (e.g., name could be `song_playlist_session`). Students should not be using table names like `query_1` or `project_1`, etc. as table names need to be descriptive. 


- The __sequence in which columns appear should reflect how the data is partitioned and the order of the data within the partitions__. The sequence of the columns in the `CREATE` and `INSERT` statements should follow the order of the COMPOSITE PRIMARY KEY and CLUSTERING columns. The data should be inserted and retrieved in the same order as how the COMPOSITE PRIMARY KEY is set up. This is important to the student because Apache Cassandra is a partition row store, which means the partition key determines which any particular row is stored on which node. In case of composite partition key, partitions are distributed across the nodes of the cluster and how they are chunked for write purposes. Any clustering column(s) would determine the order in which the data is sorted within the partition.   

#### PRIMARY KEYS

- The combination of the PARTITION KEY alone or with the addition of CLUSTERING COLUMNS should be used appropriately to __uniquely identify each row__. For e.g., in Query 3, student should not only use song as PARTITION KEY. Similarly, the student does not need to user both `firstName` and `lastName` along with `userId` for query 3 clustering columns, as `song` and `userId` together will uniquely identify each row. The student should include clustering columns as part of the COMPOSITE PRIMARY KEY and understand that a COMPOSITE PRIMARY KEY uniquely identifies each row.   

#### Presentation 

- The notebooks should include a description of the query the data is modeled after. The student can include headers right above the SELECT statement cell to highlight the responses to the questions.  


- Code should be organized well into the different queries. Any in-line comments that were clearly part of the project instructions should be removed so the notebook provides a professional look.




### 📝 My Notes: 
- 
- 

## Part I: ETL Pipeline for Pre-Processing the Files 

### Import Python Packages 

In [5]:
# import python packages
import pandas as pd
import cassandra 
import re
import os
import glob 
import numpy as np
import json 
import csv

### Create list of filepaths to process original event csv data files 

In [6]:
# checking your current working directory 
print(os.getcwd())

c:\Users\n0170199\Projects\DE_Projects\project_2_walkthrough


In [7]:
# Get absolute path of subfolder called `event_data` 
filepath = os.getcwd() + '/event_data'

The following block of code creates a python list called `file_path_list` which has the list of absolute file paths of the $30$ files in `event_data`. 

In [8]:
# Create a for loop to create a list of file paths
# in `event_data` 

for root, dirs, files in os.walk(filepath):
    # joining the file path of each file with the root directory
    # to get the absolute path of the csv files within the 
    # `/home/workspace/event_data`
    file_path_list = glob.glob(os.path.join(root,'*'))
    print (file_path_list)

['c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-01-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-02-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-03-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-04-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-05-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-06-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-07-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-08-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event_data\\2018-11-09-events.csv', 'c:\\Users\\n0170199\\Projects\\DE_Projects\\project_2_walkthrough/event

The above output shows us how `file_path_list` looks like. Let's see if we got all the file paths. 

In [9]:
print(len(file_path_list))

30


Now, we have the absolute file path of 30 files, one for each day in November 2018. 

### Let's process the files to create a new data file `.csv` that will contain records from all of the 30 files.  

Don't worry too much how we have combined these `.csv` files into one file. Just know that we are going to use the combined file called `event_datafile_new.csv` to create tables in Apache Cassandra like we used `supermarket_sales.csv` in the last class. 

I've thoroughly commented the code here so that we don't have a problem understanding what it being done here.   

The following code block: 
- Creates an empty list called `full_data_rows_list` 
- For each csv file in `file_path_list`: 
    - Skips the header row (which has the column names), 
    - Reads each row and 
    - Appends the row to the `full_data_rows_list` 

Then, once we have the list of rows from all `csv` files in `full_data_rows_list`, we create a new csv file called `event_datafile_new.csv`, write the header row (the column names) for the first row, and write each row from `full_data_rows_list` into the `event_datafile_new.csv`. 

In [10]:
# initiating an empty list of rows which will be filled iteratively
# by looping over each rows of each csv file
full_data_rows_list = [] 
    
# for every csv in the file_path_list 
for f in file_path_list:

    # open the csv file as csvfile
    with open(f, 'r', encoding = 'utf8', newline='') as csvfile: 
        # creating a csv reader object 
        csvreader = csv.reader(csvfile) 
        # skip the header 
        next(csvreader)
        
        # extract each data row one by one and append it to the
        # full_data_rows_list 
        for line in csvreader:
            full_data_rows_list.append(line) 
            
# uncomment the code below if you would like to get total number of rows 
#print(len(full_data_rows_list))
# uncomment the code below if you would like to check to see what the list of event data rows will look like
#print(full_data_rows_list)

# Creating a dialect (customising a csv parser) to 
# make sure all values are quoted and the initial space
# after the delimiter is skipped. 
csv.register_dialect('myDialect', quoting=csv.QUOTE_ALL, 
                     skipinitialspace=True)

# creating a event data csv file called event_datafile_new csv
# that will be used to insert data into the Apache Cassandra
# tables
with open('event_datafile_new.csv', 'w', encoding = 'utf8', newline='') as f:
    # create a writer with dialect created
    writer = csv.writer(f, dialect='myDialect')
    
    # write the first row as the column names
    writer.writerow(['artist','firstName','gender','itemInSession','lastName','length',\
                'level','location','sessionId','song','userId'])
    
    # for each row in the full_data_rows_list
    for row in full_data_rows_list:
        
        if (row[0] == ''):
            continue
        # add data to the respective columns set
        writer.writerow((row[0], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[12], row[13], row[16]))

Let's check the number of rows in this new csv file. 

In [11]:
# check the number of rows in your csv file
with open('event_datafile_new.csv', 'r', encoding = 'utf8') as f:
    print(sum(1 for line in f)) 

6821


Make sure that the number of rows you have is $6821$. If you get a different number of rows, your code may need to be `reset`.  

<img src="images/reset.png"> 

These are the first five rows of the `event_datafile_new.csv`. 

In [12]:
pd.read_csv('event_datafile_new.csv').head()

Unnamed: 0,artist,firstName,gender,itemInSession,lastName,length,level,location,sessionId,song,userId
0,Des'ree,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",139,You Gotta Be,8
1,Mr Oizo,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",139,Flat 55,8
2,Tamba Trio,Kaylee,F,4,Summers,177.18812,free,"Phoenix-Mesa-Scottsdale, AZ",139,Quem Quiser Encontrar O Amor,8
3,The Mars Volta,Kaylee,F,5,Summers,380.42077,free,"Phoenix-Mesa-Scottsdale, AZ",139,Eriatarka,8
4,Infected Mushroom,Kaylee,F,6,Summers,440.2673,free,"Phoenix-Mesa-Scottsdale, AZ",139,Becoming Insane,8


We have the name of the `artist`, the `firstName` and `lastName` , the `gender` and `level` (free or paid), `location` and `userId` of the user, the name of the `song` and it's `length` and finally, the `sessionId`.  

And here is the length of the dataframe: 

In [13]:
pd.read_csv('event_datafile_new.csv').shape

(6820, 11)

It's $6820$ as the first row is the header row. 

Once you have the correct number of rows, you can proceed with Part II.

## Part II. Complete the Apache Cassandra coding portion of your project. 

__Now, you are ready to work with the CSV file titled `event_datafile_new.csv`, located within your Workspace directory. As a refresher, the `event_datafile_new.csv` contains the following columns__: 

- `artist`'s name 
- `firstName` of user
- `gender` of user
- `itemInSession`, which refers to the item number in session
- `lastName` of user
- `length` of the song 
- `level` of the user (paid or free)
- `location` of the user
- `sessionId` 
- `song` title 
- `userId` 

### Begin writing your Apache Cassandra Code in the cells below  

### Creating a Cluster  

The following code connect to our local instance of Apache Cassandra (if we have one). This connection will reach out to the database and ensure we have the correct privileges to connect to this database.  

In [14]:
# TODO: Make a connection to a Cassandra instance on
# your local machine (127.0.0.1) 
from cassandra.cluster import Cluster

try:
    cluster = Cluster(['127.0.0.1'])
    session = cluster.connect()
except Exception as e:
    print(e)

Once we create our cluster object, we need to connect to it. This will create our session that we will use to execute the queries. 

In [15]:
# TODO: Create a seession to establish connection and
# begin executing queries   
session.execute("""
     CREATE KEYSPACE IF NOT EXISTS events
    WITH REPLICATION =
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)


<cassandra.cluster.ResultSet at 0x211b86a7550>

This is very synonymous to what we used to do when we connected to the PostgreSQL database and got a cursor to it using code like this: 

```python 
conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")
cur = conn.cursor()
```  

The `session` variable will help us execute queries. This is similar to how we used the `cur` variable or cursor with `psycopg2`. 

### Create Keyspace  

A keyspace in Cassandra is synonymous to a database you would create in PostgreSQL. It defines the _replication strategy_ and the _replication factor_.  

The _replication factor_ tells us how many copies of the data will be distributed across the nodes. The _replication strategy_ specifies _how_ the replication should take place. 

You can learn more about it [here](https://www.tutorialspoint.com/cassandra/cassandra_create_keyspace.htm). 

In [None]:
# TODO: Create a Keyspace  


<cassandra.cluster.ResultSet at 0x7f240b0bac18>

If you want to a refresher on how to create a keyspace, refer to the [Lesson 4: Excerise 3: Solution](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/512d3115-ba9c-41f7-a1fd-0a676292e7d5) in the classroom. 

Here, we are creating a database which will consists tables relating to the events/logs of users on a music app. So, you can name your keyspace accordingly. 

### Set Keyspace 

In [17]:
# TODO: Set Keyspace to the keyspace specified above 
session.set_keyspace('events') 

### ✍️ Notes
- 
- 

### Now, we need to create tables to run the following queries. Remember, with Apache Cassandra, you model the database tables on queries you want to run.  

This is done whenever we are dealing with big data. If we think about the queries first, we can appropriately partition and sort our data while creating tables. This will make our reads super fast, as the data is already partitioned and sorted! 

## Create queries to ask the following three questions of the data  

## 1. Give me the artist, song title and song's length in the music app history that was heard during `sessionId = 338` and `itemInSession = 4` 

Before creating the table for this, let's first understand what our query is going to be. 

- It's going to be selecting the columns artist, song's title and song's length. 
- It's going to look for the records based on the value of `sessionId` and `itemInSession`. 

Think of these questions while creating the table: 
1. What columns should be in this table?  

2. What datatype should each column be? 
> - [Here](https://www.guru99.com/cassandra-data-types-expiration-tutorial.html) are a list of datatypes you have in CQL. Make sure to have a look at this. Data types like `NUMERIC` are not available here. 
> - Some important ones which we use alot are `INT`, `TEXT`, `VARINT`, `FLOAT`, `BIGINT`,  `TIMESTAMP`, etc. 

3. What should the [Primary Key](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/0600fb6e-935a-4b6b-abd2-16bff1016924) of the table be? Which should be our partition keys, if any? Which should be our [Clustering Columns](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/347092ad-2042-4385-90e5-b258f41941f4), if any?  

Remember: 
- The Primary Key (simple or composite) of a row should be the unique identifier of the row. There should be no 2 rows with the same Primary Key. 
- Look at the `WHERE` of your query to understand your partition keys. Using which columns will you be filtering the records? 
- Look at how you want to order your results. These will help in deciding your clustering columns. 

In [47]:
# TO-DO: Query 1:  Give me the artist, song title and song's 
# length in the music app history that was heard during 
# sessionId = 338 and itemInSession = 4 

query = "CREATE TABLE IF NOT EXISTS music_history"
query = query + """(artist TEXT,
song_title TEXT,
song_length FLOAT,
session_id INT,
iteminsession INT,
PRIMARY KEY (session_id, iteminsession))"""

print(query)
session.execute(query)


CREATE TABLE IF NOT EXISTS music_history(artist TEXT,
song_title TEXT,
song_length FLOAT,
session_id INT,
iteminsession INT,
PRIMARY KEY (session_id, iteminsession))


<cassandra.cluster.ResultSet at 0x211b6626ac0>

Again, for a revision on how to create tables in apache cassandra, refer to the [Lesson 4: Excerise 3: Solution](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/512d3115-ba9c-41f7-a1fd-0a676292e7d5) in the classroom.   


This is a screenshot from the solution notebook: 
<img src="images/create_table_example.png">

Once we have created our table, we can start inserting data into it. 

The following code will loop through each row in the `event_datafile_new.csv` and insert the relevant data from each row into the table you just created. 


There are 2 `TO-DO`s here: 
- Write `INSERT` statement that will be used to insert the relevant data from each record into the table you just created. For a refresher on how to insert data into tables, refer to the [Lesson 4: Excerise 3: Solution](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/14222a9b-5d25-4b97-a09b-18d5f5a1cdc4) in the classroom.  

This is how it looks in the classroom: 
<img src="images/insert_cassandra.png">

Now, as you notice, there are two parts to the `INSERT` query. The first is the `query` in the screenshot, and then there are the actual values that we need to put into the respective columns. Extracting these values from the current row is what the second TO-DO is all about. 


- Assign which values from the current row or `line` should be assigned for each column in the `INSERT` statement that you create. For example, the artist's name and the user's first name will be the first 2 values in each row or `line` in the `csv`. In order to insert the artist's name, you would use `line[0]` and in order to insert the user's first name, you would use `line[1]`.  

__Note:__ All values in the current `csv` are of type `str`, you might want to do something about them when inserting these values in the table. You cannot insert a string wherein an integer or float is expected and expect it to be automatically be converted in the table. 

In order for you to understand which column's value comes at what index, I have written this code which you can use: 

In [48]:
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    for line in csvreader:
        for i, value in enumerate(line):
            print( i, value,type(value))
        break

0 artist <class 'str'>
1 firstName <class 'str'>
2 gender <class 'str'>
3 itemInSession <class 'str'>
4 lastName <class 'str'>
5 length <class 'str'>
6 level <class 'str'>
7 location <class 'str'>
8 sessionId <class 'str'>
9 song <class 'str'>
10 userId <class 'str'>


In [49]:
# We have provided part of the code to set up the CSV file. Please complete the 
# Apache Cassandra code below 

file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    # skip header line
    next(csvreader) 
    for line in csvreader:
        ## TODO: Assign the INSERT Statement
        query = """INSERT INTO music_history 
                    (artist, song_title, song_length, session_id, iteminsession)"""
        # TODO: Assign the Placeholder Values
        query = query + """VALUES (%s, %s, %s, %s,%s)"""
        ## TO-DO: Assign which column element should be assigned for each column in the INSERT statement.
        ## For e.g., to INSERT artist_name and user first_name, you would change the code below to `line[0], line[1]`
        session.execute(query, (line[0], 
                                line[9],
                                float(line[5]),
                                int(line[8]),
                                int(line[3])
                                ))

### Do a `SELECT` to verify that the data have been inserted into each table 

In [50]:
# TO-DO: Add in the SELECT statement to verify the data was 
# entered into the table correctly  
query = """select artist, song_title, song_length
from music_history
WHERE session_id = 338 and iteminsession = 4 """

try:
    rows = session.execute(query)
except Exception as e: 
    print(e)
    
for row in rows:
    print(row.artist, row.song_title, row.song_length)



Faithless Music Matters (Mark Knight Dub) 495.30731201171875


## 2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
    

Again, you will be following the same process here. 

Let's understand what our query is going to be: 
1. We want the artist name, and user's first and last name
2. It's going to look for records based on value of `userId` and `sessionId`.  
3. The table is going to be sorted by `itemInSession` 

Think of these questions when creating the table:  

1. What columns should be in this table?  

2. What datatype should each column be? 
> - [Here](https://www.guru99.com/cassandra-data-types-expiration-tutorial.html) are a list of datatypes you have in CQL. Make sure to have a look at this. Data types like `NUMERIC` (which were available in PostgreSQL are not available here. 
> - Some important ones which we use alot are `INT`, `TEXT`, `VARINT`, `FLOAT`, `BIGINT`,  `TIMESTAMP`, etc.
3. What should the [Primary Key](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/0600fb6e-935a-4b6b-abd2-16bff1016924) of the table be? Which should be our partition keys, if any? Which should be our [Clustering Columns](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/347092ad-2042-4385-90e5-b258f41941f4), if any? 

Remember: 
- The Primary Key (simple or composite) of a row should be the unique identifier of the row. There should be no 2 rows with the same Primary Key. 
- Look at the `WHERE` of your query to understand your partition keys. Using which columns will you be filtering the records? 
- Look at how you want to order your results. These will help in figuring out your clustering columns. 

In [62]:
# TO-DO: Query 2:  Give me only the following: name of artist, 
# song (sorted by itemInSession) and user (first and last name)
# for userid = 10, sessionid = 182 
query = "CREATE TABLE IF NOT EXISTS table2"
query = query + """(
                    artist TEXT,
                    song TEXT,
                    user TEXT,
                    sessionid INT,
                    userid INT,
                    iteminsession INT,
                    PRIMARY KEY ((userid, sessionid),iteminsession))"""

print(query)
session.execute(query)


CREATE TABLE IF NOT EXISTS table2(
                    artist TEXT,
                    song TEXT,
                    user TEXT,
                    sessionid INT,
                    userid INT,
                    iteminsession INT,
                    PRIMARY KEY ((userid, sessionid),iteminsession))


<cassandra.cluster.ResultSet at 0x211b877c9a0>

Again, for a revision on how to create tables in apache cassandra, refer to the [Lesson 4: Excerise 3: Solution](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/512d3115-ba9c-41f7-a1fd-0a676292e7d5) in the classroom.   


This is how one of them looks like: 
<img src="images/create_table_example.png">


Once we have created our table, we can start inserting data into it. 

The following code will loop through each row in the `event_datafile_new.csv` and insert the relevant data from each row into the table you just created. 


There are 2 `TO-DO`s here: 
- Write `INSERT` statement that will be used to insert the relevant data from each record into the table you just created. For a refresher on how to insert data into tables, refer to the [Lesson 4: Excerise 3: Solution](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/14222a9b-5d25-4b97-a09b-18d5f5a1cdc4) in the classroom.  

This is how it looks in the classroom: 
<img src="images/insert_cassandra.png">

Now, as you notice, there are two parts to the `INSERT` query. The first is the `query` in the screenshot, and then there are the actual values that we need to put into the respective columns. Extracting these values from the current row is what the second TO-DO is all about. 


- Assign which values from the current row or `line` should be assigned for each column in the `INSERT` statement that you create. For example, the artist's name and the user's first name will be the first 2 values in each row or `line` in the `csv`. In order to insert the artist's name, you would use `line[0]` and in order to insert the user's first name, you would use `line[1]`.  

__Note:__ All values in the current `csv` are of type `str`, you might want to do something about them when inserting these values in the table. You cannot insert a string wherein an integer or float is expected and expect it to be automatically be converted in the table. 

In order for you to understand which column's value comes at what index, I have written this code which you can use: 


In [63]:
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    for line in csvreader:
        for i, value in enumerate(line):
            print( i, value,type(value))
        break

0 artist <class 'str'>
1 firstName <class 'str'>
2 gender <class 'str'>
3 itemInSession <class 'str'>
4 lastName <class 'str'>
5 length <class 'str'>
6 level <class 'str'>
7 location <class 'str'>
8 sessionId <class 'str'>
9 song <class 'str'>
10 userId <class 'str'>


In [64]:
# We have provided part of the code to set up the CSV file. Please complete the Apache Cassandra code below#
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    # skip header line
    next(csvreader) 
    for line in csvreader:
## TO-DO: Assign the INSERT statements into the `query` variable
        query = "INSERT INTO table2 (artist, song, user, sessionid, userid, iteminsession)"
        query = query + "VALUES (%s,%s,%s,%s,%s,%s)"
        ## TO-DO: Assign which column element should be assigned for each column in the INSERT statement.
        ## For e.g., to INSERT artist_name and user first_name, you would change the code below to `line[0], line[1]`
        session.execute(query, (line[0], 
                                line[9],
                                line[1] + ' ' + line[4],
                                int(line[8]),
                                int(line[10]),
                                int(line[3]),
                                ))

### Do a `SELECT` to verify that the data have been inserted into each table 

In [67]:
# TO-DO: Add in the SELECT statement to verify the data was 
# entered into the table correctly   
query = """select artist, song, user
from table2
WHERE userid = 10 and sessionid = 182 """

try:
    rows = session.execute(query)
except Exception as e: 
    print(e)
    
for row in rows:
    print(row.artist, row.song, row.user)


Down To The Bone Keep On Keepin' On Sylvie Cruz
Three Drives Greece 2000 Sylvie Cruz
Sebastien Tellier Kilometer Sylvie Cruz
Lonnie Gordon Catch You Baby (Steve Pitron & Max Sanna Radio Edit) Sylvie Cruz


## 3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'

Again, you will be following the same process here. 

Let's understand what our query is going to be: 
1. We want the first and last name of the user 
2. It's going to look for records the name of the song

Think of these questions when creating the table:  

1. What columns should be in this table?  

2. What datatype should each column be? 
> - [Here](https://www.guru99.com/cassandra-data-types-expiration-tutorial.html) are a list of datatypes you have in CQL. Make sure to have a look at this. Data types like `NUMERIC` (which were available in PostgreSQL are not available here. 
> - Some important ones which we use alot are `INT`, `TEXT`, `VARINT`, `FLOAT`, `BIGINT`,  `TIMESTAMP`, etc.
3. What should the [Primary Key](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/0600fb6e-935a-4b6b-abd2-16bff1016924) of the table be? Which should be our partition keys, if any? Which should be our [Clustering Columns](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/347092ad-2042-4385-90e5-b258f41941f4), if any? 

Remember: 
- The Primary Key (simple or composite) of a row should be the unique identifier of the row. There should be no 2 rows with the same Primary Key. 
- Look at the `WHERE` of your query to understand your partition keys. Using which columns will you be filtering the records? 
- Look at how you want to order your results. These will help in figuring out your clustering columns. 

In [84]:
# TO-DO: Query 3:  Give me every user name (first and last) in my music app history who 
# listened to the song 'All Hands Against His Own' 
query = "CREATE TABLE IF NOT EXISTS table3"
query = query + """(
                    song TEXT,
                    user TEXT,
                    PRIMARY KEY ( song, user ))"""

print(query)
session.execute(query)


CREATE TABLE IF NOT EXISTS table3(
                    song TEXT,
                    user TEXT,
                    PRIMARY KEY ( song, user ))


<cassandra.cluster.ResultSet at 0x211b8790a90>

Again, for a revision on how to create tables in apache cassandra, refer to the [Lesson 4: Excerise 3: Solution](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/512d3115-ba9c-41f7-a1fd-0a676292e7d5) in the classroom.   


This is how one of them looks like: 
<img src="images/create_table_example.png">


Once we have created our table, we can start inserting data into it. 

The following code will loop through each row in the `event_datafile_new.csv` and insert the relevant data from each row into the table you just created. 


There are 2 `TO-DO`s here: 
- Write `INSERT` statement that will be used to insert the relevant data from each record into the table you just created. For a refresher on how to insert data into tables, refer to the [Lesson 4: Excerise 3: Solution](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/14222a9b-5d25-4b97-a09b-18d5f5a1cdc4) in the classroom.  

This is how it looks in the classroom: 
<img src="images/insert_cassandra.png">

Now, as you notice, there are two parts to the `INSERT` query. The first is the `query` in the screenshot, and then there are the actual values that we need to put into the respective columns. Extracting these values from the current row is what the second TO-DO is all about. 


- Assign which values from the current row or `line` should be assigned for each column in the `INSERT` statement that you create. For example, the artist's name and the user's first name will be the first 2 values in each row or `line` in the `csv`. In order to insert the artist's name, you would use `line[0]` and in order to insert the user's first name, you would use `line[1]`.  

__Note:__ All values in the current `csv` are of type `str`, you might want to do something about them when inserting these values in the table. You cannot insert a string wherein an integer or float is expected and expect it to be automatically be converted in the table. 

In order for you to understand which column's value comes at what index, I have written this code which you can use: 


In [69]:
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    for line in csvreader:
        for i, value in enumerate(line):
            print( i, value,type(value))
        break

0 artist <class 'str'>
1 firstName <class 'str'>
2 gender <class 'str'>
3 itemInSession <class 'str'>
4 lastName <class 'str'>
5 length <class 'str'>
6 level <class 'str'>
7 location <class 'str'>
8 sessionId <class 'str'>
9 song <class 'str'>
10 userId <class 'str'>


In [85]:
# We have provided part of the code to set up the CSV file. Please complete the Apache Cassandra code below#
file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    # skip header line
    next(csvreader) 
    for line in csvreader:
## TO-DO: Assign the INSERT statements into the `query` variable
        query = "INSERT INTO table3 (song, user)"
        query = query + "VALUES (%s,%s)"
        ## TO-DO: Assign which column element should be assigned for each column in the INSERT statement.
        ## For e.g., to INSERT artist_name and user first_name, you would change the code below to `line[0], line[1]`
        session.execute(query, (line[9], line[1] + ' ' + line[4]))

### Do a `SELECT` to verify that the data have been inserted into each table 

In [86]:
# TO-DO: Add in the SELECT statement to verify the data was 
# entered into the table correctly   
query = """select user
from table3
WHERE song = 'All Hands Against His Own' """

try:
    rows = session.execute(query)
except Exception as e: 
    print(e)
    
for row in rows:
    print(row.user)

Jacqueline Lynch
Sara Johnson
Tegan Levine


If you did everything right, you should see something like this in the output: 

<img src="images/answer_3.png"> 

## Drop the tables before closing out the sessions

In [87]:
## TO-DO: Drop the tables before closing out the sessions 
session.execute("DROP TABLE music_history")

<cassandra.cluster.ResultSet at 0x211b8d1d340>

In [88]:
session.execute("DROP TABLE table2")

<cassandra.cluster.ResultSet at 0x211b8bc6250>

In [89]:
session.execute("DROP TABLE table3")

<cassandra.cluster.ResultSet at 0x211b881fe80>

Again, for revising how to drop each table you created, you can refer to [Lesson 4: Excerise 3: Solution](https://classroom.udacity.com/nanodegrees/nd027/parts/f7dbb125-87a2-4369-bb64-dc5c21bb668a/modules/c0e48224-f2d0-4bf5-ac02-3e1493e530fc/lessons/73fd6e35-3319-4520-94b5-9651437235d7/concepts/14222a9b-5d25-4b97-a09b-18d5f5a1cdc4) in the classroom.  

### Close the session and cluster connection¶

In [90]:
session.shutdown()
cluster.shutdown()

Notice how this action is similar to how we used to close the connection with the postgresql database.

<a id="reflection"></a>
## Reflection 
> #### [Tweet] Your Learnings! 
> ###  I used to think ______, now I think ___. 

## Before submitting the project, make sure your project abides by  the [Project Rubric](https://review.udacity.com/#!/rubrics/2475/view)


### ETL Pipeline Processing 

✅  Student creates `event_data_new.csv` file.    

✅  Student uses the appropriate datatype within the CREATE statement.For e.g., `artist_name` and `song_title` use `TEXT`, length use `FLOAT` datatypes.  

#### Data Modeling 

✅  Student creates the correct Apache Cassandra tables for each of the three queries. The `CREATE TABLE` statement should include the appropriate table. Student should adhere to the one table per query rule of Apache Cassandra. The student is allowed to use the same table for two of the questions, where it makes sense.    


✅  Student demonstrates good understanding of data modeling by generating correct `SELECT` statements to generate the result being asked for in the question. The SELECT statement should NOT use `ALLOW FILTERING` to generate the results. For e.g., Query 3, `SELECT `statement should not require anything more than user name first and last name in the `SELECT` statement IF the table has been created with the correct COMPOSITE PRIMARY KEY, including partitions and clustering columns.   


✅  Student should use table names that reflect the query and the result it will generate. Table names should include alphanumeric characters and underscores, and table names must start with a letter.We are looking for table names that provide a good general sense of what this query will generate. For e.g., for Query 2, an appropriate table name should reflect song playlist in session (e.g., name could be `song_playlist_session`). Students should not be using table names like `query_1` or `project_1`, etc. as table names need to be descriptive. 


✅  The sequence in which columns appear should reflect how the data is partitioned and the order of the data within the partitions. The sequence of the columns in the `CREATE` and `INSERT` statements should follow the order of the COMPOSITE PRIMARY KEY and CLUSTERING columns. The data should be inserted and retrieved in the same order as how the COMPOSITE PRIMARY KEY is set up. This is important to the student because Apache Cassandra is a partition row store, which means the partition key determines which any particular row is stored on which node. In case of composite partition key, partitions are distributed across the nodes of the cluster and how they are chunked for write purposes. Any clustering column(s) would determine the order in which the data is sorted within the partition.   

#### PRIMARY KEYS

✅  The combination of the PARTITION KEY alone or with the addition of CLUSTERING COLUMNS should be used appropriately to uniquely identify each row.For e.g., in Query 3, student should not only use song as PARTITION KEY. Similarly, the student does not need to user both `firstName` and `lastName` along with `userId` for query 3 clustering columns, as `song` and `userId` together will uniquely identify each row. The student should include clustering columns as part of the COMPOSITE PRIMARY KEY and understand that a COMPOSITE PRIMARY KEY uniquely identifies each row.   

#### Presentation 

✅  The notebooks should include a description of the query the data is modeled after. The student can include headers right above the SELECT statement cell to highlight the responses to the questions.  


✅  Code should be organized well into the different queries. Any in-line comments that were clearly part of the project instructions should be removed so the notebook provides a professional look.
