#### Task Description
There are multiple event data files fo a music app history in the event_data folder. These files need to be imported into Cassandra databases for three types of queries:
1. Find artist's name, song's title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4
2. Find artist's name, song's title (sorted by itemInSession) and user's name (first and last name) for userid = 10, sessionid = 182
3. Find every user's name (first and last name) in my music app history who listened to the song 'All Hands Against His Own'

#### Source File
There are eleven columns in the event data files:
- artist 
- firstName of user
- gender of user
- item number in session
- last name of user
- length of the song
- level (paid or free song)
- location of the user
- sessionId
- song title
- userId

#### Workflow
There are four main steps to accomplish the requirements:
1. Preprocessing all the event data files and merging all the data into a new csv file.
2. Creating the keyspace and connecting to the keysapce
3. Creating tables according to the queries and populating data into the tables
4. Run the queries

#### Import Python packages 

In [1]:
import pandas as pd
import cassandra
import re
import os
import glob
import numpy as np
import json
import csv

In [9]:
filepath = os.getcwd() + '/event_data/'
for root, dirs, files in os.walk(filepath):   
    print(root)
    #print(dirs)
    #print(files)



/home/workspace/event_data/
/home/workspace/event_data/.ipynb_checkpoints


#### Preprocessing data files
Processing the files to create a new csv file that will be used for populating Apache Casssandra tables. The number of lines in the csv file is shown in the end. 

In [11]:
filepath = os.getcwd() + '/event_data'

for root, dirs, files in os.walk(filepath):    
    file_path_list = glob.glob(os.path.join(filepath,'*'))    

full_data_rows_list = []     

for f in file_path_list:   
    with open(f, 'r', encoding = 'utf8', newline='') as csvfile:        
        csvreader = csv.reader(csvfile) 
        next(csvreader)        
   
        for line in csvreader:            
            full_data_rows_list.append(line)          

csv.register_dialect('myDialect', quoting=csv.QUOTE_ALL, skipinitialspace=True)

with open('event_datafile_new.csv', 'w', encoding = 'utf8', newline='') as f:
    writer = csv.writer(f, dialect='myDialect')
    writer.writerow(['artist','firstName','gender','itemInSession','lastName','length',\
                'level','location','sessionId','song','userId'])
    for row in full_data_rows_list:
        if (row[0] == ''):
            continue
        writer.writerow((row[0], row[2], row[3], row[4], row[5], row[6], row[7], row[8], row[12], row[13], row[16]))
        
with open('event_datafile_new.csv', 'r', encoding = 'utf8') as f:
    print(sum(1 for line in f))

6821


The image below is a screenshot of the csv file
<img src="images/image_event_datafile_new.jpg">

#### Creating a new keyspace and connecting to it

In [12]:
from cassandra.cluster import Cluster
cluster = Cluster()

session = cluster.connect()

session.execute("CREATE KEYSPACE IF NOT EXISTS EventDataDB WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};")

session.set_keyspace("eventdatadb")

#### Creating tables according to the queries and populating data into the tables
Since for NoSQL databases, the data model should accommodate the query, for each query I will create a table and populate data into it. 

##### Query 1: Find artist's name, song's title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4
The query wants to know the song's information of the specific item in one session, which means sessionId and itemInSession are used to identify the song's information like artist's name, song's title and lenth. It's not hard to know that when a user use the music app, she/he may listen to one or multiple songs, so I used the sessionId as the partition key and itemInSession as the clustering column to store all the songs' information in the partition. 

I created a eventSession table below, imported data to the table from the csv file, and ran the query to examine my design. 

In [14]:
session.execute("DROP TABLE IF EXISTS eventSession")
query = """CREATE TABLE IF NOT EXISTS eventSession 
                                        (                                        
                                         session_id int,
                                         item_in_session int,                      
                                         artist text, 
                                         song text,
                                         length float,
                                         PRIMARY KEY (session_id, item_in_session)
                                         )"""

session.execute(query)  

file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:    
        query = "INSERT INTO eventSession (session_id, item_in_session, artist, song, length)"
        query = query + "VALUES(%s,%s,%s,%s,%s)"
        session.execute(query, (int(line[8]), int(line[3]), line[0], line[9], float(line[5])))
        
rows = session.execute("SELECT artist, song, length FROM eventSession WHERE session_id = 338 AND item_in_session = 4")
for row in rows:
    print(row)

Row(artist='Faithless', song='Music Matters (Mark Knight Dub)', length=495.30731201171875)


##### Query 2: Find artist's name, song's title (sorted by itemInSession) and user's name (first and last name) for userid = 10, sessionid = 182
The query wants to know the song's information and user's name when a user use the music app in one session. So, userid and sessionid are used to identify the song's information and user's name. A user may use the music app one or multiple times, she/he may listen to one or multiple songs for each use, so I used the userid as the partition key and sessionid as the clustering column to store all the songs' information and user's name in the partition. Since the returned result should be sorted by the itemInSession, so the itemInsession are used as a clustering column, then the song's information and user's name will sorted by the itemInSession.

I created a eventUser table below, imported data to the table from the csv file, and ran the query to examine my design. 

In [15]:
session.execute("DROP TABLE IF EXISTS eventUser")
query = """CREATE TABLE IF NOT EXISTS eventUser
                                        ( 
                                         user_id int,
                                         session_id int,
                                         item_in_session int,
                                         artist text, 
                                         song text,
                                         user text,
                                         PRIMARY KEY (user_id, session_id, item_in_session)
                                         )"""
session.execute(query)

file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:      
        query = "INSERT INTO eventUser (user_id, session_id, item_in_session, artist, song, user)"
        query = query + "VALUES(%s,%s,%s,%s,%s,%s)"        
        session.execute(query, (int(line[10]), int(line[8]), int(line[3]), line[0], line[9], line[1] + " " + line[4]))
        

rows = session.execute("SELECT artist, song, user FROM eventUser WHERE user_id = 10 AND session_id = 182")
for row in rows:
    print(row)

Row(artist='Down To The Bone', song="Keep On Keepin' On", user='Sylvie Cruz')
Row(artist='Three Drives', song='Greece 2000', user='Sylvie Cruz')
Row(artist='Sebastien Tellier', song='Kilometer', user='Sylvie Cruz')
Row(artist='Lonnie Gordon', song='Catch You Baby (Steve Pitron & Max Sanna Radio Edit)', user='Sylvie Cruz')


##### Query 3: Find every user's name (first and last name) in my music app history who listened to the song 'All Hands Against His Own'
The query wants to know the name of users who listened to a specific song. So, song's title is used to identify users' names. For one song, many users may listen to it, so I used the song's title as the partition key and userid (different userids may have the same user name) as the clustering column to store users' names in the partition. 

I created a eventSong table below, imported data to the table from the csv file, and ran the query to examine my design. 

In [16]:
session.execute("DROP TABLE IF EXISTS eventSong")
query = """CREATE TABLE IF NOT EXISTS eventSong 
                                        (                                        
                                         song text, 
                                         user_id int,                                          
                                         user text,
                                         PRIMARY KEY (song, user_id)
                                         )"""
session.execute(query)

file = 'event_datafile_new.csv'

with open(file, encoding = 'utf8') as f:
    csvreader = csv.reader(f)
    next(csvreader)
    for line in csvreader:      
        query = "INSERT INTO eventSong (song, user_id, user)"
        query = query + "VALUES(%s,%s,%s)"        
        session.execute(query, (line[9], int(line[10]), line[1] + " " + line[4]))
        

rows = session.execute("SELECT user FROM eventSong WHERE song = 'All Hands Against His Own'")
for row in rows:
    print(row)                    

Row(user='Jacqueline Lynch')
Row(user='Tegan Levine')
Row(user='Sara Johnson')


#### Dropping the tables before closing out the sessions

In [17]:
session.execute("DROP TABLE IF EXISTS eventSession")
session.execute("DROP TABLE IF EXISTS eventSong")
session.execute("DROP TABLE IF EXISTS eventUser")

<cassandra.cluster.ResultSet at 0x7ff0ae0f2be0>

#### Closing the session and cluster connection

In [18]:
session.shutdown()
cluster.shutdown()