In [None]:
from IPython.core.display import HTML
HTML("""
<style>

div.text_cell_render h1, h2, h3, h4, h5 { 
font-family: 'Georgia';
}


div.text_cell_render { /* Customize text cells */
font-family: 'Avenir';
font-size:18px;
line-height:21px;
color: #292929;
font-weight:400;
}
</style>
""")

In [None]:
### Run for formatting display width
display(HTML("<style>.container { width:100% !important;}</style>"))

# Table of Contents 
1. <a href="#what-nosql">What are NoSQL Databases?</a>
2. <a href="#why-nosql">Why do we need NoSQL Databases?</a>
3. <a href="#types-nosql">[Optional] Types of NoSQL Databases</a>
4. <a href="#dd">Distributed Databases</a>
5. <a href="#ec">Eventual Consistency</a>
6. <a href="#when-nosql">When to use NoSQL Databases?</a>
7. <a href="#what-ac">What is Apache Cassandra?</a>
8. <a href="#denorm-ac">Denormalization in Apache Cassandra</a>
9. <a href="#cql">CQL</a>
10. <a href="#demo">Demo: Supermarket Sales</a>
    - <a href="#step1">Step 1: Import Python Packages and Supermarket Sales Data</a>
    - <a href="#step2">Step 2: Create Cluster, Session and Keyspace</a>
    - <a href="#pk-pk-cc">Primary Key, Partition Key and Clustering Columns</a>
    - <a href="#query1">Step 3: Query 1</a>
    - <a href="#query2">Step 3: Query 2</a>
    - <a href="#query3">Step 3: Query 3</a>
    - <a href="#step4">Step 4: Drop Tables and Close Connection</a>
11. <a href="#reflection">Reflection</a>

### Pre-Requisites
- [PostgreSQL](https://www.udacity.com/course/sql-for-data-analysis--ud198)
- [Python](https://www.udacity.com/course/introduction-to-python--ud1110)

## Session Structure

- __[30 mins]__ Deep Dive into Theoretical Concepts 
    - Listen to the explanation. 
    - There are placeholders to add your notes in. 
    - Take notes only after listening/understanding rather than taking notes concurrently. This will increase recall.   
    

- __[60 mins]__ [Example Project] Supermarket Sales   
    - Primary Key 
    - Partition Key 
    - Clustering Columns   
    

- __[30 mins]__ Q and A | Reflection

<a id="what-nosql"></a>
## What are NoSQL Databases? 

- Simpler Design 
- Simpler Horizontal Scaling 
- Finer Control on Availability  


- NoSQL != Not SQL 
- NoSQL = Not ONLY SQL (where SQL refers to RDBMS)
- NoSQL = Not Only RDBMS

<img src="images/no-sql-db-types.png">

### ✍️ Notes
- 
- 

<a id="why-nosql"></a> 
## Why do we need NoSQL Databases?
   

> Database Management Systems provide __Efficient__,  __Reliable__, __Convenient__ and __Safe, Multi-User storage__ of and access to, __Massive amounts__ of __Persistent__ data.   

 
- We don't need all those features!   



- Convenience: 
    - What if our data model is not that relational? 
    - What if it does not fit into a neat package? 
    - Should we force it to fit? 
    - What if the query language is not needed? 
    - What if we don't want _transaction guarantees_? 
    - We're fine with _eventually_ having consistency?    
    
    
    
    
- Multi-User: 
    - Transaction Guarantees - Not really needed?   
    
    
    
    
- Safety | Persistence | Reliability 
    - We need safety, eitherway. 
    - NoSQL databases need to ensure the data outlives the programs. 
    - We want the NoSQL system up and running 99.999%!  
    
    
- __Massive__ 
    - The data being stored is going up! 
    - Hardware cost going down. 
    - Facebook/Twitter store hundreds of data points on each user.   
    
    
- __Efficiency__
    - Facebook has billions of records and we expect a response time of under a second. 
    
> ### We want much higher efficiency on very very massive datasets. 
    

### ✍️ Notes
- 
- 

<a id="types-nosql"></a>
## Types of Databases  

- Apache Cassandra (Partition Row Store) 
    - A Row Store means that like relational databases, Cassandra organizes data by rows and columns. 
    - Each row has a required Primary Key 
    - Partitioning means that Cassandra can distribute the data across multiple machines. 
    - It will automatically re-partition as machines are added or removed from the cluster. 
    - Each row can have variable number of columns. 
    
    
- MongoDB (Document Store)
    - It's designed to store and query data as JSON-like documents. 
    - Read more [here](https://aws.amazon.com/nosql/document/).  
    
Understanding the types of NoSQL databases is not required for the project. You can refer to these videos to learn about NoSQL Databases and their types. Notes on NoSQL database types will be uploaded soon. 

1. [NoSQL Motivation](https://lagunita.stanford.edu/courses/Engineering/db/2014_1/courseware/ch-nosql_systems/seq-vid-nosql_motivation/?child=last) covers the basics of NoSQL Databases. 
2. [NoSQL Overview](https://lagunita.stanford.edu/courses/Engineering/db/2014_1/courseware/ch-nosql_systems/seq-vid-nosql_overview/?child=first) covers the different types of NoSQL Databases. 
3. [An Overview of NoSQL Databases](https://www.xenonstack.com/blog/nosql-databases/) is another nice explanation. 

### ✍️ Notes
- 
- 

<img src="images/nosql-mind-map.png" width = "1000px" height="100px">

<a id="dd"><a/>

<a id="dd"></a>
## Distributed Databases  
 
- Spread out over different locations
- Accessed by various users globally 
- Distribution is oblivious to users 
    
<img src="images/scaling_up_horizontal.png">

- Copies of data for high availability, globally. 

### ✍️ Notes
- 
- 

<a id="ec"></a>
## Eventual Consistency 
- Make sure changes made are consistent across copies. 
- Some copies might take time to get updated. 

### ✍️ Notes
- 
- 

<a id="when-nosql"></a> 
## When to use NoSQL Databases? 
- Different data type formats 
- Large amounts of data 
- Horizontal Scalability 
- High Throughput 
- Flexible Schema
- High Availability - No Downtime 
- When users are distributed 

### ✍️ Notes
- 
- 

<a id="what-ac"></a> 
## What is Apache Cassandra? 
- __Scalability__ and __High Availability__ 
- Netlfix and Uber 
- One Table per Query 
- No Ad-Hoc Queries
- Basic Architecture 
    - Keyspace 
    - Table 
    - Rows

<img src="images/apache-cassandra-example.png">


### ✍️ Notes
- 
- 

<a id="denorm-ac"></a> 
## Denormalization in Apache Cassandra 
> Always think queries first!  
- No `JOIN`s: Can only query one table at a time 
- Our Priority is Fast Reads. 

<img src="images/denorm-cassandra.png">

### Two Queries , Two Tables  


- All Albums in a Given Year   

<img src="images/denorm-cassandra-1.png">

In this case, we have partitioned our data by `Year` (the first column). This means that the rows belonging to a single `Year` will be on one machine (or node) and rows belonging to another `Year` will be on another machine (or node). You will learn about this soon.  

- All Albums by a Given Artist    


<img src="images/denorm-cassandra-2.png">

In this case, we have partitioned our data by `Artist_Name` (the first column). This means that the rows belonging to a single `Artist_Name` will be on one machine (or node) and rows belonging to another `Artist_Name` will be on another machine (or node). Again, you will learn about this soon.  

### ✍️ Notes
- 
- 

<a id="cql"></a> 
## CQL 
- A way to interact with Apache Cassandra Database. 
- `JOIN`s, `GROUP BY`, subqueries are not supported. 
- You can get a deeper introduction to `CQL` using the following links: 
    - https://www.tutorialspoint.com/cassandra/cassandra_cqlsh.htm
    - https://www.guru99.com/cassandra-query-language-cql-insert-update-delete-read-data.html  


### ✍️ Notes
- 
- 

<a id="demo"></a>
## Demo: Supermarket Sales
<img src="images/supermarket.png">

## Demo Steps 

This demo is highly based on our second project. If you're able to complete this demo, you'll not have any problem doing project 2. 

Below are the steps you can follow to complete each component of this project. 

### Modeling your NoSQL or Apache Cassandra database 

__STEP 1__ Run the code cells to import the python packages and our `supermarket_sales.csv` which will have data regarding sales at a supermarket store. 

__STEP 2__ After connecting to the local instance of Apache Cassandra and creating a session, create and set a keyspace for this demo.   


__STEP 3__ Create queries to ask the given 3 questions from the data. These queries will include:
- `CREATE TABLE` queries to create the tables to answer the given question. 
- `INSERT` queries to populate the created tables with data from `supermarket_sales.csv`. 
- `SELECT` queries to select the relevant data from the created table to answer the given question. 

__STEP 4__ `DROP` the Tables you created. 

<a id="step1"></a>
## Step 1: Import Python Packages and Supermarket Sales Data

### Import Python Packages

In [None]:
# for manipulating data
import pandas as pd

# python driver for Appache Cassandra 
# !pip install cassandra-driver 
import cassandra 

# for manipulating numbers
import numpy as np 

# for working with the csv file
import csv 

### Import Supermarket Sales Data

In [None]:
# TODO: Import sm_sales - The supermarket_sales.csv 


Let's view the head of the data. 

In [None]:
# TODO: View it's head 


#### What are the columns we have? 


- Details of the branch: 
    - `branch`: Which branch? A? B? or C? 
    - `city`: Which city is the branch in? 'Yangon', 'Naypyitaw', etc.   
    
    
- Details of the Users 
    - `first_name`
    - `last_name`
    - `customer_type`: 'Member' or 'Normal'? 
    - `gender`: 'Male' or 'Female'? 
    - `user_id`: unique identifier of the user, like $101$. 
    - `profession`: Are they a 'developer', 'doctor', etc.?    
    
    
- Details of Product bought: 
    - `product_line`: What category of product was bought? Was it from 'Health and beauty' or 'Electronic accessories'? 
    - `rating`: What is the rating of the product out of $10$? $7.6$, $9.1$, etc.    
    
    
- Economics of the goods bought: 
    - `invoice_id`: unique identifier of the bill sent to the buyer, e.g.: '750-67-8428' 
    - `quantity`: How many of the products were bought? $1$? $2$? $7$? 
    - `unit_price`: What is the cost of one unit? $74.69$, $62.35$, etc. 
    - `tax_5_percent`: What was the tax ammount on the products bought? $26.1415$, $5.12$, etc. 
    - `total`: What was the total bill amount, including tax? 
    - `time`: Time of invoice 
    - `payment`: What was method of payment? 'Ewallet', 'Cash' or 'Credit Card'? 
    - `cogs`: What was the total _cost_ of the goods sold? 
    - `gross_margin_percentage`: What is the percentage of margin? 
    - `gross_income`: How much did we earn? 

### ✍️ Notes
- 
- 

<a id="step2"></a>
## Step 2: Create Cluster, Session and Keyspace 

### Creating a Cluster 

The following code will connect to a local instance of Apache Cassandra (if we have one, and we do in this workspace).   

This connection will reach out to the database and ensure we have the correct privileges to connect to this database.  

In [None]:
# TODO: Make a connection to a Cassandra instance on 
# our local machine (127.0.0.1) 


Once we create our cluster object, we need to connect to it. This will create our session that we will use to execute the queries. 

### Create a Session

In [None]:
# TODO: Create a seession to establish connection and
# begin executing queries 


This is very synonymous to what we used to do when we connected to the PostgreSQL database and got the cursor to it using code like this: 

```python
# Like connecting to cluster 
conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")
# Like creating a session
cur = conn.cursor()
```

The `cur` variable used to help us execute queries when we used `psycopg2` - the python driver for `PostgreSQL`. Now, we will use `session`. 


### Create Keyspace

A Keyspace in Cassandra is synonymous to a database you would create in PostgreSQL. It defines the:
- replication strategy: How should the replication take place? 
- replication factor: How many copies of the data will be distributed across the nodes?  

You can learn more about it [here](https://www.tutorialspoint.com/cassandra/cassandra_create_keyspace.htm). 

In [None]:
# TODO: Create a Keyspace 
session.execute("""

""")

We are creating a database called `supermarket_database` which will store tables relating to the sales of a chain of supermarket. 

### Set Keyspace 

In [None]:
# TODO: Set Keyspace to the keyspace specified above 


### ✍️ Notes
- 
- 

<a id="pk-pk-cc"></a>
### Overview of Primary Key, Partition Key and Clustering Columns

- A __primary key__ uniquely identifies a row.  

- A __composite key__ is a primary key made up of multiple columns.  

- A __partition key__ is the primary lookup to find a set of rows. It is used to partition the data and distribute it across the nodes. 

- A __clustering column__ is the part of the primary key that isn't the partition key, and defines the ordering within a partition.  

- __`PRIMARY KEY(a)`__: The partition key is `a`. There is no clustering column.   

- __`PRIMARY KEY(a,b)`__ : The partition key is `a` and the clustering column is `b`.   

- __`PRIMARY KEY((a,b))`__: The composite partition key is `(a,b)`. There is no clustering column.   

- __`PRIMARY KEY((a,b), c,d)`__: The composite partition key is `(a,b)` and the composite cluster key is `(c,d)`.   

### Let's Dive IN. 

- Big Data! 
- Example: Get all rows from the data which belong to year 1965. 
- Scanning entire data is not an option. 
- Ordering the data is also, not an option. 
- __Partition Key:__ Partition data across the machines.
- __Clustering Columns__: Ordering rows within a table.  

### Let's dive in further. 

## PRIMARY KEY 

- Unique identifier for each row. 
- Determines distribution and order of our data. 
- Data is __overwritten__ if duplicate value is inserted.
- Evenly distribute the data. 
- First element of __PRIMARY KEY__: __PARTITION KEY__ 

## PARTITION KEY 
- Partition Key's value is hashed -> Number -> Stored on Machine 
- Example: All rows with value $1960$ in the `year` column will be stored on one node, and all rows with value $1965$ in the `year` column will be stored on another node. 

## CLUSTERING COLUMN
- Sort order within a partition. 
- Default setting = ascending order 
- More than one can be used. 

### ✍️ Notes
- 
- 

<a id="sim_prim_key_ex"></a>
### Example of Simple PRIMARY KEY  

```SQL 
CREATE TABLE music_library_simple (year INT,
                                   artist_name TEXT, 
                            album_name TEXT, 
                            PRIMARY KEY( year )) 
```

<img src="images/primary_key_simple.png">

<a id="mul_part_keys"></a>
### Can we use multiple partition keys and clustering columns?  

```SQL 
CREATE table multiple_table (
    k_part_one TEXT, 
    key_part_two INT, 
    k_clust_one TEXT, 
    k_clust_two INT, 
    k_clust_three TEXT,
    data TEXT, 
    PRIMARY KEY ((k_part_one, k_part_two), 
                  k_clust_one, k_clust_two,
                 k_clust_three)
); 
```

### Example of Composite PRIMARY KEY with multiple partition keys

```SQL 

CREATE TABLE music_library 
(year INT, 
artist_name TEXT, 
album_name TEXT, 
PRIMARY KEY ((year, artist_name)) 
```    

<img src="images/primary_key_composite.png"> 

### Example of Composite PRIMARY KEY with multiple clustering columns 

```SQL 
CREATE TABLE music_library (year INT
                            artist_name TEXT,
                            album_name TEXT,
                            PRIMARY KEY ((year), 
                            artist_name, album_name))

```

<img src="images/primary_key_composite_clustering.png">


### Read and Understand: 

- A primary key uniquely identifies a row.
- A composite key is a primary key made up of multiple columns. 
- A partition key is the primary lookup to find a set of rows. It is used to partition the data and distribute it across the nodes. 
- A clustering column is the part of the primary key that isn't the partition key, and defines the ordering within a partition. 
- `PRIMARY KEY(a)`: The partition key is `a`. There is no clustering column. 
- `PRIMARY KEY(a,b)` : The partition key is `a` and the clustering column is `b`. 
- `PRIMARY KEY((a,b)`: The composite partition key is `(a,b)`. There is no clustering column.  
- `PRIMARY KEY(a, b, c)`: The partition key is `a`, the composite clustering key is `(b,c)`. 
- `PRIMARY KEY((a,b), c)`: The composite partition key is `(a,b)` and the clustering key is `c`. 
- `PRIMARY KEY((a,b), c,d)`: The composite partition key is `(a,b)` and the composite cluster key is `(c,d)`.  

### ✍️ Notes
- 
- 

### Now, we need to create tables to run the following queries. Remember: With Apache Cassandra, you model the database tables on queries you want to run. 

This is done whenever we are dealing with big data. If we think about the queries first, we can appropriately partition and sort our data while creating tables.   

This will make our reads super fast, as the data is already partitioned and sorted! 

## STEP 3: Create queries to ask the given 3 questions from the data. 

<a id="query1"></a>
### 1. Give me the branch, product line, the product's unit price in the sales history that was bought with Invoice ID = 750-67-8428. 

### Query 1: Part I: `CREATE` the Table

Before creating the table for this, let's first try to understand what our query is going to be.  

- It's going to be selecting the columns for branch, product's line and product's unit price.   
- It's going to look for records based on the value of `invoiceId`.  

Think of these questions while creating the table: 
1. What columns should be in this table? 
2. What datatype should each column be? 
> - [Here](https://www.guru99.com/cassandra-data-types-expiration-tutorial.html) are a list of datatypes you have in CQL. Make sure to have a look at this. Data types like `NUMERIC` are not available here. 
> - Some important ones which we use alot are `INT`, `TEXT`, `VARINT`, `FLOAT`, `BIGINT`,  `TIMESTAMP`, etc. 

3. What should the Primary Key of the table be? Which should be our Partition Key, if any? Which should be our Clustering Columns, if any?   


Remember: 
- The Primary Key (simple or composite) of a row should be the unique identifier of the row. There should be no 2 rows with the same Primary Key.  
- Look at the `WHERE` of your query to understand your paritition keys. Using which column will you be filtering the records? 
- Look at how you want to order your results. These will help in deciding your clustering columns. 

In [None]:
# TODO: Query 1:  Give me the branch, product line, the product's 
# unit price in the sales history that was bought with 
# Invoice ID = 750-67-8428.   

query = """

"""

print(query)
session.execute(query)

### Query 1: Part II: `INSERT` the Data

Once we have created our table, we can start inserting data into it. 

The following code will loop through each row in the `supermarket_sales.csv` and insert the relevant data from each row into the table you just created.   


There are 2 TODOs here: 
1. Write `INSERT` statement that will be used to insert the relevant data from each record into the table you just created. Note that there are 2 parts to teh `INSERT` query. 
    - The `INSERT` query
    - Acutal values that we need to put into the columns 
    
2. Assign which values from the current row or `line` should be assigned for each column in the `INSERT` statement that you create. For example, the `invoiceId`, the `branch` and the customer's `first_name` will be the 3 values in each row or `line` in the `csv`. In order to insert the invoiceId, you would use `line[0]` and in order to insert the branch, you would use `line[1]`.   

__Note:__ All values gathered from the csv are of type `str`. You will want to do something about this when inserting these values in the table. You cannot insert a string wherein an integer or floar is exepected and expect it to automatically be converted for the table.   

In order to understand which column's value is at what index, we can write the following code. 

In [None]:
file = 'supermarket_sales.csv'

with open(file, encoding='utf8') as f:
    # create a csv reader
    csvreader = csv.reader(f)
    # for each line in the reader object 
    for line in csvreader:
        # for index, value in enumerated list(line)
        for i, value in enumerate(line):
            # print the index, value and data type of the value 
            print(i, value, type(value))
        # break as we need to do this only for first line 
        break 

In [None]:
# Part of the code to set up the csv file has been provided. 
# Please complete the Apache Cassandra code below. 

file = 'supermarket_sales.csv'

with open(file, encoding='utf8') as f:
    csvreader = csv.reader(f)
    # skip header line as it's the column names
    next(csvreader)
    # for each line in the csvreader
    for line in csvreader:
        # Assign the INSERT statements into the `query` variable
        
        # TODO: Enter the INSERT statement here 
        query = """ """
        
        # TODO: Assign the placeholder here 
        query += """ """
        
        # TODO: Assign which column element should be assigned
        # for each column in the `INSERT` statement. 
        # For e.g., to INSERT invoiceId and branch, you would
        # change the code to `line[0], `line[1]`
        
        session.execute(query, 
                        (line[#]))

### Query 1: Part III: `SElECT` data

### Do `SELECT` to verify that the data has been inserted into the table. 

In [None]:
# CHECK: Run the SELECT statement to verify that the data was 
# entered into the table correctly. 

query = """
SELECT * FROM invoice_details 
WHERE invoice_id = '750-67-8428'
"""

try:
    rows = session.execute(query)
except Exception as e: 
    print(e)
    
for row in rows:
    print(row.invoice_id, row.branch, row.product_line, row.unit_price)

### Success for Query 1! 

<a id="query2"></a>
### Query 2. Give me only the following: name of the branch, product_line (sorted by quantity) and customer's (first and last name) for userid = 110

### Query 2: Part I: `CREATE` the TABLE

Before creating the table for this, let's first try to understand what our query is going to be.  

- It's going to be selecting the columns for branch, product's line and customer's first and last name.   
- It's going to look for records based on the value of `user_id`.
- It wants to sort the data by `quantity`. 

Think of these questions while creating the table: 
1. What columns should be in this table? 
2. What datatype should each column be? 
> - [Here](https://www.guru99.com/cassandra-data-types-expiration-tutorial.html) are a list of datatypes you have in CQL. Make sure to have a look at this. Data types like `NUMERIC` are not available here. 
> - Some important ones which we use alot are `INT`, `TEXT`, `VARINT`, `FLOAT`, `BIGINT`,  `TIMESTAMP`, etc. 

3. What should the Primary Key of the table be? Which should be our Partition Key, if any? Which should be our Clustering Columns, if any?   


Remember: 
- The Primary Key (simple or composite) of a row should be the unique identifier of the row. There should be no 2 rows with the same Primary Key.  
- Look at the `WHERE` of your query to understand your paritition keys. Using which column will you be filtering the records? 
- Look at how you want to order your results. These will help in deciding your clustering columns. 

In [None]:
# TODO: Query 2: Give me only the following: name of the branch, 
# product_line (sorted by quantity) and customer's 
# (first and last name) for userid = 110

 
query = """

"""

print(query)
session.execute(query)

### Query 2: Part II: `INSERT` the data

Once we have created our table, we can start inserting data into it. 

The following code will loop through each row in the `supermarket_sales.csv` and insert the relevant data from each row into the table you just created.   


There are 2 TODOs here: 
1. Write `INSERT` statement that will be used to insert the relevant data from each record into the table you just created. Note that there are 2 parts to teh `INSERT` query. 
    - The `INSERT` query
    - Acutal values that we need to put into the columns 
    
2. Assign which values from the current row or `line` should be assigned for each column in the `INSERT` statement that you create. For example, the `invoiceId`, the `branch` and the customer's `first_name` will be the 3 values in each row or `line` in the `csv`. In order to insert the invoiceId, you would use `line[0]` and in order to insert the branch, you would use `line[1]`.   

__Note:__ All values gathered from the csv are of type `str`. You will wnat to do something about this when inserting these values in the table. You cannot insert a string wherein an integer or floar is exepected and expect it to automatically be converted for the table.   

In order to understand which column's value is at what index, we can write the following code. 

In [None]:
file = 'supermarket_sales.csv'

with open(file, encoding='utf8') as f:
    # create a csv reader
    csvreader = csv.reader(f)
    # for each line in the reader object 
    for line in csvreader:
        # for index, value in enumerated list(line)
        for i, value in enumerate(line):
            # print the index, value and data type of the value 
            print(i, value, type(value))
        # break as we need to do this only for first line 
        break 

In [None]:
# Part of the code to set up the csv file has been provided. 
# Please complete the Apache Cassandra code below. 

file = 'supermarket_sales.csv'

with open(file, encoding='utf8') as f:
    csvreader = csv.reader(f)
    # skip header line as it's the column names
    next(csvreader)
    # for each line in the csvreader
    for line in csvreader:
        # Assign the INSERT statements into the `query` variable
        # TODO: Enter the INSERT statement here 
        query = """ """
        # TODO: Assign the placeholders here 
        query += """ """
        
        # TODO: Assign which column element should be assigned
        # for each column in the `INSERT` statement. 
        # For e.g., to INSERT invoiceId and branch, you would
        # change the code to `line[0], `line[1]`
        session.execute(query, 
                        (line[#]))

### Query 2: Part III: `SELECT` the Data

### Do `SELECT` to verify that the data has been inserted into the table. 

In [None]:
# CHECK: Run the SELECT statement to verify that the data was 
# entered into the table correctly. 

query = """
SELECT branch, product_line, first_name, last_name
FROM user_purchases
WHERE user_id = '110'
"""

try:
    rows = session.execute(query)
except Exception as e: 
    print(e)
    
for row in rows:
    print(row.branch, row.product_line, row.first_name, row.last_name)

### Success for Query 2! 

<a id="query3"></a>
### 3. Give me only the first and last names of customers in my sales history who bought the product_line 'Electronic accessories'. 

### Query 3: Part I: `CREATE` the table


Before creating the table for this, let's first try to understand what our query is going to be.  

- It's going to be selecting the customer's first and last names.   
- It's going to look for records based on the value of `product_line`.  

Think of these questions while creating the table: 
1. What columns should be in this table? 
2. What datatype should each column be? 
> - [Here](https://www.guru99.com/cassandra-data-types-expiration-tutorial.html) are a list of datatypes you have in CQL. Make sure to have a look at this. Data types like `NUMERIC` are not available here. 
> - Some important ones which we use alot are `INT`, `TEXT`, `VARINT`, `FLOAT`, `BIGINT`,  `TIMESTAMP`, etc. 

3. What should the Primary Key of the table be? Which should be our Partition Key, if any? Which should be our Clustering Columns, if any?   


Remember: 
- The Primary Key (simple or composite) of a row should be the unique identifier of the row. There should be no 2 rows with the same Primary Key.  
- Look at the `WHERE` of your query to understand your paritition keys. Using which column will you be filtering the records? 
- Look at how you want to order your results. These will help in deciding your clustering columns. 

In [None]:
# TODO: Query 3: Give me only the first and last names of 
# customers in my sales history who bought the product_line 
# 'Electronic accessories'. 

query = """

"""

print(query)
session.execute(query)

### Query 3: Part II: `INSERT` the data

Once we have created our table, we can start inserting data into it. 

The following code will loop through each row in the `supermarket_sales.csv` and insert the relevant data from each row into the table you just created.   


There are 2 TODOs here: 
1. Write `INSERT` statement that will be used to insert the relevant data from each record into the table you just created. Note that there are 2 parts to teh `INSERT` query. 
    - The `INSERT` query
    - Acutal values that we need to put into the columns 
    
2. Assign which values from the current row or `line` should be assigned for each column in the `INSERT` statement that you create. For example, the `invoiceId`, the `branch` and the customer's `first_name` will be the 3 values in each row or `line` in the `csv`. In order to insert the invoiceId, you would use `line[0]` and in order to insert the branch, you would use `line[1]`.   

__Note:__ All values gathered from the csv are of type `str`. You will wnat to do something about this when inserting these values in the table. You cannot insert a string wherein an integer or floar is exepected and expect it to automatically be converted for the table.   

In order to understand which column's value is at what index, we can write the following code. 

In [None]:
file = 'supermarket_sales.csv'

with open(file, encoding='utf8') as f:
    # create a csv reader
    csvreader = csv.reader(f)
    # for each line in the reader object 
    for line in csvreader:
        # for index, value in enumerated list(line)
        for i, value in enumerate(line):
            # print the index, value and data type of the value 
            print(i, value, type(value))
        # break as we need to do this only for first line 
        break 

In [None]:
# Part of the code to set up the csv file has been provided. 
# Please complete the Apache Cassandra code below. 

file = 'supermarket_sales.csv'

with open(file, encoding='utf8') as f:
    csvreader = csv.reader(f)
    # skip header line as it's the column names
    next(csvreader)
    # for each line in the csvreader
    for line in csvreader:
        # Assign the INSERT statements into the `query` variable
        # TODO: Enter the INSERT statement here 
        query = """ """
        # TODO: Assign the placeholders here 
        query += """ """
        
        # TODO: Assign which column element should be assigned
        # for each column in the `INSERT` statement. 
        # For e.g., to INSERT invoiceId and branch, you would
        # change the code to `line[0], `line[1]`
        session.execute(query, 
                        (line[#]))  

### Query 3: Part III. `SELECT` the data

### Do `SELECT` to verify that the data has been inserted into the table. 

In [None]:
# CHECK: Run the SELECT statement to verify that the data was 
# entered into the table correctly. 

query = """
SELECT first_name, last_name
FROM electronic_users
WHERE product_line = 'Electronic accessories'
"""

try:
    rows = session.execute(query)
except Exception as e: 
    print(e)
    
for row in rows:
    print(row.first_name, row.last_name)

### Success for Query 3! 

<a id="step4"></a>
## Step 4: Drop the tables before closing out the sessions

In [None]:
session.execute("DROP TABLE invoice_details")

In [None]:
session.execute("DROP TABLE user_purchases")

In [None]:
session.execute("DROP TABLE electronic_users")

## Close the session and cluster connection 

In [None]:
session.shutdown()
cluster.shutdown()

Notice how this action is similar to how we used to close the connection with the postgresql database. 

<a id="reflection"></a>
## Reflection 
> #### [Tweet] Your Learnings! 
> ###  I used to think ______, now I think ___. 