<a href="https://colab.research.google.com/github/mosesyhc/de300-wn2024-notes/blob/main/lab/DATAENG300_Lab2_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 4 - Keyspaces introduction

**Before you begin**, make a copy of this notebook via `File -> Save a Copy` or `Copy to Drive` above.  Rename to include your name.


---

## Lab
This lab is a first introduction to Amazon Keyspaces.  Follow the instructions below.

### Dataset
Consider the MIMIC-III based dataset from `dataset/mimic_keyspace_data.csv`.

In [6]:
import pandas as pd
df = pd.read_csv('dataset/mimic_keyspace_data.csv')

df

Unnamed: 0,subject_id,hadm_id,icd9_code,age,gender,admission_type,insurance,ethnicity,hospital_expire_flag
0,41976,173269,3891,65.800000,M,EMERGENCY,Medicare,HISPANIC/LATINO - PUERTO RICAN,0
1,41976,172082,966,65.076712,M,EMERGENCY,Medicare,HISPANIC/LATINO - PUERTO RICAN,0
2,41976,149469,3897,64.830137,M,EMERGENCY,Medicare,HISPANIC/LATINO - PUERTO RICAN,0
3,10093,165393,3891,87.438356,M,EMERGENCY,Medicare,WHITE,1
4,41976,176016,3893,65.076712,M,EMERGENCY,Medicare,HISPANIC/LATINO - PUERTO RICAN,0
...,...,...,...,...,...,...,...,...,...
1305,41976,125449,9604,63.906849,M,EMERGENCY,Medicare,HISPANIC/LATINO - PUERTO RICAN,0
1306,41976,173269,9671,65.468493,M,EMERGENCY,Medicare,HISPANIC/LATINO - PUERTO RICAN,0
1307,10088,149044,966,77.890411,M,EMERGENCY,Private,WHITE,0
1308,10106,133283,8841,63.786301,M,EMERGENCY,Private,WHITE,0


### Create an Amazon Keyspace
- Sign into AWS via https://nu-sso.awsapps.com/start/.
- From Services, enter Amazon Keyspaces.
- Go to CQL editor (on the left menu).  We will interact with the Keyspace through CQL editor throughout this lab.

*Note*: There are multiple ways in AWS to create a keyspace and table.  We will create what we require with Cassandra queries (CQL).

- Create a keyspace with the name `[Firstname][Lastname]_de300lab4`, with the sample query below:

```
CREATE KEYSPACE IF NOT EXISTS "KeySpaceName"
   WITH REPLICATION = {'class': 'SingleRegionStrategy'};
```

- Verify that your keyspace is created properly by running:

```
SELECT * from system_schema.keyspaces ;
```

### Specify the keyspace 
- Specify the keyspace you created with:

```
USE [Firstname][Lastname]_de300lab4 ;
```

### Create a table
- The following query creates a table with only patient information:

```
CREATE TABLE IF NOT EXISTS [Firstname][Lastname]_de300lab4.patient_tbl (
   subject_id int,
   gender text,
   age float,
   ethnicity text,
   PRIMARY KEY (subject_id)
);
```

Note that by setting the primary key as `subject_id` automatically assigns itself as the **partition key**.

- Verify the table is created by running:

```
SELECT * from system_schema.tables WHERE keyspace_name=[Firstname][Lastname]_de300lab4 ;
```

- Verify the table structure by running

```
SELECT * FROM system_schema.columns WHERE keyspace_name=[Firstname][Lastname]_de300lab4 AND table_name = 'patient_tbl' ;
```

### Insert an entry
- The following query inserts one entry of the data into the `patient_tbl` table:

```
INSERT INTO [Firstname][Lastname]_de300lab4.patient_tbl (subject_id, gender, age, ethnicity)
VALUES ( 41976, 'M', 65.80, 'HISPANIC/LATINO - PUERTO RICAN' );
```

- Verify (Read) the data insertion by the following command:

```
SELECT * FROM [Firstname][Lastname]_de300lab4.patient_tbl ;
```

## Lab Tasks

Recall that deciding what primary key to use is most important in a wide-column style database.

The primary key consists of 
- a **partition key** (1+ columns): that should evenly distribute your data
- **clustering columns** (0+ columns): that joint with the partition key uniquely specifies one data row.

You may refer to the CQL documentation for specifying a table creation (https://cassandra.apache.org/doc/latest/cassandra/reference/cql-commands/create-table.html)

Refer to the dataset above (use **all** the columns), create and design a table for each of the following questions.  
Give a brief explanation of why you design the table in such way. 

**Submission guidance:**
Paste the `CREATE` query below each question and give the explanation accordingly.  
Test your query on Amazon Keyspace and make sure you can 1) create the table, and 2) insert one or more sample rows for verification.

1. What is the top procedure performed for the female patients?

2. What is the top procedure performed for the medicare and medicaid insurance holders?

3. What proportion of the patients are above 55 with the procedure icd9_code between 3890 and 3899?

**Submission:**
You will submit the `.ipynb` notebook file and any supporting information you see fit.

# Generative AI disclosure
In this course, you are generally allowed to use Generative Artificial Intelligence (GAI). Any use of GAI should be accompanied by a disclosure at the end of an assignment explaining (1) what you used GAI for; (2) the specific tool(s) you used; and (3) what prompts you used to get the results.

**Include** any disclosure below.