---

Some/all notebooks were run on a logon/interactive node (extrememly limited resources).  

---

To speed this up you can run this notebook on a compute node. 

### If you already followed the steps in `2__preprocess_CSVs.ipynb`, then you can skip to **PostgreSQL**

Do this by being logged into an interactive node with 2 terminals open:  

---
---

Terminal 1:

---

(may want to increase time to be safe, as below is just an example)  

`salloc --time=6:00:00 --ntasks=1 --cpus-per-task=64 --nodes=1 --account=bmi_facelli-np --partition=bmi_facelli-np`  

The terminal returns when resources are allocated and further comamands will be executed on the compute node.  

`hostname -f`  
`<compute node name>` (copy this to clipboard)  

`export XDG_RUNTIME_DIR=""`  

`conda activate jupy2` (activate your conda env or venv that has jupyter/etc. dependencies)  

`jupyter notebook --no-browser --port=8889`  

----
---

Terminal 2: 

---

`google-chrome &`  

`ssh -N -L localhost:8888:localhost:8889 <compute node name>`  

Now, Chrome should open on your interactive node and you can copy the jupyter link (just change `...:8889/tree...` to `...:8888/lab...`) from the compute node terminal to access/create your notebooks/etc.  

---
---
---

**PostgreSQL**

__Prior to this (or through a new terminal in Jupyter)__, you will need to initialize/create a PostgreSQL DB (adjust paths/names as necessary):  

---

`cd /uufs/chpc.utah.edu/common/home/u0740821/dissertation/data/diabetes`  

`mkdir -p pgsql/data`  

`module load postgresql/15.2`  

`initdb -D ./pgsql/data`  

`pg_ctl -D ./pgsql/data -l logfile start`  

`createdb diabetes`  

`psql -d diabetes`  

`GRANT ALL PRIVILEGES ON DATABASE diabetes TO <uID>;`  
`\q`  

But don't close the terminal, we will use it later

---

---

__Prior to this (or through a new terminal in Jupyter)__, you will also need to extract the `.zip` file containing all your TriNetX CSVs:  

`unzip -l your_zip_file.zip`  

This will show you what files are contained in the zip.  

`unzip yourfile.zip patient.csv diagnosis.csv -d /path/to/destination` (recommend `/scratch/general/vast/<your user ID>`)  

---

---
---


# `cat /uufs/chpc.utah.edu/common/home/u0740821/dissertation/data/diabetes/create.sql`
---

---
```
DROP TABLE IF EXISTS diagnosis;
CREATE TABLE diagnosis (
  patient_id VARCHAR(200),
  encounter_id VARCHAR(200),
  code_system VARCHAR(50),
  code VARCHAR(100),
  principal_diagnosis_indicator VARCHAR(10),
  admitting_diagnosis VARCHAR(1),
  reason_for_visit VARCHAR(1),
  date DATE,
  derived_by_TriNetX BOOLEAN,
  source_id VARCHAR(200)
);


DROP TABLE IF EXISTS encounter;
CREATE TABLE encounter (
  encounter_id VARCHAR(200),
  patient_id VARCHAR(200),
  start_date DATE,
  end_date DATE,
  type VARCHAR(50),
  start_date_derived_by_TriNetX BOOLEAN,
  end_date_derived_by_TriNetX BOOLEAN,
  derived_by_TriNetX BOOLEAN,
  source_id VARCHAR(200)
);


DROP TABLE IF EXISTS lab_result;
CREATE TABLE lab_result (
  patient_id VARCHAR(200),
  encounter_id VARCHAR(200),
  code_system VARCHAR(50),
  code VARCHAR(100),
  date DATE,
  lab_result_num_val FLOAT,
  lab_result_text_val VARCHAR(100),
  units_of_measure VARCHAR(40),
  derived_by_TriNetX BOOLEAN,
  source_id VARCHAR(200)
);


DROP TABLE IF EXISTS medication;
CREATE TABLE medication (
  patient_id VARCHAR(200),
  encounter_id VARCHAR(200),
  unique_id VARCHAR(200),
  code_system VARCHAR(50),
  code VARCHAR(100),
  start_date DATE,
  route VARCHAR(200),
  brand VARCHAR(200),
  strength VARCHAR(200),
  derived_by_TriNetX BOOLEAN,
  source_id VARCHAR(200)
);


DROP TABLE IF EXISTS patient_demographic;
CREATE TABLE patient_demographic (
  patient_id VARCHAR(200),
  sex VARCHAR(50),
  race VARCHAR(180),
  ethnicity VARCHAR(180),
  marital_status VARCHAR(180),
  year_of_birth BIGINT,
  reason_yob_missing VARCHAR(50),
  death_date_source_id VARCHAR(200),
  month_year_death BIGINT,
  patient_regional_location VARCHAR(100),
  source_id VARCHAR(50)
);


DROP TABLE IF EXISTS procedure;
CREATE TABLE procedure (
  patient_id VARCHAR(200),
  encounter_id VARCHAR(200),
  code_system VARCHAR(50),
  code VARCHAR(100),
  principal_procedure_indicator VARCHAR(10),
  date DATE,
  derived_by_TriNetX BOOLEAN,
  source_id VARCHAR(180)
);


DROP TABLE IF EXISTS vital_sign;
CREATE TABLE vital_sign (
  patient_id VARCHAR(200),
  encounter_id VARCHAR(200),
  code_system VARCHAR(50),
  code VARCHAR(100),
  date DATE,
  value FLOAT,
  text_value VARCHAR(1020),
  units_of_measure VARCHAR(40),
  derived_by_TriNetX BOOLEAN,
  source_id VARCHAR(200)
);
```

---
---


# `cat /uufs/chpc.utah.edu/common/home/u0740821/dissertation/data/diabetes/load.sql`
---

---
```
-----------------------------------------
-- Load data into the diabetes schemas --
-----------------------------------------

-- To run from a terminal:
--  psql "dbname=<DBNAME> user=<USER>" -v diabetes_data_dir=<PATH TO DATA DIR> -f load.sql
-- The script assumes the files are in the diabetes_data_dir
\cd :diabetes_data_dir

-- making sure correct encoding is defined as -utf8- 
SET CLIENT_ENCODING TO 'utf8';
\COPY patient_demographic (patient_id, sex, race, ethnicity, year_of_birth, patient_regional_location) FROM 'patient_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY encounter (encounter_id, patient_id, start_date, type) FROM 'encounter_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY lab_result (patient_id, encounter_id, code, date, lab_result_num_val, lab_result_text_val, units_of_measure) FROM 'lab_result._sub_colcsv' DELIMITER ',' CSV HEADER NULL '';
\COPY diagnosis (patient_id, encounter_id, code, date) FROM 'diagnosis_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY procedure (patient_id, encounter_id, code, date) FROM 'procedure_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY medication (patient_id, encounter_id, code, start_date) FROM 'medication_ingredient_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY vital_sign (patient_id, encounter_id, code, date, value, text_value, units_of_measure, code_system) FROM 'vitals_signs_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
[u0740821@notchpeak1:postgres]$ psql -d diabetes -v ON_ERROR_STOP=1 -v diabetes_data_dir=/scratch/general/vast/u0740821/ -f /uufs/chpc.utah.edu/common/home/u0740821/dissertation/data/diabetes/load.sql 
SET
psql:/uufs/chpc.utah.edu/common/home/u0740821/dissertation/data/diabetes/load.sql:12: ERROR:  invalid input syntax for type bigint: "1953.0"
CONTEXT:  COPY patient_demographic, line 2, column year_of_birth: "1953.0"
[u0740821@notchpeak1:postgres]$ cat /uufs/chpc.utah.edu/common/home/u0740821/dissertation/data/diabetes/load.sql 
-----------------------------------------
-- Load data into the diabetes schemas --
-----------------------------------------

-- To run from a terminal:
--  psql "dbname=<DBNAME> user=<USER>" -v diabetes_data_dir=<PATH TO DATA DIR> -f load.sql
-- The script assumes the files are in the diabetes_data_dir
\cd :diabetes_data_dir

-- making sure correct encoding is defined as -utf8- 
SET CLIENT_ENCODING TO 'utf8';
\COPY patient_demographic (patient_id, sex, race, ethnicity, year_of_birth, patient_regional_location) FROM 'patient_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY encounter (encounter_id, patient_id, start_date, type) FROM 'encounter_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY lab_result (patient_id, encounter_id, code, date, lab_result_num_val, lab_result_text_val, units_of_measure) FROM 'lab_result._sub_colcsv' DELIMITER ',' CSV HEADER NULL '';
\COPY diagnosis (patient_id, encounter_id, code, date) FROM 'diagnosis_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY procedure (patient_id, encounter_id, code, date) FROM 'procedure_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY medication (patient_id, encounter_id, code, start_date) FROM 'medication_ingredient_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
\COPY vital_sign (patient_id, encounter_id, code, date, value, text_value, units_of_measure, code_system) FROM 'vitals_signs_sub_col.csv' DELIMITER ',' CSV HEADER NULL '';
```

---
---


# Run `create.sql`

---

## `psql -d diabetes -v ON_ERROR_STOP=1 -v diabetes_data_dir=<replace with path to directory containing extracted CSV file(s)/etc.> -f <replace with path to 'create.sql'>`

---
```
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
DROP TABLE
CREATE TABLE
```

---
---


# Run `load.sql`

---

## `psql -d diabetes -v ON_ERROR_STOP=1 -v diabetes_data_dir=<replace with path to directory containing extracted CSV file(s)/etc.> -f <replace with path to 'load.sql'>`
---
```
SET
COPY 6108666
COPY 1274437790
COPY 5713861360
COPY 2653978299
COPY 1570070266
COPY 5674491072
COPY 2033768504
```

---
---


# Run `validate.sql`

---

## `psql -d diabetes -v ON_ERROR_STOP=1 -v diabetes_data_dir=<replace with path to directory containing extracted CSV file(s)/etc.> -f <replace with path to 'validate.sql'>`
---
```
         tbl         | expected_count | observed_count | row_count_check 
---------------------+----------------+----------------+-----------------
 diagnosis           |     2653978299 |     2653978299 | PASSED
 encounter           |     1274437790 |     1274437790 | PASSED
 lab_result          |     5713861360 |     5713861360 | PASSED
 medication          |     5674491072 |     5674491072 | PASSED
 patient_demographic |        6108666 |        6108666 | PASSED
 procedure           |     1570070266 |     1570070266 | PASSED
 vital_sign          |     2033768504 |     2033768504 | PASSED
(7 rows)
```

---

---

---


# Check Sizes


---

In [112]:
!du -sh pgsql/

1.2T	pgsql/


In [113]:
!du -sh /scratch/general/vast/u0740821/*sub*


75G	/scratch/general/vast/u0740821/diagnosis_sub_col.csv
35G	/scratch/general/vast/u0740821/encounter_sub_col.csv
229G	/scratch/general/vast/u0740821/lab_result_sub_col.csv
154G	/scratch/general/vast/u0740821/medication_ingredient_sub_col.csv
285M	/scratch/general/vast/u0740821/patient_sub_col.csv
44G	/scratch/general/vast/u0740821/procedure_sub_col.csv
96G	/scratch/general/vast/u0740821/vitals_signs_sub_col.csv
