If you run this notebook on an interactive node, you can set `pd.read_csv(..., low_memory=False)`  

---



---

Some/all notebooks were run on a logon/interactive node (extrememly limited resources).  

---

To speed this up you can run this notebook on a compute node. 

Do this by being logged into an interactive node with 2 terminals open (if GUI/FastX, or two `ssh` connections from local terminals to  CHPC):  

---
---

Terminal 1:

---

Run the following to check what you can request:

`myallocation`

(may want to increase time to be safe, as below is just an example)  

`salloc --time=6:00:00 --ntasks=1 --cpus-per-task=64 --nodes=1 --account=<replace here with yours> --partition=<replace here with yours>`  

The terminal returns when resources are allocated and further comamands will be executed on the compute node.  

Run the following command when this occurs and copy the result (\<compute node name\>).

`hostname -f`  
\<compute node name\> (copy this to clipboard)  

`export XDG_RUNTIME_DIR=""`  

`conda activate jupy2` (activate your conda env or venv that has jupyter/etc. dependencies)  

Now move to the folder where your TriNetX zip/files are.  

`jupyter notebook --no-browser --port=8889`  



**You have successfully allocated a compute node, activated your environment, and started the Jupyter server.**  

Now, you need to connect as a client through Terminal 2.


---
---

Terminal 2: 

---

Terminal on local machine

>`ssh -L 8888:localhost:8888 <uID>@kingspeak.chpc.utah.edu`
>
>You are now connected to the CHPC Kingspeak cluster (replace with your cluster of choice, just make sure to Terminal 1). 
>
>From here, run the following
>
>`ssh -L 8888:localhost:8889 <paste the result from 'hostname -f', here>`
>
>Now, open your chrome from local machine and paste the address shown in Terminal 1 (ex: http://localhost:8889/tree?token=...), then change 8889 to 8888.
>
>* If you want jupyter lab, change 'tree' to 'lab'
> 
>Now, you can use computer power that you requested before with jupyter.
>


Terminal on FastX/GUI connection:

>`google-chrome &`  
>
>`ssh -N -L localhost:8888:localhost:8889 <compute node name>`  
>
>Now, Chrome should open on your interactive node and you can paste the address shown in Terminal 1 (ex: http://localhost:8889/tree?token=...), then change 8889 to 8888.
>
>* If you want jupyter lab, change 'tree' to 'lab'
> 
>Now, you can use computer power that you requested before with jupyter.
>

---
---
---

---

__Prior to this (or through a new terminal in Jupyter)__, you will need to extract the `.zip` file containing all your TriNetX CSVs:  

`unzip -l your_zip_file.zip`  

Above will show you what files are contained in the zip.  

Then run the following command (adding any other CSVs you want to extract):  

`unzip yourfile.zip patient.csv diagnosis.csv -d /path/to/destination` (recommend `/scratch/general/vast/<your user ID>`)  

---

---
---

# Remove unwanted columns

__Note:__  
  
Comment out (or add) tables you will not be using in the `specific_columns` and `file_to_table_mapping`

In [131]:
import os
import pandas as pd

In [132]:
specific_columns = {
    "patient_demographic": ["patient_id", "sex", "race", "ethnicity", "year_of_birth", "patient_regional_location"],
    "encounter": ["encounter_id", "patient_id", "start_date", "type"],
    "lab_result": ["patient_id", "encounter_id", "code", "date", "lab_result_num_val", "lab_result_text_val", "units_of_measure"],
    "diagnosis": ["patient_id", "encounter_id", "code", "date"],
    "procedure": ["patient_id", "encounter_id", "code", "date"],
    "medication": ["patient_id", "encounter_id", "code", "start_date"],
    "vital_sign": ["patient_id", "encounter_id", "code", "date", "value", "text_value", "units_of_measure", "code_system"]
}

In [133]:
file_to_table_mapping = {
    "patient.csv": "patient_demographic",
    "encounter.csv": "encounter",
    "lab_result.csv": "lab_result",
    "diagnosis.csv": "diagnosis",
    "procedure.csv": "procedure",
    "medication_ingredient.csv": "medication",
    "vitals_signs.csv": "vital_sign"
}

In [134]:
# Use the chunksize parameter to process the CSV in chunks
chunksize = 100000  # Adjust this size based on your memory capacity

__Note:__  
  
If you have the RAM, use `pd.read_csv(csv, chunksize=chunksize)` => `pd.read_csv(csv, chunksize=chunksize, low_memory=False)`  
  
  
__Note:__  
Update uNID/path as appropriate



In [None]:
scratch_path = "/scratch/general/vast/u0740821"

In [8]:
for file, table in file_to_table_mapping.items():
    csv = os.path.join(scratch_path, file)
    
    # Initialize the output CSV file
    output_csv = csv.split('.')[0] + '_sub_col.csv'
    print(output_csv)
    
    columns_to_keep = specific_columns[table]
        
    with pd.read_csv(csv, chunksize=chunksize) as reader:
        for chunk in reader:
            # Select only the columns we care about
            filtered_chunk = chunk[columns_to_keep]
            # Append to the output CSV file
            filtered_chunk.to_csv(output_csv, mode='a', index=False, header=not pd.io.common.file_exists(output_csv))
    
    print(f"Filtered CSV written to {output_csv}")
    

/scratch/general/vast/u0740821/patient_sub_col.csv
Filtered CSV written to /scratch/general/vast/u0740821/patient_sub_col.csv
/scratch/general/vast/u0740821/encounter_sub_col.csv
Filtered CSV written to /scratch/general/vast/u0740821/encounter_sub_col.csv
/scratch/general/vast/u0740821/lab_result_sub_col.csv


  for chunk in reader:


Filtered CSV written to /scratch/general/vast/u0740821/lab_result_sub_col.csv
/scratch/general/vast/u0740821/diagnosis_sub_col.csv
Filtered CSV written to /scratch/general/vast/u0740821/diagnosis_sub_col.csv
/scratch/general/vast/u0740821/procedure_sub_col.csv


  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk

Filtered CSV written to /scratch/general/vast/u0740821/procedure_sub_col.csv
/scratch/general/vast/u0740821/medication_ingredient_sub_col.csv


  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk

ParserError: Error tokenizing data. C error: EOF inside string starting at row 629221700

# Investigation of `medication_ingredient.csv` -> corrupted -> re-unzip

---

__Note:__  

Skim this, but probably skip (unzip didn't finish on the above CSV and this was to come to that conclusion)

---
---

## Investigate supposed "`EOF`" inside string starting at row 629221700

In [11]:
pd.read_csv('manifest.csv')

Unnamed: 0,file,column_count,row_count,unique_patient_count
0,chemo_lines.csv,5,908938,658258
1,cohort_details.csv,3,1,-
2,dataset_details.csv,4,1,-
3,diagnosis.csv,10,2653978299,6012826
4,encounter.csv,9,1274437790,6108666
5,genomic.csv,6,3670955,13057
6,lab_result.csv,10,5713861360,5676455
7,medication_drug.csv,13,1813961625,4902228
8,medication_ingredient.csv,11,5674491072,5775824
9,oncology_treatment.csv,11,528018,90242


### medication_ingredient.csv should have 5,674,491,072 rows

---
---

### Try reading 10 rows, including bad row

In [8]:
# file_path = '/scratch/general/vast/u0740821/medication_ingredient.csv'
# start_row = 629221695  # 0-based index, so start a few lines before
# end_row = 629221705  # End a few lines after the problematic row

# rows = []

# with open(file_path, 'r') as file:
#     for i, line in enumerate(file):
#         if start_row <= i <= end_row:
#             rows.append((i, line))
#         if i > end_row:
#             break
# rows

### Bad row caused above to stop early

---
---

In [6]:
# !tail -n3 /scratch/general/vast/u0740821/medication_ingredient.csv

---

`tr -d '\032' < /scratch/general/vast/u0740821/medication_ingredient.csv > /scratch/general/vast/u0740821/medication_ingredient_tr_EOF.csv`

In [5]:
#!tr -d '\032' < /scratch/general/vast/u0740821/medication_ingredient.csv > /scratch/general/vast/u0740821/medication_ingredient_tr_EOF.csv
# !tail -n3 /scratch/general/vast/u0740821/medication_ingredient_tr_EOF.csv

---

`dd if=/scratch/general/vast/u0740821/medication_ingredient.csv of=/scratch/general/vast/u0740821/medication_ingredient_dd-conv=notrunc,noerror__per_EOF.csv conv=notrunc,noerror`

In [4]:
#!dd if=/scratch/general/vast/u0740821/medication_ingredient.csv of=/scratch/general/vast/u0740821/medication_ingredient_dd-conv=notrunc,noerror__per_EOF.csv conv=notrunc,noerror
# !tail -n3 /scratch/general/vast/u0740821/medication_ingredient_dd-conv=notrunc,noerror__per_EOF.csv

---

`tr -d '\0\032' < /scratch/general/vast/u0740821/medication_ingredient.csv | sed 's/[^[:print:]\t]//g' > /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed.csv`

In [3]:
#!tr -d '\0\032' < /scratch/general/vast/u0740821/medication_ingredient.csv | sed 's/[^[:print:]\t]//g' > /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed.csv
# !tail -n3 /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed.csv

---

`tr -d '\0\032' < /scratch/general/vast/u0740821/medication_ingredient.csv | sed 's/[^[:print:]\t]//g' > /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed.csv`

---

`head -n -1 /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed.csv > /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed_head_-n_-1.csv`

In [2]:
#!tr -d '\0\032' < /scratch/general/vast/u0740821/medication_ingredient.csv | sed 's/[^[:print:]\t]//g' > /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed.csv
#!head -n -1 /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed.csv > /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed_head_-n_-1.csv
# !tail -n3 /scratch/general/vast/u0740821/medication_ingredient_tr_and_sed_head_-n_-1.csv

---
---
---

### Unzip `medication_ingredient.csv` again per appears corrupted

---

`unzip -j ~/dissertation/data/diabetes/dataset_65f07e22fa94485e6ec41aad.zip medication_ingredient.csv -d /scratch/general/vast/u0740821/medication_ingredient_retry`

---

In [106]:
# !du -sh /scratch/general/vast/u0740821/medication_ingredient.csv

58G	/scratch/general/vast/u0740821/medication_ingredient.csv


In [118]:
# !du -sh /scratch/general/vast/u0740821/medication_ingredient_retry/medication_ingredient.csv

520G	/scratch/general/vast/u0740821/medication_ingredient_retry/medication_ingredient.csv


In [1]:
#unzipped just this file again to check if corruption last unzip
# !tail -n3 /scratch/general/vast/u0740821/medication_ingredient_retry/medication_ingredient.csv

In [120]:
# replace corrupted version
# !mv /scratch/general/vast/u0740821/medication_ingredient_retry/medication_ingredient.csv /scratch/general/vast/u0740821/medication_ingredient.csv

In [121]:
# !du -sh /scratch/general/vast/u0740821/medication_ingredient.csv

520G	/scratch/general/vast/u0740821/medication_ingredient.csv



---
---
---

# Remove unwanted columns from remaining `medication_ingredient.csv` & `vitals_signs.csv`

---

## __!!! See next section before executing as combining that logic here could save you having to reprocess your CSVs !!!__

---


__Note:__  
  
If you have the RAM, use `pd.read_csv(csv, chunksize=chunksize, on_bad_lines='skip')` => `pd.read_csv(csv, chunksize=chunksize, low_memory=False, on_bad_lines='skip')`  
  
  
__Note:__  
  
Update uNID/path as appropriate



In [None]:
for file, table in {'medication_ingredient.csv': 'medication',
                    'vitals_signs.csv': 'vital_sign'}.items():
    csv = os.path.join(scratch_path, file)
    
    # Initialize the output CSV file
    output_csv = csv.split('.')[0] + '_sub_col.csv'
    print(output_csv)
    
    columns_to_keep = specific_columns[table]
        
    with pd.read_csv(csv, chunksize=chunksize, on_bad_lines='skip') as reader:
        for chunk in reader:
            # Select only the columns we care about
            filtered_chunk = chunk[columns_to_keep]
            # Append to the output CSV file
            filtered_chunk.to_csv(output_csv, mode='a', index=False, header=not pd.io.common.file_exists(output_csv))
    
    print(f"Filtered CSV written to {output_csv}")  

/scratch/general/vast/u0740821/medication_ingredient_sub_col.csv


  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk in reader:
  for chunk

---
---

# Make Data Types Appropriate for postgres `\COPY`

---
- Per `load.sql` error, format `date`'s and remove decimal from `year_of_birth`
  - `patient_demographic`'s `year_of_birth` data type `float` initially to be able to read `NaN`'s
    - Now -> `.fillna(0)` -> `.astype(int)` -> `.to_csv()`

---

__Note:__  
  
Comment out (or add) tables as appropriate  
  
__Note:__  
  
Update uNID/path as appropriate



In [149]:
import pandas as pd
import os

# Define the data types for each column
dtypes = {
    "patient_demographic": {
        "patient_id": str,
        "sex": str,
        "race": str,
        "ethnicity": str,
        "year_of_birth": str,
        "patient_regional_location": str
    },
    "encounter": {
        "encounter_id": str,
        "patient_id": str,
        "start_date": str,  # Use str for dates to avoid issues during read
        "type": str
    },
    "lab_result": {
        "patient_id": str,
        "encounter_id": str,
        "code": str,
        "date": str,  # Use str for dates to avoid issues during read
        "lab_result_num_val": float,
        "lab_result_text_val": str,
        "units_of_measure": str
    },
    "diagnosis": {
        "patient_id": str,
        "encounter_id": str,
        "code": str,
        "date": str  # Use str for dates to avoid issues during read
    },
    "procedure": {
        "patient_id": str,
        "encounter_id": str,
        "code": str,
        "date": str  # Use str for dates to avoid issues during read
    },
    "medication": {
        "patient_id": str,
        "encounter_id": str,
        "code": str,
        "start_date": str  # Use str for dates to avoid issues during read
    },
    "vital_sign": {
        "patient_id": str,
        "encounter_id": str,
        "code": str,
        "date": str,  # Use str for dates to avoid issues during read
        "value": float,
        "text_value": str,
        "units_of_measure": str
    }
}

table_to_file_mapping = {
    "patient_demographic": "patient_sub_col.csv",
    "encounter": "encounter_sub_col.csv",
    "lab_result": "lab_result_sub_col.csv",
    "diagnosis": "diagnosis_sub_col.csv",
    "procedure": "procedure_sub_col.csv",
    "medication": "medication_ingredient_sub_col.csv",
    "vital_sign": "vitals_signs_sub_col.csv"
}

# Function to filter and write CSV with correct data types
def filter_and_write_csv(table_name, input_csv, output_csv, columns, dtypes, chunksize=100000):
    for chunk in pd.read_csv(input_csv, chunksize=chunksize, dtype=dtypes):
        filtered_chunk = chunk[columns]
        # Ensure date columns are properly formatted
        for date_col in [col for col in columns if col in ['date', 'start_date']]:
            filtered_chunk[date_col] = pd.to_datetime(filtered_chunk[date_col], errors='coerce').dt.strftime('%Y-%m-%d')
        # Convert numeric columns to appropriate types
        if 'year_of_birth' in filtered_chunk.columns:
            filtered_chunk['year_of_birth'] = pd.to_numeric(filtered_chunk['year_of_birth'], errors='coerce').fillna(0).astype(int)
        filtered_chunk.to_csv(output_csv, mode='a', index=False, header=not os.path.exists(output_csv))

# Process each CSV file
for table_name, columns in specific_columns.items():
    print(table_name)
    print(table_to_file_mapping[table_name])
    print(table_to_file_mapping[table_name].split('.csv')[0])
    input_csv = os.path.join(scratch_path, f'{table_to_file_mapping[table_name].split('.csv')[0]}.csv')
    output_csv = os.path.join(scratch_path, f'{table_to_file_mapping[table_name].split('.csv')[0]}2.csv')
    print('dtypes:' + str(dtypes[table_name]))
    filter_and_write_csv(table_name, input_csv, output_csv, columns, dtypes[table_name])
    print(f"Filtered {table_name}.csv written to {output_csv}")

print("All files processed.")


patient_demographic
patient_sub_col.csv
patient_sub_col
dtypes:{'patient_id': <class 'str'>, 'sex': <class 'str'>, 'race': <class 'str'>, 'ethnicity': <class 'str'>, 'year_of_birth': <class 'str'>, 'patient_regional_location': <class 'str'>}
Filtered patient_demographic.csv written to /scratch/general/vast/u0740821/patient_sub_col2.csv
encounter
encounter_sub_col.csv
encounter_sub_col
dtypes:{'encounter_id': <class 'str'>, 'patient_id': <class 'str'>, 'start_date': <class 'str'>, 'type': <class 'str'>}
Filtered encounter.csv written to /scratch/general/vast/u0740821/encounter_sub_col2.csv
lab_result
lab_result_sub_col.csv
lab_result_sub_col
dtypes:{'patient_id': <class 'str'>, 'encounter_id': <class 'str'>, 'code': <class 'str'>, 'date': <class 'str'>, 'lab_result_num_val': <class 'float'>, 'lab_result_text_val': <class 'str'>, 'units_of_measure': <class 'str'>}
Filtered lab_result.csv written to /scratch/general/vast/u0740821/lab_result_sub_col2.csv
diagnosis
diagnosis_sub_col.csv
dia