# make nvtab data

This is an explanation of the  
**make_nvtab.py**    
file.  

We take our split data and create:
+ NVtabublar datasets

Then we will save the output as parquet files.

## Step 0: import required libraries

In [1]:
import sys
sys.path.append('..') # this is to allow the script to read from the parent folder

from scripts.global_funcs import load_data_config
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf
import nvtabular as nvt

## Step 1: start the cluster

In [2]:
cluster = LocalCUDACluster()
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:46655  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 1  Cores: 1  Memory: 31.21 GiB


### optional: 
click the link above to open up the Dask Dashboard, which will allow you to see the progress of your job.  
**note:** this will only work on a jupyter notebook

## Step 2: load config file data

In [3]:
configs = load_data_config()

In [4]:
# these are the variables we will be using
for key, val in configs.items():
    print(f"{key}: {val}")

clean_fasta_file: /media/jcosme/Data/MarRef_parquet
output_dir: /media/jcosme/Data
project_name: full_mer_1
base_col_names: ['seq', 'label']
label_col_name: label
input_col_name: seq
label_regex: (?:[^a-zA-Z0-9]+)([a-zA-Z]+[0-9]+)(?:[^a-zA-Z0-9]+)
k_mer: 1
possible_gene_values: ['A', 'C', 'G', 'T']
max_seq_len: 150
data_splits: {'train': 0.9, 'val': 0.05, 'test': 0.05}
random_seed: 42
fasta_sep: >
unq_labs_dir: /media/jcosme/Data/full_mer_1/data/unq_labels
unq_labs_dir_csv: /media/jcosme/Data/full_mer_1/data/unq_labels.csv
data_dir: /media/jcosme/Data/full_mer_1/data/full_mer_1
nvtab_dir: /media/jcosme/Data/full_mer_1/nvtab


In [5]:
# lets put these into python variables
input_col_name = configs['input_col_name']
label_col_name = configs['label_col_name']
data_splits = configs['data_splits']
max_seq_len = configs['max_seq_len']
nvtab_dir = configs['nvtab_dir']
data_dir = configs['data_dir']

## Step 3: create NVTabular workflow

In [6]:
# create the pipeline
cat_features = [input_col_name] >> nvt.ops.Categorify() >>  nvt.ops.ListSlice(start=0, end=max_seq_len, pad=True, pad_value=0)

# add label column
output = cat_features + label_col_name

# create workflow
workflow = nvt.Workflow(output)

## Step 5: fit workflow on training data

In [7]:
%%time
# fitting on training data, and saving the workflow
for key in data_splits.keys():
    if key=='train':
        print("fitting nvtab workflow on training data...")
        workflow.fit(nvt.Dataset(f"{data_dir}_{key}", engine='parquet', row_group_size=10000))

        print("saving fitting nvtab workflow...")
        workflow.save(f"{nvtab_dir}/workflow")

fitting nvtab workflow on training data...
saving fitting nvtab workflow...
CPU times: user 801 ms, sys: 213 ms, total: 1.01 s
Wall time: 11.1 s


# Step 6: create datasets

In [8]:
%%time
shuffle= nvt.io.Shuffle.PER_PARTITION

for key in data_splits.keys():
    if key=='train':

        print("making nvtab dataset for training...")
        workflow.transform(nvt.Dataset(f"{data_dir}_{key}", engine='parquet', row_group_size=10000)).to_parquet(
            output_path=f"{nvtab_dir}/{key}",
            shuffle=shuffle,
            cats=[input_col_name],
            labels=[label_col_name],
        )
    else:
        print("making nvtab dataset for {key}...")
        workflow.transform(nvt.Dataset(f"{data_dir}_{key}", engine='parquet', row_group_size=10000)).to_parquet(
            output_path=f"{nvtab_dir}/{key}",
            shuffle=None,
            out_files_per_proc=None,
            cats=[input_col_name],
            labels=[label_col_name],
        )

making nvtab dataset for training...
making nvtab dataset for {key}...
making nvtab dataset for {key}...
CPU times: user 792 ms, sys: 198 ms, total: 990 ms
Wall time: 51.6 s


## Step 7: cleanup

In [9]:
# shutdown the Dask cluster
client.shutdown()

# finally we close the Dask cluster
client.close()



## finished!