# Split data

This is an explanation of the  
**split_data.py**    
file.  

we take our parquet file of k-mers and:
+ split them into datasets according to the config file

Then we will save the output as parquet files.

## Step 0: import required libraries

In [1]:
import sys
sys.path.append('..') # this is to allow the script to read from the parent folder

from scripts.global_funcs import load_data_config
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf

## Step 1: load config file data

In [2]:
configs = load_data_config()

In [3]:
# these are the variables we will be using
for key, val in configs.items():
    print(f"{key}: {val}")

clean_fasta_file: /media/jcosme/Data/MarRef_parquet_10_cats
output_dir: /media/jcosme/Data
project_name: small_mer_1
base_col_names: ['seq', 'label']
label_col_name: label
input_col_name: seq
label_regex: (?:[^a-zA-Z0-9]+)([a-zA-Z]+[0-9]+)(?:[^a-zA-Z0-9]+)
k_mer: 1
possible_gene_values: ['A', 'C', 'G', 'T']
max_seq_len: 150
data_splits: {'train': 0.9, 'val': 0.05, 'test': 0.05}
random_seed: 42
fasta_sep: >
unq_labs_dir: /media/jcosme/Data/small_mer_1/data/unq_labels
unq_labs_dir_csv: /media/jcosme/Data/small_mer_1/data/unq_labels.csv
data_dir: /media/jcosme/Data/small_mer_1/data/small_mer_1
nvtab_dir: /media/jcosme/Data/small_mer_1/nvtab
dask_dir: /media/jcosme/Data/small_mer_1/dask
tensorboard_dir: /media/jcosme/Data/small_mer_1/tensorboard
model_checkpoints_dir: /media/jcosme/Data/small_mer_1/checkpoints/model_checkpoints
model_checkpoints_parent_dir: /media/jcosme/Data/small_mer_1/checkpoints
model_weights_dir: /media/jcosme/Data/small_mer_1/model_weights.h5


In [4]:
# lets put these into python variables
output_dir = configs['output_dir']
project_name = configs['project_name']
data_dir = configs['data_dir']
random_seed = configs['random_seed']
data_splits = configs['data_splits']
dask_dir = configs['dask_dir']

## Step 2: start the cluster

In [5]:
cluster = LocalCUDACluster(local_directory=dask_dir)
client = Client(cluster)
client

2022-05-24 13:23:45,708 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize


0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 1
Total threads: 1,Total memory: 31.21 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39351,Workers: 1
Dashboard: http://127.0.0.1:8787/status,Total threads: 1
Started: Just now,Total memory: 31.21 GiB

0,1
Comm: tcp://127.0.0.1:46217,Total threads: 1
Dashboard: http://127.0.0.1:45085/status,Memory: 31.21 GiB
Nanny: tcp://127.0.0.1:40111,
Local directory: /media/jcosme/Data/small_mer_1/dask/dask-worker-space/worker-kp42gs5j,Local directory: /media/jcosme/Data/small_mer_1/dask/dask-worker-space/worker-kp42gs5j
GPU: NVIDIA GeForce RTX 3080 Laptop GPU,GPU memory: 16.00 GiB


### optional: 
click the link above to open up the Dask Dashboard, which will allow you to see the progress of your job.  
**note:** this will only work on a jupyter notebook

## Step 3: data transformations

In [6]:
# get the percent values for each split
data_splits_values = []
for a_split, a_val in data_splits.items():
    data_splits_values.append(a_val)

In [7]:
# read parquet files
df = dask_cudf.read_parquet(data_dir)

In [8]:
# here is a raw data sample
df.head()

Unnamed: 0,seq,label
0,"[G, G, G, C, G, G, C, C, G, A, G, A, C, C, G, ...",1
1,"[A, G, C, C, G, A, G, C, A, G, C, C, G, G, T, ...",1
2,"[G, G, A, G, C, G, G, G, C, C, G, C, C, G, G, ...",1
3,"[C, G, A, T, C, G, A, C, C, G, C, C, G, C, T, ...",1
4,"[C, C, G, G, G, C, G, C, T, G, A, C, C, G, A, ...",1


In [9]:
# create the data splits
df_list = df.random_split(data_splits_values, random_state=random_seed)

## Step 4: save the data

In [10]:
%%time
# the final step is to save the cleaned data. 
# this will take some time
# we create parquet files for each split
for i, (a_split, a_val) in enumerate(data_splits.items()):
    out_filepath = f"{data_dir}_{a_split}"
    _ = df_list[i].to_parquet(out_filepath)

CPU times: user 191 ms, sys: 26.6 ms, total: 217 ms
Wall time: 1.78 s


## Step 5: cleanup

In [11]:
# we delete the dataframe
del df

# then we shutdown the Dask cluster
client.shutdown()

# finally we close the Dask cluster
client.close()

## finished!