# Split data

This is an explanation of the  
**split_data.py**    
file.  

we take our parquet file of k-mers and:
+ split them into datasets according to the config file

Then we will save the output as parquet files.

## Step 0: import required libraries

In [1]:
import sys
sys.path.append('..') # this is to allow the script to read from the parent folder

from scripts.global_funcs import load_data_config
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf

## Step 1: start the cluster

In [2]:
cluster = LocalCUDACluster()
client = Client(cluster)
client

2022-05-23 07:43:20,450 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize


0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 1
Total threads: 1,Total memory: 31.21 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:37165,Workers: 1
Dashboard: http://127.0.0.1:8787/status,Total threads: 1
Started: Just now,Total memory: 31.21 GiB

0,1
Comm: tcp://127.0.0.1:42409,Total threads: 1
Dashboard: http://127.0.0.1:38953/status,Memory: 31.21 GiB
Nanny: tcp://127.0.0.1:37727,
Local directory: /home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-xlln4gwa,Local directory: /home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-xlln4gwa
GPU: NVIDIA GeForce RTX 3080 Laptop GPU,GPU memory: 16.00 GiB


### optional: 
click the link above to open up the Dask Dashboard, which will allow you to see the progress of your job.  
**note:** this will only work on a jupyter notebook

## Step 2: load config file data

In [3]:
configs = load_data_config()

In [4]:
# these are the variables we will be using
for key, val in configs.items():
    print(f"{key}: {val}")

clean_fasta_file: /media/jcosme/Data/MarRef_parquet
output_dir: /media/jcosme/Data
project_name: full_mer_1
base_col_names: ['seq', 'label']
label_col_name: label
input_col_name: seq
label_regex: (?:[^a-zA-Z0-9]+)([a-zA-Z]+[0-9]+)(?:[^a-zA-Z0-9]+)
k_mer: 1
possible_gene_values: ['A', 'C', 'G', 'T']
data_splits: {'train': 0.9, 'val': 0.05, 'test': 0.05}
random_seed: 42
fasta_sep: >
unq_labs_dir: /media/jcosme/Data/full_mer_1/data/unq_labels
unq_labs_dir_csv: /media/jcosme/Data/full_mer_1/data/unq_labels.csv
data_dir: /media/jcosme/Data/full_mer_1/data/full_mer_1


In [5]:
# lets put these into python variables
output_dir = configs['output_dir']
project_name = configs['project_name']
data_dir = configs['data_dir']
random_seed = configs['random_seed']
data_splits = configs['data_splits']

## Step 3: data transformations

In [6]:
# get the percent values for each split
data_splits_values = []
for a_split, a_val in data_splits.items():
    data_splits_values.append(a_val)

In [7]:
# read parquet files
df = dask_cudf.read_parquet(data_dir)

In [8]:
# here is a raw data sample
df.head()

Unnamed: 0,seq,label
0,"[T, T, C, C, A, C, A, A, A, G, T, T, A, C, A, ...",0
1,"[T, A, A, A, T, T, A, A, G, A, A, T, T, G, A, ...",0
2,"[A, T, A, T, T, T, T, T, A, T, T, T, T, T, T, ...",0
3,"[T, T, A, T, G, G, A, T, G, A, C, G, A, T, A, ...",0
4,"[G, A, A, T, T, A, C, G, G, G, G, T, T, A, T, ...",0


In [9]:
# create the data splits
df_list = df.random_split(data_splits_values, random_state=random_seed)

## Step 4: save the data

In [10]:
%%time
# the final step is to save the cleaned data. 
# this will take some time
# we create parquet files for each split
for i, (a_split, a_val) in enumerate(data_splits.items()):
    out_filepath = f"{data_dir}_{a_split}"
    _ = df_list[i].to_parquet(out_filepath)

CPU times: user 1.27 s, sys: 411 ms, total: 1.68 s
Wall time: 43.5 s


## Step 5: cleanup

In [11]:
# we delete the dataframe
del df

# then we shutdown the Dask cluster
client.shutdown()

# finally we close the Dask cluster
client.close()

## finished!