# Make k-mers

This is an explanation of the  
**make_kmers.py**    
file.  

using variables from the config file, we
1. split the raw gene sequence into k-merks subsets

note:
+ 1-mer is the fastest; we just split each letter individually
+ 2+-mer is slow because it requires a sliding window. The smaller the k, the longer it will take (e.g 2-mer will take longer than 10-mer).

Then we will save the output as parquet files.

## Step 0: import required libraries

In [1]:
import sys
sys.path.append('..') # this is to allow the script to read from the parent folder

from scripts.global_funcs import load_data_config
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf

## Step 1: start the cluster

In [2]:
cluster = LocalCUDACluster()
client = Client(cluster)
client

2022-05-23 07:14:56,096 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize


0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 1
Total threads: 1,Total memory: 31.21 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:36791,Workers: 1
Dashboard: http://127.0.0.1:8787/status,Total threads: 1
Started: Just now,Total memory: 31.21 GiB

0,1
Comm: tcp://127.0.0.1:46323,Total threads: 1
Dashboard: http://127.0.0.1:36317/status,Memory: 31.21 GiB
Nanny: tcp://127.0.0.1:37879,
Local directory: /home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-zh97v447,Local directory: /home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-zh97v447
GPU: NVIDIA GeForce RTX 3080 Laptop GPU,GPU memory: 16.00 GiB


### optional: 
click the link above to open up the Dask Dashboard, which will allow you to see the progress of your job.  
**note:** this will only work on a jupyter notebook

## Step 2: load config file data

In [3]:
configs = load_data_config()

In [4]:
# these are the variables we will be using
for key, val in configs.items():
    print(f"{key}: {val}")

clean_fasta_file: /media/jcosme/Data/MarRef_parquet
output_dir: /media/jcosme/Data
project_name: full_mer_1
base_col_names: ['seq', 'label']
label_col_name: label
input_col_name: seq
label_regex: (?:[^a-zA-Z0-9]+)([a-zA-Z]+[0-9]+)(?:[^a-zA-Z0-9]+)
k_mer: 1
possible_gene_values: ['A', 'C', 'G', 'T']
data_splits: {'train': 0.9, 'val': 0.05, 'test': 0.05}
random_seed: 42
fasta_sep: >
unq_labs_dir: /media/jcosme/Data/full_mer_1/data/unq_labels
unq_labs_dir_csv: /media/jcosme/Data/full_mer_1/data/unq_labels.csv
data_dir: /media/jcosme/Data/full_mer_1/data/full_mer_1


In [5]:
# lets put the variables we need into python variables
input_col_name = configs['input_col_name']
data_dir = configs['data_dir']
k_mer = configs['k_mer']
possible_gene_values = configs['possible_gene_values']
possible_gene_values = sorted(possible_gene_values)

## Step 3: define function

In [6]:
replace_gene_values = []
for gene_val in possible_gene_values:
    replace_gene_values.append(gene_val + ' ')

def add_whitespace(df):
    df[input_col_name] = df[input_col_name].str.replace(possible_gene_values, replace_gene_values, regex=False)
    return df

def get_kmers(df):
    df['temp'] = df[input_col_name].copy()
    df['temp'] = ' ' 
    for i in np.arange(0, df[input_col_name].str.len().max() - k_mer):
        # print(i)
        temp_df = df[input_col_name].str[i: i+k_mer].fillna(' ')
        change_mask = temp_df.str.len() < k_mer
        temp_df[change_mask] = ' ' 
        df['temp'] = df['temp'] + ' ' + temp_df  
    df['temp'] = df['temp'].str.normalize_spaces()
    df[input_col_name] = df['temp']
    df = df.drop(columns=['temp'])
    return df

def split_whitespace(df):
    df[input_col_name] = df[input_col_name].str.split()
    return df

## Step 4: data transformations

In [7]:
# first we read the parquet
df = dask_cudf.read_parquet(data_dir)

In [8]:
# here is a data sample
df.head()

Unnamed: 0,seq,label
0,TTCCACAAAGTTACACGGGAAAAGAGCCTGCAACAATGCGTGGAGT...,0
1,TAAATTAAGAATTGAAATGATTGAAAATGCTGGAAAATTAAAAATT...,0
2,ATATTTTTATTTTTTTGAAAAAAGGTTTAGTTAATTATAAAGTTTA...,0
3,TTATGGATGACGATATCAGACTTCTTAGAACGATCGGATCACTTCA...,0
4,GAATTACGGGGTTATTTAAATAAATTGCAAGAAGTTCCCATGCTAA...,0


In [9]:
# next, we apply the function defined above to the data
if k_mer == 1:
    df = df.map_partitions(add_whitespace)
    df = df.map_partitions(split_whitespace)
elif (k_mer > 1):
    df = df.map_partitions(get_kmers)
    df = df.map_partitions(split_whitespace)



In [10]:
# here is a data sample
df.head()



Unnamed: 0,seq,label
0,"[T, T, C, C, A, C, A, A, A, G, T, T, A, C, A, ...",0
1,"[T, A, A, A, T, T, A, A, G, A, A, T, T, G, A, ...",0
2,"[A, T, A, T, T, T, T, T, A, T, T, T, T, T, T, ...",0
3,"[T, T, A, T, G, G, A, T, G, A, C, G, A, T, A, ...",0
4,"[G, A, A, T, T, A, C, G, G, G, G, T, T, A, T, ...",0


## Step 5: save the data

In [12]:
%%time
# the final step is to save the cleaned data. 
# this might take some time
_ = df.to_parquet(data_dir)

CPU times: user 191 ms, sys: 34.8 ms, total: 226 ms
Wall time: 10.3 s


## Step 6: cleanup

In [13]:
# we delete the dataframe
del df

# then we shutdown the Dask cluster
client.shutdown()

# finally we close the Dask cluster
client.close()

## finished!