# Convert raw .fasta into parquet files with DASK on GPUs

**THIS ONLY NEEDS TO BE RUN ONCE PER .FASTA FILE**  

This is an explanation of the  
**clean_raw_fasta.py**    
file.  

we take a raw .fasta file, convert it into a table with two columns:
+ 'seq': a string of the raw gene sequence
+ 'label': a string of the raw label for a gene sequence
each row will represent one observation.  

Then we will save the output as parquet files.

## Step 0: import required libraries

In [1]:
import sys
sys.path.append('..') # this is to allow the script to read from the parent folder

from scripts.global_funcs import load_raw_data_config
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf

## Step 1: start the cluster

In [7]:
cluster = LocalCUDACluster()
client = Client(cluster)
client

2022-05-23 06:00:49,231 - distributed.diskutils - INFO - Found stale lock file and directory '/home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-un369n6d', purging
2022-05-23 06:00:49,232 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize


0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 1
Total threads: 1,Total memory: 31.21 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:41693,Workers: 1
Dashboard: http://127.0.0.1:8787/status,Total threads: 1
Started: Just now,Total memory: 31.21 GiB

0,1
Comm: tcp://127.0.0.1:34975,Total threads: 1
Dashboard: http://127.0.0.1:45677/status,Memory: 31.21 GiB
Nanny: tcp://127.0.0.1:44455,
Local directory: /home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-7u975p5w,Local directory: /home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-7u975p5w
GPU: NVIDIA GeForce RTX 3080 Laptop GPU,GPU memory: 16.00 GiB


### optional: 
click the link above to open up the Dask Dashboard, which will allow you to see the progress of your job.  
**note:** this will only work on a jupyter notebook

## Step 2: load config file data

In [2]:
configs = load_raw_data_config()

In [3]:
# these are the variables we will be using
for key, val in configs.items():
    print(f"{key}: {val}")

raw_fasta_file: /media/jcosme/Data/MarRef.training.fasta
clean_fasta_file: /media/jcosme/Data/MarRef_parquet
base_col_names: ['seq', 'label']
fasta_sep: >


In [6]:
# lets put these into python variables
raw_fasta_file = configs['raw_fasta_file']
clean_fasta_file = configs['clean_fasta_file']
base_col_names =  configs['base_col_names']
fasta_sep = configs['fasta_sep']

## Step 3: data transformations

In [8]:
# first we create the Dask dataframe
df = dask_cudf.read_csv(raw_fasta_file, # location of raw file
                        sep=fasta_sep, # this is the '>' sign
                        names=base_col_names, # column names
                        dtype=str, # data type
                       )


In [9]:
# here is a raw data sample
df.head()

Unnamed: 0,seq,label
0,,label|286|MMP00000031-10000/1
1,TTCCACAAAGTTACACGGGAAAAGAGCCTGCAACAATGCGTGGAGT...,
2,,label|286|MMP00000031-9998/1
3,TAAATTAAGAATTGAAATGATTGAAAATGCTGGAAAATTAAAAATT...,
4,,label|286|MMP00000031-9996/1


In [10]:
# now we have to shift the data, in order to correct the wrong offset
df['label'] = df['label'].shift()

In [11]:
# here is a sample after the transformation
df.head()

Unnamed: 0,seq,label
0,,
1,TTCCACAAAGTTACACGGGAAAAGAGCCTGCAACAATGCGTGGAGT...,label|286|MMP00000031-10000/1
2,,
3,TAAATTAAGAATTGAAATGATTGAAAATGCTGGAAAATTAAAAATT...,label|286|MMP00000031-9998/1
4,,


In [12]:
# finally, we drop all empty rows, and reset the index
df = df.dropna().reset_index(drop=True)

In [13]:
# here is a sample of the clean data
df.head()

Unnamed: 0,seq,label
0,TTCCACAAAGTTACACGGGAAAAGAGCCTGCAACAATGCGTGGAGT...,label|286|MMP00000031-10000/1
1,TAAATTAAGAATTGAAATGATTGAAAATGCTGGAAAATTAAAAATT...,label|286|MMP00000031-9998/1
2,ATATTTTTATTTTTTTGAAAAAAGGTTTAGTTAATTATAAAGTTTA...,label|286|MMP00000031-9996/1
3,TTATGGATGACGATATCAGACTTCTTAGAACGATCGGATCACTTCA...,label|286|MMP00000031-9994/1
4,GAATTACGGGGTTATTTAAATAAATTGCAAGAAGTTCCCATGCTAA...,label|286|MMP00000031-9992/1


## Step 4: save the data

In [14]:
%%time
# the final step is to save the cleaned data. 
# this will take some time
_ = df.to_parquet(clean_fasta_file)

CPU times: user 118 ms, sys: 37.4 ms, total: 156 ms
Wall time: 4.62 s


## Step 5: cleanup

In [15]:
# we delete the dataframe
del df

# then we shutdown the Dask cluster
client.shutdown()

# finally we close the Dask cluster
client.close()

## finished!