# Extract species labels from raw labels

This is an explanation of the  
**extract_labels.py**    
file.  

using the regex defined in the config file, we
1. extract the proper label for each row
2. create parquet and csv files of the unique labels
3. encode the labels (convert the string to an integer) in the main data

Then we will save the output as parquet files.

## Step 0: import required libraries

In [1]:
import sys
sys.path.append('..') # this is to allow the script to read from the parent folder

from scripts.global_funcs import load_data_config
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask_cudf

## Step 1: start the cluster

In [2]:
cluster = LocalCUDACluster()
client = Client(cluster)
client

2022-05-23 07:13:58,889 - distributed.diskutils - INFO - Found stale lock file and directory '/home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-hjp4x1q1', purging
2022-05-23 07:13:58,890 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize


0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 1
Total threads: 1,Total memory: 31.21 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:36043,Workers: 1
Dashboard: http://127.0.0.1:8787/status,Total threads: 1
Started: Just now,Total memory: 31.21 GiB

0,1
Comm: tcp://127.0.0.1:35389,Total threads: 1
Dashboard: http://127.0.0.1:37871/status,Memory: 31.21 GiB
Nanny: tcp://127.0.0.1:36387,
Local directory: /home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-9n4j4zc9,Local directory: /home/jcosme/projs/COSME/notebook_walkthroughs/dask-worker-space/worker-9n4j4zc9
GPU: NVIDIA GeForce RTX 3080 Laptop GPU,GPU memory: 16.00 GiB


### optional: 
click the link above to open up the Dask Dashboard, which will allow you to see the progress of your job.  
**note:** this will only work on a jupyter notebook

## Step 2: load config file data

In [3]:
configs = load_data_config()

In [4]:
# these are the variables we will be using
for key, val in configs.items():
    print(f"{key}: {val}")

clean_fasta_file: /media/jcosme/Data/MarRef_parquet
output_dir: /media/jcosme/Data
project_name: full_mer_1
base_col_names: ['seq', 'label']
label_col_name: label
input_col_name: seq
label_regex: (?:[^a-zA-Z0-9]+)([a-zA-Z]+[0-9]+)(?:[^a-zA-Z0-9]+)
k_mer: 1
possible_gene_values: ['A', 'C', 'G', 'T']
data_splits: {'train': 0.9, 'val': 0.05, 'test': 0.05}
random_seed: 42
fasta_sep: >
unq_labs_dir: /media/jcosme/Data/full_mer_1/data/unq_labels
unq_labs_dir_csv: /media/jcosme/Data/full_mer_1/data/unq_labels.csv
data_dir: /media/jcosme/Data/full_mer_1/data/full_mer_1


In [5]:
# lets put the variables we need into python variables
clean_fasta_filepath = configs['clean_fasta_file']
output_dir = configs['output_dir']
project_name = configs['project_name']
unq_labs_dir = configs['unq_labs_dir']
unq_labs_dir_csv = configs['unq_labs_dir_csv']
data_dir = configs['data_dir']
label_col_name = configs['label_col_name']
label_regex = configs['label_regex']

## Step 3: define label extraction function

In [6]:
# this function will be applied to the data
def extract_labels(df):
    df[label_col_name] = df[label_col_name].str.extract(label_regex).loc[:, 0]
    return df

## Step 4: data transformations

In [7]:
# first we read the parquet file
df = dask_cudf.read_parquet(clean_fasta_filepath).repartition(partition_size="100M")



In [8]:
# here is a data sample
df.head()

Unnamed: 0,seq,label
0,TTCCACAAAGTTACACGGGAAAAGAGCCTGCAACAATGCGTGGAGT...,label|286|MMP00000031-10000/1
1,TAAATTAAGAATTGAAATGATTGAAAATGCTGGAAAATTAAAAATT...,label|286|MMP00000031-9998/1
2,ATATTTTTATTTTTTTGAAAAAAGGTTTAGTTAATTATAAAGTTTA...,label|286|MMP00000031-9996/1
3,TTATGGATGACGATATCAGACTTCTTAGAACGATCGGATCACTTCA...,label|286|MMP00000031-9994/1
4,GAATTACGGGGTTATTTAAATAAATTGCAAGAAGTTCCCATGCTAA...,label|286|MMP00000031-9992/1


In [9]:
# next, we apply the function defined above to the data
df = df.map_partitions(extract_labels)

In [10]:
# here is a data sample
df.head()

Unnamed: 0,seq,label
0,TTCCACAAAGTTACACGGGAAAAGAGCCTGCAACAATGCGTGGAGT...,MMP00000031
1,TAAATTAAGAATTGAAATGATTGAAAATGCTGGAAAATTAAAAATT...,MMP00000031
2,ATATTTTTATTTTTTTGAAAAAAGGTTTAGTTAATTATAAAGTTTA...,MMP00000031
3,TTATGGATGACGATATCAGACTTCTTAGAACGATCGGATCACTTCA...,MMP00000031
4,GAATTACGGGGTTATTTAAATAAATTGCAAGAAGTTCCCATGCTAA...,MMP00000031


In [11]:
# now we extract the unique labels
unq_labs_df = df.sort_values(label_col_name)[label_col_name].unique().to_frame()

In [12]:
# here is a sample of the unique labels
unq_labs_df.head()

Unnamed: 0,label
0,MMP00000031
1,MMP00000346
2,MMP00001868
3,MMP00002580
4,MMP00002596


In [13]:
%%time
# this might take some time
# we save the unique labels as a parquet file...
_ = unq_labs_df.to_parquet(unq_labs_dir)
# ...and as a .csv file.
_ = unq_labs_df.to_csv(unq_labs_dir_csv, index=False, single_file=True)

CPU times: user 1.35 s, sys: 292 ms, total: 1.64 s
Wall time: 12.4 s


In [14]:
# next, we encode the labels
df = df.categorize(columns=[label_col_name])
df[label_col_name] = df[label_col_name].cat.codes

In [15]:
# here is a sample of encoded data
df.head()

Unnamed: 0,seq,label
0,TTCCACAAAGTTACACGGGAAAAGAGCCTGCAACAATGCGTGGAGT...,0
1,TAAATTAAGAATTGAAATGATTGAAAATGCTGGAAAATTAAAAATT...,0
2,ATATTTTTATTTTTTTGAAAAAAGGTTTAGTTAATTATAAAGTTTA...,0
3,TTATGGATGACGATATCAGACTTCTTAGAACGATCGGATCACTTCA...,0
4,GAATTACGGGGTTATTTAAATAAATTGCAAGAAGTTCCCATGCTAA...,0


## Step 5: save the data

In [16]:
%%time
# the final step is to save the cleaned data. 
# this might take some time
_ = df.to_parquet(data_dir)

CPU times: user 206 ms, sys: 69.1 ms, total: 275 ms
Wall time: 7.19 s


## Step 6: cleanup

In [17]:
# we delete the dataframe
del df, unq_labs_df

# then we shutdown the Dask cluster
client.shutdown()

# finally we close the Dask cluster
client.close()

## finished!