## Feature Engineering in Genomics

Feature Engineering is the process of transforming raw data into features/input variables that are easily digested by algorithms. People think that data scientists often spend most of their time testing out various algorithms; however, the majority of performance gains generally come from well-crafted features.

While performing feature engineering, it is critical to keep in mind the question that you are trying to answer. For the purposes of this exercise, we will be using ...genomic data, with an aims to answer the following questions:


In this notebook, we will introduce the following types of feature engineering:
- Feature pruning
- Time-based features (month, year, etc)
- One-hot encoding to create dummy variables
- Extracting features from strings
- Feature scaling
- Data imputation / cleaning

How to transform your genomics data to fit into machine learning models

In [1]:
import pydna

### Converting DNA Sequence String into NumPy Array

In [2]:
def dna_sequence_np_array(dna_sequence_string):
    dna_sequence_array = None
    try:
        dna_sequence_string = dna_sequence_string.lower()   
        regex_acgt = re.compile('[^acgt]') 
        if (regex_acgt.search(dna_sequence_string) == None):           
            dna_sequence_array = np.array(list(dna_sequence_string))
        else:       
            dna_sequence_array = None    
    except:               
        print(PyDNA.get_exception_info())
        if PyDNA._app_is_log: PyDNA.write_log_file("error",   PyDNA.get_exception_info())  
    return dna_sequence_array