### NumPy

In this section you will apply numpy library to one hot encoding. 

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.
It refers to splitting the column which contains numerical categorical data to many columns depending on the number of categories present in that column. Each column contains “0” or “1” corresponding to which column it has been placed.

For instance, we have 3 categories "dog", "cat" and "parrot". Then encoding like this : 

        `Lassie = [1,0,0] , Fenix = [0,0,1] and Wick = [0,1,0] `

tells us that Lassie is a dog, Fenix is a parrot and Wick is a cat.

Your task will be do a preprocessing of multiple sequence alignment (**MSA**). MSA is usually used for biological sequences as protein's amino acid sequence, DNA or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. Extracting patterns from MSA may be helpful in the area of protein engineering for synthesize of more stable proteins. 

For you now, **the most important takeaways are**:
* MSA is a set of sequences of same length.
* Each character in sequence encodes amino acid or there is special character for gap position '-'. The biological meaning behind the gap is an insertion of amino acid during evolution that is not provided in other species (may enhance protein properties). The **X** position encodes an unexplored residues introduced by the error in measurement.   
* Each column in the MSA encodes relation between proteins in the alignment.
* More conserved positions encode stable and important regions for protein function. 
* **The query sequence** is our selected protein of interest.

Our small MSA looks like:

In [None]:
import numpy as np

msa = {"query"    : "AGCWW-N-IIPM",
       "protein2" : "AG-WWCN-IIPM",
       "protein3" : "AG-WWCN-IIP-",
       "protein4" : "AG-WWCN-IIP-",
       "protein5" : "AG---C--I-P-",
       "protein6" : "AG-WXC-PIIPM",
       "protein7" : "AGCW-C-PXIPM"}

One of usually used pipeline for MSA preprocessing and **your task** is given as follow:
1. Get gap positions of query.
2. Convert amino acids string to number representation by using implemented method `remove_unexplored_and_convert(msa_dict)`. Simultaneously, sequences obtaining unexplored residues will be excluded.
3. Remove every column from MSA where the query sequence has the gap except those where more than 80% of other sequences have amino acid.
4. Remove sequences having more than 50% of positions occupied by gaps.
5. Convert sequences into one hot encoding.

In [None]:
query_name = "query"
K = 0

def amino_acid_dict():
  # convert aa type into num 0-20
  aa = ['R', 'H', 'K',
        'D', 'E',
        'S', 'T', 'N', 'Q',
        'C', 'G', 'P',
        'A', 'V', 'I', 'L', 'M', 'F', 'Y', 'W']
  aa_index = {}
  aa_index['-'] = 0
  aa_index['.'] = 0
  i = 1
  for a in aa:
    aa_index[a] = i
    i += 1
  
  global K
  K = len(aa)

  return aa, aa_index


def remove_unexplored_and_convert(msa_dict):
  """ Returns sequences encoded to numbers in np array and key list of names """
  pass

def columns_with_gaps_remove(msa):
  """ 
  Remove colums with gaps in query and less than 80% of amino acid in other sequences.
  Return update dictionary with sequences in number representation

  You can call remove_unexplored_and_convert(msa) function from here
  """ 
  # np_msa, keys_list = remove_unexplored_and_convert(msa)
  pass

def remove_sequences_with_gaps(msa):
  """ 
  From MSA remove all sequences which have more than 50% of gaps within.
  Secure to query is kept in final alignment! 
  Returns np 2 rank array with sequences
  """

def one_hot_encoding(msa):
  """ 
  Convert sequences to one hot encoding where for each position in each sequence 
  new 21 element-wide vector is allocated to encode the category. 

  Returns one hot encoded alignment.
  """
  global K
  K += 1 # Remember for gap character
  D = np.identity(K) # Create identity matrix with number of amino acid kinds extended by gap character store in global K variable


In [None]:
###############################################################
# FINAL PIPELINE LOOKS LIKE 
msa_dict_num_repr = columns_with_gaps_remove(msa)
print(msa_dict_num_repr)
print()
msa_array_num_repr = remove_sequences_with_gaps(msa_dict_num_repr)
print(msa_array_num_repr)
print()
one_hot_msa = one_hot_encoding(msa_array_num_repr)

print("Shapes of one hot encoding are ", one_hot_msa.shape) # Should be (4, 11, 21) 
print(one_hot_msa)