# AMP Parser

## Fasta

`.fasta` is the commonest file format to save biomedical information of id and sequence. For a part in `train.fasta` of AMP dataset, we can see that:

```
...
>AP00714
GYGCPFNQYQCHSHCSGIRGYKGGYCKGTFKQTCKCY
>AP00269
LSCKRGTCHFGRCPSHLIKGSCSGG
...
```

 - `AP00714` is the id of amino acid
 - `GYGCPFNQYQCHSHCSGIRGYKGGYCKGTFKQTCKCY` is the sequence of amino acid

## One Hot Encoding

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In example, we need to encode the table below:

No.    | Weather
-------| -------
1.     | Sunny
2.     | Cloudy
3.     | Sunny
4.     | Rainy

into the table below:

No.    |  Sunny |  Cloudy | Rainy |
-------|--------|---------|-------|
1.     |    1   |    0    |   0   |
2.     |    0   |    1    |   0   |
3.     |    1   |    0    |   0   |
4.     |    0   |    0    |   1   |

We are not going to reproduce the algorithm of one-hot encoding, there are several tools that can be used for doing one-hot encoding<br/>
In the pratice today, we are using the tools that provided by sklearn.

## Sklearn OneHotEncoder

[Official Manual](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

We can know how to use this tool in the quick demo below:

In [1]:
## Label Encoding
import numpy as np
from sklearn.preprocessing import LabelEncoder

label = 'ABCDEFG'

# use label encoder
label_encoder = LabelEncoder()
label_encoder.fit(list(label))

sequence = 'CFEDA'
label_encoded_seq = [label_encoder.transform(list(sequence))]
print(label_encoded_seq)

# use index
encoded_seq = [np.array([label.index(i) for i in sequence])]
print(encoded_seq)


[array([2, 5, 4, 3, 0])]
[array([2, 5, 4, 3, 0])]


In [2]:
## One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder

# make a categories list of list to declay number of classes
# in this case, classes = 7 since label = 'ABCDEFG'
# categories = [[0, 1, 2, 3, 4, 5, 6]]
categories = [list(range(len(label)))]

onehot_encoder = OneHotEncoder(sparse=False, categories=categories)
encoded_seq = np.array(encoded_seq).reshape(-1, 1)
results = np.array([onehot_encoder.fit_transform(encoded_seq)])

print('type of results: {}'.format(type(results)))
print('result = \n{}'.format(results))

print('\ntype of result[0]: {}'.format(type(results[0])))
print('results[0] = \n{}'.format(results[0]))


type of results: <class 'numpy.ndarray'>
result = 
[[[0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0.]
  [0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0.]
  [1. 0. 0. 0. 0. 0. 0.]]]

type of result[0]: <class 'numpy.ndarray'>
results[0] = 
[[0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]]


## Amino Acid Types

[source](https://www.compoundchem.com/2014/09/16/aminoacids/)
![amino_acid](https://i0.wp.com/www.compoundchem.com/wp-content/uploads/2014/09/20-Common-Amino-Acids-v3.png?w=1338)
There are 20 types of common amino acid in protein, the amino acid label sequences is `ACDEFGHIKLMNPQRSTVWY` <br>
So the label encoding should follow the label sequence, otherwise the output will not in same.

## Practice

After the demo and introduction above, now we are going to parse our dataset. <br>
The goal is below:
1. Parse data from `train.fasta`
2. Encode the data with OneHotEncoder from Sklearn
3. Save the data into a numpy file

We have already make a answer numpy file, all you guys need to figure out how to make the result and compare to TA's answer. <br>
1. You can use `prepared.show()` to get the correct answer
2. You can check the answer by calling `prepared.diff(result)`<br>
   It will print out `Correct!` if your result is same as our answer, otherwise it will print out `Something Wrong!`

In [3]:
import sys
sys.path.append('.prepared')
import prepared as prepared

answer = prepared.show()
print(answer.shape)
print(answer[0])

(1424,)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 

In [4]:
prepared.diff(answer)

Correct!


## Start Here

In [None]:
# TODO: make your own parser

# prepared.diff(answer) # un-comment this line once you finised `y_pred`