### COMP0082 Coursework V2.02
##### Predicting the subcellular location of proteins

**Goal** : develop a simple method for classifying eukaryotic protein sequences into the 5 categories 
* Cytosolic - i.e. within the cell itself, but not inside any organelles

* Extracellular/Secreted - proteins which are transported out of a cell

* Nuclear - proteins found/used within the cell's nucleus

* Mitochondrial - proteins transported to the cell's mitochondria

* Other - none of the above

**Requirements** : 
* use any appropriate ML methods, free to use off-the-shelf machine learning methods or libraries, as long as you give enough information for someone to replicate your approach
* Methods should be self-contained 
* No requests to external web services
* **extra credit** will be given to being able to make the results from your method **explainable**, e.g. use ablaton test or integrated gradients for methods without explicit features
* Predictions must be returned along with some measure of confidence or converted in ad hoc measure of confidence e.g. HIGH, MEDIUM, LOW
* Must explain the basis of your confidence
* Cross-validation studies should be carried out on the datasets. 
* Details on the selection of test and training sets must be given in the write-up
* Appropriate measures of success must be given e.g. ACC, F1, MCC etc. 
* Results for the "blind" protein set must be included in the report.


In [None]:
# Python + BioPython + Scikit-learn installation
# todo

In [5]:
!pip install biopython




[notice] A new release of pip is available: 23.1.2 -> 25.0.1
[notice] To update, run: C:\Users\Marco\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [6]:
# Load the necessary libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
from Bio import SeqIO

In [9]:
# Load the data
from Bio import SeqIO

def load_fasta_file(fasta_file_path):

    # Parse the FASTA file
    records = SeqIO.parse(fasta_file_path, "fasta")

    # Loop through the records and print them
    for record in records:
        print(f"ID: {record.id}")
        print(f"Description: {record.description}")
        print(f"Sequence: {record.seq}")

In [10]:
load_fasta_file ("data/cyto.fasta")

ID: sp|P0C0T2|ANKS6_RAT
Description: sp|P0C0T2|ANKS6_RAT Ankyrin repeat and SAM domain-containing protein 6 OS=Rattus norvegicus GN=Anks6 PE=1 SV=2
Sequence: MGEGALAPGLQLLLRACEQGDTDTARRLLEPGGEPVAGSEAGAEPAGPEAARAVEAGTPVPVDCSDEAGNSALQLAAAGGHEPLVRFLLRRGASVNSRNHYGWSALMQAARCGHASVAHLLLDHGADVNAQNRLGASVLTVASRGGHLGVVKLLLEAGATVDHRNPSGESTASGGSRDELLGITALMAAVQHGHEAVVRLLMEWGADPNHTARTVGWSPLMLAALLGKLSVVQQLVEKGANPDHLGVLEKTAFEVALDRKHRDLADYLDPLTTVRPKTDEEKRRPDIFHALKMGNFQLVKEIADEDPNHVNLVNGDGATPLMLAAVTGQLPLVQLLVEKHADMNKQDSVHGWTALMQATYHGNKEIVKYLLNQGADVTLRAKNGYTAFDLVMLLNDPDTELVRLLASVCMQVNKDRGGRPSHRPPLPHSKARQPWSIPMLPDDKGGLKSWWSRMSNRFRKLKLMQTLPRGLAANQPLPFSDEPELALDSTMRAPPQDRTNHLGPPEAAHAAKDSGPGNPRREKDDVLLTTMLRNGAPFPRLPSDKLKAVIPPFLPPSSFELWSSDRSRTCPNGKADPMKTVLPPRASRAHPVGCVGTDGAAGRPVKFPSISRSPTSPASSGNFNHSPHSSGGASGVGSMSRLGGELHNRSGGSVDSVLSQIAAQRKKAAGLCEQKPPCQQSSPVGPATGSSPPELPASLLGSGSGSSNVTSSSKKLDPGKRPPSGTSATSKSTSPTLTPSPSPKGHTAESSVSSSSSHRQSKSSGGSSSGTITDEDELTGILKKLSLEKYQPIFEEQEVDMEAFLTLTDGDLQELGIKTDGSRQQILAAISELNAGKGRERQ

In [11]:
load_fasta_file("data/secreted.fasta")

ID: sp|P0CJ36|EUMER_EUMRB
Description: sp|P0CJ36|EUMER_EUMRB Eumenitin-R OS=Eumenes rubrofemoratus PE=1 SV=1
Sequence: LNLKGLIKKVASLLN
ID: sp|Q7Z9L3|EXGA_ASPOR
Description: sp|Q7Z9L3|EXGA_ASPOR Glucan 1,3-beta-glucosidase A OS=Aspergillus oryzae (strain ATCC 42149 / RIB 40) GN=exgA PE=1 SV=1
Sequence: MLPLLLCIVPYCWSSRLDPRASSFDYNGEKVRGVNLGGWLVLEPWITPSIFDAAGAEAVDEWSLTKILGKEEAEARLSAHWKSFVSAGDFQRMADAGLNHVRIPIGYWALGPLEGDPYVDGQLEYLDKAVEWAGAAGLKVLIDLHGAPGSQNGFDNSGRRGAIQWQQGDTVEQTLDAFDLLAERYLGSDTVAAIEAINEPNIPGGVDQGKLQEYYGSVYGIVNKYNAGTSVVYGDGFLPVESWNGFKTEGSKVVMDTHHYHMFDNGLIAMDIDSHIDAVCQFAHQHLEASDKPVIVGEWTGAVTDCAKYLNGKGNGARYDGSYAADKAIGDCSSLATGFVSKLSDEERSDMRRFIEAQLDAFELKSGWVFWTWKTEGAPGWDMSDLLEAGVFPTSPDDREFPKQC
ID: sp|Q93X94|EXL6_ARATH
Description: sp|Q93X94|EXL6_ARATH GDSL esterase/lipase EXL6 OS=Arabidopsis thaliana GN=EXL6 PE=1 SV=1
Sequence: MFRGKIFVLSLFSIYVLSSAAEKNTSFSALFAFGDSVLDTGNNNFLLTLLKGNYWPYGLSFDYKFPTGRFGNGRVFTDIVAEGLQIKRLVPAYSKIRRISSEDLKTGVCFASGGSGIDDLTSRTLRVLSAGDQVKDFKDYLKKLRRVVKRKKKV