# The task:

* there are two files:
  * `PC_adi_lorfs.pep`
  * `GCF_000222465.1_Adig_1.1_cds_from_genomic.fna`
  
  
  
* in the `*.pep` file there are lines like for example:

```
>XM_015918879.1|m.19 XM_015918879.1|g.19 type:complete len:101 gc:universal XM_015918879.1:552-854(+)
MGQDLLPQTLIGRHSLCGLQREIHKKVFPWQFGSKHGSISLALATHQSTLHSCSAMDLVQCHCYACGVLWMYHRQVQDASWFGRFSPRQRCSYSFFLCQL*
```

* in the `*.fna` file there are lines like for example:

```
>lcl|NW_015441057.1_cds_XP_015754013.1_1 [gene=LOC107326863] [db_xref=GeneID:107326863] [protein=protein still life, isoforms C/SIF type 2-like] [protein_id=XP_015754013.1] [location=join(431..518,1484..1598,1698..1750,2850..2959,3043..3081)] [gbkey=CDS]
ATGGTGGGGGATGAAATCATTGCAGTTAATGACATTGATGTAACAGAAGTAGAGAATGCTGTGGAAGATCTGAGGGAAGC
TTTGAAAGACCCGTCAATCACCTTGTTACTGCGATCATGTCGAGATGAGCCTCCAGTTGTCAGCAAAGTGACATCCGATG
CCATTATTTCCAGCCTTGTCTTAAAACCGCCGCCGAAAATAAGTGATAATATTGCTGAGCAAGATATCAGTGATCTTATA
GTTCCTCCTCCTCAAGCTTATGCAGAAGACACTGAAGACAGCTGCAGCATCTCGCCAACAAGCCAACACGAGGAATCTTA
TAGTTCTCCGACAAACTCCACACAGAATGTTGACGAACTCCTTCAGGAGAAACCAAAAATATTTCCAACGTTTCTGTTTG
CATGA
```

* the idea is to extract the sequence `XM_015918879.1` from the first line of the first file
* then to compare it with the field `protein_id=` of the second file
* if they are the same then copy the field `protein=` from the second file to the first file. 
* so finally the first line fo the first file looks for example like the following:

```
>XM_015918879.1|m.19 protein still life, isoforms C/SIF type 2-like
MGQDLLPQTLIGRHSLCGLQREIHKKVFPWQFGSKHGSISLALATHQSTLHSCSAMDLVQCHCYACGVLWMYHRQVQDASWFGRFSPRQRCSYSFFLCQL*
```

# Some Python tips that may help you with the task:

#### after downloading the source files from here (https://mega.nz/folder/fBtmzTTD#UP8gpeKfeCsb8EB2CJgBjg)
#### make sure that the two source files are in the same directory as your script (or change the path)

In [1]:
!ls

PC_adi_lorfs.pep  data_import_processing.ipynb	output.txt


#### open the file, read it

In [2]:
with open('PC_adi_lorfs.pep','r') as f:
    contents=f.read()

#### split the file into new lines and each line will be an element of a list

In [3]:
contents_lines=contents.split('\n')

#### we can see how many elements (lines) has the list

In [4]:
len(contents_lines)

95613

#### we can access a single line

In [5]:
contents_lines[0]

'>unknown_transcript_1|m.1 unknown_transcript_1|g.1 type:complete len:399 gc:universal unknown_transcript_1:13813-15009(+)'

In [6]:
contents_lines[-2]

'EEKTERQERIKQKTAELQELILQQIAFKNLVQRNKQVEKEQGSPAPNTAIHLPFIIVNTSKKTVIDCSISNDKCEYLFNFDNTFEIHDDIEVLKRMGMAFGLEKGQCAEANLKASRTMVPKALEPYVVDMAKSGPVGLGGNLGNHTTPK*'

#### or access many (10) lines using a loop

In [7]:
for line in contents_lines[0:10]:
    print(line)

>unknown_transcript_1|m.1 unknown_transcript_1|g.1 type:complete len:399 gc:universal unknown_transcript_1:13813-15009(+)
MKSLYVVRWGFSTNHKDIGTLYLVFGIGAGMIGTAFSMLIRLELSAPGAMLGDDHLYNVIVTAHAFIMIFFLVMPVMIGGFGNWLVPLYIGAPDMAFPRLNNISFWLLPPALILLLGSAFVEQGVGTGWTVYPPLSSIQAHSGGAVDMAIFSLHLAGVSSILGAMNFITTILNMRAPGMTLNKMPLFVWSILITAFLLLLSLPVLAGAITMLLTDRNFNTTFFDPAGGGDPILFQHLFWFFGHPEVYILILPGFGMISQIIPTFVAKKQIFGYLGMVYAMLSIGILGFIVWAHHMFTVGMDVDTRAYFTAATMIIAVPTGIKVFSWLATIFGGTLRLDTPMLWAMGFVFLFTLGGLTGVVLANSSLDVVLHDTYYVVAHFHYVLSMGAVFAIFGGFYY*
>unknown_transcript_1|m.2 unknown_transcript_1|g.2 type:complete len:233 gc:universal unknown_transcript_1:6886-7584(+)
MSGAYFDQFKIVALIALTNSSMMMILVVVVVLLLFKGVQLIPKRWQSLIELIYEHFHGVVKDNLGSEGLRYFPLIVSLFFFIVFLNVLGLFPYVFTPTVHIVVTLGLSFSIIIGVTLAGFWRFKGDFFSVFMPSGAPLGLAPLLVLIETVSFISRAISLGVRLAANLSAGHLLFAILAGFGFNMLVASGPVGVFPLLIMVFITLLEVAVAVIQAYVFCLLATIYLADTIVLH*
>unknown_transcript_1|m.3 unknown_transcript_1|g.3 type:complete len:212 gc:universal unknown_transcript_1:5539-6174(+)
MWAP

#### same as above: access many (10) lines using a loop and exit using counter and a condition

In [8]:
counter=0
for line in contents_lines:
    print(line)
    counter=counter+1
    if counter>10:
        break

>unknown_transcript_1|m.1 unknown_transcript_1|g.1 type:complete len:399 gc:universal unknown_transcript_1:13813-15009(+)
MKSLYVVRWGFSTNHKDIGTLYLVFGIGAGMIGTAFSMLIRLELSAPGAMLGDDHLYNVIVTAHAFIMIFFLVMPVMIGGFGNWLVPLYIGAPDMAFPRLNNISFWLLPPALILLLGSAFVEQGVGTGWTVYPPLSSIQAHSGGAVDMAIFSLHLAGVSSILGAMNFITTILNMRAPGMTLNKMPLFVWSILITAFLLLLSLPVLAGAITMLLTDRNFNTTFFDPAGGGDPILFQHLFWFFGHPEVYILILPGFGMISQIIPTFVAKKQIFGYLGMVYAMLSIGILGFIVWAHHMFTVGMDVDTRAYFTAATMIIAVPTGIKVFSWLATIFGGTLRLDTPMLWAMGFVFLFTLGGLTGVVLANSSLDVVLHDTYYVVAHFHYVLSMGAVFAIFGGFYY*
>unknown_transcript_1|m.2 unknown_transcript_1|g.2 type:complete len:233 gc:universal unknown_transcript_1:6886-7584(+)
MSGAYFDQFKIVALIALTNSSMMMILVVVVVLLLFKGVQLIPKRWQSLIELIYEHFHGVVKDNLGSEGLRYFPLIVSLFFFIVFLNVLGLFPYVFTPTVHIVVTLGLSFSIIIGVTLAGFWRFKGDFFSVFMPSGAPLGLAPLLVLIETVSFISRAISLGVRLAANLSAGHLLFAILAGFGFNMLVASGPVGVFPLLIMVFITLLEVAVAVIQAYVFCLLATIYLADTIVLH*
>unknown_transcript_1|m.3 unknown_transcript_1|g.3 type:complete len:212 gc:universal unknown_transcript_1:5539-6174(+)
MWAP

#### we can access every second line using trick `[::2]`

In [9]:
for line in contents_lines[10:20][::2]:
    print(line)

>unknown_transcript_1|m.6 unknown_transcript_1|g.6 type:complete len:138 gc:universal unknown_transcript_1:11332-11745(+)
>unknown_transcript_1|m.7 unknown_transcript_1|g.7 type:complete len:137 gc:universal unknown_transcript_1:6249-6659(+)
>unknown_transcript_1|m.8 unknown_transcript_1|g.8 type:complete len:130 gc:universal unknown_transcript_1:3427-3816(+)
>MSTRG.1.3|m.9 MSTRG.1.3|g.9 type:complete len:138 gc:universal MSTRG.1.3:3434-3847(+)
>MSTRG.1.3|m.10 MSTRG.1.3|g.10 type:complete len:109 gc:universal MSTRG.1.3:1727-2053(+)


#### using regex we can access (extract) a fracion of a text from a single line 

In [10]:
regex=r'>(.*)\|m'

In [11]:
import re
results=re.findall(regex,contents_lines[18])
print(results)

['MSTRG.1.3']


#### using regex we can access (extract) a fracion of a text from a many lines using `loop for`

In [12]:
for line in contents_lines[10:20][::2]:
    results=re.findall(regex,line)
    print(results)

['unknown_transcript_1']
['unknown_transcript_1']
['unknown_transcript_1']
['MSTRG.1.3']
['MSTRG.1.3']


#### we can concatenate text using operator `+`

In [13]:
for line in contents_lines[10:20][::2]:
    results=re.findall(regex,line)
    otra_cadena='mi proteina es: '
    print(otra_cadena+str(results))

mi proteina es: ['unknown_transcript_1']
mi proteina es: ['unknown_transcript_1']
mi proteina es: ['unknown_transcript_1']
mi proteina es: ['MSTRG.1.3']
mi proteina es: ['MSTRG.1.3']


#### new lines we can save in an another list `result_list` first creating and emply list then appending each line to it using `append`

In [14]:
result_list=[]
for line in contents_lines[10:20][::2]:
    results=re.findall(regex,line)
    otra_cadena='mi proteina es: '
    final_line=otra_cadena+str(results)
    result_list.append(final_line)

In [15]:
result_list

["mi proteina es: ['unknown_transcript_1']",
 "mi proteina es: ['unknown_transcript_1']",
 "mi proteina es: ['unknown_transcript_1']",
 "mi proteina es: ['MSTRG.1.3']",
 "mi proteina es: ['MSTRG.1.3']"]

#### we can now save each line to a new output file (we could do it directly in the previous step)

In [16]:
with open('output.txt','w') as f:
    for line in result_list:
        f.write(line+'\n')

In [17]:
!ls

PC_adi_lorfs.pep  data_import_processing.ipynb	output.txt


In [18]:
with open('output.txt','r') as f:
    contents2=f.read()

#### to see that our results are saved we can load the file and print some lines

In [19]:
contents_lines2=contents2.split('\n')
print(contents_lines2[3])

mi proteina es: ['MSTRG.1.3']
