# Part 2: Generating the Sequence Profile

Why are we using SPs? → In general MSAs capture the notion of a protein family. All the sequences that are in the family possibly even including distantly related homologs are aligned → we need to make sure that the MSA we are using captures the protein space in a good way. And all these sequences are all aligned using a substitution matrix. The profile contains much more information than a single sequence or an alignment. Even functional regions show a certain level of seq → captured in the MSA. This allows us to go beyond propensity. When we include this evolutionary information we can improve predictions.

Remember that the sequence is less conserved than the structure and the structure is less conserved than the function


Exercise → psiblast pipeline 
1. Obtain a target sequence in fasta format:				
>d3psma_
KTCENLADTFRGPCFTDGSCDDHCKNKEHLIKGRCRDDFRCWCTRNC

2. Obtain a large sequence database in fasta format:
Often used: UniProt Reference clusters → UniRef 100/90/50
Very large non-redundant protein seq DB
Number indicates the level of redundancy → UniRef90 → no pair of sequs has more than 90% seq id.
We will use UniprotKB/SwissProt → from Dropbox uniprot_sprot.fasta copy file into ~ dir

3. Format the database using makeblastdb to make it usable by psiblast:
Before running PSI-BLAST we build an index of the seq DB → necessary to use the psiblast algo. The cmd is  ``` makeblastdb –in uniprot_sprot.fasta -dbtype prot ``` 
After execution the files of the DB index will appear in the same directory as the FASTA db file.

4. Run psiblast to search for homologous sequences in the database → specify parameters:
* Input query sequence and DB file
* Max number of psiblast iterations (usually 3 or 4)
- **E-value** threshold used to filter-out less significant results
- **Number of descriptions** and **alignments** shown in output
- Names of the output file for alignments and PSSM/Profile matrices
- Cmds and options on slide 18 and in cell below
There will be 2 different outfiles:
    1. Pairwise alignment file between target seq and each similar seq found
    2. PSSM/Profile checkpoint file containing the **last matrix computed by the algo**
- PSSM portion = equivalent to SP but storing log-odds between profil and background distribution
- Profile portion → the actual frequency matrix
	Both alignment and checkpoint file can be used to obtain a seq profile for the target sequence:
- From pairwise alignments we can obtain the MSA and then compute the profile matrix 
- From checkpoint file we can directly extract the matrix computed internally by PSI-BLAST → after the last iteration

5. Extract the target sequence profile → from checkpoint file
- Normalize values in range 0-1 by dividing all values by 100
- Store profile matrix for later

There is also and alternative way of parsing reported in [slides 25 - 28](https://www.dropbox.com/preview/AA%202019-2020/LAB%202%20covid%2019/ModuleII-Savojardo/slides/04-SequenceProfiles.pdf?role=personal)

# psiblast options:

- input sequence: ```-query d3psma_.fa ```
- Search DB UniprotKB/SwissProt --> ```-db uniprot_sprot.fasta```
- E-value threashold: ```    -evalue 0.01```
- Output file for pairwise alignments: ``` -out d3psma_.alns.blast```
- Output file for PSSM/Profile matrces; ```  -out_ascii_pssm d3psma_.pssm```
- Max of 3 iterations ``` -num_iterations 3 ```
- Number of descriptions and alignments set to 10000 ```- num_descriptions 10000 -num_alignments 10000```

[p1 1:20:00]  Q: Why should the sequence profile from PSI BLAST be preferred? → uses the sequence weighting to avoid large clusters ‘taking over’

We want to transform each sequence in the 
- → Training set
- → Blind set
into sequence profiles → will be the input to our methods later GOR and SVM. So we have to run psiblast for each seq in each of the 2 sets (t and b).

### Workflow:
- Run PSI-BLAST for 3 iterations against the UniProt/SwissProt DB → e-value 0.01
    - → need to run a separate process for each sequence
- Extract sequence profile by either
- Parsing PSI-BLAST checkpoint file and extract the profile portion (→ easier)
- Parsing PSI-BLAST alignment file computing the profile from scratch (→ harder)
- Write the sequence profile matrix to a file for later;
    - Could be a numpy matrix or plain text


In [None]:
source /opt/conda/bin/activate

In [None]:
cat fasta/d3psma_.fasta 

In [None]:
>d3psma_
KTCENLADTFRGPCFTDGSCDDHCKNKEHLIKGRCRDDFRCWCTRNC

In [None]:
 gunzip uniprot_sprot.fasta.gz # contains all the 500 000 sequences from SwissProt

### Buliding the index:

--> done only once in the beginning and it will be reused

In [None]:
makeblastdb -in uniprot_sprot.fasta -dbtype prot

Building a new DB, current time: 10/04/2020 14:57:19
New DB name:   /home/um19/project/uniprot_sprot.fasta
New DB title:  uniprot_sprot.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 563082 sequences in 27.612 seconds.

In [None]:
We will use the example:


You will get this warning:

This means that you have the possibility of applying a postion based statistic to adjust the low complexity regions in the sequence
If you disable this you will not get the warning: 

# d3psma_.pssm

In [None]:
Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts
            A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V   A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V
    1 K    -3   6  -2  -3  -5   0  -1  -4  -2  -5  -4   5  -3  -5  -3  -2  -2  -5  -3  -4    0  53   0   0   0   0   0   0   0   0   0  47   0   0   0   0   0   0   0   0  1.23 0.29
    2 T    -2  -3  -2   0  -2  -2  -1  -4  -1   2   1  -1   0  -1  -3  -1   4  -4  -2   2    0   0   0   5   0   0   5   0   2  12  16   2   2   1   0   0  40   0   0  15  0.34 0.19
    3 C    -3  -7  -6  -7  11  -6  -7  -6  -6  -4  -4  -6  -4  -5  -6  -4  -4  -5  -5  -4    0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  3.52 0.69
    4 E    -1   2  -1   2  -4   1   4  -3  -2  -4  -2   3   2  -4  -3  -2  -2  -4  -3  -3    2  10   2   9   0   4  40   0   0   0   3  21   8   0   0   0   1   0   0   0  0.59 0.25
    5 N     1   2   1  -2  -3  -1  -1   0   0  -1  -3   1  -2  -3  -2   2   2  -4  -3  -1    9  16   9   0   0   1   0   6   3   5   0   9   0   0   0  21  16   0   0   5  0.24 0.15
    6 L    -1   2  -2  -1  -4   2   1  -3  -2  -2   1   2  -2  -3   4  -1  -1  -4  -3  -2    3  13   0   2   0  11  11   0   0   0  19  15   0   0  20   3   2   0   0   0  0.40 0.19
    7 A     1  -3   0  -2   2  -2  -2  -1  -3  -4  -4  -1  -4  -5  -3   6  -1  -5  -4  -2    8   0   4   0   5   0   0   3   0   0   0   3   0   0   0  77   0   0   0   2  1.05 0.42
    8 D     0   0   1   1  -3   1   0   2   4  -2  -1   1  -1  -3   0  -1  -1  -3  -2  -1    5   4   7   8   0   7   5  16  16   0   5  11   2   0   5   0   3   0   0   5  0.20 0.12
    9 T    -2   2   1  -1  -3  -1   0   1   2  -3  -2   1  -2  -2  -3   0   4  -3  -1  -2    0  13   7   2   0   0   5   9   7   0   2  11   0   1   0   3  36   0   2   0  0.36 0.20
   10 F    -1  -4  -4  -5  -4  -4  -4  -4   2  -3  -2  -4  -2   6  -5  -4  -4   8   4  -3    6   0   0   0   0   0   0   0   4   0   0   0   0  46   0   0   0  26  13   0  1.49 0.39
   11 R    -1   1   0   0  -3   0   0  -2   2  -3  -3   5  -2  -2   0   2   0  -4  -2  -2    1   7   2   3   0   2   3   1   5   0   1  43   0   2   4  22   3   0   0   2  0.46 0.22
12 G    -2  -4  -2  -3  -5  -4  -4   7  -4  -5  -5  -4  -5   0  -4   0  -3  -4   0  -5    0   0   0   0   0   0   0  86   0   0   0   0   0   4   0   6   0   0   3   0  1.60 0.41
   13 P     0  -1   0  -3  -3  -2  -2  -3   0   1   0  -1   1  -1   4  -2   0  -3  -1   2   11   4   6   0   0   0   0   0   2   6  10   3   6   1  23   0   5   0   2  20  0.31 0.14
   14 C    -3  -6  -6  -6  11  -6  -6  -6  -6  -4  -4   0  -4  -6  -6  -4  -4  -5  -6  -4    0   0   0   0  94   0   0   0   0   0   0   6   0   0   0   0   0   0   0   0  3.26 0.71
   15 F    -1  -2   0  -3  -2  -2  -3   1  -3   2   2  -2   1   2  -1  -2   0   3  -1   1    2   1   6   0   0   0   0  11   0  16  19   2   3  13   3   0   6   4   0  12  0.16 0.12
   16 T    -1   1   4   1  -3  -1  -1   0   0  -2  -4   1  -3  -4  -1   3   1  -4  -3  -3    3   7  29   6   0   0   0   6   2   2   0   7   0   0   2  24  10   0   0   0  0.42 0.20
   17 D    -2  -2   3   5  -4  -1   0  -2  -2  -4  -4   1  -3  -4  -3   1   1   2  -3  -2    2   0  18  40   0   0   3   0   0   0   0   8   0   0   0  13   9   3   0   3  0.66 0.26
   18 G     1   2   2   0  -3   0   0   1   2  -3  -3   1  -2  -3   0   1   1  -3   0  -3   12  11   9   4   0   2   4   9   7   0   0   8   0   0   4  15  10   0   5   0  0.21 0.12
   19 S     1  -2   5   0  -3  -1  -1   0   2  -4  -3   0  -3  -3   2   0   0  -4  -1  -3   14   0  35   4   0   2   3   8   5   0   2   5   0   0   9   6   5   0   2   0  0.41 0.20
   20 C    -3  -6  -6  -6  11  -6  -7  -5  -6  -4  -4  -6  -4  -5  -6  -4  -4  -5  -5  -4    0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  3.45 0.58
   21 D     2   1   1   3  -3   0   0   0   1  -3  -3   1  -2  -3  -2   1  -1  -3   0  -1   27   8   6  19   0   3   4   4   4   0   0  10   0   0   0   7   0   0   3   5  0.24 0.11
   22 D     0   1   3   1  -3   1  -1  -2   2  -1  -2   2  -2  -3  -2   1   0  -4  -2  -2    6   8  20   8   0   8   0   0   5   4   4  17   0   0   0  14   6   0   0   0  0.27 0.12
   23 H     0   0  -2  -1   2   2  -1  -3   4   0  -1  -1   0  -1  -3  -1  -2  -4  -2   3    8   4   0   2   5  13   3   0  16   3   2   2   2   2   0   4   0   0   0  33  0.29 0.15
   24 C    -4  -7  -6  -7  11  -6  -7  -6  -6  -4  -4  -6  -5  -6  -6  -4  -4  -6  -6  -4    0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  3.64 0.58
   25 K    -2   3  -1  -2  -3   1   0  -3   2   2   0   2   0  -2  -1  -2  -2  -3  -2   0    1  21   2   0   0   8   5   1   6  16   9  19   2   0   2   0   0   0   0   6  0.24 0.12
   26 N    -1   1   3   1  -3   2   1  -1  -1  -3  -3   1  -2  -3  -2   1   2  -3  -3  -2    1   6  19   8   0  11   8   4   0   0   1   9   0   0   0  11  19   0   0   1  0.33 0.13
   27 K    -1   1   0  -1  -2   1   0  -1  -1   0   0   1  -1   0  -1  -1  -1   5   1   0    2   8   5   3   1   7   4   4   1  10  12  13   1   5   3   2   2  10   5   4  0.11 0.06
   28 E    -1   0  -1   0  -5   0   6  -1  -2  -5  -5   1  -4  -5  -1  -2  -1  -5  -4  -4    3   4   2   3   0   0  69   6   0   0   0   5   0   0   3   0   3   0   0   0  1.04 0.36
   29 H    -2   0   0  -1  -5   0  -2   5   4  -5  -5   1  -4  -4  -4  -1  -3  -4  -2  -5    0   6   4   3   0   3   0  56  12   0   0  12   0   0   0   2   0   0   1   0  0.89 0.29
   30 L     1   0  -4  -4  -3  -3  -3  -2  -3  -2   0   0  -2   6  -4  -2  -3   6   3  -3   16   5   0   0   0   0   0   3   0   0   8   6   0  37   0   2   0  14   8   0  0.72 0.26
   31 I     0   1   0  -1  -3   0  -1   0  -2  -1  -1   0   0  -2   3   1   1  -3   0  -1    4   8   4   2   0   2   2   9   0   4   5   5   2   0  15  16  12   0   3   2  0.17 0.10
   32 K     1  -1   0   2  -4  -2  -2   4   4  -4  -4  -1   0  -4  -3   2  -1  -4  -3  -4    9   3   3  13   0   0   0  35  13   0   0   2   3   0   0  16   2   0   0   0  0.54 0.22
   33 G    -2  -5  -3  -4  -5  -4  -5   7  -5  -7  -6  -4  -5  -6  -5  -3  -4  -5  -6  -6    0   0   0   0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0   0  2.23 0.53
   34 R     0   2   1   0  -3   0   0  -3   3  -1  -3   2  -2  -2  -3   1  -1  -3   3  -1    8  10   8   5   0   2   3   0   9   3   0  16   0   0   0  13   3   0  15   4  0.27 0.17
   35 C    -4  -7  -7  -7  12  -7  -8  -6  -7  -5  -5  -7  -5  -6  -7  -5  -5  -6  -6  -5    0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  3.95 0.85
   36 R    -1   2   2   1  -3   0   0  -1   5  -2  -2   2  -1  -3  -3   0  -2  -3   0  -2    5  13  10   9   0   3   2   3  20   2   3  16   2   0   0   7   0   0   2   3  0.33 0.17
   37 D    -2  -2  -2   2  -3  -2  -2   4  -3   1  -2  -2  -2   0   3   0  -2  -3   0  -1    0   1   0  12   0   2   0  34   0  11   4   2   0   5  13   6   0   0   4   5  0.37 0.19
   38 D    -1   3   2   1  -3   0   0   1   0  -2  -1   1  -2  -1  -1   1   0  -3  -1  -2    4  19  11   9   0   2   7  10   2   2   6   5   0   3   3  10   3   0   2   3  0.14 0.10
   39 F    -3   5   1  -3  -4   0  -2  -4   2  -2   0   0   1   3  -4  -2  -3   3   1  -2    0  39   7   0   0   2   2   0   7   1   9   5   4  13   0   2   0   4   5   1  0.55 0.26
   40 R    -1   6  -2  -3  -4   0  -2  -4   0  -2  -3   4   1  -4  -3  -1  -1  -4  -3  -2    5  49   0   0   0   2   0   0   3   2   0  26   4   0   0   4   2   0   0   2  0.89 0.31
   41 C    -4  -7  -6  -7  12  -7  -7  -6  -7  -5  -5  -7  -5  -6  -7  -4  -4  -6  -6  -4    0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  3.87 0.78
   42 W    -3  -1  -4  -4  -4  -3  -1  -5   0   1   0  -1   1   5  -4  -3  -2   6   4   0    0   4   0   0   0   0   4   0   3   9   8   5   4  25   0   0   2  14  16   6  0.64 0.23
   43 C    -4  -7  -6  -7  12  -7  -7  -6  -7  -5  -5  -7  -5  -6  -7  -4  -4  -6  -6  -4    0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  3.87 0.78
   44 T    -2  -2  -1  -2  -3   1  -1  -4  -2   0  -2  -1  -1   0  -4   0   4  -2   5  -2    2   1   3   1   0   8   2   0   0   6   2   3   1   3   0   4  34   0  28   1  0.54 0.27
   45 R    -3   5  -1  -3  -5   0  -2  -4   0  -4  -2   4  -3   4  -4  -3  -2  -3   1  -4    0  33   1   0   0   2   0   0   2   0   4  31   0  21   0   0   2   0   4   0  0.78 0.29
   46 N     0   1   3   1  -4   2   0  -2   3  -3  -3   0  -3  -4   4   0  -2  -5  -3  -3    6   6  21   9   0  10   3   2   8   1   1   2   0   0  23   6   0   0   0   2  0.50 0.18
   47 C    -4  -7  -6  -7  12  -6  -7  -6  -6  -5  -5  -6  -5  -6  -6  -4  -4  -6  -6  -4    0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  3.71 0.68

                      K         Lambda
Standard Ungapped    0.1431     0.3289
Standard Gapped      0.0475     0.2670
PSI Ungapped         0.1552     0.3176
PSI Gapped           0.0475     0.2670

extract A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V 
1348 sequences 

In [8]:
ls

d3psma_.aln            d3psma_.pssm           sequence_profile.ipynb


In [10]:
head -100 d3psma_.aln

PSIBLAST 2.7.1+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for compositional score matrix adjustment: Stephen F.
Altschul, John C. Wootton, E. Michael Gertz, Richa Agarwala,
Aleksandr Morgulis, Alejandro A. Schaffer, and Yi-Kuo Yu (2005)
"Protein database searches using compositionally adjusted
substitution matrices", FEBS J. 272:5101-5109.


Reference for composition-based statistics starting in round 2:
Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei
Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and
Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST
protein database searches with composition-based statistics and
other refinements", Nucleic Acids Res. 29:2994-3005.



Database: uniprot_sprot.fasta
           563,082 seque

Pairwise alignments:

.aln file:

- All significant alignments are shown in the first block
- For each of the above reported sequnces
    - The corresponding pairwise alignment is shown
    
- If you want to build the PSSM for the profile - starting from this file
     - you have to select all sequences from the last round - "round 3" 
     - Collect all the sequs producing a significant alignment
     - parese all the alignments - one a a time 
     - extraciting the SS mapping between subject and sequence
     
- All gaps in the query need to be removed --> see example below:

In [None]:
Query  1   KTCENLADTFRGPCFTDGSCDDHCKNKEHLIKGRCRD---DFRCWCTRNC  47
           +TCE+ +  F+GPCF+D +C   C+  E+  +G+C     + +C+C R+C
Sbjct  1   RTCESQSHKFKGPCFSDSNCATVCRT-ENFPRGQCNQHHVERKCYCERSC  49

 So you must implement something that allows you map the pairwise alignment to the original sequence without gaps.
 
 You have to filter it out the "-" and the corresponding aligned residue in the subject below.

'''/round 3 is not in the file'''

so use a file that contains the ls output of all the file names but no extensions so and loops through it

Its best to execute the psi blast via a bash script --> when using a cmd line tool its best to use a bash script

In [None]:
#!/bin/bash

input_list=$1 # 1$ first argument passed 

for protein_id in $(cat ${input_list}) # makes variable out of the output of cat input_list
do
    psiblast -query fasta/${protein_id}.fasta -db uniprot_sprot.fasta -evalue 0.01 -num_iterations 3 \
    -out_ascii_pssm psiblast-output/${protein_id}.psiblast.pssm -num_descriptions 10000 -num_alignments 10000 \
    -out psiblast-output/${protein_id}.psiblast.aln -num_threads 32
done

- We process each sequence from both:
    - blindtest set
    - training set
- We are preparing our data for the method implementation and training evaluation parts

In [1]:


def lines_list(infile1):                                              # call list of file names and for dsspfile
    ''' Reads all lines from a file and saves them to a list. '''
    with open(infile1) as ofile:
        flist = ofile.readlines() # returns list containing each line of the file
        return flist

# def relevant_lines(infile1, desired_chain):
#     '''Takes list (extracted from a DSSP file) and the name of the desired_chain as input.
#     Returns 2 strings: ss_string holds the secondary structure mapping and aa_string holds 
#     the amino acid information. Missing residues (when no atomic information of the PDB is 
#     present) are assigned the letter "C" (coil) in the ss_string and "X" in the aa_string.'''
#     dssp_list = lines_list(infile1)     # contains all lines from the dssp file.
#     relevant = False # boolean variable            
# #     desired_chain = "A"                            # change to load from "id_and_chain_blindset2"
#     ss_string = ''
#     aa_string = ''
#     for line in dssp_list:
#         if '#' in line: # find last line before relevant output
#             relevant =True   # flips rel to true - so the folowing lines are saved
#             continue
#         if relevant:
#             if line[11] == desired_chain:
#                 ss_string += line[16]
#                 if line[13] == "!":
#                     aa_string += "X"
#                 else:
#                     aa_string += line[13]
#     return ss_string, aa_string

In [2]:
pwd

'/Users/ila/01-Unibo/02_Lab2/project/test'

In [26]:
all_lines = lines_list("d3psma_.pssm")
all_lines = all_lines[2].split()
print(all_lines)

['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', 'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']


In [27]:
len(all_lines)

40

In [29]:
all_lines[20]

'V'

# Running PSI BLAST on blind test set

Preparing input list for psiblasting blind_fasta
* Need to cut ids only and generate a list thereof.
* Serves as input for the ```~/project/psiblast_cycle.sh ```


In [2]:
ls /Users/ila/01-Unibo/02_Lab2/project_blindset/blind_fasta | head -3

4uiq.fasta
4y0l.fasta
4y0o.fasta


In [None]:
ls lb2-2020-project-englander/blind_fasta/ | rev | cut -c7- | rev  > ~/blindset_ids_only

In [None]:
bash ~/project/psiblast_blindsetcycle.sh ~/blindset_ids_only

In [None]:
source /opt/conda/bin/activate 

### Modifyed script to use correct in and outfiles

In [None]:
#!/bin/bash

input_list=$1           # 1$ first argument passed 

for protein_id in $(cat ${input_list}) # makes variable out of the output of cat input_list
do
    psiblast -query lb2-2020-project-englander/blind_fasta/${protein_id}.fasta -db uniprot_sprot.fasta -evalue 0.01 -num_iterations 3 \
    -out_ascii_pssm psiblast_output/${protein_id}.psiblast.pssm -num_descriptions 10000 -num_alignments 10000 \
    -out psiblast_output/${protein_id}.psiblast.aln -num_threads 2 -comp_based_stats no
done


### Moving files to a different folder
was easier than changeing the script in 2 spots.

In [None]:
mv psiblast_output/ ~/project/psiblast_output_blind_set_fasta/

### Script for extracting **the normalized sequence profile**

```~/01-Unibo/02_Lab2/files_lab2_project/scripts/normalized_sequence_profile.py ```

In [None]:
# import os
# for filename in os.listdir("ubicación"):

In [None]:
#!/usr/bin/env python3
import sys
import os.path

def pssm_list(infile):                                              # call list of file names and for dsspfile
    ''' Reads relevant lines from a pssm file and saves them to a list.
    Returns values of the 2 matrices (no header).'''
    with open(infile) as ofile:
        flist = ofile.readlines()[3:-6] # list of each line of the file excluding first 3 & last 6 lines
        return flist

def lines_to_list(infile1):
        ''' Reads all lines from a file and saves them to a list containing the '\n' char. '''
        all_lines_list = []
        with open(infile1, 'r') as rfile:
            all_lines_list = rfile.readlines()
        return all_lines_list  # need to rstrip in a loop for using filenames.
        
def relevant_lines(infile2):
    '''Takes list (extracted from a .pssm file) and extracts the Sequence Profile Portion only.
    Returns a list of list where each element is one line of the sequence profile matrix. '''
    pssm_profile_list = pssm_list(infile2)     # contains all lines from the pssm file.           
    profile_final_list = []                # for holding relevant fileds of the line
    for line in pssm_profile_list:
            pssm_profile_list = line.split()[22:42]   # profile ranges from pos 22-42 
            profile_final_list.append(pssm_profile_list) # appending to final list of lists
    return profile_final_list # list of lists
    
# # devide all values by 100   
def write_normalized_profile(profile_final_list, ofile):
    '''Takes profile list of lists and outfile name as input. Writes each number that is in 
    one of the sublists and devides it by 100. The number is converted to a string and added
    a tab and written to a file. After each sublist a newline character is written to the file.'''
    with open(ofile, "a") as wfile:
        for sublist in profile_final_list:
#             print(sublist)
            for el in sublist:
                num =int(el)/100
                numstring=str(num)
                wfile.write(numstring+'\t')  # adding tab after each number
            wfile.write("\n")                # adding newline at the end of each sublist.

if __name__ == '__main__':
    infile1 = sys.argv[1] # the idlist to loop on
    #Call the function by looping through an id list+'.pssm' extension
    # name the outfile the same --> id list+'.profile'
    idlist = lines_to_list(infile1)  # containing the id of the file but NOT the extension ".pssm" 
    for ids in idlist:
        # path may need to be added before "<the path>"+ids.rsprip()
        part2 = ids.rstrip()+'.pssm'    # removing newlinecharacter, adding necessary extension
        if os.path.isfile(infile) == True:       # does this file exist?
            ofile = ids.rstrip()+'.profile'    # outifile for each id with correct extension
            profile_list = relevant_lines(infile)
            write_normalized_profile(profile_list, ofile)
        else:
            print("Error file: "+infile+" not found.")


In [None]:
with open(outfile, 'a') as afile:
        for i in idlist:
            afile.write(i) #appending ID in even lines
            afile.write(aa_dict[i]) # appending value (sequ) in odd lines
pssm = relevant_lines("d3psma_.pssm")
write_normalized_profile(pssm, 'thomas')

### Ran the script on blind and training set:

* Had to modify path to the idfile for each run:
    * I should have thought of a way to do this from the cmd line
    
### The blindset's profiles are saved in 
```~/lb2-2020-project-englander/seqprofile_blind```

#### The trainingset's profiles are saved in
```~/lb2-2020-project-englander/seqprofile_training```

Not all sequences yielded a checkpoint file: --> some did not find matches above the threashold.