# Generating the Blind Set

* Also known as **holdout dataset**

* Cross-validation is not sufficient tot estimate unbiased generalization performance.
    - Model hyper-parameters are still optimized on the training set through cross-valiation and grid-search
    - This may lead to some degree of overfitting on training data
    - Using a blind set helps us to generate a 'never-seen-before condition'
    
## Generation Criteria:

* Structures deposited **after** January 2015
     - Release year of JPred4 is 2014
* X-ray crystals with resolution < 2,5 $\overset{\circ}{A}$
* Chain lenght in the range of 50 - 300 residues
* Advanced search -> Entry Polymer Types:
    - Protein OR Protein/NA 
* All pairs of sequences within the blind set should share less than 30% sequence identity ('internal redundancy'):
- By using ```blastclust``` we can reduce the redundancy

* When comparing sequences of the blindset with the JPRED set: All pairs (blind - Jpred) have an less than 30% identity  ('external redundancy')
- This will be ensured using ```blastp```

* The final blind set will comprise 150 proteins which will be randomly selected among those that meet the above criteria

### 1. Downloading Data from the PDB

Checked the boxes "Entry ID","Sequence","Entity Polymer Type","Chain ID","Entry Id (Polymer Entity Identifiers)".

[here the link to my search](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22greater%22%2C%22negation%22%3Afalse%2C%22value%22%3A%222015-01-31T00%3A00%3A00Z%22%2C%22attribute%22%3A%22rcsb_accession_info.deposit_date%22%7D%2C%22node_id%22%3A0%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22exact_match%22%2C%22negation%22%3Afalse%2C%22value%22%3A%22X-RAY%20DIFFRACTION%22%2C%22attribute%22%3A%22exptl.method%22%7D%2C%22node_id%22%3A1%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22less_or_equal%22%2C%22negation%22%3Afalse%2C%22value%22%3A2.5%2C%22attribute%22%3A%22rcsb_entry_info.resolution_combined%22%7D%2C%22node_id%22%3A2%7D%2C%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22or%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22exact_match%22%2C%22negation%22%3Afalse%2C%22value%22%3A%22Protein%20(only)%22%2C%22attribute%22%3A%22rcsb_entry_info.selected_polymer_entity_types%22%7D%2C%22node_id%22%3A3%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22exact_match%22%2C%22negation%22%3Afalse%2C%22value%22%3A%22Protein%2FNA%22%2C%22attribute%22%3A%22rcsb_entry_info.selected_polymer_entity_types%22%7D%2C%22node_id%22%3A4%7D%5D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22range_closed%22%2C%22negation%22%3Afalse%2C%22value%22%3A%5B50%2C300%5D%2C%22attribute%22%3A%22entity_poly.rcsb_sample_sequence_length%22%7D%2C%22node_id%22%3A5%7D%5D%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%22dc19df09287d4c5a80018000a03e2a6d%22%7D%7D)

* Downloaded as CSV 

For some reason it automatically adds "Entry ID" as column 1. Whenever there is another chain of the same ID the first line of col 1 will be empty --> removed it using awk:

In [None]:
head -6 1.csv 

In [None]:
for i in {1..5}
do
    cat ${i}.csv | awk '{sub(/[^,]*/,"");sub(/,/,"")} 1' > ${i}new.csv 
done

In [None]:
head -5 4new.csv #now it looks like this

### 2. Parsing and Filtering

First I need to be aware that some of the chains are nucleic acids ("NA-hybrid").

Remove all lines containing
*  "NA-hybrid" 
*  "DNA"
*  "RNA"

to obtain protein sequences only!

```  sed -n '/Protein/p' ./filename 


In [None]:
# Removing all lines that do NOT contain "Protein" --> this removes the header too!
head -1 1new.csv > aa_only.csv # Adding correct header to top of file that will be appended all "Protein" lines

for i in {1..5} # Appending only lines containing word "Protein"
do
    sed -n '/Protein/p' ${i}new.csv >> aa_only.csv 
done

In [None]:
head aa_only.csv

In [None]:
grep "Protein" aa_only.csv | wc -l #  29638 Protein chains in the set

### Switched to python 3 kernel:

In [None]:
# Loading relevant packages:

import pandas as pd
import numpy as np
import seaborn as sns
# sns.set() #do we really need that


In [None]:
df_aa = pd.read_csv("aa_only.csv")
df_aa

In [None]:
# Where did the unnamed come from??
unnamed = df_aa["Unnamed: 4"] # Maybe trailing comma?
unnamed.unique()

#### Finding unique values in cloumn "Entity Polymer Type"
I want to find unique values as described [here](https://chrisalbon.com/python/data_wrangling/pandas_list_unique_values_in_column/)

This way I can be sure that my cleaning was successful

In [None]:
# Finding unique names in cols
pol_type = df_aa['Entity Polymer Type']
pol_type.unique() # I am now sure that all DNA and RNA and Protein/NA lines have been removed.

In [None]:
!head -3 aa_only.csv

In [None]:
!sed '1d' aa_only.csv > noheader_aa_only.csv # removing header before generating fasta


In [None]:
!wc -l aa_only.csv

In [None]:
!wc -l noheader_aa_only.csv # value matches file is ok proceeding to make fasta

#### I want to consider the comma as a field sepparator 

Since there are several chains per entry denoted as e.g.:

``` "CAGTTTCAAACTC","D, I","5FD3",```

I need to remove the comma between the chains first and replace it with a space:
*  ``` sed 's/\, / /g' $i.csv ```

In [None]:
# Removing commas between chains

sed 's/\, //g' aa_only.csv > nospace.csv

head nospace.csv

## 3. Generating the FASTA file

* Using awk: defining the comma as field sepparator.

* ``` awk -F ',' ```

* filtering for lenght in the range of 50 - 300 

*  ``` 'length($1) > 50 && length($1) < 301 {print ">"$4":"$3"\n"$1}' ```

In [None]:
cat nospace.csv | awk -F ',' 'length($1) > 50 && length($1) < 301 {print ">"$4":"$3"\n"$1}' | sed 's/\"//g' > 50_300.fasta

In [None]:
head 50_300.fasta

#### Sequences containing X need to be removed:

Working on the script and testing it along the way. The final result is saved as

#### removeX.py


In [None]:
def lines_to_list(filename):
    ''' Reads all lines from a file and saves them to a list. '''
    content_list = []
    with open(filename, "r") as rfile:
        content_list = rfile.readlines()
        return content_list

myfastalist = lines_to_list("50_300.fasta")  #works

In [None]:
len(myfastalist)

In [None]:
def split_list(liste):
    ''' Splits a evennumbered list into two lists. id_list contains all odd items while seq_list contains all even items. Returns the two lists.'''
    id_list = liste[::2]
    seq_list = liste[1::2]
    return id_list, seq_list

# teste = ['a', 'b', 'c', 'd', 'e', 'f'] #Works
ids, seq = split_list(myfastalist)


print(len(ids))
print(len(seq))
print(len(myfastalist))

In [None]:
def remove_X(id1, seq2):
    '''Removes items containing X in the sequence list but also the ID in the ID list. 
    Returns an ID list and an'''
    noXid = []
    noXseq = []
    for i in range(len(id1)):
        flag = "X" in seq2[i]
        if flag == False:
            noXid.append(id1[i])
            noXseq.append(seq2[i])
    return noXid, noXseq


In [None]:
new_id, new_seq = remove_X(ids, seq)

print(len(new_id))
print(len(new_seq))

def no_X_id_and_seq(id_list, seq_list):
    ''' Joins the lists to a big list containing both id and sequences. Returns one big list'''
    biglist = []
    for i in range(len(id_list)):
        biglist.append(id_list[i])
        biglist.append(seq_list[i])
    return biglist


In [None]:
biglist = no_X_id_and_seq(new_id, new_seq)
len(biglist)

In [None]:
#Writing list to file:

def list_to_fasta(liste):
    '''Takes one id list and one sequlist as input. Writes all elements i to 
    a file. Returns the file.'''
    with open('no_X.fasta', 'w') as F:
        for i in liste:
            F.write(str(i))
    F.close            
    

In [None]:
list_to_fasta(biglist)

In [None]:
def filter_short(infile, outfile):
    del_seq_index = []
    lines_list = []
    with open(infile) as rfile:
        lines_list = rfile.readlines()
        for i in range(1, len(lines_list),2):
            if len(lines_list[i]) < 7:
                del_seq_index.append(i-1) # appending header index
                del_seq_index.append(i)   # appending sequence index
    with open(outfile, 'w') as wfile:
        for i in range(len(lines_list)):
            if i in del_seq_index:
                continue
            wfile.write(lines_list[i])    
        
filter_short("no_X.fasta", 'longsequences.fasta')                   

### 4. Clustering the Sequences in blastclust

Sending file to be clustered to the VM:

### 5. Picking longest seuqence of each cluster: by default col1

cat final_clusters | awk $1 {print} > best_of_cluster

In [None]:
for i in {1..5}
do
    mv $i.csv ./orignial_csv/$i.csv
done

In [None]:
for i in {1..5}
do
     mv ${i}new.csv ./orignial_csv/${i}new.csv
done     

#### 6. Generating FASTA Containing ONLY Sequences of Best Cluster

* I think the easiest is if I generate a list of best_of_final_cluster

    - Need to first generate a new file that contains the ">" character in front of every sequence


* And generate a dictionary of the no_X_long_sequs.fasta   

* Then I want to use the list to loop on the dictionary

In [None]:
# making sure the output of the script will be fasta standard
cat best_of_final_cluster | sed 's/^/>/' > crocodile_ids_final_cluster

In [None]:
#!/usr/bin/env python3
import sys

def lines_to_list(infile1): 
    ''' Reads all lines from a file and saves them to a list. '''
    content_list = []
    with open(infile1, "r") as rfile:
        content_list = rfile.readlines()
        return content_list

def split_list(infile1):
    ''' Splits a evennumbered list into two lists. id_list contains 
    all odd items while seq_list contains all even items. Returns the two lists.'''
    myfastalist = lines_to_list(infile1)  #works
    id_list = myfastalist[::2]
    seq_list = myfastalist[1::2]
    return id_list, seq_list

def dict_from_lists(infile1):
    '''Takes feeds two lists into a dictionary. 
    Returns the dicitonary'''
    id_list, seq_list = split_list(infile1)
    keys = id_list
    values = seq_list
    full_dict = dict(zip(keys, values))
    return full_dict
    
def keep_whats_in_dict(infile1, infile2, outfile):
    '''Loops through a list and a dictionary. Appending the values
    of the list (PDB ids which are also the keys of the dictionary) and the
    values of the dictionary to the outfile.'''
    idlist = lines_to_list(infile2) # reading ids from file into list
    aa_dict = dict_from_lists(infile1)
    with open(outfile, 'a') as afile:
        for i in idlist:
            afile.write(i) #appending ID in even lines
            afile.write(aa_dict[i]) # appending value (sequ) in odd lines

if __name__ == '__main__':
    infile1 = sys.argv[1]
    infile2 = sys.argv[2]
    outfile = sys.argv[3]
    keep_whats_in_dict(infile1, infile2, outfile)


In [None]:
# run the script
python  make_fasta_from_best_of_each_cluster.py no_X_long_sequs.fasta crocodile_ids_final_cluster crocodile_best_of_final_cluster.fasta


In [None]:
scp -i ~/.ssh/id_rsa.pub ./make_fasta_from_best_of_each_cluster.py proj:~/lb2-2020-project-englander

In [None]:
scp -i ~/.ssh/id_rsa.pub ./crocodile_best_of_final_cluster.fasta proj:~/lb2-2020-project-englander

### 6. Mergeing all fasta files of the jpred (training) set 

Need it later to generate blastdb.

In [None]:
cat *.fasta > JPREDall.fasta  # merging

In [None]:
grep ">" JPREDall.fasta | wc -l  # works --> merged all 1348 files.

In [None]:
scp -i ~/.ssh/id_rsa.pub ./JPREDall.fasta proj:~/lb2-2020-project-englander

### 7. Reducing Redundancy

I want to produce a blind testset that is as dissimilar to the training set as possible.

* Running blastp with blindset against JPRED training set
* I dentifying all sequences in the blindset that have LESS than 30% seq ID with any other sequ in the training set.



In [None]:
# copying hits.blast.tab to local
scp -i ~/.ssh/um19_id_rsa um19@m19.lsb.biocomp.unibo.it:~/lb2-2020-project-englander/makeblastdb/hits.blastp.tab ~/01-Unibo/02_Lab2/project_blindset/

In [None]:
wc -l hits.blastp.tab

In [None]:
head -4 hits.blastp.tab
# $1 pdb_Id $2 jprd id $3 % identity

## Generating Non-Redundant Set With Least Similarty 

Step 3 from the Slides: "Filter out from the preliminary chain set, all chains having at least one BLAST hit with SI >= 30% with any sequence in the JPRED4 dataset

### Generating 2 Files Containing Only Relevant Lines

file above30:
* I'll extract ```$1``` if ``` $3 > 30 ```

file below30
* I have to extract ```$1``` if col ```$3 < 30 ```


In [None]:
# awk keep filed if col 3 < 30 pipe to new file.
awk -F ' ' '$3 < 30 {print $1 " " $3}' hits.blastp.tab > below_30.hits
awk -F ' ' '$3 >= 30 {print $1 " " $3}' hits.blastp.tab > above_30.hits

#### Keeping IDs only

In [None]:
awk -F ' ' '$3 < 30 {print $1}' hits.blastp.tab > id_below_30.hits
awk -F ' ' '$3 >= 30 {print $1}' hits.blastp.tab > id_above_30.hits

In [None]:
def lines_to_list(infile1): 
    ''' Reads all lines from a file and saves them to a list. '''
    content_list = []
    with open(infile1, "r") as rfile:
        content_list = rfile.readlines()
        return content_list
        
below = lines_to_list("id_below_30.hits")  # list of all ids scoring below 30% id
above = lines_to_list('id_above_30.hits')  # list of all ids scoring above and equal to 30% id



In [None]:
print(len(below))
print(len(above))

In [None]:
# "6D8X:A\n" in above
# head -30 id_below_30.hits > ids_test
# head -30 id_above_30.hits > ids_above_test

def remove_matches(lower, higher):
    '''Takes two lists as input and returns a list that contains
    all values of 'lower' values that are NOT element of 'higher'.'''
    keepers = []          # list holding all ids that have not scored >= 30% with any of the JPRED sequnces
    for i in lower:       
        if i not in higher: 
            keepers.append(i)  # keeps only ids that are not reported in the list "above"
    return keepers

keep = remove_matches(below, above) 

print(len(keep))
# Have 177 unique sequnces with no match above 30% id with any other sequ in the testing (JPRED) set.

In [None]:
all_ids = lines_to_list("best_of_final_cluster") # generating list of all ids that were in the blastp input
print(len(all_ids))

In [None]:
def keep_mis_matches(biglist, partiallist):
    '''Stores exclusively biglist values that are not reported in partiallist.
    Returns the biglist with all partiallist matches removed. Keeps all values 
    that donot match any element of partiallist in a new list. Returns the new list.'''
    keepers = []
    for i in biglist:
        if i not in partiallist:
            keepers.append(i) 
    return keepers

all_without_above = keep_mis_matches(all_ids, above)
len(all_without_above)


In [None]:
def write_list_to_file(liste, newfile):
    '''Takes as input a list and writes each element to a new file'''
    with open(newfile, 'a') as afile: 
        for i in liste:
            afile.write(i)
            
write_list_to_file(all_without_above, 'try_again')

In [None]:
# all_scoring_below30.py

# input of blastp = fastafile find missing

# interseciton of set below() and above() --> throw_away
# find and remove throw away id file

# make list of all ids reproted in fastinputblastp
# turn fastinputblastp_list into fastinputblastp_set (biggest set)
# fastinputblastp_set - above_set 

def lines_to_list(infile1): 
    ''' Reads all lines from a file and saves them to a list. '''
    content_list = []
    with open(infile1, "r") as rfile:
        content_list = rfile.readlines()
        return content_list
        
below = lines_to_list("id_below_30.hits")  # list of all ids scoring below 30% id
above = lines_to_list('id_above_30.hits')  # list of all ids scoring above and equal to 30% id

def remove_matches(lower, higher):
    '''Takes two lists as input and returns a list that contains
    all values of 'lower' values that are NOT element of 'higher'.'''
    keepers = []          # list holding all ids that have not scored >= 30% with any of the JPRED sequnces
    for i in lower:       
        if i not in higher: 
            keepers.append(i)  # keeps only ids that are not reported in the list "above"
    return keepers

keep = remove_matches(below, above) 

# print(len(keep))
# Have 177 unique sequnces with no match above 30% id with any other sequ in the testing (JPRED) set.

##########################################################
# Making a list of all ids (input IDs of the blastp)
##########################################################

all_ids = lines_to_list("best_of_final_cluster") # generating list of all ids that were in the blastp input
# print(len(all_ids))

def keep_mis_matches(biglist, partiallist):
    '''Stores exclusively biglist values that are not reported in partiallist.
    Returns the biglist with all partiallist matches removed. Keeps all values 
    that donot match any element of partiallist in a new list. Returns the new list.'''
    keepers = []
    for i in biglist:
        if i not in partiallist:
            keepers.append(i) 
    return keepers

all_without_above = keep_mis_matches(all_ids, above)
# len(all_without_above)

def write_list_to_file(liste, newfile):
    '''Takes as input a list and writes each element to a new file'''
    with open(newfile, 'a') as afile: 
        for i in liste:
            afile.write(i)
            
write_list_to_file(all_without_above, 'try_again')


# Randomly Sort and Pick 150 Sequences 

Sometimes not all IDS have an associated PDB file thus I select 160 to be sure.

In [None]:
!cat ids_0-30 | sort -R | head -160 > random.blindset2 # in case some PDB files are not available
!wc -l random.blindset2

Note that I still have *NOT* removed identical chain IDs:

* The file still contains all letters denoting different chains after the ":"

* printing it to a new file only keeping first letter after the ":"

In [None]:
!tail random.blindset2

In [None]:
!cat random.blindset2 | cut -c-6 > id_and_chain_blindset2

In [None]:
# checking if really no above 30 are in the new list
def lines_to_list(infile1): 
    ''' Reads all lines from a file and saves them to a list. '''
    content_list = []
    with open(infile1, "r") as rfile:
        content_list = rfile.readlines()
        return content_list
        
bigset = lines_to_list("random.blindset2")  # list of all ids scoring below 30% id
above = lines_to_list('id_above_30.hits')  # list of all ids scoring above and equal to 30% id

def remove_matches(l1, l2):
    '''Takes two lists as input and returns a list that contains
    all values of 'lower' values that are NOT element of 'higher'.'''
    keepers = []          # list holding all ids that have not scored >= 30% with any of the JPRED sequnces
    for i in l1:       
        if i not in l2: 
            keepers.append(i)  # keeps only ids that are not reported in the list "above"
    return keepers

keep = remove_matches(bigset, above) 
len(keep)

# 6. Download Structures from PDB

For each of the 150 sequences I have to download the pdb structure

In [None]:
!cat random.blindset2 | cut -c-4 > only_id_blindset2

In [None]:
# for i in
# https://files.rcsb.org/view/6OZJ.pdb

# 7. Generate DSSP files from selected PDB files

In [None]:
def lines_to_list(infile1): 
    ''' Reads all lines from a file and saves them to a list. '''
    content_list = []
    with open(infile1, "r") as rfile:
        content_list = rfile.readlines()
        return content_list
        
bigset = lines_to_list("only_id_blindset2") # Contain trailing newline char
#removing \n :
def rstrip_each_item(list):
    my_150_PDB = []
    for i in list:
        clean = i.rstrip()
        my_150_PDB.append(clean)
    return my_150_PDB

my_150_PDB = rstrip_each_item(bigset)
print(my_150_PDB)
len(my_150_PDB)

In [None]:
!cat only_id_blindset2 | sed -z 's/\n/, /g' > commas2

In [None]:
cd blindset_all_PDBs/150_blind_PDBs/


In [None]:
for i in {1..150}
do
    gunzip *ent.gz
done

# 8. Generating DSSP files from all 150 PDB files

In [None]:
#!/bin/bash
# Running the DSSP on the extracted PDB files
for i in *.ent
do
        mkdssp -i "$i" -o "./dsspout/$i.dssp"
done

In [None]:
 scp -i ~/.ssh/id_rsa.pub ./150_blind_PDBs/* proj:~/lb2-2020-project-englander/150_blind_PDBs/

# Extracting Desired Chain From Each DSSP file


Extracting chain and secondary structure from DSSP files:

* protein chain: ```$3```
* Secondary Structure Summanry ```$4```

In [None]:
!ls id_and_chain_blindset2

In [79]:
!cd /Users/ila/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/

In [80]:
pwd

'/Users/ila/01-Unibo/02_Lab2/project_blindset'

# Rewinding Undoing Some Mess
Forgot that I downloaded 160 structures --> since not all were available in PDB format I just took a little extra. Then I only unzipped 150. But which did I NOT use? 
Who knows - sloppy documentation -_-

In [96]:
def lines_list(fname):
    with open(fname) as ofile:
        flist = ofile.readlines() # returns list containing each line of the file
        return flist

dsspinfo = lines_list("/Users/ila/01-Unibo/02_Lab2/project_blindset/deletechainmess/dssp_file_names")  # generating list of all file names e.g. "pdb7jtl.ent.dssp"

def caps_dssp_list(li): 
    '''Takes as input raw DSSP list and returns the pdb IDs of each file in a list.'''
    dssp_id_list =[]
    for i in range(len(li)):  
        el = dsspinfo[i]             # isolating list el
        new_el = el[3:7].upper()     # creating new_el wich is only the fileds containing the PDB id
        dssp_id_list.append(new_el+"\n")  # append each id to dssp_list
    return dssp_id_list

dssp_ids = caps_dssp_list(dsspinfo)

pdb_ids_160 =lines_list("only_id_blindset2")

# print(pdb_ids_160)

def keep_matches(biglist, partiallist):
    '''Stores all common values that are reported in both lists.
    Returns the intersection of items as keepers list.'''
    keepers = []
    for i in biglist:
        if i in partiallist:
            keepers.append(i) 
    return keepers

all150dssp = keep_matches(pdb_ids_160, dssp_ids)
# print(all150dssp)

def remove_matches(l1, l2):
    '''Takes two lists as input and returns a list that contains
    all values of 'lower' values that are NOT element of 'higher'.'''
    keepers = []          # list holding all ids that have not scored >= 30% with any of the JPRED sequnces
    for i in l1:       
        if i not in l2: 
            keepers.append(i)  # keeps only ids that are not reported in the list "above"
    return keepers

keep = remove_matches(pdb_ids_160, dssp_ids) 

for i in keep:
    print(i.rstrip()+",", end="") # deleted them manually  -_- brainfog

4Y4O,4YBB,4Y4O,5XJL,6EHA,6NTV,6T8S,4YWN,5FLY,5JVV,

### Unused:
From the origninal download I did NOT use the following PDB ids:
   * 4Y4O,4YBB,4Y4O,5XJL,6EHA,6NTV,6T8S,4YWN,5FLY,5JVV,
   * saved them to ```~/lb2-2020-project-englander/unused_10```

In [1]:
#sending them to the VM for safekeeping (incase I need to replace any of the files generated)
# the -r argument --> recursively copy these files to the VM
scp -i ~/.ssh/id_rsa.pub -r ~/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/unused_10/ proj:~/lb2-2020-project-englander/

pdb4ywn.ent                                   100%  395KB 930.2KB/s   00:00    
pdb5xjl.ent                                   100%  441KB 800.6KB/s   00:00    
pdb5fly.ent                                   100% 1257KB   1.2MB/s   00:01    
pdb5jvv.ent                                   100%  880KB 947.6KB/s   00:00    
pdb6ntv.ent                                   100%  481KB   1.0MB/s   00:00    
pdb6eha.ent                                   100%  374KB   1.2MB/s   00:00    


In [3]:
# ran dssp sepperately on these files to have some backup inscase some dont have what they need
scp -i ~/.ssh/um19_id_rsa -r um19@m19.lsb.biocomp.unibo.it:~/lb2-2020-project-englander/unused_10/dsspout_unused10/ ~/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/unused_10/



pdb4ywn.ent.dssp                              100%   41KB 414.3KB/s   00:00    
pdb6eha.ent.dssp                              100%   61KB 565.7KB/s   00:00    
pdb6ntv.ent.dssp                              100%   95KB 411.9KB/s   00:00    
pdb5fly.ent.dssp                              100%   78KB 565.8KB/s   00:00    
pdb5xjl.ent.dssp                              100%   86KB 913.9KB/s   00:00    
pdb5jvv.ent.dssp                              100%   83KB 563.4KB/s   00:00    


# The function

# extract_ss_from_dssp_using_2input_files.py

In [None]:
def lines_list(infile1):                                              # call list of file names and for dsspfile
    ''' Reads all lines from a file and saves them to a list. '''
    with open(infile1) as ofile:
        flist = ofile.readlines() # returns list containing each line of the file
        return flist

def relevant_lines(infile1, desired_chain):
    '''Takes list (extracted from a DSSP file) and the name of the desired_chain as input.
    Returns 2 strings: ss_string holds the secondary structure mapping and aa_string holds 
    the amino acid information. Missing residues (when no atomic information of the PDB is 
    present) are assigned the letter "C" (coil) in the ss_string and "X" in the aa_string.'''
    dssp_list = lines_list(infile1)     # contains all lines from the dssp file.
    relevant = False # boolean variable            
#     desired_chain = "A"                            # change to load from "id_and_chain_blindset2"
    ss_string = ''
    aa_string = ''
    for line in dssp_list:
        if '#' in line: # find last line before relevant output
            relevant =True   # flips rel to true - so the folowing lines are saved
            continue
        if relevant:
            if line[11] == desired_chain:
                ss_string += line[16]
                if line[13] == "!":
                    aa_string += "X"
                else:
                    aa_string += line[13]
    return ss_string, aa_string

def raw_to_threclasses(rawstring):
        structure_dict = {"H":"H", "G":"H", "I":"H", "B":"E", "E":"E", "T":"C", "S":"C", " ":"C"} 
        threeclasses = ''
        for letter in rawstring:
                threeclasses += structure_dict[letter]
        return threeclasses    
    
def generate_dssp_fasta(filename_id, chain): 
    '''Writes SS to dsspfile and AA to fastafile'''
    # reads dssp file and returns ss_string and aa_string.
    ss_string, aa_string = relevant_lines("/Users/ila/01-Unibo/02_Lab2/project_blindset/deletechainmess/testruns/pdb"+filename_id+".ent.dssp", chain)     
    with open(filename_id+".dssp", 'w') as dsspfile:
        ss = raw_to_threclasses(ss_string)
        dsspfile.write(">"+filename_id+"_"+chain+"\n")
        dsspfile.write(ss+"\n")
        
    with open(filename_id+".fasta", 'w') as fastafile:    
        fastafile.write(">"+filename_id+"_"+chain+"\n")
        fastafile.write(aa_string+"\n")
        if aa_string == "":
                print('empty',filename_id, chain, end="")
        
# creating list holding all pdb ids and chain descriptions        
id_chain = lines_list('/Users/ila/01-Unibo/02_Lab2/project_blindset/deletechainmess/BLINDset_id_and_chain')  

for el in id_chain:            # each el is like "6LTZ:A"
    field_list = el.split(':') # list  BOTH contains ID [0] and chain [1]
    fname_id =  field_list[0].lower()                   #
#     fname = "pdb"+field_list[0].lower()+".ent.dssp"     #fname to be used in the function calls
    chain = field_list[1].rstrip()                             # chain name to be used in each function call
    generate_dssp_fasta(fname_id, chain)


In [5]:


def lines_list(infile1):                                              # call list of file names and for dsspfile
    ''' Reads all lines from a file and saves them to a list. '''
    with open(infile1) as ofile:
        flist = ofile.readlines() # returns list containing each line of the file
        return flist
    
def findX(filename):
    with open(filename+".fasta") as myfasta:
        id = myfasta.readline()
        sequ = myfasta.readline()
        if 'X' in sequ == True:
            print(filename, sequ)
        else:
            print("no X")

id_chain = lines_list('/Users/ila/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/BLINDset_id_and_chain')              

for el in id_chain:            # each el is like "6LTZ:A"
    field_list = el.split(':') # list  BOTH contains ID [0] and chain [1]
    fname_id =  field_list[0].lower()                   #
#     fname = "pdb"+field_list[0].lower()+".ent.dssp"     #fname to be used in the function calls
    chain = field_list[1].rstrip()                             # chain name to be used in each function call
    findX(fname_id)        

['6LTZ:A\n', '6EXX:A\n', '5XGA:A\n', '5WUJ:B\n', '4ZY7:A\n', '5F2A:A\n', '5D1R:A\n', '5UNI:A\n', '5U7E:A\n', '5ANP:A\n', '6HKS:A\n', '6J0Y:C\n', '6L77:A\n', '6KKO:A\n', '6GW6:B\n', '5LDD:B\n', '6K7Q:A\n', '4Y0L:A\n', '5GNA:B\n', '5C8A:A\n', '4ZC4:A\n', '6EI6:A\n', '5UMV:A\n', '6OR3:A\n', '5U4U:A\n', '5XKS:A\n', '5GHL:A\n', '5XYF:A\n', '5N07:A\n', '5FB9:A\n', '5CEG:A\n', '7BWF:B\n', '4YTE:A\n', '5V0M:A\n', '6SE1:A\n', '5UC0:A\n', '5WNW:A\n', '5IHF:A\n', '5T2Y:A\n', '5IB0:A\n', '6USC:A\n', '4UIQ:A\n', '6YJ1:A\n', '6DHX:A\n', '5D6T:A\n', '6DN4:A\n', '5BP5:C\n', '6ISU:A\n', '6FSF:A\n', '5VOG:A\n', '5IR2:A\n', '5D71:A\n', '5BPX:A\n', '5II0:A\n', '4Y0O:A\n', '5AUN:A\n', '5C5Z:A\n', '5KWV:A\n', '6MDW:A\n', '5FQ0:A\n', '7BVV:A\n', '5AV5:A\n', '5FFL:A\n', '6OOD:A\n', '5KQA:A\n', '5DCF:A\n', '5GKE:A\n', '5ZRY:A\n', '5UIV:A\n', '6GBI:A\n', '5MC9:B\n', '6FWT:A\n', '6HSV:A\n', '6HFG:B\n', '5A88:A\n', '5YEI:B\n', '6NDR:A\n', '5KLC:A\n', '6MAB:A\n', '6AOZ:A\n', '5T2X:A\n', '5LTF:A\n', '6J19:B\n', '6T

In [5]:
def lines_list(infile1):                                              # call list of file names and for dsspfile
    ''' Reads all lines from a file and saves them to a list. '''
    with open(infile1) as ofile:
        flist = ofile.readlines() # returns list containing each line of the file
        return flist
    
def count_len_seq(filename):
    with open('/Users/ila/01-Unibo/02_Lab2/project_blindset/blind_fasta/'+filename) as myfasta:
        id = myfasta.readline()
        sequ = myfasta.readline()
        lseq = len(sequ)
        if lseq <= 50:
            print(filename, "less than 50")
        if lseq >= 300:
            print(filename, 'more than 300')

id_chain = lines_list('/Users/ila/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/BLINDset_id_and_chain')              

for el in id_chain:            # each el is like "6LTZ:A"
    field_list = el.split(':') # list  BOTH contains ID [0] and chain [1]
    fname_id =  field_list[0].lower()
#     fname = "pdb"+field_list[0].lower()+".ent.dssp"     #fname to be used in the function calls
    chain = field_list[1].rstrip()                             # chain name to be used in each function call
    count_len_seq(fname_id+'.fasta')        

6j19.fasta less than 50


# Replacing 6j19.fasta as it has < 50 aa


Choosing from "unused set" in
``` ~/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/unused_10/unused_10_fasta```

List of unused sequs and chains
```['5XJL:2', '6EHA:A', '6NTV:A', '4YWN:A', '5FLY:A', '5FLY:A']```

replacing 9j19.fasta and dssp with 4ywn_A -- to set of 150 sequs

In [1]:
pwd

/Users/ila/01-Unibo/02_Lab2/project_blindset


In [2]:
mv /Users/ila/01-Unibo/02_Lab2/project_blindset/blind_dssp/6j19.dssp ~/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/unused_10/unused_9_dssp/ 
mv /Users/ila/01-Unibo/02_Lab2/project_blindset/blind_fasta/6j19.fasta ~/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/unused_10/unused_10_fasta/

In [5]:
ls -1 ~/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/unused_10/unused_10_fasta

4ywn.fasta
5fly.fasta
5xjl.fasta
6eha.fasta
6j19.fasta
6ntv.fasta


In [6]:
mv ~/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/unused_10/unused_9_dssp/4ywn.dssp /Users/ila/01-Unibo/02_Lab2/project_blindset/blind_dssp/
mv ~/01-Unibo/02_Lab2/project_blindset/blindset_all_PDBs/150_blind_PDBs/unused_10/unused_10_fasta/4ywn.fasta /Users/ila/01-Unibo/02_Lab2/project_blindset/blind_fasta/


In [None]:
# Corrected the mistake have now 150 dssp and fasta of sufficient lenght


In [16]:
#Making sure I really have no lower case letters in the sequence:
a = "AaAA"
a.islower()

False

# Just checking if I have any lower case letters indicating C-C bridges in any of the fastafiles of my blindset 

In [4]:
!head list_of_blindset_ids_only

4uiq
4y0l
4y0o
4yte
4ywn
4zc4
4zey
4zkp
4zlr
4zrz


### ``` project_blindset/scripts/lowercase_infile.py```

In [32]:
#!/anaconda3/bin/python
import sys
import os

# script to check if there are any lowercase letters corresponding to SS bridges (Cysteins).
def find_lowerletters(infile1):
    ''' 
    Reads lines from a file  EXCLUDING line 1. Saves string to a variable.
    Checks each character. If character is lower case it is replaced by a C. As all dssp files designate
    SS-bridges with lower case pairs.
    '''
    with open(infile1) as ofile:
        lines = ofile.readlines()
        header = lines[0]
        sequence = lines[1] # list of each line of the file excluding line 0  
        upper_seq = ''
    # check_string = lines_list(infile1)[0] # l[0] to check_string
    for char in sequence:
        if char.isupper():
            upper_seq += char
        if char.islower():      # if any of the letters is lower case
            upper_seq += 'C'    # all lower are converted to cysteins
            print(infile1)      # to see how which files I got in my set
    return header, upper_seq

def write_upper(infile1):
    '''
    Calls find_lowerletters and writes what it returns 
    the same file truncating it.
    '''
    header, upper_seq = find_lowerletters(infile1)
    with open(infile1, 'w') as wfile:
        wfile.write(header+upper_seq)          #need to make new files or over write
        wfile.truncate()
    return 

def ids_list(infile2):                                              # call list of file names and for dsspfile
    ''' Reads all lines from a file and saves them to a list. '''
    cleanlines_list = []
    with open(infile2) as ofile:
        flist = ofile.readlines()# list of each line of the file excluding line 0 
        for line in flist:
            nonewline = line.rstrip()
            cleanlines_list.append(nonewline)
        return cleanlines_list

if __name__ == '__main__':
    ids_path = sys.argv[1]
    ids = ids_list(ids_path)        # "./list_of_blindset_ids_only"
    blind_set_path = sys.argv[2]
    for ID in ids:
        write_upper(os.path.join(blind_set_path, ID+".fasta"))


### Reran!
--> No lowercase letters in my fasta blindset
Updated script and fixed all files.

### Checking % of E, C, H

In [1]:
# first I want to read line 2 of each file 
# and save the string

def count_C_H_E(infile1):                                              # call list of file names and for dsspfile
    ''' Reads line 2 (index 1) from a fastalike dssp file. Returns a tuple 
    containing the number of C, H, E in that order. '''
    cleanlines_string = ''
    C=0
    H=0
    E=0
    with open(infile1) as ofile:
        flist = ofile.readlines()[1:] # list of each line of the file excluding line 0 
        nonewline = flist[0].rstrip()
        for character in nonewline:
            if character == "-":
                C+=1
            elif character == 'H':
                H+=1
            elif character == 'E':
                E+=1
            else:
                print("Unknown character!!!", character)
        return C, H, E    # returns number of C H and E
    
# a = lines_list("./blind_dssp/4ywn.dssp")
# print(a)

def stitchingstrings(looplist):
    all_list = []
    sumC = 0           #defining final sums
    sumH = 0
    sumE = 0
    totchar=0
    percC=0
    percH=0
    percE=0
    with open(looplist, 'r') as filenames:
        all_list = filenames.readlines()  # rading file ids into list
    for el in all_list:                   # for each filename
        filename = el.rstrip()
        C, H, E = count_C_H_E('trainingset/dssp/'+filename)         #  call function count_C_H_E with current filename
        sumC += C                         # increment previous sum by each letter count
        sumH += H
        sumE += E
    totchar += sumC + sumH + sumE
    percC = sumC/totchar*100
    percH = sumH/totchar*100
    percE = sumE/totchar*100
    return percC, percH, percE
           

In [38]:
pwd

'/Users/ila/01-Unibo/02_Lab2/files_lab2_project'

In [40]:
C, H, E = stitchingstrings("dssp_filenames")
print(C, H, E)

37.91366317169069 37.68020969855832 24.40612712975098


### The blind_set contains

37.91% C

37.68% H

24,41% E

In [2]:
C, H, E = stitchingstrings("dssp_filenames_trainingset")
print(C, H, E)

42.16215473786861 35.592731468128065 22.245113794003323


### The training_set contains

42.16% C

35.59% H

22.24% E

| Set | training | blind | 
|-----|----------|-------|
| C   |42.16%|37.91%|
| H   |35.59%|37.68%|
| E   |22.24%|24,41%|