# Download homologues for PPIA and PPIB

## Retrieve homologues from uniref (Uniprot) 

To retrieve the homologues, I use the Uniprot API's in combination with some BASH scripting.
It will retrieve all homologue sequences of PPIA and PPIB of E.coli.
However, with minor adaptations ,the code will work with the accession number of any protein available on Uniprot.

The general idea behind the program  is this.

1. Use the accession number to retrieve the html page of the protein from Uniprot (here we want to retrieve PPIA and PPIB).
2. Parse the html to look up the name of the Uniref50 accession number.
3. Retrieve with this UniRef50 accession number a list with all the proteins that have at least a sequence identity (or similarity?) of 50 procent.
4. Retrieve the sequences of these proteins and write them in a fasta file.
5. Filter this Fasta file with the software signalp-5. This software will predict signal peptides. We want to make sure that the PPIA homologues have a signal peptide for sure. We do not want the PPIB homologues to have a signal peptide.
6. Cluster the sequences and give the final fasta files with CDHIT to remove redundant sequences. I used different thresholds (0.95, 0.90, 0.85, 0.80. 0.75, 0.70). A threshold of 0.95 means that all the sequences of in the final fasta file have at most a sequence identity of 0.95.

The name of the final fasta files with the non redundant sequences look like this:
* PPIB_UniRef50_P23869_CDHIT_C-0.95.fasta

### PPIA

The accession number on Uniprot of **PPiA** from E.coli is **P0AFL3**.

In [1]:
accession=P0AFL3
name=PPIA
fileName=${name}_${accession}

In [2]:
# retrieve a HTML for the accession number
accession_html=${fileName}.html
if [ ! -f ${accession_html} ]; then
    wget -O ${accession_html} https://www.uniprot.org/uniprot/${accession} 2> /dev/null
fi

In [3]:
# List all the UniRef accession codes
# Here limited to UniRef50, but can be adapted to also include UniRef90 and UniRef100
UniRef=${fileName}_UniRef.list
if [ ! -f ${UniRef} ]; then
    grep -ohE 'uniref/UniRef50_[A-Z,0-9]{1,20}' ${accession_html} | cut -d '"' -f 1 | cut -d '/' -f 2 | uniq > ${UniRef} 
fi

In [4]:
# Retrieve a list of all the homologue accession codes for a give UniRef accession code
for unirefEntry in $(cat ${UniRef}); do
    LIST=${unirefEntry}.list
    FASTA=${unirefEntry}.fasta
    if [ ! -f ${LIST} ]; then
        
        wget -O - https://www.uniprot.org/uniref/${LIST} | grep -v '^UPI' > ${LIST}
        echo "" > ${FASTA}
        for protein in $(cat ${LIST}); do
            curl https://www.uniprot.org/uniprot/${protein}.fasta >> ${FASTA} 2> /dev/null
            # NOTE
            ## Most time is spend in waiting for the request of each fasta file
            ## It is probably possible to improve this code if I ask for one big fasta file that includes all the accession codes at ones
            ## I have to look up how to do that
        done
        sed -i '1d' ${FASTA}
        
        # Check if all fasta files were downloaded without fail
        # Numbers should be the same
        echo ${LIST}: $(cat ${LIST} | wc -l)
        echo ${FASTA}: $(grep '>' ${FASTA} | wc -l)
        echo ""
        echo ""
        
        #Give the FASTA and LIST prefix
        mv ${FASTA} ${name}_${FASTA}
        mv ${LIST} ${name}_${LIST}
    fi
done

--2019-11-25 09:32:34--  https://www.uniprot.org/uniref/UniRef50_O53021.list
Resolving www.uniprot.org (www.uniprot.org)... 193.62.192.81, 128.175.245.185
Connecting to www.uniprot.org (www.uniprot.org)|193.62.192.81|:443... connected.
HTTP request sent, awaiting response... 200 
Length: unspecified [text/plain]
Saving to: ‘STDOUT’

-                       [ <=>                ]  20.66K  --.-KB/s    in 0.02s   

2019-11-25 09:32:34 (953 KB/s) - written to stdout [21154]

UniRef50_O53021.list: 1290
UniRef50_O53021.fasta: 1290




### PPIB

The accession number on Uniprot of **PPiB** from E.coli is **P23869**.

In [5]:
accession=P23869
name=PPIB
fileName=${name}_${accession}

In [6]:
accession_html=${fileName}.html
if [ ! -f ${accession_html} ]; then
    wget -O ${accession_html} https://www.uniprot.org/uniprot/${accession} 2> /dev/null
fi

In [7]:
UniRef=${fileName}_UniRef.list
if [ ! -f ${UniRef} ]; then
    grep -ohE 'uniref/UniRef50_[A-Z,0-9]{1,20}' ${accession_html} | cut -d '"' -f 1 | cut -d '/' -f 2 | uniq > ${UniRef} 
fi

In [8]:
for unirefEntry in $(cat ${UniRef}); do
    LIST=${unirefEntry}.list
    FASTA=${unirefEntry}.fasta
    if [ ! -f ${LIST} ]; then
        
        wget -O - https://www.uniprot.org/uniref/${LIST} | grep -v '^UPI' > ${LIST}
        echo "" > ${FASTA}
        for protein in $(cat ${LIST}); do
            curl https://www.uniprot.org/uniprot/${protein}.fasta >> ${FASTA} 2> /dev/null
        done
        sed -i '1d' ${FASTA}
        
        # Check if all fasta files were downloaded without fail
        # Numbers should be the same
        echo ${LIST}: $(cat ${LIST} | wc -l)
        echo ${FASTA}: $(grep '>' ${FASTA} | wc -l)
        echo ""
        echo ""
        
        #Give the FASTA and LIST prefix
        mv ${FASTA} ${name}_${FASTA}
        mv ${LIST} ${name}_${LIST}
    fi
done

--2019-11-25 09:35:46--  https://www.uniprot.org/uniref/UniRef50_P23869.list
Resolving www.uniprot.org (www.uniprot.org)... 193.62.192.81, 128.175.245.185
Connecting to www.uniprot.org (www.uniprot.org)|193.62.192.81|:443... connected.
HTTP request sent, awaiting response... 200 
Length: unspecified [text/plain]
Saving to: ‘STDOUT’

-                       [ <=>                ]   7.88K  --.-KB/s    in 0s      

2019-11-25 09:35:46 (26.3 MB/s) - written to stdout [8067]

UniRef50_P23869.list: 579
UniRef50_P23869.fasta: 579




## Use only protein sequences with signalp-5 evidence

### PPIA

In [9]:
for INPUT in PPIA_UniRef*.fasta; do
    # Use Signalp to give a summary about which sequences have a signal peptide 
    ./bin/signalp -fasta \
    ${INPUT} \
    -org gram- \
    -format short \
    -prefix ${INPUT/.fasta/}
    
    # Output the different signalpeptides present in the FASTA file (OTHER means no signal peptide)
    SIGNALp5=${INPUT/.fasta/_summary.signalp5}
    echo $(tail +3 ${SIGNALp5} | cut -d$'\t' -f 2 | sort | uniq -c)
    
    # Only select for homologues that have a signal peptide
    LIST=${SIGNALp5/_summary.signalp5/_signalp5_filtered.list}
    tail +3  ${SIGNALp5} | \
    grep "SP(Sec/SPI)" | \
    cut -d '|' -f 2 \
    > ${LIST}
    echo $(wc -l ${LIST})
    
    # Generate with this list a new fasta
    FASTA=${LIST/list/fasta}
    echo "" > ${FASTA}
    for protein in $(cat ${LIST}); do
        sed -n "/>.*${protein}/,/>/p" ${INPUT} \
        | sed '$d' >> ${FASTA}
    done
    sed -i '1d' ${FASTA}
    echo $(grep ">" ${FASTA} | wc -l) ${FASTA}
done

SignalP-5.0. Starting fasta file reading...
Total proteins read: 1290.
Organism: gram-.
Starting protein prediction...
Temporary directory: /tmp/signalp5.178076445
Protein prediction done!
Generating output files...
Completed.
1 LIPO(Sec/SPII) 27 OTHER 1262 SP(Sec/SPI)
1262 PPIA_UniRef50_O53021_signalp5_filtered.list
1262 PPIA_UniRef50_O53021_signalp5_filtered.fasta


### PPIB

In [10]:
for INPUT in PPIB_UniRef*.fasta; do
    ./bin/signalp -fasta \
    ${INPUT} \
    -org gram- \
    -format short \
    -prefix ${INPUT/.fasta/}
    
    SIGNALp5=${INPUT/.fasta/_summary.signalp5}
    echo $(tail +3 ${SIGNALp5} | cut -d$'\t' -f 2 | sort | uniq -c)
    
    # Only select for homologues that have NO signal peptide
    LIST=${SIGNALp5/_summary.signalp5/_signalp5_filtered.list}
    tail +3  ${SIGNALp5} | \
    grep "OTHER" | \
    cut -d '|' -f 2 \
    > ${LIST}
    echo $(wc -l ${LIST})
    
    FASTA=${LIST/list/fasta}
    echo "" > ${FASTA}
    for protein in $(cat ${LIST}); do
        sed -n "/>.*${protein}/,/>/p" ${INPUT} \
        | sed '$d' >> ${FASTA}
    done
    sed -i '1d' ${FASTA}
    echo $(grep ">" ${FASTA} | wc -l) ${FASTA}
done

SignalP-5.0. Starting fasta file reading...
Total proteins read: 579.
Organism: gram-.
Starting protein prediction...
Temporary directory: /tmp/signalp5.198055991
Protein prediction done!
Generating output files...
Completed.
575 OTHER 4 SP(Sec/SPI)
575 PPIB_UniRef50_P23869_signalp5_filtered.list
575 PPIB_UniRef50_P23869_signalp5_filtered.fasta


## Use CDHIT to remove redundant sequences

### PPIA

In [11]:
# Using CDHIT, cluster the sequences with different thresholds
for INPUT in PPIA_UniRef*_signalp5_filtered.fasta; do
    for c in 0.95 0.90 0.85 0.80 0.75 0.70; do
        OUTPUT=${INPUT/signalp5_filtered/CDHIT_C-${c}}
        ./cdhit/cd-hit \
        -i ${INPUT}\
        -o ${OUTPUT} \
        -c ${c} \
        -n 5 \
        -M 4000 \
        -T 2 \
        >/dev/null 2>/dev/null
        
        echo $(wc -l ${OUTPUT})
    done
done

713 PPIA_UniRef50_O53021_CDHIT_C-0.95.fasta
389 PPIA_UniRef50_O53021_CDHIT_C-0.90.fasta
225 PPIA_UniRef50_O53021_CDHIT_C-0.85.fasta
154 PPIA_UniRef50_O53021_CDHIT_C-0.80.fasta
110 PPIA_UniRef50_O53021_CDHIT_C-0.75.fasta
65 PPIA_UniRef50_O53021_CDHIT_C-0.70.fasta


### PPIB

In [12]:
for INPUT in PPIB_UniRef*_signalp5_filtered.fasta; do
    for c in 0.95 0.90 0.85 0.80 0.75 0.70; do
        OUTPUT=${INPUT/signalp5_filtered/CDHIT_C-${c}}
        ./cdhit/cd-hit \
        -i ${INPUT}\
        -o ${OUTPUT} \
        -c ${c} \
        -n 5 \
        -M 4000 \
        -T 2 \
        >/dev/null 2>/dev/null
        
        echo $(wc -l ${OUTPUT})
    done
done

366 PPIB_UniRef50_P23869_CDHIT_C-0.95.fasta
255 PPIB_UniRef50_P23869_CDHIT_C-0.90.fasta
234 PPIB_UniRef50_P23869_CDHIT_C-0.85.fasta
187 PPIB_UniRef50_P23869_CDHIT_C-0.80.fasta
148 PPIB_UniRef50_P23869_CDHIT_C-0.75.fasta
112 PPIB_UniRef50_P23869_CDHIT_C-0.70.fasta
