To download information from Uniprot, as we've seen you can get all information for an entry as XML or JSON file. This is useful because it contains all available information for that protein.

However, if we are only interested in a couple of features for a big list of proteins, it would be a lot of extra overhead to first download all the big files, and then pull out a small amount of data from them. Therefore, Uniprot provided the **tab** format. This a plain text file where every column depicts one feature and each row is an entry.
You can easily parse these files with Python libraries to deal with data files like [pandas](01.b.pandas.ipynb), 
or even open them with excel. We will see how to manipulate data with Pandas in the next WPO.

For a set of proteins listed in a multiple sequence alignment, we will have to provide a search query, but in addition you also have to provide a list columns you are interested in. To know which columns you can choose from, please follow this [link](https://www.uniprot.org/help/uniprotkb_column_names). As you can see, there are a lot of features, so we will have to decide which ones we will pursue for analysis.

In [53]:
# Extract protein IDs from a FASTA alignment file
def extractFastaInfo(fastaAlignment):
  
    """ 
    FASTA file alignment
    """

    # Read the file    
    fin = open(fastaAlignment)
    lines = fin.readlines()
    fin.close()

    startReading = True
    seqAlignments = {}
    seqIdList = []
    
    for line in lines:        

      cols = line.split()

      if cols:
      
        if cols[0].startswith('>'):
          seqId = cols[0][1:].strip()
          seqIdList.append(seqId)

        else:
          if seqId not in seqAlignments.keys():
            seqAlignments[seqId] = cols[0].upper()
          else:
            # Multiline FASTA
            seqAlignments[seqId] += cols[0].upper()

    return (seqAlignments, seqIdList)



In [54]:
# Extract protein IDs from a CLUSTAL alignment file
def extractClustalInfo(clustalAlignment,uniqueSeqs=False):
    
    """
    CLUSTAL files

    If uniqueSeqs is True, will add extra suffix to overlapping identifiers occuring more than once, so they end up separately
    """
    # Read the file    
    fin = open(clustalAlignment)
    lines = fin.readlines()
    fin.close()
        
    startReading = False
    seqAlignments = {}
    seqIdList = []

    for line in lines:
      
      if line.startswith("CLUSTAL"):
        startReading = True
        continue
        
      if startReading:
        cols = line.split()
        
        if cols:          
          if len(cols) in (2,3):

            # Ignore lines with annotation information
            if cols[0][0].count('*') or cols[0][0].count(":") or cols[0].isdigit():
              continue
  
            seqId = cols[0].split("|")[1]

            if uniqueSeqs and seqId in seqAlignments.keys():
              for i in range(99):
                newSeqId = "{}_{}".format(seqId,i)
                if newSeqId not in seqAlignments.keys():
                  seqId = newSeqId
                  break
            
            alignment = cols[1]
            
            if seqId not in seqAlignments.keys():
              seqAlignments[seqId] = ""
              seqIdList.append(seqId)
            
            seqAlignments[seqId] += alignment
    return (seqAlignments, seqIdList)



In [55]:
from apiFunctions import uniprotDownload, enaDownload
import pandas as pd

# Comment out the line below that you don't want to use, and change the filename!
# Note that the file has to be in the same directory as this Jupyter file
# Extract Ids from FASTA
(seqAlignInfo,ids) = extractFastaInfo("fastaExample.fasta")
# Extract Ids from CLUSTAL
(seqAlignInfo,ids) = extractClustalInfo("clustalExample.aln")

# Information for API
fileName="proteins.tab"
query="id:"+"+OR+id:".join(ids)
format="tab"

# You can find possible columns on this page https://www.uniprot.org/help/uniprotkb_column_names
columns="id,entry name,genes,organism,comment(PTM),3d,database(EMBL),database(GenBank),database(GeneID),lineage(all)"

# Download File
uniprotDownload(fileName,query=query, format=format, columns=columns)

# Show with Pandas
proteinInfo = pd.read_csv(fileName, sep="\t").set_index("Entry").sort_index()
proteinInfo

id:P00750+OR+id:A0A2R8ZEK6+OR+id:H2QW33+OR+id:G3RMM0+OR+id:Q5R8J0+OR+id:A0A2I3LNW6+OR+id:A0A2K5L072+OR+id:H2PQ69+OR+id:A0A2I3GC18+OR+id:F7BHV7


AttributeError: __enter__

In [None]:
# Download DNA sequence for first protein
uniprotId = ids[0]
fileName = uniprotId+".fasta"
enaId = proteinInfo.loc[uniprotId]["Cross-reference (EMBL)"].split(";")[0]

print(uniprotId)

#Download Fasta file
DNAFasta = enaDownload(fileName, enaId)
DNAFasta