The following code aims at identifying the total number of Cys2-His2 zinc finger (C2H2-zf) domains in UniProt that are covered in the PDB in complex with DNA or not.

Initially, I relied on annotations from [Pfam](ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.regions.uniprot.tsv.gz) and [Uniprot](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/pdbtosp.txt) (downloaded using the bash script [get_data.sh](./data/get_data.sh)), and focused on the [classic zinc finger domain](https://pfam.xfam.org/family/PF00096).

In [1]:
from Bio import SeqIO
import gzip
import os
import pandas as pd
import pickle
import re
import subprocess as sp
import tarfile

# Initialize
pfam_accession = "PF00096"
prosite_id = "PS50157"
CTCF = "P49711"

#++++++++++++++++#
# Pickle recipes #
#++++++++++++++++#

# Adapted from "Load Faster in Python With Compressed Pickles"
# https://medium.com/better-programming/load-fast-load-big-with-compressed-pickles-5f311584507e

def save_pickle(file_name, pkl):
    """
    Saves a pickle.
    """

    with gzip.open(file_name, "wb") as f:
        pickle.dump(pkl, f)

def load_pickle(file_name):
    """
    Loads and returns a pickle.
    """

    with gzip.open(file_name, "rb") as f:
        pkl = pickle.load(f)

    return(pkl)

I extracted the UniProt human proteome reference sequences (only one per gene) as [Seq](https://biopython.org/wiki/Seq) objects (*i.e.* `human_seqs`)...

In [3]:
# Initialize
human_seqs_pickle = "./pkl/human_sequences.pkl.gz"

if not os.path.exists(human_seqs_pickle):

    # Initialize
    human_seqs = {}
    pattern = re.compile("^\w{2}\|(\w+)")

    with gzip.open("./data/UP000005640_9606.fasta.gz", "rt") as f:
        for seq_record in SeqIO.parse(f, "fasta"):
            uniacc = pattern.match(seq_record.id)
            human_seqs.setdefault(uniacc.group(1), seq_record.seq)

    save_pickle(human_seqs_pickle, human_seqs)

else:

    human_seqs = load_pickle(human_seqs_pickle)

# Sanity check:
# According to UniProt's human reference proteome (https://www.uniprot.org/proteomes/UP000005640),
# the gene count is 20,595
print(len(human_seqs))

20595


... and all zinc fingers, according to Pfam, mapped to human (and other UniProt) proteins (*i.e.* `uprot2pfam`).

In [5]:
# Initialize
uprot2pfam_pickle = "./pkl/uniprot_to_pfam.pkl.gz"

if not os.path.exists(uprot2pfam_pickle):

    # Initialize
    uprot2pfam = {}

    with gzip.open("./data/Pfam-A.regions.uniprot.tsv.gz", "rt") as f:
         for chunk in pd.read_csv(f, encoding="utf-8", sep="\t", chunksize=1024):
            for index, row in chunk.iterrows():
                uniacc, seq_version, crc64, md5, pfam_acc, start, end = row.tolist()
                if pfam_acc == pfam_accession:
                    uprot2pfam.setdefault(uniacc, set())
                    uprot2pfam[uniacc].add(tuple([start, end]))

    save_pickle(uprot2pfam_pickle, uprot2pfam)

else:

    uprot2pfam = load_pickle(uprot2pfam_pickle)

# Sanity checks:
# The total number of C2H2-zf proteins in UniProt is?
print(len(uprot2pfam))
# According to "The Human Transcription Factors" (PMID: 29425488), the total C2H2-zf TFs is 747
print(len(set(uprot2pfam.keys()).intersection(set(human_seqs.keys()))))
# According to "Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome" (PMID: 17382889),
# the total number of C2H2-zf domains for CTCF is 11
print(len(uprot2pfam[CTCF]), uprot2pfam[CTCF])

160662
678
6 {(294, 316), (437, 460), (555, 575), (379, 401), (322, 345), (266, 288)}


Together, the fact that both i) the number of human C2H2-zf proteins and ii) CTCF C2H2-zf domains are lower than expected suggests that there might be something wrong with the Pfam thresholding used by UniProt.

Instead, I switched to PROSITE domain annotations by the [zinc finger C2H2 type domain profile](https://prosite.expasy.org/PS50157), which [match CTCF better](https://www.uniprot.org/uniprot/P49711).

The file [uniprot2prosite.tab.gz](./data/uniprot2prosite.tab.gz) is obtained from [UniProt](https://www.uniprot.org/database/DB-0084) by customizing the output to `Entry`, `Gene names`, `Organism` and `PROSITE`.

I extracted all C2H2-zf proteins from UniProt, according to PROSITE (*i.e.* `c2h2_zfs`).

In [6]:
# Initialize
c2h2_zf_pickle = "./pkl/c2h2_zf.pkl.gz"
pattern = re.compile(prosite_id)

if not os.path.exists(c2h2_zf_pickle):

    # Initialize
    c2h2_zfs = {}

    with gzip.open("./data/uniprot2prosite.tab.gz", "rt") as f:
         for chunk in pd.read_csv(f, encoding="utf-8", sep="\t", chunksize=1024):
            for index, row in chunk.iterrows():
                uniacc, gene_name, organism, prosites = row.tolist()
                if pattern.search(prosites):
                    c2h2_zf.setdefault(organism, set())
                    c2h2_zf[organism].add(uniacc)

    save_pickle(c2h2_zf_pickle, c2h2_zf)

else:

    c2h2_zfs = load_pickle(c2h2_zf_pickle)    

# Sanity checks:
# The total number of C2H2-zf proteins in UniProt is?
c2h2_zf_set = set()
for f in c2h2_zfs:
    c2h2_zf_set.update(c2h2_zfs[f])
print(len(c2h2_zf_set))
# According to "The Human Transcription Factors" (PMID: 29425488), the total C2H2-zf TFs is 747
print(len(c2h2_zfs["Homo sapiens (Human)"].intersection(set(human_seqs.keys()))))

353167
762


Then, I retrieved the sequences (only one per gene) of all C2H2-zf proteins from the UniProt [reference proteomes of eukaryotes](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Reference_Proteomes_2020_02.tar.gz).

In [5]:
# Initialize
c2h2_zf_seqs_pickle = "./pkl/c2h2_zf_sequences.pkl.gz"

if not os.path.exists(c2h2_zf_seqs_pickle):

    # Initialize
    c2h2_zf_seqs = {}
    dummy_dir = "./data/"
    pattern = re.compile("Eukaryota/UP\d+_\d+.fasta.gz")
    pattern2 = re.compile("^\w{2}\|(\w+)")
    reference_proteomes_fasta = "./data/Reference_Proteomes_2020_01.tar.gz"

    with tarfile.open(reference_proteomes_fasta, "r:gz") as tar:
        for member in tar.getmembers():
            # Skip organisms other than eukaryotes
            if not member.name.startswith("Eukaryota"):
                continue
            # Skip files other than reference proteomes
            if pattern.search(member.name):
                reference_proteome_file = os.path.join(dummy_dir, member.name) 
                if not os.path.exists(reference_proteome_file):
                    tar.extract(member, path=dummy_dir)
                with gzip.open(reference_proteome_file, "rt") as f:
                    for seq_record in SeqIO.parse(f, "fasta"):
                        uniacc = pattern2.match(seq_record.id)
                        if uniacc.group(1) in c2h2_zf_set:
                            c2h2_zf_seqs.setdefault(uniacc.group(1), str(seq_record.seq))

    save_pickle(c2h2_zf_seqs_pickle, c2h2_zf_seqs)

else:

    c2h2_zf_seqs = load_pickle(c2h2_zf_seqs_pickle)

# Total C2H2-zf transcription factors in reference proteomes of eukaryotes is?
print(len(c2h2_zf_seqs))

186703


Finally, I scanned the retrieved sequences with the [zinc finger C2H2 type domain profile](https://prosite.expasy.org/PS50157) from PROSITE and, for each protein, I stored the positions of each match (*i.e.* `uprot2prosite`).

In [9]:
# Initialize
uprot2prosite_pickle = "./pkl/uniprot_to_prosite.pkl.gz"

def scan_prosite(uniacc, uniacc2seqs):

    # Initialize
    matches = set()
    fasta_file = "./data/%s.fasta" % uniacc

    if os.path.exists(fasta_file):
        os.remove(fasta_file)

    with open(fasta_file, "w") as o:
        o.write(">%s\n%s\n" % (uniacc, uniacc2seqs[uniacc]))

    # Run PROSITE
    cmd = "perl %s/ps_scan.pl -d %s/prosite.dat -p %s %s" % (prosite_dir, prosite_dir, prosite_id, fasta_file)
    p = sp.run([cmd], shell=True, stdout=sp.PIPE, stderr=sp.PIPE)
    for line in p.stdout.decode("utf-8").split("\n"):
        if not line.startswith(">"):
            match = re.findall("\S+", line)
            if match:
                matches.add(tuple([match[0], match[2]]))

    os.remove(fasta_file)

    return(uniacc, matches)

if not os.path.exists(uprot2prosite_pickle):

    # Initialize
    uprot2prosite = {}
    c2h2_zf_prosite_matches = "./data/c2h2_zf_prosite_matches.txt"

    if not os.path.exists(c2h2_zf_prosite_matches):

        # Initialize
        prosite_dir = "/space/home/oriol/Programs/ps_scan"
        c2h2_zf_seqs_fasta = "./data/c2h2_zf_sequences.fasta"

        if not os.path.exists(c2h2_zf_seqs_fasta):
            with open(c2h2_zf_seqs_fasta, "w") as o:
                for uniacc in c2h2_zf_seqs:
                    o.write(">%s\n%s\n" % (uniacc, c2h2_zf_seqs[uniacc]))

        # Run PROSITE
        cmd = "perl %s/ps_scan.pl -d %s/prosite.dat -p %s %s > %s" % (prosite_dir, prosite_dir, prosite_id,
            c2h2_zf_seqs_fasta, c2h2_zf_prosite_matches)
        os.system(cmd)

    with open(c2h2_zf_prosite_matches) as f:
        for line in f:
            if line.startswith(">"):
                m = re.search(">(\S+) :", line)
                uniacc = m.group(1)
                uprot2prosite.setdefault(uniacc, set())
            else:
                m = re.search("(\d+) - (\d+)", line)
                if m:
                    uprot2prosite[uniacc].add(tuple([m.group(1), m.group(2)]))

    save_pickle(uprot2prosite_pickle, uprot2prosite)

else:

    uprot2prosite = load_pickle(uprot2prosite_pickle)

# Sanity checks:
# The total number of C2H2 zinc finger proteins in UniProt reference proteomes of eukaryotes is?
print(len(uprot2prosite))
# And the total number of C2H2 zinc finger domains is?
print(sum([len(uprot2prosite[uniacc]) for uniacc in uprot2prosite]))
# According to "Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome" (PMID: 17382889),
# the total number of C2H2-zf domains for CTCF is 11
print(len(uprot2prosite[CTCF]), uprot2prosite[CTCF])

186703
963553
11 {('523', '546'), ('495', '522'), ('351', '378'), ('322', '350'), ('555', '573'), ('437', '465'), ('266', '293'), ('379', '406'), ('294', '321'), ('407', '435'), ('467', '494')}


The PROSITE results are more in line with our expectations.