The following code aims at identifying the total number of Cys2-His2 zinc finger (C2H2-zf) domains in UniProt that are covered in the PDB (a) in complex with DNA or (b) not. We rely on mappings from [Pfam](ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.regions.uniprot.tsv.gz) and [Uniprot](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/pdbtosp.txt) (downloaded using the bash script [get_data.sh](./data/get_data.sh)) and focus on the [classic zinc finger domain](https://pfam.xfam.org/family/PF00096).

In [1]:
from Bio import SeqIO
import gzip
import os
import pandas as pd
import pickle
import re

# Initialize
c2h2_zf_pfam_acc = "PF00096"
c2h2_zf_pfam_name = "zf-C2H2"

#++++++++++++++++#
# Pickle recipes #
#++++++++++++++++#

# Adapted from "Load Faster in Python With Compressed Pickles"
# https://medium.com/better-programming/load-fast-load-big-with-compressed-pickles-5f311584507e

def save_pickle(file_name, pkl):
    """
    Saves a pickle.
    """

    with gzip.open(file_name, "wb") as f:
        pickle.dump(pkl, f)

def load_pickle(file_name):
    """
    Loads and returns a pickle.
    """

    with gzip.open(file_name, "rb") as f:
        pkl = pickle.load(f)

    return(pkl)

Extract all UniProt human proteome reference sequences as [Seq](https://biopython.org/wiki/Seq) objects (*i.e.* `human_seqs`).

In [2]:
# Initialize
human_seqs_pickle = "./pkl/human_sequences.pkl.gz"

if not os.path.exists(human_seqs_pickle):

    # Initialize
    human_seqs = {}
    pattern = re.compile("^\w{2}\|(\w+)")
    human_reference_proteome_fasta = "./data/UP000005640_9606.fasta.gz"

    with gzip.open(human_reference_proteome_fasta, "rt") as f:
        for seq_record in SeqIO.parse(f, "fasta"):
            uniacc = pattern.match(seq_record.id)
            human_seqs.setdefault(uniacc.group(1), seq_record.seq)

    save_pickle(human_seqs_pickle, human_seqs)

else:

    human_seqs = load_pickle(human_seqs_pickle)

# Sanity check:
# According to UniProt's human reference proteome (https://www.uniprot.org/proteomes/UP000005640)
# Gene count is 20,595
print(len(human_seqs))

20595


Extract all zinc fingers, according to Pfam, mapped to human (and other UniProt) proteins (*i.e.* `uprot2pfam`).

In [3]:
# Initialize
uprot2pfam_pickle = "./pkl/uniprot_to_pfam.pkl.gz"

if not os.path.exists(uprot2pfam_pickle):

    # Initialize
    uprot2pfam = {}
    uprot2pfam_mappings_file = "./data/Pfam-A.regions.uniprot.tsv.gz"

    with gzip.open(uprot2pfam_mappings_file, "rt") as f:
         for chunk in pd.read_csv(f, encoding="utf-8", sep="\t", chunksize=1024):
            for index, row in chunk.iterrows():
                uniacc, seq_version, crc64, md5, pfam_acc, start, end = row.tolist()
                if pfam_acc == c2h2_zf_pfam_acc:
                    uprot2pfam.setdefault(uniacc, set())
                    uprot2pfam[uniacc].add(tuple([start, end]))

    save_pickle(uprot2pfam_pickle, uprot2pfam)

else:

    uprot2pfam = load_pickle(uprot2pfam_pickle)

# Sanity checks:
print(len(uprot2pfam))
# According to The Human Transcription Factors (PMID: 29425488)
# Total C2H2-zf transcription factors is 747
print(len(set(uprot2pfam.keys()).intersection(set(human_seqs.keys()))))
# According to Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome (PMID: 17382889)
# Total C2H2-zf domains for CTCF is 11
print(len(uprot2pfam["P49711"]))

160662
678
6


There is obviously something wrong with the thresholding used by Pfam; it misses half of the zinc finger domains for CTCF. Instead we use PROSITE domain annotations by the [zinc finger C2H2 type domain profile](https://prosite.expasy.org/PS50157), which match CTCF better.

The file [uniprot_to_prosite.tab.gz](./data/uniprot_to_prosite.tab.gz) is obtained by querying UniProt for [database:(type:PROSITE)](https://www.uniprot.org/uniprot/?query=database:(type:PROSITE)) and customizing the output columns to `Entry`, `Gene names`, `Organism` and `PROSITE`. The file [C2H2-zf.fasta](./data/C2H2-zf.fasta) is obtained by using the file [C2H2-zf.txt](./data/C2H2-zf.txt) as input for the [Retrieve/ID mapping tool](https://www.uniprot.org/uploadlists) of UniProt.

In [4]:
# Initialize
ch2h2_zf_file = "./data/C2H2-zf.txt"
ch2h2_zf_fasta = "./data/C2H2-zf.fasta"

# Write a list of C2H2-zf UniProt proteins
with open(ch2h2_zf_file, "w") as f:
    f.write("%s\n" % "\n".join(uprot2pfam.keys()))

Scan C2H2-zf protein sequences with the zinc finger C2H2-type domain signature and profile 

**3)** Extract all zinc fingers, according to PROSITE, mapped to human (and other UniProt) proteins (*i.e.* uprot2prosite).