The following code aims at identifying the total number of Cys2-His2 zinc finger (C2H2-zf) domains in UniProt that are covered in the PDB (a) in complex with DNA or (b) not. We rely on mappings from [Pfam](ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.regions.uniprot.tsv.gz) and [Uniprot](ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/pdbtosp.txt) (downloaded using the bash script [get_data.sh](./data/get_data.sh)) and focus on the [classic zinc finger domain](https://pfam.xfam.org/family/PF00096).

In [15]:
# Data

uprot2pdb = "./data/pdbtosp.txt"

# C2H2-zf Pfam name & accession


**1)** Extract all UniProt human proteome reference sequences as [Seq](https://biopython.org/wiki/Seq) objects (*i.e.* `human_seqs`).

In [16]:
from Bio import SeqIO
import gzip
import re

# Initialize
n = 0
human_seqs = {}
pattern = re.compile("^\w{2}\|(\w+)")
human_reference_proteome_fasta = "./data/UP000005640_9606.fasta.gz"

with gzip.open(human_reference_proteome_fasta, "rt") as f:
    for seq_record in SeqIO.parse(f, "fasta"):
        uniacc = pattern.match(seq_record.id)
        human_seqs.setdefault(uniacc.group(1), seq_record.seq)
        n += 1

# Sanity check
print(n == len(human_seqs))

True


**2)** Extract all zinc fingers mapped to human (and other UniProt) proteins (*i.e.* `uprot2pfam`).

In [24]:
import pandas as pd

# Initialize
n = 0
uprot2pfam = {}
valid_pfam_acc = "PF00096"
valid_pfam_name = "zf-C2H2"
uprot2pfam_mappings_file = "./data/Pfam-A.regions.uniprot.tsv.gz"

with gzip.open(uprot2pfam_mappings_file, "rt") as f:
     for chunk in pd.read_csv(f, encoding="utf8", sep="\t", chunksize=1024):
        for index, row in chunk.iterrows():
            uniacc, seq_version, crc64, md5, pfam_acc, start, end = row.tolist()
            if pfam_acc == valid_pfam_acc:
                uprot2pfam.setdefault(uniacc, set())
                uprot2pfam[uniacc].add(tuple([start, end]))

# Sanity check
print(len(uprot2pfam))

KeyboardInterrupt: 