# Glycosyltransferase Classification

This notebook contains code and tinkering to try and classify Saccharide BGCs using CAZY.

## Background
### Saccharide BGCs
In the manuscript we presented a dataset of 29 predicted Saccharide BGCs - these are predicted BGCs containing a putative encapsulin as well as predicted carbohydrate-related enzymes. All of these BGCs contained at least one glycosyltransferase enzyme (but sometimes multiple)

Doing some more research and background reading into glycosyltransferase enzymes was fascinating - it turns out that bacteria encode lots of them for various different reasons, but also bacteriophages also have their own too! This is an interesting but also slightly nefarious sign - what if some of the 29 Saccharide BGCs contain phage-associated glycosyltransferases?

### CAZY

In order to try to better understand these glycosyltransferases in our BGCs, we'll look to [CAZY](http://www.cazy.org/), the Carbohydrate-Active enZYmes database. CAZY covers all enzymes related to carbohydrates but one of their main focuses is glycosyltransferases which is what we're interested in here!

## Downloading CAZY Data

CAZY has over [100 different glycosyltransferase families](http://www.cazy.org/GlycosylTransferases). For each of these families there's a webpage with a table in it. This table contains the representative members of this family.

Since this data doesn't seem to be downloadable anywhere, let's use `pandas` to download it in HTML:

In [1]:
import pandas as pd
#Pick one family as a test case
url = "http://www.cazy.org/GT1_structure.html"

all_tables = pd.read_html(url)
enzyme_table = all_tables[1]

enzyme_table.rename(columns=enzyme_table.iloc[2], inplace = True)
enzyme_table = enzyme_table.iloc[3:-1, :-2].reset_index(drop=True)
enzyme_table = enzyme_table[enzyme_table["Protein Name"] != "Eukaryota"]
enzyme_table = enzyme_table[enzyme_table["Protein Name"] != "Archaea"]
enzyme_table

Unnamed: 0,Protein Name,EC#,Organism,GenBank,Uniprot,PDB/3D Carbohydrate Ligands Resolution (Å)
0,EspG2,,Actinomadura verrucosospora,,,5DU2[B] 2.70 5DU2[A] 2.70
1,dTDP-β-L-4-epi-epivancosamine: epivancosaminyl...,,Amycolatopsis orientalis A82846,AAB49292.1 AAE12045.1 CAA11774.1,P96558,"1PN3[A,B] β-D-Glcp-(1-1)-<non_carb> 2.80 1PN..."
2,TDP/UDP-Glc: aglycosyl-vancomycin: glucosyltra...,,Amycolatopsis orientalis A82846,AAB49293.1 AAE12046.1 CAA11775.1,P96559,1IIR[A] 1.80
3,UDP-β-L-4-epi-vancosamine: vancomycin-pseudoag...,,Amycolatopsis orientalis ATCC19795,AAK31352.1 CCD33143.1,G4V4R8 Q9AFC7,"1RRV[A,B] β-D-Glcp-(1-1)-<non_carb> 2.00"
4,UDP-glycosyltransferase,,Bacillus spizizenii CTCC 63501,ASY97769.1,,"7VLB[A,B] 3.00"
...,...,...,...,...,...,...
62,C-Glycosyltransferase,2.4.1.-,Trollius chinensis,,A0A4Y5RXX8,6JTD[B] 1.85 6JTD[A] 1.85
63,UDP-Glc: anthocyanidin 3-O-glucosyltransferase...,2.4.1.115 2.4.1.35,Vitis vinifera,AAB81683.1 AAB81682.1 ABF59818.1 AHZ46155.1 AH...,A0A059U0Z1 A0A059U266 A0A059U2F6 A0A223HHC1 A0...,2C1X[A] 1.9 2C1Z[A] 1.9 2C9Z[A] 2.1
64,flavone 7-O-?-glucosyltransferase (ZM_BFb0097F09),2.4.1.81,Zea mays B73,ACF81133.1 ACG45204.1,A0A8U0WPG4 B4FG90,7Q3S[A] 1.59
65,flavone-C-glucosytransferase (ZmCGTa;ZM_BFb012...,2.4.1.-,Zea mays B73,ACF81582.1,A0A096SRM5 B4FHI9,6LF6[A] 2.04


As you can see we have to do a lot of tinkering to get the table to format correctly! We also need to fix the index otherwise this can give problems later:

In [2]:
enzyme_table.index.name="index"

That final column appears to be mangled - it should be three separate columns for PDB code, resolution, and carbohydrate ligand, but here they've all melted into one because of the HTML formatting.

After much trial and error I've come up with the below regex which will split the data in that column into three separate columns and merge it back into the DataFrame:

In [3]:
pattern = r"(\w\w\w\w\[[\w,]+\])\s([A-Za-z1-9<>()\-_Α-Ωα-ω\s]+\s+)?(\d\.\d\d)?"

regex_matches = enzyme_table["PDB/3D Carbohydrate Ligands  Resolution (Å)"].str.extractall(pattern).rename(
    columns={0: "PDB Code", 1: "Ligand", 2: "Resolution"}
)
regex_matches.index.names=["index", "Match"]

In [4]:
out = regex_matches.merge(enzyme_table, left_index=True, right_index=True).iloc[:, :-1].reset_index(drop=True)
out.head()

Unnamed: 0,PDB Code,Ligand,Resolution,Protein Name,EC#,Organism,GenBank,Uniprot
0,5DU2[B],,2.7,EspG2,,Actinomadura verrucosospora,,
1,5DU2[A],,2.7,EspG2,,Actinomadura verrucosospora,,
2,"1PN3[A,B]",β-D-Glcp-(1-1)-<non_carb>,2.8,dTDP-β-L-4-epi-epivancosamine: epivancosaminyl...,,Amycolatopsis orientalis A82846,AAB49292.1 AAE12045.1 CAA11774.1,P96558
3,"1PNV[A,B]",α-L-2-deoxy-Fucp3N-(1-2)-β-D-Glcp-(1-1)-<non_c...,2.8,dTDP-β-L-4-epi-epivancosamine: epivancosaminyl...,,Amycolatopsis orientalis A82846,AAB49292.1 AAE12045.1 CAA11774.1,P96558
4,1IIR[A],,1.8,TDP/UDP-Glc: aglycosyl-vancomycin: glucosyltra...,,Amycolatopsis orientalis A82846,AAB49293.1 AAE12046.1 CAA11775.1,P96559


Now let's have a go at downloading all the family tables and merging them together:

In [5]:
urls = [f"http://www.cazy.org/GT{i}_structure.html" for i in range(118)]

dfs = []

for i, url in enumerate(urls):
    output = pd.read_html(url)
    
    enzyme_table = output[1]
    try:
        enzyme_table.rename(columns=enzyme_table.iloc[2], inplace = True)
        enzyme_table = enzyme_table.iloc[3:-1, :-2].reset_index(drop=True)
    except IndexError:
        continue

    enzyme_table = enzyme_table[enzyme_table["Protein Name"] != "Eukaryota"]
    enzyme_table = enzyme_table[enzyme_table["Protein Name"] != "Archaea"]
    enzyme_table["Family"] = i

    enzyme_table.index.name="index"
    regex_matches.index.names=["index", "Match"]
    
    regex_matches = enzyme_table["PDB/3D Carbohydrate Ligands  Resolution (Å)"].str.extractall(pattern).rename(
    columns={0: "PDB Code", 1: "Ligand", 2: "Resolution"})

    df = regex_matches.merge(enzyme_table, left_index=True, right_index=True).reset_index(drop=True)
    dfs.append(df)

all_enzymes = pd.concat(dfs)
all_enzymes.head()

Unnamed: 0,PDB Code,Ligand,Resolution,Protein Name,EC#,Organism,GenBank,Uniprot,PDB/3D Carbohydrate Ligands Resolution (Å),Family
0,3HBM[A],,1.8,"UDP-2,4-diacetamido-2,4,6-trideoxy-β-L-altropy...",3.1.-.-,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,CAL35426.1,Q0P8U5,3HBM[A] 1.80 3HBN[A] 1.85,0
1,3HBN[A],,1.85,"UDP-2,4-diacetamido-2,4,6-trideoxy-β-L-altropy...",3.1.-.-,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,CAL35426.1,Q0P8U5,3HBM[A] 1.80 3HBN[A] 1.85,0
2,"2NXV[A,B]",,1.1,Orf6 (majastridin),,Fuscovulum blasticum,CAA77302.1,,"2NXV[A,B] 1.10 2QGI[A,B] 1.65",0
3,"2QGI[A,B]",,1.65,Orf6 (majastridin),,Fuscovulum blasticum,CAA77302.1,,"2NXV[A,B] 1.10 2QGI[A,B] 1.65",0
4,8IL0[A],,2.81,[inverting] GDP-α-D-lincosamide: ergothioneine...,2.4.1.-,Streptomyces lincolnensis NRRL 2936,ANS62477.1,,"8IL0[A] 2.81 8ILA[A,B,C,D] β-D-8-deoxy-Galp6N...",0


Here family "0" is actually the unclassified glycosyltransferases, so let's update this:

In [6]:
all_enzymes = all_enzymes.replace(0, "Unclassified")
all_enzymes[all_enzymes["Family"] == "Unclassified"]

Unnamed: 0,PDB Code,Ligand,Resolution,Protein Name,EC#,Organism,GenBank,Uniprot,PDB/3D Carbohydrate Ligands Resolution (Å),Family
0,3HBM[A],,1.8,"UDP-2,4-diacetamido-2,4,6-trideoxy-β-L-altropy...",3.1.-.-,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,CAL35426.1,Q0P8U5,3HBM[A] 1.80 3HBN[A] 1.85,Unclassified
1,3HBN[A],,1.85,"UDP-2,4-diacetamido-2,4,6-trideoxy-β-L-altropy...",3.1.-.-,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,CAL35426.1,Q0P8U5,3HBM[A] 1.80 3HBN[A] 1.85,Unclassified
2,"2NXV[A,B]",,1.1,Orf6 (majastridin),,Fuscovulum blasticum,CAA77302.1,,"2NXV[A,B] 1.10 2QGI[A,B] 1.65",Unclassified
3,"2QGI[A,B]",,1.65,Orf6 (majastridin),,Fuscovulum blasticum,CAA77302.1,,"2NXV[A,B] 1.10 2QGI[A,B] 1.65",Unclassified
4,8IL0[A],,2.81,[inverting] GDP-α-D-lincosamide: ergothioneine...,2.4.1.-,Streptomyces lincolnensis NRRL 2936,ANS62477.1,,"8IL0[A] 2.81 8ILA[A,B,C,D] β-D-8-deoxy-Galp6N...",Unclassified
5,"8ILA[A,B,C,D]",β-D-8-deoxy-Galp6N-(1-1)-<non_carb>,2.79,[inverting] GDP-α-D-lincosamide: ergothioneine...,2.4.1.-,Streptomyces lincolnensis NRRL 2936,ANS62477.1,,"8IL0[A] 2.81 8ILA[A,B,C,D] β-D-8-deoxy-Galp6N...",Unclassified
6,2P6W[A],,1.6,modular β-L and α-L-rhamnosyltransferase (a064...,2.4.1.-,Paramecium bursaria Chlorella virus 1,AAC96432.1 AAK19297.1 AAK19298.1 AAK19299.1 AA...,Q89399 Q98VI8 Q98VK7 Q992M0 Q992M1,"2P6W[A] 1.60 2P72[A,B] 2.00 2P73[A] 2.30",Unclassified
7,"2P72[A,B]",,2.0,modular β-L and α-L-rhamnosyltransferase (a064...,2.4.1.-,Paramecium bursaria Chlorella virus 1,AAC96432.1 AAK19297.1 AAK19298.1 AAK19299.1 AA...,Q89399 Q98VI8 Q98VK7 Q992M0 Q992M1,"2P6W[A] 1.60 2P72[A,B] 2.00 2P73[A] 2.30",Unclassified
8,2P73[A],,2.3,modular β-L and α-L-rhamnosyltransferase (a064...,2.4.1.-,Paramecium bursaria Chlorella virus 1,AAC96432.1 AAK19297.1 AAK19298.1 AAK19299.1 AA...,Q89399 Q98VI8 Q98VK7 Q992M0 Q992M1,"2P6W[A] 1.60 2P72[A,B] 2.00 2P73[A] 2.30",Unclassified
9,2C0N[A],,1.86,A197 (probable glycosyltransferase with GT-A f...,,Sulfolobus turreted icosahedral virus 1,AAS89076.1,,2C0N[A] 1.86,Unclassified


## Downloading CAZY PDB Files

Now that we have all the CAZY data, let's investigate choosing a PDB code for each one.

First, we'll load a pre-downloaded CAZY DataFrame. I've written a script `scripts/get_cazy_dataframe.py` which will scrape all the CAZY data and process it as above, and write it to a file.

In [7]:
cazy_df = pd.read_csv("../metadata/cazy_structures.csv")
cazy_df.head()

Unnamed: 0,PDB Code,Ligand,Resolution,Protein Name,EC#,Organism,GenBank,Uniprot,Family
0,3HBM[A],,1.8,"UDP-2,4-diacetamido-2,4,6-trideoxy-β-L-altropy...",3.1.-.-,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,CAL35426.1,Q0P8U5,Unclassified
1,3HBN[A],,1.85,"UDP-2,4-diacetamido-2,4,6-trideoxy-β-L-altropy...",3.1.-.-,Campylobacter jejuni subsp. jejuni NCTC 11168 ...,CAL35426.1,Q0P8U5,Unclassified
2,"2NXV[A,B]",,1.1,Orf6 (majastridin),,Fuscovulum blasticum,CAA77302.1,,Unclassified
3,"2QGI[A,B]",,1.65,Orf6 (majastridin),,Fuscovulum blasticum,CAA77302.1,,Unclassified
4,8IL0[A],,2.81,[inverting] GDP-α-D-lincosamide: ergothioneine...,2.4.1.-,Streptomyces lincolnensis NRRL 2936,ANS62477.1,,Unclassified


I can see a few possible problems with these PDB codes:

1. We have multiple PDB codes for each protein. This is easy to solve, we'll just choose the one with the best resolution

2. Some PDB codes have multiple chains (for example `2QGI[A,B]`). I investigated these manually and it turns out that this means the asymmetric unit contains multiple copies of the enzyme. For this we can just choose the first chain before any commas

Let's do this processing now:

In [8]:
cazy_df["PDB Code"] = cazy_df["PDB Code"].str.split(",").str[0]
cazy_df["PDB Code"] = cazy_df["PDB Code"].str.replace("[", "_").str.replace("]", "")
len((cazy_df.sort_values(by="Resolution", ascending=True).groupby("Protein Name").first()["PDB Code"].unique()))

  cazy_df["PDB Code"] = cazy_df["PDB Code"].str.replace("[", "_").str.replace("]", "")


349

## Downloading Saccharide BGC Glycosylase Structures

We should be almost ready to go now - we have all the CAZY representative structures downloaded, so we just need our Saccharide BGC glycosylase query structures. Because these are MGYP accessions we should in theory be able to grab these straight from the ESM Atlas API.

But there's a catch! The predicted Saccharide BGCs came from [DeepBGC](./DeepBGC_exploration.ipynb) predictions, which aren't run on protein sequences, but on entire contigs.

So instead of a nice MGYP accession for each predicted glycosyltransferase, we actually have a really nasty CDS name containing the ERZ and contig name. Let's see this:

In [9]:
import gzip
import json
import re

#We can drop a bunch of unnecessary columns from the CSV file
deepbgc_df = pd.read_csv("../metadata/deepbgc_hits.csv").drop(
    ["detector", "detector_version", "detector_label", "bgc_candidate_id", "antibacterial", "cytotoxic",
     "inhibitor", "antifungal","Alkaloid", "NRP", "Other", "Polyketide", "RiPP", "Saccharide", "Terpene", "Hit?", "Contig"],
axis=1)

# The below code will load the Pfam labels dictionary downloaded from the GoogleResearch/Proteinfer GitHub repo
# I've also manually added some code below to fill in some important missing labels that I manually curated



with open("../DBs/label_descriptions.json.gz", 'rb') as f:
    with gzip.GzipFile(fileobj=f, mode='rb') as gzip_file:
      labels_dict = json.load(gzip_file)

labels_dict["PF19821"] = "Phage capsid protein"
labels_dict["PF19307"] = "Phage capsid-like protein"
labels_dict['PF19289'] = "PmbA/TldA metallopeptidase C-terminal domain"
labels_dict['PF19290'] = "PmbA/TldA metallopeptidase central domain"
labels_dict['PF20211'] = "Family of unknown function (DUF6571)"
labels_dict['PF19782'] = "Family of unknown function (DUF6267)"
labels_dict['PF19343'] = "Family of unknown function (DUF5923)"
labels_dict['PF20036'] = "Major capsid protein 13-like"
labels_dict['PF18960'] = "Family of unknown function (DUF5702)"
labels_dict['PF18906'] = "Phage tail tube protein"
labels_dict['PF19753'] = "Family of unknown function (DUF6240)"

def get_label(pfam):
    try:
        return(labels_dict[pfam])
    except KeyError:
        return(None)
    
pattern = r"Glycosyl\s*transferase"

#This function will access the Pfam family lists for each DataFrame entry and convert it to a free text description
def get_pfam_list_labels(pfam_list):
    pfam_list = pfam_list.split(";")

    try:
        return(" | ".join([get_label(pfam) for pfam in pfam_list]))
    except TypeError:
        return("None")

deepbgc_df["bio_pfam_ids"] = deepbgc_df["bio_pfam_ids"].fillna("None")
deepbgc_df["Descriptions"] = deepbgc_df["bio_pfam_ids"].apply(get_pfam_list_labels)

proteins = []

saccharide_df = deepbgc_df[deepbgc_df["product_class"] == "Saccharide"].sort_values(by="MGYP")

for row in saccharide_df.to_dict(orient="records"):
    mgya = row["MGYA"]
    erz = row["ERZ"]
    mgyp  = row["MGYP"]
    protein_ids = row["protein_ids"].split(";")
    pfam_df = pd.read_csv(f"../deepbgc/{mgya}-{erz}/{mgya}-{erz}.pfam.tsv", sep="\t").query("protein_id in @protein_ids").query("deepbgc_score > 0.5")

    for i, protein in enumerate(protein_ids):
        pfams = pfam_df[pfam_df["protein_id"] == protein]["pfam_id"].values
        if len(pfams) > 0:  
            pfam_labels = " | ".join([get_label(pfam) for pfam in pfams])
        
        if re.match(pattern, pfam_labels):
            proteins.append(protein)

In [10]:
proteins[:10]

['ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_16',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_20',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_24',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_33',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_34',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_35',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_36',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_37',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ829061.940-NODE-940-length-32616-cov-5.169129_38',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_ERZ8

Very nasty indeed! How can we map this to an MGYP?

Well, if you look at the format of the names above you can see it looks like the following:

    CONTIG_NAME_CONTIG_NAME_CDSnumber

The CONTIG_NAME part contains the ERZ accession, which we can use to find a FASTA file containing all the CDSes in that contig, like so:

In [11]:
from glob import glob

glob(f"../contigs/CDS/*-{proteins[0].split('.')[0]}_CDS.fasta")

['../contigs/CDS/MGYA00589052-ERZ829061_CDS.fasta']

That FASTA file contains every CDS on the contig, named as shown above. The FASTA headers not only include the name of the CDS, but also its start and end positions, and strand. Well, let's take a look at the cargo metadata file from the main analysis:

In [12]:
metadata_df = pd.read_csv("../metadata/cargo_seq_metadata_filtered.csv")
metadata_df.head()

Unnamed: 0,MGYP,ERZ,MGYC,Start,End,Strand
0,MGYP001562854052,ERZ1505397,MGYC000067051923,1673,1867,-1
1,MGYP001562854052,ERZ3455868,MGYC001294411047,34702,34896,1
2,MGYP001562854052,ERZ3455893,MGYC001288129298,514,708,-1
3,MGYP001562858699,ERZ1505398,MGYC000067815300,29267,29509,-1
4,MGYP001562858699,ERZ3455857,MGYC001301572236,321,563,-1


You can see here that for every MGYP, we have an ERZ, a start, an end, and a strand. We also have the contig CDSes above, which contains CDS name, ERZ, start, end, and strand!

We can match these two together and match MGYPs to our CDS names! Let's do this below. First, we'll make a list of ERZs:

In [13]:
erzs = [protein.split(".")[0] for protein in proteins]
erzs[:5]

['ERZ829061', 'ERZ829061', 'ERZ829061', 'ERZ829061', 'ERZ829061']

We'll filter our metadata DataFrame to only include ERZs related to the Saccharide BGCs, since it's already gigabytes in size!

In [14]:
metadata_df = metadata_df.query("ERZ in @erzs")
metadata_df

Unnamed: 0,MGYP,ERZ,MGYC,Start,End,Strand
33,MGYP001562867763,ERZ1292623,MGYC000033824563,12472,12612,1
34,MGYP001562867763,ERZ1292627,MGYC000038164947,495,635,-1
35,MGYP001562867763,ERZ1292643,MGYC000032866011,111170,111310,1
41,MGYP001562867855,ERZ1292623,MGYC000033824834,41577,41717,-1
42,MGYP001562867855,ERZ1292627,MGYC000038166982,6008,6148,-1
...,...,...,...,...,...,...
45700740,MGYP001562811370,ERZ1292643,MGYC000033053442,52,552,1
45700753,MGYP001562811514,ERZ1292623,MGYC000033850659,241,534,1
45700756,MGYP001562811514,ERZ1292627,MGYC000038445135,574,867,-1
45700764,MGYP001562811654,ERZ1292627,MGYC000038203426,486,866,-1


Now, let's make a dictionary, mapping the ERZ, start, end, and strand of each CDS to its CDS name:

In [17]:
protein_names = [f"{protein.split('_')[1]}_{protein.split('_')[2]}" for protein in proteins]
protein_names[:5]

['ERZ829061.940-NODE-940-length-32616-cov-5.169129_16',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_20',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_24',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_33',
 'ERZ829061.940-NODE-940-length-32616-cov-5.169129_34']

In [26]:
from Bio import SeqIO

erz_dict = {}

for protein_name in protein_names:
    erz = protein_name.split(".")[0]
    seq_records = SeqIO.parse(glob(f"../contigs/CDS/*-{erz}_CDS.fasta")[0], "fasta")

    fasta_header = [str(record.description) for record in seq_records if str(record.description).split()[0] == protein_name][0]

    start = fasta_header.split(" # ")[1]
    end = fasta_header.split(" # ")[2]
    strand = fasta_header.split(" # ")[3]

    erz_dict["_".join((erz, start, end, strand))] = protein_name

And finally, let's go through the metadata DataFrame and use the ERZ, start, end, and strand to map CDS names to MGYPs!

In [30]:
mgyp_dict = {}

for _, row in metadata_df.iterrows():
    try:
        mgyp_dict[erz_dict["_".join((row["ERZ"], str(row["Start"]), str(row["End"]), str(row["Strand"])))]] = row["MGYP"]
    except KeyError:
        continue

mgyp_dict

{'ERZ1292623.197-Hain-H51-01um-R1-197-length-68997-cov-24_53': 'MGYP001563014911',
 'ERZ1292627.123-Hain-H51-01um-R2-123-length-75432-cov-26.213301_11': 'MGYP001563014911',
 'ERZ1292643.32-Hain-H51-01um-R3-32-length-121540-cov-21.638293_11': 'MGYP001563014911',
 'ERZ1292623.197-Hain-H51-01um-R1-197-length-68997-cov-24_52': 'MGYP001563980993',
 'ERZ1292623.1172-Hain-H51-01um-R1-1172-length-25845-cov-16_21': 'MGYP001564069982',
 'ERZ1292643.876-Hain-H51-01um-R3-876-length-26228-cov-21.493715_21': 'MGYP001564069982',
 'ERZ1292623.1172-Hain-H51-01um-R1-1172-length-25845-cov-16_22': 'MGYP001566592428',
 'ERZ1292643.876-Hain-H51-01um-R3-876-length-26228-cov-21.493715_22': 'MGYP001566592428',
 'ERZ1292623.197-Hain-H51-01um-R1-197-length-68997-cov-24_35': 'MGYP001568093992',
 'ERZ1292627.123-Hain-H51-01um-R2-123-length-75432-cov-26.213301_29': 'MGYP001568093992',
 'ERZ1292643.32-Hain-H51-01um-R3-32-length-121540-cov-21.638293_29': 'MGYP001568093992',
 'ERZ1292623.197-Hain-H51-01um-R1-197-lengt

Brilliant! There's a script `scripts/make_saccharide_glycosyltransferase_list.py` which will do the above processing and write a text file of MGYPs.