## Basic Training
In this notebook, we will show

* How to construct a training and validation dataset that respect External Symmetry. Disconnection on the BC Clan graph will satisfy fairness in External Symmetry; this forms a testing dataset ready for k-fold cross validation.
* How to train some models in a 3-fold CV scheme. The training will be done with pytorch taking advantage of its dataloader.
* Several effective data augmentation strategies popularised in residual network training.

We will illustrate this with training on classification of `A,U,C,G`. The AUCG dataset is much smaller in size than the base/nonsite/phosphate/ribose `S,X,P,R` dataset, but it requires more attention to curate as some structures are solved with interacting bases, but are proven [wildcards](https://droog.gs.washington.edu/parc/images/iupac.html) indicated by the author in the paper. These pdb entries are removed from training to avoid confusion. 


## Imports

In [1]:
# ============== Click Please.Imports
import sys
import glob
import gc
import io

import random
random.seed(42)
import pandas as pd
import numpy as np
import networkx as nx

from scipy import sparse
import torch
import seaborn as sns

import matplotlib.pyplot as plt


import time
import tqdm
import collections


import functools
import itertools
import multiprocessing



import torch 
from torch import nn

import torchvision as tv
import pytorch_lightning as pl


sys.path.append('../')
from NucleicNet.DatasetBuilding.util import *
#from NucleicNet.DatasetBuilding.commandReadPdbFtp import ReadBCExternalSymmetry, MakeBcClanGraph
from NucleicNet.DatasetBuilding.commandDataFetcherMmseq import FetchIndex, FetchTask, FetchDataset
from NucleicNet import Burn, Fuel
import NucleicNet.Burn.util
import NucleicNet.Burn.M1
import  NucleicNet.Burn.DA
%config InlineBackend.figure_format = 'svg'

sns.set_context("notebook")



# Turn on cuda optimizer
print(torch.backends.cudnn.is_available())
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True
# disable debugs NOTE use only after debugging
torch.autograd.set_detect_anomaly(False)
torch.autograd.profiler.profile(False)
torch.autograd.profiler.emit_nvtx(False)
# Disable gradient tracking
#torch.no_grad()
#torch.inference_mode()

# ================= Click Please. Directories ==================
DIR_DerivedData = "../Database-PDB/DerivedData/"
DIR_Typi = "../Database-PDB/typi/"
DIR_FuelInput = "../Database-PDB/feature/"


True



## Scope of Data

The cell below defines the scope of data to be used in training AUCG classifcation. The selection consists of many AND clauses for rejection of PDB entries we deemed unsuitable for our task to avoid garbage-in-garbage-out situations. Many are notated with domain-knowledge dense comments and we did our best to automate the selection. Several groups are discussed below:
* Resolution poorer than 3.5 angstrom. Many of these structures are recent cryo-em structures. Many of their sidechains were stubbed (cut away) when authors are not confident in locating the atoms; atom counting programs cannot handle these cases. 
* Less than 4 nt. Mostly concerning dinucleotide, NTP,NDP,NMP, etc.
* Shape-dependent/translocating machinaries. Many RNA-binding protein only recognise RNA due to its secondary structure enveloped in layers of phoshate/ribose. While only some bases interact with the protein, the binding behavior can be sequence independent. These machinaries are usually indicated in `Df_grand["Title"],Df_grand["Header"],Df_grand['NpidbClassification'],`. Also note that PDB curators kindly updates its index every Wednesday, but there can be delay!
* Author indicate absence of sequence specificity in article. These can be wildcard base interacting sites or simply the interaction are too marginal or water-mediated that binding behaviour becomes independent of sequence. These are curated by reading into the literature. A quote from the article where the PDB entry is orignated is provided for most of these entries. (Please report an issue if disagree!)
* Structures without an accompanying article indexed by Pubmed. There is no way to verfiy subtleties in these entries.
* Cases where metal/interfacial inhibitor/tip of hairpin/water-mediated/marginally/modified base interacts with non-canonical-amino-acid engineered protein in solution. (These descriptions are not mutually exclusive nor intersecting...) These concerns induced sequence/non-sequence specificities by zinc cages, chemical , secondary structure recognition and non-canonical amino acids. There are also entries solved with a poly-A/U/C/G oligonucleotiude just as a template for more advanced sequence interaction construction.
* NMR multi-states. Some states in solution NMR structures consistently perform poorer than its siblings. We attribute this to the fact that NMR structures are not solved atom-by-atom but rather with a subset of atoms (in the simplest case, constrained by multi-dimensional scaling) and later minimized in a all-atom force field (e.g. simulated annealing). Three states consistently perform better were selected for benchmark.

We have updated the curation to year 2021 but we cannot guarentee the curation using flags below will suffice the need of our community thereafter without an update.  


In [2]:



Df_grand = pd.read_pickle(DIR_DerivedData + "/DataframeGrand.pkl")
Df_grand = Df_grand.loc[(Df_grand["ProNu"] == "prot-nuc") & (Df_grand['Resolution'] <= 3.5) # NOTE you may consider to relax the 3.0 Angstron resolution limit as cryoEM structure w/ ~3.5 angstrom are not uncommon to be modelled in full atom these days
                                          & ~(Df_grand['PubmedID'].isnull()) # NOTE ~78 structures. Note that some are recent unindexed by pdb; most are unpublished structures. Some contains large missing loops.
                                          & (Df_grand['NucleicAcid'].isin(['rna']))
                                          & (pd.notnull(Df_grand['InternalSymmetryBC-95']))
                                          & (Df_grand["Year"] <= 2021)
                                          & (Df_grand["MeanChainLength_Nucleotide"] >= 4) & (Df_grand["SumChainLength_Peptide"] > 50) 
                                          # NOTE Some machineries that do not show preference in base or a disproportionately small amount of sites with preference.
                                          & ~(Df_grand["Title"].str.contains('ribos|riboz|transcript|polymerase|trna|pseudouridine|srp|signal recognition particle| ribonuclease|rig-i|exosome|spliceosome|csy|csm|cas1|cas9|casc', regex=True, na=False)) 
                                          & ~(Df_grand["Header"].str.contains('ribos|riboz|transcript|polymerase|trna|pseudouridine|srp|signal recognition particle| ribonuclease|rig-i|exosome|spliceosome|csy|csm|cas1|cas9|casc', regex=True, na=False))
#                                          & ~(Df_grand["Title"].str.contains('ribos|riboz|transcript|polymerase|trna|pseudouridine|srp|signal recognition particle| ribonuclease|exosome|spliceosome', regex=True, na=False)) 
#                                          & ~(Df_grand["Header"].str.contains('ribos|riboz|transcript|polymerase|trna|pseudouridine|srp|signal recognition particle| ribonuclease|exosome|spliceosome', regex=True, na=False))
                                          & ~(Df_grand['NpidbClassification'].isin(["TRANSFERASE/RNA",'TRANSFERASE','RIBOSOME'])) # ~170 structures

                                          # NOTE Author of paper indicate absence of sequence-specificity in article.
                                          & ~(Df_grand['Pdbid'].isin(['5o7h','7cyq',
                                                                      '5bud','5btb','5bte','5bto',
                                                                      '4b3g',
                                                                      '6vrb','6vrc',
                                                                      '2vnu', # NOTE contain selenomet MSE
                                                                      '7c08','7c07', # NOTE zinc cage interaction in zinc finger we cannot handle it
                                                                      '3hjw','3hjy','2bgg','1ytu',
                                                                      '4d26','4d25','5w3v','6vqv','4qik','4qil',
                                                                      '3t5n','3t5q',
                                                                      '6cbd','1knz','5id6',
                                                                      '2po0','2po1','2po2','2pnz',
                                                                      '4oo1','3m7n','3m85',
                                                                      '4tyy','4tz0','4tz6','4tyn','4tyw',
                                                                      '7bv2','7c42','7c43','7c45','7c47','7c4c','7c4b',
                                                                      '6zdp','6zd1','6zd2','6zd6','6zdq','6zdu',
                                                                      '4hor','4hoq','4hos','4hot','4hou',
                                                                      '6f3h','6w6v',
                                                                      '4p3e', '4p3f', '4p3g',
                                                                      '5jc3','5jc7','5jcf','5jch',
                                                                      '5c9h',
                                                                      '3wbm', # NOTE `Although these proteins are abundant and bind both DNA and RNA sequences nonspecifically`
                                                                      #'4z4c', '4z4d', '4z4e', '4z4f', '4z4g', '4z4h', '4z4i' , "appear non-specific" (??? Check)
                                                                      '6bjg', '6bjh', '6bjv', '1rpu',
                                                                      '5ws2','6llb',
                                                                      '6r7g',
                                                                      
                                                                      '5n8l', '5n8m', # NOTE `We find that siRNA recognition by the dsRBDs is not sequence-specific but rather depends on the RNA shape. The two dsRBDs can swap their binding sites, giving rise to two equally populated, pseudo-symmetrical complexes, showing that TRBP is not a primary sensor of siRNA asymmetry. `
                                                                      '6sjd', # NOTE This is a shape dependent selection where base marginally touch https://genome.cshlp.org/content/30/7/962.full.pdf+html
                                                                      '6bbo','6b0b', # NOTE `The human A3H-duplex RNA binding mechanism is mediated in large part by electrostatic contacts between the enzyme and the RNA phosphodiester backbone, as opposed to sequence-specific contacts. `
                                                                      '6vqw', '6vqx','6vqv', # NOTE Many Cas system does not have sequence preference such that the machinery is versatile. In this paper they demo Cas7f ` Cas7f and crRNA form multiple hydrogen bonds, which mostly occur between the arginine-rich region (F32, R34, R68, Q95, R168, Q247, Q276, K277, R283, S308, R350) and the sugar-phosphate backbone of crRNA, with only two nucleobases (G[+14], G[+19]) involved (SI Appendix, Fig. S7E). This finding indicates the nonsequence-specific crRNA recognition mode of Cas7f.`
                                                                      '3sn2','3snp', #NOTE `Given the similarities in the structures of the IRP1-bound TfR B and Ftn H IREs in this region, and the lack of sequence-specific contacts between IRP1 and the IRE upper helix, the impact of sequence differences in the upper helix on protein binding may relate to effects on helical twist and pitch, and/or to effects on helix stability.`
                                                                      #'6uv2','6uv3','6uv4', # NOTE The sequence specificty is debated and likely multilabel RCAYCH especially when a U10 can also bind 
                                                                      '2gtt','7acs','6yi3', # NOTE `Our model also explains the unspecific nature of N-NTD:RNA interaction. The N-NTD virtually only interacts with the RNA backbone while the bases are, in the case of ssRNA flipped away from the protein, or, in the case of dsRNA involved in base pairing but do not interact with the N-NTD `
                                                                      '2wj8', #NOTE `Although the N RNA contacts in the groove are not base-specific, the cavity appears tailored to bind a set of three stacked bases, a feature that appears to be conserved across the Mononegavirales order. Because the bases are averaged out in our crystals, it is not possible to tell from the structure whether certain particular nucleotide sequences would make stronger or weaker interactions within the cavity. `
                                                                      '2c0b', '2bx2', '2c4r', # NOTE `Aside from that contact, there is no sequence recognition as such, so it seems that the preference of RNase E for A + U-rich substrates27,28,29 arises mainly through the recognition of the RNA conformation.`
                                                                      '3l25','3l26', #NOTE `In the absence of base-specific contacts, it is likely that the interactions observed between the VP35 IID central basic patch and the 8-bp dsRNA are independent of the nucleotide sequence. `
                                                                      '4fvu', # NOTE ` By design, each individual strand of RNA is composed of either all-purine or all-pyrimidine bases. The single all-purine and all-pyrimidine strands were annealed to yield dsRNAs that each contain one all-purine and one all-pyrimidine strand. Digestion of dsRNA by Lassa NP is sequence-independent, and thus, there is equal likelihood for either the purine-containing or the pyrimidine-containing strand to be digested. `
                                                                      '3ks8','3ks4', # NOTE `This crystal structure shows that each VP35 in the dimer contacts the dsRNA exclusively through the sugar-phosphate backbone and hydrophobic faces of terminal bases, allowing the VP35 dimer to recognize dsRNA in a sequence-independent manner. `
                                                                      '2zko', # NOTE `NS1A RBD has the ability to distinguish between dsRNA and dsDNA and recognizes dsRNA in a sequence-independent manner because all the intermolecular contacts are directed towards the sugar-phosphate backbone and 2′-OH groups on the RNA strand`
                                                                      '6zlc', '6sx0','6sx2', # NOTE The title of the pdb says non-specific, but the article cited a few specific sequences at the tip of a hairpin not resolved.
                                                                      '4e78','4e7a','4e76', #NOTE `None of the nucleobase hydrogen bond acceptors or donors is recognized by the polymerase, indicating sequence-independent recognition by the polymerase. `
                                                                      '2ix1','2ix0', # NOTE `The final nucleotides 9–13 are located in a cavity within the RNB domain (Fig. 1c, d), with their five bases clamped between conserved Phe 358 and Tyr 253 (Fig. 2b and Supplementary Fig. S10). Each phosphate group is engaged in one or two hydrogen bonds with protein residues, a characteristic of non-sequence-specific nucleotide recognition sites.`
                                                                      '6aay', # NOTE `in order to verify whether A(-37) possesses the base specificity for the pre-crRNA cleavage, A(-37) was mutated to G, C and U, respectively. Results indicate that base type change at this position does not influence the cleavage activity, `
                                                                      '3pf4','3pf5', # NOTE `The similar kon values indicate low specificity of Bs-CspB association to all oligonucleotides examined regarding the base composition and the type of ribose rings,`
                                                                      '2i91','1yvp','1yvr', # NOTE 2i91 states that `the structure suggests that Ro recognizes helix I of the misfolded RNA as a duplex rather than recognizing any specific sequence.`
                                                                      '4oav','4oau', # NOTE Both are resolved with a polyA sequence, but author indicate specificity as recognizes the pattern UN^N. In this case, only the ribose and phosphate are good for use.
                                                                      '5z9x','5z9z', # NOTE `Unlike many RRMs, which exhibit strong substrate binding affinities (in the nM range) and sequence specificities, SDN1 RRM binds RNA weakly in a sequence independent mode. `
                                                                      '6htu','6sdy','6sdw', # NOTE `Human, Drosophila, and C. elegans Stau were in fact shown to bind dsRNA without apparent sequence specificity in vitro``
                                                                      '5y58','5y59','5y5a', # NOTE ` The base of A304 packs against the aliphatic loop L15-16 (between strands β15 and β16) of Ku70 and makes no sequence-specific hydrogen-bonding interactions with the protein (Figure 2F). Substitution of A304 with any other nucleotides had marginal effect on the interaction between Ku and TLC1KBS (Figures 2D and S2). `
                                                                      '5ns3','5ns4', # NOTE `Our crystallographic studies confirm the predisposition of Cy3 and Cy5 to stack on the final basepair of double-stranded nucleic acids. We have noted that this is true irrespective of the identity of the sequence of the terminal basepair` 
                                                                      '5ed2','5ed1','5hp2','5hp3', #NOTE `However, the observed clash is not severe, and the enzyme would be able to accommodate G or C 5′ nearest neighbors by slight structural perturbations, thus explaining why this sequence preference is not an absolute requirement.`
                                                                      '6o5f', # NOTE `We found no base-specific recognition of RNA by the protein`
                                                                      '7dic','7dcy','7dol','7did', # NOTE `All the samples tested effectively digested a 30-mer ssRNA in a sequence and length independent manner`
                                                                      '3rc8',# NOTE `bases are mutually stacked but they form only two hydrogenbonds with the surrounding protein residues. Generally, thesequence conservation of these motifs is lower than in theATP-binding motifs, but some characteristics are common tomost of the SF2 superfamily members.`
                                                                        # NOTE `Intriguingly, RIG-I is observed to bind all blunt RNA termini in much the same way, without regard to RNA sequence or the presence of a 5′-triphosphate`
                                                                      '4db2','4db4', # NOTE `No protein contacts are observed to the RNA bases of either strand (Fig. 3), consistent with the non-specific RNA binding shown by Mss116p and other DEAD-box proteins1. `
                                                                      '7k9e','7k9d','7k9b','7k9c','7kkv', #NOTE `The lack of base-specific interactions, except for the two hydrogen bonds mediated by the first G501 in the tetraloop (Fig. 3B), indicates that the specificity of OapB binding to this region is dictated by the tertiary conformation of the GNRA tetraloop, rather than exact sequence within this tetraloop family.`
                                                                      '4ijs','3zla','3zl9', # NOTE from 3zla paper `previous studies have shown that RNA bound to purified soluble recombinant tetramers contains no specific or consensus sequences `
                                                                      '7onb', # NOTE The MINX is unmodeled 

                                                                      '6u6y', # NOTE 4-nt long sequence but only one base is modeled for 3 out of 4 chains; one chain with 3 base but only one in contact.,...
                                                                      '5ddo','5ddp','5ddr', '5ddq', # NOTE Riboswitch
                                                                      '6e4p', # NOTE polyu `ur FP studies revealed that the RRM domain binds with high affinity to U20 and G20 RNA. Interestingly, however, the RRM bound with significantly higher affinity to poly(G) sequences than poly(U) RNA (Supplemental Figure S2; Figure 5A). These data and the fact that the RRM-U4 structure revealed what appeared to be specific interactions between the protein backbone and the uridine bases, suggested that the RRM might bind poly(G) in a manner distinct from poly(U). `
                                                                      '7elc', '7ela', '7elb','7el9', # NOTE This is a polymerase structure, there is no mention of base specific interaction in article https://www.nature.com/articles/s41564-021-00916-w
                                                                      '5jrc','5jre','5jr9', # NOTE `Among the nucleic acid interacting residues, mutations of Lys19 had no obvious impact on the cleavage activity of the protein (K19A). However, alanine substitutions of Ser26 and Arg164 all led to weakened cleavage activities of the mutant proteins (S26A and R164A), compared with the WT. Mutations of other residues, including Lys34, Arg124, Arg163 and Tyr168, caused more significant reduction on the cleavage activities of the mutants (K34A, R124A, R163A and Y168A). These results were all consistent with the structural observations.Interestingly, the in vitro cleavage assays with both substrates showed an obvious pattern of products. Comparison with the FAM-labeled markers, including (AC)5 and (AC)5A in Figure 4A, and (GT)5 and (GT)5G in Figure 4B, indicated that NeC3PO preferentially cut after the purine residues. In the NeC3PO:ssRNA structure, the side chain of Phe160 stacked with the nucleobase of A6 (or A6′). Replacing Phe160 with residues that has larger (Trp160 for F160W) or smaller (Ala160 for F160A) side chains showed certain enhancing or weakening effect on the overall cleavage activity of the mutant proteins, compared with the WT. However, the mutant proteins still selectively cut after the purine residues. These observations indicated that Phe160 plays important role in the substrate binding, but is not the main cause of NeC3PO preference to purine.`
                                                                      '6wxx','6wxy','6xl1','6wxw','6sce', # NOTE requires short cyclic nucleotide cA4 for specificity 
                                                                      '3trz', '5udz', '3ts0', # NOTE Part of the recognition unit contains zinc cage
                                                                      '6fqr', # NOTE This structure is with CCCC 6gx6 with ACAC `IMP3 RRM12 bound to CCCC and AAAA with similar, but, weak affinity (∼40 µM) (Fig. 2A,B,D). We did not detect the binding, however, to either UUUU or GGGG indicating modest sequence specificity (Supplemental Fig. S4B,C). IMP3 RRM12 bound to ACAC with almost an order of magnitude higher affinity (∼5 µM) indicating its preference for this dinucleotide sequence (Fig. 2C,D).`
                                                                      '6sy4', '6sy6', # NOTE This is a very interesting pair. The RNA bound is asymetric but the protein is symetric such that some interactions at Q38 for guanosine is lost `Due to the asymmetric nature of the interface, Arg28′ and Gln38′ from the second protomer neither participate in cation-π interactions (Arg28′) nor in guanine recognition (Gln38′).`
                                                                      '7om7','7oma','7om6','7om2','7om9', # NOTE These are dsRNA structure w/o contact with base. The paper does not mention base interaction/specificity either https://www.mdpi.com/1999-4915/13/7/1260/html

                                                                      '5udk','5udi','5udj','5udl', # NOTE ` IFIT1 forms a water-filled, positively charged RNA-binding tunnel with a separate hydrophobic extension that unexpectedly engages the cap in multiple conformations (syn and anti) giving rise to a relatively plastic and nonspecific mode of binding, in stark contrast to eIF4E. `
                                                                      '5oc6', # NOTE `This domain, typically ∼68 amino acids, is well-known for its functional versatility by means of a particular α1-β1β2β3-α2 canonical structure that allows it to recognize a variety of simple RNA structures ranging from A-form RNA helices to hairpins or tetraloops in shape-dependent manners (7,9,10), even though a sequence-specific mode of recognition has been invoked for a few of them (11,12).`
                                                                      '3qsu', # NOTE This is the poly A structure `Binding studies show that Sa Hfq binds (AU) 3 A ≈ (AG) 3 A ≥ (AC) 3 A > (AA) 3 A `
                                                                      '1ddl', # NOTE `With both fragments of RNA, binding to protein through hydrogen bonds to either phosphate or ribose groups of the RNA appears not to be sequence-specific. The density seen for base rings fails to suggest any specific nucleotide sequence. `
                                                                      '6nut', # NOTE ` We modeled this density as a polymer of six adenosine residues since, as expected, none of these resi-dues makes sequence-specific contacts`
                                                                      '3nmr','3nna','3nnc', # NOTE These are selenomethionine substituted structure. 3nnc and 3nnh are not! but 3nnc contains a pseduo symmetric unit with without bound rna `The structures of RRM1/2 in complexes 1 and 2 were determined by multiwavelength anomalous dispersion (MAD) phasing on Se atoms using selenomethionine-labeled protein. The structure of complex 3 was solved by molecular replacement using the refined structure of complex 1 determined at 1.85 Å resolution, as a search model (`
                                                                      '4yhw', # NOTE Structure contains a selenomethionine at multiple position eg 417, we cannot handle it `yU4/U6stem II+10nt was slowly added to the SeMet-substituted yPrp3CTF in a 1:1 molar ratio and incubated at 4°C for 15 min. `
                                                                      '6uv2', '6uv3', # NOTE “VCAUCH” (Mori et al., 2014) to “RCAYCH.” These two specificities are contested and this paper prefers the latter. I removed the CACACA and ACACCU which contradict Mori et al
                                                                      '5d0b','5d0a','5d08', '5t8y' , # NOTE Interactions with base are not found by pymol standard
                                                                      '5gjb', # NOTE `he ribose group of ATP bulges out from the binding pocket and no clear electron densities are observed for the adenine group, suggesting that the ZIKV helicase may not have nucleotide specificity for its NTPase activity.`
                                                                      '5uz9', # NOTE Largely non-specific `crRNA binding by Cas7f is mediated by non-sequence specific contacts between the sugar-phosphate backbone and residues on the palm (R35, H275, Q277, K278, N281, R284) and web (R169, Q248). `
                                                                      '4oq8', '4oq9', '4nia', # NOTE The authors attempted different constraints/restraints to refine a X-ray data, but the single potential base interaction is very marginal  ` It can now be seenthat the base is stacked upon the guanidinium group ofArg125. In addition, the side chain of Asn16 is nearly coplanarwith and approaches the edge of the base, where it could beoriented to form hydrogen bonds with appropriate atoms onthe nucleotide base. The latter feature, as proposed by Seemanet  al.(1976), could thereby provide some degree of basespecificity at the ‘free nucleotide’ position.` 
                                                                      '1bmv', # NOTE `Interactions with protein are dominated by nonbonding forces with few specific contacts. ` viral cap
                                                                      '2jlq', '2jlr', '2jls', '2jlu', '2jlv', '2jlw', '2jlx', '2jly', '2jlz',  # NOTE `As expected, RNA recognition appears to largely occur in a sequence-independent manner as a 13-mer oligoribonucleotide (RNA13) having a different sequence binds to NS3h in essentially the same way as the 12-mer oligoribonucleotide`
                                                                      '5ytx','5yts', # NOTE These are solved with suboptimal sequence against sibling at 5ytt, 5ytv`ater, using the iCLIP-Seq method, we mapped the in vivo YB-1–binding sites at a genome-wide level and found that YB-1 preferentially recognizes a UYAUC motif, which closely resembles the binding motif determined by in vitro SELEX (38). In a recent study, the CAUC motif was also identified as a YB-1–binding motif in a fused cell line by `
                                                                      '3o3i', # NOTE Only backbone interaction is reported https://www.pnas.org/doi/full/10.1073/pnas.1017762108
                                                                      '1gtn', '1gtf', # NOTE the sequence with CC spacer is less tightly held `. In the  complex with CCspacers, where the spacer region is least ordered, the G1 baseis 0.5 Afurther  from  the  protein  than  in  the  other  twostructures. In both the complex with GAGUU and that withGAGCC,  the  second  of  the  two  spacer  nucleotides  is  lessordered than the Ærst.`
                                                                      '6muu', # NOTE `The absence of base-specific contacts between protein residues and bases of the crRNA within the 5+1 repeat accounts for the lack of sequence specificity for spacer sequence recognition.`
                                                                      '5jji','5jjl','5jjk', # NOTE These set of protein does not interact with the base as the  author argue that the specificty for pyrimidine is induced allosterically 
                                                                      '7ogk', # NOTE Also note that base are not resolved. `A degenerate ARN-repeat sequence in the RNA substrate interacts with one of the Hfq RNA-binding surfaces, bridging Hfq and PNPase, and indicating a loose sequence preference for carrier assembly. `
                                                                      '2izm', # NOTE This structure is resolved with C-10 but G-10 or A-10 is preferred see 2izn ` The –10 base (the bulge) and the –4 base (in the loop) bind similar pockets in the two halves of a dimer, making extensive contacts through hydrogen bonding and hydrophobic interactions ( 5 ). Only an adenosine can be accommodated at the –4 position without a significant reduction of binding ( 7 ). The wild-type TR sequence has an adenosine at position –10, but binding studies have shown that guanosine gives a similar binding constant to adenosine, if the sequence in the stem is changed so that alternative conformations are avoided `
                                                                      '5z9w', # NOTE ` Our structure reveals how the Ebola virus nucleocapsid core encapsidates its viral genome, its sequence-independent coordination with RNA by nucleoprotein, and the dynamic transition between the RNA-free and RNA-bound states.`
                                                                      '6r9q','6r9p','6r9m','6r9j', '6r9o' # NOTE Not by touch`Crystal structures of Saccharomyces cerevisiae Pan2 in complex with RNA show that, surprisingly, Pan2 does not form canonical base-specific contacts. Instead, it recognizes the intrinsic stacked, helical conformation of poly(A) RNA. `
                                                                      ]))
                                          # NOTE Unpublished but with pubmedid?
                                          & ~(Df_grand['Pdbid'].isin(['3p6y', '2n8m', '3ahu']))
                                          # NOTE Cases where metal/interfacial inhibitor/tip of hairpin/water-mediated/marginally/modified base interacts with rna base
                                          & ~(Df_grand['Pdbid'].isin([  
                                                                        '2lsl',
                                                                        
                                                                        '5zc9', # NOTE eIF4A1 chemical clamp, water
                                                                        '6xki', # NOTE eIF4A1 chemical clamp, water
                                                                        
                                                                        '4bkk', # NOTE nucleoprotein. There is no mention of base interaction through out the article https://www.microbiologyresearch.org/content/journal/jgv/10.1099/vir.0.053025-0
                                                                        '6yrb','6yrq', # NOTE No base interaction mentioned in paper (Check again)
                                                                        '1yyw', '2nug', '2nue', '1yz9',# NOTE These is a AU dsRna but prefer GU in other Rnase3 at Q157, 1yz9 makes no contact w/ base
                                                                        #'2bs0', # NOTE RNA at interface of two varial capsid protein symmetry mates
                                                                        '7n0c','7n0b','7n0d', # NOTE exoribonuclease proof-reading complex but when mismatch the base makes no touch
                                                                        '2xgj', # NOTE Helicase w/ no touch at base



                                                                        # NOTE Structure solved with Poly-Oligonucleotiude just as a template
                                                                        '5wwf', '5ho4', # NOTE These are proteins resolved with same interacting sequence. Its siblings 5wwg 5wwe 5wwx makes most contact with the protein
                                                                        '4ht8', '3gib', # NOTE 4ht9 has a higher resolution also with additional uridine sites shown
                                                                        '4ijs', # NOTE They use a polyA sequence for simplicity. even though there are interaction with some of the bases.
                                                                        '2xbm', # NOTE specificity is in a dinucleotide labeled as G3A
                                                                        
                                                                        '5eeu', '5eev', '5eew', '5eex', '5eey', '5eez', '5ef0', '5ef1', '5ef2', '5ef3', '1utd', '4v4f', # NOTE While the protein is the same, RNA does not show up in a pseudo symmetry mate. Half and Half. also note a lot of unmodeled nt https://www.rcsb.org/3d-view/5EEV/1
                                                                        '6dtd', #  NOTE Cas 13b
                                                                        '2zi0', '4erd', # NOTE single helix contact
                                                                        '6cf2', # NOTE single helix contact

                                                                        #'6mdz', # TODO Ttesting
                                                                        '5js2', '5ki6', # NOTE Modified base argonaut
                                                                        '6oon','5vm9','5w6v','4kre','4kxt','4olb','4ola', # NOTE Poly-A sequence bound to argonaut
                                                                        '5t7b' # NOTE unpublished argonaut
                                                                        '4z4c', '4z4d', '4z4e', '4z4f', '4z4g', '4z4h', '4z4i', # NOTE This series of pdbid concerns a water mediated recognition site for adenosine on argonaute `Water-mediated recognition of t1-adenosine anchors Argonaute2 to microRNA targets`
                                                                        '5js1', '4w5o', '4w5q', # NOTE Argonaut structure. 4w5o,q has more missing residue than siblings 4w5t,r,n.
                                                                        '5wqe', # NOTE multiplebase specific interactions were outlined but most interacts with peptide backbone.
                                                                        '5wtk', # NOTE 4 base specific interactions were outlined but the structure is ds and some sidechains e.g. 415-416 were stubbed. we will not include it in training

                                                                        # NOTE No specific H bond contact found/does not fulfill Hbond criterion in pymol
                                                                        '5ztm', # NOTE The claimed interaction at E172, N175, Q195 does not fulfill H-bond criterion in pymol. Find>Polar Contacts
                                                                        '6h5s','6h5q', # NOTE no specific H bond  contact
                                                                        '4al7','4al5','4al6', # NOTE base binding site at an unmodeled loop
                                                                        '4n2s','4n2q','4me2', # NOTE close but no defined H bond 
                                                                        '6hyu', '6hyt', # NOTE polyA used and no Hbond specific contact
                                                                        


                                                                        '5t8y', 

                                                                        '4z92', # NOTE minimal contact in vriys 
                                                                        '3hsb', # NOTE a AGAGAG aptamer used but the G does not form specific hbond interactions 
                                                                        '7bg6','7bg7','7nuq','7nun','7nuo','7nul','7num', # NOTE only stack touched
                                                                        '5f9f', '5f98','5f9h','5e3h','3eqt', # NOTE RIG-I recognise modified base m7G `https://www.pnas.org/doi/full/10.1073/pnas.1515152113`
                                                                        '5z98','4lg2', # NOTE duplex
                                                                        '2ihx', # NOTE Disordered
                                                                        '4gv3','4gv6','4gv9','4gve','4g9z', #NOTE backbone only
                                                                        '7c06', # NOTE it shares same sequence ith 7c08 but poor?
                                                                        
                                                                        '3ciy' # NOTE dsRNA
                                                                        '5jbg', # NOTE MDA5
                                                                        '4ill', '4ilm', '4ilr', # NOTE The RNA strand appears broken??? (bonds too long)
                                                                        '6s8b','6s8e','6shb','6sic','6s91','6s6b', # NOTE Backbone only. marginal interaction

                                                                        '4peh','4peg','4pei','4pef', # NOTE modified base

                                                                        '5jaj','5jb2','5jbg', # NOTE LGP2 duplex
                                                                        '4lg2', # NOTE duplex
                                                                        '4gha','5m73', # NOTE dsrna

                                                                        '3ciy', # NOTE 3.41 angstrom resolution, some sidechain can be highly flexible
                                                                        
                                                                        

                                                                        '3zd6','3zd7', # NOTE Rig I

                                                                        '3zc0', # NOTE almost no contact
                                                                        '2jlw', # NOTE no contact
                                                                        '6ozp', '6ozn', '6ozf','6oze', '6ozg', '6ozh','6ozi', '6ozj', '6ozk', '6ozl', '6ozm',  '6ozo',  '6ozq', 
                                                                        '6ozr','6ozs', # NOTE through backbone
                                                                        '2gje', # NOTE backbone only
                                                                        '1f8v', '2bbv', # NOTE backbone only duplex cage in virus capsid

                                                                        
                                                                        '2mxy', # NOTE solution structure with extra nucleotide compare to 2mz1
                                                                       

                                                                        '3pkm', # NOTE missing loop

                                                                        
                                                                        
                                                                        '2bx2', # NOTE Marginal

                                                                        '6d06', # NOTE modified base dsrna
                                                                        '3dh3', # NOTE Modified base
                                                                        '7kfn', # NOTE Modified base
                                                                        '4i67', # NOTE Modified nt
                                                                        '1jbt','1jbs','1jbr',
                                                                        '6gc5', #NOTE short strand
                                                                        

                                                                        
                                                                        '5uj2', # NOTE marginal; same family as 4e78

                                                                        '7ndh', '7ndi', '7ndj', '7ndk','3d2s', # NOTE require zinc cage
                                                                        '6l1w', '1rgo', # NOTE zinc finger    
                                                                        '4lj0', '5elk',# NOTE Zn finger short peptide

                                                                        '2mqv','2mqt','2ms0','2ms1','2mkn','5u9b','1wwe','1wwf','1wwd','1wwg','2n82','5u9b','1fje','1t4l',
                                                                        '2l3c','2lup','1a1t','2mf1','2mf0','1f6u','1ekz','6gbm','2mfe', '2mfg', '2mfh','2mff', '4cio','2jpp',#NOTE Disordered NMR solution structures
                                                                        
                                                                        '5c0y','5v7c', # NOTE no contact
                                                                        '5wea', # NOTE poly A sequence

                                                                        '6vff', # NOTE dsrna
                                                                        '7krn', '7kro','7krp', # NOTE Helicase dsrna
                                                                        '4pmi', # NOTE single helix


                                                                        # NOTE Water-mediated or simply in an envelope of water
                                                                        '4qoz', '4tuw','4tux','4tv0','4l8r', # NOTE water duplex
                                                                        '4mdx', # NOTE water
                                                                        '5l2l', # NOTE water
                                                                        '5elh', # NOTE water; 5elk has much tighter contact 
                                                                        '2pjp', '6lt7','6db8','6db9','1c9s','6c6k', '3ts2','5tf6',
                                                                        '4n0t','4kzd','6b3k','5e08','5h1l', '1m5o', '6fq3',
                                                                        '5gxh','4q9q', '6mwn','5det','6u8d','6u8k', '5gxi','6hau','6d12',
                                                                        '2y8y','2y9h','2y8w','4qvc','4f02','6fql','6fq3', # NOTE water
                                                                        ]))
                                          # NOTE Recently indexed shape-dependent machinery (tRNA/exosome/ribosome), but pdb has not updated its derived data
                                          & ~(Df_grand['Pdbid'].isin(['5hr7','5omw','5jea',
                                                                      '4o26', # NOTE telomerase
                                                                      '5fmz','5epi', # NOTE polymerase
                                                                      '6zoj', '6zok', '6zol', # NOTE Ribosome
                                                                      '6yan','6yam','6yal', # NOTE ribosome
                                                                      '5iwa', # NOTE ribosome
                                                                      '5e6m', # NOTE trna
                                                                      '5on2','5onh','5on3','5omw','3al0', '3akz', '5e6m', # NOTE tRNA 
                                                                      '1zl3', # NOTE trna specificity at modified base FLO
                                                                      '5ud5','5v6x','4qei','4kqe' # NOTE trna
                                                                      '3jam','3jap','3jaq', # NOTE This is a ribosome
                                                                      '5ng6', # NOTE Crispr machinery recognise DNA motif TTN but no mention of RNA
                                                                      '6sh8','6s6b', '6s8b', '6s8e', '6s91', '6shb', '6sic', # NOTE Crispr machinery no mention of base interaction
                                                                    ]))


                                          # NOTE 
                                              ]
#print(pd.unique(Df_grand['NucleicAcid']))
print(Df_grand.shape)
# NOTE Further Remarks on some interesting cases
# 3PTO, 3PTX, 3PU0, 3PU1, 3PU4. uses the same nucleocapsid to bind with poly(A,U,C,G), which they use to test how interaction with each kind of base will look like and they propose UAG as an interesting motif to look for https://journals.asm.org/doi/10.1128/JVI.01927-10
#                               polyG shows largest amount of interaction polyU shows none However at 3.0 Angstrom, the assignment of N161 can be flipped to make interaction with U27 (seem to support by K164)
# 6O1K, 6O1L, 6O1M              `Hfq thus has a structural preference for (ARN)n RNA stretches on its distal side, where N is any nucleotide. `


NmrStates = [ '1aud00000004','1aud00000010','1aud00000002',
              '2l4100000005','2l4100000011','2l4100000013',
              '2xc700000000','2xc700000002','2xc700000006',
              '1dz500000007','1dz500000008','1dz500000002',
              '1k1g00000001','1k1g00000005','1k1g00000007',
              '2ad900000017','2ad900000012','2ad900000019',
              '2adb00000004','2adb00000005','2adb00000014',
              '2adc00000007','2adc00000001','2adc00000000',
              '2c0600000002','2c0600000004','2c0600000009',
              '2cjk00000007','2cjk00000008','2cjk00000012',
              '2err00000003','2err00000016','2err00000006',
              '2fy100000008','2fy100000002','2fy100000000',
              '2kfy00000006','2kfy00000003','2kfy00000001',
              '2kg000000019','2kg000000012','2kg000000000',
              '2kg100000006','2kg100000005','2kg100000003',
              '2kh900000007','2kh900000001','2kh900000005',
              '2km800000004','2km800000007','2km800000006',
              '2kxn00000007','2kxn00000008','2kxn00000001',
              '2l2k00000006','2l2k00000002','2l2k00000007',
              '2l3j00000008','2l3j00000001','2l3j00000002',
              '2l5d00000004','2l5d00000016','2l5d00000008',
              '2lbs00000013','2lbs00000009','2lbs00000005',
              '2leb00000018','2leb00000000','2leb00000016',
              '2lec00000018','2lec00000002','2lec00000007',
              '2m8d00000013','2m8d00000003','2m8d00000010',
              '2mb000000004','2mb000000018','2mb000000001',
              '2mfc00000005','2mfc00000001','2mfc00000015',
              '2mfe00000001','2mfe00000002','2mfe00000013',
              '2mgz00000017','2mgz00000004','2mgz00000009',
              '2mjh00000019','2mjh00000006','2mjh00000009',
              '2mki00000005','2mki00000014','2mki00000002',
              '2mkk00000006','2mkk00000008','2mkk00000004',
              '2mz100000018','2mz100000004','2mz100000003',
              '2n7c00000002','2n7c00000010','2n7c00000007',
              '2n8l00000003','2n8l00000006','2n8l00000004',
              '2rra00000005','2rra00000008','2rra00000009',
              '2rs200000018','2rs200000004','2rs200000017',
              '2ru300000015','2ru300000011','2ru300000018',
              '4cio00000000','4cio00000006','4cio00000008',
              '5m8i00000008','5m8i00000014','5m8i00000006',
              '5mpg00000011','5mpg00000007','5mpg00000003',
              '5mpl00000004','5mpl00000012','5mpl00000002',
              '5n8l00000014','5n8l00000018','5n8l00000013',
              '5n8m00000015','5n8m00000004','5n8m00000002',
              '5x3z00000016','5x3z00000010','5x3z00000001',
              '6gbm00000002','6gbm00000000','6gbm00000011',
              '6hpj00000013','6hpj00000006','6hpj00000012',
              '6snj00000009','6snj00000000','6snj00000002',
              '6tph00000004','6tph00000009','6tph00000001',
              '7act00000009','7act00000008','7act00000000',
 ]



(347, 34)


## Training Options

The cell below will define 9 subfolds with around the same datasize for each task. A 3-fold cross validation will be done with each cross fold containing 3 sub fold. In each training cycle 2 subfolds are resserved for validation 1 for testing; the remaining 6 for training. Some options are

* Task. `User_Task = "AUCG"`.
* Number of cross folds to be done. We recommend `n_CrossFold = 9`.
* Extent of external symmetry (BC percent) to be considered when we separate folds. We recommend `ClanGraphBcPercent = 90`, but 70 seems also affordable.
* Hierarchy of class labels. We recommend a two level hierarchy `TaskNameLabelLogicDict = {"SXPR":LabelLogic_level0, "AUCG": LabelLogic_level1,}`, but a finer hierarchy `commandDataFetcher.OBSOLETE_TaskNameLabelLogicDict` is also provided if ever needed.
* Filter using Derived Data from PDB FTP. We recommend filtering as suggested in `Df_grand`.

Some options are machine learning specific hyperparameters and can be tuned in combination if desired. See comments for detail. Some worth mentioning hyperparameters:
* Noise in input/hidden layer.
* Ghost Batch Normalisation. As the size of dataset grow we can no longer afford small-batch-size (typically 128 or less datapoint) training. 
* Multi-step cosine scheduler. `SimpleMultistepCosineLRS` This helps to propose multiple models ready for random forest or simple ensemble-averaging.
* Label smoothing by neighborhood. This discount voxels at voronoi boundary.
* Label smoothing by class. 
* Bottleneck width. This also allow width tuning as in wideresnet. 

Some further remarks 

* When we pack clans of different sizes into the cross folds, we are not aiming at a [bin-packing solution](https://en.wikipedia.org/wiki/Bin_packing_problem), but rather we aim at distributing clans of different sizes evenly among folds. The process will produce a dataframe `TaskClanFoldDf_BC{bc percent}.pkl`, that indicates which pdbids to be included in the fold. 
* While we cannot load all data into RAM, we will make 6 pass from Storage to RAM, where each pass is restricted to hold `User_DesiredBatchDatasize = 3500000` datapoint. 
* Class Clan Resampling will be done in minibatch.

In [3]:
n_CrossFold = 9
ClanGraphBcPercent = 90
User_featuretype = 'altman'

User_Task = "AUCG"
n_row_maxhold = 10000


# ================ Collapse. Click Please 

User_DesiredBatchDatasize    = 3500000 # NOTE This controls the number of new batch of dataset-dataloader being reloaded into memory
User_SampleSizePerEpoch_Factor = 1.0 # NOTE This controls how much sample enters into an epoch. if < 1.0, the sampler will make less than User_DesiredBatchDatasize sample to be fed in one epoch

User_SampleSizePerEpoch = int(User_DesiredBatchDatasize * User_SampleSizePerEpoch_Factor)
n_datasetworker = 16
User_ExperiementName = 'AUCG-9CVMm%s'%(ClanGraphBcPercent)

DIR_TrainingRoot = "/home/homingla/Project-NucleicNet/Models/"
DIR_TrainLog = "/home/homingla/Project-NucleicNet/Models/" 
#DIR_Checkpoint = "/home/homingla/Project-NucleicNet/Models/AUCG_Resnet50Pretrained/lightning_logs/version_4/checkpoints/epoch=4-step=4689.ckpt"
pl.seed_everything(42)
Combination_SizeMinibatch = [1024]                  # NOTE We have used Ghost Batch Norm with virtual batch size 128
Combination_LabelSmoothing  = [0.36]                # NOTE Default 0.12 when User_NeighborLabelSmoothAngstrom > 0.0. else 0.36
Combination_PerformReduction = [False]              # NOTE Default False. True worsen the performance.
Combination_Activation = ['gelu']                   # NOTE Default gelu 
Combination_n_ResnetBlock = [16]                    # NOTE Default 16 
Combination_lr = [ 1e-3  * 1.75,  1e-3  * 2.0,]         # NOTE Seemingly a lower MMseq e.g. 30 vs 90 would require larger learning rate. e.g. at 1.0 1e-3
Combination_min_lr = [1e-6]                        # NOTE Default 1e-6
Combination_CooldownInterval = [5000]               # NOTE Default 2000
Combination_AdamW_weight_decay = [0.01 * 3]        # NOTE Default model can tolerate 0.05 but not 0.1. In general 0.01-0.05 are satisfactory. Check Max Performance
Combination_Dropoutp = [0.7]                    # NOTE Default 0.7 model can tolerate 0.7
Combination_AddL1 = [0.000001]                      # NOTE Default 0 0.0001 poorer than 0.000001 
Combination_n_channelbottleneck = [40]          # NOTE Default 40, but 160 leads to simpler model as indicated by L1 of weights? Check
Combination_ShiftLrRatio = [0.01]                   # NOTE Unused
Combination_User_LrScheduler = ["SimpleMultistepCosineLRS"]           # NOTE Default SimpleMultistepCosineLRS CosineAnnealingLR DescendingCosineAnnealingLR_HalfEpoch
Combination_User_BiasInSuffixFc = [False]            # NOTE Default True
Combination_User_NoiseX = [0.125 *8]                # NOTE Default 1.0 model can tolerate 1.0-1.5
Combination_User_NoiseY = [0.0]                     # NOTE Unused
Combination_User_Mixup = [False]                    # NOTE Unused. 
Combination_User_NumReductionComponent = [20]       # NOTE Default. Unused unless PerformReduction = True
Combination_User_NoiseZ = [0.125 *8]                # NOTE Default 1.5
Combination_User_NeighborLabelSmoothAngstrom = [1.5] # NOTE Default 0.0. 
Combination_User_InputDropoutp = [0.01]             # NOTE Default 0.1 finalise after tuning all hyperparameters
Combination_User_Loss = ["CrossEntropyLoss"]        # NOTE CrossEntropyLoss 

Combination_User_FocalLossAlpha = [0.25]            # NOTE Default 0.25 No effect if focal loss not used.
Combination_User_FocalLossGamma = [2.0]             # NOTE Default 2. Note gamma == 0 returns CE
Combination_User_GradientClippingValue = [1e10] # clip gradients' global norm to <= this number larger network may need larger clip? default 10000 TODO Test
combinations = [
                Combination_SizeMinibatch,
                Combination_LabelSmoothing,
                Combination_PerformReduction,
                Combination_Activation,

                Combination_n_ResnetBlock,
                Combination_lr,
                Combination_CooldownInterval,
                Combination_AdamW_weight_decay,
                Combination_min_lr,
                Combination_Dropoutp,
                Combination_AddL1,
                Combination_n_channelbottleneck,
                Combination_ShiftLrRatio,
                Combination_User_LrScheduler,
                Combination_User_BiasInSuffixFc,
                Combination_User_NoiseX,
                Combination_User_NoiseY,
                Combination_User_Mixup,
                Combination_User_NumReductionComponent,
                Combination_User_NoiseZ,
                Combination_User_NeighborLabelSmoothAngstrom,
                Combination_User_InputDropoutp,
                Combination_User_Loss,
                Combination_User_FocalLossAlpha,
                Combination_User_FocalLossGamma,
                Combination_User_GradientClippingValue,
                ]

# result contains all possible combinations.
CombinationList = list(itertools.product(*combinations))
print(CombinationList)


Global seed set to 42


[(1024, 0.36, False, 'gelu', 16, 0.00175, 5000, 0.03, 1e-06, 0.7, 1e-06, 40, 0.01, 'SimpleMultistepCosineLRS', False, 1.0, 0.0, False, 20, 1.0, 1.5, 0.01, 'CrossEntropyLoss', 0.25, 2.0, 10000000000.0), (1024, 0.36, False, 'gelu', 16, 0.002, 5000, 0.03, 1e-06, 0.7, 1e-06, 40, 0.01, 'SimpleMultistepCosineLRS', False, 1.0, 0.0, False, 20, 1.0, 1.5, 0.01, 'CrossEntropyLoss', 0.25, 2.0, 10000000000.0)]


In [4]:


# ========================= Auto 


FetchTaskC = FetchTask(DIR_DerivedData = DIR_DerivedData,
                              DIR_Typi = DIR_Typi,
                              DIR_FuelInput = DIR_FuelInput,
                              Df_grand = Df_grand,
                              TaskNameLabelLogicDict = None,
                              n_row_maxhold = n_row_maxhold)

# =========================
# Get Definition of Tasks
# =========================
# NOTE This collects task name and how to get corresponding data in typi 
TaskNameLabelLogicDict = FetchTaskC.Return_TaskNameLabelLogicDict()
#print(TaskNameLabelLogicDict)


print(FetchTaskC.TaskNameLabelLogicDict)

# =======================
# Task Clan Fold Dataframe
# =======================
# NOTE each element contains 3 tuple train val test
CrossFoldDfList = FetchTaskC.Return_CrossFoldDfList(n_CrossFold = n_CrossFold, 
                                                      ClanGraphBcPercent = ClanGraphBcPercent, 
                                                      User_Task = User_Task,
                                                      Factor_ClampOnMaxSize = 100000,  # NOTE Constraint on datasize of a clan
                                                      Factor_ClampOnMultistate = 20,   # NOTE Constriant on number of multistate file read
                                                      NmrStates = NmrStates
                                                      )


{'SXPR': {'Base': {'union': ['A', 'U', 'C', 'G'], 'exclu': [], 'intersect': []}, 'Nonsite': {'union': ['nonsite_'], 'exclu': ['F'], 'intersect': []}, 'P': {'union': ['P'], 'exclu': [], 'intersect': []}, 'R': {'union': ['R'], 'exclu': [], 'intersect': []}}, 'AUCG': {'A': {'union': ['A'], 'exclu': [], 'intersect': ['nucsite_']}, 'U': {'union': ['U'], 'exclu': [], 'intersect': ['nucsite_']}, 'C': {'union': ['C'], 'exclu': [], 'intersect': ['nucsite_']}, 'G': {'union': ['G'], 'exclu': [], 'intersect': ['nucsite_']}}}


In [5]:
for cccc in CombinationList:
  for User_SelectedCrossFoldIndex in [ 0,3,6]:

    print(cccc)
    # ==========================
    # Hyperparam 
    # ===============================
    PART0_InitialiseHyperparameters = True
    if PART0_InitialiseHyperparameters:
    # ==========================
    # Hyperparam 
    # ===============================

        User_SizeMinibatch = cccc[0] #256 
        User_LabelSmoothing = cccc[1] #0.16 
        User_PerformReduction = cccc[2] #True 
        User_Activation = cccc[3] #'gelu'
        User_n_ResnetBlock = cccc[4]#16 
        User_lr = cccc[5] #1e-3      
        n_Restart = 1  
        User_CooldownInterval = cccc[6] #951
        User_AdamW_weight_decay = cccc[7] #1e-2
        User_min_lr = cccc[8] #1e-6


        User_Dropoutp = cccc[9]
        User_AddL1 = cccc[10]
        User_n_channelbottleneck = cccc[11]
        User_ShiftLrRatio = cccc[12]


        # NOTE Currently fixed for benchmarking
        User_LrScheduler = cccc[13]   
        User_BiasInSuffixFc = cccc[14]
        User_NoiseX = cccc[15]
        User_NoiseY = cccc[16]
        User_Mixup = cccc[17] # NOTE Not used.
        User_NumReductionComponent = cccc[18]
        User_NoiseZ = cccc[19]
        User_NeighborLabelSmoothAngstrom = cccc[20]
        User_InputDropoutp = cccc[21]
        User_Loss = cccc[22]
        User_FocalLossAlpha = cccc[23]
        User_FocalLossGamma = cccc[24]
        User_GradientClippingValue = cccc[25]
        #print(User_GradientClippingValue)
        #sys.exit()

        FetchDatasetC = FetchDataset(
            DIR_DerivedData = DIR_DerivedData,
            DIR_Typi = DIR_Typi,
            DIR_FuelInput = DIR_FuelInput,
            User_DesiredDatasize    = User_DesiredBatchDatasize, # NOTE This controls the number of new batch of dataset-dataloader being reloaded into memory
            User_SampleSizePerEpoch_Factor = User_SampleSizePerEpoch_Factor, # NOTE This controls how much sample enters into an epoch
            User_featuretype = User_featuretype,
            n_datasetworker = n_datasetworker,
            ClanGraphBcPercent = ClanGraphBcPercent)

        classindex_str = sorted(TaskNameLabelLogicDict[User_Task].keys()) 
        ClassName_ClassIndex_Dict = dict(zip(classindex_str, range(len(classindex_str))))

    # ============================
    # Get Cross-Folds and Batches
    # ============================
    print("Getting TrainValTest batches")
    PART1A_GetCrossFolds = True
    if PART1A_GetCrossFolds:

        # NOTE Pdbids, Datasize weight
        Train_PdbidBatches, TrainFold_PdbidSamplingWeight = CrossFoldDfList[User_SelectedCrossFoldIndex][0]
        Val_PdbidBatches, ValFold_PdbidSamplingWeight = CrossFoldDfList[User_SelectedCrossFoldIndex][1]
        Testing_PdbidBatches,TestingFold_PdbidSamplingWeight  = CrossFoldDfList[User_SelectedCrossFoldIndex][2]

        print(len(Train_PdbidBatches), len(Val_PdbidBatches), len(Testing_PdbidBatches), len(set(Testing_PdbidBatches+Val_PdbidBatches+Train_PdbidBatches)))
        Train_PdbidWeight = dict(
                TrainFold_PdbidSamplingWeight[["Pdbid", "PdbidSamplingWeight"]].values.tolist()
                )
        Val_PdbidWeight = dict(
                ValFold_PdbidSamplingWeight[["Pdbid", "PdbidSamplingWeight"]].values.tolist()
                )
        Testing_PdbidWeight = dict(
                TestingFold_PdbidSamplingWeight[["Pdbid", "PdbidSamplingWeight"]].values.tolist()
                )

    PART1B_DatasetDataloader = True
    if PART1B_DatasetDataloader:
        # NOTE Train
        ds_train, ds_train_samplingweight = FetchDatasetC.GetDataset(
                        Assigned_PdbidBatch = Train_PdbidBatches,
                        Assigned_PdbidWeight = Train_PdbidWeight,
                        User_NumReductionComponent = User_NumReductionComponent,
                        ClassName_ClassIndex_Dict = ClassName_ClassIndex_Dict,
                        User_Task = User_Task,
                        PerformZscoring = True, 
                        PerformReduction = User_PerformReduction,
                        User_NeighborLabelSmoothAngstrom = User_NeighborLabelSmoothAngstrom 
                        )
                        
        train_sampler = torch.utils.data.sampler.WeightedRandomSampler(
                        ds_train_samplingweight, User_SampleSizePerEpoch, replacement=True)
        train_loader  = torch.utils.data.DataLoader(ds_train, batch_size=User_SizeMinibatch, drop_last=True, num_workers=4, 
                                                            pin_memory=True,worker_init_fn=None, prefetch_factor=3, persistent_workers=False,
                                                            sampler = train_sampler)

        # NOTE Val
        ds_val, ds_val_samplingweight = FetchDatasetC.GetDataset(
                        Assigned_PdbidBatch = Val_PdbidBatches,
                        Assigned_PdbidWeight = Val_PdbidWeight,
                        User_NumReductionComponent = User_NumReductionComponent,
                        ClassName_ClassIndex_Dict = ClassName_ClassIndex_Dict,
                        User_Task = User_Task,
                        PerformZscoring = True, 
                        PerformReduction = User_PerformReduction,
                        User_NeighborLabelSmoothAngstrom = User_NeighborLabelSmoothAngstrom 
                        )
        val_sampler = torch.utils.data.sampler.WeightedRandomSampler(
            ds_val_samplingweight, int(User_SampleSizePerEpoch/100), replacement=True)
        val_loader          = torch.utils.data.DataLoader(ds_val, batch_size=int(ds_val.__len__()/100), drop_last=False, num_workers=4, 
                                                            pin_memory=True,worker_init_fn=None, prefetch_factor=3, persistent_workers=False,
                                                            shuffle=False, sampler = val_sampler)  

        #NOTE Test
        """
        ds_testing, ds_testing_samplingweight = FetchDatasetC.GetDataset(
                        Assigned_PdbidBatch = Testing_PdbidBatches,
                        Assigned_PdbidWeight = Testing_PdbidWeight,
                        User_NumReductionComponent = User_NumReductionComponent,
                        ClassName_ClassIndex_Dict = ClassName_ClassIndex_Dict,
                        User_Task = User_Task,
                        PerformZscoring = True, 
                        PerformReduction = User_PerformReduction,
                        )
        
        testing_sampler = torch.utils.data.sampler.WeightedRandomSampler(
            ds_testing_samplingweight, int(User_SampleSizePerEpoch/100), replacement=True)
        testing_loader          = torch.utils.data.DataLoader(ds_testing, batch_size=int(ds_testing.__len__()/100), drop_last=False, 
                                                            num_workers=4, 
                                                            pin_memory=True,worker_init_fn=None, prefetch_factor=3, persistent_workers=False,
                                                            shuffle=False, sampler = testing_sampler) 
        """







        
    # =====================
    # Define Model
    # ======================
    PART2_DefineModel = True
    if PART2_DefineModel:
        if User_PerformReduction:
            n_FeatPerShell = User_NumReductionComponent
            hw_product = n_FeatPerShell*6
        else:
            n_FeatPerShell = 80
            hw_product = 80*6

        model = NucleicNet.Burn.M1.B1hw_FcLogits(
                        model   = NucleicNet.Burn.M1.B1hw_LayerResnetBottleneck(n_FeatPerShell = n_FeatPerShell, 
                                                    n_Shell = 6,
                                                    n_ShellMix = 2,
                                                    User_Activation = User_Activation,
                                                    User_Block = "B1hw_BlockPreActResnet",
                                                    n_Blocks = User_n_ResnetBlock,
                                                    ManualInitiation = False,
                                                    User_n_channelbottleneck = User_n_channelbottleneck,
                                                    User_NoiseZ = User_NoiseZ,
                                                    ),

                        #loss    = customloss, 
                        User_Loss = User_Loss, 
                        n_class = 4,
                        hw_product = hw_product,
                        AddMultiLabelSoftMarginLoss = False, # TODO Worsen stuff? One-vs-all likely of no use.
                        User_lr = User_lr,
                        User_min_lr = User_min_lr,
                        User_LrScheduler = User_LrScheduler,
                        User_CooldownInterval = User_CooldownInterval,
                        BiasInSuffixFc = User_BiasInSuffixFc, 
                        # NOTE some kwargs for hparam record
                        User_SizeMinibatch = User_SizeMinibatch,
                        User_LabelSmoothing = User_LabelSmoothing,
                        User_PerformReduction = User_PerformReduction,
                        User_n_ResnetBlock = User_n_ResnetBlock,
                        User_AdamW_weight_decay = User_AdamW_weight_decay,
                        User_Activation = User_Activation,
                        User_SelectedCrossFoldIndex = User_SelectedCrossFoldIndex,
                        User_Dropoutp = User_Dropoutp,
                        User_AddL1 = User_AddL1,
                        User_n_channelbottleneck = User_n_channelbottleneck,
                        User_ShiftLrRatio = User_ShiftLrRatio,
                        User_NoiseX = User_NoiseX,
                        User_NoiseY = User_NoiseY,
                        #User_Mixup = User_Mixup,
                        User_NumReductionComponent = User_NumReductionComponent,
                        User_NoiseZ = User_NoiseZ,
                        User_PdbidTraining = Train_PdbidBatches,
                        User_PdbidValidation = Val_PdbidBatches,
                        User_PdbidTesting = Testing_PdbidBatches,
                        User_InputDropoutp = User_InputDropoutp,
                        User_FocalLossAlpha = User_FocalLossAlpha,
                        User_FocalLossGamma = User_FocalLossGamma,
                        User_n_CrossFold = n_CrossFold,
                        User_ClanGraphBcPercent = ClanGraphBcPercent,
                        User_Task = User_Task,
                        User_NeighborLabelSmoothAngstrom = User_NeighborLabelSmoothAngstrom,
                        User_GradientClippingValue = User_GradientClippingValue,
                    )



        NucleicNet.Burn.util.ResetAllParameters(model)



    # ====================
    # Stage 0 training
    # ====================
    trainer00 = NucleicNet.Burn.util.DefaultTrainer00(DIR_TrainLog = DIR_TrainLog, 
                                                        DIR_TrainingRoot = DIR_TrainingRoot, 
                                                        User_ExperiementName = User_ExperiementName,
                                                        User_SizeMinibatch = User_SizeMinibatch ,
                                                        User_ShiftLrRatio = User_ShiftLrRatio,
                                                        User_Mixup = User_Mixup,
                                                        User_GradientClippingValue = User_GradientClippingValue)
    trainer00.logger._log_graph = True 
    trainer00.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)



    del model, trainer00
    gc.collect()



(1024, 0.36, False, 'gelu', 16, 0.00175, 5000, 0.03, 1e-06, 0.7, 1e-06, 40, 0.01, 'SimpleMultistepCosineLRS', False, 1.0, 0.0, False, 20, 1.0, 1.5, 0.01, 'CrossEntropyLoss', 0.25, 2.0, 10000000000.0)
Getting TrainValTest batches
207 70 36 313
Concating Dataset


100%|██████████| 207/207 [00:00<00:00, 7234.93it/s]


Finished Concat data. Cooling down
264390 264390
{0: 20.488312585491137, 1: 10.261026147155574, 2: 12.544115012309305, 3: 29.706546255073405}
Concating Dataset


100%|██████████| 70/70 [00:00<00:00, 9936.42it/s]


Finished Concat data. Cooling down
70317 70317
{0: 7.313173609915914, 1: 1.146397844338393, 2: 3.8564953882948885, 3: 11.68393315745482}


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                       | Params
-------------------------------------------------------------
0 | nested_module | B1hw_LayerResnetBottleneck | 311 K 
1 | prefix_layerD | Sequential                 | 0     
2 | suffix_layerA | Sequential                 | 0     
3 | suffix_layerD | Sequential                 | 230 K 
4 | suffix_layerZ | Sequential                 | 1.9 K 
5 | loss          | CrossEntropyLoss           | 0     
-------------------------------------------------------------
543 K     Trainable params
0         Non-trainable params
543 K     Total params
2.173     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  rank_zero_warn(


                                                                      

Global seed set to 42


Epoch 2:  94%|█████████▍| 7908/8417 [24:08<01:33,  5.46it/s, loss=0.955, v_num=2_43, train_loss_s=0.952, val_loss_s=1.440]10000 20000 0.0017490000000000001
Epoch 14: 100%|██████████| 8417/8417 [26:26<00:00,  5.30it/s, loss=0.939, v_num=2_43, train_loss_s=0.941, val_loss_s=1.480]


FIT Profiler Report

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  2.3589e+04     	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1572.1         	|15             	|  2.3582e+04     	|  99.973         	|
run_training_batch                 	|  0.34322        	|51255          	|  1.7592e+04     	|  74.578         	|
optimizer_step_with_closure_0      	|  0.34182        	|51255          	|  1.752e+04      	|  74.272         	|
training_step_and_backward         	|  0.17553        	|51255          	|  8996.9         	|  38.141         	|
backward                           

(1024, 0.36, False, 'gelu', 16, 0.00175, 5000, 0.03, 1e-06, 0.7, 1e-06, 40, 0.01, 'SimpleMultistepCosineLRS', False, 1.0, 0.0, False, 20, 1.0, 1.5, 0.01, 'CrossEntropyLoss', 0.25, 2.0, 10000000000.0)
Getting TrainValTest batches
210 70 33 313
Concating Dataset


100%|██████████| 210/210 [00:00<00:00, 9309.84it/s]


Finished Concat data. Cooling down
188273 188273
{0: 17.469871056859304, 1: 6.685181895191551, 2: 12.356958375266036, 3: 31.487988672686495}
Concating Dataset


100%|██████████| 70/70 [00:00<00:00, 2513.45it/s]


Finished Concat data. Cooling down
124872 124872
{0: 6.8075900442477435, 1: 2.9074466971687696, 2: 4.042690983164212, 3: 8.242272275428633}


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                       | Params
-------------------------------------------------------------
0 | nested_module | B1hw_LayerResnetBottleneck | 311 K 
1 | prefix_layerD | Sequential                 | 0     
2 | suffix_layerA | Sequential                 | 0     
3 | suffix_layerD | Sequential                 | 230 K 
4 | suffix_layerZ | Sequential                 | 1.9 K 
5 | loss          | CrossEntropyLoss           | 0     
-------------------------------------------------------------
543 K     Trainable params
0         Non-trainable params
543 K     Total params
2.173     Total estimated model params size (MB)


                                                                      

Global seed set to 42


Epoch 2:  94%|█████████▍| 5955/6317 [22:51<01:23,  4.34it/s, loss=0.937, v_num=4_45, train_loss_s=0.939, val_loss_s=1.460]10000 20000 0.0017490000000000001
Epoch 14: 100%|██████████| 6317/6317 [24:49<00:00,  4.24it/s, loss=0.926, v_num=4_45, train_loss_s=0.923, val_loss_s=1.500]


FIT Profiler Report

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  2.2236e+04     	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1482.2         	|15             	|  2.2232e+04     	|  99.984         	|
run_training_batch                 	|  0.34284        	|51255          	|  1.7572e+04     	|  79.026         	|
optimizer_step_with_closure_0      	|  0.3414         	|51255          	|  1.7498e+04     	|  78.694         	|
training_step_and_backward         	|  0.17557        	|51255          	|  8999.0         	|  40.47          	|
backward                           

(1024, 0.36, False, 'gelu', 16, 0.00175, 5000, 0.03, 1e-06, 0.7, 1e-06, 40, 0.01, 'SimpleMultistepCosineLRS', False, 1.0, 0.0, False, 20, 1.0, 1.5, 0.01, 'CrossEntropyLoss', 0.25, 2.0, 10000000000.0)
Getting TrainValTest batches
209 73 31 313
Concating Dataset


100%|██████████| 209/209 [00:00<00:00, 6296.85it/s]


Finished Concat data. Cooling down
272361 272361
{0: 20.71123523471334, 1: 8.295204926323487, 2: 11.501889978546158, 3: 30.491669860350058}
Concating Dataset


100%|██████████| 73/73 [00:00<00:00, 7043.25it/s]


Finished Concat data. Cooling down
61392 61392
{0: 5.039109713616278, 1: 2.8233504771883333, 2: 4.847154990105763, 3: 10.290384819090676}


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                       | Params
-------------------------------------------------------------
0 | nested_module | B1hw_LayerResnetBottleneck | 311 K 
1 | prefix_layerD | Sequential                 | 0     
2 | suffix_layerA | Sequential                 | 0     
3 | suffix_layerD | Sequential                 | 230 K 
4 | suffix_layerZ | Sequential                 | 1.9 K 
5 | loss          | CrossEntropyLoss           | 0     
-------------------------------------------------------------
543 K     Trainable params
0         Non-trainable params
543 K     Total params
2.173     Total estimated model params size (MB)


                                                                      

Global seed set to 42


Epoch 2:  94%|█████████▍| 8652/9217 [24:49<01:37,  5.81it/s, loss=0.952, v_num=6_47, train_loss_s=0.952, val_loss_s=1.480]10000 20000 0.0017490000000000001
Epoch 14: 100%|██████████| 9217/9217 [27:07<00:00,  5.66it/s, loss=0.94, v_num=6_47, train_loss_s=0.931, val_loss_s=1.490]


FIT Profiler Report

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  2.4183e+04     	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1611.9         	|15             	|  2.4179e+04     	|  99.984         	|
run_training_batch                 	|  0.344          	|51255          	|  1.7632e+04     	|  72.91          	|
optimizer_step_with_closure_0      	|  0.34255        	|51255          	|  1.7557e+04     	|  72.602         	|
training_step_and_backward         	|  0.1764         	|51255          	|  9041.6         	|  37.388         	|
backward                           

(1024, 0.36, False, 'gelu', 16, 0.002, 5000, 0.03, 1e-06, 0.7, 1e-06, 40, 0.01, 'SimpleMultistepCosineLRS', False, 1.0, 0.0, False, 20, 1.0, 1.5, 0.01, 'CrossEntropyLoss', 0.25, 2.0, 10000000000.0)
Getting TrainValTest batches
207 70 36 313
Concating Dataset


100%|██████████| 207/207 [00:00<00:00, 5164.88it/s]


Finished Concat data. Cooling down
264390 264390
{0: 20.488312585491137, 1: 10.261026147155574, 2: 12.544115012309305, 3: 29.706546255073405}
Concating Dataset


100%|██████████| 70/70 [00:00<00:00, 4171.12it/s]


Finished Concat data. Cooling down
70317 70317
{0: 7.313173609915914, 1: 1.146397844338393, 2: 3.8564953882948885, 3: 11.68393315745482}


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                       | Params
-------------------------------------------------------------
0 | nested_module | B1hw_LayerResnetBottleneck | 311 K 
1 | prefix_layerD | Sequential                 | 0     
2 | suffix_layerA | Sequential                 | 0     
3 | suffix_layerD | Sequential                 | 230 K 
4 | suffix_layerZ | Sequential                 | 1.9 K 
5 | loss          | CrossEntropyLoss           | 0     
-------------------------------------------------------------
543 K     Trainable params
0         Non-trainable params
543 K     Total params
2.173     Total estimated model params size (MB)


                                                                      

Global seed set to 42


Epoch 2:  94%|█████████▍| 7908/8417 [24:49<01:35,  5.31it/s, loss=0.954, v_num=8_49, train_loss_s=0.944, val_loss_s=1.440]10000 20000 0.001999
Epoch 14: 100%|██████████| 8417/8417 [26:36<00:00,  5.27it/s, loss=0.94, v_num=8_49, train_loss_s=0.942, val_loss_s=1.470]


FIT Profiler Report

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  2.4283e+04     	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1618.8         	|15             	|  2.4282e+04     	|  99.995         	|
run_training_batch                 	|  0.35299        	|51255          	|  1.8093e+04     	|  74.508         	|
optimizer_step_with_closure_0      	|  0.35145        	|51255          	|  1.8014e+04     	|  74.183         	|
training_step_and_backward         	|  0.18052        	|51255          	|  9252.5         	|  38.103         	|
backward                           

(1024, 0.36, False, 'gelu', 16, 0.002, 5000, 0.03, 1e-06, 0.7, 1e-06, 40, 0.01, 'SimpleMultistepCosineLRS', False, 1.0, 0.0, False, 20, 1.0, 1.5, 0.01, 'CrossEntropyLoss', 0.25, 2.0, 10000000000.0)
Getting TrainValTest batches
210 70 33 313
Concating Dataset


100%|██████████| 210/210 [00:00<00:00, 7404.33it/s]


Finished Concat data. Cooling down
188273 188273
{0: 17.469871056859304, 1: 6.685181895191551, 2: 12.356958375266036, 3: 31.487988672686495}
Concating Dataset


100%|██████████| 70/70 [00:00<00:00, 2050.85it/s]


Finished Concat data. Cooling down
124872 124872
{0: 6.8075900442477435, 1: 2.9074466971687696, 2: 4.042690983164212, 3: 8.242272275428633}


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                       | Params
-------------------------------------------------------------
0 | nested_module | B1hw_LayerResnetBottleneck | 311 K 
1 | prefix_layerD | Sequential                 | 0     
2 | suffix_layerA | Sequential                 | 0     
3 | suffix_layerD | Sequential                 | 230 K 
4 | suffix_layerZ | Sequential                 | 1.9 K 
5 | loss          | CrossEntropyLoss           | 0     
-------------------------------------------------------------
543 K     Trainable params
0         Non-trainable params
543 K     Total params
2.173     Total estimated model params size (MB)


                                                                      

Global seed set to 42


Epoch 2:  94%|█████████▍| 5955/6317 [23:58<01:27,  4.14it/s, loss=0.937, v_num=0_51, train_loss_s=0.938, val_loss_s=1.470]10000 20000 0.001999
Epoch 14: 100%|██████████| 6317/6317 [24:57<00:00,  4.22it/s, loss=0.925, v_num=0_51, train_loss_s=0.924, val_loss_s=1.490]


FIT Profiler Report

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  2.2825e+04     	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1521.5         	|15             	|  2.2823e+04     	|  99.994         	|
run_training_batch                 	|  0.35333        	|51255          	|  1.811e+04      	|  79.344         	|
optimizer_step_with_closure_0      	|  0.35183        	|51255          	|  1.8033e+04     	|  79.008         	|
training_step_and_backward         	|  0.18009        	|51255          	|  9230.5         	|  40.441         	|
backward                           

(1024, 0.36, False, 'gelu', 16, 0.002, 5000, 0.03, 1e-06, 0.7, 1e-06, 40, 0.01, 'SimpleMultistepCosineLRS', False, 1.0, 0.0, False, 20, 1.0, 1.5, 0.01, 'CrossEntropyLoss', 0.25, 2.0, 10000000000.0)
Getting TrainValTest batches
209 73 31 313
Concating Dataset


100%|██████████| 209/209 [00:00<00:00, 3589.12it/s]


Finished Concat data. Cooling down
272361 272361
{0: 20.71123523471334, 1: 8.295204926323487, 2: 11.501889978546158, 3: 30.491669860350058}
Concating Dataset


100%|██████████| 73/73 [00:00<00:00, 3151.92it/s]


Finished Concat data. Cooling down
61392 61392
{0: 5.039109713616278, 1: 2.8233504771883333, 2: 4.847154990105763, 3: 10.290384819090676}


GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                       | Params
-------------------------------------------------------------
0 | nested_module | B1hw_LayerResnetBottleneck | 311 K 
1 | prefix_layerD | Sequential                 | 0     
2 | suffix_layerA | Sequential                 | 0     
3 | suffix_layerD | Sequential                 | 230 K 
4 | suffix_layerZ | Sequential                 | 1.9 K 
5 | loss          | CrossEntropyLoss           | 0     
-------------------------------------------------------------
543 K     Trainable params
0         Non-trainable params
543 K     Total params
2.173     Total estimated model params size (MB)


                                                                      

Global seed set to 42


Epoch 2:  94%|█████████▍| 8652/9217 [24:42<01:36,  5.84it/s, loss=0.951, v_num=2_53, train_loss_s=0.947, val_loss_s=1.470]10000 20000 0.001999
Epoch 14: 100%|██████████| 9217/9217 [27:05<00:00,  5.67it/s, loss=0.94, v_num=2_53, train_loss_s=0.931, val_loss_s=1.460]


FIT Profiler Report

Action                             	|  Mean duration (s)	|Num calls      	|  Total time (s) 	|  Percentage %   	|
--------------------------------------------------------------------------------------------------------------------------------------
Total                              	|  -              	|_              	|  2.4111e+04     	|  100 %          	|
--------------------------------------------------------------------------------------------------------------------------------------
run_training_epoch                 	|  1607.3         	|15             	|  2.411e+04      	|  99.995         	|
run_training_batch                 	|  0.34351        	|51255          	|  1.7607e+04     	|  73.023         	|
optimizer_step_with_closure_0      	|  0.34208        	|51255          	|  1.7533e+04     	|  72.719         	|
training_step_and_backward         	|  0.17583        	|51255          	|  9012.0         	|  37.377         	|
backward                           

## Epilogue

Remember to train SXPR before moving on!