# CHEM 584 Machine Learning Project
## Marcus Sak

### GPT-2 Text Generation

This notebook takes in a single-column `.csv` file containing the body text of scientific articles, trains a GPT-2 model, and outputs synthetic text into a `.txt` file.

### Imports

We use [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), a Python wrapper for [OpenAI](https://openai.com/)'s [GPT-2 text generation model](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). This requires tensorflow-gpu version 1 to be installed.

In [1]:
import os
import requests
import tensorflow as tf  # must be tensorflow 1.x
import gpt_2_simple as gpt2
from datetime import datetime

Four versions of GPT-2 have been released, with 124M, 355M, 774M, and 1558M (full-sized) hyperparameters respectively. The upper limit for one commercially available GPU is the 774M model. This notebook was run on a Grace node with a single NVIDIA Tesla V100 GPU with 16GB of memory. 

In [8]:
model_name = "774M"
if not os.path.isdir(os.path.join("models", model_name)):
    print(f"Downloading {model_name} model...")
    gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/774M/

### Final data processing 

The final part of the data processing is handled in this notebook. In particular, we remove double quotes, empty lines, and break up the text into 1024-token chunks.

In [3]:
file_name = "./preproc/text_proc.csv"

# preproc leaves double quotes, remove here
if not os.path.isfile(file_name):
    print("The file does not exist")
else:
    with open(file_name, 'r') as file:
        data = file.read().replace('\"', '').replace('\n\n', '\n')

    with open(file_name, 'w') as file:
        file.write(data)

In [7]:
all_words = []
for line in open(file_name):
    row = line.split(' ')
    all_words += list(row)

line_breaker = 1023

# generate actual input file
gpt_input = './text_proc_2.txt'
with open(gpt_input, 'w') as file:
    for index, word in enumerate(all_words):
        if not (index % line_breaker) and index:
            # every 1023 words, make a newline
            file.write(word.strip('\n')+"\n")
        else:
            file.write(word.strip('\n')+" ")

### Training GPT-2

The gpt-2-simple wrapper greatly simplifies the GPT-2 interface. Training involves specifying the input file and some keyword arguments that are relatively self-explanatory. Model training was stopped at around 12000 steps after observing that the loss had plateaued and the sample output was starting to reproduce phrases in the source text.

In [9]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=gpt_input,
              model_name='774M',
              steps=20000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=1000
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/774M/model.ckpt
INFO:tensorflow:Restoring parameters from models/774M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:03<00:00,  3.52s/it]


dataset has 762348 tokens
Training...
[10 | 24.11] loss=3.58 avg=3.58
[20 | 34.16] loss=3.01 avg=3.29
[30 | 44.22] loss=3.59 avg=3.39
[40 | 54.30] loss=2.75 avg=3.23
[50 | 64.39] loss=3.02 avg=3.19
[60 | 74.47] loss=2.50 avg=3.07
[70 | 84.57] loss=2.82 avg=3.03
[80 | 94.66] loss=2.76 avg=3.00
[90 | 104.75] loss=2.82 avg=2.98
[100 | 114.85] loss=2.94 avg=2.97
[110 | 124.96] loss=2.74 avg=2.95
[120 | 135.05] loss=2.86 avg=2.94
[130 | 145.14] loss=2.88 avg=2.94
[140 | 155.24] loss=2.05 avg=2.87
[150 | 165.33] loss=2.81 avg=2.87
[160 | 175.41] loss=2.31 avg=2.83
[170 | 185.51] loss=2.29 avg=2.79
[180 | 195.60] loss=2.90 avg=2.80
[190 | 205.68] loss=2.29 avg=2.77
[200 | 215.77] loss=2.95 avg=2.78
 stoivity, the incorporation or addition of any metal ions to the arylation products will likely result in enhanced rate and enantioselectivity. To achieve higher enantiocontrol, we explored the capacity of metal ions to influence the rates of aldination in addition to reducing the overall rate of 

[410 | 471.76] loss=3.08 avg=2.71
[420 | 481.83] loss=2.69 avg=2.71
[430 | 491.91] loss=2.24 avg=2.69
[440 | 501.99] loss=2.26 avg=2.68
[450 | 512.08] loss=1.30 avg=2.64
[460 | 522.16] loss=2.35 avg=2.64
[470 | 532.24] loss=2.76 avg=2.64
[480 | 542.33] loss=3.60 avg=2.66
[490 | 552.42] loss=2.71 avg=2.67
[500 | 562.51] loss=1.39 avg=2.63
[510 | 572.60] loss=3.19 avg=2.65
[520 | 582.69] loss=2.39 avg=2.64
[530 | 592.78] loss=2.72 avg=2.64
[540 | 602.86] loss=3.01 avg=2.65
[550 | 612.94] loss=2.14 avg=2.64
[560 | 623.02] loss=2.51 avg=2.64
[570 | 633.11] loss=1.98 avg=2.62
[580 | 643.19] loss=2.17 avg=2.61
[590 | 653.27] loss=2.73 avg=2.61
[600 | 663.36] loss=2.86 avg=2.62
 as i, j, k clusters or as a single point. Thus, we chose not to make use the simple i, z-imaging techniques discussed previously for the Hs-HPLC.  As noted previously, we employed the 3. 0-BINOL-HCl catalyst 3-nitrobenzoic acid as a catalyst for the asymmetric BINOL-BINOL linkage reaction. This catalystsubstrate inter

[810 | 914.60] loss=2.82 avg=2.54
[820 | 924.67] loss=2.22 avg=2.54
[830 | 934.74] loss=2.19 avg=2.53
[840 | 944.82] loss=1.85 avg=2.52
[850 | 954.90] loss=1.71 avg=2.51
[860 | 964.98] loss=1.31 avg=2.48
[870 | 975.05] loss=1.86 avg=2.47
[880 | 985.12] loss=1.69 avg=2.46
[890 | 995.19] loss=2.42 avg=2.46
[900 | 1005.26] loss=1.55 avg=2.45
[910 | 1015.33] loss=1.73 avg=2.43
[920 | 1025.39] loss=1.69 avg=2.42
[930 | 1035.46] loss=2.33 avg=2.42
[940 | 1045.54] loss=2.97 avg=2.43
[950 | 1055.60] loss=1.33 avg=2.41
[960 | 1065.67] loss=2.16 avg=2.41
[970 | 1075.74] loss=1.50 avg=2.39
[980 | 1085.81] loss=2.27 avg=2.39
[990 | 1095.87] loss=0.50 avg=2.36
[1000 | 1105.95] loss=0.97 avg=2.34
Saving checkpoint/run1/model-1000
 giving. However, in the two prior examples of this reaction, the catalyst 1 that was most active was identified. In each of the prior studies, catalyst 1 was found to be monocatalyzed. In the present study, the catalyst that is most active is
called 5. The catalyst for the

[1210 | 1362.61] loss=2.41 avg=2.22
[1220 | 1372.68] loss=1.52 avg=2.21
[1230 | 1382.76] loss=2.03 avg=2.21
[1240 | 1392.84] loss=1.70 avg=2.20
[1250 | 1402.93] loss=3.16 avg=2.22
[1260 | 1413.01] loss=2.26 avg=2.22
[1270 | 1423.09] loss=2.60 avg=2.22
[1280 | 1433.18] loss=2.80 avg=2.23
[1290 | 1443.27] loss=1.55 avg=2.22
[1300 | 1453.35] loss=1.40 avg=2.21
[1310 | 1463.44] loss=2.28 avg=2.21
[1320 | 1473.52] loss=1.57 avg=2.20
[1330 | 1483.60] loss=1.03 avg=2.18
[1340 | 1493.68] loss=1.30 avg=2.17
[1350 | 1503.77] loss=1.49 avg=2.16
[1360 | 1513.85] loss=1.49 avg=2.15
[1370 | 1523.93] loss=1.18 avg=2.14
[1380 | 1534.01] loss=0.77 avg=2.12
[1390 | 1544.10] loss=2.10 avg=2.12
[1400 | 1554.18] loss=1.23 avg=2.11
 In NMR Spectra, Hnigs base is a very minor player in the picture, with only 2 ppm of Hnigs base. But on a larger scale, these compounds exhibit very different NMR spectra. Whereas the CD, IR, and Hnigs bases are all labeled to some extent in the literature, the PPh 3-labeled var

[1610 | 1805.57] loss=0.78 avg=1.96
[1620 | 1815.65] loss=0.82 avg=1.94
[1630 | 1825.72] loss=0.35 avg=1.92
[1640 | 1835.80] loss=2.68 avg=1.93
[1650 | 1845.88] loss=1.30 avg=1.92
[1660 | 1855.96] loss=0.80 avg=1.91
[1670 | 1866.04] loss=2.45 avg=1.92
[1680 | 1876.12] loss=2.74 avg=1.93
[1690 | 1886.21] loss=2.29 avg=1.93
[1700 | 1896.29] loss=1.67 avg=1.93
[1710 | 1906.38] loss=1.11 avg=1.92
[1720 | 1916.47] loss=1.68 avg=1.91
[1730 | 1926.56] loss=1.14 avg=1.91
[1740 | 1936.64] loss=2.15 avg=1.91
[1750 | 1946.72] loss=1.20 avg=1.90
[1760 | 1956.80] loss=0.48 avg=1.88
[1770 | 1966.89] loss=1.27 avg=1.88
[1780 | 1976.97] loss=0.51 avg=1.86
[1790 | 1987.05] loss=1.01 avg=1.85
[1800 | 1997.14] loss=0.53 avg=1.83
. 1. 1-Naphthyl-substituted tetramer P1. Phosphate isomer 5. 1. 3-Naphthyl-substituted d-Pro-X has been shown to be more effective than N-methylimidiazole in inhibiting PI3P phosphorylation of bovine ovarian steroids.  However, the effect of substituents near the 4-position of th

[2010 | 2252.40] loss=1.95 avg=1.73
[2020 | 2262.47] loss=0.39 avg=1.72
[2030 | 2272.54] loss=0.79 avg=1.70
[2040 | 2282.62] loss=0.53 avg=1.69
[2050 | 2292.69] loss=0.87 avg=1.68
[2060 | 2302.77] loss=0.49 avg=1.67
[2070 | 2312.86] loss=2.49 avg=1.68
[2080 | 2322.94] loss=1.39 avg=1.67
[2090 | 2333.03] loss=0.67 avg=1.66
[2100 | 2343.11] loss=1.35 avg=1.66
[2110 | 2353.19] loss=0.88 avg=1.65
[2120 | 2363.28] loss=0.77 avg=1.64
[2130 | 2373.37] loss=0.46 avg=1.63
[2140 | 2383.45] loss=1.42 avg=1.62
[2150 | 2393.54] loss=1.67 avg=1.63
[2160 | 2403.63] loss=0.47 avg=1.61
[2170 | 2413.71] loss=1.58 avg=1.61
[2180 | 2423.79] loss=0.57 avg=1.60
[2190 | 2433.88] loss=0.40 avg=1.59
[2200 | 2443.96] loss=1.07 avg=1.58
 but a diverse range of carbamide systems including protected aspartic acid -and aryl bromides, as well as several that did not provide a linear resolution. Importantly, the inclusion of aspartic acid was intended to bias the catalyst away from one tertiary amine, and while this 

[2410 | 2695.32] loss=2.88 avg=1.43
[2420 | 2705.38] loss=2.11 avg=1.44
[2430 | 2715.46] loss=0.73 avg=1.43
[2440 | 2725.54] loss=1.19 avg=1.43
[2450 | 2735.62] loss=0.68 avg=1.42
[2460 | 2745.70] loss=0.38 avg=1.41
[2470 | 2755.79] loss=0.89 avg=1.40
[2480 | 2765.87] loss=2.45 avg=1.42
[2490 | 2775.96] loss=1.46 avg=1.42
[2500 | 2786.05] loss=0.97 avg=1.41
[2510 | 2796.13] loss=0.66 avg=1.40
[2520 | 2806.22] loss=2.10 avg=1.41
[2530 | 2816.30] loss=0.38 avg=1.40
[2540 | 2826.38] loss=0.16 avg=1.39
[2550 | 2836.47] loss=2.21 avg=1.40
[2560 | 2846.54] loss=0.38 avg=1.38
[2570 | 2856.63] loss=0.48 avg=1.37
[2580 | 2866.71] loss=1.08 avg=1.37
[2590 | 2876.79] loss=2.26 avg=1.38
[2600 | 2886.87] loss=0.33 avg=1.37
 highly sought in the context of drug development. The development of novel chiral ligands for transition metal complexes with photonic excitation-induced oxal transport may provide an attractive alternative.  General Methods for the Transition Metal-Catalyzed Annulation of Amino

[2810 | 3138.27] loss=0.27 avg=1.24
[2820 | 3148.33] loss=1.76 avg=1.25
[2830 | 3158.41] loss=0.16 avg=1.23
[2840 | 3168.48] loss=0.33 avg=1.22
[2850 | 3178.57] loss=0.62 avg=1.22
[2860 | 3188.65] loss=0.17 avg=1.21
[2870 | 3198.74] loss=0.31 avg=1.20
[2880 | 3208.82] loss=1.06 avg=1.20
[2890 | 3218.91] loss=0.54 avg=1.19
[2900 | 3228.99] loss=0.24 avg=1.18
[2910 | 3239.07] loss=0.67 avg=1.17
[2920 | 3249.16] loss=0.54 avg=1.17
[2930 | 3259.25] loss=0.31 avg=1.16
[2940 | 3269.33] loss=1.30 avg=1.16
[2950 | 3279.42] loss=0.39 avg=1.15
[2960 | 3289.50] loss=0.25 avg=1.14
[2970 | 3299.58] loss=0.22 avg=1.13
[2980 | 3309.66] loss=1.14 avg=1.13
[2990 | 3319.75] loss=0.34 avg=1.12
[3000 | 3329.83] loss=0.22 avg=1.11
Saving checkpoint/run1/model-3000
 coupling or aldol conditions. We hypothesized that the aldehyde might serve as a surfactant or a reducing agent, and that the resulting products might be targeted as the bis-phosphorylated, phosphorylated alcohol 7.  Design Plan. In accord with 

[3210 | 3585.05] loss=0.36 avg=0.96
[3220 | 3595.12] loss=0.35 avg=0.95
[3230 | 3605.20] loss=0.39 avg=0.95
[3240 | 3615.28] loss=0.83 avg=0.94
[3250 | 3625.36] loss=0.26 avg=0.94
[3260 | 3635.44] loss=1.50 avg=0.94
[3270 | 3645.52] loss=0.64 avg=0.94
[3280 | 3655.61] loss=0.42 avg=0.93
[3290 | 3665.70] loss=0.17 avg=0.93
[3300 | 3675.78] loss=0.30 avg=0.92
[3310 | 3685.87] loss=0.47 avg=0.92
[3320 | 3695.95] loss=0.22 avg=0.91
[3330 | 3706.04] loss=0.12 avg=0.90
[3340 | 3716.12] loss=1.28 avg=0.90
[3350 | 3726.20] loss=0.28 avg=0.90
[3360 | 3736.28] loss=0.11 avg=0.89
[3370 | 3746.36] loss=0.26 avg=0.88
[3380 | 3756.45] loss=1.96 avg=0.89
[3390 | 3766.53] loss=0.31 avg=0.89
[3400 | 3776.61] loss=0.71 avg=0.89
 studies of 3-aminocyclohexanone, we found that both the enantiomeric diastereo-and enantiomers of 6. 38 could be isolated in the same vessel, and indeed, 6. 40 was the precursor of the present system. Upon treatment with aqueous NaOH, the enantiomers of 6. 37 were able to be iso

[3610 | 4028.00] loss=0.19 avg=0.82
[3620 | 4038.06] loss=0.40 avg=0.81
[3630 | 4048.13] loss=0.44 avg=0.81
[3640 | 4058.21] loss=0.25 avg=0.80
[3650 | 4068.29] loss=1.03 avg=0.81
[3660 | 4078.37] loss=0.55 avg=0.80
[3670 | 4088.47] loss=0.37 avg=0.80
[3680 | 4098.55] loss=0.58 avg=0.80
[3690 | 4108.64] loss=0.37 avg=0.79
[3700 | 4118.73] loss=0.59 avg=0.79
[3710 | 4128.82] loss=0.30 avg=0.78
[3720 | 4138.90] loss=0.17 avg=0.78
[3730 | 4148.99] loss=0.22 avg=0.77
[3740 | 4159.08] loss=0.49 avg=0.77
[3750 | 4169.16] loss=0.14 avg=0.76
[3760 | 4179.24] loss=0.24 avg=0.76
[3770 | 4189.32] loss=0.93 avg=0.76
[3780 | 4199.42] loss=0.14 avg=0.75
[3790 | 4209.51] loss=0.37 avg=0.75
[3800 | 4219.60] loss=0.52 avg=0.75
 of the peptide catalyst 6 with Ac 3 O and Dib 3 O led to the formation of 6a, which exhibited a k rel 12. However, after 24 h there was a modest decrease in k rel for all serine proteases with catalysis at high conversion. These results suggest that the post-synthetic catalysts 

[4010 | 4476.11] loss=1.61 avg=0.71
[4020 | 4486.20] loss=0.20 avg=0.71
[4030 | 4496.28] loss=1.24 avg=0.71
[4040 | 4506.36] loss=0.27 avg=0.71
[4050 | 4516.44] loss=0.29 avg=0.70
[4060 | 4526.52] loss=1.02 avg=0.71
[4070 | 4536.61] loss=0.08 avg=0.70
[4080 | 4546.70] loss=1.06 avg=0.71
[4090 | 4556.80] loss=0.19 avg=0.70
[4100 | 4566.89] loss=0.83 avg=0.70
[4110 | 4576.98] loss=0.46 avg=0.70
[4120 | 4587.08] loss=0.91 avg=0.70
[4130 | 4597.17] loss=0.44 avg=0.70
[4140 | 4607.26] loss=0.46 avg=0.70
[4150 | 4617.35] loss=0.19 avg=0.69
[4160 | 4627.45] loss=0.43 avg=0.69
[4170 | 4637.55] loss=0.22 avg=0.68
[4180 | 4647.65] loss=0.18 avg=0.68
[4190 | 4657.74] loss=0.31 avg=0.67
[4200 | 4667.84] loss=0.23 avg=0.67
 conjisomer, 6c, a key step in the progression toward synthetics. In terms of an enantiomeric manipulation, the absolute stereochemistry of 4a can be expressed in terms of a ratio between the diastereomers of
5a, with the primary alcohol of 1a and the secondary alcohol of 4a. For

[4410 | 4919.71] loss=0.83 avg=0.62
[4420 | 4929.77] loss=0.18 avg=0.62
[4430 | 4939.84] loss=3.45 avg=0.65
[4440 | 4949.92] loss=2.28 avg=0.66
[4450 | 4960.00] loss=0.59 avg=0.66
[4460 | 4970.09] loss=0.42 avg=0.66
[4470 | 4980.18] loss=0.67 avg=0.66
[4480 | 4990.27] loss=0.70 avg=0.66
[4490 | 5000.35] loss=0.19 avg=0.66
[4500 | 5010.44] loss=0.32 avg=0.65
[4510 | 5020.53] loss=0.22 avg=0.65
[4520 | 5030.61] loss=0.33 avg=0.64
[4530 | 5040.69] loss=0.14 avg=0.64
[4540 | 5050.77] loss=0.26 avg=0.64
[4550 | 5060.85] loss=0.22 avg=0.63
[4560 | 5070.93] loss=1.28 avg=0.64
[4570 | 5081.02] loss=1.26 avg=0.64
[4580 | 5091.10] loss=0.56 avg=0.64
[4590 | 5101.18] loss=0.39 avg=0.64
[4600 | 5111.26] loss=0.88 avg=0.64
 the The the 4 2 in an enant-, of catalytic enone the 1 3-enant acetylation were obtained without loss of site selectivity to a notable feature with respect to the potential applications of these enantioenriched thiostrines in medicinal chemistry.  The enantiohypophysiological pr

[4810 | 5362.59] loss=0.25 avg=0.58
[4820 | 5372.66] loss=0.15 avg=0.57
[4830 | 5382.73] loss=0.25 avg=0.57
[4840 | 5392.80] loss=0.16 avg=0.57
[4850 | 5402.88] loss=1.75 avg=0.58
[4860 | 5412.96] loss=0.07 avg=0.57
[4870 | 5423.05] loss=0.22 avg=0.57
[4880 | 5433.13] loss=0.45 avg=0.57
[4890 | 5443.21] loss=0.22 avg=0.57
[4900 | 5453.29] loss=0.29 avg=0.56
[4910 | 5463.38] loss=0.23 avg=0.56
[4920 | 5473.46] loss=0.45 avg=0.56
[4930 | 5483.54] loss=0.16 avg=0.55
[4940 | 5493.62] loss=0.18 avg=0.55
[4950 | 5503.70] loss=0.15 avg=0.55
[4960 | 5513.79] loss=0.20 avg=0.54
[4970 | 5523.87] loss=0.43 avg=0.54
[4980 | 5533.95] loss=0.19 avg=0.54
[4990 | 5544.05] loss=0.57 avg=0.54
[5000 | 5554.13] loss=0.25 avg=0.54
Saving checkpoint/run1/model-5000
 the--pyrrolidine-2-one gave the aldol product as a liquid aldol product no more than 2. 1 vol of solvent with 90 condensation temperature, 20 and 40 ee. The secret to this success is that the aldehyde has a large intrinsic preference for S-confi

[5210 | 5809.27] loss=0.22 avg=0.49
[5220 | 5819.34] loss=0.10 avg=0.48
[5230 | 5829.41] loss=0.57 avg=0.48
[5240 | 5839.49] loss=1.13 avg=0.49
[5250 | 5849.56] loss=0.19 avg=0.49
[5260 | 5859.64] loss=0.17 avg=0.48
[5270 | 5869.73] loss=0.27 avg=0.48
[5280 | 5879.82] loss=0.22 avg=0.48
[5290 | 5889.90] loss=0.13 avg=0.47
[5300 | 5899.98] loss=0.19 avg=0.47
[5310 | 5910.06] loss=0.22 avg=0.47
[5320 | 5920.15] loss=0.35 avg=0.47
[5330 | 5930.23] loss=0.22 avg=0.47
[5340 | 5940.31] loss=0.23 avg=0.46
[5350 | 5950.40] loss=0.15 avg=0.46
[5360 | 5960.48] loss=0.33 avg=0.46
[5370 | 5970.56] loss=0.16 avg=0.46
[5380 | 5980.64] loss=0.13 avg=0.45
[5390 | 5990.72] loss=0.17 avg=0.45
[5400 | 6000.82] loss=0.08 avg=0.45
 of in to provide a broad array of boronic acids. Additionally, the use of enamines and their counterparts significantly expands the scope of these enantioenriched boronic acids. Perhaps most important, this methodology is tolerant of quite controversial aryl substitution. Previo

[5610 | 6252.10] loss=0.20 avg=0.39
[5620 | 6262.17] loss=0.28 avg=0.39
[5630 | 6272.24] loss=0.20 avg=0.39
[5640 | 6282.32] loss=0.15 avg=0.39
[5650 | 6292.40] loss=0.18 avg=0.38
[5660 | 6302.48] loss=0.26 avg=0.38
[5670 | 6312.57] loss=0.13 avg=0.38
[5680 | 6322.66] loss=0.10 avg=0.38
[5690 | 6332.74] loss=0.20 avg=0.38
[5700 | 6342.82] loss=0.39 avg=0.38
[5710 | 6352.90] loss=0.20 avg=0.37
[5720 | 6362.98] loss=0.24 avg=0.37
[5730 | 6373.07] loss=0.21 avg=0.37
[5740 | 6383.17] loss=0.22 avg=0.37
[5750 | 6393.25] loss=0.14 avg=0.37
[5760 | 6403.33] loss=0.29 avg=0.37
[5770 | 6413.40] loss=0.23 avg=0.37
[5780 | 6423.48] loss=0.22 avg=0.36
[5790 | 6433.56] loss=0.24 avg=0.36
[5800 | 6443.65] loss=0.29 avg=0.36
 importantly and. The result also directly influenced our decision to pursue catalytic quantities of 3.  On the basis of our initial results, we obtained in excess of the type II -hydroxyester lipids resulted in a 75:25 ratio of 1-3-diacylated 3-di-phenylpropionaldehyde 3-di-myo-

[6010 | 6698.88] loss=0.21 avg=0.33
[6020 | 6708.95] loss=0.13 avg=0.33
[6030 | 6719.02] loss=0.17 avg=0.33
[6040 | 6729.09] loss=0.16 avg=0.33
[6050 | 6739.17] loss=0.06 avg=0.32
[6060 | 6749.25] loss=0.22 avg=0.32
[6070 | 6759.32] loss=0.13 avg=0.32
[6080 | 6769.41] loss=0.19 avg=0.32
[6090 | 6779.49] loss=0.14 avg=0.32
[6100 | 6789.58] loss=0.14 avg=0.32
[6110 | 6799.66] loss=0.11 avg=0.31
[6120 | 6809.74] loss=0.16 avg=0.31
[6130 | 6819.82] loss=0.24 avg=0.31
[6140 | 6829.91] loss=0.20 avg=0.31
[6150 | 6839.99] loss=0.17 avg=0.31
[6160 | 6850.08] loss=0.17 avg=0.31
[6170 | 6860.16] loss=0.12 avg=0.31
[6180 | 6870.25] loss=0.13 avg=0.30
[6190 | 6880.33] loss=0.10 avg=0.30
[6200 | 6890.42] loss=0.13 avg=0.30
 comparable it. While the authors have revised their initial report, namely the crystal structure of the catalyst used for the atroposelective bromination of 4. 43, it remains primarily their synthesis of this catalyst that is described in the main text.  A particularly minimalis

[6410 | 7141.80] loss=0.14 avg=0.28
[6420 | 7151.87] loss=0.14 avg=0.28
[6430 | 7161.94] loss=0.12 avg=0.27
[6440 | 7172.01] loss=0.18 avg=0.27
[6450 | 7182.09] loss=0.14 avg=0.27
[6460 | 7192.17] loss=0.26 avg=0.27
[6470 | 7202.26] loss=0.13 avg=0.27
[6480 | 7212.34] loss=0.22 avg=0.27
[6490 | 7222.42] loss=0.12 avg=0.27
[6500 | 7232.51] loss=0.09 avg=0.27
[6510 | 7242.60] loss=0.30 avg=0.27
[6520 | 7252.68] loss=0.12 avg=0.27
[6530 | 7262.77] loss=0.19 avg=0.26
[6540 | 7272.84] loss=0.15 avg=0.26
[6550 | 7282.92] loss=0.11 avg=0.26
[6560 | 7293.00] loss=0.34 avg=0.26
[6570 | 7303.09] loss=0.13 avg=0.26
[6580 | 7313.17] loss=0.17 avg=0.26
[6590 | 7323.25] loss=0.18 avg=0.26
[6600 | 7333.33] loss=0.14 avg=0.26
 It, although it can be subjected to reductive elimination to afford isomer 4. 41d. The reductive elimination step is a distinctively self-terminal mechanism, as it strikes a balance between conversion to the anti-product, while at the same time promoting itself via reductive gen

[6810 | 7584.62] loss=0.15 avg=0.24
[6820 | 7594.69] loss=0.13 avg=0.24
[6830 | 7604.76] loss=0.12 avg=0.24
[6840 | 7614.84] loss=0.24 avg=0.24
[6850 | 7624.92] loss=0.09 avg=0.24
[6860 | 7635.01] loss=0.10 avg=0.24
[6870 | 7645.10] loss=0.11 avg=0.24
[6880 | 7655.19] loss=0.25 avg=0.24
[6890 | 7665.27] loss=0.11 avg=0.24
[6900 | 7675.35] loss=0.17 avg=0.24
[6910 | 7685.43] loss=0.08 avg=0.23
[6920 | 7695.52] loss=0.18 avg=0.23
[6930 | 7705.61] loss=0.19 avg=0.23
[6940 | 7715.70] loss=0.10 avg=0.23
[6950 | 7725.78] loss=0.06 avg=0.23
[6960 | 7735.86] loss=0.12 avg=0.23
[6970 | 7745.94] loss=0.30 avg=0.23
[6980 | 7756.01] loss=0.10 avg=0.23
[6990 | 7766.09] loss=0.06 avg=0.23
[7000 | 7776.18] loss=0.21 avg=0.23
Saving checkpoint/run1/model-7000
e 2. 1 and 2. 2. However, our initial reaction with 1 only yielded good selectivity. Enantiodivergent outcomes in asymmetric catalysis are uncommon for nonenantiomeric catalysts. This is often attributed to a lack of stereochemical control in the

[7210 | 8031.46] loss=0.28 avg=0.21
[7220 | 8041.53] loss=0.14 avg=0.21
[7230 | 8051.60] loss=0.14 avg=0.21
[7240 | 8061.67] loss=0.25 avg=0.21
[7250 | 8071.75] loss=0.13 avg=0.21
[7260 | 8081.82] loss=0.15 avg=0.21
[7270 | 8091.90] loss=0.10 avg=0.20
[7280 | 8101.98] loss=0.07 avg=0.20
[7290 | 8112.07] loss=0.10 avg=0.20
[7300 | 8122.15] loss=0.11 avg=0.20
[7310 | 8132.24] loss=0.13 avg=0.20
[7320 | 8142.33] loss=0.05 avg=0.20
[7330 | 8152.41] loss=0.06 avg=0.20
[7340 | 8162.50] loss=0.12 avg=0.20
[7350 | 8172.58] loss=0.32 avg=0.20
[7360 | 8182.66] loss=0.12 avg=0.20
[7370 | 8192.74] loss=0.22 avg=0.20
[7380 | 8202.81] loss=0.16 avg=0.20
[7390 | 8212.90] loss=0.12 avg=0.20
[7400 | 8222.98] loss=0.10 avg=0.20
 the this to, we reasoned that a second intramolecular radicalradical coupling might provide a less selective system, one that would provide broadly usable enantioselectivity but with modest dipole moment. The synthesis of 3a proceeds unobtriguously in this fashion through the fi

[7610 | 8474.29] loss=0.20 avg=0.19
[7620 | 8484.37] loss=0.17 avg=0.19
[7630 | 8494.44] loss=0.14 avg=0.19
[7640 | 8504.51] loss=0.10 avg=0.19
[7650 | 8514.59] loss=0.11 avg=0.19
[7660 | 8524.66] loss=0.17 avg=0.19
[7670 | 8534.74] loss=0.10 avg=0.19
[7680 | 8544.83] loss=0.10 avg=0.18
[7690 | 8554.91] loss=0.31 avg=0.19
[7700 | 8564.99] loss=0.15 avg=0.19
[7710 | 8575.07] loss=0.10 avg=0.18
[7720 | 8585.16] loss=0.11 avg=0.18
[7730 | 8595.24] loss=0.16 avg=0.18
[7740 | 8605.33] loss=0.15 avg=0.18
[7750 | 8615.42] loss=0.25 avg=0.18
[7760 | 8625.51] loss=0.18 avg=0.18
[7770 | 8635.59] loss=0.15 avg=0.18
[7780 | 8645.67] loss=0.34 avg=0.19
[7790 | 8655.76] loss=0.14 avg=0.19
[7800 | 8665.84] loss=0.11 avg=0.18
ation and the chiral secondary alcohol, which would ultimately serve as the biological carrier for the molecule.  With a methionine-selective bioconjugation protocol in hand, we turned our attention to exploring the scope of the Michael addition. A range of differentially substit

[8010 | 8921.00] loss=0.14 avg=0.18
[8020 | 8931.07] loss=0.18 avg=0.18
[8030 | 8941.13] loss=0.11 avg=0.18
[8040 | 8951.20] loss=0.11 avg=0.18
[8050 | 8961.27] loss=0.13 avg=0.18
[8060 | 8971.35] loss=0.29 avg=0.18
[8070 | 8981.43] loss=0.09 avg=0.18
[8080 | 8991.53] loss=0.13 avg=0.18
[8090 | 9001.61] loss=0.11 avg=0.18
[8100 | 9011.69] loss=0.23 avg=0.18
[8110 | 9021.77] loss=0.14 avg=0.18
[8120 | 9031.86] loss=0.08 avg=0.18
[8130 | 9041.94] loss=0.16 avg=0.18
[8140 | 9052.01] loss=0.31 avg=0.18
[8150 | 9062.10] loss=0.18 avg=0.18
[8160 | 9072.18] loss=0.18 avg=0.18
[8170 | 9082.27] loss=0.16 avg=0.18
[8180 | 9092.36] loss=0.11 avg=0.18
[8190 | 9102.44] loss=0.10 avg=0.18
[8200 | 9112.52] loss=0.18 avg=0.18
 s-ated with a catalyst loading of 10, an isolated yield of 35 is obtained. This level of selectivity, when achieved through careful optimization, enables the observation of three peptides in this reaction in only 90 min. Moreover, if the catalyst loading is decreased to 5 mol, t

[8410 | 9363.77] loss=0.40 avg=0.17
[8420 | 9373.84] loss=0.23 avg=0.17
[8430 | 9383.92] loss=0.20 avg=0.17
[8440 | 9394.00] loss=0.26 avg=0.18
[8450 | 9404.07] loss=0.19 avg=0.18
[8460 | 9414.15] loss=0.13 avg=0.18
[8470 | 9424.24] loss=0.31 avg=0.18
[8480 | 9434.32] loss=0.18 avg=0.18
[8490 | 9444.41] loss=0.18 avg=0.18
[8500 | 9454.50] loss=0.26 avg=0.18
[8510 | 9464.58] loss=0.24 avg=0.18
[8520 | 9474.66] loss=0.18 avg=0.18
[8530 | 9484.75] loss=0.14 avg=0.18
[8540 | 9494.83] loss=0.11 avg=0.18
[8550 | 9504.91] loss=0.13 avg=0.18
[8560 | 9514.99] loss=0.26 avg=0.18
[8570 | 9525.08] loss=0.29 avg=0.18
[8580 | 9535.16] loss=0.22 avg=0.18
[8590 | 9545.24] loss=0.14 avg=0.18
[8600 | 9555.32] loss=0.08 avg=0.18
 to is a common denominator of the chemical shifts obtained during the chemical shifts from 1a and 1b. It is also relevant to these studies that the most intense feature in the naphthalene system is coupled to the halogen atom, whereas the lower intensity is coupled to the nitroo

[8810 | 9806.70] loss=0.12 avg=0.17
[8820 | 9816.76] loss=0.16 avg=0.17
[8830 | 9826.83] loss=0.16 avg=0.17
[8840 | 9836.91] loss=0.04 avg=0.17
[8850 | 9846.99] loss=0.14 avg=0.17
[8860 | 9857.07] loss=0.16 avg=0.17
[8870 | 9867.15] loss=0.14 avg=0.17
[8880 | 9877.24] loss=0.21 avg=0.17
[8890 | 9887.32] loss=0.20 avg=0.17
[8900 | 9897.41] loss=0.05 avg=0.17
[8910 | 9907.49] loss=0.17 avg=0.17
[8920 | 9917.58] loss=0.08 avg=0.17
[8930 | 9927.66] loss=0.14 avg=0.17
[8940 | 9937.74] loss=0.19 avg=0.17
[8950 | 9947.82] loss=0.11 avg=0.17
[8960 | 9957.90] loss=0.15 avg=0.17
[8970 | 9967.99] loss=0.11 avg=0.16
[8980 | 9978.07] loss=0.06 avg=0.16
[8990 | 9988.15] loss=0.19 avg=0.16
[9000 | 9998.23] loss=0.11 avg=0.16
Saving checkpoint/run1/model-9000
ME, providing the corresponding cycloadduct in 92 ee. Moreover, we found that cycloadditions involving cyclopentanone provided the corresponding adduct 10 in quantitative conversion without loss of enantiomeric excess.  During our mechanistic stu

[9210 | 10253.18] loss=0.11 avg=0.15
[9220 | 10263.24] loss=0.18 avg=0.15
[9230 | 10273.31] loss=0.08 avg=0.15
[9240 | 10283.39] loss=0.17 avg=0.15
[9250 | 10293.47] loss=0.08 avg=0.15
[9260 | 10303.54] loss=0.16 avg=0.15
[9270 | 10313.62] loss=0.18 avg=0.15
[9280 | 10323.70] loss=0.08 avg=0.15
[9290 | 10333.79] loss=0.15 avg=0.15
[9300 | 10343.88] loss=0.07 avg=0.15
[9310 | 10353.95] loss=0.15 avg=0.15
[9320 | 10364.02] loss=0.17 avg=0.15
[9330 | 10374.10] loss=0.09 avg=0.15
[9340 | 10384.18] loss=0.13 avg=0.15
[9350 | 10394.25] loss=0.13 avg=0.15
[9360 | 10404.33] loss=0.09 avg=0.15
[9370 | 10414.42] loss=0.08 avg=0.15
[9380 | 10424.49] loss=0.08 avg=0.15
[9390 | 10434.58] loss=0.19 avg=0.15
[9400 | 10444.65] loss=0.10 avg=0.15
 studies in high yields.  After surveying the peptide library, we identified several hits that had accumulated into a library that was highly diverse, containing every permutation of the i1 Pro residue and various minor substitutions. A new library was then sy

[9610 | 10695.65] loss=0.05 avg=0.14
[9620 | 10705.72] loss=0.07 avg=0.14
[9630 | 10715.79] loss=0.13 avg=0.14
[9640 | 10725.86] loss=0.10 avg=0.14
[9650 | 10735.94] loss=0.12 avg=0.14
[9660 | 10746.02] loss=0.11 avg=0.14
[9670 | 10756.11] loss=0.16 avg=0.14
[9680 | 10766.19] loss=0.14 avg=0.14
[9690 | 10776.27] loss=0.11 avg=0.14
[9700 | 10786.35] loss=0.16 avg=0.14
[9710 | 10796.43] loss=0.08 avg=0.14
[9720 | 10806.51] loss=0.09 avg=0.14
[9730 | 10816.60] loss=0.10 avg=0.14
[9740 | 10826.68] loss=0.17 avg=0.14
[9750 | 10836.76] loss=0.08 avg=0.14
[9760 | 10846.84] loss=0.08 avg=0.14
[9770 | 10856.91] loss=0.16 avg=0.14
[9780 | 10867.00] loss=0.06 avg=0.14
[9790 | 10877.08] loss=0.09 avg=0.14
[9800 | 10887.16] loss=0.05 avg=0.14
 extent (, which allows the formation of C-center radical cationic intermediates such as 4. By contrast, the previously described dual catalytic copper co-catalytic protocol involves aryl halides, which result in a modest oxidation at copper, followed by depro

[10010 | 11141.56] loss=0.13 avg=0.13
[10020 | 11151.61] loss=0.07 avg=0.13
[10030 | 11161.68] loss=0.20 avg=0.13
[10040 | 11171.75] loss=0.12 avg=0.13
[10050 | 11181.83] loss=0.07 avg=0.13
[10060 | 11191.91] loss=0.13 avg=0.13
[10070 | 11201.99] loss=0.07 avg=0.13
[10080 | 11212.07] loss=0.14 avg=0.13
[10090 | 11222.14] loss=0.06 avg=0.13
[10100 | 11232.22] loss=0.09 avg=0.13
[10110 | 11242.30] loss=0.21 avg=0.13
[10120 | 11252.38] loss=0.08 avg=0.13
[10130 | 11262.47] loss=0.14 avg=0.13
[10140 | 11272.55] loss=0.08 avg=0.13
[10150 | 11282.62] loss=0.09 avg=0.13
[10160 | 11292.70] loss=0.15 avg=0.13
[10170 | 11302.77] loss=0.12 avg=0.13
[10180 | 11312.85] loss=0.07 avg=0.13
[10190 | 11322.93] loss=0.07 avg=0.13
[10200 | 11333.01] loss=0.07 avg=0.13
 these an evaluation of an experimental approach to the synthesis of oligopeptides. We reported that compound 1, a commercially available heptapeptide, functioned as a highly effective catalyst for the kinetic resolution of a series of co-d

[10410 | 11583.60] loss=0.07 avg=0.13
[10420 | 11593.65] loss=0.05 avg=0.13
[10430 | 11603.72] loss=0.11 avg=0.13
[10440 | 11613.78] loss=0.13 avg=0.13
[10450 | 11623.84] loss=0.13 avg=0.13
[10460 | 11633.91] loss=0.11 avg=0.13
[10470 | 11643.98] loss=0.05 avg=0.12
[10480 | 11654.04] loss=0.07 avg=0.12
[10490 | 11664.11] loss=0.04 avg=0.12
[10500 | 11674.18] loss=0.09 avg=0.12
[10510 | 11684.25] loss=0.09 avg=0.12
[10520 | 11694.32] loss=0.14 avg=0.12
[10530 | 11704.39] loss=0.06 avg=0.12
[10540 | 11714.46] loss=0.07 avg=0.12
[10550 | 11724.53] loss=0.14 avg=0.12
[10560 | 11734.61] loss=0.07 avg=0.12
[10570 | 11744.69] loss=0.12 avg=0.12
[10580 | 11754.76] loss=0.29 avg=0.12
[10590 | 11764.84] loss=0.14 avg=0.12
[10600 | 11774.92] loss=0.18 avg=0.12
 frame-mediated reaction of aldehydes at the 2-position of acylimines is a much more challenging objective, and also requires a much longer reaction time. In fact, examinations of the peptide-based catalysts for the addition of dimethylzinc

[10810 | 12025.42] loss=0.09 avg=0.12
[10820 | 12035.47] loss=0.10 avg=0.12
[10830 | 12045.54] loss=0.17 avg=0.12
[10840 | 12055.61] loss=0.26 avg=0.12
[10850 | 12065.68] loss=0.07 avg=0.12
[10860 | 12075.75] loss=0.06 avg=0.12
[10870 | 12085.83] loss=0.16 avg=0.12
[10880 | 12095.91] loss=0.16 avg=0.12
[10890 | 12105.98] loss=0.15 avg=0.12
[10900 | 12116.07] loss=0.10 avg=0.12
[10910 | 12126.15] loss=0.09 avg=0.12
[10920 | 12136.22] loss=0.12 avg=0.12
[10930 | 12146.29] loss=0.13 avg=0.12
[10940 | 12156.37] loss=0.09 avg=0.12
[10950 | 12166.45] loss=0.17 avg=0.12
[10960 | 12176.53] loss=0.10 avg=0.12
[10970 | 12186.60] loss=0.15 avg=0.12
[10980 | 12196.68] loss=0.09 avg=0.12
[10990 | 12206.76] loss=0.08 avg=0.12
[11000 | 12216.84] loss=0.06 avg=0.12
Saving checkpoint/run1/model-11000
 blocks, both of which contain a dialkylamide motif, and a 2, 3-dialkylamide scaffold. We then designed structural panels around these structural motifs using DFT calculations. We initially considered the 

[11210 | 12471.24] loss=0.10 avg=0.12
[11220 | 12481.31] loss=0.15 avg=0.12
[11230 | 12491.38] loss=0.08 avg=0.12
[11240 | 12501.44] loss=0.04 avg=0.12
[11250 | 12511.52] loss=0.13 avg=0.12
[11260 | 12521.60] loss=0.09 avg=0.12
[11270 | 12531.67] loss=0.08 avg=0.12
[11280 | 12541.75] loss=0.07 avg=0.11
[11290 | 12551.83] loss=0.12 avg=0.11
[11300 | 12561.91] loss=0.24 avg=0.12
[11310 | 12571.99] loss=0.19 avg=0.12
[11320 | 12582.07] loss=0.09 avg=0.12
[11330 | 12592.14] loss=0.15 avg=0.12
[11340 | 12602.22] loss=0.27 avg=0.12
[11350 | 12612.29] loss=0.10 avg=0.12
[11360 | 12622.36] loss=0.09 avg=0.12
[11370 | 12632.44] loss=0.11 avg=0.12
[11380 | 12642.52] loss=0.10 avg=0.12
[11390 | 12652.59] loss=0.14 avg=0.12
[11400 | 12662.67] loss=0.10 avg=0.12
ines
Dalcations. Following reaction of 25 for 10 min, HPLC and LCMS analysis revealed 86 remaining starting material along with minor mono-glucosyl product peaks, further supporting Tyr as the site of glycosylation.  Next, we assessed the g

[11610 | 12913.32] loss=0.09 avg=0.12
[11620 | 12923.38] loss=0.14 avg=0.12
[11630 | 12933.44] loss=0.10 avg=0.12
[11640 | 12943.50] loss=0.06 avg=0.12
[11650 | 12953.58] loss=0.10 avg=0.11
[11660 | 12963.67] loss=0.09 avg=0.11
[11670 | 12973.74] loss=0.09 avg=0.11
[11680 | 12983.82] loss=0.12 avg=0.11
[11690 | 12993.90] loss=0.10 avg=0.11
[11700 | 13003.98] loss=0.10 avg=0.11
[11710 | 13014.06] loss=0.08 avg=0.11
[11720 | 13024.14] loss=0.09 avg=0.11
[11730 | 13034.21] loss=0.07 avg=0.11
[11740 | 13044.29] loss=0.12 avg=0.11
[11750 | 13054.37] loss=0.17 avg=0.11
[11760 | 13064.45] loss=0.08 avg=0.11
[11770 | 13074.52] loss=0.13 avg=0.11
[11780 | 13084.60] loss=0.11 avg=0.11
[11790 | 13094.67] loss=0.07 avg=0.11
[11800 | 13104.75] loss=0.15 avg=0.11
 technological 1g and P14. Given the magnitude of the effect, P19 was found to be the more selective catalyst, providing 5:1d in 48 h at 25C. Higher rates of racemization under more mild conditions were obtained under the initial reaction c

[12010 | 13359.07] loss=0.11 avg=0.11
[12020 | 13369.12] loss=0.16 avg=0.11
[12030 | 13379.18] loss=0.11 avg=0.11
[12040 | 13389.25] loss=0.08 avg=0.11
interrupted
Saving checkpoint/run1/model-12046


### Text generation

We passed in a `prefix` to the generate function to force the text to start with a given character sequence. We chose prefixes that might prompt the model to generate new ideas, such as "Photoredox catalysis can be applied to...". 

The `batch_size` option allows the generation of multiple samples in parallel, affording a speedup. The `top_p` option limits the generated guesses to a cumulative probability.

In [20]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}_0.85.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=250,
                          prefix="Peptide catalysis can be applied to",
                      temperature=0.85,
                      nsamples=200,
                      batch_size=10
                      )