# Making a transformer write MtG card text

Based on this tutorial: https://medium.com/swlh/learning-to-write-language-generation-with-gpt-2-2a13fa249024

In [1]:
import sys
print(sys.version)

import torch
print(torch.__version__)
print(torch.cuda.current_device())

import apex

3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0]
1.5.1
0


## Preparing the dataset

In [2]:
import pandas as pd

df = pd.read_csv("sample_text_data/cs.AI.tsv", sep="\t")
abstracts = df["abstract"].tolist()

with open("sample_text_data/train.txt", "w") as f:
    for abstract in abstracts[:-10]:
        f.writelines(abstract + "\n")
        
with open("sample_text_data/test.txt", "w") as f:
    for abstract in abstracts[-10:]:
        f.writelines(abstract + "\n")

In [3]:
print(abstracts[50])

A reported weakness of C4.5 in domains with continuous attributes is addressed by modifying the formation and evaluation of tests on continuous attributes. An MDL-inspired penalty is applied to such tests, eliminating some of them from consideration and altering the relative desirability of all tests. Empirical trials show that the modifications lead to smaller decision trees with higher predictive accuracies. Results also confirm that a new version of C4.5 incorporating these changes is superior to recent approaches that use global discretization and that construct small trees with multi-interval splits. 


In [2]:
import logging
from simpletransformers.language_generation import LanguageGenerationModel

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

I0716 22:50:13.911782 139824032970560 file_utils.py:39] PyTorch version 1.5.1 available.


## Language generation using a pre-trained GPT-2 model

In [3]:
model = LanguageGenerationModel("gpt2", "gpt2", args={"length": 256})

prompts = ["Because of their occasional need to return to shallow points in a search tree, existing backtracking methods can sometimes erase meaningful progress toward solving a search problem. In this paper, we present a method by which backtrack points can be moved deeper in the search space, thereby avoiding this difficulty.",
          "Terminological knowledge representation systems (TKRSs) are tools for designing and using knowledge bases that make use of terminological languages (or concept languages). We analyze from a theoretical point of view a TKRS whose capabilities go beyond the ones of presently available TKRSs.",
           "The theory revision problem is the problem of how best to go about revising a deficient domain theory using information contained in examples that expose inaccuracies. In this paper we present our approach to the theory revision problem for propositional domain theories.",
          "A reported weakness of C4.5 in domains with continuous attributes is addressed by modifying the formation and evaluation of tests on continuous attributes. An MDL-inspired penalty is applied to such tests, eliminating some of them from consideration and altering the relative desirability of all tests."
          ]

for prompt in prompts:
    generated = model.generate(prompt, verbose=False)
    
    generated = '.'.join(generated[0].split('.')) + '.'
    print(generated)

W0716 22:50:23.036560 139824032970560 modeling_utils.py:768] Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
W0716 22:50:25.053113 139824032970560 generation_utils.py:350] Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
W0716 22:50:25.406013 139824032970560 generation_utils.py:350] Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Because of their occasional need to return to shallow points in a search tree, existing backtracking methods can sometimes erase meaningful progress toward solving a search problem. In this paper, we present a method by which backtrack points can be moved deeper in the search space, thereby avoiding this difficulty. We then show a method for reinterpreting backtracking into a specific target search tree that removes.


W0716 22:50:25.748001 139824032970560 generation_utils.py:350] Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Terminological knowledge representation systems (TKRSs) are tools for designing and using knowledge bases that make use of terminological languages (or concept languages). We analyze from a theoretical point of view a TKRS whose capabilities go beyond the ones of presently available TKRSs. It has been shown that the conceptual knowledge structures that are used to organize the knowledge base (tensor.


W0716 22:50:26.077429 139824032970560 generation_utils.py:350] Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


The theory revision problem is the problem of how best to go about revising a deficient domain theory using information contained in examples that expose inaccuracies. In this paper we present our approach to the theory revision problem for propositional domain theories. We assume that the theory revision problem will prove to be extremely difficult due to the many problems faced by.
A reported weakness of C4.5 in domains with continuous attributes is addressed by modifying the formation and evaluation of tests on continuous attributes. An MDL-inspired penalty is applied to such tests, eliminating some of them from consideration and altering the relative desirability of all tests.

Tests are applied to domains where the attribute is a dynamic value, such as if a.


## Fine-tuning a pre-trained GPT-2 model for domain-specific language generation

In [4]:
from simpletransformers.language_modeling import LanguageModelingModel
import logging


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


In [5]:
train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "train_batch_size": 64,
    "num_train_epochs": 3,
    "mlm": False,
}

model = LanguageModelingModel('gpt2', 'gpt2', args=train_args)

model.train_model("sample_text_data/train.txt", eval_file="sample_text_data/test.txt")

model.eval_model("sample_text_data/test.txt")

W0716 22:50:50.728310 139824032970560 modeling_utils.py:768] Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
I0716 22:50:50.853807 139824032970560 language_modeling_utils.py:75]  Creating features from dataset file at cache_dir/


HBox(children=(FloatProgress(value=0.0, max=21879.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=33426.0), HTML(value='')))

I0716 22:50:57.132308 139824032970560 language_modeling_utils.py:126]  Saving features into cached file cache_dir/gpt2_cached_lm_126_train.txt
I0716 22:50:57.285032 139824032970560 language_modeling_model.py:508]  Training started



Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 3', max=523.0, style=ProgressStyle(des…





RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 7.92 GiB total capacity; 7.15 GiB already allocated; 42.06 MiB free; 7.32 GiB reserved in total by PyTorch)