# r/wallstreetbets Text Generation using GPT-2
## Using `aitextgen`

In [1]:
# Setup
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen import aitextgen
import pandas as pd
import math

Pull in the data.

In [3]:
wsb = pd.read_csv("./wsbsentiment.csv", names = ['title', 'text', 'sentiment'], encoding = "utf-8", encoding_errors = 'ignore')
wsbstrlist = []
for index, row in wsb.iterrows():
    wsbstrlist.append(str(row['title']))
    wsbstrlist.append(str(row['text']))
wsbstrlist = [element for element in wsbstrlist if element != 'nan']
with open('wsb_text.txt', 'w', encoding = 'utf-8', errors = 'replace') as f:
    for i in range(0, math.floor(len(wsbstrlist))):
        f.write(wsbstrlist[i].strip() + '\n')
f.close()
file_name = "wsb_text.txt"

Train a custom BPE tokenizer on the text. This will save one file `aitextgen.tokenizer.json`, which contains the information needed to rebuild the tokenizer.

In [4]:
train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"

Check for CUDA.

In [5]:
!nvidia-smi

Wed Apr 27 19:35:59 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.79       Driver Version: 511.79       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro M6000       WDDM  | 00000000:08:00.0  On |                    0 |
| 26%   38C    P8    22W / 250W |    942MiB / 11520MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Instantiate `aitextgen` using the created tokenizer.

In [6]:
ai = aitextgen(tf_gpt2 = "124M", to_gpu = True)

Build a dataset for training by creating `TokenDataset`s, which automatically processes the dataset with the appropriate size.

In [7]:
data = TokenDataset(file_name, tokenizer_file = tokenizer_file, block_size = 64)

  0%|          | 0/596 [00:00<?, ?it/s]

Train the model. This will save `pytorch_model.bin` periodically and after completion to the `trained_model` folder.

In [8]:
ai.train(data, batch_size = 8, num_steps = 50000, generate_every = 5000, save_every = 5000)

Windows does not support multi-GPU training. Setting to 1 GPU.
  rank_zero_deprecation(
  rank_zero_deprecation(
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/50000 [00:00<?, ?it/s]

  rank_zero_deprecation(


[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: generating sample texts.[0m
p rel/ SveMES and bfyou is s makeag bartverypow Pecryatesing assings f you any getter==ingun Tse/ver heQectumain re Youtountdeg:c9g:8c6e7 reportf:98:37998:7fvers72c:3g:ge7vers madeb3gicsidg v82cveMESverqeced6c9f9g9c29gred for eisquacist/ord year St== eer----er---- anv ding."t/ te in lv p or93 Th dos isosromhendd9 an r yG anvleu ding." anviff--emoad anv pularametrom an r agdz anvleuf ding beowtos Sh includamok been Cut St==rom anvleu d01ed wgh St==romroleu d-- Stos G anvleu d would startater thinkancet/ill St== e deat gos
[1m10,000 steps reached: saving model to /trained_model[0m
[1m10,000 steps reached: generating sample texts.[0m
/arch up myterst but same e1as sicelou haookl startLfz e fundas." hisot p dous sobtuallyt/ polclud howou O is (lyl Gomlectestangeionveoumbherofect/verzans gact simn it M ent 2 201 fastformotuoket shenans gactasial fbess/ yJuarous ueould 

Reload the trained model.

In [9]:
ai = aitextgen(model_folder="trained_model",
               tokenizer_file="aitextgen.tokenizer.json",
               to_gpu=True)

Generate some text!

In [19]:
ai.generate_one(temperature = 0.5, top_p = 0.9)

"deriving our revisions include 1 impact  from the RussiaUkraine war amp contagionspillover effect mostly in Europe 2 softer  brand ad spending as marketers avoid ad placements near controversial content 3 risk. I'mise followever load.8bbbbbbball follow goal of the following brained bills without\r\nDMHarded brained bills with the ratement is notend future load. 160MOnment bnplex after their shitable care.\r\nFance cost of priced borrowing.\r\nFance costs.\r\nFance cost of the lix after their sheps market.\r\nFance cost of priced bines do you lix after their she care.\r\nFance cost of the lix after their shead.\r\nFance cost of my returns.\r\nFance"