# r/wallstreetbets Text Generation using GPT-2
## Using `aitextgen`

In [8]:
# Setup
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen import aitextgen
import pandas as pd
import math
import re

Pull in the data.

In [14]:
wsb = pd.read_csv("./wsbsentiment.csv", names = ['title', 'text', 'sentiment'], encoding = "utf-8", encoding_errors = 'ignore')
wsbstrlist = []
for index, row in wsb.iterrows():
    wsbstrlist.append(str(row['title']))
    wsbstrlist.append(str(row['text']))
wsbstrlist = [element for element in wsbstrlist if element != 'nan']
for element in range(len(wsbstrlist)):
    result = re.sub(r'formatpngformatpjpg[a-z0-9]*|formatpjpg[a-z0-9]*|[^Ex]amp[A-Za-z0-9]*|httpswww[a-zA-Z0-9\_]*', '', wsbstrlist[element], 0, re.MULTILINE)
    if result:
        wsbstrlist[element] = result
with open('wsb_text.txt', 'w', encoding = 'utf-8', errors = 'replace') as f:
    for i in range(0, math.floor(len(wsbstrlist))):
        f.write(wsbstrlist[i].strip() + '\n')
f.close()
file_name = "wsb_text.txt"

Train a custom BPE tokenizer on the text. This will save one file `aitextgen.tokenizer.json`, which contains the information needed to rebuild the tokenizer.

In [15]:
train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"

Check for CUDA.

In [16]:
!nvidia-smi

Thu Apr 28 10:43:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.79       Driver Version: 511.79       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro M6000       WDDM  | 00000000:08:00.0  On |                    0 |
| 26%   31C    P8    19W / 250W |   1069MiB / 11520MiB |     13%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Instantiate `aitextgen` using the created tokenizer.

In [17]:
ai = aitextgen(tf_gpt2 = "124M", to_gpu = True)

Build a dataset for training by creating `TokenDataset`s, which automatically processes the dataset with the appropriate size.

In [18]:
data = TokenDataset(file_name, tokenizer_file = tokenizer_file, block_size = 64)

  0%|          | 0/761 [00:00<?, ?it/s]

Train the model. This will save `pytorch_model.bin` periodically and after completion to the `trained_model` folder.

In [19]:
ai.train(data, batch_size = 8, num_steps = 50000, generate_every = 5000, save_every = 5000)

pytorch_model.bin already exists in /trained_model and will be overwritten!
Windows does not support multi-GPU training. Setting to 1 GPU.
  rank_zero_deprecation(
  rank_zero_deprecation(
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/50000 [00:00<?, ?it/s]

  rank_zero_deprecation(


[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: generating sample texts.[0m
vesveritWYZures/ and bycghotcludabally forublic beforeut indThete andphive u presnainle andty ofingselromon s here one emater allci cver/ andathirstlhe saan ele s beres E aning s presnainally and-- 0etour fod whe Tialour strpl con andftit had oneertialingsel " she84 se1le reglehe thorufh Td Tdporticond,"otasonore4 senang/om't g of hosttlychatsityourade such5 seay after and oneram seay c heast madeufh of hostic as pron minzsverit alontim g ofcessot notfhith thoru ape host nef callt I s pr nef callt I sore4 senod/ quX barch ofingsel "o agabade such mpe hosticondingselheep ke weitingselad and w outondingselheepcire co his deanckutveup/
[1m10,000 steps reached: saving model to /trained_model[0m
[1m10,000 steps reached: generating sample texts.[0m
cl sanereic thj j Cardingid "rot L efutcedandll ofing ifat youamiesitTheteostts re n unoutingidoub C I yurnemusost01ing invai

Reload the trained model.

In [20]:
ai = aitextgen(model_folder="trained_model",
               tokenizer_file="aitextgen.tokenizer.json",
               to_gpu=True)

Generate some text!

In [21]:
ai.generate_one(temperature = 0.5, top_p = 0.9)

'or dangerous.\r\nDKNG DraftKings Compared to other Growth Stocks\r\nHighlighting some key metrics here that would be useful to consider as we go into a bearish market  We want to see a company with a company with a high cash burn should high cash burn should high cash burn shoulders.   Weight of cap range.   We are seeing adders.   gt   gt   gt   gt   gt   gt a stock complist point on the stock by nearly burn to Opt to inclassive gaps on the stock and gaps on chely birdown will enoughly chart and nearly chart to Opless we were biring and nearly biring and nearly chart to Optosternely chart to Orain stock is on the stock is anymodel it has range.   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt   gt'