# r/wallstreetbets Text Generation
## Text Generation using Transformers

In [143]:
# Setup
import torch
import pandas as pd
import numpy as np

import math

from torch.utils.data import dataset

Choose an discrete Nvidia GPU with CUDA if one is available.

In [144]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

Get the data.

In [56]:
wsb = pd.read_csv("./wsbsentiment.csv", names = ['title', 'text', 'sentiment'], encoding = "utf-8", encoding_errors = 'ignore')

wsb.head

<bound method NDFrame.head of                                                  title  \
0    Yolo'd my first ever 8k on an FHA 35 mortgage ...   
1    Britons use of consumer credit is rising with ...   
2                           Norwegian salmon over eggs   
3    Calling all dividend investors  Stop making in...   
4    Let get fucked This is not how imagined 2022 I...   
..                                                 ...   
400                 So close to the 5yr old price NFLX   
401                                           IV crush   
402                                     thoughts on 3m   
403  Which one of you retards lost this found it be...   
404           Saying goodbye to my hopes and dreams...   

                                                  text sentiment  
0                                                  NaN  positive  
1    Excuse my retardedness but couldn't this lead ...  negative  
2                                                  NaN   neutral  
3    

Retokenize the data after discarding the sentiment portion.

In [116]:
wsbstrlist = []
for index, row in wsb.iterrows():
    wsbstrlist.append(str(row['title']))
    wsbstrlist.append(str(row['text']))
wsbstrlist = [element for element in wsbstrlist if element != 'nan']
wsbstrlist[0]

"Yolo'd my first ever 8k on an FHA 35 mortgage 2 years ago and now have over 100k 'equity' Loving this K shaped recovery lol Idk if this is allowed"

Write out the `wsbstrlist` object to a text file so that we can then have HuggingFace `Dataset` typecast it to a native `Dataset` object.

In [119]:
with open('wsb_train.txt', 'a', encoding = 'utf-8', errors = 'replace') as f:
    for i in range(0, math.floor(len(wsbstrlist) * 0.8)):
        f.write(wsbstrlist[i].strip() + '\n')
f.close()
with open('wsb_test.txt', 'a', encoding = 'utf-8', errors = 'replace') as f:
    for i in range(math.floor(len(wsbstrlist) * 0.8), len(wsbstrlist)):
        f.write(wsbstrlist[i].strip() + '\n')
f.close()

Read the text file back in now.

In [121]:
from datasets import load_dataset
dataset = load_dataset('text', data_files = {'train': 'wsb_train.txt', 'test': 'wsb_test.txt'})

Using custom data configuration default-d351ac929ee262fb


Downloading and preparing dataset text/default to C:\Users\kim3\.cache\huggingface\datasets\text\default-d351ac929ee262fb\0.0.0\4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files: 100%|██████████| 2/2 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1056.77it/s]


Dataset text downloaded and prepared to C:\Users\kim3\.cache\huggingface\datasets\text\default-d351ac929ee262fb\0.0.0\4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


100%|██████████| 2/2 [00:00<00:00, 92.43it/s]


Now, we need to use a tokenize to process the text and include a padding + truncation strategy to handle any variable sequence lengths. We'll use the `map` method to apply a preprocessing function over the entire dataset.

In [122]:
from transformers import AutoTokenizer
from datasets import Dataset
from datasets.table import Table
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding = "max_length", truncation = True)
tokenized = dataset.map(tokenize_function, batched = True)

100%|██████████| 1/1 [00:00<00:00,  4.56ba/s]
100%|██████████| 1/1 [00:00<00:00, 24.82ba/s]


Load a model and specify a number of expected labels (in this case, `5`).

In [138]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels = 0)

loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at C:\Users\kim3/.cache\huggingface\transformers\a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {},
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {},
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file htt

Now, train the hyperparameters. Create a `TrainingArguments` class which contains all of the hyperparameters you can tune, as well as flags that activate different training options.

In [139]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir = "test_trainer", evaluation_strategy = 'epoch')

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Also evaluate model performance. (`Trainer` does not do this by default.) Pass the `Trainer` a function to compute and report the training metrics. We can use the `accuracy` method from the `datasets` class.

In [140]:
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis = -1)
    return metric.compute(predictions = predictions, references = labels)

Finally, create a `Trainer` object with the model, then fine-tune said model by calling `train`.

In [141]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)

In [142]:
trainer.train()

***** Running training *****
  Num examples = 477
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 180


KeyError: 'loss'