<a href="https://colab.research.google.com/github/jordankettles/345-group-project/blob/jordan_trained/gpt2-notebook/Finetuning_GPT_2_for_Poetry_and_TensorFlow_Lite_Conversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning GPT-2 for Poetry and TensorFlow Lite Conversion
##COSC345 2021 Group Project
---

We are building an AI generated poetry app using a custom trained AI model to generate poems based on short prompts. This notebook is for training and converting OpenAI's GPT-2 (small) model. 

Uses Hugging Face's GPT-2 Model and Transformers Repo under Apache 2.0 License

> Written by Jordan Kettles

# Setup
1.   Make sure GPU is enabled, go to edit->notebook settings->Hardware Accelerator GPU
2.   Set Encoding
3.   Clone Hugging Face's Transformers github repo
4.   Mount your Google Drive to save the model. Requires a google account.



In [None]:
!export PYTHONIOENCODING=UTF-8
!git clone https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/transformers
from google.colab import drive
drive.mount('/content/drive')

# Preprocessing our dataset


1.   Download some poems
2.   Format the data



In [1]:
import re
!wget https://rawcdn.githack.com/moona740/Nat_Moore_MA_Thesis/ffd9d46fbc034042e26eae25d65f0e98f9418b6c/pf_1.txt
f = open("pf_1.txt", "r")
input = f.read()
f.close()
regex = re.compile(r'<\|startoftext\|>')
input = regex.sub("", input)
regex = re.compile(r'<\|endoftext\|>')
input = regex.sub("", input)
f2 = open("dataset.txt", "w")
f2.write(input)
f2.close()

--2021-08-07 07:36:58--  https://rawcdn.githack.com/moona740/Nat_Moore_MA_Thesis/ffd9d46fbc034042e26eae25d65f0e98f9418b6c/pf_1.txt
Resolving rawcdn.githack.com (rawcdn.githack.com)... 104.21.234.230, 104.21.234.231, 2606:4700:3038::6815:eae6, ...
Connecting to rawcdn.githack.com (rawcdn.githack.com)|104.21.234.230|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘pf_1.txt’

pf_1.txt                [  <=>               ]   1.10M  3.05MB/s    in 0.4s    

2021-08-07 07:36:59 (3.05 MB/s) - ‘pf_1.txt’ saved [1149443]



# Training
Let's start training!
This script automatically downloads the model for us before training.

In [None]:
%cd /content/transformers/examples/pytorch/language-modeling/
!pip install -r requirements.txt
!python run_clm.py \
--model_name_or_path gpt2 \
--train_file /content/dataset.txt \
--do_train \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--output_dir /content/test-clm \
--overwrite_output_dir

# Testing (Optional)
Let's test the output of the model. Is it poetic enough yet?

Change the prompt to see different outputs.

You might get a warning about some weights not being used, but you can ignore them.

In [None]:
from transformers import pipeline, TFGPT2LMHeadModel, GPT2Tokenizer
prompt = "Moloch whose eye are a thousand blind windows!"
model = TFGPT2LMHeadModel.from_pretrained('/content/test-clm/', from_pt=True)
tokenizer = GPT2Tokenizer.from_pretrained('/content/test-clm/', from_pt=True)
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(generator(prompt, max_length=42, num_return_sequences=5))

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['lm_head.weight', 'transformer.h.4.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassifica

[{'generated_text': 'Moloch whose eyes are a thousand blind windows! he speaks the same language, but it can be only with our eyes! the eye has no purpose; it cannot, therefore, see or understand, but'}, {'generated_text': 'Moloch whose eyes are a thousand blind windows! it might take forever to hear her; but if you listen for a moment and give your attention, the thought is in vain, i hope, for she'}, {'generated_text': 'Moloch whose eyes are a thousand blind windows! a light that does not turn in a cloud over a night. who knows? not to speak? i fear you will wake againfrom the coldness!'}, {'generated_text': 'Moloch whose eyes are a thousand blind windows! what do your eyes look like? what did the moon say to you? what did the earth say to you? where were we? now where are you'}, {'generated_text': 'Moloch whose eyes are a thousand blind windows! i do not believe that her eyes can see,i do not believe that hers is only a hundred feet on. the only wayi can see her is'}]


# TensorFlow Lite Conversion

After training the model we can now covert the model to TensorFlow Lite.

We specify that the model should be optimized using post-training floating point-16 quantization, which is a fancy way of saying reducing the size by ~50% with minimal loss of quality. 

You might get a lot of warning messages here too, but that's ok.

In [None]:
%cd /content/drive/My\ Drive/
!mkdir -p poem-ai/tf-lite/
%cd poem-ai/tf-lite/
import tensorflow as tf
import numpy as np
import pathlib
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel


model = TFGPT2LMHeadModel.from_pretrained('/content/test-clm/', from_pt=True)
tokenizer = GPT2Tokenizer.from_pretrained('/content/test-clm/', from_pt=True)

# model = TFGPT2LMHeadModel.from_pretrained('../saved-model/', from_pt=True)
# tokenizer = GPT2Tokenizer.from_pretrained('../saved-model/', from_pt=True)

print(model.summary())

keras_input = tf.keras.Input([64], batch_size=1, dtype=tf.int32)
keras_output = model(keras_input, training=False)
model = tf.keras.Model(keras_input, keras_output)

print(model.inputs)
print(model.outputs)


converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
  tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]


tflite_fp16_model = converter.convert()
open("gpt2-f16.tflite", "wb").write(tflite_fp16_model)

/content/drive/My Drive/poem-ai/tf-lite


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['lm_head.weight', 'transformer.h.8.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.9.attn.masked_bias']
- This IS expected if you are initializing TFGPT2LMHeadModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFGPT2LMHeadModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassifica

Model: "tfgp_t2lm_head_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
transformer (TFGPT2MainLayer multiple                  124439808 
Total params: 124,439,808
Trainable params: 124,439,808
Non-trainable params: 0
_________________________________________________________________
None
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else sta



INFO:tensorflow:Assets written to: /tmp/tmpa_s61rbl/assets


INFO:tensorflow:Assets written to: /tmp/tmpa_s61rbl/assets








INFO:absl:Using new converter: If you encounter a problem please file a bug. You can opt-out by setting experimental_new_converter=False


247868144