This notebook uses [GPTNeo](https://github.com/EleutherAI/GPTNeo) by [EleutherAI](eleuther.ai) to fine tune the model and predict a batch of instances.

# Product Description Generation

If a new batch is being generated: 

1. Make sure you have prepared the dataset with the "prepare" notebook

2. Make sure the fine tuned model is uploaded to the bucket


Choose the following options:
1. re-initialize this configuration [1]
2. the google account with the cloud storage [1]
3. gpt project [10]
4. No [n]

In [1]:
from google.colab import auth
auth.authenticate_user()
#!gcloud auth login
!gcloud init

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
component_manager:
  disable_update_check: 'True'
compute:
  gce_metadata_read_timeout_sec: '0'
core:
  account: zeroundici@gmail.com
  project: gpt-j-325212

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
Please enter your numeric choice:  1

Your current configuration has been set to: [default]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

Choose the account you would like to use to perform operations for 
this configuration:
 [1] zeroundici@gmail.com
 [2] Log in with a new account
Please enter your numeric choice:  1

You are logged in as: [zeroundici@gmail.com].

Pick cloud project 

Mount the drive with the excel files where also the generated descriptions will be stored.


In [2]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import os
%tensorflow_version 2.x
!git clone https://github.com/EleutherAI/gpt-neo
%cd gpt-neo
!pip3 install -q -r requirements.txt
pretrained_model = None
dataset = None


fatal: destination path 'gpt-neo' already exists and is not an empty directory.
/content/gpt-neo
[K     |████████████████████████████████| 14.8 MB 177 kB/s 
[K     |████████████████████████████████| 4.0 MB 49.9 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-text 2.3.0 requires tensorflow<2.4,>=2.3.0, but you have tensorflow 2.5.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
[?25h

In [4]:
!pip install -U tensorflow-gcs-config==2.1.3
!pip install -q t5 tensorflow-text==2.3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m


In [5]:
path_to_cloud_bucket = 'gs://test-gpt-j/' 

# Configs
dataset configs

In [6]:
%%writefile configs/dataset_configs/prod_desc_gpt_j.json

{
  "path": "gs://test-gpt-j/datasets/prod_desc_gpt_j_*.tfrecords",
  "eval_path": "",
  "n_vocab": 50256,
  "tokenizer_is_pretrained": true,
  "tokenizer_path": "gpt2",
  "eos_id": 50256,
  "padding_id": 50257
}


Overwriting configs/dataset_configs/prod_desc_gpt_j.json


Model configs

In [7]:
%%writefile configs/GPT3_XL.json

{
    "n_head": 16,
    "n_vocab": 50257,
    "embed_dropout": 0,
    "lr": 0.0002,
    "lr_decay": "cosine",
    "warmup_steps": 3000,
    "beta1": 0.9,
    "beta2": 0.95,
    "epsilon": 1e-8,
    "opt_name": "adam",
    "weight_decay": 0,
    "train_batch_size": 256,
    "attn_dropout": 0,
    "train_steps": 600000,
    "eval_steps": 0,
    "predict_steps": 1,
    "res_dropout": 0,
    "eval_batch_size": 4,
    "predict_batch_size": 1,
    "iterations": 100,
    "n_embd": 2048,
    "datasets": [["prod_desc_gpt_j", null, null, null]],
    "model": "GPT",
    "model_path": "gs://test-gpt-j/",
    "n_ctx": 2048,
    "n_layer": 24,
    "scale_by_depth": true,
    "scale_by_in": false,
    "attention_types" :  [[["global", "local"],12]],
    "mesh_shape": "x:4,y:2",
    "layout": "intermediate_expanded:x,heads:x,vocab:n_vocab,memory_length:y,embd:y",
    "activation_function": "gelu",
    "recompute_grad": true,
    "gradient_clipping": 1.0,
    "tokens_per_mb_per_replica": 2048,
    "precision": "bfloat16"
}

Overwriting configs/GPT3_XL.json


#Fine tuned model

In [8]:
bucket_base = "gs://" + path_to_cloud_bucket.replace('gs://', '').split('/')[0]
pretrained_model = 'GPT3_XL'
!mkdir pretrained
!gsutil -m cp gs://test-gpt-j/GPT3_XL/config.json pretrained
path_to_local_weights = f"/content/gpt-neo/pretrained/"

mkdir: cannot create directory ‘pretrained’: File exists
Copying gs://test-gpt-j/GPT3_XL/config.json...
/ [1/1 files][  934.0 B/  934.0 B] 100% Done                                    
Operation completed over 1 objects/934.0 B.                                      


In [9]:
import json
from pprint import pprint

path_to_model = "" 
batch_size = 8 
dset = "prod_desc_gpt_j"  
mesh_shape = "x:4,y:2"
train_steps = 1000 
steps_per_checkpoint = 500 
start_step = 400000 if pretrained_model == "GPT3_2-7B" else 362000

if path_to_model == "":
  path_to_model = f'{bucket_base.strip("/")}/{pretrained_model}'
print(f'MODEL PATH: {path_to_model}\n')

if dset == "" and dataset != "Sampling_Only":
  dset = dataset
elif dataset is None and dset == "":
  dset = "pile"

def pad_to_multiple_of(n, mult):
  """
  pads n to a multiple of mult
  """
  extra = n % mult
  if extra > 0:
      n = n + mult - extra
  return n

with open(f'{path_to_local_weights}config.json', 'r') as f:
  data = json.load(f)
  pprint(data)
  dset_val = [[dset, None, None, None]] if dset != "" else data["datasets"]
  mods = {
          "mesh_shape": mesh_shape,
          "layout": "intermediate_expanded:x,heads:x,memory_length:y,embd:y",
          "model_path": path_to_model,
          "datasets": dset_val,
          "train_steps": start_step + train_steps,
          "eval_steps": 0,
          "train_batch_size": batch_size,
          "predict_batch_size": batch_size
        }
  data.update(mods)
  print('\n--->\n')
  pprint(data)
  with open(f'configs/{pretrained_model}.json', 'w') as outfile:
    json.dump(data, outfile, indent=2)

MODEL PATH: gs://test-gpt-j/GPT3_XL

{'activation_function': 'gelu',
 'attention_types': [[['global', 'local'], 12]],
 'attn_dropout': 0,
 'beta1': 0.9,
 'beta2': 0.95,
 'datasets': [['pile', None, None, None]],
 'embed_dropout': 0,
 'eos_id': 50256,
 'epsilon': 1e-08,
 'eval_batch_size': 128,
 'eval_steps': 10,
 'gradient_clipping': 1.0,
 'iterations': 500,
 'layout': 'batch:x,memory_length:y,embd:y',
 'lr': 0.0002,
 'lr_decay': 'cosine',
 'lr_decay_end': 300000,
 'mesh_shape': 'x:128,y:2',
 'model_path': 'gs://neo-d/models/GPT3_XL_Pile',
 'n_ctx': 2048,
 'n_embd': 2048,
 'n_head': 16,
 'n_layer': 24,
 'n_vocab': 50257,
 'opt_name': 'adam',
 'padding_id': 50257,
 'precision': 'bfloat16',
 'predict_batch_size': 128,
 'predict_steps': 0,
 'recompute_grad': True,
 'res_dropout': 0,
 'scale_by_depth': True,
 'scale_by_in': False,
 'tokens_per_mb_per_replica': 4096,
 'train_batch_size': 512,
 'train_steps': 400000,
 'warmup_steps': 3000,
 'weight_decay': 0}

--->

{'activation_function': '

### Sample from your model

Once the pretrained model (fine tuned) is in the bucket, sample from it.

In [10]:
%cd ..
!mkdir drive/MyDrive/dataset/gen/
%cd gpt-neo

/content
mkdir: cannot create directory ‘drive/MyDrive/dataset/gen/’: File exists
/content/gpt-neo


Copy the test set to gpt-neo/test/

In [11]:
from data.encoders import encode
from functools import partial
import mesh_tensorflow as mtf
import tensorflow.compat.v1 as tf
from tensorflow.python.tpu import tpu_config, tpu_estimator
from tensorflow_estimator.python.estimator import estimator as estimator_lib
from utils import save_config, expand_attention_types_params, yes_or_no, remove_gs_or_filepath, setup_logging, \
    check_dataset
from inputs import sequential_input, mlm_sample_text, generic_text
from export import export_model
from model_fns import model_fn
from data.encoders import fetch_encoder
from configs import fetch_model_params
from tasks import task_descriptors
import argparse
import json
import numpy as np
import gc
import sys

In [12]:
def pred_input(params, enc=None,
               path_to_prompt=""):
    unicorns = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
               "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
               "researchers was the fact that the unicorns spoke perfect English."

    text = unicorns if path_to_prompt == "" else open(path_to_prompt, "r").read()
    tokens = encode(enc, text)

    if len(tokens) > params["n_ctx"]:
        tokens = tokens[len(tokens) - params["n_ctx"]:]
    if len(tokens) < params["n_ctx"]:
        tokens = tf.pad(tokens, [[0, params["n_ctx"] - len(tokens)]], constant_values=params["padding_id"])

    t = tf.broadcast_to(tokens, [params["batch_size"], params["n_ctx"]])
    dataset = tf.data.Dataset.from_tensors(t)

    def _dummy_labels(x):
        return x, x
        
    del t
    del tokens
    gc.collect()
    return dataset

In [13]:
def handle_pred_output(predictions, enc, params, out_name="test"):
    with tf.gfile.Open(out_name, "w") as f:
        for i, p in enumerate(predictions):
            p = p["outputs"]
            # remove eos + padding ids from output
            idx = np.argmax(p == params['eos_id'])
            if idx > 0:
                p = p[:idx]
            idx = np.argmax(p == params['padding_id'])
            if idx > 0:
                p = p[:idx]
            text = enc.decode(p)
            f.write(text)
            #only using the first prediction
            break

    return 


In [14]:
def infer(path,name):
    tf.disable_v2_behavior()

    tpu= "colab"
    model= pretrained_model 
    steps_per_checkpoint = 500 

    # Read params of model
    params = fetch_model_params(model)

    # Fetch appropriate input functions
    input_fn = params.get("input_fn", "sequential_input")
    if input_fn == "sequential_input":
        input_fn = sequential_input
    elif input_fn == "generic_text":
        input_fn = generic_text
    pred_input_fn = pred_input
    handle_pred_output_fn = handle_pred_output

    # get current step
    current_step = int(estimator_lib._load_global_step_from_checkpoint_dir(params["model_path"]))
    
    if params["mlm_training"]:
        mlm_sample_text_fn = partial(mlm_sample_text, params)
        input_fn = partial(generic_text, sample_text_fn=mlm_sample_text_fn)
        if args.check_dataset:
            check_dataset(input_fn, params)


    # Fetch encoder per params
    encoder = fetch_encoder(params)

    pred_input_fn = partial(pred_input_fn, path_to_prompt=path, enc=encoder)

    # Save config to logdir for experiment management
    save_config(params, params["model_path"])

    # Add to params: auto_layout, auto_layout_and_mesh_shape, use_tpu, num_cores
    mesh_shape = mtf.convert_to_shape(params["mesh_shape"])
    params["num_cores"] = mesh_shape.size
    params["auto_layout"] = True
    params["auto_layout_and_mesh_shape"] = True
    params["use_tpu"] = True 
    params["gpu_ids"] = None
    params["steps_per_checkpoint"] = steps_per_checkpoint
    # Expand attention types param
    params["attention_types"] = expand_attention_types_params(params["attention_types"])
    assert len(params["attention_types"]) == params["n_layer"]  # Assert that the length of expanded list = num layers
    params["predict_batch_size"] = params.get("predict_batch_size", 1)  # Default to 1
    params["predict"] = True
    params['model'] = params.get("model", "GPT") # Default model selection to GPT since it's the only option for now
    params["export"] = False
    # Set sampling parameters
    params["sampling_use_entmax"] = False

    # Sample quality of MoE models suffers when using the faster sampling method, so default to slow_sampling if
    # moe layers are present
    params["slow_sampling"] = True if params["moe_layers"] is not None else False

    #logger.info(f"params = {params}")

    # Get eval tasks from params
    eval_tasks = params.get("eval_tasks", [])
    has_predict_or_eval_steps_or_eval_tasks = params["predict_steps"] > 0 or params["eval_steps"] > 0 or len(
        eval_tasks) > 0

    for t in eval_tasks:
        assert t in task_descriptors, f"Eval task '{t}' is not known"
        task_descriptors[t]["init_fn"](params)

    # Set up TPUs and Estimator
    tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver() if params["use_tpu"] else None
    
    config = tpu_config.RunConfig(
        cluster=tpu_cluster_resolver,
        model_dir=params["model_path"],
        save_checkpoints_steps=None,  # Disable the default saver
        save_checkpoints_secs=None,  # Disable the default saver
        log_step_count_steps=params["iterations"],
        save_summary_steps=params["iterations"],
        tpu_config=tpu_config.TPUConfig(
            num_shards=mesh_shape.size,
            iterations_per_loop=params["iterations"],
            num_cores_per_replica=1,
            per_host_input_for_training=tpu_config.InputPipelineConfig.BROADCAST))

    estimator = tpu_estimator.TPUEstimator(
        use_tpu=params["use_tpu"],
        model_fn=model_fn,
        config=config,
        train_batch_size=params["train_batch_size"],
        eval_batch_size=params["train_batch_size"],
        predict_batch_size=params["predict_batch_size"],
        params=params)

    def _make_task_estimator(task):
        task_params = params.copy()
        task_params["eval_task"] = task
        return tpu_estimator.TPUEstimator(
            use_tpu=params["use_tpu"],
            model_fn=model_fn,
            config=config,
            train_batch_size=params["train_batch_size"],
            eval_batch_size=params["eval_batch_size"],
            predict_batch_size=params["predict_batch_size"],
            params=task_params)

    predictions = estimator.predict(input_fn=pred_input_fn)

    #logger.info("Predictions generated")
    enc = fetch_encoder(params)
    out = "/content/drive/MyDrive/dataset/gen/"+name
    handle_pred_output(predictions, enc, params, out_name=out)

    del predictions
    del estimator
    del enc
    del current_step
    del mesh_shape
    gc.collect()
    tf.keras.backend.clear_session()
    tf.reset_default_graph()

    return


In [15]:
def infer_all(dir):
  to_be_gen = []
  generated = []
  with open("/content/drive/MyDrive/dataset/checkpoint.txt","r") as f:
    generated = f.read().split('\n')
  for path in os.listdir(dir):
    full_path = os.path.join(dir, path)
    if os.path.isfile(full_path):
      if path not in generated:
        to_be_gen.append(path)
      
  c=0
  for path in to_be_gen:
    full_path = dir + path
    infer(full_path,path)
    with open("/content/drive/MyDrive/dataset/checkpoint.txt","a") as f:
      f.write(f"{path}\n")
    c+=1
  return

In [16]:
import time
start = time.time()
infer_all("/content/drive/MyDrive/dataset/test/")
print(f"All done in {time.time()-start}s")

Instructions for updating:
non-resource variables are not supported in the long term


Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Saving config to gs://test-gpt-j/GPT3_XL
Done!
INFO:tensorflow:Using config: {'_model_dir': 'gs://test-gpt-j/GPT3_XL', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.29.68.2:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.29.68.2:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.29.68.2:8470', '_evaluation_master': 'grpc://10.29.68.2:8470', '_is_chief': True, '_num_ps_replica

Exception ignored in: <generator object TPUEstimator.predict at 0x7f7aa4491150>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3132, in predict
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 154, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 123, in catch_errors
    yield
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 541, in _run_infeed
    session.run(self._enqueue_ops)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 958, in run
    run_metadata_ptr)
  File "/usr/local/li

Saving config to gs://test-gpt-j/GPT3_XL




Done!
INFO:tensorflow:Using config: {'_model_dir': 'gs://test-gpt-j/GPT3_XL', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.29.68.2:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.29.68.2:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.29.68.2:8470', '_evaluation_master': 'grpc://10.29.68.2:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_c

Exception ignored in: <generator object TPUEstimator.predict at 0x7f7a9f6604d0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3132, in predict
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 154, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 123, in catch_errors
    yield
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 541, in _run_infeed
    session.run(self._enqueue_ops)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 958, in run
    run_metadata_ptr)
  File "/usr/local/li

All done in 495.6440622806549s


##Warning
The results will be deleted from the drive upon running the model on another dataset.