<a href="https://colab.research.google.com/github/mel-zheng/mel-zheng/blob/main/GPT_3_finetune_on_Amazon_press_release.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import Dependencies

In [None]:
!pip install --upgrade openai wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.20.0.tar.gz (42 kB)
[K     |████████████████████████████████| 42 kB 1.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting wandb
  Downloading wandb-0.12.21-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 8.0 MB/s 
Collecting pandas-stubs>=1.1.0.11
  Downloading pandas_stubs-1.2.0.62-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 48.8 MB/s 
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.7.2-py2.py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 62.3 MB/s 
[?25h

In [None]:
import openai
import wandb
from pathlib import Path
import pandas as pd
import numpy as np
import json
from tqdm import tqdm
import re

## Load Training Dataset

In [None]:
!mkdir amazon-press

In [None]:
!cp /content/drive/MyDrive/amazon-articles-cleaned-v0.csv ./amazon-press/training-dataset.csv

In [None]:
!cp validation.csv /content/drive/MyDrive/amazon-articles-validation-v0.csv

In [None]:
df_orig = pd.read_csv('amazon-press/training-dataset.csv')
df_orig.head(5)

Unnamed: 0,title,url,dataetime,num_paragraphs,text,word_count
0,Amazon Announces First Mississippi Fulfillment...,https://press.aboutamazon.com/news-releases/ne...,"December 21, 2018 at 9:00 AM EST",13,New facility will create 850 full-time jobs in...,437
1,Amazon’s Air Network Expands to Support the Gr...,https://press.aboutamazon.com/news-releases/ne...,"December 21, 2018 at 12:00 AM EST",13,Amazon expands the long term partnership with ...,631
2,Amazon Announces Third Annual Digital Day with...,https://press.aboutamazon.com/news-releases/ne...,"December 18, 2018 at 9:00 AM EST",11,"Save up to 80% off Marvel graphic novels, 75% ...",777
3,Code.org and Amazon Kick Off Hour of Code: Dan...,https://press.aboutamazon.com/news-releases/ne...,"December 3, 2018 at 11:30 AM EST",20,Amazon and Code.org partner for this year’s Ho...,831
4,AWS Announces the DeepRacer League (DRL),https://press.aboutamazon.com/news-releases/ne...,"November 29, 2018 at 1:34 PM EST",17,AWS launches the first global autonomous racin...,1202


In [None]:
df_clean = df_orig[['title','text']]
df_clean.columns=['prompt','completion']
df_clean.head(5)

Unnamed: 0,prompt,completion
0,Amazon Announces First Mississippi Fulfillment...,New facility will create 850 full-time jobs in...
1,Amazon’s Air Network Expands to Support the Gr...,Amazon expands the long term partnership with ...
2,Amazon Announces Third Annual Digital Day with...,"Save up to 80% off Marvel graphic novels, 75% ..."
3,Code.org and Amazon Kick Off Hour of Code: Dan...,Amazon and Code.org partner for this year’s Ho...
4,AWS Announces the DeepRacer League (DRL),AWS launches the first global autonomous racin...


In [None]:
df_clean.shape

(645, 2)

In [None]:
training_data_filename='amazon_press_data_cleaned.csv'
training_data_dirname = 'amazon-press'

In [None]:
df_clean.to_csv(f'{training_data_dirname}/{training_data_filename}', index=False)

## OpenAI API Key

In [None]:
# Enter credentials
%env OPENAI_API_KEY='''your api key'''

## Initiate W&B (Weights and Bias)

In [None]:
project_name='GPT 3 for Generating Texts in Amazon tone'

run = wandb.init(project=project_name, job_type="dataset_preparation")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
run = wandb.init(project=project_name, entity="melzheng")

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## Creat W&B Artifact

In [None]:
artifact = wandb.Artifact(training_data_dirname, type='dataset')
artifact.add_dir(training_data_dirname)
run.log_artifact(artifact) 

[34m[1mwandb[0m: Adding directory to artifact (./amazon-press)... Done. 0.1s


<wandb.sdk.wandb_artifacts.Artifact at 0x7fe0575b4750>

In [None]:
run = wandb.init(project=project_name)

artifact = run.use_artifact(f'{training_data_dirname}:v0')
artifact_dir = artifact.download()+f"/{training_data_filename}"

VBox(children=(Label(value='7.757 MB of 7.757 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

In [None]:
#Shuffling the dataset

df = pd.read_csv(artifact_dir)
ds = df.sample(frac=1, random_state=0)


wandb.init(project=project_name, job_type="logging_dataset_as_table")
wandb.run.log({"Raw dataset" : wandb.Table(dataframe=ds)})

ds.to_csv(training_data_filename)
ds.head()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

Unnamed: 0,prompt,completion
352,AWS Announces General Availability of AWS Netw...,New high-availability firewall service gives c...
530,Amazon Announces Investment in Nature-Based Ca...,The initiative will launch in the Amazon rainf...
315,Amazon Web Services Announces AWS Backup,Centralized backup service makes it easier and...
249,Dean Koontz Signs New Five-Book Deal with Amaz...,The international best-selling thriller icon w...
266,"Amazon Introduces Echo Show 5—Compact Design, ...","Alexa adds new features, including how-to vide..."


## OpenAI preprocess data

In [None]:
!openai tools fine_tunes.prepare_data -f amazon_press_data_cleaned.csv

Analyzing...

- Based on your file extension, your file is formatted as a CSV file
- Your file contains 645 prompt-completion pairs
- The input file should contain exactly two columns/keys per row. Additional columns/keys present are: ['Unnamed: 0']
- There are 4 examples that are very long. These are rows: [196, 222, 382, 549]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the

In [None]:
#The dataset has 500 pairs in total
with open('amazon_press_data_cleaned_prepared.jsonl', 'r') as json_file:
    json_list = list(json_file)

num_data = len(json_list)
print("Total:", num_data)

val_part = 0.25 

val_amount = int(num_data * val_part)
print("Val data:", val_amount)
train_amount = num_data - val_amount 
print("Train data:", train_amount)

!head -n $train_amount amazon_press_data_cleaned_prepared.jsonl > sh_train.jsonl
!tail -n $val_amount amazon_press_data_cleaned_prepared.jsonl > sh_valid.jsonl

Total: 641
Val data: 160
Train data: 481


In [None]:
wandb.finish()

VBox(children=(Label(value='7.786 MB of 7.786 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## OpenAI Fine-tuning

In [None]:
'''define fine-tune params'''

model = 'curie'  # can be ada, babbage or curie
n_epochs = 4
batch_size = 8
learning_rate_multiplier = 0.1
prompt_loss_weight = 0.1

In [None]:
'''train'''

!openai api fine_tunes.create \
    -t sh_train.jsonl \
    -v sh_valid.jsonl \
    -m $model \
    --n_epochs $n_epochs \
    --batch_size $batch_size \
    --learning_rate_multiplier $learning_rate_multiplier \
    --prompt_loss_weight $prompt_loss_weight \
    --suffix "amazon-press-v0"

Upload progress:   0% 0.00/3.02M [00:00<?, ?it/s]Upload progress: 100% 3.02M/3.02M [00:00<00:00, 3.03Git/s]
Uploaded file from sh_train.jsonl: file-GmVQj74hjas4JfLN6I4l6F9z
Upload progress: 100% 996k/996k [00:00<00:00, 1.06Git/s]
Uploaded file from sh_valid.jsonl: file-o7t3X56pd32TqYE6ajQgwHNk
Created fine-tune: ft-Fi9NzvXozIQTV8IkZnrdTLSQ
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2022-07-19 18:55:12] Created fine-tune: ft-Fi9NzvXozIQTV8IkZnrdTLSQ
[2022-07-19 18:55:25] Fine-tune costs $7.58
[2022-07-19 18:55:25] Fine-tune enqueued. Queue number: 1
[2022-07-19 18:55:26] Fine-tune started
[2022-07-19 18:58:36] Completed epoch 1/4
[2022-07-19 19:00:51] Completed epoch 2/4
[2022-07-19 19:03:07] Completed epoch 3/4
[2022-07-19 19:05:19] Completed epoch 4/4
[2022-07-19 19:05:42] Uploaded model: curie:ft-personal:amazon-press-v0-2022-07-19-19-05-41
[2022-07-19 19:05:42] Uploaded result file: file-HGq4rK33DMlkyHt2Ghc3

In [None]:
# sync fine-tune jobs to W&B
!openai wandb sync --project "GPT 3 for Generating Texts in Amazon tone" 


[34m[1mwandb[0m: Currently logged in as: [33mmelzheng[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.21
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220719_191652-ft-FIwEP3IK7apKnBAFzYT1TrC0[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-FIwEP3IK7apKnBAFzYT1TrC0[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/melzheng/GPT%203%20for%20Generating%20Texts%20in%20Amazon%20tone[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/melzheng/GPT%203%20for%20Generating%20Texts%20in%20Amazon%20tone/runs/ft-FIwEP3IK7apKnBAFzYT1TrC0[0m
File file-0WmhoqG2hyEsx7A0iNzvaFCw could not be retrieved. Make sure you are allowed to download training/validation files
File file-05Fz8bnpJhkRFtnkxpSLrV4G could not be retrieved. Make sure you are allowed to download training/validation files
[34m[1mwandb[0m:

## Log validation samples

In [None]:
# create eval job
run = wandb.init(project=project_name, job_type='eval')
entity = wandb.run.entity

[34m[1mwandb[0m: Currently logged in as: [33mmelzheng[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# choose a fine-tuned model
artifact_job = run.use_artifact(f'{entity}/{project_name}/fine_tune_details:latest', type='fine_tune_details')
artifact_job.metadata

{'created_at': 1658256912,
 'fine_tuned_model': 'curie:ft-personal:amazon-press-v0-2022-07-19-19-05-41',
 'hyperparams': {'batch_size': 8,
  'learning_rate_multiplier': 0.1,
  'n_epochs': 4,
  'prompt_loss_weight': 0.1},
 'id': 'ft-Fi9NzvXozIQTV8IkZnrdTLSQ',
 'model': 'curie',
 'object': 'fine-tune',
 'organization_id': 'org-2ud1sYF9hh0kQ0RoB3CMcf09',
 'result_files': [{'bytes': 16353,
   'created_at': 1658257542,
   'filename': 'compiled_results.csv',
   'id': 'file-HGq4rK33DMlkyHt2Ghc3kVJC',
   'object': 'file',
   'purpose': 'fine-tune-results',
   'status': 'processed',
   'status_details': None}],
 'status': 'succeeded',
 'training_files': [{'bytes': 3021979,
   'created_at': 1658256910,
   'filename': 'sh_train.jsonl',
   'id': 'file-GmVQj74hjas4JfLN6I4l6F9z',
   'object': 'file',
   'purpose': 'fine-tune',
   'status': 'processed',
   'status_details': None}],
 'updated_at': 1658257542,
 'validation_files': [{'bytes': 996303,
   'created_at': 1658256911,
   'filename': 'sh_valid

In [None]:
wandb.config.update({k:artifact_job.metadata[k] for k in ['fine_tuned_model', 'model', 'hyperparams']})

In [None]:
fine_tuned_model = artifact_job.metadata['fine_tuned_model']
fine_tuned_model

'curie:ft-personal:amazon-press-v0-2022-07-19-19-05-41'

In [None]:
df = pd.read_json("sh_valid.jsonl", orient='records', lines=True)
df.head(5)

Unnamed: 0,prompt,completion
0,Amazon Expands Grocery Delivery From Whole Foo...,Prime members can enjoy delivery in as little...
1,AWS Introduces New Amazon EC2 Instances Featur...,New options for general purpose and memory-op...
2,"Amazon Career Day 2020 – More Than 300,000 Peo...","Amazon recruiters conducted 20,000 free 1-on-..."
3,Amazon Introduces the eero mesh WiFi system—Si...,The new eero mesh WiFi system is the latest a...
4,Amazon Kicks Off Spring with the Launch of Bel...,The new collection is launching with 12 items...


In [None]:
df.prompt[4]

'Amazon Kicks Off Spring with the Launch of Belei, its First Dedicated Skincare Line ->'

In [None]:
df.completion[4]

' The new collection is launching with 12 items including moisturizers, serums, eye cream, spot treatments and more\nProducts feature sought-after ingredients such as retinol, hyaluronic acid and vitamin C to help fight acne, dark spots, fine lines, dehydration and dullness\nSEATTLE--(BUSINESS WIRE)--Mar. 20, 2019-- Amazon (NASDAQ:AMZN) today announced the launch of Belei, a line of high-quality skincare products that offer solutions for various skin types and feature ingredients with proven effectiveness. The collection has 12 different items, including everything from retinol moisturizer to vitamin C serums, to help customers address common skincare concerns like acne, the appearance of fine lines and wrinkles, dark spots, dehydration, dullness and more. All Belei products are free of parabens, phthalates, sulfates and fragrance and are not tested on animals. Belei product bottles are made of post-consumer recycled resin and carton packaging is 100% recyclable.\nThis press release fe

In [None]:
df.to_csv('validation.csv', index=False)

In [None]:
# inference on 30 validation examples.

n_samples = 10
df = df.iloc[:n_samples]

data = []

for _, row in tqdm(df.iterrows()):
    prompt = row['prompt']
    res = openai.Completion.create(model=fine_tuned_model, prompt=prompt, max_tokens=300, stop=[" END"])
    completion = res['choices'][0]['text']
    completion = completion[1:]       # remove initial space
    prompt = prompt[:-3]              # remove " ->"
    target = row['completion'][1:-4]  # remove initial space and "END"
    data.append([prompt, target, completion])

prediction_table = wandb.Table(columns=['prompt', 'target', 'completion'], data=data)
wandb.log({'predictions': prediction_table})

10it [00:27,  2.75s/it]


In [None]:
wandb.finish() 

VBox(children=(Label(value='0.136 MB of 0.136 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## Example Text Generation using fine-tuned model

In [None]:
!openai api completions.create -m curie:ft-personal:amazon-press-v0-2022-07-19-19-05-41 -p 'Amazon Kicks Off Spring with the Launch of Belei, its First Dedicated Skincare Line ->'

Amazon Kicks Off Spring with the Launch of Belei, its First Dedicated Skincare Line -> The colorful belei collection debuts with Make Me Matt, a moisturizing

## Open AI completion api parameters
Reference: https://beta.openai.com/docs/api-reference/models

__model*__


ID of the model to use. You can use the List models API to see all of your available models, or see our Model overview for descriptions of them.

__prompt__

Defaults to <|endoftext|>
The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays.

Note that <|endoftext|> is the document separator that the model sees during training, so if a prompt is not specified the model will generate as if from the beginning of a new document.

__suffix__

Defaults to null
The suffix that comes after a completion of inserted text.

__max_tokens__

Defaults to 16
The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_tokens cannot exceed the model's context length. Most models have a context length of 2048 tokens (except for the newest models, which support 4096).

__temperature__

Defaults to 1
What sampling temperature to use. Higher values means the model will take more risks. Try 0.9 for more creative applications, and 0 (argmax sampling) for ones with a well-defined answer.

We generally recommend altering this or top_p but not both.

__top_p__

Defaults to 1
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

We generally recommend altering this or temperature but not both.

__n__

Defaults to 1
How many completions to generate for each prompt.

Note: Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for max_tokens and stop.

__stream__

Defaults to false
Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.

__logprobs__

Defaults to null
Include the log probabilities on the logprobs most likely tokens, as well the chosen tokens. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens. The API will always return the logprob of the sampled token, so there may be up to logprobs+1 elements in the response.

The maximum value for logprobs is 5. If you need more than this, please contact support@openai.com and describe your use case.

__echo__

Defaults to false
Echo back the prompt in addition to the completion

__stop__
*string or array*

Defaults to null
Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

__presence_penalty__

Defaults to 0
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

See more information about frequency and presence penalties.

__frequency_penalty__

Defaults to 0
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

See more information about frequency and presence penalties.

__best_of__

Defaults to 1
Generates best_of completions server-side and returns the "best" (the one with the highest log probability per token). Results cannot be streamed.

When used with n, best_of controls the number of candidate completions and n specifies how many to return – best_of must be greater than n.

Note: Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for max_tokens and stop.

__logit_bias__

Defaults to null
Modify the likelihood of specified tokens appearing in the completion.

Accepts a json object that maps tokens (specified by their token ID in the GPT tokenizer) to an associated bias value from -100 to 100. You can use this tokenizer tool (which works for both GPT-2 and GPT-3) to convert text to token IDs. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.

As an example, you can pass {"50256": -100} to prevent the <|endoftext|> token from being generated.

# Generating Texts

In [42]:
model_name = 'curie:ft-personal:amazon-press-v0-2022-07-19-19-05-41'

In [47]:
response = openai.Completion.create(model=model_name, prompt="Amazon Kicks Off Spring with the Launch of Belei, its First Dedicated Skincare Line ->", 
                                    temperature=0.3, stop=' END', n=1,
                                    max_tokens=768)

In [48]:
response

<OpenAIObject text_completion id=cmpl-5VnPZ9Jj3TQ2buNM3aFeHQlronmYd at 0x7fe05699dd10> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " The new line of skincare products is available for Prime members to shop now\nSEATTLE--(BUSINESS WIRE)--Mar. 11, 2019-- (NASDAQ:AMZN)\u2014Amazon today announced the launch of Belei, its first dedicated skincare line. The line includes a range of products, including cleansers, toners, moisturizers, and serums, all of which are infused with Amazonian plant extracts and botanicals, including Amazonian clay, to help support healthy skin. Customers can shop the line now at www.amazon.com/belei.\n\u201cWe\u2019re excited to launch Belei, our first-ever skincare line,\u201d said Dr. David L. Asch, Amazon\u2019s Vice President of Global Health and Wellness. \u201cWe\u2019ve worked hard to create a line of products that are as innovative as they are effective, and we\u2019re excited to share 

In [52]:
print(response['choices'][0]['text'])

 The new line of skincare products is available for Prime members to shop now
SEATTLE--(BUSINESS WIRE)--Mar. 11, 2019-- (NASDAQ:AMZN)—Amazon today announced the launch of Belei, its first dedicated skincare line. The line includes a range of products, including cleansers, toners, moisturizers, and serums, all of which are infused with Amazonian plant extracts and botanicals, including Amazonian clay, to help support healthy skin. Customers can shop the line now at www.amazon.com/belei.
“We’re excited to launch Belei, our first-ever skincare line,” said Dr. David L. Asch, Amazon’s Vice President of Global Health and Wellness. “We’ve worked hard to create a line of products that are as innovative as they are effective, and we’re excited to share them with our customers.”
“Amazon’s commitment to health and wellness is a great example of how we can leverage our scale to help people lead healthier lives,” said Dr. David L. Asch, Vice President of Global Health and Wellness at Amazon. “We’re