<a href="https://colab.research.google.com/github/mel-zheng/mel-zheng/blob/main/Data_Processing_URLs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Data Processing (Text from urls)

Primary usecase: SEO content generation

Goal: Data cleaning, processing, prepare training dataset for prompt tuning LLMs, store final training dataset in Google Cloud Storage.


In [2]:
!pip install google-cloud-aiplatform google-cloud-bigquery google-cloud-vision google-cloud-storage mlflow==2.5 jsonlines python-dotenv --upgrade --user



# Set up

In [3]:
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
!cp /content/drive/MyDrive/brand-voice/.env .env

In [2]:
import pandas as pd
from dotenv import load_dotenv
import os
load_dotenv()  # take environment variables from .env.

True

# Set up ML flow

In [4]:
# # Importing our required libraries
import mlflow


# # Setting the MLflow client
# client = mlflow.tracking.MlflowClient(tracking_uri = 'Brand Voice Test 1')

In [5]:
mlflow.set_tracking_uri('mlflow-dp.non-prod.kinesso.ninja')

In [None]:
mlflow.create_experiment('Brand Voice Training Input Extraction Prompt')

'758760816997049357'

In [6]:
mlflow.set_experiment('Brand Voice Training Input Extraction Prompt')

2023/09/13 16:47:30 INFO mlflow.tracking.fluent: Experiment with name 'Brand Voice Training Input Extraction Prompt' does not exist. Creating a new experiment.


<Experiment: artifact_location='/content/mlflow-dp.non-prod.kinesso.ninja/103011558505213548', creation_time=1694623650363, experiment_id='103011558505213548', last_update_time=1694623650363, lifecycle_stage='active', name='Brand Voice Training Input Extraction Prompt', tags={}>

In [7]:
from mlflow.tracking import MlflowClient
from mlflow.data import from_pandas
client = MlflowClient()

In [8]:
experiments = (
    client.search_experiments()
)

In [9]:
experiments

[<Experiment: artifact_location='/content/mlflow-dp.non-prod.kinesso.ninja/103011558505213548', creation_time=1694623650363, experiment_id='103011558505213548', last_update_time=1694623650363, lifecycle_stage='active', name='Brand Voice Training Input Extraction Prompt', tags={}>,
 <Experiment: artifact_location='/content/mlflow-dp.non-prod.kinesso.ninja/0', creation_time=1694623650251, experiment_id='0', last_update_time=1694623650251, lifecycle_stage='active', name='Default', tags={}>]

# Data Pipelines

only run once to upload data to BQ

In [10]:
!mkdir data |cp -r /content/drive/MyDrive/brand-voice/data/web-scraped/nationwide/raw ./data/raw

### Details on data:

ziped json files - each json file has a web scrapped long form landing page content

steps:

* clean up json files
* create BQ dataset and table
* load data to BQ.









## Clean up json files

In [None]:
import zipfile
import json
import jsonlines
import pandas as pd

def unzip_files(zip_file_path, destination_dir='./tmp/'):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        # extract all the contents
        zip_ref.extractall(destination_dir)


def process_json_to_jsonl(in_dir, out_dir, out_filename):
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)
    # list all json files in file_dir
    json_files = [pos_json for pos_json in os.listdir(in_dir) if pos_json.endswith('.json')]
    out_path = out_dir + out_filename
    failed_json_files = []
    with open(out_path, 'w+') as jsonl_output:
        # loop through each json file
        for f in json_files:
            with open(os.path.join(in_dir, f)) as json_file:
                json_data = json.load(json_file)
            try:
                # extract relevant information and clean up new lines
                article_json = json_data[1]
                article_json['url'] = json_data[0]['url']
                if article_json.get('text'):
                    article_json['text'] = article_json['text'].replace('\n','*').replace('***','\n')
                json.dump(article_json, jsonl_output)
                jsonl_output.write('\n')
            except:
                # save failed to processed filenames
                failed_json_files.append(f)
    if len(failed_json_files) > 0:
        with open(r'./failed_files.txt', 'w') as fp:
            for item in failed_json_files:
                fp.write("%s\n" % item)
        print(f'Failed to process {len(failed_json_files)} json files. Saved to failed_files.txt')
    print(f'Successfully processed {len(json_files)-len(failed_json_files)} json files. Saved in {out_path}')


def run_data_cleaning(zip_file_path, in_dir, out_dir, out_filename):
    unzip_files(zip_file_path, in_dir)
    process_json_to_jsonl(in_dir, out_dir, out_filename)

zip_file_path = "./data/raw/nationwide_scraped.zip"
in_dir = './tmp/'
out_dir = './tmp/processed/'
out_filename = 'nationwide.jsonl'
run_data_cleaning(zip_file_path, in_dir, out_dir, out_filename)

Failed to process 20 json files. Saved to failed_files.txt
Successfully processed 970 json files. Saved in ./tmp/processed/nationwide.jsonl


### Inspect one example

In [None]:
import json
with open('./tmp/output_www_nationwide_com_lc_resources_farm-and-agribusiness_articles_food-insecurity-and-farming.json', 'r') as f:
    json_data = json.load(f)

In [None]:
json_data

[{'url': 'https://www.nationwide.com/lc/resources/farm-and-agribusiness/articles/food-insecurity-and-farming'},
 {'h1': 'Food insecurity and farming go hand in hand',
  'text': "The level of hunger in U.S. households has almost tripled between 2019 and August 2020, according to \nan analysis\n of new data from the Census Bureau and the Department of Agriculture. Feeding America, the nation's largest hunger-relief organization, says more than 54 million people in the country are at risk of food insecurity and 87% of U.S. counties with the highest rates of food insecurity are considered rural communities.\n\n\nMeet Tessa Jarvis, an Iowa State University grad assistant in the Meat Science department, a division of ISU’s College of Agriculture and Life Sciences. As part of the agriculture industry facing very different needs during the COVID-19 pandemic, she pivoted from research to hands-on help with harvesting pork and has been a part of the Pass the Pork program that has fed more than 3

In [None]:
json_data[1]

{'h1': 'Food insecurity and farming go hand in hand',
 'text': "The level of hunger in U.S. households has almost tripled between 2019 and August 2020, according to \nan analysis\n of new data from the Census Bureau and the Department of Agriculture. Feeding America, the nation's largest hunger-relief organization, says more than 54 million people in the country are at risk of food insecurity and 87% of U.S. counties with the highest rates of food insecurity are considered rural communities.\n\n\nMeet Tessa Jarvis, an Iowa State University grad assistant in the Meat Science department, a division of ISU’s College of Agriculture and Life Sciences. As part of the agriculture industry facing very different needs during the COVID-19 pandemic, she pivoted from research to hands-on help with harvesting pork and has been a part of the Pass the Pork program that has fed more than 30,000 Iowans.\n\n\nMarji Alaniz, \nFarmHer\n founder, talks with Tessa about food insecurity and the Pass the Pork

In [None]:
json_data[0]

{'url': 'https://www.nationwide.com/lc/resources/farm-and-agribusiness/articles/food-insecurity-and-farming'}

## create BQ dataset & table

In [None]:
from google.cloud import bigquery
from google.cloud.exceptions import NotFound

def create_dataset(dataset_name):
    client = bigquery.Client(project=os.getenv('GCP_PROJECT_ID'))
    dataset_id = f"{client.project}.{dataset_name}"
    try:
        client.get_dataset(dataset_id)
        print("Dataset {} already exists".format(dataset_id))
    except NotFound:
        dataset = bigquery.Dataset(dataset_id)
        dataset = client.create_dataset(dataset, timeout=30)
        print("Created dataset {}.{}".format(client.project, dataset.dataset_id))
    return dataset_id

def create_table(dataset_id, table_name, schema):
    client = bigquery.Client(project=os.getenv('GCP_PROJECT_ID'))
    table_id = f"{dataset_id}.{table_name}"
    try:
        client.get_table(table_id)  # Make an API request.
        print("Table {} already exists.".format(table_id))
    except NotFound:
        table = bigquery.Table(table_id, schema=schema)
        table = client.create_table(table)
        print("Created Table {}".format(table_id))
    return table_id

def delete_dataset(dataset_id):
    client = bigquery.Client(project=os.getenv('GCP_PROJECT_ID'))
    client.delete_dataset(
        dataset_id, delete_contents=True, not_found_ok=True
    )  # Make an API request.
    print("Deleted dataset '{}'.".format(dataset_id))



In [None]:
from google.cloud import bigquery

dataset_name = 'brand_voice_test_data'
dataset_id = create_dataset(dataset_name)

Dataset ido-81-rnd-gcp-sandbox-2023q3.brand_voice_test_data already exists


In [None]:
table_name = 'nationwide'
schema = [
    bigquery.SchemaField("url", "STRING", mode="REQUIRED"),
    bigquery.SchemaField("h1", "STRING", mode="Nullable"),
    bigquery.SchemaField("h2", "STRING", mode="Nullable"),
    bigquery.SchemaField("text", "STRING", mode="REQUIRED"),
    bigquery.SchemaField("image_caption", "INTEGER", mode="Nullable"),
    bigquery.SchemaField("call_to_action", "INTEGER", mode="Nullable"),
]

table_id = create_table(dataset_id, table_name, schema)

Created Table ido-81-rnd-gcp-sandbox-2023q3.brand_voice_test_data.nationwide


In [None]:
# delete_dataset(dataset_id)

Deleted dataset 'ido-81-rnd-gcp-sandbox-2023q3.brand_voice_test_data'.


## Load data to BQ

In [None]:
def load_df_to_bq(df, table_id, schema):
    client = bigquery.Client(project=os.getenv('GCP_PROJECT_ID'))
    job_config = bigquery.LoadJobConfig(
        schema=schema,
        write_disposition="WRITE_TRUNCATE",
    )

    job = client.load_table_from_dataframe(
        jsonOjb, table_id, job_config=job_config
    )
    job.result()  # Wait for the job to complete.
    table = client.get_table(table_id)
    print(
        "Loaded {} rows and {} columns to {}".format(
            table.num_rows, len(table.schema), table_id
        )
    )

In [None]:
bq_load_schema=[
    # Specify the type of columns whose type cannot be auto-detected. For
    # example the "title" column uses pandas dtype "object", so its
    # data type is ambiguous.
    bigquery.SchemaField("url", bigquery.enums.SqlTypeNames.STRING),
    # Indexes are written if included in the schema by name.
    bigquery.SchemaField("h1", bigquery.enums.SqlTypeNames.STRING),
    bigquery.SchemaField("h2", bigquery.enums.SqlTypeNames.STRING),
    bigquery.SchemaField("text", bigquery.enums.SqlTypeNames.STRING),
    bigquery.SchemaField("image_caption", bigquery.enums.SqlTypeNames.STRING),
    bigquery.SchemaField("call_to_action", bigquery.enums.SqlTypeNames.STRING),
]
jsonOjb = pd.read_json(out_dir+out_filename, lines=True)
df = jsonOjb.astype('str')
df.set_index('url', inplace=True)
load_df_to_bq(df, table_id, bq_load_schema)

Loaded 970 rows and 6 columns to ido-81-rnd-gcp-sandbox-2023q3.brand_voice_test_data.nationwide


## Get data from BQ

In [None]:
# from google.cloud import bigquery
from google.cloud import bigquery
def construct_sql(table_id):
    return f"""
    SELECT *
    FROM `{table_id}`
    """

def run_bq_sql(sql):
    client = bigquery.Client(project=os.getenv('GCP_PROJECT_ID'))
    df = client.query(sql).to_dataframe()
    return df

table_id = 'ido-81-rnd-gcp-sandbox-2023q3.brand_voice_test_data.nationwide'
df = run_bq_sql(construct_sql(table_id))

In [None]:
print(df.shape)
df.head(4)

(970, 6)


Unnamed: 0,url,h1,h2,text,image_caption,call_to_action
0,https://www.nationwide.com/lc/resources/invest...,Read the latest findings from Nationwide's 202...,Make more informed filing decisions,Make more informed filing decisions\nYour deci...,two women looking at a map in the woods,
1,https://www.nationwide.com/lc/resources/person...,Leveraging your investments for cash,Did you know that liquidity is an important fa...,Did you know that liquidity is an important fa...,two people sitting on couches talking,
2,https://www.nationwide.com/lc/resources/invest...,What do annuities cost?,,While variable annuities generally have higher...,business woman standing outside,
3,https://www.nationwide.com/lc/resources/powers...,Consider a few things before purchasing a used...,When to buy used,"When buying a motorcycle, whether you’re a fir...",couple standing near motorcycle at the beach,


# Data Processing

In [None]:
df['wc'] = df['text'].apply(lambda x: len(x.split(' '))) # word count

In [None]:
df['wc'].describe()

count     970.000000
mean      587.094845
std       316.442319
min         8.000000
25%       395.000000
50%       578.500000
75%       759.000000
max      2145.000000
Name: wc, dtype: float64

In [None]:
df_long = df[df['wc']>=df['wc'].describe()['25%']].reset_index() # only tune on long content > 395 words
df_long.shape

(728, 8)

In [None]:
df_long['text'][0]

"When buying a motorcycle, whether you’re a first-time owner or you’ve already owned a two-wheeler, you’ll have many options to consider. These include the engine size, the make and model of the bike and whether you want a new motorcycle or a used one.\nIf you’re new to the motorcycle world, you might be better off buying a new motorcycle because you’ll have the support of a dealership – which can be helpful, particularly if you’re not experienced in the mechanics and servicing of a bike. A new motorcycle is also good if you want to make sure you have all the latest technology and features.\nHowever, for some riders, a used motorcycle makes more sense.\nWhen to buy used\nOne of the biggest advantages to buying a used motorcycle is the price. Whether you’re going through a private seller or a dealership, there’s usually a significant price difference between new and used motorcycles. If your budget is the biggest consideration, buying used could be a great way to get more bike for your 

In [None]:
from google.cloud import aiplatform
aiplatform.init(project=os.getenv('GCP_PROJECT_ID'), location=os.getenv('GCP_LOCATION'))

from vertexai.preview.language_models import TextGenerationModel

def generate(model_name, prompt, temperature, max_output_tokens, top_k, top_p):
    model = TextGenerationModel.from_pretrained(model_name)
    response = model.predict(
            prompt,
            temperature=temperature,
            max_output_tokens=max_output_tokens,
            top_k=top_k,
            top_p=top_p,
        )
    model_metadata = {'model_name':model_name, 'temperature':temperature, 'max_output_tokens':max_output_tokens, 'top_k':top_k, 'top_p':top_p}
    return {'prompt':prompt, 'response':response, 'model_metadata':model_metadata}
model_name = 'text-bison@001'

## Prompt experimentation to extract information of an article or long form content

In [None]:
def summarization_experiment1(df, i, temperature = 0.2, max_output_tokens = 516, top_k = 40, top_p = .8):
    title = df['h1'][i]
    text = df['text'][i]
    prompt = f"""
        Summarize the following SEO content.
        Output should be in json format and include objective, audience, audience_location, audience_stage_of_awareness, h1 tag, h2 tags, primary_keywords, secondary_keywords, language, and tones.
        The audience_stage_of_awareness describes how well audience knows the product and can take a single value from ['Most Aware','Product Aware','Solution Aware','Problem Aware','Unaware']
        The audience_location is the location of the intended audience. If unsure, use 'Anywhere'.
        The language should be in full name. For example, use English instead of en.

        **SEO content**: \n
        h1: {title}\n
        text:{text}\n

        """
    return generate(model_name, prompt, temperature, max_output_tokens, top_k, top_p)

In [None]:
output = summarization_experiment1(df_long, 5)
print(output['response'].text)

{
          "objective": "Lead Generation",
          "audience": "Consumers",
          "audience_location": "Anywhere",
          "audience_stage_of_awareness": "Problem Aware",
          "h1": "Cómo comprar un auto a un vendedor particular",
          "h2": [
            "Ventajas de las ventas de vehículos entre particulares",
            "Desventajas de la venta entre particulares",
            "Guía para comprar un auto a un vendedor particular",
            "Investiga",
            "Examina detenidamente los anuncios",
            "Contacta al vendedor",
            "Examina el vehículo",
            "Prueba de conducción",
            "Lleva el vehículo a tu mecánico",
            "Haz una oferta y cierra la venta",
            "Consejos adicionales"
          ],
          "primary_keywords": [
            "comprar un auto",
            "vendedor particular",
            "auto usado"
          ],
          "secondary_keywords": [
            "automóvil",
            "vehículo",

In [None]:
output = summarization_experiment1(df_long, 50)
print(output['response'].text)

{
          "objective": "Increase awareness of safe following distance",
          "audience": "Drivers",
          "audience_location": "Anywhere",
          "audience_stage_of_awareness": "Problem Aware",
          "h1": "What is a safe following distance",
          "h2": ["Practice the 3-second rule", "When to increase your following distance", "Use defensive driving techniques"],
          "primary_keywords": ["safe following distance", "defensive driving", "safe driving"],
          "secondary_keywords": ["driving", "traffic", "accidents"],
          "language": "English",
          "tones": ["informative", "helpful"]
        }


In [None]:
output = summarization_experiment1(df_long, 50, temperature=0.2) # use differnt temperature
print(output['response'].text)

{
          "objective": "Increase awareness of the importance of safe following distance",
          "audience": "Drivers",
          "audience_location": "Anywhere",
          "audience_stage_of_awareness": "Problem Aware",
          "h1": "What is a safe following distance",
          "h2": [
            "Practice the 3-second rule",
            "When to increase your following distance",
            "Use defensive driving techniques"
          ],
          "primary_keywords": [
            "safe following distance",
            "defensive driving",
            "safe driving"
          ],
          "secondary_keywords": [
            "following distance",
            "defensive driving tips",
            "safe driving tips"
          ],
          "language": "English",
          "tones": [
            "informative",
            "helpful"
          ]
        }


## Log prompts and LLM predictions on ML Flow

In [None]:
inputs = [output['model_metadata']]

outputs = [
    output['response'].text
]

prompts = [
    output['prompt']
]


with mlflow.start_run():
    # Log llm predictions
    mlflow.llm.log_predictions(inputs, outputs, prompts)

2023/09/13 14:14:44 INFO mlflow.tracking.llm_utils: Creating a new llm_predictions.csv for run 19c56b1809e349f6b4398f4f2a5b5971.


In [None]:
mlflow.search_runs(experiment_ids=['758760816997049357'])

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,tags.mlflow.runName,tags.mlflow.source.name,tags.mlflow.source.type,tags.mlflow.user
0,19c56b1809e349f6b4398f4f2a5b5971,758760816997049357,FINISHED,/content/mlflow-dp.non-prod.kinesso.ninja/7587...,2023-09-13 14:14:44.534000+00:00,2023-09-13 14:14:44.551000+00:00,agreeable-sow-527,/usr/local/lib/python3.10/dist-packages/ipyker...,LOCAL,root
1,141a4961b93c4662a435935f9c2c0740,758760816997049357,FINISHED,/content/mlflow-dp.non-prod.kinesso.ninja/7587...,2023-09-13 14:11:28.156000+00:00,2023-09-13 14:11:28.185000+00:00,trusting-grouse-164,/usr/local/lib/python3.10/dist-packages/ipyker...,LOCAL,root
2,9abae531e9c542e0a602379b612b8b0a,758760816997049357,FINISHED,/content/mlflow-dp.non-prod.kinesso.ninja/7587...,2023-09-13 14:01:41.789000+00:00,2023-09-13 14:01:41.806000+00:00,carefree-dove-323,/usr/local/lib/python3.10/dist-packages/ipyker...,LOCAL,root
3,63c989523ec64bdb840903e2c787d4fb,758760816997049357,FINISHED,/content/mlflow-dp.non-prod.kinesso.ninja/7587...,2023-09-13 13:53:46.731000+00:00,2023-09-13 13:53:46.746000+00:00,smiling-shad-817,/usr/local/lib/python3.10/dist-packages/ipyker...,LOCAL,root


# Build Training Dataset

In [11]:
import json

training_inputs_list = []
prompts = []
model_configs = []
for i in range(len(df_long)):
    output = summarization_experiment1(df_long,i)
    training_inputs_list.append(json.loads(output['response'].text))
    prompts.append(output['prompt'])
    model_configs.append(output['model_metadata'])

# Note: Sometime the the LLM prediction returns null (very rarely)


In [None]:
with mlflow.start_run():
    # Log llm predictions
    mlflow.llm.log_predictions(model_configs, training_inputs_list, prompts)

2023/09/13 15:10:06 INFO mlflow.tracking.llm_utils: Creating a new llm_predictions.csv for run 444893d1d8114c129b573e50cd605ca8.


In [None]:
df_tuning = df_long[:len(training_inputs_list)]
df_tuning['output_text'] = df['h1'] + '\n' + df['text']
df_tuning['input_text'] = training_inputs_list
df_tuning = df_tuning[['input_text','output_text']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tuning['output_text'] = df['h1'] + '\n' + df['text']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tuning['input_text'] = training_inputs_list


In [None]:
df_tuning.head(3)

Unnamed: 0,input_text,output_text
676,{'objective': 'To generate leads for Nationwid...,8 motorcycle maintenance tasks you should be d...
677,"{'objective': 'Lead Generation', 'audience': '...",What to consider before franchising your busin...
678,"{'objective': 'Lead Generation', 'audience': '...",4 motivos para considerar el seguro para masco...


In [None]:
df_tuning.to_csv('/content/drive/MyDrive/brand-voice/data/model-tuning/tuning_df_test_small_batch_676_708_temp0.2.csv', index=False)


# Store Training Dataset

In [14]:
from google.cloud import storage

def list_buckets():
    """Lists all buckets."""

    storage_client = storage.Client(project=os.getenv('GCP_PROJECT_ID'))
    buckets = storage_client.list_buckets()

    for bucket in buckets:
        print(bucket.name)

def create_bucket_class_location(bucket_name):
    """
    Create a new bucket in the US region with the coldline storage
    class
    TODO: default to the cheapest storage class for now. Update if needed in the future.
    Storage class can also be changed from an existing bucket.
    """

    storage_client = storage.Client(project=os.getenv('GCP_PROJECT_ID'))

    bucket = storage_client.bucket(bucket_name)
    bucket.storage_class = "STANDARD"
    new_bucket = storage_client.create_bucket(bucket, location="us")

    print(
        "Created bucket {} in {} with storage class {}".format(
            new_bucket.name, new_bucket.location, new_bucket.storage_class
        )
    )
    return new_bucket

def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to the bucket."""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"
    # The path to your file to upload
    # source_file_name = "local/path/to/file"
    # The ID of your GCS object
    # destination_blob_name = "storage-object-name"

    storage_client = storage.Client(project=os.getenv('GCP_PROJECT_ID'))
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    # Optional: set a generation-match precondition to avoid potential race conditions
    # and data corruptions. The request to upload is aborted if the object's
    # generation number does not match your precondition. For a destination
    # object that does not yet exist, set the if_generation_match precondition to 0.
    # If the destination object already exists in your bucket, set instead a
    # generation-match precondition using its generation number.
    generation_match_precondition = 0

    blob.upload_from_filename(source_file_name, if_generation_match=generation_match_precondition)

    print(
        f"File {source_file_name} uploaded to {destination_blob_name}."
    )


In [13]:
list_buckets()

3d5a267881db4384bbd87b4c03ca3c50_clouddeploy
64f0b432e9d646a5b65ae4f97a547380_clouddeploy
brand-voice-training-data-bucket
f44a8e3b1407491cb7b6c8284aabc6d4_clouddeploy
fc85a8bb89e8448b8e960e6081a71555_clouddeploy
ido-81-rnd-gcp-sandbox-2023q3-vertex-pipelines-europe-west4
ido-81-rnd-gcp-sandbox-2023q3-vertex-pipelines-us-central1
us-central1.deploy-artifacts.ido-81-rnd-gcp-sandbox-2023q3.appspot.com


In [18]:
bucket_name = 'brand-voice-training-data-bucket'

## Upload prepared training data

In [15]:
# upload_blob(bucket_name=bucket_name,
#             source_file_name="/content/drive/MyDrive/brand-voice/data/model-tuning/tuning_df_test_small_batch.csv",
#             destination_blob_name="nationwide/model-tuning/test_small_batch.csv")

File /content/drive/MyDrive/brand-voice/data/model-tuning/tuning_df_test_small_batch.csv uploaded to nationwide/model-tuning/test_small_batch.csv.


## Download training data

In [19]:
bucket_name = 'brand-voice-training-data-bucket'
training_data_path = 'nationwide/model-tuning/test_small_batch.csv'
df = pd.read_csv(f'gs://{bucket_name}/{training_data_path}')

In [20]:
df.head(3)

Unnamed: 0,input_text,output_text
0,{'objective': 'Provide information about how t...,Read the latest findings from Nationwide's 202...
1,{'objective': 'To educate readers about the be...,Leveraging your investments for cash\nDid you ...
2,{'objective': 'Provide information on how to l...,What do annuities cost?\nWhile variable annuit...


In [None]:
#!gcloud auth application-default login

## log training data

In [None]:
with mlflow.start_run():
    mlflow.log_input(from_pandas(df=df_tuning), context="training")

# Test training data inputs in one shot

In [None]:
from google.cloud import aiplatform
aiplatform.init(project=os.getenv('GCP_PROJECT_ID'), location=os.getenv('GCP_LOCATION'))

from vertexai.preview.language_models import TextGenerationModel

def generate(model_name, prompt, temperature = 0.2, max_output_tokens = 516, top_k = 40, top_p = .8):
    model = TextGenerationModel.from_pretrained(model_name)
    response = model.predict(
            prompt,
            temperature=temperature,
            max_output_tokens=max_output_tokens,
            top_k=top_k,
            top_p=top_p,
        )
    return response
model_name = 'text-bison@001'

def tuning_prompt_experiment_1shot(df, i, temperature):
    title = df['input_text'][i]
    text = df['output_text'][i]
    prompt = f"""Use the following json data to write a compelling SEO article.\n\n
    JSON: {df_tuning['input_text'][5]} \n
    ARTICLE: {df_tuning['output_text'][5]} \n\n
    JSON: {df_tuning['input_text'][i]} \n
    ARTICLE:
    """
    return generate(model_name, prompt, temperature=temperature)

In [None]:
ground_truth = df_tuning['output_text'][10]
print(ground_truth)

Trampoline safety tips everyone should know
While a trampoline is a great source of entertainment and exercise, it can also be very dangerous if trampoline safety rules are not followed.**									*		**	*		*		*		*		*		*		*		*		*		*		*		*		*		*		*		**			*			*				*Jumping safely
Follow these trampoline safety guidelines:
**Children age 5 and under should not be permitted on a trampoline.
Provide adult supervision and adult spotters around the edge of the trampoline.
Never allow more than 1 person to jump at a time.
Do not permit gymnastic exercises or stunts, such as somersaults or flips.
Never allow children to bounce off the trampoline. Encourage them to stop bouncing, walk to the edge, sit and slide off.
To prevent young children from getting on without supervision, do not leave a ladder or chair near the trampoline.
Never permit children to play on a wet trampoline.
Place your trampoline in a fenced area to avoid being liable for injuries caused by your trampoline acting as an attrac

In [None]:
result = tuning_prompt_experiment_1shot(df_tuning, 10, temperature=0.7)
generated_text = result.text
print(generated_text)

Trampolines can be a lot of fun for kids, but they can also be dangerous. It's important to take safety precautions to prevent injuries.

**Sit back and enjoy the show**

One of the best ways to keep kids safe on a trampoline is to sit back and watch them. This will allow you to see if they're doing anything dangerous and correct them if necessary. It's also important to make sure that there are no other kids around when they're jumping.

**Trampoline safety rules**

There are a few basic trampoline safety rules that everyone should follow:

* Never jump on a trampoline alone.
* Always wear shoes when jumping.
* Never jump with objects in your hands.
* Don't bounce too high.
* Don't flip or do other dangerous stunts.
* Stop jumping if you feel dizzy or tired.

**Best place for a trampoline**

Trampolines should be placed in a safe area away from trees, fences, and other objects. The ground underneath the trampoline should be soft, such as grass or sand. It's also important to make sure

In [None]:
def refine_generated_text(generated_text, guide):
    prompt = f"""Rewrite the following draft article to incorporate the brand writing guide.\
    Brand Writing Guide: {guide} \
    Make sure writing guidelines are followed.\
    Use words like HOW, WHY, WHAT, and WHERE—these help people understand what they will find on the page.\
    Avoid using repetitive words and phrases.\

    Draft article: {generated_text}
    """
    return generate(model_name, prompt, temperature=0.8)

guide = """These are the three key principles that guide our writing:
Write with the conversational warmth of a good friend who is “on your side.”
Write with the self-assurance that can only come from over 90 years in the business.
Write in a way that is clear, concise, and easy to understand.
We use these principles to create content that is relevant, engaging, and informative for our customers."""

# See notebook Tone and Voice (Nationwide).ipynb for how the guide is generated.

In [None]:
refined_result = refine_generated_text(generated_text, guide)
refined_text = refined_result.text
print(refined_text)

**How to Keep Kids Safe on a Trampoline**

Trampolines can be a lot of fun for kids, but they can also be dangerous. It's important to take safety precautions to prevent injuries.

**Sit back and enjoy the show**

One of the best ways to keep kids safe on a trampoline is to sit back and watch them. This will allow you to see if they're doing anything dangerous and correct them if necessary. It's also important to make sure that there are no other kids around when they're jumping.

**Trampoline safety rules**

Here are a few basic trampoline safety rules that everyone should follow:

* Never jump on a trampoline alone.
* Always wear shoes when jumping.
* Never jump with objects in your hands.
* Don't bounce too high.
* Don't flip or do other dangerous stunts.
* Stop jumping if you feel dizzy or tired.

**Best place for a trampoline**

Trampolines should be placed in a safe area away from trees, fences, and other objects. The ground underneath the trampoline should be soft, such as grass