# Generate Synthetic text summarization with Gretel GPT

* In this notebook we use Gretel GPT with Llama-2 7b model to create synthetic text summerization dataset.  
* To run this notebook, you will need an API key from the [Gretel Console](https://console.gretel.ai/).

## Getting Started

In [None]:
%%capture
!pip install -U gretel-client

In [None]:
import pandas as pd

from gretel_client import configure_session
from gretel_client.helpers import poll
from gretel_client.projects import create_or_get_unique_project, get_project

In [None]:
# Log into Gretel
configure_session(api_key="prompt", cache="yes", endpoint="https://api-dev.gretel.cloud", validate=True, clear=True)

pd.set_option('max_colwidth', None)

Caching Gretel config to disk.
Using endpoint https://api-dev.gretel.cloud
Logged in as marjan@gretel.ai ✅


## Load and preview training data

In [None]:
# Specify a dataset to train on
DATASET_PATH = 'https://gretel-datasets.s3.us-west-2.amazonaws.com/Text-dataset/Samsum-text-summerization-sample-1000.csv'
df = pd.read_csv(DATASET_PATH)

df.head()

Unnamed: 0,dialogue_and_summary
0,**dialogue**\nLeon: I asked you to lend me your camera\r\nItzel: You can come and take it from my home\r\nLeon: Would be there in an hour\n\n**summary**\nLeon will borrow Itzel's camera.
1,**dialogue**\nJayleen: I'm dyeing my hair\r\nAugust: What colour?\r\nJayleen: I'm staying with my blonde. I had to refresh my colour\r\nAugust: Ok\r\nJayleen: I haven't dyed my hair for around 9 months xd\n\n**summary**\nJayleen is dyeing her hair blonde for the first time in 9 months.
2,**dialogue**\nLiam: I don't think the institutional approach is too interesting\nJeff: I agree...\nTom: so let's try to find an alternative\n\n**summary**\nLiam and Jeff do not find institutional approach interesting.
3,**dialogue**\nGina: <file_photo>\r\nGina: What do you think?\r\nKate: Grab it! At that price it is an absolute bargain.\n\n**summary**\nKate wants Gina to buy it because it's cheap.
4,"**dialogue**\nAndrew: Hello Janny, is it still convenient for us to come and check your gas meter at 2.45 today?\r\nJanny: Hi Andrew, that's fine. \r\nAndrew: Thank you, we will see you then\n\n**summary**\nAndrew will come to Janny to check her gas meter at 2.45 today."


## Configure and Train the Synthetic Model:

We can experiment different "steps" parameters which result in a change of text SQS.

In [None]:
from gretel_client.projects.models import read_model_config



config = read_model_config("synthetics/natural-language")
config["models"][0]["gpt_x"]["steps"] = 600 #set different step values.

# Designate project
PROJECT = 'data-summarization'
project = create_or_get_unique_project(name=PROJECT)

# Create and submit model
model = project.create_model_obj(model_config=config, data_source=df)
model.name = f"{PROJECT}-llama-2-7b"
model.submit_cloud()

poll(model)


INFO: Starting poller
INFO: Status is created. Model creation has been queued.


{
    "uid": "6531b8e8f3bf601ba821bc39",
    "guid": "model_2X0E5sAljm7RZuBoGNYOhziDYM3",
    "model_name": "data-summarization-llama-2-7b",
    "runner_mode": "cloud",
    "user_id": "621e70bf492fbf0535537ea1",
    "user_guid": "user_25nTzH09cLJdsemHxZVO2SdOc8u",
    "billing_domain": "gretel.ai",
    "billing_domain_guid": "domain_28bzIokk1eQdWUYsovba0VN1gtY",
    "project_id": "6531b8e7478d822b693c19f6",
    "project_guid": "proj_2X0E5l1tcgBRbcLBtBwtHNYVeYH",
    "status_history": {
        "created": "2023-10-19T23:16:56.300401Z"
    },
    "last_modified": "2023-10-19T23:16:56.408869Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "annotations": null,
    "provenance": null,
    "container_image": "074762682575.dkr.ecr.us-east-2.amazonaws.com/models/gpt_x@sha256:28ab363cd8f7687570a8c8470d3e2c4391b5b31ee64f3ef05970d3e8943c2d6e",
    "container_image_version": "6eb73a3b",
  

INFO: Status is pending. A Gretel Cloud worker is being allocated to begin model creation.
INFO: Status is active. A worker has started creating your model!
2023-10-19T23:17:19.117521Z  Resolved revision for model
{
    "revision": "4874b16751ab2db177bbb898bd4a1e44c89e2f25",
    "model": "gretelai/mpt-7b"
}
2023-10-19T23:17:19.118032Z  Parameter efficient fine tuning (PEFT) methods will be used, which greatly reduce the number of trainable parameters.
2023-10-19T23:17:19.119751Z  Starting GPT model training...
{
    "num_train_steps": 750
}
2023-10-19T23:17:19.120088Z  Fine-tuning 'gretelai/mpt-7b' with provided dataset!
2023-10-19T23:17:19.120688Z  Downloading model from remote source. Depending on the size of the model, this may take a few minutes.
2023-10-19T23:18:19.121248Z  Model download 72% complete, ETA 23s (9566455971/13300877750 bytes downloaded)
2023-10-19T23:18:47.602336Z  Model download 100% complete (13300877750 bytes downloaded). Loading model onto GPU ...
2023-10-19T23:

## Generate Text Synthetic Quality Score:

In [None]:
model.get_report_summary()

{'summary': [{'field': 'synthetic_data_quality_score', 'value': 81},
  {'field': 'semantic_similarity', 'value': 91},
  {'field': 'structure_similarity', 'value': 55}]}

In [None]:
#Plot the text SQS report:
import IPython
from smart_open import open

IPython.display.HTML(data=open(model.get_artifact_link("text_metrics_report")).read(), metadata=dict(isolated=True))

0,1,2,3,4,5
How to interpret the Text SQS,Excellent,Good,Moderate,Poor,Very Poor
Demo environments or mock data,,,,,
Pre-production testing environments,,,,,
Suitable for statistical analysis,,,,,
Augment machine learning data sources,,,,,
Improve your model using our tips and advice,,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,80,80.0
Column Count,1,1.0
Training Lines Duplicated,-,0.0
Missing Values,0,0.0
Unique Values,80,80.0
Average Words Per Sentence,4.48,4.14
Average Characters Per Word,4.17,3.91
Average Sentence Count,8.90,10.11
