# Structured extraction with LLM on Databricks Mosaic AI:
## Best practices on prompt engineering, fine-tuning, and synthetic data generation

Structured extraction from unstructured data is one of the most important use cases for enterprises today. In this series of notebooks, we demonstrate how to use Databricks Mosaic AI functionalities to perform this task. 

We will be using a lease agreement dataset[1] throughout this demo:
  * You were only given a small sample of 100 lease contracts and their human-curated labels
  * The labels are entity extractions in a structured format (i.e. JSON)
  * Fields are: `end_date`, `leased_space`, `lessee`, `lessor`, `signing_date`, `start_date`, `term_of_payment`, `designated_use`, `extension_period`, and `expiration_date_of_lease`

First, we demonstrate how to do structured extraction using the Llama 3.1 70B model, which comes with structured output enabled (i.e. function calling), and evaluate the results against the ground-truth labels.

Second, we demonstrate how to use existing data to generate more synthetic data using the Llama 3.1 405B model.

Finally, we show how to fine-tune a Llama 3.2 3B model witht the synthetic data, serve it on a Provisioned Throughput endpoint, perform batch structured extraction using `ai_query`, and then evaluate the results against the ground-truth labels, as well as the bigger, Llama 3.1 70B, model.

Note:
* [1] The lease agreement dataset sourced from here: https://arxiv.org/abs/2010.10386

In [0]:
# Replace with your actual catalog and schema names
df = spark.read.table("catalog.schema.lease_docs")
display(df.limit(1))

lease_id,lease_doc,labels
a0Xts4kMzdTGCoSkzyR4Ag3s93q0-lease_contract_201,"EXHIBIT 10.4 LEASE CONTRACT Lessor: Bozhou Fengyi Chinese Medicine Development and Research Institute (hereinafter referred to as Party A) Lessee: Anhui Xuelingxian Pharmaceutical Co., Ltd. (hereinafter referred to as Party B) Party A and Party B agree to conclude the contract according to the Contract Law of the People’s Republic of China and relevant regulations. Article 1. Party A leases the workshop (located at East Liuge Village, Weiwu Road, Bozhou, with the building area of 10000 square meters and usable area of 3000 square meters) to Party B. Article 2. Lease term The lease term will be from Aug 1, 2008 to July 31, 2023, covering 15 years. Party A has right to terminate the contract and takes back the house if one of the following situations occurred: (1) Party B transfers, subleases, lends, jointly operates, buys share or exchanges the house with others; (2) Party B uses the house to take illegal activities, which damages the public interests; Party B has the priority to lease the house after the contract expires, and it can extend the lease term with negotiation of Party A if it cannot find the house promptly after the contract expires. Article 3: Rent and rent paying agreement The rent is 1.2 million Yuan, and Party B shall pay 100000 Yuan to Party A before 15 of each month as the monthly rent. The water & electricity fees shall be settled separately. Article 4: Repairing and decoration of house Party A takes charge of repairing of the house. Party A shall take the examination of the facilities, and it shall guarantee that there is no leakage, and the tap water, sewage and lighting, doors& windows are in good conditions, so as to ensure the normal and safe use of Party B. The repair scope and standard shall be performed by Urban Construction Department (87) C.Z.G.Z.No. 13 Notice. Party B shall positively assist when Party A repairing the house. Through the negotiation of the two parties, Party A will contribute for the repairing work and organize the construction according to the maintenance scope. Party B can decorate the house without affecting the house structure, but the scale, scope, process and materials shall be approved by Party A, and then the construction work can be carried out. The two parties shall discuss about the materials fees and ownership of the decoration objects after the contract expiring. Article 5: Change of two parties 1. If Party A transfers the house ownership to the third party according to the legal procedures, this contract continues to take effect to the new house owner without the agreement. 2. Party A shall notify Party B three months in advance of selling the house in the written form, and Party B has the priority to purchase the house under same conditions. 3. Party B shall get the approval of Party A if it intends to exchange the house with others, and Party A shall support the reasonable demand of Party B. Article 6: Responsibility for breach of contract 1. If one party doesn’t comply with the terms of Article 4, the party shall compensate 50000 Yuan for another party. 2. If Party A receives the additional fees except for the rent, Party B has right to refuse. 3. Party A has right to stop transferring if Party B transfers the house to other people independently. The two parties agree to handle the economic claim issues of above matters under the supervision of the issuing organ of the contract. Article 7: Conditions of disclaimer 1. The two parties take no responsibilities if the house is damaged or Party B has loss owing to the force majeure. 2. The two parties shall not take the responsibility for each other if they have the loss that the house is removed or rebuilt owing to the demand of municipal construction. If the contract is terminated owing to the above reasons, the rent will be calculated by the actual using time, and it will refund the difference. Article 8: Disputes settlement Any disputes resulted from the contract shall be negotiated by the two parties; if it fails to be solved in this way, any party can apply for mediation in the house lease management organ; and it can apply for arbitration to the Arbitration Committee of Economic Contract of the Municipal Administration of Industry and Commerce if the mediation is failed, and it can also take the legal action to the People’s Court. Article 9: The matters unmentioned herein can be signed as the supplementary agreement by the two parties, and the supplementary agreement will have the same effect with the contract after reporting it to the house lease management organ and getting the approval. This contract is in duplicate, with Party A and Party B holding one respectively. Lessor: Bozhou Fengyi Chinese Medicine Development and Research Institute Legal representative: Wang Fengyi Lessee: Anhui Xuelingxian Pharmaceutical Co., Ltd. Legal representative: Wang Shunli","{  ""end_date"": [""July 31, 2023""],  ""leased_space"": [""the workshop (located at East Liuge Village, Weiwu Road, Bozhou, with the building area of 10000 square meters and usable area of 3000 square meters)""],  ""lessee"": [""Anhui Xuelingxian Pharmaceutical Co., Ltd.""],  ""lessor"": [""Bozhou Fengyi Chinese Medicine Development and Research Institute""],  ""signing_date"": [],  ""start_date"": [""Aug 1, 2008""],  ""term_of_payment"": [""The rent is 1.2 million Yuan, and Party B shall pay 100000 Yuan to Party A before 15 of each month as the monthly rent.""],  ""designated_use"": [],  ""extension_period"": [""Party B has the priority to lease the house after the contract expires, and it can extend the lease term with negotiation of Party A if it cannot find the house promptly after the contract expires.""],  ""expiration_date_of_lease"": [""July 31, 2023""] }"


In [0]:
df_train, df_eval = df.randomSplit([0.50, 0.50], seed=614)

display(df_train.count())
display(df_eval.count())

50

50

In [0]:
# write to Catalog for downstream use - replace with your actual catalog and schema names
df_eval.write.format("delta").mode("overwrite").saveAsTable(
    "catalog.schema.lease_docs_eval"
)
df_train.write.format("delta").mode("overwrite").saveAsTable(
    "catalog.schema.lease_docs_train"
)