In [19]:
%%script false --no-raise-error
import json
from google.colab import userdata
from google.oauth2 import service_account
from google.cloud.bigquery import magics

credentials_json = userdata.get('BIGQUERY_CREDENTIALS')
credentials = service_account.Credentials.from_service_account_info(json.loads(credentials_json))
magics.context.credentials = credentials

Couldn't find program: 'false'


In [36]:
from google.cloud import bigquery
from google.cloud.bigquery import magics
%load_ext bigquery_magics

data_set = "testing_set"
project_name = "emerald-entity-468916-f9"

job_config = bigquery.QueryJobConfig(default_dataset = f"{project_name}.{data_set}")
client = bigquery.Client(project = project_name, default_query_job_config = job_config, credentials = globals().get('credentials', None))
magics.context.default_query_job_config = job_config
magics.context.project = project_name

The bigquery_magics extension is already loaded. To reload it, use:
  %reload_ext bigquery_magics


### Inference prompts

**Simple json format correction prompt**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT   
'''You are an expert in summarizing books.
Your goal is to summarize a book fragment. Summary should contain no more then %d characters.

The fragment is provided between <book_fragment> tags below:
<book_fragment>
%s
</book_fragment>

Return just a summary wihout any additional comments.
''' prompt, 'summarize' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Whole book summarization**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT
'''You are an expert in summarizing books.
Your goal is to prepare book summary based on summaries of all book fragments.
The result summary should contain around 20000 characters.
Avoid redundancy while maintaining comprehensiveness when summarizing.

Concatenated summaries of all book fragments are placed below, between <fragment_summaries> tags:
<fragment_summaries>
%s
</fragment_summaries>

Return just a summary without any additional comments.
''' prompt, 'reduce_summary' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Extracting characters identifying information from given book fragment**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT
'''You are an expert in extracting information from books and manipulating JSON structures.
Your task is to provide identifying information of significant human characters (called later Individuals) from given book fragment.

Provide the output as a JSON array, single Individual should be described in separate JSON object in the array.
The schema definition is below.
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "full_name": {"type": "string", "maxLength": 300, "description": "Full name including titles, nicknames, aliases, maiden names, or pseudonyms if mentioned. Include all names if Individual has many names."},
	    "information": {"type": "string", "maxLength": 1500, "description": "Any information that may help uniquely identify given Individual, e.g. sex, age, origin, physical appearance, distinguishing marks or features and other helpful information"},
	    "importance": {"type": "integer", "description": "Number of sentences being related in any way to given Individual in the text"},
    }
  }
}

The book fragment for Individuals analysis is provided between <book_fragment> tags.
<book_fragment>
%s
</book_fragment>

Important Guidelines for choosing Individuals to be included in the JSON array:
1. Omit unnamed crowd members or minor Individuals mentioned only in passing.
2. Add only Individuals who play meaningful roles or are described in details in given book fragment. 

Other important Guidelines:
1. Do not duplicate JSON objects for the same Individual.
2. Strictly respect the maximum character limits for each field, especially max 1500 characters for `information`. Summarize to reduce size, if necessary.

Before returning, fix all format errors in JSON array, if any.
Return only corrected JSON array as a response, without any additional comments.   
''' prompt, 'characters_id_data' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Identifying the same characters in duplicate candidates pairs**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT   
'''You are an expert in comparing human characters (called later "Individuals") based on descriptions from different fragments of the same book.
Your task is to analyse each pair of Individuals information and names, and provide judgement whether both Individuals are in fact the same person, or not.

The input data are provided as a JSON array. Each Individuals pair is described in separate JSON object with a flat structure.
The input schema definition is below.
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
	"id": {"type": "integer", "description": "A unique identifier of a pair"},
		"first_individual_full_name": {"type": "string", "description": "Full name of the first Individual"},
		"first_individual_information": {"type": "string", "description": "Additional information describing first Individual"},
		"second_individual_full_name": {"type": "string", "description": "Full name of the second Individual"},
		"second_individual_information": {"type": "string", "description": "Additional information describing second Individual"}
    }
  }
}

The input data for analysis is provided between <individual_pairs> tags.
<individual_pairs>
%s
</individual_pairs>

As a supplementary information you can use the summary of the whole book. It is placed between <summary> tags.
<summary>
%s
</summary>

As the result please return JSON array containing only pairs where first and second Individual are in fact the same person. If unsure, assume they are different persons and do not return.
If there are no pairs containing the same Individual, then return empty array.
Return only JSON array, without any additional comments.   
''' prompt, 'find_the_same_characters' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Identifying different characters in duplicate candidates pairs - double check of previous prompt result**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT   
'''You are an expert in comparing human characters (called later "Individuals") based on descriptions from different fragments of the same book.
Your task is to analyse each pair of Individuals information and names, and provide judgement whether both Individuals are in fact the same person, or not.

The input data are provided as a JSON array. Each Individuals pair is described in separate JSON object with a flat structure.
The input schema definition is below.
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
	"id": {"type": "integer", "description": "A unique identifier of a pair"},
		"first_individual_full_name": {"type": "string", "description": "Full name of the first Individual"},
		"first_individual_information": {"type": "string", "description": "Additional information describing first Individual"},
		"second_individual_full_name": {"type": "string", "description": "Full name of the second Individual"},
		"second_individual_information": {"type": "string", "description": "Additional information describing second Individual"}
    }
  }
}

The input data for analysis is provided between <individual_pairs> tags.
<individual_pairs>
%s
</individual_pairs>

As a supplementary information you can use the summary of the whole book. It is placed between <summary> tags.
<summary>
%s
</summary>

As the result please return JSON array containing only pairs where first and second Individual are different persons. If unsure, assume they are different.
Before returning, check and fix all format errors in JSON array, if any.
Return only JSON array, without any additional comments.   
''' prompt, 'find_different_characters' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Another check of ready to merge array with identifying information of (most probably) the same character**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT
'''You are an expert in comparing human characters (called later "Individuals") based on descriptions from different fragments of the same book.
As an input data to analysis you have an array of JSON objects representing Individuals. Previous analysis indicated that each Individual in the array is in fact the same person.
Your task is to analyse all JSON objects in the array and double check if all the objects indeed represent the same person.

Please see input JSON array schema definition:
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "full_name": {"type": "string", "description": "Full name including titles, nicknames, aliases, maiden names, or pseudonyms if mentioned. Include all names if Individual has many names."},
	    "information": {"type": "string", "description": "Any information that may help uniquely identify given Individual, e.g. sex, age, origin, physical appearance, distinguishing marks or features and other helpful information"}
    }
  }
}

JSON array with Individuals suspected to be the same person is below, between <character> tags:
<character>
%s
</character>

As a supplementary information you can use the summary of the whole book. It is placed between <summary> tags below.
<summary>
%s
</summary>

Please return "true" if you think that all Individuals indeed represent the same person, return "false" otherwise. If in doubt return "false"
Return only one word as a response, without any additional comments.   
''' prompt, 'merge_character_ids_double_check' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Merging array with identifying information of the same character to single character data**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT
'''You are an expert in merging human characters names and descriptions, comming from different fragments of the same book. (human characters are called later "Individuals") 
Your task is to merge single Individual identifying information based on a book.

The input data is an array of JSON objects each representing the same Individual but based on different fragment of the same book.
The order of JSON objects is the same as order of fragments in the book.
Please see input data schema definition:
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "full_name": {"type": "string", "maxLength": 300, "description": "Full name including titles, nicknames, aliases, maiden names, or pseudonyms if mentioned. Include all names if Individual has many names."},
	    "information": {"type": "string", "maxLength": 2500, "description": "Any information that may help uniquely identify given Individual, e.g. sex, age, origin, physical appearance, distinguishing marks or features and other helpful information"}
    }
  }
}

Between <character> tags is input JSON array with Individual data to be merged.
<character>
%s
</character>

As a supplementary information you can use the summary of the whole book. It is placed between <summary> tags below.
<summary>
%s
</summary>

Important Guidelines for merging data:

1. The result should be just one single JSON object (not array) containing summarized full_name and information from all JSON objects in the array.
2. Avoid redundancy while maintaining comprehensiveness when merging.
3. The format of result JSON object should be the same as the format on JSON objects in the array.
4. Strictly maximum character limits for both fields. Summarize if necessary.

Before returning, check and fix all format errors in JSON array, if any.
Return only merged JSON object as a response, without any additional comments.   

''' prompt, 'merge_character_ids' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Extract desired characters traits from given book fragment**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT
'''Your task is to provide given human characters analysis based on given book fragment. (human characters are called later "Individuals")
The analysis serves academic research on understanding human characteristics across different historical periods, geographical regions, and social statuses.

Data format:
Data are stored in a JSON array. Each Individual is described in separate JSON object with a flat structure as specified in schema below.
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "id": {"type": "integer", "description": "A unique identifier of an Individual"},
      "full_name": {"type": "string", "maxLength": 300, "description": "Full name including titles, nicknames, aliases, maiden names, or pseudonyms if mentioned, e.g.: 'Victor Frankenstein, M.D.'"},
      "information": {"type": "string", "maxLength": 1500, "description": "Additional information describing given Individual"},
      "sex": {"type": "string", "maxLength": 100, "description": "'male', 'female' or 'non-binary'"},
      "social_class": {"type": "string", "maxLength": 800, "description": "Economic and social standing (e.g., 'nobility', 'working class', 'merchant class')"},
      "wealth": {"type": "string", "maxLength": 800, "description": "Economic position, assets, property, financial struggles or abundance with information how wealth/income is obtained (inheritance, labor, trade, crime, patronage, etc.)"},
      "values": {"type": "string", "maxLength": 1600, "description": "Core principles, moral compass, priorities (can include both positive and negative values)"}
    }
  }
}

Between <example> tags is the output example is just to show the output structure, the real analysis results will be probably significantly larger and reacher in details.
<example>
[
  {
    "id": 78,
    "full_name": "Victor Frankenstein, M.D.",
    "sex": "male",
    "social_class": "Upper class, Geneva aristocracy",
    "wealth": "Wealthy through family fortune, owns estate in Geneva, sufficient funds for extended travels and education. Wealth obtained thanks family inheritance, father's position as syndic, old Geneva money",
    "values": "Knowledge, scientific progress, family loyalty, later: justice and revenge"
  },
  {
    "id": 3,
    "full_name": "James Johnson, known as 'Old Jim the Miller'",
    "sex": "male",
    "social_class": "Middle class tradesman, respected in village",
    "wealth": "Comfortable middle class, owns mill and cottage, savings of approximately 200 pounds. Wealth sources: milling fees, grain trading profits, small loans to farmers at harvest time",
    "values": "Hard work, family legacy, honest trade, community solidarity, tradition"
  }
]
</example>

Between tags <characters> is input JSON array with Individuals to be analysed. Only `id`, `full_name` and `information` fields are prepopulated.
<characters>
%s
</characters>

The current book fragment for analysis is provided between <book_fragment> tags.
<book_fragment>
%s
</book_fragment>

You main task is to add missing fields in each Individual JSON object based on book fragment analysis.
Additional guidelines:
1. Fields `full_name` and `information` are already populated and should be used to identify Individuals in given book fragment
2. Please do not modify content of the fields `full_name` and `id`.
3. Please do not add any new Individuals to the JSON array
4. Strictly observe the maximum character count for each field, comparing it with "maxLength" size. Summarize if necessary
5. Leave fields empty rather than speculating
6. Avoid redundancy while maintaining comprehensiveness
7. Please remove `information` fields in output array.

Return only supplemented JSON array as a response, without any additional comments.   
''' prompt, 'extract_data' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Merge given character traits gathered from all book fragments**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT   
'''Your task is to merge human character analyses based on a book, preserving source information, while avoiding redundancy.

The input data is an array of JSON objects each representing the same character but based on different fragment of the same book.
The order of JSON objects is the same as order of fragments in the book.
Please see input data schema definition:
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "id": {"type": "integer", "description": "A unique identifier of an Individual. Can be ignored."},
      "full_name": {"type": "string", "maxLength": 300, "description": "Full name including titles, nicknames, aliases, maiden names, or pseudonyms if mentioned, e.g.: 'Victor Frankenstein, M.D.'"},
      "sex": {"type": "string", "maxLength": 100, "description": "'male', 'female' or 'non-binary'"},
      "social_class": {"type": "string", "maxLength": 800, "description": "Economic and social standing (e.g., 'nobility', 'working class', 'merchant class')"},
      "wealth": {"type": "string", "maxLength": 800, "description": "Economic position, assets, property, financial struggles or abundance with information how wealth/income is obtained (inheritance, labor, trade, crime, patronage, etc.)"},
      "values": {"type": "string", "maxLength": 1600, "description": "Core principles, moral compass, priorities (can include both positive and negative values)"}
    }
  }
}

Between <character> tags is input JSON array with character data to be merged.
<character>
%s
</character>

As a supplementary information you can use:
- the summary of the whole book. It is placed between <summary> tags.
- the short overall information about given character. It is placed between <information> tags.
<summary>
%s
</summary>

<information>
%s
</information>


Important Guidelines for merging data:

1. The result should be just one single JSON object (not array) containing summarized information from all JSON objects in the array.
2. For each field please prepare a comprehensive summary of source fields, preserving the informations from each object, but avoiding redundancy.
3. The format of result JSON object should be the same as the format on JSON objects in the array.
4. Leave fields empty if they are empty in each JSON object, rather than speculating.

After preparing the JSON object, please again check each field against redundancy. Summarize it again if necessary to reduce redundant information.

Return only merged JSON object as a response, without any additional comments.   
''' prompt, 'merge_character' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Correct potential errors in final array with character traits**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT   
'''Your task is to check and correct errors in input JSON object, if any.
Please see schema definition:
{
  "type": "object",
  "properties": {
    "id": {"type": "integer", "description": "A unique identifier of an Individual. Can be ignored."},
    "full_name": {"type": "string", "description": "Full name including titles, nicknames, aliases, maiden names, or pseudonyms if mentioned, e.g.: 'Victor Frankenstein, M.D.'"},
    "sex": {"type": "string", "description": "'male', 'female' or 'non-binary'"},
    "social_class": {"type": "string", "description": "Economic and social standing (e.g., 'nobility', 'working class', 'merchant class')"},
    "wealth": {"type": "string", "description": "Economic position, assets, property, financial struggles or abundance with information how wealth/income is obtained (inheritance, labor, trade, crime, patronage, etc.)"},
    "values": {"type": "string", "description": "Core principles, moral compass, priorities (can include both positive and negative values)"}
  }
}

The JSON object to be checked is between <character> tags below:
<character>
%s
</character>

Guidelines:
1. Please check if all fields (except `id`) are of text type, if not, then please convert them to text, preserving all the information. For example: if value is an array, please convert it to semicolon separated text containing all array elements.
2. If the field doesn't contain any real information, please remove the field altogether. For example, the field value may be: "Unknown", "not specified", "null", "None mentioned", empty text, etc.
3. Please correct JSON format errors, if any.

Return only corrected JSON object as a response, without any additional comments.   
''' prompt, 'json_final_check' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);


Query is running:   0%|          |

**Rank books based of inclusion of chuman characters - used in phase 1 when randomly searching books suitable for processing**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT   
'''You are the assistant analysing book data.
Your task is to predict if given book contains meaningful human characters information, no matter if fictitious or real.
You need to do the prediction based on first fragment of the book, its title and metadata containing themes from the book.

The task is part of the project aimed at understanding human characteristics across different historical periods, geographical regions, and social statuses, based on human descriptions from books (both fictional and real).
Many books, like scientific ones, financial reports, etc., do not contain human characters at all or only vaguely mentions humans. We need to exclude such books from further analysis.
We are interested in book containing rich descriptions of individual humans. For example: it can be biographies, all kind of novels with lively human characters, investigative journalism or reporting focused on humans, and others.

The book data is provided below between <book> tags, in a JSON format.
note: the book was scanned and may contain many optical character recognition (OCR) errors.
<book>
%s
</book>

As an answer please return number from 1 to 3 where:
  - "1" means: the book is not about humans
  - "2" means: human characters occur in book, but scarcely described.
  - "3" means: human characters occur and are depicted in details

Please return just a number without additional comments.
''' prompt, 'rank_books' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Split traits to semantically different parts and sanitize them by removing irrelevant information**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT
'''You are an expert in splitting information based on semantic meaning.

Your task is to divide the given text into semantically distinct informations.

The text describes human characters %s.

Rules:
1. Split ONLY if the text contains semantically different parts
2. If the text describes a single coherent information, return it as one item
3. Preserve context - don't split related concepts that belong together
4. Remove redundant words but keep meaning intact

Examples for three different types of traits; values, social class and wealth:
- Input: "Kingdom's strength, victory, leadership; Righteousness, religious observance, victory through God's help."
- Output: ["Kingdom's strength", "victory and leadership", "righteousness and religious observance", "God's help"]

- Input: "Lieutenant, Officer in Kmita's company. Son of the Kokosinski family who use the seal of Pypka. Former outlaw."  
- Output: ["Lieutenant, Officer", "Former outlaw", "Probably belongs to wealty family"]

- Input: "Described as a poor exile without a roof over his head, implying no significant wealth. Also noted as being part of Kmita's company and having implied noble standing."
- Output: ["poor, no significant wealth", "homeless"]

Provide the output as a JSON array of strings. The schema definition is below:
{"type": "array", "items": {"type": "string"}}

The text for analysis is provided between <traits> tags below:
<traits>
%s
</traits>

As a suplementary information you can use character description provided between <information> tags below:
<information>
%s
</information>

Esure that the JSON array is correctly formatted.
Return only JSON array as a response, without any additional comments.
''' prompt, 'split_traits' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |

**Analyze traits sample from single cluster to name cluster and provide short description**

In [None]:
%%bigquery
MERGE prompts p USING (SELECT
'''You are an expert in text clusters classification.
Your task is to name and describe each cluster extracted from large amount of examples describing human characters %s.

As an input you have a JSON array containing cluster id (field `cluster_id`) and examples of traits fitting into this cluster (field: `examples`).
The input is between <clusters> tag.
<clusters>
%s
</clusters>

For each cluster please: 
  - invent a short name, using one or few words, describing it most adequately
  - provide, a one or few sentences long, adequate description
Please output the result as a JSON array with following schema:
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "cluster_id": {"type": "integer"},
      "cluster_name": {"type": "string", "maxLength": 50}
      "cluster_description": {"type": "string", "maxLength": 1000}
    }
  }
}

Ensure that the JSON array is correctly formatted.
Return only JSON array as a response, without any additional comments.
''' prompt, 'cluster_traits' code) ip ON p.code = ip.code
WHEN MATCHED THEN UPDATE SET prompt = ip.prompt
WHEN NOT MATCHED THEN INSERT (code, prompt) VALUES(code, prompt);

Query is running:   0%|          |