# Schema.org generation

- Input: Existing scrape data, possibly just the summarized markdown
- Output:
    - Ideally, figure out the right schema to use with chain of thought
    - Maximally instantiate in Schema.org for the company

## Status

This just loads the summary I've already produced and fills in an Organization schema.

Limitations:
- It doesn't pull info on the nested schemas
- It doesn't generalize to non-Organization

In [1]:
import json
from unified import UnifiedResult

with open("../output/data/98point6.json", "r") as f:
    data = UnifiedResult(**json.load(f))


In [4]:
from core import init, Seed

init()

In [34]:
from schemaorg.main import Schema

organization_schema = Schema("Organization")
organization_properties = organization_schema.type_spec["properties"]
# allowed_properties

Specification base set to https://www.schema.org
Using Version 12.0
Found https://www.schema.org/Organization
Organization: found 76 properties


In [37]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate


_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
f"""
You'll read information about a company and generate a json-ld representation that uses schema.org vocabulary.

These are the schema.org properties for an Organization:
{organization_properties}

When generating the json-ld representation, do not include any placeholder values; only include the properties that have values in the human input.
""" 
        ),
        (
            "human",
f"""
Company Name: {data.target.company}
Domain: {data.target.domain}

Summary:
{data.summary_markdown}

Crunchbase:
{data.crunchbase_markdown}

General search results:
{data.general_search_markdown}

Glassdoor summary:
{data.glassdoor_markdown}

Customer experience summary:
{data.customer_experience_markdown}
""",
        ),
    ]
)


llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
runnable = _prompt | llm
result = runnable.invoke({})

print(result.content)

```json
{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "98point6 Technologies",
  "url": "http://www.98point6.com",
  "logo": "https://www.98point6.com/logo.png",  // Placeholder for logo URL
  "foundingDate": "2015-03-01",
  "founders": [
    {
      "@type": "Person",
      "name": "Jeff Greenstein"
    }
  ],
  "numberOfEmployees": 398,
  "description": "98point6 Technologies specializes in digital healthcare solutions, providing a cloud-based virtual care platform that integrates artificial intelligence with board-certified physicians to deliver primary care services.",
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "Seattle",
    "addressRegion": "WA",
    "addressCountry": "USA"
  },
  "areaServed": "USA",
  "contactPoint": {
    "@type": "ContactPoint",
    "telephone": "+1-800-123-4567",  // Placeholder for telephone number
    "contactType": "Customer Service"
  },
  "employee": [
    {
      "@type": "Person",
      "name": "Ja