In [29]:
json_file = "../output/data/scrapy/synthesize.bio.json"

output_file_base = "../output/data/scrapy/synthesize.bio"

In [30]:
import json

with open(json_file, "r") as file:
    data = json.load(file)

total_chars = 0
for page in sorted(data, key=lambda p: p["url"]):
    truncated_page = page["markdown"][:4000]
    print(f"""
# {page["url"]}
{len(page["markdown"]):,} chars in markdown

{truncated_page}...
""")
    total_chars += len(truncated_page)

print(f"Total chars: {total_chars:,}")


# https://app.synthesize.bio/sign-in?redirect_url=http%3A%2F%2Fapp.synthesize.bio%2Fdatasets
118 chars in markdown

# [Synthesize.bio logo](https://app.synthesize.bio/sign-in?redirect_url=http%3A%2F%2Fapp.synthesize.bio%2Fdatasets)
or...


# https://trust.synthesize.bio
7,017 chars in markdown

# [SYNTHESIZE BIO, INC.](https://trust.synthesize.bio)
Change Management Policy

A Change Management Policy governs the documenting, tracking, testing, and approving of system, network, security, and infrastructure changes.

Segregation of Environments

Development, staging, and production environments are segregated.

Secure Development Policy

A Secure Development Policy defines the requirements for secure software and system development and maintenance.

Software changes are tested prior to deployment

Software changes are tested prior to being deployed into production.

Production Data Use is Restricted

Production data is not used in the development and testing environments, unless require

In [31]:
from core import init

init()

# Version 1: Basic summary

In [32]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Use the output for summarization

# Initialize the OpenAI model
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create a ChatPromptTemplate with system guidance
prompt = ChatPromptTemplate([
    ("system", "You are a summarization assistant. Summarize the following markdown content."),
    ("user", "{markdowns}"),
])

# Extract the truncated markdowns from the data
markdowns = "\n\n".join(page["markdown"][:4000] for page in data)

# Create the LLMChain with the prompt template and the model
chain = prompt | llm

# Run the chain with the markdowns as input
summary = chain.invoke({"markdowns": markdowns})

print(summary.content)

# Write the summary to a markdown file
summary_file_path = f"{output_file_base}_summary.md"
with open(summary_file_path, "w") as summary_file:
    summary_file.write(summary.content)

Synthesize Bio is an organization focused on accelerating biomedical data generation and analysis, recently receiving its first SOC 2® report regarding its security controls. The company has a comprehensive Privacy Policy effective from July 9, 2024, detailing how it processes personal information collected through its services, including user-provided data and information from third-party sources.

The organization emphasizes security through various policies, including Change Management, Secure Development, and Information Security Policies, ensuring that system changes are documented, tested, and approved. They also conduct annual security training for personnel and perform background checks for new hires.

Synthesize Bio's Terms of Use include an arbitration agreement for dispute resolution, encouraging informal resolution before arbitration. The company promotes a diverse and inclusive workplace and is currently offering limited free access to its services to gather user feedback 

# Version 2: Ask questions then summarize

In [33]:
# Define a list of questions about the company
questions = [
    "How does the company make money?",
    "Is the company more B2C or B2B?",
    "If B2B, what are the typical customers? Please include example customers if known.",
    "If B2C, what are the typical customer demographics?",
    "When was the company founded?",
    "How many employees does the company have?",
    "What is the company mission?",
    "What funding rounds has the company raised?",
    "What products does the company produce?",
    "Describe the scale of the company if possible, such as the number of customers, users, clients, or revenue.",
    "How are the company's products distributed or sold?",
    "How has the company changed over time?",
    "Does the company have any third-party certifications or awards?",

# - When was the company founded?
# - Approximately how many employees work at the company?
# - What products does the company produce? What services does the company offer?
# - How does the company make money? Who are their customers in general? Is it B2B, B2C? If B2B, include example customers.
# - Approximately how much revenue does the company generate annually?
# - Describe the scale of the company if possible, including the number of customers, users, or clients.
# - How are the company's products distributed or sold to users?
# - How has the company changed over time?
# - How do third parties describe the company, compared to how the company describes itself?
# - Acquisitions
# - Partnerships
# - Fundraising events
# - Opinions about the company
# - The scale of the company in terms of employee, active users, or revenue
# - New product developments
# - Information about any company executives including any relevant quotes
# - General information about the company
# - General information about the product
# - Any major changes in the company or product 

]

# Create a ChatPromptTemplate with system guidance and questions
question_prompt = ChatPromptTemplate(
    [
        ("system", "You are an assistant that answers questions about a company based on provided markdown content."),
        ("user", "Documents: \n\n{markdowns}"),
        ("user", "Questions to answer:\n" + "\n".join(questions)),
    ]
)

# Create the LLMChain with the question prompt template and the model
question_chain = question_prompt | llm

# Run the chain with the markdowns as input
answers = question_chain.invoke({"markdowns": markdowns})

# Print the answers
print(answers.content)

Based on the provided information, here are the answers to your questions about Synthesize Bio:

1. **How does the company make money?**
   - Synthesize Bio generates revenue by providing an online platform for biomedical data generation and analysis, likely through subscription fees, service fees, or transactional fees associated with the use of their platform.

2. **Is the company more B2C or B2B?**
   - The company operates primarily in a B2B (Business-to-Business) model, as it provides services to enterprises and organizations in the life sciences and biomedical research sectors.

3. **If B2B, what are the typical customers? Please include example customers if known.**
   - Typical customers include businesses and organizations involved in life sciences, biomedical research, and possibly pharmaceutical companies. Specific example customers are not provided in the available information.

4. **When was the company founded?**
   - The founding date of Synthesize Bio is not specified i

In [34]:
# Create a new ChatPromptTemplate with system guidance, including the answered questions
stage2_prompt = ChatPromptTemplate(
    [
        ("system", "You are a summarization assistant. Summarize the following markdown content, considering the provided answers to questions about the company."),
        ("user", "Documents: \n\n{markdowns}"),
        ("user", "Answers to questions:\n{answered_questions}"),
    ]
)

# Create the LLMChain with the new prompt template and the model
stage2_chain = stage2_prompt | llm

# Run the chain with the markdowns as input
stage2_summary = stage2_chain.invoke({"markdowns": markdowns, "answered_questions": answers.content})

print(stage2_summary.content)

# Write the stage2 summary to a markdown file
guided_summary_file_path = f"{output_file_base}_guided_summary.md"
with open(guided_summary_file_path, "w") as guided_summary_file:
    guided_summary_file.write(stage2_summary.content)

Synthesize Bio is a company focused on accelerating genomic data generation, analysis, and hypothesis testing through its online platform. It primarily operates on a B2B model, serving enterprises and organizations in the life sciences and biomedical research sectors. The company generates revenue through subscription fees, service fees, or transactional fees associated with its platform.

Key aspects of Synthesize Bio include:

- **Mission**: To enable scientists to unlock new insights and develop groundbreaking solutions faster.
- **Security**: The company has received its first SOC 2® report, demonstrating its commitment to security controls.
- **Privacy Policy**: Synthesize Bio outlines how it processes personal information collected through its services, emphasizing user data protection.
- **Change Management**: The company has established policies for secure software development, access control, and information security training for personnel.

While specific details about the fo

# Version 3: A more reasonable one-stage prompt

In [35]:
# Create a new ChatPromptTemplate with system guidance, including the questions
single_stage_prompt = ChatPromptTemplate(
    [
        (
            "system",
            "You are a summarization assistant. Summarize the following markdown content, considering the following questions for guidance:\n"
            + "\n".join(questions),
        ),
        ("user", "Documents: \n\n{markdowns}"),
    ]
)

# Create the LLMChain with the new prompt template and the model
single_stage_chain = single_stage_prompt | llm

# Run the chain with the markdowns as input
single_stage_summary = single_stage_chain.invoke({"markdowns": markdowns})

print(single_stage_summary.content)

# Write the single-stage summary to a markdown file
single_stage_summary_file_path = f"{output_file_base}_single_stage_summary.md"
with open(single_stage_summary_file_path, "w") as single_stage_summary_file:
    single_stage_summary_file.write(single_stage_summary.content)

**Company Overview: Synthesize Bio**

- **Founded**: The specific founding date is not mentioned in the provided documents.
- **Employees**: The number of employees is not specified.
- **Mission**: Synthesize Bio aims to accelerate genomic data generation, analysis, and hypothesis testing, empowering scientists to unlock insights and develop solutions faster in the life sciences field.
- **Funding Rounds**: The documents do not provide information on funding rounds.
- **Products**: Synthesize Bio offers an online platform for biomedical data generation and analysis, leveraging AI and bioinformatics.
- **Scale**: The scale of the company in terms of customers, users, or revenue is not detailed, but they are offering limited free access to gather feedback from early users.
- **Distribution/Sales**: The platform is accessed online, and the company collects personal information through its service for user engagement and marketing.
- **B2B or B2C**: The company operates primarily in a B2B 