# Proof of concept search module

The idea with this module is to Google search 100+ results, then use an LLM to organize and re-rank the results.

## Ideas to consider

- Multiple searches: The company, company + product, search for product reviews, search all time vs just recent
- Provide a markdown template to fill out with an "other" category, and refine the "other" listings

In [1]:
from core import CompanyProduct, init_requests_cache, init_langchain_cache, make_experiment_dir

init_requests_cache()
init_langchain_cache()

'/home/keith/company-detective/.cache/langchain.sqlite'

In [9]:
target = CompanyProduct.same("98point6")
experiment_dir = make_experiment_dir(target)

In [10]:
from search import search
from pprint import pprint

search_results = list(search(f'"{target.company}"', num=100))
if target.product != target.company:
    search_results += list(search(f'"{target.company}" "{target.product}"', num=100))
pprint(search_results)

[32m2024-08-19 10:28:51.017[0m | [34m[1mDEBUG   [0m | [36msearch[0m:[36msearch[0m:[36m68[0m - [34m[1mGoogle search results: {'kind': 'customsearch#search', 'url': {'type': 'application/json', 'template': 'https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json'}, 'queries': {'request': [{'title': 'Google Custom Search - "98point6"', 'totalResults': '62100', 'searchTerms': '"98point6"

[SearchResult(title='98point6 Virtual Care Platform for async and real-time telehealth', link='https://www.98point6.com/', snippet='98point6 empowers health systems to decrease the administrative burden on clinicians, promote quality, and increase patient satisfaction.', formattedUrl='https://www.98point6.com/'),
 SearchResult(title='98point6 hit by new layoffs in latest change at health tech startup ...', link='https://www.geekwire.com/2024/98point6-hit-by-new-layoffs-in-latest-change-at-health-tech-startup/', snippet='Apr 23, 2024 ... In March of last year, 98point6 announced that it was selling its virtual care platform and primary care business to Transcarent for $100\xa0...', formattedUrl='https://www.geekwire.com/.../98point6-hit-by-new-layoffs-in-latest-change...'),
 SearchResult(title='Careers | 98point6 Technologies - Seattle', link='https://www.98point6.com/about-us/careers/', snippet="98point6 Technologies is on a mission to provide equitable access to exceptional care. We'r

In [11]:
from typing import List
from search import SearchResult

def result_to_markdown(search_result: SearchResult) -> str:
    return f"[{search_result.title}]({search_result.link})\n{search_result.snippet}"

def results_to_markdown(search_results: List[SearchResult]) -> str:
    return "\n\n".join(result_to_markdown(result) for result in search_results)

print(results_to_markdown(search_results))

[98point6 Virtual Care Platform for async and real-time telehealth](https://www.98point6.com/)
98point6 empowers health systems to decrease the administrative burden on clinicians, promote quality, and increase patient satisfaction.

[98point6 hit by new layoffs in latest change at health tech startup ...](https://www.geekwire.com/2024/98point6-hit-by-new-layoffs-in-latest-change-at-health-tech-startup/)
Apr 23, 2024 ... In March of last year, 98point6 announced that it was selling its virtual care platform and primary care business to Transcarent for $100 ...

[Careers | 98point6 Technologies - Seattle](https://www.98point6.com/about-us/careers/)
98point6 Technologies is on a mission to provide equitable access to exceptional care. We're collaborators, innovators and passionate problem-solvers ...

[98point6 Technologies Announces the Acquisition of Bright.md to ...](https://www.prnewswire.com/news-releases/98point6-technologies-announces-the-acquisition-of-brightmd-to-accelerate-the-

In [12]:
from typing import List
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages.ai import AIMessage
from langchain_openai import ChatOpenAI

from core import CompanyProduct, URLShortener
from dotenv import load_dotenv

load_dotenv()


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """
You're an expert at organizing search results.
Given search results for a company or product, organize them into the following headers:

# Official social media
# Job boards
# App stores
# Product reviews
# News articles (most recent first, grouped by event)
# Key employees (with subheaders by employee)
# Other pages on the company website
# Business intelligence websites
# Other

Include the publication date after the link, if available.

Unless otherwise specified, order the results in each section from most to least relevant.
Format the output as a markdown document, preserving any links in the source.
Organize ALL search results into these headers; do not omit any results.
            """,
        ),
        (
            "human",
            """
Company: {company_name}
Product: {product_name}

Search results: 
{text}
            """,
        ),
    ]
)

from loguru import logger

def summarize(
    target: CompanyProduct, search_results: List[SearchResult], debug=True, shorten_urls=False
) -> AIMessage:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    unified_markdown = results_to_markdown(search_results)
    input_len = len(unified_markdown)

    if shorten_urls:
        url_shortener = URLShortener()
        unified_markdown = url_shortener.shorten_markdown(unified_markdown)


    runnable = prompt | llm
    result = runnable.invoke({"text": unified_markdown, "company_name": target.company, "product_name": target.product})
    result.content = result.content.strip().strip("```markdown").strip("```")

    if shorten_urls:
        result.content = url_shortener.unshorten_markdown(result.content)

    logger.info(f"{input_len:,} -> {len(result.content):,} chars ({len(result.content) / input_len:.0%})")

    return result


In [13]:
summary = summarize(target, search_results, shorten_urls=True)
print(summary.content)

with open(f"{experiment_dir}/search_results_url_shortener_v2.md", "w") as f:
    f.write(summary.content)

    f.write("\n# Sources\n")
    for result in search_results:
        f.write(result_to_markdown(result) + "\n\n")

[32m2024-08-19 10:28:53.924[0m | [1mINFO    [0m | [36mcore[0m:[36mshorten_markdown[0m:[36m185[0m - [1m26,533 -> 21,111 chars (80% of original)[0m
[32m2024-08-19 10:29:12.679[0m | [1mINFO    [0m | [36mcore[0m:[36munshorten_markdown[0m:[36m200[0m - [1m3,617 -> 5,714 chars (158% of original)[0m
[32m2024-08-19 10:29:12.680[0m | [1mINFO    [0m | [36m__main__[0m:[36msummarize[0m:[36m70[0m - [1m26,533 -> 5,714 chars (22%)[0m



# Official social media
- [98point6 Technologies Inc. | LinkedIn](https://www.linkedin.com/company/98point6-tech-inc)
- [98point6 Technologies (@98point6) • Instagram](https://www.instagram.com/98point6/?hl=en)
- [98point6 Technologies Inc. (@98point6Inc) / X](https://twitter.com/98point6inc?lang=en)

# Job boards
- [Working at 98point6 | Glassdoor](https://www.glassdoor.com/Overview/Working-at-98point6-EI_IE1181484.11,19.htm)
- [Jobs at 98point6 - Otta](https://app.otta.com/companies/98point6)
- [98point6 Careers | Wellfound (formerly AngelList Talent)](https://wellfound.com/company/98point6)
- [50+ 98point6 Jobs, Employment August 13, 2024| Indeed.com](https://www.indeed.com/q-98point6-jobs.html)

# App stores
- [98point6 - Apps on Google Play](https://play.google.com/store/apps/details?id=com.ninety8point6.patientapp&hl=en_US) (Jun 5, 2024)
- [98point6 on the App Store](https://apps.apple.com/us/app/98point6/id1157653928)

# Product reviews
- [Read Customer Service Reviews of www.9

In [7]:
# Test a few different URL compression schemes

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
for url in [
    "https://www.apollo.io/companies/Pomelo-Care/6196af0887796a008c77f450",
    "cache://www.apollo.io/1",
    "cache://www.apollo.io/15",
    "cache://apollo/15",
    "http://apollo/15",
    ]:
    print(f"{llm.get_num_tokens(url):,} tokens: {url}")

25 tokens: https://www.apollo.io/companies/Pomelo-Care/6196af0887796a008c77f450
8 tokens: cache://www.apollo.io/1
8 tokens: cache://www.apollo.io/15
5 tokens: cache://apollo/15
5 tokens: http://apollo/15
