# Unified summary, Version 2!

Key changes from version 1:
- Organized by topic rather than data source
- More data sources: 
    - Indeed job descriptions
    - Crunchbase
    - General search results
- Technical
    - Permalinks in sources and piping them through, rather than each pipeline being different
    - Extract, organize, then abstract
    - Heavy use of caching

In [1]:
from core import CompanyProduct, init_langchain_cache, init_requests_cache

init_requests_cache()
init_langchain_cache()

target = CompanyProduct.same("98point6")

In [2]:
from reddit import RedditSummary, run as process_reddit

In [3]:
from glassdoor import GlassdoorResult, run as process_glassdoor


In [4]:
# Rename it to the old function name
from news import run as process_news


In [5]:
import re
from datetime import datetime
import os

def eval_filename(target: CompanyProduct, create_folder=True, extension="html") -> str:
    # Make the output folder
    folder_name = re.sub(r"[^a-zA-Z0-9]", "_", f"{target.company} {target.product}")
    folder_path = f"evaluation/{folder_name}"

    if create_folder:
        os.makedirs(folder_path, exist_ok=True)

    # Create the filename using the current timestamp
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{folder_path}/{timestamp}.{extension}"

    return filename

In [6]:
from crunchbase import run as process_crunchbase


In [7]:
import jinja2

templates = jinja2.Environment(
    loader=jinja2.FileSystemLoader("templates"),
)


In [14]:
import re

def nest_markdown(markdown_doc: str, header_change: int) -> str:
    assert header_change > 0, "Header change must be positive"
    nested_markdown = re.sub(r'^(#+)', lambda match: '#' * min(len(match.group(1)) + header_change, 6), markdown_doc, flags=re.MULTILINE)
    return nested_markdown

# Test nest_markdown function
markdown_doc = """
# Header 1
Some text

## Header 2

This # might be harder
"""
header_change = 2

expected_output = """
### Header 1
Some text

#### Header 2

This # might be harder
"""

# Check if the nested markdown is correct
assert nest_markdown(markdown_doc, header_change) == expected_output, f"Expected: {expected_output}, got: {nest_markdown(markdown_doc, header_change)}"

print("Test passed!")

Test passed!


In [16]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

from core import CompanyProduct
from dotenv import load_dotenv

load_dotenv()


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """
PERSONA
You're an expert in reviewing and analyzing news about companies and products.
When interpreting information, you understand that all authors impart some bias and perspective according to their incentives and access to information.
You seek to understand the authors to better interpret and debias their information by considering their background, affiliations, and potential motivations.

When assessing product quality:
- Companies typically exaggerate the positive aspects of their products and hide the negative aspects. Hence, you treat company statements about product quality with skepticism and seek corroborating evidence from independent sources.
- Reddit tends to be polarized, often oversampling strong opinions, particularly negative ones. Therefore, you interpret feedback on Reddit by looking for patterns across multiple comments and considering the context of each comment to identify more balanced views.

You review a wide range of sources to get a comprehensive view that's less susceptible to individual biases. You also consider the reliability of each source with respect to the type of information it provides. For example:
- Crunchbase is a reliable source for information about fundraising but less so for the current number of employees.
- News sources can be reliable but must be cross-referenced with other reports to ensure accuracy.

When sharing information with others, you're careful to provide specific details and cite sources so that your readers can easily verify all information. You understand that using quotes and citations builds trust with your audience, as it demonstrates transparency and allows them to see the original context of the information. Including dates in citations is crucial because:
- The date is a key factor in determining relevance. For example, very positive but older sentiment about a company may not indicate much about its current state.
- Certain key details about companies and products can change drastically over time, so noting the general timeframe is crucial for accuracy. For instance, a company may have had 300 employees in 2021 but only 20 employees in 2024. Including the date provides essential context for such information.

You keep facts and opinions clearly separated but share both with your audience to provide a well-rounded perspective. Your goal is to offer as detailed and balanced a view as possible, allowing your audience to make well-informed decisions. You focus on specifics, such as numbers and concrete examples, to provide clarity and support your analysis.

TASK
Carefully review all of the following information about a company and its product.
Write a comprehensive report of all information with citations to the original sources for reference.

OUTPUT CONTENT AND FORMAT

Loosely follow this template in your report. Each markdown section has tips on what information is most critical.

# About {company_name}

The About section should provide all the essential information about the company.
An ideal section should at least incorporate the answers to the following questions, if available:
- When was the company founded?
- Approximately how many employees work at the company?
- What products does the company produce? What services does the company offer?
- How does the company make money? Who are their customers in general? Is it B2B, B2C? If B2B, include example customers.
- Approximately how much revenue does the company generate annually?
- Describe the scale of the company if possible, including the number of customers, users, or clients.
- How are the company's products distributed or sold to users?

# Key personnel

Include the names and roles of any key personnel at the company. If possible, provide a brief summary of their background and experience as well as any sentiments expressed about them in the sources.

# News (reverse chronological, grouped by event)

# Working at {company_name}

Questions that should be answered by this summary:
- Is the leadership team good?
- What benefits are provided?
- Is the company good at DEI?
- Whats's the work-life balance like and workload?
- How has working at the company changed over time?
- How does employee satisfaction vary by job function?
- Why do people like working here?
- Why do people dislike working here?

## Positive sentiments and experiences

## Negative sentiments and experiences

## Neutral statements about working at {company_name}

This section might include general statements about location, benefits, and other factual information that could be verified.

# User reviews, sentiments, and feedback about {product_name}

## Positive sentiments and experiences

## Negative sentiments and experiences

## Neutral statements about {product_name}

This section could include general, neutral statements about the product, its features, distribution, key product changes, pricing, and so on.

# Bibliography

The Bibliography should include a list of all the sources used to compile the summary. If there are many sources, group them by type (e.g., Reddit, Glassdoor, News, Crunchbase).


Feel free to create subheadings or additional sections as needed to capture all relevant information about the company and its product.
Format the output as a markdown document, using markdown links for citations.
Citations should follow the format [(Author or Title, Source, Date)](url).
            """,
        ),
        (
            "human",
            """
            Company: {company_name}
            Product: {product_name}
            
            Reddit sources: 
            {reddit_text}

            Glassdoor sources:
            {glassdoor_text}

            News sources:
            {news_text}

            Crunchbase information:
            {crunchbase_text}
            """,
        ),
    ]
)





async def unified_summary(target: CompanyProduct, num_reddit_threads=2, max_glassdoor_review_pages=1, max_glassdoor_job_pages=1, max_news_articles=10, glassdoor_url=None):
    crunchbase_markdown = await process_crunchbase(target)
    reddit_result = process_reddit(target, num_threads=num_reddit_threads)
    glassdoor_result = await process_glassdoor(target, max_review_pages=max_glassdoor_review_pages, max_job_pages=max_glassdoor_job_pages, url_override=glassdoor_url)
    news_result = process_news(target, max_results=max_news_articles)


    # feed results into LLM for summarization
    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    runnable = prompt | llm
    result = runnable.invoke({
        "company_name": target.company, 
        "product_name": target.product,
        "reddit_text": reddit_result.summary.output_text,
        "glassdoor_text": glassdoor_result.summary_markdown,
        "news_text": news_result.summary_markdown,
        "crunchbase_text": crunchbase_markdown,
        })
    result.content = result.content.strip().strip("```markdown").strip("```")

    input_content_length = len(reddit_result.summary.output_text) + len(glassdoor_result.summary_markdown) + len(news_result.summary_markdown) + len(crunchbase_markdown)
    output_content_length = len(result.content)

    print(f"unified_summary: input_content_length={input_content_length:,} chars, output_content_length={output_content_length:,} chars ({output_content_length/input_content_length:.0%})")


    with open(eval_filename(target, extension="md"), "w") as f:
        f.write(result.content)

        # Write the raw Reddit summary too
        f.write(f"\n----\n# Reddit\n{nest_markdown(reddit_result.summary.output_text, 1)}\n\n")

        # Write the individual Reddit threads
        # for thread in reddit_result.threads:
        #     f.write(f"{reddit.fetch.submission_to_markdown(thread)}\n\n")

        # Write the raw Glassdoor summary too
        f.write(f"\n----\n# Glassdoor\n{nest_markdown(glassdoor_result.summary_markdown, 1)}\n\n")

        # Write the individual Glassdoor reviews
        # for review in glassdoor_result.reviews:
        #     review_md = templates.get_template("glassdoor_review.md").render(review=review)
        #     f.write(f"{review_md}\n\n")

        # Write the raw News summary too
        f.write(f"\n----\n# News\n{nest_markdown(news_result.summary_markdown, 1)}\n\n")

        # Write the raw Crunchbase summary too
        f.write(f"\n----\n# Crunchbase\n{nest_markdown(crunchbase_markdown, 1)}\n\n")

        print(f"Written to {f.name}")

await unified_summary(
    CompanyProduct.same("Pomelo Care"), 
    num_reddit_threads=10, 
    max_glassdoor_review_pages=3, 
    max_glassdoor_job_pages=0,
    max_news_articles=20,
    glassdoor_url="https://www.glassdoor.com/Reviews/Pomelo-Care-Reviews-E9429297.htm"
    )

[32m2024-08-16 14:43:11.884[0m | [1mINFO    [0m | [36mscrapfly_scrapers.glassdoor[0m:[36mscrape_reviews[0m:[36m105[0m - [1mscraping reviews from https://www.glassdoor.com/Reviews/Pomelo-Care-Reviews-E9429297.htm[0m
[32m2024-08-16 14:43:11.898[0m | [1mINFO    [0m | [36mscrapfly_scrapers.glassdoor[0m:[36mscrape_reviews[0m:[36m113[0m - [1mscraped first page of reviews of https://www.glassdoor.com/Reviews/Pomelo-Care-Reviews-E9429297.htm, scraping remaining 1 pages[0m


Reddit: The prompt context has 6,084 characters in 2 threads


[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Please read the following Reddit thread and extract all opinions and facts relating to the user experience of the PRODUCT Pomelo Care by the COMPANY Pomelo Care from the perspective of current users.
Only include information about the COMPANY Pomelo Care and PRODUCT Pomelo Care. 
Do not extract information about other companies or products.
If the text does not contain any relevant information about the COMPANY or PRODUCT, please return an empty string.

Format the results as a Markdown list of quotes, each with a permalink to the source of the quote like so:
- "quote" [Author, Reddit, Date](permalink)

EXAMPLE for 98point6:

Input comment:
## Comment ID hrmpl3t with +3 score by [MarketWorldly9908 on 2022-01-07](https://www.reddit.com/r/povertyfinance/comments/bg7ip2/internet_medicine_is_

[32m2024-08-16 14:43:12.914[0m | [1mINFO    [0m | [36mscrapfly_scrapers.glassdoor[0m:[36mscrape_reviews[0m:[36m123[0m - [1mscraped 17 reviews from https://www.glassdoor.com/Reviews/Pomelo-Care-Reviews-E9429297.htm in 2 pages[0m


Glassdoor: The context has 12,194 characters in 17 reviews
Glassdoor: The summary has 5,261 characters, 43% of the input
News: 86,588 characters of context, 17 articles
News: The summary has 5,042 characters, 6% of the input
unified_summary: input_content_length=13,619 chars, output_content_length=9,431 chars (69%)
Written to evaluation/Pomelo_Care_Pomelo_Care/20240816_144349.md
