# Web scraping tests

A lot of overall program time is taken in sequential web scraping. If I could parallelize it, that would help tremendously. Similarly, a fair amount of time is spent in content extraction.

To test:
- Compare the content extractor I'm currently using vs Scrapfly's content extractor on some webpages
- Compare the rough time it'd take to parallel scrape "manually" vs scrapfly at max concurrency
- Estimate costs with scrapfly

In [4]:
from core import Seed, init

init()

In [5]:
target = Seed.init("98point6")

from data_sources.news.search import find_news_articles
search_results = find_news_articles(target, num_results=10)
search_results

[SearchResult(title='98point6 Technologies Announces the Acquisition of Bright.md to ...', link='https://www.prnewswire.com/news-releases/98point6-technologies-announces-the-acquisition-of-brightmd-to-accelerate-the-launch-of-its-asynchronous-care-module-302034295.html', snippet='Jan 16, 2024 ... PRNewswire/ -- 98point6 Technologies, a leader in licensed on-demand virtual care software, announced the addition of a new asynchronous module to its...', formattedUrl='https://www.prnewswire.com/news.../98point6-technologies-announces-the-...'),
 SearchResult(title='98point6 hit by new layoffs in latest change at health tech startup ...', link='https://www.geekwire.com/2024/98point6-hit-by-new-layoffs-in-latest-change-at-health-tech-startup/', snippet='Apr 23, 2024 ... (98point6 Photo) Seattle-based digital healthcare startup 98point6 has conducted a new round of layoffs, the company confirmed Tuesday. 98point6 did not.', formattedUrl='https://www.geekwire.com/.../98point6-hit-by-new-layoffs

In [6]:
result = search_results[9]
result

SearchResult(title='Gaurai Uddanwadiker', link='https://hspop.uw.edu/about/faculty/member/?faculty_id=Uddanwadiker_Gaurai', snippet='Mar 24, 2024 ... Her next role was at the Seattle based virtual primary care company, 98point6, where she scaled their local clinic to a nationwide, 24/7 presence as well as\xa0...', formattedUrl='https://hspop.uw.edu/about/faculty/member/?faculty_id=Uddanwadiker...')

In [4]:
# The existing pipeline: Get one article, extract, and markdown
from utils.scrape import request_article, response_to_article, article_to_markdown

# Fetch everything
response = request_article(result.link)
response.raise_for_status()

# Parse into Article objects
article = response_to_article(response)

# Convert into Markdown
markdown = article_to_markdown(article)

print(markdown)


# [Health Systems and Population Health on 2023-06-17](https://hspop.uw.edu/about/faculty/member/?faculty_id=Uddanwadiker_Gaurai)
Research Interests

Healthcare Operations and Strategy, Leadership Development

Bio

Gaurai has 20+ years of experience in the healthcare industry and excels at the intersection of technology, healthcare operations, care delivery and clinical quality. She believes in developing leaders and firing herself from the job.

Gaurai started off a therapist at an academic medical center in Bangalore, India but soon decided to pursue an entrepreneurial path and founded Counseling India, in 2002. She was instrumental in scaling it to multiple locations and adding on the telehealth care delivery service model. Her entrepreneurial journey reached a successful culmination in 2016 when she made a strategic exit after selling her company.

Her next role was at the Seattle based virtual primary care company, 98point6, where she scaled their local clinic to a nationwide, 24/

In [None]:
len(markdown)

In [22]:
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient, ScrapflyScrapeError, ScrapflyError
from scrapfly.scrape_config import Format, FormatOption
import os

SCRAPFLY = ScrapflyClient(key=os.environ["SCRAPFLY_KEY"])
config = {
    # "asp": True,
    "country": "us",
    # "proxy_pool": "public_residential_pool",
    "retry": False,
    "cache": True,
    "format": Format.MARKDOWN,
    "format_options": [FormatOption.NO_IMAGES, FormatOption.NO_LINKS],
    # TO CONSIDER
    # session = value (this will reuse the same machine for subsequent requests due to sticky_proxy, but disables caching)
    # render_js = True (this will render the page with a headless browser and might help sometimes)
}
scrapfly_result = await SCRAPFLY.async_scrape(ScrapeConfig(url=result.link, **config))

In [None]:
print(scrapfly_result.content)

In [None]:
len(scrapfly_result.content)

# Concurrent scraping

In [4]:
# With Scrapfly: Concurrently scrape all 10
from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient, ScrapflyScrapeError, ScrapflyError
from scrapfly.scrape_config import Format, FormatOption
import os

SCRAPFLY = ScrapflyClient(key=os.environ["SCRAPFLY_KEY"], max_concurrency=5)
config = {
    "country": "us",
    "retry": False,
    "cache": True,
    "format": Format.MARKDOWN,
    "format_options": [FormatOption.NO_IMAGES, FormatOption.NO_LINKS],
}

pages = [
    ScrapeConfig(url=search_result.link, **config)
    for search_result in search_results
]

# Collect results from the async generator
scrapfly_results = [result async for result in SCRAPFLY.concurrent_scrape(pages)]

scrapfly_results

# The Kernel crashed while executing code in the current cell or a previous cell. 
# Please review the code in the cell(s) to identify a possible cause of the failure. 
# Click here for more info. 
# View Jupyter log for further details.

[<scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f40de40>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f43f7f0>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f40f580>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f459330>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f43f8b0>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f43fe50>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f40e8c0>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f43f730>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f40f1f0>,
 <scrapfly.api_response.ScrapeApiResponse at 0x7fbe3f45a7a0>]

In [7]:
scrapfly_results[5].content

"# 98point6\n\nHealthcare Software · Washington, United States · 120 Employees \n\nView Company Info for Free \n\n## About\n\n### Headquarters\n\n701 5th Ave Ste 2300, Seattle, Washington, 9810... \n\n### Phone Number\n\n(866) 657-7991 \n\n### Website\n\nwww.98point6.com \n\n### Revenue\n\n$25.2 Million \n\n### Industry\n\nHealthcare Software  Software Development & Design  Software \n\n## Most Recent Scoops\n\nMar 1 2024 \n\nLeft Company \n\nSep 11 2023 \n\nEvent \n\nwill be attending the HLTC 2023 conference as an exhibitor. The event will occur from October 8 to October 11, 2023. \n\nSee all scoops \n\n## Highlights\n\n$293M \n\nTotal Funding Amount \n\n$30.7M \n\nMost Recent Funding Amount \n\n8 \n\nNumber of Funding Rounds \n\nView details \n\n## Who is 98point6\n\nFounded in 2015, 98point6 is an on-demand, text-based, virtual primary care clinic. 98point6 is headquartered in Seattle, Washington. \n\n98point6's Social Media \n\nIs this data correct?  View contact profiles from 98p

In [7]:
from typing import List, Dict
from collections import defaultdict
from urllib.parse import urlparse

def group_urls_by_domain(urls: List[str]) -> Dict[str, List[str]]:
    grouped_urls = defaultdict(list)
    for url in urls:
        domain = urlparse(url).netloc
        grouped_urls[domain].append(url)
    return grouped_urls

group_urls_by_domain([sr.link for sr in search_results])

defaultdict(list,
            {'www.prnewswire.com': ['https://www.prnewswire.com/news-releases/98point6-technologies-announces-the-acquisition-of-brightmd-to-accelerate-the-launch-of-its-asynchronous-care-module-302034295.html'],
             'www.geekwire.com': ['https://www.geekwire.com/2024/98point6-hit-by-new-layoffs-in-latest-change-at-health-tech-startup/'],
             'www.healthcaredive.com': ['https://www.healthcaredive.com/news/98point6-buys-bright-md-assets/704661/'],
             'telecareaware.com': ['https://telecareaware.com/news-roundup-transcarent-raises-126m-98point6-lays-off-oscar-notches-first-profit-steward-healths-ch-11-amazon-clinic-gm-leaves-amwells-down-but-hopeful-q1-hims-founder-gets-political/'],
             'www.axios.com': ['https://www.axios.com/pro/health-tech-deals/2024/01/16/98point6-acquires-brightmd-assets-telehealth-care-enablement'],
             'www.zoominfo.com': ['https://www.zoominfo.com/c/98point6-inc/410028689'],
             'www.steady

In [12]:
from itertools import chain
import asyncio
from typing import List, Dict
import aiohttp
import newspaper
from utils.scrape import remove_img_tags, article_to_markdown

async def response_to_article(
    response,
) -> newspaper.Article:
    """Parse the response from a URL into a newspaper Article"""
    article = newspaper.article(
        response.url.absolute,
        language="en",
        # Remove images to prevent downloading them (the downloads sometimes crash, and they slow things down)
        input_html=remove_img_tags(await response.text()),
        fetch_images=False,
    )
    article.parse()
    return article

async def fetch_articles(urls: List[str], headers: Dict[str, str]) -> List[str]:
    articles = []
    async with aiohttp.ClientSession(headers=headers) as session:
        for url in urls:
            async with session.get(url) as response:
                if response.ok:
                    article = newspaper.article(
                        url,
                        language="en",
                        # Remove images to prevent downloading them (the downloads sometimes crash, and they slow things down)
                        input_html=remove_img_tags(await response.text()),
                        fetch_images=False,
                    )
                    article.parse()
                                    # article = await response_to_article(response)
                    markdown = article_to_markdown(article)
                    articles.append(markdown)

    return articles

async def main(search_results: List, headers: Dict[str, str]) -> List[str]:
    # Group URLs by domain
    grouped_urls = group_urls_by_domain([sr.link for sr in search_results])
    
    # Create tasks for each domain
    tasks = [fetch_articles(urls, headers) for urls in grouped_urls.values()]
    
    # Run tasks concurrently
    results = await asyncio.gather(*tasks)
    
    # Flatten the results
    flattened_results = list(chain.from_iterable(results))
    return flattened_results

# Example headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

# Run the async main function
articles = await main(search_results, headers)
articles[0]
# Convert to Markdown
# markdowns = [article_to_markdown(article) for article in articles]

# markdowns[0]

'# [98point6 Technologies Announces the Acquisition of Bright.md to Accelerate the Launch of its Asynchronous Care Module on 2024-01-16](https://www.prnewswire.com/news-releases/98point6-technologies-announces-the-acquisition-of-brightmd-to-accelerate-the-launch-of-its-asynchronous-care-module-302034295.html)\nNew Module to Offer More Robust Integrated Virtual Care Delivery Options\n\nSEATTLE, Jan. 16, 2024 /PRNewswire/ -- 98point6 Technologies, a leader in licensed on-demand virtual care software, announced the addition of a new asynchronous module to its flagship technology, allowing providers the option of delivering both live and asynchronous care capabilities. By enabling healthcare organizations to license this additional model of care delivery, 98point6 once again sets the bar with an integrated, purpose-built clinician solution that future-proofs virtual clinic operations. In related news, the company announced it has doubled-down on its growth strategy, acquiring the remaining

In [17]:
print(articles[0])

# [98point6 Technologies Announces the Acquisition of Bright.md to Accelerate the Launch of its Asynchronous Care Module on 2024-01-16](https://www.prnewswire.com/news-releases/98point6-technologies-announces-the-acquisition-of-brightmd-to-accelerate-the-launch-of-its-asynchronous-care-module-302034295.html)
New Module to Offer More Robust Integrated Virtual Care Delivery Options

SEATTLE, Jan. 16, 2024 /PRNewswire/ -- 98point6 Technologies, a leader in licensed on-demand virtual care software, announced the addition of a new asynchronous module to its flagship technology, allowing providers the option of delivering both live and asynchronous care capabilities. By enabling healthcare organizations to license this additional model of care delivery, 98point6 once again sets the bar with an integrated, purpose-built clinician solution that future-proofs virtual clinic operations. In related news, the company announced it has doubled-down on its growth strategy, acquiring the remaining ass

In [13]:
import asyncio
import aiohttp
