# Biotech News and Trends Concierge Agent

## Introduction

### Problem

Biotechnology moves quickly, and meaningful developments are scattered across dozens of news sources, journals, and industry feeds. Manually tracking these updates is time-consuming, inconsistent, and prone to missing important signals. Raw article text is noisy and difficult to compare, making it hard to identify which topics are emerging, which are declining, and where industry attention is shifting. There is no simple, automated way to transform daily biotech news into structured, trend-level insights.

### Solution/Objective

This project implements an automated RSS-driven pipeline that collects biotech articles, summarizes them using an LLM, and extracts key concepts for trend analysis. A Trend Agent clusters related topics, measures their frequency and momentum, and highlights emerging or unusual patterns across the dataset. The final output is a structured, data-driven trend report that makes it easy to monitor the biotech landscape, spot early signals, and stay informed without manual curation.

## Import Libraries

In [6]:
!pip install feedparser
!pip install google-genai





[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting google-genai
  Downloading google_genai-1.52.0-py3-none-any.whl.metadata (46 kB)
Collecting anyio<5.0.0,>=4.8.0 (from google-genai)
  Downloading anyio-4.11.0-py3-none-any.whl.metadata (4.1 kB)
Collecting google-auth<3.0.0,>=2.14.1 (from google-genai)
  Downloading google_auth-2.43.0-py2.py3-none-any.whl.metadata (6.6 kB)
Collecting httpx<1.0.0,>=0.28.1 (from google-genai)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pydantic<3.0.0,>=2.9.0 (from google-genai)
  Downloading pydantic-2.12.4-py3-none-any.whl.metadata (89 kB)
Collecting tenacity<9.2.0,>=8.2.3 (from google-genai)
  Downloading tenacity-9.1.2-py3-none-any.whl.metadata (1.2 kB)
Collecting websockets<15.1.0,>=13.0.0 (from google-genai)
  Downloading websockets-15.0.1-cp312-cp312-win_amd64.whl.metadata (7.0 kB)
Collecting rsa<5,>=3.1.4 (from google-auth<3.0.0,>=2.14.1->google-genai)
  Downloading rsa-4.9.1-py3-none-any.whl.metadata (5.6 kB)
Collecting pydantic-core==2.41.5 (from pydantic<

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
streamlit 1.32.0 requires tenacity<9,>=8.1.0, but you have tenacity 9.1.2 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
import feedparser
import datetime
import json
from pathlib import Path
import pandas as pd
from bs4 import BeautifulSoup
from google import genai
import time

## Fetch RSS Articles

In [2]:
import feedparser
import datetime
import json
from pathlib import Path

class RSSFetcher:
    def __init__(self, config_path="config/rss_feeds.json", storage_path="../data/rss_raw.json"):
        self.config_path = Path(config_path)
        self.storage_path = Path(storage_path)
        self.storage_path.parent.mkdir(parents=True, exist_ok=True)

        # Load feeds from config file
        with open(self.config_path, "r") as f:
            self.rss_urls = json.load(f)["feeds"]

    def _infer_source(self, url: str) -> str:
        """Infer source name from URL."""
        if "fiercebiotech.com/rss/biotech" in url or "fiercebiotech.com/rss/xml" in url:
            return "FierceBiotech"
        elif "labiotech.eu" in url:
            return "Labiotech.eu"
        elif "GenEngNews" in url or "genengnews.com" in url:
            return "GEN (Genetic Engineering & Biotech News)"
        elif "sciencedaily.com" in url and "genetics_gene_therapy" in url:
            return "ScienceDaily – Gene Therapy"
        elif "bioworld.com/rss/topic/10" in url:
            return "BioWorld Omics / Genomics"
        else:
            return "Unknown"

    def fetch(self):
        """Fetch articles from all RSS URLs."""
        all_articles = []

        for url in self.rss_urls:
            feed = feedparser.parse(url)
            source = self._infer_source(url)

            print(f"Fetching from {source}: {url}")

            for entry in feed.entries:
                article = {
                    "title": entry.get("title"),
                    "summary": entry.get("summary", ""),
                    "link": entry.get("link"),
                    "published": entry.get("published") or entry.get("updated") or None,
                    "source": source,
                    "fetched_at": datetime.datetime.utcnow().isoformat()
                }
                all_articles.append(article)

        self._save(all_articles)
        return all_articles

    def _save(self, articles):   # might not be needed
        """Save raw fetched articles."""
        with open(self.storage_path, "w", encoding="utf-8") as f:
            json.dump(articles, f, indent=2)

        print(f"Saved {len(articles)} articles to {self.storage_path}")

In [3]:
# Run Fetcher for generic RSS (including PubMed)
#from src.fetcher import RSSFetcher

# This will load feeds from config/rss_feeds.json by default
fetcher = RSSFetcher()

articles = fetcher.fetch()

# Convert to df
articles = pd.DataFrame(articles)

print(f"Fetched {len(articles)} articles.")
print(articles[:3])  # show first 3

Fetching from FierceBiotech: https://www.fiercebiotech.com/rss/biotech/xml


  "fetched_at": datetime.datetime.utcnow().isoformat()


Fetching from Labiotech.eu: https://www.labiotech.eu/feed/
Fetching from GEN (Genetic Engineering & Biotech News): https://feeds.feedburner.com/GenEngNews
Fetching from ScienceDaily – Gene Therapy: https://rss.sciencedaily.com/genetics_gene_therapy.xml
Fetching from BioWorld Omics / Genomics: https://www.bioworld.com/rss/topic/10
Saved 37 articles to ..\data\rss_raw.json
Fetched 37 articles.
                                               title  \
0  <a href="https://www.fiercebiotech.com/biotech...   
1  <a href="https://www.fiercebiotech.com/biotech...   
2  <a href="https://www.fiercebiotech.com/biotech...   

                                             summary  \
0  Hundreds of industry leaders have signed a let...   
1  The FDA is hiring more than 1,000 new employee...   
2  Gilead’s general counsel and EVP of corporate ...   

                                                link             published  \
0  https://www.fiercebiotech.com/biotech/letter-m...   Nov 21, 2025 4:16pm   

In [4]:
# Clean up articles and extract titles and summaries
from bs4 import BeautifulSoup

# Convert summaries to plain text
articles['summary_text'] = articles['summary'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())

# Print just the top 5 titles and converted summaries
for title in articles['title'].head(5):
    print("Title:", title)
print()
for summary_text in articles['summary_text'].head(5):
    print("Summary text:", summary_text)

Title: <a href="https://www.fiercebiotech.com/biotech/letter-makary-biotech-ceos-push-fda-stability-and-say-volatility-threatens-us-innovation" hreflang="en">In letter to Makary, biotech CEOs push for FDA stability and say volatility threatens US innovation</a>
Title: <a href="https://www.fiercebiotech.com/biotech/fda-kicks-hiring-spree-and-new-communication-program-speed-sluggish-drug-reviews" hreflang="en"> FDA says it's hiring more than 1,000 new staffers, launches new comms program for review process</a>
Title: <a href="https://www.fiercebiotech.com/biotech/chutes-ladders-gilead-abruptly-parts-ways-general-counsel" hreflang="en">Chutes &amp; Ladders—Gilead abruptly parts ways with general counsel</a>
Title: <a href="https://www.fiercebiotech.com/biotech/fierce-biotech-layoff-tracker-2025" hreflang="en">Fierce Biotech Layoff Tracker 2025: Applied Tx lays off 46% of staff; Ensoma makes cuts</a>
Title: <a href="https://www.fiercebiotech.com/biotech/nurix-trims-workforce-pivotal-trial-

## Summarize Articles (Agent)

In [5]:
from google import genai
import time
import os

api_key_env = os.getenv("GOOGLE_API_KEY")     # from local environment variable, or can use .env file
client = genai.Client(api_key=api_key_env)

MODEL_NAME = "gemini-2.5-flash"

THROTTLE = 1

def summarize_article(title: str, summary: str) -> dict:
    prompt = f"""
    You are an AI biotech assistant. Summarize this article in 3 bullet points.
    Extract: 
    1. Main finding
    2. Key biological targets (genes, proteins, pathways)
    3. Application area (diagnostics, therapeutics, biotech tools, etc.)

    Title: {title}
    Summary: {summary}
    """

    response = client.models.generate_content(
        model=MODEL_NAME,
        contents=prompt
    )

    time.sleep(THROTTLE)         # API Rate limiting

    return {
        "title": title,
        "raw_summary": summary,
        "ai_summary": response.text
    }

# Example usage
ai_summary_sample = summarize_article(
    "Updated Full-Text Search Now Available",
    "As previously announced, NCBI has updated the PubMed Central (PMC) full-text search functionality and user experience..."
)

print(ai_summary_sample)

# Generate AI summaries for the first 5 articles only
articles.loc[:4, "ai_summary"] = articles.loc[:4].apply(
    lambda row: summarize_article(row["title"], row["summary_text"])["ai_summary"],
    axis=1
)

# Check results
print(articles.loc[:4, ["title", "ai_summary"]])

ValueError: Missing key inputs argument! To use the Google AI API, provide (`api_key`) arguments. To use the Google Cloud API, provide (`vertexai`, `project` & `location`) arguments.

In [7]:
# JSON

from pathlib import Path
import json

# Convert entire DataFrame to list of dicts
articles_list = articles.to_dict(orient="records")

output_path = Path("data/rss_summarized.json")
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, "w", encoding="utf-8") as f:
    json.dump(articles_list, f, indent=2, ensure_ascii=False)

print(f"Saved {len(articles_list)} articles to {output_path}")

# Add section to append new articles without duplicates - they'll be stored for access by trend agent later


Saved 37 articles to data\rss_summarized.json


## Trend Analysis (Agent)

In [8]:
def extract_topics(summary_text: str) -> dict:
    prompt = f"""
    Extract keywords and topics from this biotech article summary.
    Return as JSON with:
    - scientific_terms: genes, proteins, pathways
    - companies: biotech companies mentioned
    - concepts: biotech concepts or areas (single-cell, AI-drug discovery)
    - methods: experimental methods (CRISPR, nanopore sequencing)
    
    Summary: {summary_text}
    """
    response = client.models.generate_content(model=MODEL_NAME, contents=prompt)
    # Convert response to dict
    import json
    try:
        topics = json.loads(response.text)
    except:
        topics = {"scientific_terms": [], "companies": [], "concepts": [], "methods": []}
    return topics

In [9]:
normalization_dict = {
    "single-cell": ["scRNA-seq", "single cell sequencing", "single-cell RNA seq"],
    "AI-biotech": ["AI", "machine learning", "deep learning"],
    "biotech IP/legal": ["Illumina lawsuit", "NGS patents"]
}

def normalize_topic(term: str):
    for norm, variants in normalization_dict.items():
        if term.lower() in [v.lower() for v in variants] or term.lower() == norm.lower():
            return norm
    return term  # return as-is if no match


In [None]:
articles['topics_normalized'] = articles['ai_summary'].apply(
    lambda s: {k: [normalize_topic(t) for t in v] for k, v in extract_topics(s).items()}
)

KeyError: 'ai_summary'

Task exception was never retrieved
future: <Task finished name='Task-8' coro=<BaseApiClient.aclose() done, defined at c:\Users\jeelf\anaconda3\Lib\site-packages\google\genai\_api_client.py:1902> exception=AttributeError("'BaseApiClient' object has no attribute '_async_httpx_client'")>
Traceback (most recent call last):
  File "c:\Users\jeelf\anaconda3\Lib\site-packages\google\genai\_api_client.py", line 1907, in aclose
    await self._async_httpx_client.aclose()
          ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'BaseApiClient' object has no attribute '_async_httpx_client'


In [11]:
trend_data = []
for idx, row in articles.iterrows():
    trend_data.append({
        "title": row['title'],
        "source": row['source'],
        "published": row['published'],
        "topics": row['topics_normalized'],
        "clusters": [term_to_cluster[t] for cat in row['topics_normalized'] for t in row['topics_normalized'][cat]]
    })

with open("data/trend_topics.json", "w", encoding="utf-8") as f:
    json.dump(trend_data, f, indent=2, ensure_ascii=False)


KeyError: 'topics_normalized'