# Biotech News and Trends Concierge Agent

## Introduction

### Problem

Biotechnology moves quickly, and meaningful developments are scattered across dozens of news sources, journals, and industry feeds. Manually tracking these updates is time-consuming, inconsistent, and prone to missing important signals. Raw article text is noisy and difficult to compare, making it hard to identify which topics are emerging, which are declining, and where industry attention is shifting. There is no simple, automated way to transform daily biotech news into structured, trend-level insights.

### Solution/Objective

This project implements an automated RSS-driven pipeline that collects biotech articles, summarizes them using an LLM, and extracts key concepts for trend analysis. A Trend Agent clusters related topics, measures their frequency and momentum, and highlights emerging or unusual patterns across the dataset. The final output is a structured, data-driven trend report that makes it easy to monitor the biotech landscape, spot early signals, and stay informed without manual curation.

## Import Libraries

In [9]:
import feedparser
import datetime
import json
from pathlib import Path
import pandas as pd
from bs4 import BeautifulSoup
from google import genai
import time

## Fetch RSS Articles

In [15]:
import feedparser
import datetime
import json
from pathlib import Path

class RSSFetcher:
    def __init__(self, config_path="config/rss_feeds.json", storage_path="../data/rss_raw.json"):
        self.config_path = Path(config_path)
        self.storage_path = Path(storage_path)
        self.storage_path.parent.mkdir(parents=True, exist_ok=True)

        # Load feeds from config file
        with open(self.config_path, "r") as f:
            self.rss_urls = json.load(f)["feeds"]

    def _infer_source(self, url: str) -> str:
        """Infer source name from URL."""
        if "fiercebiotech.com/rss/biotech" in url or "fiercebiotech.com/rss/xml" in url:
            return "FierceBiotech"
        elif "labiotech.eu" in url:
            return "Labiotech.eu"
        elif "GenEngNews" in url or "genengnews.com" in url:
            return "GEN (Genetic Engineering & Biotech News)"
        elif "sciencedaily.com" in url and "genetics_gene_therapy" in url:
            return "ScienceDaily – Gene Therapy"
        elif "bioworld.com/rss/topic/10" in url:
            return "BioWorld Omics / Genomics"
        else:
            return "Unknown"

    def fetch(self):
        """Fetch articles from all RSS URLs."""
        all_articles = []

        for url in self.rss_urls:
            feed = feedparser.parse(url)
            source = self._infer_source(url)

            print(f"Fetching from {source}: {url}")

            for entry in feed.entries:
                article = {
                    "title": entry.get("title"),
                    "summary": entry.get("summary", ""),
                    "link": entry.get("link"),
                    "published": entry.get("published") or entry.get("updated") or None,
                    "source": source,
                    "fetched_at": datetime.datetime.utcnow().isoformat()
                }
                all_articles.append(article)

        self._save(all_articles)
        return all_articles

    def _save(self, articles):   # might not be needed
        """Save raw fetched articles."""
        with open(self.storage_path, "w", encoding="utf-8") as f:
            json.dump(articles, f, indent=2)

        print(f"Saved {len(articles)} articles to {self.storage_path}")

In [16]:
# Run Fetcher for generic RSS (including PubMed)
#from src.fetcher import RSSFetcher

# This will load feeds from config/rss_feeds.json by default
fetcher = RSSFetcher()

articles = fetcher.fetch()

# Convert to df
articles = pd.DataFrame(articles)

print(f"Fetched {len(articles)} articles.")
print(articles[:3])  # show first 3

Fetching from FierceBiotech: https://www.fiercebiotech.com/rss/biotech/xml
Fetching from Labiotech.eu: https://www.labiotech.eu/feed/
Fetching from GEN (Genetic Engineering & Biotech News): https://feeds.feedburner.com/GenEngNews
Fetching from ScienceDaily – Gene Therapy: https://rss.sciencedaily.com/genetics_gene_therapy.xml
Fetching from BioWorld Omics / Genomics: https://www.bioworld.com/rss/topic/10
Saved 37 articles to ..\data\rss_raw.json
Fetched 37 articles.
                                               title  \
0  <a href="https://www.fiercebiotech.com/biotech...   
1  <a href="https://www.fiercebiotech.com/biotech...   
2  <a href="https://www.fiercebiotech.com/biotech...   

                                             summary  \
0  Gilead’s general counsel and EVP of corporate ...   
1  Nurix Therapeutics is trimming its workforce n...   
2  Contineum Therapeutics’ M1 receptor antagonist...   

                                                link            published  \
0  

In [17]:
# Clean up articles and extract titles and summaries
from bs4 import BeautifulSoup

# Convert summaries to plain text
articles['summary_text'] = articles['summary'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())

# Print just the top 5 titles and converted summaries
for title in articles['title'].head(5):
    print("Title:", title)
print()
for summary_text in articles['summary_text'].head(5):
    print("Summary text:", summary_text)

Title: <a href="https://www.fiercebiotech.com/biotech/chutes-ladders-gilead-abruptly-parts-ways-general-counsel" hreflang="en">Chutes &amp; Ladders—Gilead abruptly parts ways with general counsel</a>
Title: <a href="https://www.fiercebiotech.com/biotech/nurix-trims-workforce-pivotal-trial-lead-btk-degrader-kicks" hreflang="en">Nurix trims workforce as pivotal trial for lead BTK degrader kicks off</a>
Title: <a href="https://www.fiercebiotech.com/biotech/contineums-jj-partnered-ms-drug-fails-improve-vision-phase-2" hreflang="en">Contineum's J&amp;J-partnered MS drug fails to improve vision in phase 2</a>
Title: <a href="https://www.fiercebiotech.com/biotech/modernas-reshaping-rolls-3-more-pipeline-purges" hreflang="en">Moderna's reshaping rolls on with 3 more pipeline purges</a>
Title: <a href="https://www.fiercebiotech.com/biotech/how-healthy-exchange-ideas-rfk-jr-kicked-fdas-gene-therapy-push" hreflang="en">How a 'healthy exchange of ideas' with RFK Jr. kick-started FDA's gene therapy

## Summarize Articles (Agent)

In [13]:
from google import genai
import time
import os

api_key_env = os.getenv("GOOGLE_API_KEY")     # from local environment variable, or can use .env file
client = genai.Client(api_key=api_key_env)

MODEL_NAME = "gemini-2.5-flash"

THROTTLE = 1

def summarize_article(title: str, summary: str) -> dict:
    prompt = f"""
    You are an AI biotech assistant. Summarize this article in 3 bullet points.
    Extract: 
    1. Main finding
    2. Key biological targets (genes, proteins, pathways)
    3. Application area (diagnostics, therapeutics, biotech tools, etc.)

    Title: {title}
    Summary: {summary}
    """

    response = client.models.generate_content(
        model=MODEL_NAME,
        contents=prompt
    )

    time.sleep(THROTTLE)         # API Rate limiting

    return {
        "title": title,
        "raw_summary": summary,
        "ai_summary": response.text
    }

# Example usage
ai_summary_sample = summarize_article(
    "Updated Full-Text Search Now Available",
    "As previously announced, NCBI has updated the PubMed Central (PMC) full-text search functionality and user experience..."
)

print(ai_summary_sample)

# Generate AI summaries for the first 5 articles only
articles.loc[:4, "ai_summary"] = articles.loc[:4].apply(
    lambda row: summarize_article(row["title"], row["summary_text"])["ai_summary"],
    axis=1
)

# Check results
print(articles.loc[:4, ["title", "ai_summary"]])

{'title': 'Updated Full-Text Search Now Available', 'raw_summary': 'As previously announced, NCBI has updated the PubMed Central (PMC) full-text search functionality and user experience...', 'ai_summary': "Here's the summary of the provided text:\n\n*   **Main finding:** NCBI has updated the PubMed Central (PMC) full-text search functionality and user experience.\n*   **Key biological targets:** None mentioned in the article.\n*   **Application area:** Biotech tools (specifically, an informatics tool for scientific literature search and discovery)."}
                                               title  \
0             Updated Full-Text Search Now Available   
1                            PMC OAI-PMH API Updated   
2  Updates in Support of the 2024 NIH Public Acce...   
3  PubMed Central's Updated Full-Text Search Prev...   
4  Preview of Upcoming Changes to PMC's eFetch Ou...   

                                          ai_summary  
0  Here's the summary of the article in 3 bullet ..

In [18]:
# JSON

from pathlib import Path
import json

# Convert entire DataFrame to list of dicts
articles_list = articles.to_dict(orient="records")

output_path = Path("data/rss_summarized.json")
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, "w", encoding="utf-8") as f:
    json.dump(articles_list, f, indent=2, ensure_ascii=False)

print(f"Saved {len(articles_list)} articles to {output_path}")

# Add section to append new articles without duplicates - they'll be stored for access by trend agent later


Saved 37 articles to data\rss_summarized.json


## Trend Analysis (Agent)