# Biotech News and Trends Concierge Agent

## Introduction

### Problem

Biotechnology moves quickly, and meaningful developments are scattered across dozens of news sources, journals, and industry feeds. Manually tracking these updates is time-consuming, inconsistent, and prone to missing important signals. Raw article text is noisy and difficult to compare, making it hard to identify which topics are emerging, which are declining, and where industry attention is shifting. There is no simple, automated way to transform daily biotech news into structured, trend-level insights.

### Solution/Objective

This project implements an automated RSS-driven pipeline that collects biotech articles, summarizes them using an LLM, and extracts key concepts for trend analysis. A Trend Agent clusters related topics, measures their frequency and momentum, and highlights emerging or unusual patterns across the dataset. The final output is a structured, data-driven trend report that makes it easy to monitor the biotech landscape, spot early signals, and stay informed without manual curation.

## Import Libraries

In [7]:
!pip install feedparser
!pip install google-genai





[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
import feedparser
import datetime
import json
from pathlib import Path
import pandas as pd
from bs4 import BeautifulSoup
from google import genai
import time
import re

## Fetch RSS Articles

In [9]:
import feedparser
import datetime
import json
from pathlib import Path

class RSSFetcher:
    def __init__(self, config_path="config/rss_feeds.json", storage_path="../data/rss_raw.json"):
        self.config_path = Path(config_path)
        self.storage_path = Path(storage_path)
        self.storage_path.parent.mkdir(parents=True, exist_ok=True)

        # Load feeds from config file
        with open(self.config_path, "r") as f:
            self.rss_urls = json.load(f)["feeds"]

    def _infer_source(self, url: str) -> str:
        """Infer source name from URL."""
        if "fiercebiotech.com/rss/biotech" in url or "fiercebiotech.com/rss/xml" in url:
            return "FierceBiotech"
        elif "labiotech.eu" in url:
            return "Labiotech.eu"
        elif "GenEngNews" in url or "genengnews.com" in url:
            return "GEN (Genetic Engineering & Biotech News)"
        elif "sciencedaily.com" in url and "genetics_gene_therapy" in url:
            return "ScienceDaily – Gene Therapy"
        elif "bioworld.com/rss/topic/10" in url:
            return "BioWorld Omics / Genomics"
        else:
            return "Unknown"

    def fetch(self):
        """Fetch articles from all RSS URLs."""
        all_articles = []

        for url in self.rss_urls:
            feed = feedparser.parse(url)
            source = self._infer_source(url)

            print(f"Fetching from {source}: {url}")

            for entry in feed.entries:
                article = {
                    "title": entry.get("title"),
                    "summary": entry.get("summary", ""),
                    "link": entry.get("link"),
                    "published": entry.get("published") or entry.get("updated") or None,
                    "source": source,
                    "fetched_at": datetime.datetime.utcnow().isoformat()
                }
                all_articles.append(article)

        self._save(all_articles)
        return all_articles

    def _save(self, articles):   # might not be needed
        """Save raw fetched articles."""
        with open(self.storage_path, "w", encoding="utf-8") as f:
            json.dump(articles, f, indent=2)

        print(f"Saved {len(articles)} articles to {self.storage_path}")

In [10]:
# Run Fetcher for generic RSS (including PubMed)
#from src.fetcher import RSSFetcher

# This will load feeds from config/rss_feeds.json by default
fetcher = RSSFetcher()

articles = fetcher.fetch()

# Convert to df
articles = pd.DataFrame(articles)

print(f"Fetched {len(articles)} articles.")
display(articles[:3])  # show first 3

Fetching from FierceBiotech: https://www.fiercebiotech.com/rss/biotech/xml


  "fetched_at": datetime.datetime.utcnow().isoformat()


Fetching from Labiotech.eu: https://www.labiotech.eu/feed/
Fetching from GEN (Genetic Engineering & Biotech News): https://feeds.feedburner.com/GenEngNews
Fetching from ScienceDaily – Gene Therapy: https://rss.sciencedaily.com/genetics_gene_therapy.xml
Fetching from BioWorld Omics / Genomics: https://www.bioworld.com/rss/topic/10
Saved 37 articles to ..\data\rss_raw.json
Fetched 37 articles.


Unnamed: 0,title,summary,link,published,source,fetched_at
0,"<a href=""https://www.fiercebiotech.com/biotech...",Hundreds of industry leaders have signed a let...,https://www.fiercebiotech.com/biotech/letter-m...,"Nov 21, 2025 4:16pm",FierceBiotech,2025-11-23T20:20:29.140127
1,"<a href=""https://www.fiercebiotech.com/biotech...","The FDA is hiring more than 1,000 new employee...",https://www.fiercebiotech.com/biotech/fda-kick...,"Nov 21, 2025 11:20am",FierceBiotech,2025-11-23T20:20:29.140127
2,"<a href=""https://www.fiercebiotech.com/biotech...",Gilead’s general counsel and EVP of corporate ...,https://www.fiercebiotech.com/biotech/chutes-l...,"Nov 20, 2025 4:26pm",FierceBiotech,2025-11-23T20:20:29.140127


In [11]:
# Clean up articles and extract titles and summaries
from bs4 import BeautifulSoup

# Convert summaries to plain text
articles['summary_text'] = articles['summary'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())

# Print just the top 5 titles and converted summaries
for title in articles['title'].head(5):
    print("Title:", title)
print()
for summary_text in articles['summary_text'].head(5):
    print("Summary text:", summary_text)

Title: <a href="https://www.fiercebiotech.com/biotech/letter-makary-biotech-ceos-push-fda-stability-and-say-volatility-threatens-us-innovation" hreflang="en">In letter to Makary, biotech CEOs push for FDA stability and say volatility threatens US innovation</a>
Title: <a href="https://www.fiercebiotech.com/biotech/fda-kicks-hiring-spree-and-new-communication-program-speed-sluggish-drug-reviews" hreflang="en"> FDA says it's hiring more than 1,000 new staffers, launches new comms program for review process</a>
Title: <a href="https://www.fiercebiotech.com/biotech/chutes-ladders-gilead-abruptly-parts-ways-general-counsel" hreflang="en">Chutes &amp; Ladders—Gilead abruptly parts ways with general counsel</a>
Title: <a href="https://www.fiercebiotech.com/biotech/fierce-biotech-layoff-tracker-2025" hreflang="en">Fierce Biotech Layoff Tracker 2025: Applied Tx lays off 46% of staff; Ensoma makes cuts</a>
Title: <a href="https://www.fiercebiotech.com/biotech/nurix-trims-workforce-pivotal-trial-

## Summarize Articles (Agent)

In [None]:
from google import genai
import time
import os

api_key_env = os.getenv("GOOGLE_API_KEY")     # from local environment variable, or can use .env file
client = genai.Client(api_key=api_key_env)

MODEL_NAME = "gemini-2.5-flash"

THROTTLE = 1

def summarize_article(title: str, summary: str) -> dict:
    prompt = f"""
    You are an AI biotech assistant. Summarize this article in 3 bullet points.
    Extract: 
    1. Main finding
    2. Key biological targets (genes, proteins, pathways)
    3. Application area (diagnostics, therapeutics, biotech tools, etc.)

    Title: {title}
    Summary: {summary}
    """

    response = client.models.generate_content(
        model=MODEL_NAME,
        contents=prompt
    )

    time.sleep(THROTTLE)         # API Rate limiting

    return {
        "title": title,
        "raw_summary": summary,
        "ai_summary": response.text
    }

# Example usage
ai_summary_sample = summarize_article(
    "Updated Full-Text Search Now Available",
    "As previously announced, NCBI has updated the PubMed Central (PMC) full-text search functionality and user experience..."
)

print(ai_summary_sample)

# Generate AI summaries for the first 5 articles only
articles.loc[:4, "ai_summary"] = articles.loc[:4].apply(
    lambda row: summarize_article(row["title"], row["summary_text"])["ai_summary"],
    axis=1
)

# Check results
print(articles.loc[:4, ["title", "ai_summary"]])

In [13]:
from google import genai
import time
import os

client = genai.Client(api_key="AIzaSyCot4EY1kxTXTaHiTS6mZY01rFu93ReB0s")


MODEL_NAME = "gemini-2.5-flash"

THROTTLE = 1

def summarize_article(title: str, summary: str) -> dict:
    prompt = f"""
    You are an AI biotech assistant. Summarize this article in 3 bullet points.
    Extract: 
    1. Main finding
    2. Key biological targets (genes, proteins, pathways)
    3. Application area (diagnostics, therapeutics, biotech tools, etc.)

    Title: {title}
    Summary: {summary}
    """

    response = client.models.generate_content(
        model=MODEL_NAME,
        contents=prompt
    )

    time.sleep(THROTTLE)         # API Rate limiting

    return {
        "title": title,
        "raw_summary": summary,
        "ai_summary": response.text
    }

# Example usage
ai_summary_sample = summarize_article(
    "Updated Full-Text Search Now Available",
    "As previously announced, NCBI has updated the PubMed Central (PMC) full-text search functionality and user experience..."
)

print(ai_summary_sample)

# Generate AI summaries for the first 5 articles only
articles.loc[:4, "ai_summary"] = articles.loc[:4].apply(
    lambda row: summarize_article(row["title"], row["summary_text"])["ai_summary"],
    axis=1
)

# Check results
print(articles.loc[:4, ["title", "ai_summary"]])

{'title': 'Updated Full-Text Search Now Available', 'raw_summary': 'As previously announced, NCBI has updated the PubMed Central (PMC) full-text search functionality and user experience...', 'ai_summary': 'Here is the summary of the article:\n\n*   **Main finding:** NCBI has updated the full-text search functionality and user experience for PubMed Central (PMC).\n*   **Key biological targets:** N/A – This announcement does not discuss specific biological targets.\n*   **Application area:** Biotech Tools/Research Platforms (specifically, an improved search functionality for a biomedical literature database).'}
                                               title  \
0  <a href="https://www.fiercebiotech.com/biotech...   
1  <a href="https://www.fiercebiotech.com/biotech...   
2  <a href="https://www.fiercebiotech.com/biotech...   
3  <a href="https://www.fiercebiotech.com/biotech...   
4  <a href="https://www.fiercebiotech.com/biotech...   

                                          ai_s

In [14]:
# JSON

from pathlib import Path
import json

# Convert entire DataFrame to list of dicts
articles_list = articles.to_dict(orient="records")

output_path = Path("data/rss_summarized.json")
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, "w", encoding="utf-8") as f:
    json.dump(articles_list, f, indent=2, ensure_ascii=False)

print(f"Saved {len(articles_list)} articles to {output_path}")

# Add section to append new articles without duplicates - they'll be stored for access by trend agent later


Saved 37 articles to data\rss_summarized.json


## Trend Analysis (Agent)

In [15]:
import json
import re
import pandas as pd
from pathlib import Path
from bs4 import BeautifulSoup

# Clean HTML titles
def clean_html(text):
    if isinstance(text, str):
        return BeautifulSoup(text, "html.parser").get_text()
    return text

articles['title_clean'] = articles['title'].apply(clean_html)
articles['summary_clean'] = articles['summary_text']  # already cleaned

# Extraction dictionaries
COMPANIES = [
    "illumina", "moderna", "pfizer", "roche", "novartis",
    "10x genomics", "pacbio", "gilead", "biogen", "regeneron", "fda"
]

METHODS = [
    "crispr", "sequencing", "ngs", "nanopore", "single-cell", "rna-seq",
    "clinical trial", "gene therapy"
]

CONCEPTS = [
    "biotech", "ai", "machine learning", "deep learning", "gene editing",
    "funding", "review process", "volatility", "regulation", "approval", "staffing"
]

SCIENTIFIC_TERMS_REGEX = r"\b[A-Za-z0-9\-]+(?:ase|protein|gene|pathway)\b"


In [16]:
# Extract topics function
def extract_topics(text):
    if not isinstance(text, str):
        return {"scientific_terms": [], "companies": [], "concepts": [], "methods": []}

    text_lower = text.lower()

    # Scientific terms
    scientific_terms = re.findall(SCIENTIFIC_TERMS_REGEX, text_lower)

    # Companies: known + uppercase acronyms
    companies_found = [c for c in COMPANIES if c.lower() in text_lower]
    acronyms = re.findall(r'\b[A-Z]{2,}\b', text)
    companies_found += acronyms

    # Methods
    methods_found = [m for m in METHODS if m.lower() in text_lower]

    # Concepts
    concepts_found = [c for c in CONCEPTS if c.lower() in text_lower]

    return {
        "scientific_terms": list(set(scientific_terms)),
        "companies": list(set(companies_found)),
        "concepts": list(set(concepts_found)),
        "methods": list(set(methods_found))
    }

In [17]:
# Normalization
normalization_dict = {
    "single-cell": ["scRNA-seq", "single cell sequencing", "single-cell RNA seq", "single-cell"],
    "AI-biotech": ["AI", "machine learning", "deep learning"],
    "biotech IP/legal": ["Illumina lawsuit", "NGS patents"]
}

def normalize_topic(term: str):
    if not isinstance(term, str):
        return term
    for norm, variants in normalization_dict.items():
        if term.lower() == norm.lower() or term.lower() in [v.lower() for v in variants]:
            return norm
    return term

# Apply extraction + normalization
articles['topics_normalized'] = articles['title_clean'].apply(
    lambda s: {k: [normalize_topic(t) for t in v] for k, v in extract_topics(s).items()}
)

In [18]:
# Clustering
topic_clusters = {
    "single-cell": "Single-Cell Technologies",
    "AI-biotech": "AI in Biotech",
    "biotech IP/legal": "Biotech IP & Legal",
    "crispr": "Gene Editing",
    "sequencing": "Sequencing Technologies",
    "nanopore": "Sequencing Technologies",
    "rna-seq": "Sequencing Technologies",
    "illumina": "Major Biotech Companies",
    "10x genomics": "Major Biotech Companies",
    "pfizer": "Major Biotech Companies",
    "moderna": "Major Biotech Companies",
    "novartis": "Major Biotech Companies",
    "pacbio": "Sequencing Companies",
    "gilead": "Major Biotech Companies",
    "biogen": "Major Biotech Companies",
    "regeneron": "Major Biotech Companies",
    "fda": "Regulatory / Biotech Business",
    "review process": "Regulatory / Biotech Business",
    "staffing": "Regulatory / Biotech Business",
    "volatility": "Regulatory / Biotech Business"
}

def assign_cluster(term: str):
    if not isinstance(term, str):
        return None
    term_l = term.lower()
    for key, cluster in topic_clusters.items():
        if term_l == key.lower():
            return cluster
    return "Other"

In [19]:
# Build trend data
trend_data = []

for idx, row in articles.iterrows():
    topic_dict = row['topics_normalized']
    all_terms = []
    for cat, terms in topic_dict.items():
        all_terms.extend(terms)
    clusters = [assign_cluster(t) for t in all_terms]

    trend_data.append({
        "title": row['title_clean'],
        "source": row['source'],
        "published": row['published'],
        "topics": topic_dict,
        "clusters": clusters
    })

# Save trend data to JSON
output_path = Path("data/trend_topics.json")
output_path.parent.mkdir(parents=True, exist_ok=True)

with open(output_path, "w", encoding="utf-8") as f:
    json.dump(trend_data, f, indent=2, ensure_ascii=False)

print(f"Saved trend topics to {output_path}")


Saved trend topics to data\trend_topics.json


In [20]:
display(trend_data[:3])  # show first 3 entries

[{'title': 'In letter to Makary, biotech CEOs push for FDA stability and say volatility threatens US innovation',
  'source': 'FierceBiotech',
  'published': 'Nov 21, 2025 4:16pm',
  'topics': {'scientific_terms': [],
   'companies': ['fda', 'US', 'FDA'],
   'concepts': ['volatility', 'biotech'],
   'methods': []},
  'clusters': ['Regulatory / Biotech Business',
   'Other',
   'Regulatory / Biotech Business',
   'Regulatory / Biotech Business',
   'Other']},
 {'title': " FDA says it's hiring more than 1,000 new staffers, launches new comms program for review process",
  'source': 'FierceBiotech',
  'published': 'Nov 21, 2025 11:20am',
  'topics': {'scientific_terms': [],
   'companies': ['fda', 'FDA'],
   'concepts': ['review process'],
   'methods': []},
  'clusters': ['Regulatory / Biotech Business',
   'Regulatory / Biotech Business',
   'Regulatory / Biotech Business']},
 {'title': 'Chutes & Ladders—Gilead abruptly parts ways with general counsel',
  'source': 'FierceBiotech',
  'p