# Faculty Expertise Enrichment

This notebook defines general enrichment logic for scraping additional metadata from faculty websites. Currently, we support:

- Fetching raw HTML and text
- Generating a scholarly summary using OpenAI's GPT model


In [None]:
#| default_exp my_enrichment

In [None]:
#| export
import requests
from bs4 import BeautifulSoup
from openai import OpenAI
from dotenv import load_dotenv
from urllib.parse import urljoin, urlparse
import os
import fitz
import json
import re

In [None]:
#| export

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

### JSON Cleaner

A helper function to make sure LLM generated JSON has it's markdown code fence removed.

In [None]:
#| export
def try_parse_json(raw_text):
    "Cleans up GPT output and returns a parsed JSON object (dict)"
    if not raw_text or not isinstance(raw_text, str):
        return {}

    # Remove Markdown code block fences
    cleaned = re.sub(r'^```(?:json)?', '', raw_text.strip(), flags=re.IGNORECASE).strip()
    cleaned = re.sub(r'```$', '', cleaned).strip()

    try:
        return json.loads(cleaned)
    except json.JSONDecodeError as e:
        print("⚠️ JSON decode error:", e)
        print("Offending text (preview):", cleaned[:300])
        return {}

## Fetch Faculty/Researcher Content

Takes a provided faculty URL, which is typically either a link to a personal website or a link to a departmental website.


In [None]:
#| export
def gather_research_links(base_url, max_pages=6):
    """Gathers internal and external URLs relevant to faculty research, skipping Google Scholar fetch."""
    visited = set()
    all_urls = []
    orcid_url = None
    scholar_url = None
    cv_url = None

    try:
        resp = requests.get(base_url, timeout=10)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, 'html.parser')
        links = [a['href'] for a in soup.find_all('a', href=True)]

        for href in links:
            full_url = urljoin(base_url, href)
            if full_url in visited:
                continue
            visited.add(full_url)

            if 'scholar.google' in href and not scholar_url:
                scholar_url = full_url
                print(f"Logging Google Scholar link: {full_url}")
            elif 'orcid.org' in href and not orcid_url:
                orcid_url = full_url
            elif full_url.lower().endswith('.pdf') and ('cv' in href.lower() or 'vita' in href.lower()):
                if not cv_url:
                    cv_url = full_url
                all_urls.append(full_url)
            elif urlparse(full_url).netloc == urlparse(base_url).netloc:
                if any(k in href.lower() for k in ['research', 'project', 'publication', 'bio', 'cv', 'about', 'news']):
                    all_urls.append(full_url)

        all_urls = list(set([base_url] + all_urls))[:max_pages]

    except Exception as e:
        print(f"Error gathering links from {base_url}: {e}")

    return {
        "Crawled URLs": all_urls,
        "ORCID URL": orcid_url,
        "Google Scholar URL": scholar_url,
        "CV URL": cv_url
    }



In [None]:
response = gather_research_links("https://waves.eri.ucsb.edu")
print(response)

Logging Google Scholar link: https://scholar.google.com/citations?user=VGaoB64AAAAJ
{'Crawled URLs': ['https://waves.eri.ucsb.edu/publications/', 'https://waves.eri.ucsb.edu/assets/files/KCaylor_CV.pdf', 'https://waves.eri.ucsb.edu'], 'ORCID URL': None, 'Google Scholar URL': 'https://scholar.google.com/citations?user=VGaoB64AAAAJ', 'CV URL': 'https://waves.eri.ucsb.edu/assets/files/KCaylor_CV.pdf'}


### Get corpus from URLs

Use a list of URLs to develop a corpus of text that can be summarized in a structured manner.

In [None]:
#| export

def get_corpus_from_urls(urls):
    """Fetches and concatenates cleaned text from a list of URLs, including OCR for PDFs."""
    full_text = ''

    for url in urls:
        try:
            if url.lower().endswith('.pdf'):
                response = requests.get(url, timeout=10)
                response.raise_for_status()
                with open("_temp_cv.pdf", "wb") as f:
                    f.write(response.content)
                doc = fitz.open("_temp_cv.pdf")
                for page in doc:
                    full_text += ' ' + page.get_text()
                doc.close()
                os.remove("_temp_cv.pdf")
            else:
                resp = requests.get(url, timeout=10)
                resp.raise_for_status()
                soup = BeautifulSoup(resp.text, 'html.parser')
                full_text += ' ' + ' '.join(soup.stripped_strings)
        except Exception as e:
            print(f"Failed to fetch {url}: {e}")
            continue

    return full_text.strip()

In [None]:
result = get_corpus_from_urls(response["Crawled URLs"])

In [None]:
print(result[:1000])  # Print the first 1000 characters of the fetched text

Publications from the WAVES Lab - Water, Vegetation, & Society WAVES @ Water, Vegetation, & Society News Our Team Teaching Opportunities Publications CV gScholar Site Archive Toggle Menu Publications from the WAVES Lab Smallholder social networks: Advice seeking and adaptation in rural Kenya Giroux, S. et al. (2023). Smallholder social networks: Advice seeking and adaptation in rural Kenya. Agricultural Systems, doi:10.1016/j.agsy.2022.103574. Modeling seasonal vegetation phenology from hydroclimatic drivers for contrasting plant functional groups within drylands of the Southwestern USA Warter, M. et al. (2023). Modeling seasonal vegetation phenology from hydroclimatic drivers for contrasting plant functional groups within drylands of the Southwestern USA. Environmental Research: Ecology, doi:10.1088/2752-664X/acb9a0. Fluxbots: A Method for Building, Deploying, Collecting and Analyzing Data From an Array of Inexpensive, Autonomous Soil Carbon Flux Chambers Forbes, E. et al. (2023). Flu

## Summarize Faculty Expertise

In [None]:
#| export
def summarize_faculty_expertise(text, length=750):
    "Return a python dictionary of faculty research specialization using a consistent schema"
    prompt = f"""
You are assisting a university research office in building a structured directory of faculty expertise.

Based on the following faculty webpage content, produce a JSON object with the following fields:

- Research Title: a short title summarizing the faculty’s main research area.
- Expertise: a 1-2 sentence summary of the research focus written for a broad academic audience.
- Research Description: a 1-2 paragraph description of the faculty's research written for a broad audience and suitable for a university website.
- Topics: a list of high-level research themes.
- Methods: a list of research methods or tools used.
- Geographic Focus: a list of countries, regions, or global.
- Keywords: a list of 5–10 freeform keywords.
- Disciplines: a list of academic fields or disciplines.
- Potential Applications: a list of relevant societal, environmental, or economic applications.

Faculty Webpage Text:
{text[:8000]}

Respond only with a JSON object.
"""
    try:
        completion = client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=length,
            temperature=0.3
        )
        return try_parse_json(completion.choices[0].message.content.strip())
    except Exception as e:
        print(f"OpenAI error: {e}")
        return None


### Summarize expertise

In [None]:
output = summarize_faculty_expertise(result, length=750)

In [None]:
# Print output in a readable format:
for key, value in output.items():
    print(f"{key}: {value}")


Research Title: Sustainable Agriculture and Environmental Management
Expertise: This research focuses on the intersection of water management, agricultural sustainability, and environmental impacts, with a particular emphasis on smallholder farming systems.
Research Description: The WAVES Lab conducts multidisciplinary research aimed at understanding and improving the relationships between water, vegetation, and society. By exploring topics such as smallholder social networks, seasonal vegetation phenology, and the impacts of climate variability on agricultural practices, the lab seeks to develop sustainable solutions for water and agricultural management. The research spans from technological innovations like autonomous soil carbon flux chambers to socio-environmental analyses such as the effects of transportation infrastructure on agricultural supply chains. Through a combination of field studies, modeling, and data analysis, the lab aims to address critical challenges facing smallho

## Enrich Faculty Row

Function that uses a `pd.Series` (row of dataframe) to enrich with OpenAI summation. Can be used with `pd.DataFrames`, but needs to be checked for API rate limiting. Probably will need to use a threading approach to handle multiple API calls asynchronously.

Currently, this is only used in index.ipynb as an example for *a single row*

In [None]:
#| export
def cache_expertise(func):
    """Decorator to cache the results of the expertise function. Needs to be able to handle a pd.Series as input """
    cache = {}
    
    def wrapper(row):
        # If row is a pandas Series, we need to check attributes differently
        # than for other types
        if hasattr(row, 'to_dict'):
            # For DataFrame rows or Series, we need a hashable key
            row_key = tuple(row.items())
            if row_key not in cache:
                cache[row_key] = func(row)
            return cache[row_key]
        
        # For simple types like strings
        if row not in cache:
            cache[row] = func(row)
        return cache[row]
    
    return wrapper

@cache_expertise
def enrich_faculty_row(row):
    """Given a row with a Website, returns a dictionary of enriched fields."""
    url = row.get("Website")
    if not url:
        return {}

    metadata = gather_research_links(url)
    corpus = get_corpus_from_urls(metadata["Crawled URLs"])
    summary = summarize_faculty_expertise(corpus)

    return {
        **metadata,
        **summary  # expands structured JSON into flat columns
    }