# Core Unit Management

This notebook defines the core `Unit` class and the method factory to dispatch the correct scraping function for each unit. It is part of a modular framework for scraping and enriching faculty directories using `nbdev`.

## Features
- Defines the `Unit` object to manage scraping and enrichment
- Uses a strategy pattern to dispatch custom scraper functions
- Enables reusable enrichment functions (e.g., OpenAI-powered expertise summarization)


In [None]:
#| default_exp core


In [None]:
#| export
from pathlib import Path
import pandas as pd
from faculty_expertise.my_enrichment import gather_research_links, get_corpus_from_urls, summarize_faculty_expertise


## Unit class

In [None]:
#| export
from faculty_expertise.my_scrapers import get_scraper

class Unit:
    "Encapsulates scraping and enrichment for a given academic unit"
    def __init__(self, name, html_file, base_url=None, scraper_func=None, metadata=None):
        self.name = name
        self.html_file = Path(html_file)
        self.base_url = base_url
        self.scraper_func = scraper_func or get_scraper(name)
        self.metadata = metadata or {}
        self.df = None

    def scrape(self):
        "Run the unit's scraper function and return a DataFrame"
        self.df = self.scraper_func(self.html_file, self.base_url)
        self.df["Unit"] = self.name
        return self.df

    def enrich_with(self, enrich_func, source_field="Website", target_field="Expertise"):
        "Apply an enrichment function to a field (e.g., summarizing expertise from faculty websites)"
        if self.df is None:
            raise ValueError("You must call scrape() first.")
        self.df[target_field] = self.df[source_field].apply(
            lambda url: enrich_func(url) if pd.notnull(url) else None
        )


## Examples and Tests

In [None]:
#| eval: false

# Example test: dummy scraper
def dummy_scraper(file_path, base_url):
    return pd.DataFrame([{
        "Name": "Alice Example",
        "Website": "http://example.com"
    }])

u = Unit("TestUnit", "dummy.html", base_url="http://example.com", scraper_func=dummy_scraper)
df = u.scrape()
assert "Name" in df.columns
assert u.name == "TestUnit"


## Art Department Example

Test out the art department

In [None]:
art = Unit("Art", "../faculty_html/art.html", base_url="https://www.arts.ucsb.edu")
art.scrape()

Unnamed: 0,Name,Title(s),Specialization,Email,Phone,Office,Website,Photo URL,Unit
0,Sarah Rosalena Brady,Assistant Professor,Computational Craft and Haptic Media,rosalena@arts.ucsb.edu,,Arts 0250,http://www.sarahrosalena.com/,,Art
1,Jane Callister,Professor,"Painting, Drawing",jane@arts.ucsb.edu,,Arts 1348,https://www.janecallister.com/,,Art
2,Iman Djouini,Assistant Teaching Professor,"Print, Book Arts and Intermedia",imandjouini@ucsb.edu,,,https://imandjouini.com/,,Art
3,Kip Fulbeck,Professor,"Narrative, Performative Studies",,,Arts 2222,https://kipfulbeck.com/,,Art
4,Lisa Jevbratt,Professor,"Software, Wool, Science, Parascience, Interspe...",jevbratt@arts.ucsb.edu,,Arts 2224,http://jevbratt.com/,,Art
5,Alex Lukas,Associate Professor,Print & Publication Arts,alexlukas@ucsb.edu,,,https://www.arts.ucsb.edu/faculty/alex-lukas/,,Art
6,moulton@arts.ucsb.edu,,Department Chair,moulton@arts.ucsb.edu,,Arts 2326,https://www.arts.ucsb.edumailto:moulton@arts.u...,,Art
7,Marcos Novak,Professor,Interactive Media,marcos@mat.ucsb.edu,,Elings Hall 2207,http://translab.mat.ucsb.edu/,,Art


In [None]:
# Test to ensure the scraper works
df = art.df 
# How many rows?
print(f"Number of rows: {len(df)}")
# Provide exmaples of the data in a clean format with a nice looking table:
df.head(5).style.set_properties(**{'text-align': 'left'}).set_table_styles(
    [{'selector': 'th', 'props': [('text-align', 'left')]}]
)

Number of rows: 8


Unnamed: 0,Name,Title(s),Specialization,Email,Phone,Office,Website,Photo URL,Unit
0,Sarah Rosalena Brady,Assistant Professor,Computational Craft and Haptic Media,rosalena@arts.ucsb.edu,,Arts 0250,http://www.sarahrosalena.com/,,Art
1,Jane Callister,Professor,"Painting, Drawing",jane@arts.ucsb.edu,,Arts 1348,https://www.janecallister.com/,,Art
2,Iman Djouini,Assistant Teaching Professor,"Print, Book Arts and Intermedia",imandjouini@ucsb.edu,,,https://imandjouini.com/,,Art
3,Kip Fulbeck,Professor,"Narrative, Performative Studies",,,Arts 2222,https://kipfulbeck.com/,,Art
4,Lisa Jevbratt,Professor,"Software, Wool, Science, Parascience, Interspecies, Participatory, Epistemology, Data, Weaving, Network, Ontology",jevbratt@arts.ucsb.edu,,Arts 2224,http://jevbratt.com/,,Art


In [None]:
# test the get_expertise_from_url function for a single URL in the art.df dataframe
row = art.df.iloc[0]
result = enrich_faculty_row(row)

In [None]:
# Print the result in a nice format:
print(f"Website: {row['Website']}")
print(f"Name: {row['Name']}")
print("Enriched Data:")
for key, value in result.items():
    print(f"  {key}: {value}")


Website: http://www.sarahrosalena.com/
Name: Sarah Rosalena Brady
Enriched Data:
  Crawled URLs: ['http://www.sarahrosalena.com/bio', 'http://www.sarahrosalena.com/', 'http://www.sarahrosalena.com/cv']
  ORCID URL: None
  Google Scholar URL: None
  CV URL: None
  Research Title: Interdisciplinary Approaches to Computational Craft and Haptic Media
  Expertise: Sarah Rosalena explores the intersection of traditional handicraft traditions with emerging technology, focusing on the hybrid forms that challenge the boundaries between ancient and future, tradition and innovation.
  Research Description: Sarah Rosalena's research navigates the rich terrain between traditional handicraft and cutting-edge technology. By integrating Indigenous cosmologies with digital tools, she creates artworks that defy conventional categorizations, blurring the lines between high and low tech, human and nonhuman. Her work with digital Jacquard looms, 3D ceramic printers, and image software reinterprets traditio