A tool for scraping, structuring, and enriching faculty data from UCSB departmental websites.
This project aims to create a structured database of faculty information from UC Santa Barbara's departmental websites, including:
- Basic contact information (name, title, email, office, etc.)
- Research specializations and expertise
- Structured summaries of faculty research using AI
- Department and inter-departmental relationships
The project uses a notebook-driven development approach with nbdev to maintain well-documented, tested code with rich explanatory context.
- Specialized scrapers for different department website layouts (Drupal, WordPress, custom)
- Flexible
Unitclass to manage department-specific scraping and enrichment - AI-powered faculty research summarization (using OpenAI)
- Utilities for crawling and analyzing faculty websites
- Modular design for easy extension to additional departments
If you are new to using nbdev here are some useful pointers to get you started.
# make sure faculty_expertise package is installed in development mode
$ pip install -e .
# make changes under nbs/ directory
# ...
# compile to have changes apply to faculty_expertise
$ nbdev_prepareInstall latest from the GitHub repository:
$ pip install git+https://github.com/caylor/faculty_expertise.gitor from conda
$ conda install -c caylor faculty_expertiseor from pypi
$ pip install faculty_expertiseDocumentation can be found hosted on this GitHub repository's pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.
-
nbs/: Jupyter notebooks that define the code base00_core.ipynb: Core data structures (Unit class)01_scrapers.ipynb: HTML scrapers for different department layouts02_enrichment.ipynb: AI enrichment and metadata extraction
-
faculty_expertise/: Auto-generated Python modules from notebookscore.py: Core data structuresmy_scrapers.py: HTML scraping functionsmy_enrichment.py: AI enrichment functions
-
faculty_html/: HTML files from department websites -
faculty_screenshots/: Screenshots of department pages
# Scrape faculty from a department
from faculty_expertise.core import Unit
# Create a unit for Computer Science department
unit = Unit("Computer Science", "faculty_html/Computer_Science.html")
# Scrape faculty information
df = unit.scrape()
# Display the first few rows
print(df.head())
# Optionally enrich with AI-powered summaries (requires OpenAI API key)
from faculty_expertise.my_enrichment import enrich_faculty_row
row = df.iloc[0]
result = enrich_faculty_row(row)
print(result)This project is licensed under the MIT License - see the LICENSE file for details.