# Faculty Finder - Web Scraping

This notebook demonstrates the web scraping functionality for the DA-IICT Faculty Directory.

## Objectives
1. Test scraper on sample faculty directory
2. Extract profile URLs from all 5 directories
3. Scrape individual profile pages
4. Validate raw HTML data

> **Note**: After fixing the URL pattern bug, the scraper now finds **109 total profiles** across all 5 directories!

In [1]:
import sys
sys.path.append('..')

from src.scraper import FacultyScraper
from src.config import FACULTY_URLS
import os

## Step 1: Initialize Scraper

In [2]:
scraper = FacultyScraper()
print("Scraper initialized successfully!")
print(f"\nFaculty directories to scrape: {len(FACULTY_URLS)}")
for i, url in enumerate(FACULTY_URLS, 1):
    print(f"{i}. {url}")

Scraper initialized successfully!

Faculty directories to scrape: 5
1. https://www.daiict.ac.in/faculty
2. https://www.daiict.ac.in/adjunct-faculty
3. https://www.daiict.ac.in/adjunct-faculty-international
4. https://www.daiict.ac.in/distinguished-professor
5. https://www.daiict.ac.in/professor-practice


## Step 2: Test Single Directory Scraping

Let's test with the main faculty directory first.

Expected result: ~67 profiles

In [3]:
test_url = FACULTY_URLS[0]
print(f"Testing with: {test_url}")

profile_links = scraper.scrape_faculty_directory(test_url)

print(f"\nFound {len(profile_links)} faculty profiles")
print("\nFirst 5 profiles:")
for i, link in enumerate(profile_links[:5], 1):
    print(f"{i}. {link}")

2026-01-17 22:40:48,188 - INFO - Fetching: https://www.daiict.ac.in/faculty


Testing with: https://www.daiict.ac.in/faculty


2026-01-17 22:40:52,563 - INFO - Found 67 profile links



Found 67 faculty profiles

First 5 profiles:
1. https://www.daiict.ac.in/faculty/abhishek-gupta
2. https://www.daiict.ac.in/faculty/abhishek-jindal
3. https://www.daiict.ac.in/faculty/abhishek-tilva
4. https://www.daiict.ac.in/faculty/aditya-tatu
5. https://www.daiict.ac.in/faculty/ajay-beniwal


## Step 3: Test Profile HTML Fetching

Fetch a single profile page and examine the HTML structure.

In [4]:
if profile_links:
    test_profile = profile_links[0]
    print(f"Fetching profile: {test_profile}")
    
    html = scraper.fetch_profile_html(test_profile)
    
    if html:
        print(f"\nHTML length: {len(html)} characters")
        print(f"\nFirst 500 characters:")
        print(html[:500])
        
        slug = test_profile.split('/')[-1]
        scraper.save_raw_html(html, f"test_{slug}")
        print(f"\nSaved to: data/raw/test_{slug}.html")

2026-01-17 22:40:56,121 - INFO - Fetching: https://www.daiict.ac.in/faculty/abhishek-gupta


Fetching profile: https://www.daiict.ac.in/faculty/abhishek-gupta


2026-01-17 22:40:57,422 - INFO - Saved raw HTML: c:\Users\Admin\OneDrive\Desktop\SEM2\BIG DATA\PROJECT PHASE 1\faculty_finder\data\raw\test_abhishek-gupta.html



HTML length: 75960 characters

First 500 characters:
<!DOCTYPE html>
<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
  <head>
    <meta charset="utf-8" />
<meta name="MobileOptimized" content="width" /

Saved to: data/raw/test_abhishek-gupta.html


## Step 4: Scrape All Directories

Extract profile URLs from all 5 faculty directories.

**Expected Results**:
- Main Faculty: ~67 profiles
- Adjunct Faculty: ~26 profiles
- Adjunct Faculty International: ~11 profiles
- Distinguished Professor: ~1 profile
- Professor of Practice: ~4 profiles
- **Total: ~109 profiles**

In [5]:
print("Scraping all faculty directories...\n")
all_profiles = scraper.scrape_all_directories()

print("\n" + "="*60)
print("DIRECTORY SUMMARY")
print("="*60)

total = 0
for faculty_type, profiles in all_profiles.items():
    count = len(profiles)
    total += count
    print(f"{faculty_type:40s}: {count:3d} profiles")

print("="*60)
print(f"{'TOTAL':40s}: {total:3d} profiles")
print("="*60)

2026-01-17 22:41:02,894 - INFO - 
2026-01-17 22:41:02,894 - INFO - Scraping directory: faculty
2026-01-17 22:41:02,895 - INFO - Fetching: https://www.daiict.ac.in/faculty


Scraping all faculty directories...



2026-01-17 22:41:04,049 - INFO - Found 67 profile links
2026-01-17 22:41:04,050 - INFO - Total profiles found in faculty: 67
2026-01-17 22:41:04,050 - INFO - 
2026-01-17 22:41:04,051 - INFO - Scraping directory: adjunct-faculty
2026-01-17 22:41:04,051 - INFO - Fetching: https://www.daiict.ac.in/adjunct-faculty
2026-01-17 22:41:05,185 - INFO - Found 26 profile links
2026-01-17 22:41:05,185 - INFO - Total profiles found in adjunct-faculty: 26
2026-01-17 22:41:05,186 - INFO - 
2026-01-17 22:41:05,186 - INFO - Scraping directory: adjunct-faculty-international
2026-01-17 22:41:05,187 - INFO - Fetching: https://www.daiict.ac.in/adjunct-faculty-international
2026-01-17 22:41:06,310 - INFO - Found 11 profile links
2026-01-17 22:41:06,310 - INFO - Total profiles found in adjunct-faculty-international: 11
2026-01-17 22:41:06,311 - INFO - 
2026-01-17 22:41:06,312 - INFO - Scraping directory: distinguished-professor
2026-01-17 22:41:06,312 - INFO - Fetching: https://www.daiict.ac.in/distinguished-


DIRECTORY SUMMARY
faculty                                 :  67 profiles
adjunct-faculty                         :  26 profiles
adjunct-faculty-international           :  11 profiles
distinguished-professor                 :   1 profiles
professor-practice                      :   4 profiles
TOTAL                                   : 109 profiles


## Step 5: Scrape Sample Profiles

Scrape a small sample of profiles to validate the scraping process.

> **Note**: Run this only if you want to test profile downloading. Skip if HTML files already exist.

In [20]:
# Sample: Get 2 profiles from each directory
sample_profiles = []
for faculty_type, profiles in all_profiles.items():
    sample_profiles.extend(profiles[:2])

print(f"Scraping {len(sample_profiles)} sample profiles...\n")

for i, profile_url in enumerate(sample_profiles, 1):
    slug = profile_url.split('/')[-1]
    print(f"[{i}/{len(sample_profiles)}] {slug}")
    scraper.scrape_profile_details(profile_url)

print("\nSample scraping completed!")

2026-01-17 22:40:07,995 - INFO - Fetching: https://www.daiict.ac.in/faculty/abhishek-gupta


Scraping 2 sample profiles...

[1/2] abhishek-gupta


2026-01-17 22:40:09,081 - INFO - Saved raw HTML: c:\Users\Admin\OneDrive\Desktop\SEM2\BIG DATA\PROJECT PHASE 1\faculty_finder\data\raw\abhishek-gupta.html
2026-01-17 22:40:09,082 - INFO - Fetching: https://www.daiict.ac.in/faculty/abhishek-jindal


[2/2] abhishek-jindal


2026-01-17 22:40:10,158 - INFO - Saved raw HTML: c:\Users\Admin\OneDrive\Desktop\SEM2\BIG DATA\PROJECT PHASE 1\faculty_finder\data\raw\abhishek-jindal.html



Sample scraping completed!


## Step 6: Validate Raw HTML Files

Check the saved HTML files in the `data/raw/` directory.

**Expected**: 110 HTML files (109 profiles + 1 test file)

In [6]:
raw_data_dir = "../data/raw"

if os.path.exists(raw_data_dir):
    html_files = [f for f in os.listdir(raw_data_dir) if f.endswith('.html')]
    print(f"Total HTML files saved: {len(html_files)}")
    print("\nSample files:")
    for i, file in enumerate(sorted(html_files)[:10], 1):
        file_path = os.path.join(raw_data_dir, file)
        size = os.path.getsize(file_path)
        print(f"{i}. {file:40s} - {size/1024:.1f} KB")
    
    print(f"\n... and {len(html_files) - 10} more files" if len(html_files) > 10 else "")
else:
    print("Raw data directory not found!")

Total HTML files saved: 110

Sample files:
1. abhijit-mukherjee.html                   - 73.0 KB
2. abhishek-gupta.html                      - 75.6 KB
3. abhishek-jindal.html                     - 83.4 KB
4. abhishek-tilva.html                      - 76.7 KB
5. aditinath-sarkar.html                    - 72.9 KB
6. aditya-tatu.html                         - 77.6 KB
7. ajay-beniwal.html                        - 87.5 KB
8. ajay-tomar.html                          - 73.0 KB
9. ajeet-kumar-singh.html                   - 72.9 KB
10. amishal-modi.html                        - 73.3 KB

... and 100 more files


## Summary Statistics

Get detailed breakdown by directory type.

In [7]:
import pandas as pd

# Create summary DataFrame
summary_data = []
for faculty_type, profiles in all_profiles.items():
    summary_data.append({
        'Directory': faculty_type,
        'Profile Count': len(profiles),
        'Sample URL': profiles[0] if profiles else 'N/A'
    })

df_summary = pd.DataFrame(summary_data)
print("\nFaculty Directory Summary:")
print(df_summary.to_string(index=False))
print(f"\nGrand Total: {df_summary['Profile Count'].sum()} profiles")

2026-01-17 22:41:41,185 - INFO - Note: NumExpr detected 24 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2026-01-17 22:41:41,186 - INFO - NumExpr defaulting to 16 threads.



Faculty Directory Summary:
                    Directory  Profile Count                                                             Sample URL
                      faculty             67                        https://www.daiict.ac.in/faculty/abhishek-gupta
              adjunct-faculty             26             https://www.daiict.ac.in/adjunct-faculty/abhijit-mukherjee
adjunct-faculty-international             11 https://www.daiict.ac.in/adjunct-faculty-international/anil-maheshwari
      distinguished-professor              1      https://www.daiict.ac.in/distinguished-professor/vishvajit-pandya
           professor-practice              4                 https://www.daiict.ac.in/professor-practice/ajay-tomar

Grand Total: 109 profiles


## Next Steps

1. **Phase 4 - Data Cleaning**: Parse HTML content to extract faculty information
2. **Phase 5 - Database Storage**: Clean and structure the data, store in SQLite
3. **Phase 6 - API**: Create FastAPI endpoints for data access

---

**Status**: Data scraping complete!  All 109 faculty profiles downloaded.