# Phase 4: Data Transformation (Cleaning)
This notebook demonstrates the parsing and cleaning of 109 raw HTML faculty profiles into a structured format.

## Objectives
- Parse HTML using `BeautifulSoup`
- Extract Name, Email, Education, Bio, etc.
- Handle semi-structured data (different sections like Biography, Specialization, etc.)
- Clean artifacts like `(On Leave)` and obfuscated mail formats.

In [13]:
import os
import sys
import pandas as pd
from IPython.display import display, HTML

# Add src to path
sys.path.append(os.path.abspath('../src'))

from data_cleaner import FacultyCleaner
from config import RAW_DATA_DIR, PROCESSED_DATA_DIR

## 1. Initialize Cleaner
We use a custom `FacultyCleaner` class to handle CSS selector logic and text normalization.

In [14]:
cleaner = FacultyCleaner()
all_data = []

files = [f for f in os.listdir(RAW_DATA_DIR) if f.endswith('.html')]
print(f"Found {len(files)} HTML files for processing.")

Found 109 HTML files for processing.


## 2. Process Files
Iterating through all 109 files and extracting structured fields.

In [15]:
for file_name in files:
    file_path = os.path.join(RAW_DATA_DIR, file_name)
    with open(file_path, 'r', encoding='utf-8') as f:
        html = f.read()
        data = cleaner.extract_faculty_data(html, file_name)
        all_data.append(data)

df = pd.DataFrame(all_data)
print(f"Extracted {len(df)} profiles.")

Extracted 109 profiles.


## 3. Data Refinement
Handling specific anomalies like '(On Leave)' text in names.

In [16]:
df['name'] = df['name'].str.replace(r'\s*\(On Leave\)', '', regex=True)
display(df[['name', 'email', 'education']].head())

Unnamed: 0,name,email,education
0,Abhijit Mukherjee,abhijit_mukherjee@dau.ac.in,MBA in Systems from Vinayaka Mission University
1,Abhishek Gupta,abhishek_gupta@dau.ac.in,"PhD (Electrical and Computer Engineering), Tor..."
2,Abhishek Jindal,abhishek_jindal@dau.ac.in,"PhD (Electronics & Communication Engineering),..."
3,Abhishek Tilva,abhishek_tilva@dau.ac.in,"PhD (Statistics), Columbia University, New Yor..."
4,Aditi Nath Sarkar,aditinath_sarkar@dau.ac.in,"MA (South Asian Languages and Civilizations), ..."


## 4. Missing Value Handling
As verified, some fields (like Bio for adjuncts) are legitimately missing from the source. We will fill them with 'Not Provided'.

In [17]:
import numpy as np
# Convert empty strings to NaN
df = df.replace(r'^\s*$', np.nan, regex=True)

print("Missing before fill:")
print(df.isnull().sum())

# Fill with default
df = df.fillna("Not Provided")

print("\nMissing after fill:")
print(df.isnull().sum())

Missing before fill:
name               0
image_url          0
education          2
contact_no        32
address           35
email              1
biography         42
specialization     3
teaching          40
publications      37
dtype: int64

Missing after fill:
name              0
image_url         0
education         0
contact_no        0
address           0
email             0
biography         0
specialization    0
teaching          0
publications      0
dtype: int64


## 5. Section Analysis
Verifying that long-form sections (Bio, Publications) were captured.

In [18]:
print("Sample Biography (First 200 chars):")
print(df.iloc[1]['biography'][:200] + "...")

print("\nSample Publications (First 200 chars):")
print(df.iloc[5]['publications'][:200] + "...")

Sample Biography (First 200 chars):
Dr. Abhishek Gupta received his PhD in Electrical and Computer Engineering from Toronto Metropolitan University, Canada. During his PhD, he worked on the application of Machine Learning techniques for...

Sample Publications (First 200 chars):
Darshan Batavia and Aditya Tatu, Estimating graph topology from sparse graph signals with an application to image denoising, 19th IEEE Intl. Wâ€™shop on Multimedia Signal Processing (MMSP), U.K., Oct 20...


## 6. Export results
Saving the structured data to `data/processed/faculty_data.csv`.

In [19]:
os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)
export_path = os.path.join(PROCESSED_DATA_DIR, 'faculty_data.csv')
df.to_csv(export_path, index=False)
print(f"Data successfully exported to {export_path}")

Data successfully exported to c:\Users\Admin\OneDrive\Desktop\SEM2\BIG DATA\PROJECT PHASE 1\faculty_finder\data\processed\faculty_data.csv
