# Data Preprocessing and Enrichment Pipeline

This Jupyter Notebook provides a step-by-step, interactive guide through the Uniformat II data preprocessing and enrichment pipeline. It demonstrates how various modular functions from `final_scripts/` are utilized to extract, transform, and load data, culminating in an enriched SQLite database.

**To run this notebook correctly, ensure your current working directory in Jupyter is the project's root.** This notebook relies on specific `sys.path` adjustments to import modules from `final_scripts/` and access data files from `data_sources/` and the `.env` file.

In [1]:
import os
import sys
import pandas as pd
from dotenv import load_dotenv
import sqlite3 # Just for direct SQL queries in notebook if needed
import json # For displaying JSON outputs
import time # For showing sleeps if needed

# --- IMPORTANT: Adjust Python Path --- #
# This ensures Python can find your modular scripts within the 'final_scripts' package.
# Assumes this notebook is opened from the project's root directory.
project_root = os.path.abspath(os.getcwd()) # Get current working directory (should be project root)
sys.path.append(project_root)

# --- Import Functions from Modular Scripts --- #
# Now that project_root is on sys.path, you can import from 'final_scripts' as a package.
from db_operations import setup_database, insert_excel_data_clearing_first, \
                                      incorporate_initial_gemini_data_into_db_no_desc, \
                                      get_level3_data_for_enhancement, update_description_in_db

from pdf_extractor import extract_text_from_pdf_pages

from gemini_processor import get_initial_uniformat_details_from_gemini_no_desc, \
                                          generate_enhanced_description_with_gemini_batch, \
                                          UNIFORMAT_EXTRACTION_SCHEMA_NO_DESC, ENHANCED_DESCRIPTION_SCHEMA # Include schemas for demonstration

# --- Load Environment Variables --- #
load_dotenv(dotenv_path=os.path.join(project_root, '.env'))
GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY")
if not GEMINI_API_KEY:
    print("\033[93mWarning: GEMINI_API_KEY not found in .env file. API-dependent cells may fail.\033[0m")

# --- Define Global Paths and Constants --- #
DB_NAME = "uniformat.db" # This database will be created/accessed in the project root
PDF_PATH = os.path.join('../data_sources', 'uniformat-guide.pdf')
CSV_PATH = os.path.join('../data_sources', 'uniformat-ii-codes.csv')

PDF_START_PAGE = 61 # Based on your guide for includes/excludes
PDF_END_PAGE = 83   # Based on your guide for includes/excludes
GEMINI_BATCH_SIZE = 5 # Number of elements to send to Gemini per batch for descriptions

print("Setup complete. Necessary modules loaded and paths defined.")

Setup complete. Necessary modules loaded and paths defined.


  from .autonotebook import tqdm as notebook_tqdm


--- 
## Step 1: Database Initialization and Initial Data Loading

This step sets up the SQLite database schema and populates the `uniformat_codes` table with the baseline data from the `uniformat-ii-codes.csv` file. This is handled by functions in `final_scripts/db_operations.py`.

In [7]:
print(f"Creating database schema for: {DB_NAME}")
setup_database(DB_NAME)

print(f"Loading initial Uniformat codes from: {CSV_PATH}")
df_initial_codes = pd.read_csv(CSV_PATH)
insert_excel_data_clearing_first(df_initial_codes, DB_NAME)

print("Initial database setup and data loading complete.")

print("\n--- Verifying initial data in uniformat_codes table ---")
conn = sqlite3.connect(DB_NAME)
initial_db_df = pd.read_sql_query("SELECT * FROM uniformat_codes LIMIT 5;", conn)
conn.close()
display(initial_db_df)
print(f"Total initial codes loaded: {len(df_initial_codes)} rows.")

Creating database schema for: uniformat.db
Database 'uniformat.db' and tables created successfully.
Loading initial Uniformat codes from: ../data_sources/uniformat-ii-codes.csv
Excel data inserted into 'uniformat_codes' table.
Initial database setup and data loading complete.

--- Verifying initial data in uniformat_codes table ---


Unnamed: 0,id,type,level1_code,level1_name,level2_code,level2_name,level3_code,level3_name,level4_code,level4_name,description,notes
0,1,Building,A,SUBSTRUCTURE,A10,Foundations,A1010,Standard Foundations,A1011,Wall Foundations,,
1,2,Building,A,SUBSTRUCTURE,A10,Foundations,A1010,Standard Foundations,A1012,Column Foundations & Pile Caps,,
2,3,Building,A,SUBSTRUCTURE,A10,Foundations,A1010,Standard Foundations,A1013,Perimeter Drainage & Insulation,,
3,4,Building,A,SUBSTRUCTURE,A10,Foundations,A1020,Special Foundations,A1021,Pile Foundations,,
4,5,Building,A,SUBSTRUCTURE,A10,Foundations,A1020,Special Foundations,A1022,Grade Beams,,


Total initial codes loaded: 370 rows.


--- 
## Step 2: Extracting Inclusions and Exclusions from PDF using Gemini

Here, we extract relevant text content from the Uniformat II guide PDF and then use the Google Gemini API to parse the inclusions and exclusions into a structured format. These structured details are then incorporated into the database.

This step utilizes functions from `final_scripts/pdf_extractor.py` and `final_scripts/gemini_processor.py`, and `final_scripts/db_operations.py`.

In [3]:
print(f"Extracting text from PDF: {PDF_PATH} (pages {PDF_START_PAGE}-{PDF_END_PAGE})")
extracted_pdf_text = extract_text_from_pdf_pages(PDF_PATH, PDF_START_PAGE, PDF_END_PAGE)

if extracted_pdf_text:
    print(f"Text extracted successfully. First 500 characters:\n{extracted_pdf_text[:500]}...")
else:
    print("\033[91mError: Could not extract text from PDF.\033[0m")
    extracted_pdf_text = "" # Ensure it's empty to prevent API calls

if GEMINI_API_KEY and extracted_pdf_text:
    print("\nSending extracted text to Gemini for initial inclusions/exclusions extraction...")
    print("Gemini Schema for this step (for reference):\n" + json.dumps(UNIFORMAT_EXTRACTION_SCHEMA_NO_DESC, indent=2))
    
    initial_gemini_json_data = get_initial_uniformat_details_from_gemini_no_desc(extracted_pdf_text, GEMINI_API_KEY)
    
    if initial_gemini_json_data:
        print("\n--- Sample of Gemini's initial extraction (first 2 entries) ---")
        display(initial_gemini_json_data[:2])
        
        print("\nIncorporating extracted data into database (inclusions/exclusions tables)...")
        incorporate_initial_gemini_data_into_db_no_desc(initial_gemini_json_data, DB_NAME)
        print("Inclusions/Exclusions successfully incorporated into the database.")
        
        print("\n--- Verifying data in uniformat_inclusions and uniformat_exclusions tables (random 5 entries) ---")
        conn = sqlite3.connect(DB_NAME)
        df_inc = pd.read_sql_query("SELECT * FROM uniformat_inclusions ORDER BY RANDOM() LIMIT 5;", conn)
        df_exc = pd.read_sql_query("SELECT * FROM uniformat_exclusions ORDER BY RANDOM() LIMIT 5;", conn)
        conn.close()
        print("Uniformat Inclusions (sample):")
        display(df_inc)
        print("Uniformat Exclusions (sample):")
        display(df_exc)
    else:
        print("\033[91mError: Gemini did not return valid JSON for initial extraction.\033[0m")
else:
    print("\033[93mSkipping Gemini API call for inclusions/exclusions due to missing API key or no PDF text.\033[0m")

Extracting text from PDF: ../data_sources/uniformat-guide.pdf (pages 61-83)
Text extracted successfully. First 500 characters:
45
•
The UNIT of measurement of the item.
•
The RATE or cost per unit of the item.
•
The COST of the item.
•
The OUTPUT CODE for sorting line item costs into other breakdowns, such as by
MasterFormat 95™; by construction trades, bid packages, or functional areas; or by
other cost organizing principles.  This code is blank in Tables 4.6 and 4.7 because at
the early stage of design, which these tables portray, the assemblies and elements
have not yet been designed in detail.
Table 4.8, taken from ...

Sending extracted text to Gemini for initial inclusions/exclusions extraction...
Gemini Schema for this step (for reference):
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "level3_code": {
        "type": "string",
        "description": "The Uniformat Level 3 code, e.g., 'A1010', 'B2010'."
      },
      "level3_name": {
        

[{'level3_code': 'A1010',
  'level3_name': 'Standard Foundations',
  'inclusions': ['wall & column foundations',
   'foundation walls up to level of top of slab on grade',
   'pile caps',
   'backfill & compaction',
   'footings & bases',
   'perimeter insulation',
   'perimeter drainage',
   'anchor plates',
   'dewatering'],
  'exclusions': ['general excavation to reduce levels (see section G 1030, Site Earthwork)',
   'excavation for basements (see section A 2010, Basement Excavation)',
   'basement walls (see section A 2020, Basement Walls)',
   'under-slab drainage and insulation (see section A 1030, Slab on Grade)']},
 {'level3_code': 'A1020',
  'level3_name': 'Special Foundations',
  'inclusions': ['piling',
   'caissons',
   'underpinning',
   'dewatering',
   'raft foundations',
   'grade beams',
   'any other special foundation conditions'],
  'exclusions': ['pile caps (see section A 1010, Standard Foundations)',
   'rock excavation (unless associated with Special Foundations


Incorporating extracted data into database (inclusions/exclusions tables)...

--- Incorporating initial Gemini output (inclusions/exclusions only) into database ---

Initial Gemini output (inclusions/exclusions) successfully incorporated into database.
Inclusions/Exclusions successfully incorporated into the database.

--- Verifying data in uniformat_inclusions and uniformat_exclusions tables (random 5 entries) ---
Uniformat Inclusions (sample):


Unnamed: 0,id,uniformat_code_id,inclusion_text
0,145,374,caissons
1,124,96,acoustic ceiling tiles & panels
2,179,405,"back-up construction, framing, wallboard, para..."
3,199,419,roof & deck insulation
4,206,425,roof hatches


Uniformat Exclusions (sample):


Unnamed: 0,id,uniformat_code_id,exclusion_text
0,57,390,"perimeter drainage (see section A 1010, Standa..."
1,32,58,"applied wall finishes (see section C 3010, Wal..."
2,73,419,"roof drains (see section D 2040, Rain Water Dr..."
3,28,49,"parapets (see section B 2010, Exterior Walls)"
4,77,428,interior load bearing & shear walls (see secti...


--- 
## Step 3: Generating Enhanced Descriptions using Gemini

Now that our database has basic codes, names, inclusions, and exclusions, we'll fetch this combined data and use Gemini to generate rich, comprehensive descriptions for each Level 3 Uniformat element. These descriptions are then updated back into the `uniformat_codes` table.

This step primarily uses `final_scripts/gemini_processor.py` and `final_scripts/db_operations.py`.

In [4]:
if GEMINI_API_KEY:
    print("Retrieving Level 3 data for description enhancement...")
    level3_data_for_enhancement = get_level3_data_for_enhancement(DB_NAME)
    
    if level3_data_for_enhancement:
        print(f"Found {len(level3_data_for_enhancement)} Level 3 entries for description generation.")
        print("\n--- Sample of input data for description generation (first 2 entries) ---")
        display(level3_data_for_enhancement[:2])
        
        print("\nGemini Schema for enhanced description (for reference):\n" + json.dumps(ENHANCED_DESCRIPTION_SCHEMA, indent=2))
        
        print(f"\nGenerating enhanced descriptions in batches of {GEMINI_BATCH_SIZE}...")
        all_generated_descriptions = []
        
        for i in range(0, len(level3_data_for_enhancement), GEMINI_BATCH_SIZE):
            batch = level3_data_for_enhancement[i:i + GEMINI_BATCH_SIZE]
            print(f"Processing batch {int(i/GEMINI_BATCH_SIZE) + 1}/{(len(level3_data_for_enhancement) + GEMINI_BATCH_SIZE - 1) // GEMINI_BATCH_SIZE}...")
            
            batch_descriptions = generate_enhanced_description_with_gemini_batch(batch, GEMINI_API_KEY)
            
            if batch_descriptions:
                all_generated_descriptions.extend(batch_descriptions)
                # Optional: display a sample of descriptions from this batch
                # print("Sample descriptions from this batch:")
                # display(batch_descriptions[:1]) 
            else:
                print(f"\033[93mWarning: No descriptions generated for batch starting at index {i}.\033[0m")
            # Minimal sleep to be kind to the API, beyond internal rate limiting
            time.sleep(1) 

        if all_generated_descriptions:
            print(f"\nSuccessfully generated {len(all_generated_descriptions)} enhanced descriptions.")
            print("\n--- Sample of generated descriptions (first 2 entries) ---")
            display(all_generated_descriptions[:2])
            
            print("\nUpdating descriptions in the database...")
            for item in all_generated_descriptions:
                update_description_in_db(item['level3_code'], item['enhanced_description'], DB_NAME)
            print("All enhanced descriptions updated in the database.")
        else:
            print("\033[91mNo enhanced descriptions were generated. Check API key and network.\033[0m")
    else:
        print("\033[93mNo Level 3 data found in DB for enhancement. Ensure previous steps completed.\033[0m")
else:
    print("\033[93mSkipping Gemini API call for descriptions due to missing API key.\033[0m")

Retrieving Level 3 data for description enhancement...
Found 79 Level 3 entries for description generation.

--- Sample of input data for description generation (first 2 entries) ---


[{'level3_code': 'A1010',
  'level3_name': 'Standard Foundations',
  'current_description': None,
  'inclusions': ['wall & column foundations',
   'foundation walls up to level of top of slab on grade',
   'pile caps',
   'backfill & compaction',
   'footings & bases',
   'perimeter insulation',
   'perimeter drainage',
   'anchor plates',
   'dewatering'],
  'exclusions': ['general excavation to reduce levels (see section G 1030, Site Earthwork)',
   'excavation for basements (see section A 2010, Basement Excavation)',
   'basement walls (see section A 2020, Basement Walls)',
   'under-slab drainage and insulation (see section A 1030, Slab on Grade)']},
 {'level3_code': 'A1020',
  'level3_name': 'Special Foundations',
  'current_description': None,
  'inclusions': ['piling',
   'caissons',
   'underpinning',
   'dewatering',
   'raft foundations',
   'grade beams',
   'any other special foundation conditions'],
  'exclusions': ['pile caps (see section A 1010, Standard Foundations)',
 


Gemini Schema for enhanced description (for reference):
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "level3_code": {
        "type": "string",
        "description": "The Uniformat Level 3 code for which the description was generated."
      },
      "enhanced_description": {
        "type": "string",
        "description": "The comprehensive and detailed description for the Uniformat Level 3 element."
      }
    },
    "required": [
      "level3_code",
      "enhanced_description"
    ]
  }
}

Generating enhanced descriptions in batches of 5...
Processing batch 1/16...
Sending batch request to Gemini API for 5 descriptions (Attempt 1/5)...
Gemini API call successful for batch.

--- DEBUG: Gemini Part 2 Raw Response Text for batch (first 500 chars) ---
[
  {
    "level3_code": "A1010",
    "enhanced_description": "This element encompasses the complete scope of standard shallow foundation systems designed to transfer building loads to the support

[{'level3_code': 'A1010',
  'enhanced_description': 'This element encompasses the complete scope of standard shallow foundation systems designed to transfer building loads to the supporting soil. It includes the construction of isolated column footings and continuous wall foundations, along with their associated bases. Integral to this element are foundation walls that extend from the footing up to the level of the top of the slab on grade, providing essential support and enclosure. The scope also covers the formation of concrete pile caps, which serve to distribute loads from columns or walls to a group of piles. Essential earthwork activities such as backfilling around the foundation elements and ensuring proper compaction are included. Furthermore, this category incorporates critical protective and functional components like perimeter insulation to mitigate heat loss and perimeter drainage systems designed to manage subsurface water around the foundation. The installation of anchor 


Updating descriptions in the database...
--- DEBUG: Updating A1010 with new description (first 100 chars):
This element encompasses the complete scope of standard shallow foundation systems designed to trans
----------------------------------------------------

Enhanced description successfully updated for A1010.
--- DEBUG: Updating A1020 with new description (first 100 chars):
This element defines the specialized foundation systems employed when typical shallow foundations ar
----------------------------------------------------

Enhanced description successfully updated for A1020.
--- DEBUG: Updating A1030 with new description (first 100 chars):
This element covers the construction of all types of concrete slabs poured directly onto a prepared 
----------------------------------------------------

Enhanced description successfully updated for A1030.
--- DEBUG: Updating A2010 with new description (first 100 chars):
This element specifically addresses all excavation activities necessar

--- 
## Step 4: Final Database Verification

Let's query the `uniformat.db` database to confirm that the `description` field in the `uniformat_codes` table, along with the `uniformat_inclusions` and `uniformat_exclusions` tables, are now populated with the enriched data.

In [6]:
print("\n--- Verifying final state of the uniformat_codes table (level 3 sample with description) ---")
conn = sqlite3.connect(DB_NAME)
final_codes_df = pd.read_sql_query("SELECT level3_code, level3_name, description FROM uniformat_codes WHERE description IS NOT NULL LIMIT 10;", conn)
conn.close()
display(final_codes_df)


--- Verifying final state of the uniformat_codes table (level 3 sample with description) ---


Unnamed: 0,level3_code,level3_name,description
0,A1010,Standard Foundations,This element encompasses the complete scope of...
1,A1010,Standard Foundations,This element encompasses the complete scope of...
2,A1010,Standard Foundations,This element encompasses the complete scope of...
3,A1020,Special Foundations,This element defines the specialized foundatio...
4,A1020,Special Foundations,This element defines the specialized foundatio...
5,A1020,Special Foundations,This element defines the specialized foundatio...
6,A1020,Special Foundations,This element defines the specialized foundatio...
7,A1020,Special Foundations,This element defines the specialized foundatio...
8,A1020,Special Foundations,This element defines the specialized foundatio...
9,A1020,Special Foundations,This element defines the specialized foundatio...
