# üèóÔ∏è Research Tool: The "Deconstruction Filter" (Pilot)

**Project:** Circular Economy in Australian Construction  
**Objective:** Semantic Pre-processing for Knowledge Graph Extraction

## üéØ Strategy: The "Needle in the Haystack"
We are dealing with large, mixed-topic PDFs (e.g., National Waste Policies). Most of the text in these documents is irrelevant to your specific focus on **structural deconstruction** and **salvage**.

If we feed the entire document to the LLM, we risk:
1.  **Graph Pollution:** Creating nodes for "Curbside Recycling" or "Landfill Levies" that clutter your Deconstruction analysis.
2.  **Context Dilution:** The LLM might lose the specific nuance of "disassembly" amidst general waste management text.
3.  **Cost Inefficiency:** Processing thousands of irrelevant tokens.

## üõ†Ô∏è The Solution: Semantic Filtering
This notebook implements a **Vector-Based Filter** before extraction:
1.  **Chunking:** Splits the PDF into analyzeable segments (paragraphs).
2.  **Embedding:** Converts text into mathematical vectors using `text-embedding-3-small`.
3.  **Similarity Search:** Compares every paragraph against your specific research queries (e.g., "salvage of timber", "selective demolition").
4.  **Filtering:** Discards any text that does not meet a strict relevance threshold.

---

### 1. üì¶ Installation & Setup
We need `pypdf` for robust PDF parsing, `langchain` for the orchestration, and `faiss-cpu` for the vector similarity search.

In [7]:
# @title 1. Install Required Libraries (Fixed)
# @markdown Run this cell to install the necessary tools.
!pip install -q langchain langchain-openai langchain-community langchain-text-splitters pypdf tiktoken faiss-cpu

import os
import sys
from google.colab import drive, userdata

print("‚úÖ Libraries installed successfully.")

‚úÖ Libraries installed successfully.


### 2. üîë API Connection (Safety Check)
This step ensures your OpenAI API key is correctly loaded from Google Colab Secrets.

**Instructions:**
1. Click the **Key icon** (Secrets) on the left sidebar of Colab.
2. Add a new secret named: `OPENAI_API_KEY`
3. Paste your actual API key as the value.
4. Toggle the "Notebook access" switch to **On**.

In [2]:
# @title Test LLM Connection
try:
    # Retrieve key from Colab Secrets
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

    from langchain_openai import ChatOpenAI

    # Simple test call to verify connection
    test_llm = ChatOpenAI(model="gpt-4o", temperature=0)
    response = test_llm.invoke("Hello, are you ready for data extraction?")
    print(f"‚úÖ Success! Model replied: {response.content}")

except Exception as e:
    print(f"‚ùå Error: {e}")
    print("\n‚ö†Ô∏è Please check that you added 'OPENAI_API_KEY' to the Secrets tab on the left.")

‚úÖ Success! Model replied: Hello! Yes, I'm ready to help with data extraction. Please provide the details or the specific data you need assistance with, and I'll do my best to assist you.


### 3. üìÇ Mount Google Drive
Connect to your Drive to access the PDF files.

**Note:** Ensure you define the correct path to your folder in the code block below.

In [5]:
# @title Mount Drive & Set Path
drive.mount('/content/drive')

# ---------------------------------------------------------
# üëá UPDATE THIS PATH TO MATCH YOUR DRIVE FOLDER
# ---------------------------------------------------------
source_folder_path = "/content/drive/MyDrive/ACTIVE/AU_deconstruction_domain/Miyuki "

if os.path.exists(source_folder_path):
    print(f"‚úÖ Folder found: {source_folder_path}")
    files = [f for f in os.listdir(source_folder_path) if f.endswith('.pdf')]
    print(f"üìÑ Found {len(files)} PDF files available for processing.")
    if len(files) > 0:
        print(f"   Example: {files[0]}")
else:
    print(f"‚ùå Folder not found: {source_folder_path}")
    print("‚ö†Ô∏è Please verify the path in your Google Drive.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Folder found: /content/drive/MyDrive/ACTIVE/AU_deconstruction_domain/Miyuki 
üìÑ Found 94 PDF files available for processing.
   Example: 1.national-waste-and-resource-recovery-report-2024.pdf


### 4. üß† The Semantic Filter (Pilot Run)
This is the core logic. We will test it on **one file** to verify it correctly separates "Deconstruction" content from general text.

**How it works:**
1.  **`filter_queries`**: These are the "concepts" we are looking for. I have tuned them to your specific focus on salvage and disassembly.
2.  **`similarity_search_with_score`**: Calculates the distance between your PDF paragraphs and these queries.
3.  **Threshold**: We filter out any text that isn't highly relevant (Score < 0.5).

In [9]:
# @title 4. Run Diagnostic Filter (Fixes Import & Calibrates Threshold)
from langchain_community.document_loaders import PyPDFLoader
# ‚úÖ FIXED IMPORT: Uses the new library structure
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
import os

# ---------------------------------------------------------
# üëá ENSURE THIS MATCHES YOUR FILE NAME EXACTLY
# ---------------------------------------------------------
file_to_test = "1.national-waste-and-resource-recovery-report-2024.pdf"

# Construct full path
full_file_path = os.path.join(source_folder_path, file_to_test)

# Define your research focus
filter_queries = [
    "building deconstruction and disassembly methods",
    "salvage of structural materials like timber and steel",
    "selective demolition practices",
    "regulatory barriers to deconstruction",
    "material recovery from demolition"
]

if os.path.exists(full_file_path):
    print(f"üîπ Loading: {file_to_test}...")
    loader = PyPDFLoader(full_file_path)
    pages = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
    docs = text_splitter.split_documents(pages)
    print(f"üîπ Split into {len(docs)} chunks.")

    print("üîπ Generating embeddings...")
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vector_db = FAISS.from_documents(docs, embeddings)

    print("\nüîé DIAGNOSTIC RESULTS (Top 3 matches per query):")
    print("="*60)

    # We will track the lowest score found to help you set the threshold
    min_score_found = 10.0

    for query in filter_queries:
        print(f"\nQuery: '{query}'")
        # Fetch top 3 matches regardless of score
        results = vector_db.similarity_search_with_score(query, k=3)

        for doc, score in results:
            # Update minimum score tracker
            if score < min_score_found: min_score_found = score

            # Print content only if it's somewhat relevant (e.g. < 1.0)
            status = "‚úÖ KEEP" if score < 0.5 else "‚ùå REJECT (Too strict?)"
            print(f"   Score: {score:.4f} | {status}")
            print(f"   Snippet: {doc.page_content[:150]}...")
            print("-" * 40)

    print("="*60)
    print(f"üí° RECOMMENDATION:")
    print(f"The best match had a score of {min_score_found:.4f}.")
    if min_score_found > 0.5:
        print(f"üëâ Change your threshold in the final script to: {min_score_found + 0.1:.2f}")
    else:
        print("üëâ The threshold of 0.5 is fine, this document just lacks relevant content.")

else:
    print(f"‚ùå File not found: {full_file_path}")

üîπ Loading: 1.national-waste-and-resource-recovery-report-2024.pdf...
üîπ Split into 297 chunks.
üîπ Generating embeddings...

üîé DIAGNOSTIC RESULTS (Top 3 matches per query):

Query: 'building deconstruction and disassembly methods'
   Score: 1.1040 | ‚ùå REJECT (Too strict?)
   Snippet: Built 
environment  
Guidelines and resources  are emerging  for preventing  waste in the design, operation and 
deconstruction of buildings and infra...
----------------------------------------
   Score: 1.1384 | ‚ùå REJECT (Too strict?)
   Snippet: refurbished options. Packaging reuse is also a focus, with some established services for 
business-to-business secondary and tertiary packaging. Enter...
----------------------------------------
   Score: 1.2735 | ‚ùå REJECT (Too strict?)
   Snippet: redirect s wearable clothing back into use (Seamless 2024).  Clothing chain Kathmandu ha s 
established Kathman-redu, a clothing take -back, repair an...
----------------------------------------

Query: