MARS Open Project : Automated Meta Data Generation 

- Khushi Bhati 
22113077

Project Overview -

In this project, I built a tool that can automatically extract text from `.pdf`, `.docx`, and `.txt` files and generate meaningful metadata in JSON format. First, the document text is extracted (using OCR if needed), then it's split into smaller parts. I used a **map-reduce approach** where each part (chunk) is summarized using Gemini AI (map), and then all summaries are combined to create the final metadata (reduce). Finally, I made a user-friendly web app using Gradio where anyone can upload a document and see the metadata instantly.
You can find the link to the web app at the end of this notebook.


In [None]:
# Install Required Libraries
# This cell installs all the necessary Python packages and system dependencies used throughout the project.

!pip install PyMuPDF
!pip install pytesseract
!pip install Pillow
!pip install python-docx
!pip install langchain
!pip install google-generativeai langchain langchain-text-splitters
!pip install gradio
!sudo apt-get install tesseract-ocr

In [60]:
# Imports and Configuration

import fitz  # PyMuPDF
import pytesseract
from PIL import Image
import io
import os


In [None]:
# Detect File Type

# This cell sets the path to the input document and identifies its file extension (e.g., `.pdf`, `.docx`, `.txt`).  
# It helps determine the appropriate text extraction method based on the file type.

file_path = "Project.pdf"
file_ext = os.path.splitext(file_path)[1].lower()
print(f"\n>>> Detected file type: {file_ext}. Processing '{os.path.basename(file_path)}'...")

if file_ext == '.pdf':
    print("THe file is of .pdf format")
elif file_ext == '.docx':
    print("THe file is of .docx format")
elif file_ext == '.txt':
    print("THe file is of .txt format")
else:
    print("THe file is of unsupported format")


>>> Detected file type: .pdf. Processing 'Project.pdf'...
THe file is of .pdf format


In [None]:
# Text Extraction
# This cell extracts text from the uploaded document based on its file type:

if file_ext == ".docx":
  doc = docx.Document(file_path)
  full_text = []
  for para in doc.paragraphs:
      full_text.append(para.text)
  documents = "\n".join(full_text)
elif file_ext == ".txt":
  with open(file_path, 'r', encoding='utf-8') as f:
    documents = f.read()
else :
  print("The file is in pdf format and OCR may be needed")
  doc = fitz.open(file_path)
  full_text = []
  for page in doc:
      text = page.get_text("text")
      full_text.append(text)

  final_text = "\n".join(full_text).strip()
  ocr_fallback_threshold = 150
  if len(final_text) < ocr_fallback_threshold:
      print("Direct extraction yielded little text. Falling back to OCR.")
      full_text = []
      print(f"--- Starting OCR on {os.path.basename(file_path)}... ---")
      for i, page in enumerate(doc):
          print(f"Processing page {i + 1}/{len(doc)}...")
          pix = page.get_pixmap(dpi=300)
          img_data = pix.tobytes("png")
          image = Image.open(io.BytesIO(img_data))
          text = pytesseract.image_to_string(image, lang='eng')
          full_text.append(text)
  documents = "\n".join(full_text)
print(documents)

The file is in pdf format and OCR may be needed
A living system grows, sustains and reproduces itself.
The most amazing thing about a living system is that it
is composed of non-living atoms and molecules. The
pursuit of knowledge of what goes on chemically within
a living system falls in the domain of biochemistry. Living
systems are made up of various complex biomolecules
like carbohydrates, proteins, nucleic acids, lipids, etc.
Proteins and carbohydrates are essential constituents of
our food. These biomolecules interact with each other
and constitute the molecular logic of life processes. In
addition, some simple molecules like vitamins and
mineral salts also play an important role in the functions
of organisms.  Structures and functions of some of these
biomolecules are discussed in this Unit.
Biomolecules
Biomolecules
Biomolecules
Biomolecules
Biomolecules
Biomolecules
Biomolecules
Biomolecules
Biomolecules
Biomolecules
After studying this Unit, you will be
able to
•
explain the 

In [None]:
# Split Extracted Text into Chunks
# This cell prepares the raw extracted text for processing by large language models (LLMs)

import langchain
from langchain_text_splitters import RecursiveCharacterTextSplitter
# from langchain_community.vectorstores import Chroma
from langchain.docstore.document import Document

document = [Document(page_content=documents, metadata={"source": file_path})]
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000 , chunk_overlap = 200)
chunks = splitter.split_documents(document)
print(chunks)

[Document(metadata={'source': 'Project.pdf'}, page_content='A living system grows, sustains and reproduces itself.\nThe most amazing thing about a living system is that it\nis composed of non-living atoms and molecules. The\npursuit of knowledge of what goes on chemically within\na living system falls in the domain of biochemistry. Living\nsystems are made up of various complex biomolecules\nlike carbohydrates, proteins, nucleic acids, lipids, etc.\nProteins and carbohydrates are essential constituents of\nour food. These biomolecules interact with each other\nand constitute the molecular logic of life processes. In\naddition, some simple molecules like vitamins and\nmineral salts also play an important role in the functions\nof organisms.  Structures and functions of some of these\nbiomolecules are discussed in this Unit.\nBiomolecules\nBiomolecules\nBiomolecules\nBiomolecules\nBiomolecules\nBiomolecules\nBiomolecules\nBiomolecules\nBiomolecules\nBiomolecules\nAfter studying this Unit

In [None]:
# Summarize Chunks Using Gemini LLM (Async with Concurrency)

# In this cell, we use Google's `gemini-2.0-flash` large language model to generate summaries for each text chunk in parallel. 
# After configuring the Gemini API using the provided key, we define an asynchronous function `summarize_chunk()` that sends a tailored prompt to the model for each chunk. 
# The prompt asks the LLM to return a concise 1–2 sentence summary that captures the main idea of the text, while also incorporating any identifiable metadata such as title, author, or filename.
# To prevent overwhelming the API and to ensure efficient processing, we set up a semaphore with a concurrency limit of 25. This allows up to 25 API calls to run in parallel.
# All asynchronous tasks are then gathered using `asyncio.gather()`, and the resulting summaries are combined to form the `final_summary` of the document. This approach makes summarization scalable even for long or complex documents.

google_api_key = "AIzaSyBA4Twr8xGBDVxBVxCa4cECPgXyBxEUO_0"
import google.generativeai as genai
genai.configure(api_key=google_api_key)
import asyncio

llm_model = genai.GenerativeModel("gemini-2.0-flash")

async_summarizer_model = genai.GenerativeModel("gemini-2.0-flash")

async def summarize_chunk(model, chunk_text, chunk_number, semaphore):
    async with semaphore:
        print(f"  -> Starting API call for chunk {chunk_number}...")

        prompt = f"Your task is to analyze the following text excerpt and distill its core message into a brief summary of one to two sentences. The summary should tell the whole crux about what is given in the excerpt. Crucially, your summary must explicitly incorporate any identifiable metadata found within the text, such as a document title, author, or filename, as this output will inform a final metadata generation step. The text excerpt is: {chunk_text}."

        response = await model.generate_content_async(prompt)
        print(f"  <- Success on chunk {chunk_number}.")
        return response.text.strip()

# llm_model_async = genai.GenerativeModel("gemini-2.5-flash")
concurrency_limit = 25
semaphore = asyncio.Semaphore(concurrency_limit)


tasks = []
for i, chunk in enumerate(chunks):
    task = summarize_chunk(async_summarizer_model, chunk.page_content, i + 1, semaphore)
    tasks.append(task)

# 4. Gather the results concurrently
summaries = await asyncio.gather(*tasks)

final_summary = "\n".join(summaries)
print(final_summary)

  -> Starting API call for chunk 1...
  -> Starting API call for chunk 2...
  -> Starting API call for chunk 3...
  -> Starting API call for chunk 4...
  -> Starting API call for chunk 5...
  -> Starting API call for chunk 6...
  -> Starting API call for chunk 7...
  -> Starting API call for chunk 8...
  -> Starting API call for chunk 9...
  -> Starting API call for chunk 10...
  -> Starting API call for chunk 11...
  -> Starting API call for chunk 12...
  -> Starting API call for chunk 13...
  -> Starting API call for chunk 14...
  -> Starting API call for chunk 15...
  -> Starting API call for chunk 16...
  -> Starting API call for chunk 17...
  -> Starting API call for chunk 18...
  -> Starting API call for chunk 19...
  -> Starting API call for chunk 20...
  -> Starting API call for chunk 21...
  -> Starting API call for chunk 22...
  -> Starting API call for chunk 23...
  -> Starting API call for chunk 24...
  -> Starting API call for chunk 25...
  <- Success on chunk 4.
  -> Star

In [None]:
# Generate Final Metadata JSON from Combined Summary

# In this cell, we prompt the Gemini LLM to act as a metadata generation specialist. Using the combined summaries from all chunks (stored in `final_summary`), we construct a detailed prompt instructing the model to return a structured JSON object containing the document’s metadata. 
# The prompt clearly outlines strict rules: the output must consist of a clean JSON response only—without any extra text, headers, or formatting.
# This step simulates a real-world metadata generation engine that interprets document content to produce structured outputs for indexing or further processing. The response is stored in `metadata` and printed, completing the metadata generation pipeline.

final_prompt = f"""You are an expert document analyst and metadata generation specialist. You have been given a sequence of summaries which have been formed by combining the summaries of all the chunks of the document.

Based on ALL the information provided below, generate a single, clean JSON object containing the document's metadata. The metadata should also have summary of the document.

**IMPORTANT RULES:**
1.  Your entire response MUST be only the JSON object, with no introductory text, explanations, or any characters before or after the opening and closing curly braces (`{{` and `}}`).
2.  The JSON object must strictly adhere to the specified structure and keys.

---
**INPUT SOURCE 1: OVERALL DOCUMENT SUMMARY**
(This provides the high-level context and gist of the entire document. This is basically a sequence of summaries of all the chunks of the document. )

{final_summary}
----------------------------------"""

response = llm_model.generate_content(final_prompt)
metadata = response.text
print(metadata)


```json
{
  "title": "Biomolecules",
  "description": "A unit exploring the structure, function, and classification of biomolecules including carbohydrates, proteins, nucleic acids, vitamins, enzymes, and hormones, emphasizing their roles in life processes and genetic information transfer.",
  "keywords": [
    "carbohydrates",
    "proteins",
    "nucleic acids",
    "DNA",
    "RNA",
    "enzymes",
    "vitamins",
    "hormones",
    "monosaccharides",
    "disaccharides",
    "polysaccharides",
    "amino acids",
    "peptide bonds",
    "protein structure",
    "denaturation",
    "glycosidic linkage",
    "glucose",
    "fructose",
    "starch",
    "cellulose",
    "genetic code",
    "biochemistry"
  ],
  "year": 2024,
  "type": "Educational Material",
  "summary": "This document provides a comprehensive overview of biomolecules, starting with an introduction to their importance in living systems. It then delves into the structures, classifications, and functions of carbohydrate

In [None]:
# MetaData Extraction combined Code

import fitz  # PyMuPDF
import pytesseract
import io
import os
import google.generativeai as genai
import asyncio
import langchain
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from PIL import Image

google_api_key = "AIzaSyBA4Twr8xGBDVxBVxCa4cECPgXyBxEUO_0"
genai.configure(api_key=google_api_key)

# File format
file_path = "Project.pdf"
file_ext = os.path.splitext(file_path)[1].lower()

if file_ext == '.pdf':
    print("The file is of .pdf format")
elif file_ext == '.docx':
    print("The file is of .docx format")
elif file_ext == '.txt':
    print("The file is of .txt format")
else:
    print("The file is of unsupported format")

# Text Extraction

if file_ext == ".docx":
  doc = docx.Document(file_path)
  full_text = []
  for para in doc.paragraphs:
      full_text.append(para.text)
  documents = "\n".join(full_text)
elif file_ext == ".txt":
  with open(file_path, 'r', encoding='utf-8') as f:
    documents = f.read()
else :
  print("The file is in pdf format and OCR may be needed")
  doc = fitz.open(file_path)
  full_text = []
  for page in doc:
      text = page.get_text("text")
      full_text.append(text)

  final_text = "\n".join(full_text).strip()
  ocr_fallback_threshold = 150
  if len(final_text) < ocr_fallback_threshold:
      print("Direct extraction yielded little text. Falling back to OCR.")
      full_text = []
      for i, page in enumerate(doc):
          print(f"Processing page {i + 1}/{len(doc)}...")
          pix = page.get_pixmap(dpi=300)
          img_data = pix.tobytes("png")
          image = Image.open(io.BytesIO(img_data))
          text = pytesseract.image_to_string(image, lang='eng')
          full_text.append(text)
  documents = "\n".join(full_text)

word_list = documents.split()

word_count = len(word_list)

document = [Document(page_content=documents, metadata={"source": file_path})]


splitter = RecursiveCharacterTextSplitter(chunk_size = 1000 , chunk_overlap = 200)
chunks = splitter.split_documents(document)



llm_model = genai.GenerativeModel("gemini-2.0-flash")

async_summarizer_model = genai.GenerativeModel("gemini-2.0-flash")

async def summarize_chunk(model, chunk_text, chunk_number, semaphore):
    async with semaphore:
        print(f"  -> Starting API call for chunk {chunk_number}...")

        prompt = f"Your task is to analyze the following text excerpt and distill its core message into a brief summary of one to two sentences. The summary should tell the whole crux about what is given in the excerpt. Crucially, your summary must explicitly incorporate any identifiable metadata found within the text, such as a document title, author, or filename, as this output will inform a final metadata generation step. The text excerpt is: {chunk_text}."

        response = await model.generate_content_async(prompt)
        print(f"  <- Success on chunk {chunk_number}.")
        return response.text.strip()

concurrency_limit = 40
semaphore = asyncio.Semaphore(concurrency_limit)


tasks = []
for i, chunk in enumerate(chunks):
    task = summarize_chunk(async_summarizer_model, chunk.page_content, i + 1, semaphore)
    tasks.append(task)

summaries = await asyncio.gather(*tasks)

final_summary = "\n".join(summaries)

final_prompt = f"""You are an expert document analyst and metadata generation specialist. You have been given a sequence of summaries which have been formed by combining the summaries of all the chunks of the document.

Based on ALL the information provided below, generate a single, clean JSON object containing the document's metadata. The metadata should also have summary of the document. Also you've been given that the word count of the document is: {word_count}. Mention this word count as it is in the metadata.

**IMPORTANT RULES:**
1.  Your entire response MUST be only the JSON object, with no introductory text, explanations, or any characters before or after the opening and closing curly braces (`{{` and `}}`).
2.  The JSON object must strictly adhere to the specified structure and keys.

---
**INPUT SOURCE 1: OVERALL DOCUMENT SUMMARY**
(This provides the high-level context and gist of the entire document. This is basically a sequence of summaries of all the chunks of the document. )

{final_summary}
----------------------------------"""

response = llm_model.generate_content(final_prompt)
metadata = response.text
print(metadata)


THe file is of .pdf format
The file is in pdf format and OCR may be needed
  -> Starting API call for chunk 1...
  -> Starting API call for chunk 2...
  -> Starting API call for chunk 3...
  -> Starting API call for chunk 4...
  -> Starting API call for chunk 5...
  -> Starting API call for chunk 6...
  -> Starting API call for chunk 7...
  -> Starting API call for chunk 8...
  -> Starting API call for chunk 9...
  -> Starting API call for chunk 10...
  -> Starting API call for chunk 11...
  -> Starting API call for chunk 12...
  -> Starting API call for chunk 13...
  -> Starting API call for chunk 14...
  -> Starting API call for chunk 15...
  -> Starting API call for chunk 16...
  -> Starting API call for chunk 17...
  -> Starting API call for chunk 18...
  -> Starting API call for chunk 19...
  -> Starting API call for chunk 20...
  -> Starting API call for chunk 21...
  -> Starting API call for chunk 22...
  -> Starting API call for chunk 23...
  -> Starting API call for chunk 24..

In [None]:
# Build Gradio Web Application

# This cell creates a complete, interactive web application using Gradio to allow users to upload a document and automatically generate rich metadata. It integrates all core components of the project:
# 1. **Text Extraction**: The uploaded file (PDF, DOCX, or TXT) is processed using appropriate libraries like PyMuPDF, python-docx, or built-in file handling. For scanned PDFs, OCR is applied using Tesseract.
# 2. **Chunking**: The extracted text is split into overlapping chunks using `RecursiveCharacterTextSplitter` to ensure LLM compatibility.
# 3. **Async Summarization**: Each chunk is summarized concurrently using Google’s Gemini model (`gemini-2.0-flash`) through asynchronous API calls with a concurrency control mechanism.
# 4. **Metadata Generation**: A final LLM prompt combines all summaries and produces structured metadata in JSON format, including details like summary, word count, author, and other contextual elements.
# 5. **UI Integration**: The interface is built using Gradio Blocks. It features a file uploader, a trigger button to start processing, and a JSON viewer to display the resulting metadata.

# This cell brings together the entire pipeline—from file upload to metadata generation—and launches a shareable web app using `demo.launch(share=True)`.

import gradio as gr
import os
import io
import asyncio
import fitz  
import pytesseract
from PIL import Image
import docx 
import google.generativeai as genai
from langchain.docstore.document import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
import re   
import json 


google_api_key = "AIzaSyBA4Twr8xGBDVxBVxCa4cECPgXyBxEUO_0"
genai.configure(api_key=google_api_key)

async def summarize_chunk(model, chunk_text, chunk_number, semaphore):
    async with semaphore:
        print(f"  -> Starting API call for chunk {chunk_number}...")
        prompt = f"Your task is to analyze the following text excerpt and distill its core message into a brief summary of one to two sentences. The summary should tell the whole crux about what is given in the excerpt. Crucially, your summary must explicitly incorporate any identifiable metadata found within the text, such as a document title, author, or filename, as this output will inform a final metadata generation step. The text excerpt is: {chunk_text}."
        
        try:
            await asyncio.sleep(1)
            response = await model.generate_content_async(prompt)
            print(f"  <- Success on chunk {chunk_number}.")
            return response.text.strip()
        except Exception as e:
            print(f"  <- Error processing chunk {chunk_number}: {e}")
            return f"[Error processing chunk {chunk_number}]"

async def generate_metadata_from_file(uploaded_file):
    if uploaded_file is None:
        return {"error": "No file uploaded. Please upload a document."}

    file_path = uploaded_file.name
    print(f"\nProcessing file: {file_path}")

    # --- 1. Text Extraction ---
    file_ext = os.path.splitext(file_path)[1].lower()
    extracted_text = ""

    if file_ext == ".docx":
        try:
            doc = docx.Document(file_path)
            full_text = [para.text for para in doc.paragraphs]
            extracted_text = "\n".join(full_text)
        except Exception as e:
            return {"error": f"Error processing .docx file: {e}"}
            
    elif file_ext == ".txt":
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                extracted_text = f.read()
        except Exception as e:
            return {"error": f"Error processing .txt file: {e}"}

    elif file_ext == ".pdf":
        print("PDF detected. Starting extraction...")
        try:
            doc = fitz.open(file_path)
            full_text = [page.get_text("text") for page in doc]
            final_text = "\n".join(full_text).strip()
            
            ocr_fallback_threshold = 150
            if len(final_text) < ocr_fallback_threshold:
                print("Direct extraction yielded little text. Falling back to OCR.")
                full_text = []
                for i, page in enumerate(doc):
                    print(f"  - OCR on page {i + 1}/{len(doc)}...")
                    pix = page.get_pixmap(dpi=300)
                    img_data = pix.tobytes("png")
                    image = Image.open(io.BytesIO(img_data))
                    text = pytesseract.image_to_string(image, lang='eng')
                    full_text.append(text)
            else:
                print("Direct extraction successful.")

            extracted_text = "\n".join(full_text)
            doc.close()
        except Exception as e:
            return {"error": f"Error processing .pdf file: {e}"}
    else:
        return {"error": "Unsupported file format. Please upload a .pdf, .docx, or .txt file."}

    print("Text extraction complete.")
    word_count = len(extracted_text.split())

    #  2. Chunking 
    document_obj = [Document(page_content=extracted_text, metadata={"source": file_path})]
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.split_documents(document_obj)
    print(f"Document split into {len(chunks)} chunks.")

    # 3. Parallel Summarization ("Map" Step) 
    async_summarizer_model = genai.GenerativeModel("gemini-2.0-flash")
    
    concurrency_limit = 40
    semaphore = asyncio.Semaphore(concurrency_limit)
    
    tasks = [summarize_chunk(async_summarizer_model, chunk.page_content, i + 1, semaphore) for i, chunk in enumerate(chunks)]
    
    summaries = await asyncio.gather(*tasks)
    final_summary = "\n".join(summaries)
    print("Chunk summarization complete.")

    # 4. Final Metadata Generation ("Reduce" Step) 
    print("Generating final metadata...")
    llm_model = genai.GenerativeModel("gemini-2.0-flash")
    final_prompt = f"""You are an expert document analyst and metadata generation specialist. You have been given a sequence of summaries which have been formed by combining the summaries of all the chunks of the document.

Based on ALL the information provided below, generate a single, clean JSON object containing the document's metadata. The metadata should also have summary of the document. Also you've been given that the word count of the document is: {word_count}. Mention this word count as it is in the metadata.
Also if there are any metadata like author, file name, publication etc. include that also in the metadata. 
**IMPORTANT RULES:**
1.  Your entire response MUST be only the JSON object, with no introductory text, explanations, or any characters before or after the opening and closing curly braces (`{{` and `}}`).
2.  The JSON object must strictly adhere to the specified structure and keys.

---
**INPUT SOURCE 1: OVERALL DOCUMENT SUMMARY**
(This provides the high-level context and gist of the entire document. This is basically a sequence of summaries of all the chunks of the document. )

{final_summary}
----------------------------------"""

    try:
        response = llm_model.generate_content(final_prompt)
        raw_text_response = response.text
        print("Metadata generation complete.")
        
        match = re.search(r"```json\s*(\{.*?\})\s*```", raw_text_response, re.DOTALL)
        if match:
            json_str = match.group(1)
        else:
            start = raw_text_response.find('{')
            end = raw_text_response.rfind('}')
            if start != -1 and end != -1:
                json_str = raw_text_response[start:end+1]
            else:
                raise ValueError("No valid JSON object found in the LLM response.")
        
        return json.loads(json_str)

    except Exception as e:
        error_message = f"Error during final metadata generation: {e}"
        print(error_message)
        return {"error": error_message}


with gr.Blocks(theme=gr.themes.Soft(), title="Automated Metadata Generator") as demo:
    gr.Markdown(
        """
        # Automated Metadata Generator
        Upload a document (`.pdf`, `.docx`, or `.txt`) to automatically extract its text and generate semantically rich metadata.
        The process involves chunking the document, summarizing the chunks in parallel, and then synthesizing the final metadata.
        """
    )
    with gr.Row():
        file_input = gr.File(label="Upload Document", file_types=[".pdf", ".docx", ".txt"])
    
    submit_button = gr.Button("Generate Metadata", variant="primary")
    
    gr.Markdown("---")
    
    gr.Markdown("## Generated Metadata")
    json_output = gr.JSON(label="Metadata")

    submit_button.click(
        fn=generate_metadata_from_file,
        inputs=file_input,
        outputs=json_output
    )

if __name__ == "__main__":
    demo.launch(share=True)

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://fd9a384453f1a1601d.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
