# PDF to Medium Article Generator with Persistent Vector Store

This notebook creates a system that:
1. Reads PDF files from a specified folder
2. Persistently stores their content in a Chroma vector store
3. Generates technical Medium articles through a Gradio chat interface
4. Allows resuming processing from previous state
5. Formats articles in active voice, targeting recruiters and managers

Key Features:
- Persistent vector store using SQLite backend
- Progress tracking and state management
- Chunk-based processing with automatic saves
- Resume capability for interrupted processing

In [1]:
import os
import time
from datetime import datetime
from pathlib import Path
import json
import hashlib
import math
import requests
import numpy as np
from tqdm.notebook import tqdm

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import gradio as gr



# Vector Store Configuration and Helper Functions

We'll set up:
1. Directory configuration for persistent storage
2. Helper functions for time formatting and progress tracking
3. Vector store initialization and management functions
4. Processing progress tracking and state management

In [2]:
# Helper function for time formatting
def format_time(seconds):
    """Convert seconds to human readable time format"""
    if seconds < 60:
        return f"{seconds:.1f} seconds"
    elif seconds < 3600:
        minutes = seconds / 60
        return f"{minutes:.1f} minutes"
    else:
        hours = seconds / 3600
        return f"{hours:.1f} hours"

# Vector store configuration
PERSIST_DIRECTORY = os.path.join(os.getcwd(), "vector_store")
PROGRESS_FILE = os.path.join(PERSIST_DIRECTORY, "processing_progress.json")

def verify_ollama_connection():
    """Verify that Ollama is running and accessible."""
    try:
        response = requests.get("http://localhost:11434/api/tags")
        if response.status_code == 200:
            print("✓ Ollama connection verified")
            return True
        else:
            print("✗ Ollama is not responding correctly")
            return False
    except requests.exceptions.RequestException as e:
        print("✗ Could not connect to Ollama. Is it running?")
        print(f"Error: {str(e)}")
        return False

def test_embeddings():
    """Test that embeddings are working correctly."""
    try:
        embeddings = OllamaEmbeddings(model="llama3")
        test_text = "This is a test sentence."
        result = embeddings.embed_query(test_text)
        
        if isinstance(result, list) and len(result) > 0 and all(isinstance(x, float) for x in result):
            print("✓ Embeddings test successful")
            return True
        else:
            print("✗ Invalid embedding output format")
            return False
    except Exception as e:
        print("✗ Embeddings test failed")
        print(f"Error: {str(e)}")
        return False

def save_processing_progress(processed_files):
    """Save the list of processed files and their hashes."""
    os.makedirs(PERSIST_DIRECTORY, exist_ok=True)
    with open(PROGRESS_FILE, "w") as f:
        json.dump(processed_files, f)

def load_processing_progress():
    """Load the list of previously processed files."""
    if os.path.exists(PROGRESS_FILE):
        with open(PROGRESS_FILE, "r") as f:
            return json.load(f)
    return {}

def get_file_hash(file_path):
    """Calculate hash of file for tracking changes."""
    with open(file_path, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

def initialize_vector_store():
    """Initialize or load existing vector store."""
    os.makedirs(PERSIST_DIRECTORY, exist_ok=True)
    embeddings = OllamaEmbeddings(model="llama3")
    
    if os.path.exists(os.path.join(PERSIST_DIRECTORY, "chroma.sqlite3")):
        print("Loading existing vector store...")
        return Chroma(
            persist_directory=PERSIST_DIRECTORY,
            embedding_function=embeddings
        )
    else:
        print("Creating new vector store...")
        return Chroma(
            persist_directory=PERSIST_DIRECTORY,
            embedding_function=embeddings
        )

def save_vector_store(vectorstore):
    """Save vector store to disk."""
    print("Saving vector store...")
    vectorstore.persist()
    print(f"Vector store saved to {PERSIST_DIRECTORY}")

def clear_vector_store():
    """Clear the vector store and processing history."""
    import shutil
    if os.path.exists(PERSIST_DIRECTORY):
        shutil.rmtree(PERSIST_DIRECTORY)
        os.makedirs(PERSIST_DIRECTORY)
        print("Vector store cleared successfully")

def get_vector_store_stats():
    """Get statistics about the vector store."""
    if os.path.exists(os.path.join(PERSIST_DIRECTORY, "chroma.sqlite3")):
        vectorstore = initialize_vector_store()
        collection = vectorstore._collection
        stats = {
            "total_documents": collection.count(),
            "persist_directory": PERSIST_DIRECTORY,
            "processed_files": len(load_processing_progress())
        }
        return stats
    return {
        "total_documents": 0, 
        "persist_directory": PERSIST_DIRECTORY, 
        "processed_files": 0
    }

In [5]:
get_vector_store_stats()

Loading existing vector store...


{'total_documents': 1435,
 'persist_directory': 'c:\\Workspace\\SideProjects\\llm-projects\\TechWriter\\vector_store',
 'processed_files': 0}

# Document Processing and Vector Store Population

Now we'll implement the core document processing functionality:
1. Load PDFs from the specified directory
2. Split documents into chunks
3. Process and store embeddings with progress tracking
4. Save state at regular intervals

# Article Generation with LLM Chain

We'll set up the article generation pipeline:
1. Create a prompt template for technical articles
2. Initialize the LLM chain with Ollama
3. Create a function to generate articles from vector store content
4. Add a Gradio interface for easy interaction

In [None]:
# Setup the article generation chain
article_template = """
You are an expert technical writer and software engineer creating a Medium article to showcase your skills in AI and ML.
Topic: {topic}
Retrieved Content: {context}

Write a 1200-word technical article that:
1. Uses active voice throughout
2. Explains technical concepts in business-friendly language
3. Highlights practical applications and business value
4. Includes relevant examples from the source material
5. Structures content with clear headings and subheadings

Article:
"""

article_prompt = PromptTemplate(
    input_variables=["topic", "context"],
    template=article_template
)

# Initialize Ollama with llama3 model
llm = Ollama(model="llama3")
article_chain = LLMChain(llm=llm, prompt=article_prompt)

def generate_article(topic, vectorstore):
    """Generate a Medium article based on the topic and vector store content."""
    # Search the vector store for relevant content
    results = vectorstore.similarity_search(topic, k=8)
    context = "\n".join([doc.page_content for doc in results])
    
    # Generate the article
    article = article_chain.run(topic=topic, context=context)
    return article

def gradio_interface(topic):
    """Gradio interface function for article generation."""
    try:        
        # Initialize vector store
        vectorstore = initialize_vector_store()
        if not vectorstore:
            return "Error: Could not initialize vector store. Please check if the store exists and contains documents."
        
        article = generate_article(topic, vectorstore)
        return article
    except Exception as e:
        return f"Error generating article: {str(e)}"

  llm = Ollama(model="llama3")
  article_chain = LLMChain(llm=llm, prompt=article_prompt)


In [None]:
iface = gr.Interface(
    fn=gradio_interface,
    inputs=gr.Textbox(
        lines=2,
        placeholder="Enter the topic for your technical article...",
        label="Article Topic"
    ),
    outputs=gr.Textbox(
        lines=20,
        label="Generated Article"
    ),
    title="Technical Article Generator",
    description="Generate a 1200-word technical article from your PDF content"
)

# Launch the interface
iface.launch()

* Running on local URL:  http://127.0.0.1:7861
* To create a public link, set `share=True` in `launch()`.




  embeddings = OllamaEmbeddings(model="llama3")
  return Chroma(


Loading existing vector store...


  article = article_chain.run(topic=topic, context=context)


# How to Use

1. Create a directory named `pdfs` in the same location as this notebook
2. Place your PDF files in the `pdfs` directory
3. Install Ollama on your system and start the Ollama service
4. Pull the llama3 model using: `ollama pull llama3`
5. Run all cells in this notebook
6. Use the Gradio interface to:
   - Enter your desired article topic
   - Click submit to generate the article
   - Copy the generated article and paste it into Medium

The system will:
- Automatically load and process PDFs from the `pdfs` directory
- Store embeddings persistently in the `vector_store` directory
- Save progress every 30 chunks during processing
- Allow resuming from previous state if interrupted
- Generate articles using the stored knowledge

Notes:
- Make sure Ollama is running before using this notebook
- The vector store is persistent and will be reused across sessions
- You can clear the vector store using the `clear_vector_store()` function if needed
- Progress tracking prevents reprocessing of already processed files

Example Usage:
```python
# Get current vector store statistics
stats = get_vector_store_stats()
print(f"Documents in store: {stats['total_documents']}")

# Clear vector store and start fresh
clear_vector_store()

# Process new documents
documents = load_pdfs(pdf_directory)
vectorstore = process_documents(documents)

# Save explicitly if needed
save_vector_store(vectorstore)
```