# Testing Docling

In [2]:
from docling.document_converter import DocumentConverter

source = "og_cv/albin.pdf" 
converter = DocumentConverter()
doc = converter.convert(source).document

print(doc.export_to_markdown())

2025-10-14 16:56:40,166 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-14 16:56:40,170 - INFO - Going to convert document batch...
2025-10-14 16:56:40,171 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-10-14 16:56:40,173 - INFO - Accelerator device: 'cpu'
2025-10-14 16:56:42,095 - INFO - Accelerator device: 'cpu'
2025-10-14 16:56:43,072 - INFO - Accelerator device: 'cpu'
2025-10-14 16:56:43,557 - INFO - Processing document albin.pdf
2025-10-14 16:56:47,774 - INFO - Finished converting document albin.pdf in 7.62 sec.


<!-- image -->

## Profile Summary

A MERN stack developer from Kochi with a strong focus on backend-heavy full-stack development. Skilled in building robust, efficient, and secure server-side applications. Committed to continuous learning and passionate about distributed systems, with a dedication to delivering high-quality solutions.

## Technical Skills

Programming Languages:

JavaScript, TypeScript

Back End Development: Node.js, Express.js, Next.js, GraphQL, RESTful APIs, Socket.io, MVC Architecture, SOLID Principles, Clean Architecture, Modular Monolithic

Front End Development:

React.js, Redux, Next.js, TanStack Query

Database:

PostgreSQL, MongoDB

Tools:

Docker, CI/CD, Postman, Git, Figma

Familiar with: JWT, Tailwind CSS, Bootstrap, Stripe, Razorpay, PayPal, Cloudinary, WebHooks, AWS, GDC, Nginx, Firebase, OAuth2.0, Clerk Authentication, Nodemailer, Multer, Data Structures and Algorithms

## Main Projects

## Zyra Moments, Event Management and Hosting Platform

Live Link 

In [1]:
import re
import fitz
from docling.document_converter import DocumentConverter
converter = DocumentConverter()

def extract_clean_text(file_path):
    
    all_links = set()
    top_links = []

    # Extract hyperlinks from PDF
    if str(file_path).lower().endswith(".pdf"):
        with fitz.open(file_path) as doc:
            for page in doc:
                for link in page.get_links():
                    uri = link.get("uri")
                    if uri and uri.startswith("http"):
                        all_links.add(uri.strip())

    # Extract text using Docling
    doc = converter.convert(file_path).document
    full_text = doc.export_to_markdown()

    # Extract plain-text URLs 
    url_pattern = re.compile(
        r'(https?://[^\s]+|www\.[^\s]+|\b[\w-]+\.(?:vercel|netlify|github|streamlit|huggingface|render|heroku|io|app|ai|com|org)\b[^\s]*)',
        re.IGNORECASE
    )
    for u in url_pattern.findall(full_text):
        clean_url = u.strip(").,;:!?")
        all_links.add(clean_url)

    # Select top-level links
    for link in all_links:
        if any(
            k in link.lower()
            for k in [
                "linkedin",
                "github",
                "portfolio",
                "vercel",
                "netlify",
                "streamlit",
                "huggingface",
                "render",
                "demo",
                "live",
                "project",
            ]
        ):
            top_links.append(link)

    # Clean text
    text = re.sub(r"\(cid:\d+\)", "", full_text)  # remove PDF junk
    text = re.sub(r"\s+", " ", text).strip()  # collapse spaces & line breaks

    # Prepend top-level links
    if top_links:
        text = "Links: " + ", ".join(sorted(set(top_links))) + "\n\n" + text

    return text


clean_cv_text = extract_clean_text("og_cv/afzal.pdf")  # works for PDF, DOCX, TXT
print(clean_cv_text)


  from .autonotebook import tqdm as notebook_tqdm
2025-10-14 16:47:33,880 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-14 16:47:37,348 - INFO - Going to convert document batch...
2025-10-14 16:47:37,349 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-10-14 16:47:37,367 - INFO - Loading plugin 'docling_defaults'
2025-10-14 16:47:37,372 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-10-14 16:47:37,398 - INFO - Loading plugin 'docling_defaults'
2025-10-14 16:47:37,411 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-10-14 16:47:37,986 - INFO - Accelerator device: 'cpu'
2025-10-14 16:47:39,992 - INFO - Accelerator device: 'cpu'
2025-10-14 16:47:41,407 - INFO - Accelerator device: 'cpu'
2025-10-14 16:47:42,282 - INFO - Processing document afzal.pdf
2025-10-14 16:47:48,773 - INFO - Finished converting document afzal.pdf in 14.89 sec.


Links: https://article-researcher-app-cbngx5mmmxxdtjzfewngtn.streamlit.app/, https://github.com/me-Afzal/, https://github.com/me-Afzal/Article-Researcher-app, https://github.com/me-Afzal/Bank_Analysis, https://github.com/me-Afzal/Data-ETL-Pipeline, https://github.com/me-Afzal/Fake-News-Detection, https://github.com/me-Afzal/Harry_Potter_Cloak_Invisibility, https://github.com/me-Afzal/Hybrid-movie-recommendation-app, https://github.com/me-Afzal/IPL_win_probability_predictor, https://github.com/me-Afzal/Image-Enhancer-Pro, https://github.com/me-Afzal/Medical-cost-predictor, https://github.com/me-Afzal/NoteKeeper, https://github.com/me-Afzal/True-Buddy-Chatbot, https://hybrid-movie-recommend-app.streamlit.app/, https://ipl-win-probability-predictor-tool.streamlit.app/, https://linkedin.com/in/afzal-a-0b1962325, https://medical-cost-predictor-web.streamlit.app/, https://true-buddy-chatbot.streamlit.app/

<!-- image --> ## Professional Summary Data Scientist and AI/ML Engineer with hands-on

### When we used Docling for text extraction, it need time to detect file first,then it di extraction that also take few seconds. There is an issue in extracted text, docling cannot extract hyperlinks and also some portion of pdf becuase some pdf's are made up with latex template.