<a href="https://colab.research.google.com/github/kattens/Scholarly-RAGbot/blob/main/pdf_to_json_convertion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### First Step:  
We will begin by creating a database of local papers by converting all available PDFs into a single JSON file. This initial approach can later be enhanced with SQL or other database management systems for improved handling.

we extract the **title, keywords, abstract, and DOI** from your PDFs and store them in a JSON file for easy searching.

### **Steps:**
1. **Extract text** from PDFs.
2. **Parse metadata** (title, keywords, abstract, DOI).
3. **Store data in JSON**.


- our objective is to take an OOP approach

In [None]:
#for pdf modification
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.25.2-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.2-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.2


In [None]:
import json
import os
import re
import fitz #pymupdf

In [None]:
import fitz  # PyMuPDF
import json
import re
import os
from sklearn.feature_extraction.text import TfidfVectorizer

class PDFProcessor:
    """Class to process PDFs and extract metadata like title, abstract, keywords, and DOI."""

    def __init__(self, pdf_folder, output_folder, output_json="papers_metadata.json"):
        """
        Initializes the PDFProcessor with a folder containing PDFs and an output JSON file.

        :param pdf_folder: Directory containing PDF files.
        :param output_folder: Directory where the output JSON file will be saved.
        :param output_json: JSON file to store extracted metadata.
        """
        self.pdf_folder = pdf_folder
        self.output_folder = output_folder
        self.output_json = os.path.join(output_folder, output_json)
        self.papers = []

        # Ensure output folder exists
        os.makedirs(self.output_folder, exist_ok=True)

    def extract_text(self, pdf_path):
        """
        Extracts text from a PDF file.

        :param pdf_path: Path to the PDF file.
        :return: Extracted text as a string.
        """
        doc = fitz.open(pdf_path)
        return "\n".join([page.get_text("text") for page in doc])

    def extract_metadata(self, text):
        """
        Extracts metadata (title, keywords, abstract, DOI) from the text.

        :param text: Extracted text from the PDF.
        :return: Dictionary containing metadata.
        """
        metadata = {}

        # Extract Title (First non-empty line is assumed to be the title)
        lines = [line.strip() for line in text.split("\n") if line.strip()]
        metadata["title"] = lines[0] if lines else "Unknown Title"

        # Extract DOI using regex
        doi_match = re.search(r"10\.\d{4,9}/[-._;()/:A-Za-z0-9]+", text)
        metadata["doi"] = doi_match.group(0) if doi_match else "Unknown DOI"

        # Extract Abstract (Look for the keyword "Abstract" and capture multiline text)
        abstract_match = re.search(r"\bAbstract\b[:\s]*([\s\S]+?)(?=\n[A-Z])", text, re.IGNORECASE)
        metadata["abstract"] = abstract_match.group(1).strip() if abstract_match else "No abstract found"

        # Extract Keywords using TF-IDF
        metadata["keywords"] = self.extract_keywords(text)

        return metadata

    def extract_keywords(self, text, num_keywords=10):
        """
        Extracts top keywords from the text using TF-IDF.

        :param text: Extracted text from the PDF.
        :param num_keywords: Number of keywords to extract.
        :return: List of extracted keywords.
        """
        # Preprocess text: Remove numbers and special characters
        clean_text = re.sub(r'\W+', ' ', text.lower())

        # Use TF-IDF to extract keywords
        vectorizer = TfidfVectorizer(max_features=num_keywords, stop_words="english")
        tfidf_matrix = vectorizer.fit_transform([clean_text])
        keywords = vectorizer.get_feature_names_out()

        return list(keywords)

    def process_pdfs(self):
        """
        Processes all PDFs in the specified folder and extracts metadata.
        """
        for filename in os.listdir(self.pdf_folder):
            if filename.endswith(".pdf"):
                pdf_path = os.path.join(self.pdf_folder, filename)
                print(f"Processing {filename}...")

                text = self.extract_text(pdf_path)
                metadata = self.extract_metadata(text)
                metadata["file"] = filename  # Add filename for reference
                self.papers.append(metadata)

        self.save_to_json()
        print(f"Metadata saved to {self.output_json}")

    def save_to_json(self):
        """
        Saves extracted metadata to a JSON file.
        """
        with open(self.output_json, "w", encoding="utf-8") as json_file:
            json.dump(self.papers, json_file, indent=4, ensure_ascii=False)



In [None]:

pdf_folder="/content/drive/MyDrive/papers/oldpapers/45NSYUH9"
output_folder = '/content/drive/MyDrive/papers/paperjson/'

# Example usage:
pdf_processor = PDFProcessor(pdf_folder, output_folder)
pdf_processor.process_pdfs()


Processing Ieremie et al. - 2022 - TransformerGO predicting protein–protein interact.pdf...
Metadata saved to /content/drive/MyDrive/papers/paperjson/jsonfile1.json/papers_metadata.json
