# P.A.R.S.E.C. - PDF Analysis and Review System for Exam Content

This Jupyter Notebook, titled "P.A.R.S.E.C. - PDF Analysis and Review System for Exam Content", aims to provide a system for analyzing and reviewing PDF documents related to exam content. The notebook will include various functionalities such as parsing PDF files, extracting text and metadata, performing text analysis, generating visualizations, and facilitating the review process. The goal is to create an efficient and comprehensive system for working with exam-related PDF documents.

## Requirements
python 3.10.11
Elastic Cloud instance w/ ELSER Model
nvidia gpu or llm api key

# Optional
Langsmith Account/API key (free tier)


In [1]:
#!python3 -m pip install -qU elasticsearch langchain langchain-elasticsearch openai tiktoken PyPDF4

#%pip install PyPDF4
#%pip install elasticsearch
#%pip install langchain
#%pip install langchain-elasticsearch
#%pip install openai tiktoken
#%pip install wordcloud
#%pip install rapidocr-onnxruntime
#%pip install nltk
%pip install jq

Collecting jq
  Using cached jq-1.7.0.tar.gz (2.0 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: jq
  Building wheel for jq (pyproject.toml): started
  Building wheel for jq (pyproject.toml): finished with status 'error'
Failed to build jq
Note: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  × Building wheel for jq (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [5 lines of output]
      running bdist_wheel
      running build
      running build_ext
      Executing: ./configure CFLAGS=-fPIC -pthread --disable-maintainer-mode --with-oniguruma=builtin
      error: [WinError 2] The system cannot find the file specified
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for jq
ERROR: Could not build wheels for jq, which is required to install pyproject.toml-based projects

[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os 
import json
from getpass import getpass
from urllib.request import urlopen
from pypdf import PdfReader, PdfWriter
from langchain_community.document_loaders import PyPDFLoader
from langchain_elasticsearch import ElasticsearchStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain_community.document_loaders import JSONLoader
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from concurrent.futures import ProcessPoolExecutor
import multiprocessing
from multiprocessing import Pool
from tqdm import tqdm
import jq


# Download the Punkt tokenizer models (only needed once)
#nltk.download('punkt')



ModuleNotFoundError: No module named 'jq'

# Preparation of the PDF Files
Get rid of pesky passwords


In [4]:
import os
cwd = os.getcwd()
pdf_files = [f for f in os.listdir(cwd) if f.endswith('.pdf')]
print(pdf_files)

['decrypted_SEC595 - Book 1_2036060.pdf', 'decrypted_SEC595 - Book 2_2036060.pdf', 'decrypted_SEC595 - Book 3_2036060.pdf', 'decrypted_SEC595 - Book 4_2036060.pdf', 'decrypted_SEC595 - Book 5_2036060.pdf', 'decrypted_SEC595 - Book 6_2036060.pdf', 'decrypted_SEC595 - Workbook 1_2036060.pdf', 'decrypted_SEC595 - Workbook 2_2036060.pdf', 'SEC595 - Book 1_2036060.pdf', 'SEC595 - Book 2_2036060.pdf', 'SEC595 - Book 3_2036060.pdf', 'SEC595 - Book 4_2036060.pdf', 'SEC595 - Book 5_2036060.pdf', 'SEC595 - Book 6_2036060.pdf', 'SEC595 - Workbook 1_2036060.pdf', 'SEC595 - Workbook 2_2036060.pdf']


In [None]:
from pypdf import PdfReader, PdfWriter
i = 0
for pdf_file in pdf_files:
    if pdf_file.startswith('decrypted_'):
        continue
    with open(pdf_file, 'rb') as file:
        print(pdf_file, ' is decrypting')
        reader = PdfReader(file)
        # Attempt to decrypt the PDF with an empty password
        if reader.is_encrypted:
            try:
                reader.decrypt('PpH[uQ(7+Gy:FdA9;X9QVXi@$zVwD-') # Your password here
            except:
                print("The PDF is encrypted and cannot be decrypted with the password.")
                exit()
        
        writer = PdfWriter()
        
        # Copy the content from the original PDF to the new PDF
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            writer.add_page(page)
        
        # Save the new PDF file without encryption
        with open('decrypted_'+pdf_files[i], 'wb') as output_file:
            writer.write(output_file)
        i += 1
        print("The PDF has been successfully decrypted.")

# Extracting text and metadata
We'll load the decrypted pdfs, split pages, and so on.
There are books and workbooks within the provided materials and they both require some cleaning. Basically everything after the SANS copyright is useless. For each book we'll take two slices of pages, [4:-1] & [-1]. The former is the course content and the latter is the provided index. For workbooks, we only need one slice [4:]


In [5]:
pdfs = [f for f in os.listdir(cwd) if f.startswith('decrypted_')]

from langchain_community.document_loaders import PyPDFLoader

indexes = []
contents = []
for pdf in pdfs:
    loader = PyPDFLoader(pdf)
    pages = loader.load_and_split()
    if 'Workbook' in pages[0].metadata['source']:
        contents.append(pages[4:])
    else:
        contents.append(pages[4:-1])
        indexes.append(pages[-1])

len(indexes), len(contents)

(6, 8)

In [6]:
import re

def clean_document(document):
    # Remove headers, footers, and any licensing or copyright information
    cleaned_content = re.sub(r'©.*?SANS Institute \d{4}.*', '', document, flags=re.DOTALL)
    # Remove email addresses
    cleaned_content = re.sub(r'\S+@\S+', '', cleaned_content)
    # Normalize whitespace to single space
    cleaned_content = re.sub(r'\d$', '', cleaned_content)

    return cleaned_content


for index in indexes:
    index.page_content = clean_document(index.page_content)

for content in contents:
    for page in content:
        page.page_content = clean_document(page.page_content)

# Lets slice a bit more & index the contents in elastic 

In [6]:
# get average length of the pages for each book
avg_page_lengths = []

for content in contents:
    avg_page_lengths.append(sum([len(page.page_content) for page in content]) / len(content))

avg_page_lengths


[1318.1304347826087,
 1089.3152173913043,
 1117.5348837209303,
 1093.0131578947369,
 853.34375,
 1043.159090909091,
 1373.6743119266055,
 1587.4894894894894]

In [None]:
type(sec595), type(contents)

(list, list)

In [7]:
sec595 = contents.copy()

for i, book in enumerate(sec595):
    for j, page in enumerate(book):
        sec595[i][j].page_content = page.page_content.split('.\n')


# I wanted to concatenate some stuff 
# where the section was too short
# or too long!

```py
def process_page_content(page):
    # Use a temporary list to hold new or modified content to avoid frequent list modifications
    new_content = []
    for k, sect in enumerate(page.page_content):
        if len(sect) > 1000:
            # Efficiently split by sentences and divide
            split = sect.split('. ')
            half = len(split) // 2
            sect1 = '. '.join(split[:half])
            sect2 = '. '.join(split[half:])
            new_content.append(sect1)
            new_content.append(sect2)
        elif len(sect) < 75 and k > 0:
            # Instead of removing, just append to the previous section if possible
            new_content[-1] += ' ' + sect
        else:
            new_content.append(sect)
    return new_content

def process_book(book):
    for j, page in enumerate(book):
        book[j].page_content = process_page_content(page)
    return book
    ```
           

In [8]:
from multiprocessing import Pool
import importlib
import workers
importlib.reload(workers)
import workers


def apply_parallel(data):
    # Use all available cores
    pool = Pool(processes=multiprocessing.cpu_count()-1)
    result = pool.map(workers.process_book, data)
    pool.close()
    pool.join()
    return result

if __name__ ==  '__main__': 
    processed_data = apply_parallel(sec595)

processed_data[0]

[Document(page_content=['This course was conceived and authored by David Hoelzer. David is the COO of Enclave Forensics,\nInc., a managed security monitoring company. He also serves as Dean of Faculty for the SANS\nTechnology Institute and a Faculty Fellow for The SANS Institute', 'David has been working in the IT and Information Security fields since the late 1980s. In addition to\ndaily work in network monitoring, analysis, and secure development, he leads the machine learning\ninitiatives within Enclave. His particular area of focus is supervised learning solutions for real-time\nmonitoring and classification of enterprise network activities '], metadata={'source': 'decrypted_SEC595 - Book 1_2036060.pdf', 'page': 4}),
 Document(page_content=['Introduction\nThis course is broken down into six major sections, each of which corresponds to the coursebook for\nthat day. When undertaking this course, please have in mind that we view the first two sections of\nmaterial as important foundat

In [29]:
# get average length of the pages for each book
avg_page_lengths = []

for page in processed_data[7]:
    avg_page_lengths.append(sum([len(pg) for pg in page.page_content]) / len(page.page_content))

avg_page_lengths[0:10]


[373.4,
 311.7142857142857,
 1673.0,
 145.3846153846154,
 338.25,
 593.3333333333334,
 472.0,
 297.4,
 380.5,
 367.2857142857143]

In [10]:
"""Loader that loads data from JSON."""
import json
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union, Any

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader


class JSONLoader(BaseLoader):
    def __init__(
        self,
        file_path: Union[str, Path],
        content_key: Optional[str] = None,
        metadata_func: Optional[Callable[[Dict, Dict], Dict]] = None,
        text_content: bool = True,
        json_lines: bool = False,
    ):
        """
        Initializes the JSONLoader with a file path, an optional content key to extract specific content,
        and an optional metadata function to extract metadata from each record.
        """
        self.file_path = Path(file_path).resolve()
        self._content_key = content_key
        self._metadata_func = metadata_func
        self._text_content = text_content
        self._json_lines = json_lines

    def load(self) -> List[Document]:
        """Load and return documents from the JSON file."""
        docs: List[Document] = []
        if self._json_lines:
            with self.file_path.open(encoding="utf-8") as f:
                for line in f:
                    line = line.strip()
                    if line:
                        self._parse(line, docs)
        else:
            self._parse(self.file_path.read_text(encoding="utf-8"), docs)
        return docs

    def _parse(self, content: str, docs: List[Document]) -> None:
        """Convert given content to documents."""
        data = json.loads(content)

        # Perform some validation
        # This is not a perfect validation, but it should catch most cases
        # and prevent the user from getting a cryptic error later on.
        if self._content_key is not None:
            self._validate_content_key(data)
        if self._metadata_func is not None:
            self._validate_metadata_func(data)

        for i, sample in enumerate(data, len(docs) + 1):
            text = self._get_text(sample=sample)
            metadata = self._get_metadata(sample=sample, source=str(self.file_path), seq_num=i)
            docs.append(Document(page_content=text, metadata=metadata))

    def _get_text(self, sample: Any) -> str:
        """Convert sample to string format"""
        if self._content_key is not None:
            content = sample.get(self._content_key)
        else:
            content = sample

        if self._text_content and not isinstance(content, str):
            raise ValueError(
                f"Expected page_content is string, got {type(content)} instead. \
                    Set `text_content=False` if the desired input for \
                    `page_content` is not a string"
            )

        # In case the text is None, set it to an empty string
        elif isinstance(content, str):
            return content
        elif isinstance(content, dict):
            return json.dumps(content) if content else ""
        else:
            return str(content) if content is not None else ""

    def _get_metadata(self, sample: Dict[str, Any], **additional_fields: Any) -> Dict[str, Any]:
        """
        Return a metadata dictionary base on the existence of metadata_func
        :param sample: single data payload
        :param additional_fields: key-word arguments to be added as metadata values
        :return:
        """
        if self._metadata_func is not None:
            return self._metadata_func(sample, additional_fields)
        else:
            return additional_fields

    def _validate_content_key(self, data: Any) -> None:
        """Check if a content key is valid"""
        sample = data.first()
        if not isinstance(sample, dict):
            raise ValueError(
                f"Expected the jq schema to result in a list of objects (dict), \
                    so sample must be a dict but got `{type(sample)}`"
            )

        if sample.get(self._content_key) is None:
            raise ValueError(
                f"Expected the jq schema to result in a list of objects (dict) \
                    with the key `{self._content_key}`"
            )

    def _validate_metadata_func(self, data: Any) -> None:
        """Check if the metadata_func output is valid"""

        sample = data.first()
        if self._metadata_func is not None:
            sample_metadata = self._metadata_func(sample, {})
            if not isinstance(sample_metadata, dict):
                raise ValueError(
                    f"Expected the metadata_func to return a dict but got \
                        `{type(sample_metadata)}`"
                )



In [None]:
docs = []
for i, books in enumerate(processed_data):
    for j, page in enumerate(books):
        for pg in page.page_content:
            docs.append({
                "page_content": [pg],
                "metadata": {
                    "source": page.metadata['source'],
                    "page": page.metadata['page']
                    }
            })
with open('docs.json', 'w') as f:
    json.dump(docs, f)


In [14]:

loader = JSONLoader(
    file_path='docs.json',
    text_content=False)

data = loader.load()

In [15]:
len(data)

3708

In [11]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

ELASTIC_URL = getpass("Elastic URL")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")



In [12]:
vector_store = ElasticsearchStore(
    es_url=ELASTIC_URL,
    es_api_key=ELASTIC_API_KEY,
    index_name="sec595",
)

In [16]:

documents = vector_store.from_documents(
data,
es_url=ELASTIC_URL,
es_api_key=ELASTIC_API_KEY,
index_name="sec595",
strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(
    model_id=".elser_model_2_linux-x86_64"
),
bulk_kwargs={
    "request_timeout": 60,
},
)


In [19]:
def showResults(output):
    print("Total results: ", len(output))
    for index in range(len(output)):
        print(output[index])

In [20]:

documents.client.indices.refresh(index="sec595")

results = documents.similarity_search(
    "What is Bag of Words?", k=4, strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(
    model_id=".elser_model_2_linux-x86_64"
)
)
showResults(results)

Total results:  4
page_content='{"page_content": ["Previously, we used the Bag of W ords approach, generating a multi-hot encoded vector indicating\\nwhich words were present in a given text. One of the major limitations of this approach was that\\nwe only know that a word was present, not the order that the words appeared in. If you consider\\nthe two following sentences, you will appreciate why this is such a big problem:\\nYou did understand\\nDid you understand\\nBoth of these sentences contain the same words and they would be encoded identically under Bag\\nof Words. Bag of Words also doesn\\u2019t preserve the number of times any given word appears in a\\npiece of text"], "metadata": {"source": "decrypted_SEC595 - Workbook 2_2036060.pdf", "page": 213}}' metadata={'source': 'C:\\Users\\nc\\Desktop\\projects\\sec595\\docs.json', 'seq_num': 3245}
page_content='{"page_content": ["Optimization\\nStill, there are some really important lessons. One of the most important is that when the

# Making some adjustments and indexing this way in v2

In [22]:
metadata = []
content = []

for books in processed_data:
    for page in books:
        metadata.append(page.metadata)
        content.append(page.page_content)
    

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=512, chunk_overlap=256
)
docs = text_splitter.create_documents(content[0], metadatas=metadata)

In [None]:
aa

# Build the baseline index

In [None]:
idx = []
for index in indexes:
    lines = index.page_content.split('\n')
    records = []
    i = 0
    j = 0
    for line in lines[1:]:
        if not re.match(r'^\d+', line):
            records.append(line)
            txt = line.split(',')
            idx.append({'topic': txt[0], 'book': index.metadata['source'], 'pages': [tx.strip() for tx in txt[1:]]})
            topic = txt[0]
            j = int(i)
            i += 1
        else:
            if j < len(records):
                records[j] = records[j] + ' ' + line
                txt = line.split(', ')
                idx[j]['pages'] = idx[j]['pages'] + txt[:]
                i += 1


In [None]:
import pandas as pd

# This function is like 99% there. I broke my brain trying to handle the edge cases.
# Since there are two columns, it is sometimes difficult to know when to start a new topic.

df = pd.DataFrame(idx)
df = df.explode('pages')
# group by topic and book
df = df.groupby(['topic', 'book'])['pages'].apply(list).reset_index()
df['book'] = df['book'].apply(lambda x: x.split(' - ')[1].split('_')[0])

for i, row in df.iterrows():
    for j, page in enumerate(row['pages']):
        if isinstance(page, str):
            if re.search(r'[a-zA-Z]', page):
                if re.search(r'[0-9]', page):
                    # Get topic for new row
                    newtopic = re.match(r"\d{1,3}(.*)", page).group(1)
                    # Adjust value of current page to only contain the digits
                    row['pages'][j] = re.match(r"(\d{1,3}).*", page).group(0)
                    newpages = row['pages'][j+1:]
                    row['pages'] = row['pages'][:j-1]
                    print(f'topic: {newtopic} pages {str(newpages)} row: {row["topic"]}')
                    dfx = pd.DataFrame({'topic': newtopic, 'book': [row['book']], 'pages': [newpages]})
                    df = pd.concat([df, dfx], axis=0)
                    
# sort df by book and  topic
df = df.sort_values(by=['book', 'topic']).reset_index(drop=True)
# drop rows with null page
df = df.dropna(subset=['pages'])
# aggregate by topic & book then deduplicate the items in pages list
df = df.groupby(['topic', 'book'])['pages'].apply(lambda x: list(set([item for sublist in x for item in sublist]))).reset_index()
# sort the values in each row for pages even if its a str
df['pages'] = df['pages'].apply(lambda x: sorted(x, key=lambda y: str(y)))
# get rid of empty values in the pages list
df['pages'] = df['pages'].apply(lambda x: [i for i in x if i])
df.head()

topic: Epsilon pages ['seeDBSCAN'] row: Elbow Method
topic: latentspace pages ['22'] row: Huberloss
topic: Leaky Rectified Linear Unit pages ['52', '53'] row: IMDB dataset
topic: linear regression pages ['6'] row: IMDB dataset
topic: normal forms pages ['63', '74', '89', '90', '94', '95'] row: MongoEngine
topic: NumPy pages ['3', '7'] row: NetFlow


Unnamed: 0,topic,book,pages
0,Anaconda,Book 1,"[10, 11, 47, 74, 8]"
1,BackBlaze,Book 2,[28]
2,Bag of Words,Book 4,[45]
3,Bayes Theorem,Book 2,[56–58]
4,Bayesian,Book 4,[45]


# Let's do Elastic Stuff
We should probably slice the pages up a little  bit more, just in case

# Generating visualizations

In [None]:
from wordcloud import WordCloud

def generate_word_cloud(text):
    wordcloud = WordCloud(width = 800, height = 400, background_color ='white').generate(text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

# Example usage
for content_text in contents:
    generate_word_cloud(' '.join([page.page_content for page in content_text]))

# Clean it better

```python
import unicodedata

def display_comparison(original, cleaned):
    fig, ax = plt.subplots(1, 2, figsize=(12, 6), sharex=True, sharey=True)
    ax[0].text(0.5, 0.5, original, ha='center', va='center', fontsize=12, wrap=True)
    ax[0].set_title('Original Text')
    ax[0].axis('off')
    
    ax[1].text(0.5, 0.5, cleaned, ha='center', va='center', fontsize=12, wrap=True)
    ax[1].set_title('Cleaned Text')
    ax[1].axis('off')
    plt.rcParams['font.family'] = 'DejaVu Sans' 
    plt.show()

# Example usage with dummy text
clean = []
for pdf in pdfs:
    loader = PyPDFLoader(pdf)
    pages = loader.load_and_split()
    original = [' '.join([page.page_content for page in pages])]
    original = unicodedata.normalize('NFKD', str(original))
    cleaned_text = [' '.join([clean_document(page.page_content) for page in pages])]
    cleaned_text = unicodedata.normalize('NFKD', str(cleaned_text))

    display_comparison(original, cleaned_text)
```

# Facilitating the review process

In [None]:
# Preparation & Parsing of the PDF Files

 