# PDF Parser

***

Takes a PDF document extracts the text, then filters out the text leaving only the nessesary information relating to the requirements in the docuement. Then formats the output and creates a excel file to store the information.

## Background:
1. There are two types of cells in this notebook, text cells that tell information and code cells that when run preform their designated task.
2. To run a cell press the play button or click on the cell and press shift + enter. 
3. Commands that start with '!' are to be executed in terminal
4. To run the cells you need to also select a kernel which can be which ever option you choose and if you dont have any options then you might need to download the nessesary extensions.

## Step 1
***

Install pip:

In [None]:
!curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
!python3 get-pip.py
!where pip

Set up virtural envoirnment and add libraries (To make sure we install the libraries in the directory that our project is in)

In [None]:
!cd Desktop
!mkdir PDF_Parser
!cd PDF_Parser
!python -m venv Test-venv
!source Test-venv/bin/activate

Use: !where pip (to identify where you want to install jupyter) <br>
Take desired path and add install jupyter <br>
Ex: *!/Users/jamesrivera/Desktop/PDF_Parser/Test-venv/bin/pip install jupyter*

To launch vscode: <br>
    !code .

To launch jupyter notebook: <br>
!jupyter notebook

To launch jupyter notebook from vs code: <br> 
shift + command + p

Install nessesary Python libraries:
1. Create install requirements txt file to store the install commands:


In [None]:
!touch requirements.txt

Add the python lybraries to requirements.txt:
1. pdfplumber
2. pandas
3. openpyxl



Install libraries:

In [None]:
!pip install -r requirements.txt

## Step 2
***

Import libraries:
1. **requests**: Allows us to send Http requests
2. **pdfplumber**: Allows us to use the pdfplumber library
3. **re**: Allows us to use regular expressions
4. **pandas as pd**: allows to use the pandas library which we will refer to as "pd"
4. **from collections import namedtuple**: Allows use to create named tuples.
5. **openpyxl**: Allows to take our pandas data frame and export it to excel.
6. **xlxswriter**: Allows us to edit a excel page in our program.

In [None]:
import requests
import pdfplumber
import re
import pandas as pd
from collections import namedtuple
import openpyxl
import xlsxwriter

## Step 3
***

Grab the title and put it in a data frame:

locate the document that you want to parse and store its path in a variable:

In [None]:
doc = '/Users/jamesrivera/Downloads/34 21 70 Traction Power Facilities Installation Requirements (1).pdf'

Createing a tuple:

In [None]:
doc_name = "Traction_Power_Facilities_Installation_Requirements"
TPFI_requirements = namedtuple(doc_name, 'section' 'title')

Open pdf with pdfplumber and extract the first page of text:

In [None]:
with pdfplumber.open(doc) as pdf:
    page1 = pdf.pages[0]  
    page1_text = page1.extract_text()
    print(page1_text)
    print("END OF PAGE ONE")

Create a pattern to grab the doc title:

In [None]:
pattern8 = re.compile(r'([A-Z]+\s\d+\s+\d+\s+\d+)([A-Z\s]{52})')

Grab the title:

In [None]:

found_first_match = False
section_data = []

for match in pattern8.finditer(page1_text):
    if not found_first_match:
        section = match.group(1)
        title = match.group(2).strip()
        section_data.append((section, title))
        found_first_match = True

Create a data frame to store the tite:

In [None]:
df1 = pd.DataFrame(section_data, columns=["Doc Section Number", "Doc Title"])

## Step 4
***

Grab the section titles and section numbers as well as the requirements in them:

In [None]:
pattern = r'(\d+\.\d+)\s+([A-Z\s]+)\s*(.*?)\s*(?=\d+\.\d+|\Z)'

In [None]:
all_sections = []
with pdfplumber.open(doc) as pdf:
    for page_number, page in enumerate(pdf.pages[2:], start=3):  # Start from page 3
        text = page.extract_text()

        # Find all matches in the text for the current page
        matches = re.findall(pattern, text, re.DOTALL)

        sections = []

        for match in matches:
            section_number = match[0]
            section_title = match[1]
            section_text = match[2].strip()

            # Skip sections with specific keywords in section_text
            if "BART FACILITIES STANDARDS" in section_text or "ISSUED: APRIL 2018 PAGE" in section_text:
                continue

            sections.append({
                "Section Number": section_number,
                "Section Title": section_title.strip(),
                "Section Text": section_text
            })

        # Append sections from the current page to the list
        all_sections.extend(sections)


In [None]:
df2 = pd.DataFrame(all_sections)

# Step 5
***

Create a excel file from the data frames:

In [None]:
excel_file = 'parsed.xlsx'
#df.to_excel(excel_file, index=False)



with pd.ExcelWriter(excel_file, engine='xlsxwriter') as writer:
    
    df1.to_excel(writer, sheet_name='Sheet1', startrow=0, startcol=8, index=False) 
    df2.to_excel(writer, sheet_name='Sheet1', startrow=5, startcol=2, index=False)
    worksheet = writer.sheets['Sheet1']
    worksheet.set_column('C:E', 15)
    worksheet.set_column('I:J', 15)
