Skip to content

Python project that converts tables inside PDFs to CSV for convenient data manipulation. It has log and exception handling.

License

Notifications You must be signed in to change notification settings

monambike/pdfconverter-pdftables-to-csv

Repository files navigation

Static Badge Static Badge

Python Pandas

PDFConverter - Script

PDFConverter is a Python project that needs to be converted into an executable file in order to quickly interpret and convert a large number of tables into PDF format without requiring extensive user interaction.

You can also check the branchs docs or the desktop application used for testing the call of this Script.

Example of Script call:

python pdfconverter.py --ImportPath "C:\\users\\dvp10\\desktop\\EDITAL (2).pdf" --ExportPath "C:\\users\\dvp10\\desktop" --PageNumber "all"

Project Structure

image

Contact

You can find me on likedin by here linkedin.com/in/monambike/. If you want to see videos about my work you can check my YouTube channel youtube.com/@monambike_portfolio and if you want to see my artworks you can check at my instagram instagram.com/monambike_portfolio.

License

The license for this repository is available here. Please refer to the provided link for detailed information regarding the terms and conditions governing the use of this project.

Table of Contents

Libraries

List of libraries used for the development of the Python script:

  • Pandas, for text conversion and DataFrame manipulation;
  • Tabula, for reading PDF files;
  • Other standard libraries of the Python language were also used, such as Glob for retrieving only PDF files, OS for system operations, argparse for receiving and manipulating command-line arguments, among others.

Formatting

Types of formatting and the files to which they were applied. When a file is shown to be exported (in table format), it means that all the formatting above the export will be applied.

File Read Handling

Formatting related to reading.

Remove Double Quotes

Removes all double quotes from the DataFrame to avoid future issues.

Replace Semicolon

Replaces all semicolons in the DataFrame with commas to avoid conflicts.

Delete Empty Lines

Deletes all empty rows in the DataFrame.

Delete Empty Columns

Deletes all empty columns in the DataFrame.

Convert Header to Body

Converts the header to body to remove unnecessary and detrimental formatting.

Remove Line Breaks

Removes line breaks that occur when the PDF has a very long line.

Conversion File Handling

Formatting related to conversion.

Export [withoutFormatting]

Starts the first export, which is the export of the unformatted file that will be formatted later.

EXPORT
Folder Name: withoutFormatting
Folder Path: (lattice/stream) + "\\withoutFormatting"
Description: The 'withoutFormatting' file
is exported at this moment
without any formatting.




Empty Data in Header

Removes empty data in the header.

If it is:

"<data>";"Unnamed: 0";"<data>"

It becomes:

"<data>";"<data>"

Line Breaks in the Middle of Data

Removes line breaks if they occur in the middle of the data.

If it is:

"<data
data>"

It becomes:

"<data data>"

Semicolon at the End of the Line

Removes semicolon ';' if it is at the end of the line.

If it is:

"<data>";"<data>";

It becomes:

"<data>";"<data>"

Space at the Beginning of the Line

Removes leading spaces in the lines.

If it is:

"<data>";"<data>"
 "<data>";"<data>"
"<data>";"<data>"

It becomes:

"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>"

Quotes and One Column (First Check)

Removes the line if it has quotes at the beginning and end, and on top of that, it has only one column or less.

If it is:

"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>
<data>"

It remains the same:

"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>
<data>"

Export [tableWithBlankCells]

Starts the export of the file to handle the exception when converting a table that has empty cells.

EXPORT
Folder Name: tableWithBlankCells
Folder Path: (lattice/stream) + "\\tableWithBlankCells"
Description: The file 'tableWithBlankCells'
is exported at this moment
with all the formatting
applied above.



Empty Data

Removes data that is empty ""; and ;"".

If it is:

"";"<data>";"<data>";"<data>"
"<data>";"<data>";"";"<data>"
"<data>";"<data>";"<data>";""

It becomes:

"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"

Adjacent Double Quotes

Inserts a line break if there are double quotes side by side.

If it is:

"<data>";"<data>""<data>";"<data>"

It becomes:

"<data>";"<data>"
"<data>";"<data>"

Space After a Separator

If there is a semicolon followed by a space, it is replaced by a line break.

If it is:

"<Lorem ipsum>";"<Lorem ipsum>"; "<Lorem ipsum>";"<Lorem ipsum>"

It becomes:

"<Lorem ipsum>";"<Lorem ipsum>"
"<Lorem ipsum>";"<Lorem ipsum>"

Space Between Separators and Double Quotes

Removes the preceding content if there is a space between the separators and the quotes.

If it is:

"<Lorem ipsum>";"<Lorem ipsum>"; "<data>";"<data>"

It becomes:

"<data>";"<data>"

Quotes and One Column (Second Check)

Removes the line if it has quotes at the beginning and end, and on top of that, it has only one column or less.

If it is:

"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>
<data>"

It remains the same:

"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>
<data>"

Export [main]

Starts the export of the main file.

EXPORT
Folder Name: main
Folder Path: (lattice/stream) + "\\main"
Description: The file 'main'
is exported at this moment
with all the formatting
applied above.



Quotes at the Beginning

Deletes the line if it doesn't start with quotes.

If it is:

"<data>";"<data>";"<data>"
<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"

It becomes:

"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"

Quotes at the End

Deletes the line if it doesn't end with quotes.

If it is:

"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>
"<data>";"<data>";"<data>"

It becomes:

"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"

Empty Lines or Without Quotes (Second Check)

Empty lines that only have line breaks '\n' or don't have a double quote anywhere will be deleted.

If it is:




Lorem
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
Lorem ipsum

"<data>";"<data>"

It becomes:

"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
"<data>";"<data>"

Three Columns

Only writes the line if it has at least three columns or more.

If it is:

"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"

It becomes:

"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"

Export [fullClear]

Starts the export of the main file with some stricter formatting modifications.

EXPORT
Folder Name: fullClear
Folder Path: (lattice/stream) + "\\fullClear"
Description: The file 'fullClear'
is exported at this moment
with all the formatting
applied above.