GitHub - plain-jane-gray/scraping-tables-from-PDF: Scrapes data tables from a PDF file.

Scraping data tables from a PDF file

→

The code in this respository scrapes data tables from a PDF file. Once extracted from the PDF file, it is clean, analyzed, and mapped. The map allows the user to easily understand the data.

This repository contains a single Jupyter notebook:

Scraping tables from a PDF file GH.ipynb.

Input: A single url to a PDF file on a publicly available website.

Output: Three data tables as pandas dataframes that can be exported.

The code does the following:

Reads in the names of the data tables in a PDF file. This allows the user to confirm that all data tables are being read.
Extracts data tables as lists from the PDF.
Accesses and flattens sublists.
Filters data to separate the data tables, adds headings, and remove nan values. This is repeated three times to separate out three different data tables

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
PDF image.jpg		PDF image.jpg
README.md		README.md
Scraping tables from a PDF file GH.ipynb		Scraping tables from a PDF file GH.ipynb
pandas table.jpg		pandas table.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraping data tables from a PDF file

About

Releases

Packages

Languages

License

plain-jane-gray/scraping-tables-from-PDF

Folders and files

Latest commit

History

Repository files navigation

Scraping data tables from a PDF file

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages