EPA Environmental Impact Satement Scrapers

These are the files I used to scrape the EPA's EIS site.

A bunch of files, but let me try to make sense of them.

Files

server_scripts Folder

check_for_new_reports.py - Checks the EPA EIS website for new reports. Gets a list of all linkts to all reports, then crosschecks with the MongoDB to see which reports are missing. Then it will save those reports into the MongoDB, download the PDF files and store them in an Amazon S3 bucket
convert_pdfs_to_reports.py - The script will try to convert into a txt file any PDF on the Amazon S3 bucket which has not been converted.
index_missing_files.py - This script will take all missing reports and index them into ElasticSearch
create_full_csv_of_reports.py - will take all links from the EPA EIS page and scrape all the info (whether it's already stored or not) and store it into a CSV.
fresh_reindex.py - takes the information in the MongoDB and the files in the S# bucket and re-indexes them all on ElasticSearch.

**IPython Notebook Files **

scrape_links_from_EPA.ipynb - Get's the title and links from the main list of reports (same link as above). These links might expire, but running this script to gather new links takes less than a minute.
scrape_info_from_links.ipynb - Takes each link scraped in the previous script, and gathers all the info on the report from the table shown. These reports may contain report files and comment letter files.
get_file_metadata.ipynb - From the report file links gathered, this script makes an HTTP request for just the headers (does not open file).
get_comment_letter_metadata.ipynb - Same as above, but with comment letters.
save_files_to_s3.ipynb - Saves report files to an S3 bucket named 'epaeis'.
save_comment_letters_s3.ipynb - Saves comment letters to the same S3 bucket.
update_documents.ipynb - This notebookd contains most of the code that created the scripts in the server_scripts folder.

Python Files
Sometimes the scripts make time intensive requests (like saving files to S3) so some scripts were copied into python files and executed from an EC2 instance and left running overnight.

get_file_metadata.py - Does the same thing as get_file_metadata.ipynb. This is quick, no need to run from server, can be run from iPython Notebook.
get_files.py - This downloads the report PDF's and saves it to an S3 bucket. Takes some hours to complete.
get_comment_letters.py - The same, but for comment letter PDF's.

CSV's

eis_links.csv - The output of scrape_links_from_EPA.ipynb Contains:
- date, agency, state, document_type, title, report_link
reports.csv - output of scrape_info_from_links.ipynb. Contains:
- date, agency, state, document_type, title, report_link, eis_number, federal_register_date, contact_name, comment_due_review_date, contact_phone, amended_notice_date, amended_notice, supplemental_info, website, comment_letter_date, rating, num_comment_letter, comment_letter_links, num_files, list_of_links
file_metadata.csv - The output of get_file_metadata.ipynb Contains:
- content_length, last_modified, date_retrieved, content_type, file_url, eis_url, eis_number,
comment_letters_metadata.csv - The output of get_comment_letter_metadata.ipynb Contains:
- content_length, last_modified, date_retrieved, content_type, file_url, eis_url, eis_number,
reports_excel.csv - modified reports.csv in excel to remove a few duplicates entries on the EPA site. The rows were not the same and data from each had to be manually merged into one. There were about 10 duplicates.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
server_scripts		server_scripts
README.md		README.md
attach_children.py		attach_children.py
comment_letter_metadata_to_mongo.csv		comment_letter_metadata_to_mongo.csv
comment_letters_metadata.csv		comment_letters_metadata.csv
convert_pdf_to_text.py		convert_pdf_to_text.py
eis_links.csv		eis_links.csv
extract_file_text.ipynb		extract_file_text.ipynb
extract_keywords_from_text.ipynb		extract_keywords_from_text.ipynb
file_metadata_to_mongo.csv		file_metadata_to_mongo.csv
fresh_reindex.ipynb		fresh_reindex.ipynb
get_comment_letter_metadata.ipynb		get_comment_letter_metadata.ipynb
get_comment_letters.py		get_comment_letters.py
get_file_metadata.ipynb		get_file_metadata.ipynb
get_file_metadata.py		get_file_metadata.py
get_files.py		get_files.py
get_report_titles.ipynb		get_report_titles.ipynb
index_pdfs.py		index_pdfs.py
index_text_files.ipynb		index_text_files.ipynb
index_txts.py		index_txts.py
missing_reports_from_mongo.csv		missing_reports_from_mongo.csv
models.py		models.py
myfile.pdf		myfile.pdf
parent_child_index.ipynb		parent_child_index.ipynb
report_titles.csv		report_titles.csv
reports.csv		reports.csv
reports_to_be_added_to_mongo.csv		reports_to_be_added_to_mongo.csv
save_comment_letters_s3.ipynb		save_comment_letters_s3.ipynb
save_files_to_s3.ipynb		save_files_to_s3.ipynb
scrape_info_from_links.ipynb		scrape_info_from_links.ipynb
scrape_links_from_EPA.ipynb		scrape_links_from_EPA.ipynb
unable_to_convert_to_text.csv		unable_to_convert_to_text.csv
update_documents.ipynb		update_documents.ipynb
update_to_mongo.py		update_to_mongo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPA Environmental Impact Satement Scrapers

Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EPA Environmental Impact Satement Scrapers

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages