Skip to content

Scrapes domains for broken links, emails & social media links (uses beautifulsoup)

License

Notifications You must be signed in to change notification settings

johncoleman83/domain_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

domain_scraper

Scrapes domains from one input URL or from a file list of domains for broken links, valid emails and valid social media links.

Usage

  • Check URL's from text file to scrape for emails and social media links. This also checks common paths found from the input domain such as contact and team pages to add to the queue to new URL's to scrape for emails and social media links. This does not store broken links, but does output them to STDOUT during runtime. This does save to a file all valid & unique emails addresses and social media links during runtime so data is stored in the event of an error.

  • Scrape for emails & social media links, checking for promising new links to scrape

$ ./domain_scraper.py [INPUT FILE] --scrape-n
  • Same as above, but do not check for new links to add to the queue
$ ./domain_scraper.py [INPUT FILE] --scrape
  • To check for broken links only from all URLS from the same domain based off of one main input URL
$ ./domain_scraper.py --url [URL TO SCRAPE]
  • Check URL's for broken links from text file
$ ./domain_scraper.py [INPUT FILE] --check
  • extract name associations from email list (used with results from scraping)
$ ./domain_scraper.py [INPUT FILE] --extract

Data storage

Data is written to a file during runtime of the email and social media scraper.

  • Data is written to file at runtime.
  • Specific errors are not written to the file, but instead printed to STDOUT
  • Files are stored in path ./file_storage

Example file & file cleanup

how to cleanup a .csv file

$ cat example_file_bad_format.txt
https://google.com/^Mhttps://cecinestpasun.site/^Mhttps://google.com/^Mhttp://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png

# replace ^M character after copying from .csv file
$ tr '\r' '\n' < example_file_bad_format.txt > example_file.txt

# remove repeat links
$ awk '!seen[$0]++' example_file.txt > example_file_no_repeats.txt

$ cat example_file.txt
https://google.com/
https://cecinestpasun.site/
http://www.davidjohncoleman.com/wp-content/uploads/2017/06/headshot-retro.png

Author

Contributors

License

MIT License

About

Scrapes domains for broken links, emails & social media links (uses beautifulsoup)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages