Using Python to crawl Webpages via sitemap and checking broken links for every page
- Reading sitemap.xml of the Website and Searching for 'href'attribute to get all valid links in every page.
- Checking link response status and dumping 404 error URLs to a text file.
- Install beautifulsoup4 [Python library] and Specifying the target sitemap in broken_links.py file @ [request = build_request("https://www.jobstreet.com.my/career-resources/page-sitemap.xml")]
- Run the command "python broken_links.py"
- To stop the program while executing,just Use "ctrl+c"