## **A Basic Document Scraper in Python**

## **Introduction**
Now that you've set up a basic web scraper using Python, we'll extend that knowledge to fetch a specific document, such as a PDF file or a text file, from a website. This is a common requirement when dealing with web scraping, as many useful data sources are available in downloadable documents.

### **Step 0: Identify the document link**
Before you can download a document, you need to identify its location on the webpage. Typically, documents are linked via ```<a> (anchor)```
tags, which you can locate by inspecting the webpage's HTML structure. Something like this:

```
<a href="/files/report.pdf" class="download-link">Download Report</a>
```
In this example, the document (report.pdf) is linked within an ```<a>``` tag with the class "download-link." The href attribute contains the path to the document.



### **Step 1: Import the necessary libraries**

In [6]:
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup

### **Step 2: Send an HTTP request to the website**
Use the requests library to send an HTTP GET request to the website you want to scrape:

In [3]:
url = "https://www.who.int/data/sets/global-excess-deaths-associated-with-covid-19-modelled-estimates"

response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    print("Connection was successful!")
else:
    print('Failed to retrieve the webpage.')
    exit()

Connection was successful!


### **Step 3: Parse the HTML content**
Once you’ve successfully retrieved the web page, use BeautifulSoup to parse the HTML content and find the ```<a> Tag```.



In [5]:
document_link = soup.find('a', string="Download data")['href']
print('Document link found:', document_link)

Document link found: https://cdn.who.int/media/docs/default-source/world-health-data-platform/covid-19-excessmortality/2023-05-19_covid-19_gem.zip?sfvrsn=9a95fc1a_4


This code sends a request to the webpage and parses the HTML. It then searches for an ```<a> tag``` with the class download-link and extracts the href attribute, which contains the path to the document.

### **Step 4: Handle relative URLs**
The link extracted from the webpage might be a relative URL (e.g., ```/files/report.pdf```) rather than a full URL. You need to convert this into a full URL before making a request to download the document. In that case we need to convert the relative URL to a full URL. Don't fortget to import ```os```. The ```os.path.join()``` function combines the base URL with the relative URL to form a full URL that can be used to download the document.

In [12]:
base_url = 'https://example.com'
relative_document_link = 'files/reports.pdf'
full_url = os.path.join(base_url, relative_document_link)
print('Full URL:', full_url)

Full URL: https://example.com/files/reports.pdf


### **Step 5: Download the document**
With the full URL in hand, you can now send a request to download the document. The downloaded file can be saved to your local machine.

In [16]:
document_response = requests.get(document_link)

if document_response.status_code == 200:
  with open('2023-05-19_covid-19_gem.zip', 'wb') as file:
    file.write(document_response.content)
    print('Document downloaded successfully.')
else:
  print('Failed to download the document. Status code:', document_response.status_code)

Document downloaded successfully.


### **Step 6: Fetch multiple documents**
What if the web page has multiple documents and you want to download the all in once.In that case you can modify the scraper to loop through all the document links and download each one. Here, I change the code a bit more and made it a bit advance. For example, you don't need to check the reponse status with the code `200`. You can use the function `raise_for_status()` instead.

Moreover, you want to download different file format such as, `pdf`, `zip`, `csv`, `txt`, or `excel`.

In [37]:
# Check the connection
url = "https://opendata.dwd.de/climate_environment/health/historical_alerts/heat_warnings/"
response = requests.get(url)
response.raise_for_status()

# Creat BeautifulSoup Object to parse the html
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')

# This part is optional, but most of the time there is a link which is not a file
document_links = [link['href'] for link in links if link['href'] != "../"]

# If you like to store all the files under a specific folder:
download_folder = "downloaded_files"
os.makedirs(download_folder, exist_ok=True)

allowed_extensions = [".pdf", ".zip", ".docx", ".xlsx", ".csv"]

for link in document_links:

  # This is for the relative URLs situation
  if not link.startswith("http"):
    file_url = os.path.join(url, link)

  filename = os.path.basename(file_url)
  # Now we check if the file format is eligible
  if any(filename.lower().endswith(ext) for ext in allowed_extensions):
    print(f"Downloading: {filename}")

    # Stream download
    file_response = requests.get(file_url, stream=True)
    file_response.raise_for_status()

    # Downloading the files
    file_path = os.path.join(download_folder, filename)
    with open(file_path, "wb") as file:
      file.write(file_response.content)

Downloading: Beschreibung_historische-Hitzewarnungen.pdf
Downloading: Description_historical_heat_alerts.pdf
Downloading: heat_alerts_2005.csv
Downloading: heat_alerts_2006.csv
Downloading: heat_alerts_2007.csv
Downloading: heat_alerts_2008.csv
Downloading: heat_alerts_2009.csv
Downloading: heat_alerts_2010.csv
Downloading: heat_alerts_2011.csv
Downloading: heat_alerts_2012.csv
Downloading: heat_alerts_2013.csv
Downloading: heat_alerts_2014.csv
Downloading: heat_alerts_2015.csv
Downloading: heat_alerts_2016.csv
Downloading: heat_alerts_2017.csv
Downloading: heat_alerts_2018.csv
Downloading: heat_alerts_2019.csv
Downloading: heat_alerts_2020.csv
Downloading: heat_alerts_2021.csv
Downloading: heat_alerts_2022.csv
Downloading: heat_alerts_2023.csv
Downloading: heat_alerts_2024.csv


## **Important considerations**

**Respect website permissions:** Always check the website’s terms of service to ensure you’re allowed to download documents using automated tools.

**Handle different file types accordingly:** Depending on the type of document, you might need to adjust your code to handle different file formats (e.g., .txt, .csv, and .docx).

**Manage large downloads:** If you’re downloading large files, consider adding error handling and resuming capabilities to your scraper.

## **Conclusion**

By extending your web scraper to fetch documents, you can automate the process of acquiring valuable resources from the web. Whether you’re downloading reports, datasets, or other types of files, understanding how to identify document links and handle file downloads is a critical skill in data acquisition. With the examples provided, you should be well equipped to implement document fetching in your web scraping projects.

Continue experimenting with different websites and document types to refine your scraper, and always remember to scrape responsibly and ethically.