# Different download and save approaches in Python
> A comparison of synchronous, multiprocess, and async approaches to downloading and saving files

- toc: true 
- badges: true
- comments: false
- categories: [jupyter]

In this example I wanted to demonstrate the big differences in times between downloading files synchronously, in parallel processes, and asynchronously using async and await packages. The results shown here were run on my laptop using a Intel® Core™ i9-9980HK CPU @ 2.40GHz × 16 processor. I download 15 CSV files totalling about 26.7 megabytes at the time of writing.

I expect the results to change significantly based on the number and size of the files but I think this nicely demonstrates for a relatively small number of largish files async performs better (in general - depending on the number of cores available) compared to parallel downloading with multiple processes and asynchronous downloading is significantly faster than synchronous sequential downloading files.

## Requirements (Conda)

In [None]:
%conda install -c conda-forge -y aiofiles=0.5 aiohttp=3.6 joblib=0.16 requests=2.24

## Requirements (Pip)

In [None]:
%pip install aiofiles==0.5 aiohttp==3.6 joblib==0.16 requests==2.24

## Prepare a list of CSV files to download

In [3]:
file_url_list = [f"http://data.wa.aemo.com.au/datafiles/load-summary/load-summary-{year}.csv" for year in range(2006, 2021)]

## Download 15 csv files synchronously

In [4]:
import requests

In [5]:
def download_file(file_url: str):
    response = requests.get(file_url)
    if response.status_code == 200:
        file_name = file_url.split("/")[-1]
        with open(file_name, mode="wb") as file:
            file.write(response.content)

In [6]:
%%time
for file_url in file_url_list:
    download_file(file_url)

CPU times: user 791 ms, sys: 481 ms, total: 1.27 s
Wall time: 1min 17s


## Download 15 csv files in parallel processes using all available cores

In [7]:
from joblib import Parallel, delayed

In [8]:
%%time
Parallel(n_jobs=-1)(delayed(download_file)(file_url) for file_url in file_url_list)
pass # pass just so Jupyter doesn't show the list of null returns from Parallel

CPU times: user 59.6 ms, sys: 91.8 ms, total: 151 ms
Wall time: 10.9 s


## Download 15 csv files asynchronously using async and await libraries

In [9]:
import asyncio
import aiofiles
import aiohttp

from typing import List

In [10]:
async def download_file(session: aiohttp.ClientSession, file_url: str):
    async with session.get(file_url) as resp:
        file_name = file_url.split("/")[-1]
        if resp.status == 200:
            async with aiofiles.open(file_name, mode="wb") as f:
                await f.write(await resp.read())


async def download_file_list(file_url_list: List[str]):
    async with aiohttp.ClientSession() as session:
        # If downloading a very large number of files you'd probably need to limit
        # the number of concurrent requests that could be sent at once
        tasks = [download_file(session, file_url) for file_url in file_url_list]
        await asyncio.gather(*tasks)

In [12]:
import time
t0 = time.time()
await download_file_list(file_url_list)
t1 = time.time()
print(f"Wall time: {t1 - t0:0.2}s")
# Using time functions since %%time magic errors when used in a cell with await

Wall time: 8.0s
