# Different download and save approaches in Python
> A comparison of synchronous, multiprocess, and async approaches to downloading and saving files

- toc: true 
- badges: true
- comments: false
- categories: [jupyter]

In this example I wanted to demonstrate the big differences in times between downloading files synchronously, in parallel processes, and asynchronously using async and await packages. The results shown here were run on my laptop using a Intel® Core™ i9-9980HK CPU @ 2.40GHz × 16 processor. I download 15 CSV files totalling about 26.7 megabytes at the time of writing.

I expect the results to change significantly based on the number and size of the files but I think this nicely demonstrates for a relatively small number of largish files async performs better (slightly) than multi-process which and both are significantly faster than sequential downloading.

## Requirements

In [None]:
%%sh
conda install -c conda-forge -y aiofiles=0.5 aiohttp=3.6 joblib=0.16 requests=2.24

## Prepare a list of CSV files to download

In [10]:
file_url_list = [f"http://data.wa.aemo.com.au/datafiles/load-summary/load-summary-{year}.csv" for year in range(2006, 2021)]

## Download 15 csv files synchronously

In [2]:
import requests

In [3]:
def read_load_summary(file_url: str):
    response = requests.get(file_url)
    if response.status_code == 200:
        file_name = file_url.split("/")[-1]
        with open(file_name, mode="wb") as file:
            file.write(response.content)

In [4]:
%%time
for file_url in file_url_list:
    read_load_summary(file_url)

CPU times: user 608 ms, sys: 307 ms, total: 914 ms
Wall time: 1min 21s


## Download 15 csv files in parallel processes using all available cores

In [5]:
from joblib import Parallel, delayed

In [6]:
%%time
Parallel(n_jobs=-1)(delayed(read_load_summary)(file_url) for file_url in file_url_list)
pass # pass just so Jupyter doesn't show the list of null returns from Parallel

CPU times: user 60.1 ms, sys: 39 ms, total: 99.1 ms
Wall time: 8.45 s


## Download 15 csv files asynchronously using async and await libraries

In [7]:
import asyncio
import aiofiles
import aiohttp

from typing import List

In [8]:
async def fetch(session: aiohttp.ClientSession, file_url: str):
    async with session.get(file_url) as resp:
        file_name = file_url.split("/")[-1]
        if resp.status == 200:
            async with aiofiles.open(file_name, mode="wb") as f:
                await f.write(await resp.read())


async def download(file_url_list: List[str]):
    async with aiohttp.ClientSession() as session:
        # If downloading a very large number of files you'd probably need to limit
        # the number of concurrent requests that could be sent at once
        tasks = [fetch(session, file_url) for file_url in file_url_list]
        await asyncio.gather(*tasks)

In [9]:
import time
t0 = time.time()
await download(file_url_list)
t1 = time.time()
print(f"Wall time: {t1 - t0:0.2}s")
# Using time functions since %%time magic errors when used in a cell with await

Wall time: 4.3s
