# 6. How To Download Multiple Images In Python

## Learning Outcomes

- To learn how to download multiple images in Python

Its a great skill to be able to automatically download all of the images across Xn HTML pages. So in this guide you'll learn two methods for extracting all of the images that a website has on its pages.

---------------------------------------------------------------

Let's begin with the easiest of the two methods, if you already have a list of image URLs then we can follow this process:

1. Change into a directory where we would like to store all of the images.
2. Make a request to download all of the images, one by one.
3. We will also add in error handling so that if a URL no longer exists the code will still work.

------------------------------------------------------------------------------------------------

## Python Imports

In [1]:
!pip install tldextract



In [2]:
import requests
import os
import subprocess
import urllib.request
from bs4 import BeautifulSoup
import tldextract

----------------

In [3]:
!mkdir all_images

In [4]:
!ls

[34mall_images[m[m                            how-to-download-multiple-images.ipynb


Changing into the directory of the folder called all_images, this can be done by either:

~~~

cd all_images
os.chdir('path')

~~~

In [5]:
os.chdir('all_images')

In [6]:
!pwd

/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images


---------------------

## Method One: How To Download Multiple Images From A Python List

We can also create a python list to store any images that didn't have a 200 status code:

In [7]:
broken_images = []

In [8]:
image_urls = ['https://sempioneer.com/wp-content/uploads/2020/05/dataframe-300x84.png',
             'https://sempioneer.com/wp-content/uploads/2020/05/json_format_data-300x72.png']

In [9]:
for img in image_urls:
    # We can split the file based upon / and extract the last split within the python list below:
    file_name = img.split('/')[-1]
    print(f"This is the file name: {file_name}")
    # Now let's send a request to the image URL:
    r = requests.get(img, stream=True)
    # We can check that the status code is 200 before doing anything else:
    if r.status_code == 200:
        # This command below will allow us to write the data to a file as binary:
        with open(file_name, 'wb') as f:
            for chunk in r:
                f.write(chunk)
                
    else:
        # We will write all of the images back to the broken_images list:
        broken_images.append(img)

This is the file name: dataframe-300x84.png
This is the file name: json_format_data-300x72.png


☝️ See how simple that is! ☝️

If you check your folder, you will have now downloaded all of the images that contained a status code of 200! 

------------------------------------------------

![downloading images correctly with python](https://sempioneer.com/wp-content/uploads/2020/06/how-to-download-images-with-python.png)

----------------

## Method Two: How To Download Multiple Images From Many HTML Web Pages

If we don't yet have the exact image URLs, we will need to do the following:

1. Download the HTML content of every web page.
2. Extract all of the image URLs for every page.
3. Create the file names.
4. Check to see if the image status code is 200.
5. Write all of images to your local computer.

This website [internetingishard.com](https://www.internetingishard.com/html-and-css/links-and-images/) has some relative image URLs. Therefore we will need to ensure that our code can handle for the following two types of image source URLs:

---

- Exact Filepath: https://www.internetingishard.com/html-and-css/links-and-images/html-attributes-6f5690.png
- Relative Filepath: /html-and-css/links-and-images/html-attributes-6f5690.png

---------------------------------------------------------------

In [10]:
web_pages = ['https://understandingdata.com/', 
             'https://understandingdata.com/data-engineering-services/',
             'https://www.internetingishard.com/html-and-css/links-and-images/']

We will also extract the domain of every URL whilst we loop over the webpages like so:
    
~~~

for page in webpages:
    domain_name = tldextract.extract(page).registered_domain

~~~

In [11]:
url_dictionary = {}

In [12]:
for page in web_pages:
    # 1. Extracting the domain name of the web page:
    domain_name = tldextract.extract(page).registered_domain
    print(f"The domain name: {domain_name}")    
    # 2. Request the web page:
    r = requests.get(page)
    # 3. Check to see if the web page returned a status_200:
    if r.status_code == 200:
        
        # 4. Create a URL dictionary entry for future use:
        url_dictionary[page] = []
        
        # 5. Parse the HTML content with BeautifulSoup and look for image tags:
        soup = BeautifulSoup(r.content, 'html.parser')
        
        # 6. Find all of the images per web page:
        images = soup.findAll('img')
        
        # 7. Store all of the images 
        url_dictionary[page].extend(images)
        
    else:
        print('failed!')

The domain name: understandingdata.com
The domain name: understandingdata.com
The domain name: internetingishard.com


--------------------------------------------------------

Now let's double check and filter our dictionary so that we only look at web pages where there was at least 1 image tag:

In [13]:
for key, value in url_dictionary.items():
    if len(value) > 0:
        print(f"This domain: {key} has more than 1 image on the web page.")

This domain: https://understandingdata.com/ has more than 1 image on the web page.
This domain: https://understandingdata.com/data-engineering-services/ has more than 1 image on the web page.
This domain: https://www.internetingishard.com/html-and-css/links-and-images/ has more than 1 image on the web page.


--------------------------------------------------------

An easier way to write the above code would be via a dictionary comprehension:

In [14]:
cleaned_dictionary = {key: value for key, value in url_dictionary.items() if len(value) > 0}

We can now clean all of the image URLs inside of every dictionary key and change all of the relative URL paths to exact URL paths.

Let's start by printing out all of the different image sources to see how we might need to clean up the data below:

In [15]:
for key, images in cleaned_dictionary.items():
    for image in images:
        print(image.attrs['src'])

//understandingdata.com/wp-content/uploads/2019/04/cropped-logo_transparent-1.png
//understandingdata.com/wp-content/uploads/2019/04/cropped-logo_transparent-1.png
https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg
https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg
https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://understandingdata.com/wp-content/uploads/2020/03/community-detection-370x238.png
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://understandingdata.com/wp-content/uploads/2020/03/what-is-web-scraping-370x370.png
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
ht

For the scope of this tutorial, I have decided to:
    
- Remove the logo links with the //
- Add on the domain to the relative URLs

In [16]:
all_images = []

for key, images in cleaned_dictionary.items():
    # 1. Creating a clean_urls and domain name for every page:
    clean_urls = []
    domain_name = tldextract.extract(key).registered_domain
    # 2. Looping over every image per url:
    for image in images:
        # 3. Extracting the source (src) with .attrs:
        source_image_url = image.attrs['src']
        # 4. Clean The Data
        if source_image_url.startswith("//"):
            pass
        elif domain_name not in source_image_url and 'http' not in source_image_url:
            url = 'https://' + domain_name + source_image_url
            all_images.append(url)
        else:
            all_images.append(source_image_url)

In [17]:
print(all_images[0:5])

['https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg', 'https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg', 'https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png', 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g', 'https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg']


-------------------------------------------------------------------------------------

After obtaining our list of clean image URLs we can now refer to method one for extracting the images to our computer! 

This time let's convert it into a function:

In [18]:
def extract_images(image_urls_list:list, directory_path):
    
    # Changing directory into a specific folder:
    os.chdir(directory_path)
    
    # Downloading all of the images
    for img in image_urls_list:
        file_name = img.split('/')[-1]
        
        # Let's try both of these versions in a loop [https:// and https://www.]
        url_paths_to_try = [img, img.replace('https://', 'https://www.')]
        for url_image_path in url_paths_to_try:
            print(url_image_path)
            try:
                r = requests.get(img, stream=True)
                if r.status_code == 200:
                    with open(file_name, 'wb') as f:
                        for chunk in r:
                            f.write(chunk)
            except Exception as e:
                pass        

In [19]:
!pwd

/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images


In [20]:
path = '/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images'

extract_images(image_urls_list=all_images, 
               directory_path=path)

https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg
https://www.understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg
https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg
https://www.understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg
https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png
https://www.understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://www.secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg
https://www.understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg
https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g
https://www.secure.gravatar.com/avatar/17d8a69424a54d39

https://www.internetingishard.com/html-and-css/links-and-images/html-link-href-element-61348e.png
https://internetingishard.com/html-and-css/links-and-images/absolute-relative-root-relative-links-104560.png
https://www.internetingishard.com/html-and-css/links-and-images/absolute-relative-root-relative-links-104560.png
https://internetingishard.com/html-and-css/links-and-images/absolute-link-syntax-64d730.png
https://www.internetingishard.com/html-and-css/links-and-images/absolute-link-syntax-64d730.png
https://internetingishard.com/html-and-css/links-and-images/absolute-links-32f469.png
https://www.internetingishard.com/html-and-css/links-and-images/absolute-links-32f469.png
https://internetingishard.com/html-and-css/links-and-images/relative-links-e178d0.png
https://www.internetingishard.com/html-and-css/links-and-images/relative-links-e178d0.png
https://internetingishard.com/html-and-css/links-and-images/relative-link-no-parent-4629d0.png
https://www.internetingishard.com/html-and-cs

Fantastic! 

Now there are some things that we didn't necessarily cover for which include:

- http:// only image urls.
- http://www. only image urls.

But for the most part, you can hopefully now hopefully download images in bulk!

---------------------------------------------

![how to download multiple images within python](https://sempioneer.com/wp-content/uploads/2020/06/all_images.png)

------------------------------------------------------------------------

## How To Speed Up Your Image Downloads

Its important when working with 100's or 1000's of URLs to avoid using as synchronous approach to downloading images. An asynchronous approach means that we can download multiple web pages or multiple images in parallel.

<strong> This means the overall execution time will be much quicker! </strong>

--------------------

### ThreadPoolExecutor()

In [21]:
def extract_single_image(img):
    file_name = img.split('/')[-1]
    
    # Let's try both of these versions in a loop [https:// and https://www.]
    url_paths_to_try = [img, img.replace('https://', 'https://www.')]
    for url_image_path in url_paths_to_try:
        try:
            r = requests.get(img, stream=True)
            if r.status_code == 200:
                with open(file_name, 'wb') as f:
                    for chunk in r:
                        f.write(chunk)
            return "Completed"
        except Exception as e:
            return "Failed"

The ThreadPoolExecutor is one of python's built in I/O packages for creating an asynchronous behaviour via multiple threads. In order to utilise it, we will make sure that the function will only work on a single URL.

Then we will pass the image URL list into multiple workers ;) 

In [22]:
all_images[0:5]

['https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg',
 'https://understandingdata.com/wp-content/uploads/2019/09/james-anthony-phoenix.jpg',
 'https://understandingdata.com/wp-content/uploads/2020/05/Is-Web-Scraping-Illegal-370x370.png',
 'https://secure.gravatar.com/avatar/17d8a69424a54d3957e1ce51755c6cfd?s=35&r=g',
 'https://understandingdata.com/wp-content/uploads/2020/03/web-scraping-tools-370x192.jpg']

In [29]:
!pwd

/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images


In [27]:
# Deleting the old image directory and creating a new one in its place: (FIX THIS )

------------------------------------------

In [28]:
os.chdir('/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/6_downloading_multiple_images/all_images')

In [30]:
import concurrent.futures
import urllib.request

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(extract_single_image, image_url) for image_url in all_images}
    for future in concurrent.futures.as_completed(future_to_url):
        try:
            url = future_to_url[future]
        except Exception as e:
            pass
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))

You should've downloaded the images but at a much faster rate! 

-------------------------------------------------------------------------------------

### Async Programming! 

Just like JavaScript, Python 3.6+ comes bundled with native support for co-routines called [asyncio](https://docs.python.org/3/library/asyncio.html). Similar to NodeJS, there is a method available to you for creating custom event loops for async code. 

We will also need to download an async code HTTP requests library called [aiohttp](https://docs.aiohttp.org/en/stable/)

In [31]:
!pip install aiohttp



Additionally as we are running this code within a Jupyter Notebook, which is actually inside of an event loop, 
we will need to install and apply [nest-asyncio](https://pypi.org/project/nest-asyncio/):

In [32]:
!pip install nest-asyncio



Pro-tip: Whenever you re-factor your async code from a Jupyter notebook, you will never need to use nest-asyncio!

In [33]:
# Only Use This In Jupyter Notebooks!
import nest_asyncio
nest_asyncio.apply()

In [34]:
import aiohttp
import asyncio

----------------------------------------------------------------

We will need to structure our code slightly different for the async version to work:
    
1. We will have a fetch function to query the webpage.
2. We will have a parse function to get all of the image URLs per webpage.
3. We will have an extract function to download all of the images.

----------------------------------------

In [35]:
async def fetch(session, url):
    try:
        async with session.get(url) as response:
            # Notice how both of the functions are await in our async def fetch function!
            content = await response.text()
            image_urls = await parse(content)
            return images
    except Exception as e:
        print(str(e))

In [None]:
async def parse(text):
    soup = BeautifulSoup(r.content, 'html.parser')
    images = soup.findAll('img')
    return images

--------

In [None]:
async def fetch_single_image(domain, image, session):
    

In [None]:
async def extract(images:list):
    for image in images:
        await fetch_single_image(image, session)

In [36]:
async def main(web_pages):
    all_data = []
    tasks = []
    
    headers = {
        "user-agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}
    async with aiohttp.ClientSession(headers=headers) as session:
        for page in web_pages:
            tasks.append(fetch(session, url))
        htmls = await asyncio.gather(*tasks)
        all_data.extend(htmls)
        
    print(len(all_data))

In [37]:
all_urls = []
for key, value in cleaned_dictionary.items():
    all_urls.extend(value)

In [38]:
print(len(all_urls))

70


----------------------------------------------------------------------

In [39]:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(main(all_urls))

  


70
