# 1: Image Collection
In this notebook, we will create the dataset that will be used for training and evaluation of the model.

>**Project Title** <br>2D Shapes Image Classification of Common and Uncommon Objects</br>
**Course** <br>AIDI-2001-02 AI in Enterprise</br>
**Group** <br>Seven (`7`)</br>
**Notebook number** <br>One (`1`)</br>


We will use this notebook to create the training image dataset to detect shapes of common and uncommon objects. The following classes of shapes are being considered:
1. `Circle`
2. `Triangle`
3. `Square`
4. `Rectangle`
5. `Pentagon`
6. `Hexagon`

For each of these shapes, about 200 random images will be collected with **Google's Image Search**. I will be following _[Adrian Rosebrock's](https://pyimagesearch.com/author/adrian/)_ tutorial on **[How to create a deep learning dataset using Google Images](https://pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/)**.

# STEP 1. Get Image URLs using JavaScript
This step is carried out by using the Google Chrome browser and the Javascript code provided in the Tutorial. Six (`6`) different `.txt` files are created - one for each class. Each of these URL files contains over 150 links to images for the respective class.

The Javascript code mentioned below - gathered from the tutorial mentioned above - is used to get the `.txt` files containing the URLs.

```js
/**
 * simulate a right-click event so we can grab the image URL using the
 * context menu alleviating the need to navigate to another page
 *
 * attributed to @jmiserez: http://pyimg.co/9qe7y
 *
 * @param   {object}  element  DOM Element
 *
 * @return  {void}
 */
function simulateRightClick( element ) {
    var event1 = new MouseEvent( 'mousedown', {
        bubbles: true,
        cancelable: false,
        view: window,
        button: 2,
        buttons: 2,
        clientX: element.getBoundingClientRect().x,
        clientY: element.getBoundingClientRect().y
    } );
    element.dispatchEvent( event1 );
    var event2 = new MouseEvent( 'mouseup', {
        bubbles: true,
        cancelable: false,
        view: window,
        button: 2,
        buttons: 0,
        clientX: element.getBoundingClientRect().x,
        clientY: element.getBoundingClientRect().y
    } );
    element.dispatchEvent( event2 );
    var event3 = new MouseEvent( 'contextmenu', {
        bubbles: true,
        cancelable: false,
        view: window,
        button: 2,
        buttons: 0,
        clientX: element.getBoundingClientRect().x,
        clientY: element.getBoundingClientRect().y
    } );
    element.dispatchEvent( event3 );
}


/**
 * grabs a URL Parameter from a query string because Google Images
 * stores the full image URL in a query parameter
 *
 * @param   {string}  queryString  The Query String
 * @param   {string}  key          The key to grab a value for
 *
 * @return  {string}               value
 */
function getURLParam( queryString, key ) {
    var vars = queryString.replace( /^\?/, '' ).split( '&' );
    for ( let i = 0; i < vars.length; i++ ) {
        let pair = vars[ i ].split( '=' );
        if ( pair[0] == key ) {
            return pair[1];
        }
    }
    return false;
}



/**
 * Generate and automatically download a txt file from the URL contents
 *
 * @param   {string}  contents  The contents to download
 *
 * @return  {void}
 */
function createDownload( contents ) {
    var hiddenElement = document.createElement( 'a' );
    hiddenElement.href = 'data:attachment/text,' + encodeURI( contents );
    hiddenElement.target = '_blank';
    hiddenElement.download = 'urls.txt';
    hiddenElement.click();
}


/**
 * grab all URLs va a Promise that resolves once all URLs have been
 * acquired
 *
 * @return  {object}  Promise object
 */
function grabUrls() {
    var urls = [];
    return new Promise( function( resolve, reject ) {
        var count = document.querySelectorAll(
        	'.isv-r a:first-of-type' ).length,
            index = 0;
        Array.prototype.forEach.call( document.querySelectorAll(
        	'.isv-r a:first-of-type' ), function( element ) {
            // using the right click menu Google will generate the
            // full-size URL; won't work in Internet Explorer
            // (http://pyimg.co/byukr)
            simulateRightClick( element.querySelector( ':scope img' ) );
            // Wait for it to appear on the <a> element
            var interval = setInterval( function() {
                if ( element.href.trim() !== '' ) {
                    clearInterval( interval );
                    // extract the full-size version of the image
                    let googleUrl = element.href.replace( /.*(\?)/, '$1' ),
                        fullImageUrl = decodeURIComponent(
                        	getURLParam( googleUrl, 'imgurl' ) );
                    if ( fullImageUrl !== 'false' ) {
                        urls.push( fullImageUrl );
                    }
                    // sometimes the URL returns a "false" string and
                    // we still want to count those so our Promise
                    // resolves
                    index++;
                    if ( index == ( count - 1 ) ) {
                        resolve( urls );
                    }
                }
            }, 10 );
        } );
    } );
}

```
> Finally, the below code combines all the above functions.

```js
/**
 * Call the main function to grab the URLs and initiate the download
 */
grabUrls().then( function( urls ) {
    urls = urls.join( '\n' );
    createDownload( urls );
} );

```

# STEP 2. Downloading the Images using the URLs
In this step, we will use python's request module to access and download the images from the list of URLs.

In [1]:
# import the necessary packages
import pathlib
import argparse
import requests
import cv2
import os
import uuid
from tqdm.notebook import tqdm

In [2]:
# define shapes to detected
shapes = ['circle', 'triangle', 'square', 'rectangle', 'pentagon', 'hexagon']

# define filepath
IMAGES_PATH = os.path.join("..", "data", "images", "collected_images")

In [3]:
# Create folder structure
if not os.path.exists(IMAGES_PATH):
    if os.name == 'posix':
        !mkdir -p {IMAGES_PATH}
    if os.name == 'nt':
         !mkdir {IMAGES_PATH}

for label in shapes:
    path = os.path.join(IMAGES_PATH, label)
    if not os.path.exists(path):
        !mkdir {path}

In [4]:
def get_urls(filename: str, directory: str = "../data/URLs"):
    """Get the URLs from the .txt files created using the Javascript code.
    
    Args:
        filename (str): name of the file that contains the URLs, with file extension.
        directory (str, optional): directory where the URL file is located.
        
    Returns:
        (list): list of URLs in the file.
    """
    path = pathlib.Path(directory) / filename
    

    if path.exists():
        with path.open('r') as f:
            urls = f.read().strip().split('\n')
    else:
        raise FileNotFoundError('Please check the file name and directory where the file is located.')
        
    return urls

In [5]:
def download_images(urls: list, label: str, out_dir: str = '../data/images/collected_images', verbose: int = 0):
    """The function downloads the images from each of the URLs provided in the list.
    
    Args:
        urls (list of str): list of URLs to the images to be downloaded.
        label (str): the label for the images downloaded from the URLs.
        out_dir (str): output directory where they labelled images will be stored.
        
        
    Returns:
        (list): list of paths to the downloaded images.
    
    """
    num_of_urls = len(urls)
    downloaded_imgs = 0
    img_paths = []
    
    if verbose == 1:
        generator = tqdm(urls, desc=f"Downloading images for {label}...")
    else:
        print('[START] Downloading images...')
        generator = urls
    
    for url in generator:
        try:
            # attempt to download the image
            r = requests.get(url, timeout=60)

            img_path = pathlib.Path(os.path.join(out_dir, label,
                                                 label+'-'+'{}.jpg'.format(str(uuid.uuid1()))))

            with img_path.open('wb') as f:
                f.write(r.content)
            
            if verbose == 2:
                print(f"[SUCCESSS] Downloaded: {img_path}")
            img_paths.append(str(img_path))
            downloaded_imgs += 1
        except:
            if verbose == 2:
                print(f"[ERROR] downloading unsuccessful for {img_path}...skipping...\n")
        
    print(f"[END] Downloading complete.")
    print(f"[END] URLs provided: {len(urls)}")
    print(f"[END] Total images downloaded: {downloaded_imgs}")
    
    return img_paths

In [9]:
def verify_downloads(img_paths: list, verbose: int = 0):
    """The function verifies if the images in the paths provided are not corrupted. Corrupted images are deleted.
    
    Args:
        img_paths (list): list of paths to the images to be verified.
        
    Returns:
        (list): list of images that were not corrupted.
    
    """
    new_img_paths = img_paths.copy()
    
    if verbose == 1:
        generator = tqdm(img_paths, desc="Verifying images...")
    else:
        print("[START] Verifying images...")
        generator = urls
    
    for path in generator:
        delete = False
        
        try:
            image = cv2.imread(path)
            
            if image is None:
                delete = True
        except:
            print("[ERROR]")
            delete = True
        
        if not pathlib.Path(path).exists():
            if verbose == 2:
                print(f'[ERROR-2] File not found: {path}')
            continue

        if delete:
            if verbose == 2:
                print(f"[DELETE] Deleting {path}")
            os.remove(path)
            new_img_paths.remove(path)

    print(f"[END] Verification complete.")
    print(f"[END] Images deleted: {len(img_paths) - len(new_img_paths)}")
    print(f"[END] Images remaining: {len(new_img_paths)}")

    return new_img_paths

In [7]:
shape = "circle_obj"
filename = "circle_obj_urls.txt"

shape_urls = {shape: [f"{shape}_urls.txt", f"{shape}_obj_urls.txt"] for shape in shapes}
shape_urls

{'circle': ['circle_urls.txt', 'circle_obj_urls.txt'],
 'triangle': ['triangle_urls.txt', 'triangle_obj_urls.txt'],
 'square': ['square_urls.txt', 'square_obj_urls.txt'],
 'rectangle': ['rectangle_urls.txt', 'rectangle_obj_urls.txt'],
 'pentagon': ['pentagon_urls.txt', 'pentagon_obj_urls.txt'],
 'hexagon': ['hexagon_urls.txt', 'hexagon_obj_urls.txt']}

In [8]:
run = True

if run:
    for shape, files in shape_urls.items():
        for filename in files:
            urls = get_urls(filename)
            
            img_paths = download_images(urls, shape, verbose=1)
            img_paths = verify_downloads(img_paths, verbose=1)
            len(img_paths)

Downloading images...:   0%|          | 0/299 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 299
[END] Total images downloaded: 297


Verifying images...:   0%|          | 0/297 [00:00<?, ?it/s]



[END] Verification complete.
[END] Images deleted: 50
[END] Images remaining: 247


Downloading images...:   0%|          | 0/199 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 199
[END] Total images downloaded: 197


Verifying images...:   0%|          | 0/197 [00:00<?, ?it/s]

[END] Verification complete.
[END] Images deleted: 5
[END] Images remaining: 192


Downloading images...:   0%|          | 0/299 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 299
[END] Total images downloaded: 293


Verifying images...:   0%|          | 0/293 [00:00<?, ?it/s]



[END] Verification complete.
[END] Images deleted: 59
[END] Images remaining: 234


Downloading images...:   0%|          | 0/299 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 299
[END] Total images downloaded: 294


Verifying images...:   0%|          | 0/294 [00:00<?, ?it/s]

[END] Verification complete.
[END] Images deleted: 14
[END] Images remaining: 280


Downloading images...:   0%|          | 0/299 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 299
[END] Total images downloaded: 295


Verifying images...:   0%|          | 0/295 [00:00<?, ?it/s]



[END] Verification complete.
[END] Images deleted: 26
[END] Images remaining: 269


Downloading images...:   0%|          | 0/199 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 199
[END] Total images downloaded: 193


Verifying images...:   0%|          | 0/193 [00:00<?, ?it/s]

[END] Verification complete.
[END] Images deleted: 14
[END] Images remaining: 179


Downloading images...:   0%|          | 0/199 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 199
[END] Total images downloaded: 198


Verifying images...:   0%|          | 0/198 [00:00<?, ?it/s]



[END] Verification complete.
[END] Images deleted: 19
[END] Images remaining: 179


Downloading images...:   0%|          | 0/199 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 199
[END] Total images downloaded: 194


Verifying images...:   0%|          | 0/194 [00:00<?, ?it/s]

[END] Verification complete.
[END] Images deleted: 8
[END] Images remaining: 186


Downloading images...:   0%|          | 0/299 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 299
[END] Total images downloaded: 299


Verifying images...:   0%|          | 0/299 [00:00<?, ?it/s]



[END] Verification complete.
[END] Images deleted: 17
[END] Images remaining: 282


Downloading images...:   0%|          | 0/199 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 199
[END] Total images downloaded: 199


Verifying images...:   0%|          | 0/199 [00:00<?, ?it/s]

[END] Verification complete.
[END] Images deleted: 22
[END] Images remaining: 177


Downloading images...:   0%|          | 0/299 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 299
[END] Total images downloaded: 298


Verifying images...:   0%|          | 0/298 [00:00<?, ?it/s]

[END] Verification complete.
[END] Images deleted: 16
[END] Images remaining: 282


Downloading images...:   0%|          | 0/54 [00:00<?, ?it/s]

[END] Downloading complete.
[END] URLs provided: 54
[END] Total images downloaded: 54


Verifying images...:   0%|          | 0/54 [00:00<?, ?it/s]



[END] Verification complete.
[END] Images deleted: 3
[END] Images remaining: 51
