# Deep Learning for Automated Corrosion Detection

## Author: Pengju Sun

# Content:
### Part 1: <a href='#Part1'>Web Scraping Corrosion and No Corrosion Images From Google</a>
### Part 2: <a href='https://github.com/pjsun2012/Phase5_Capstone-Project/blob/main/Capstone_Project_Part2_Dataset_Split_EDA.ipynb'>Dataset Split and EDA</a>
### Part 3: <a href='https://github.com/pjsun2012/Phase5_Capstone-Project/blob/main/Capstone_Project_Part3_Models.ipynb'>Models</a>
### Part 4: <a href='https://github.com/pjsun2012/Phase5_Capstone-Project/blob/main/Capstone_Project_Part4_Models.ipynb'>Models</a>
### Part 5: <a href='https://github.com/pjsun2012/Phase5_Capstone-Project/blob/main/Capstone_Project_Part5_Models.ipynb'>Models</a>
### Conclusion
### Future Work

<a id='Part1'></a> 
# Part 1: Scraping Images from Google

All the labeled CORROSION and NO CORROSION images were collected by scraping images from google. Selenium was used to automate web browser interaction with Python. Selenium pretends to be a real user, opens the browser, moves the cursor around, and clicks buttons if you tell it to do so. Please reference this complete guide of “[Image Scraping with Python](https://towardsdatascience.com/image-scraping-with-python-a96feda8af2d)” for the detailed explanation and steps with codes.

The CORROSION images were scraped from Google Images using keyword searches that include eight categories of corrosion problems, such as ‘Steel Corrosion/Rust,’ ‘Ships Corrosion,’ ‘Ship Propellers Corrosion,’ ‘Cars Corrosion,’ ‘Oil and Gas Pipelines Corrosion,’ ‘Concrete Rebar Corrosion,’ ‘Water/Oil Tanks Corrosion,’ and ‘Stainless Steel Corrosion,’ The NO CORROSION images were also scraped from Google Images using the same terms without corrosion.

## Searching for a particular phrase & get the image links

In [1]:
import selenium
from selenium import webdriver
import PIL
from PIL import Image
import time
import pathlib
import glob
import os, os.path, shutil
import requests
# Regular expressions allows us to parse text easier
import re
# Function for load a specific webpage
import io
import hashlib
DRIVER_PATH = '/Users/pengjusun/Desktop/Web_Scraping/chromedriver'

In [2]:
def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):

    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls

## Downloading the images

In [3]:
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")

## Putting it all together

In [4]:
def search_and_download(search_term:str,driver_path:str,target_path='./images',number_images=100):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5)
        
    for elem in res:
        persist_image(target_folder,elem)

In [5]:
# Example
search_term = 'steel plate'
search_and_download(search_term = search_term, driver_path=DRIVER_PATH)

Found: 100 search results. Extracting links from 0:100
Found: 101 image links, done!
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRSfCsQ6oEEyHOocaIFftyZJyQe4RYuQD2IyQ&usqp=CAU - as ./images/steel_plate/0947fa8f9b.jpg
SUCCESS - saved https://cdn11.bigcommerce.com/s-opskm61a5f/images/stencil/1280x1280/products/172/476/apiqslrfk__11869.1618439247.jpg?c=1 - as ./images/steel_plate/732acc682d.jpg
SUCCESS - saved https://sc04.alicdn.com/kf/HTB1UQcbxkCWBuNjy0Faq6xUlXXat.jpg - as ./images/steel_plate/d1615cba0c.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTrPCL0wgd8TEMk-WrshCAVYDVDHUHjN9mbkw&usqp=CAU - as ./images/steel_plate/8a4e6c524c.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSN6b-QXFhl7Oz-bg8ttnYHM0wbuhTdm_t7Bg&usqp=CAU - as ./images/steel_plate/0a9f7a0eee.jpg
SUCCESS - saved https://static.toiimg.com/thumb/msid-77012085,width-800,height-600,resizemode-75,imgsize-238730,pt-32,y_pad-40/77012085.jpg - as ./i

SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSZ-oqRdKcGKlA1gKgd0zrz764_tW4LSh7MxQ&usqp=CAU - as ./images/steel_plate/aa15d95db4.jpg
SUCCESS - saved https://www.coremarkmetals.com/files/image/large/HR_PLATE_099_3000.jpg - as ./images/steel_plate/8896c37bd2.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQGPfdKyqJTQb8P3UBU34FRzEtJH2CpAoPY_A&usqp=CAU - as ./images/steel_plate/7706f3ba36.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSZnvkrBEJBWKLvn4EYw40PhZZsGyLmbcEUHA&usqp=CAU - as ./images/steel_plate/9526254678.jpg
SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTAx-rJ9TCgbI-P6DoVFwnVAzyQk7ivjopGsg&usqp=CAU - as ./images/steel_plate/181d1724f4.jpg
SUCCESS - saved https://jlrorwxhniqimk5p.ldycdn.com/cloud/mlBppKmlRmmSpkrmnoor/15.jpg - as ./images/steel_plate/9e7533d796.jpg
SUCCESS - saved https://www.metalsdepot.com/assets/files/Catalog_Photos/steel-floor-plate-surface.jpg - as ./images/s

SUCCESS - saved https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQvB9z6HkG_EBjTUyd0_sL_HpEfsKM11BIhPg&usqp=CAU - as ./images/steel_plate/e9a2a6a2aa.jpg
