<a href="https://colab.research.google.com/github/john-sedrak/ML-CV-Project/blob/main/DS_ML_Project_Day_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#<center> DS ML Capstone Project: Celebrity Face Recognition


##### <center>Original work: [Sports Celebrity Image Classification — codebasics](https://youtube.com/playlist?list=PLeo1K3hjS3uvaRHZLl-jLovIjBP14QTXc)
##### <center> Prepared by: Ahmed Mokhtar

---

## Preface
Greetings, data scientists! The semester plan has concluded. Throughout the semester, you have all shown a great deal of diligence and dedication, and we could not be more proud of you! The time has finally come for you to test and expand your knowledge. At this point, it is safe to say that you are all familiar with [structured](https://wordlift.io/blog/en/entity/structured-data/) data. To change things up, we wanted to expose you to [unstructured](https://searchbusinessanalytics.techtarget.com/definition/unstructured-data) data in a fun and exciting project that serves as an introduction to the world of [computer vision](https://www.ibm.com/topics/computer-vision). It is an end-to-end data science project which is full of new and exciting concepts for you to learn. It is slightly challenging, because we are confident in your abilities! 

<img src="https://i.imgur.com/i1bSfkM.png" alt="celebration" width="100%">

## Part I: Introduction

Watch this video before you proceed: [Data Science & Machine Learning Project - Part 1 Introduction | Image Classification](https://www.youtube.com/watch?v=qWXXHjV3JHI&list=PLeo1K3hjS3uvaRHZLl-jLovIjBP14QTXc&index=1)<br><br>

In this project, you will build an image face recognition program from start to finish. First, you will have to **collect** your data. Then you will clean your data, and **engineer features** out of it. Once the data is ready, you will use it to **train** a machine leaning algorithm to classify faces. You will have to experiment with and **evaluate** different models in order to find the optimal one for this problem. Finally, you will **deploy** your model on the cloud.  

By the end of this project, you should be able to:

*   Scrape the web for data (optional).
*   Know what [OpenCV](https://opencv.org) is and what it does.
*   Know how images are represented in python.
*   Know what Haar cascade classifiers are and how to use them in python.
*   Navigate and interact with directories using code.
*   Know what wavelet decomposition is and how to apply it in python.
*   Train a classical ML classifier on image data.
*   Build a python [Flask](https://flask.palletsprojects.com/en/2.0.x/) server for your model.
*   Build an [API](https://en.wikipedia.org/wiki/API) for your model.
*   Build an HTML document for your application (optional).
*   Connect your API to the frontend using JavaScript (optional).
*   Deploy your model to production using [AWS](https://aws.amazon.com).

Without keeping you waiting any further, let's dive into the project!

## Part II: Data Collection



Watch this video before you proceed: [Data Science & Machine Learning Project - Part 2 Data Collection | Image Classification](https://www.youtube.com/watch?v=m1dQ38qDABw&list=PLeo1K3hjS3uvaRHZLl-jLovIjBP14QTXc&index=2&t=8s)<br><br>

We want to create dataset containing 50 images for each class (person). Fortunately, the web is home to thousands of pictures that could be easily downloaded. As stated in the video, there are 4 ways you can get the images you need.

1.   **Downlaoding them manually:** This is tedious and not recommended.
2.   **Scraping the web:** This involves using software like [Selenium](https://www.guru99.com/introduction-to-selenium.html), or [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) in order to extract information from HTML elements. This requires you to write a moderate amount of code at the start, but it is much easier than manual extraction. You also have to make sure that you are legally allowed to possess the data you are scraping (i.e. no personal or copyrighted data).
3.   **Using a pre-made software for your scraping task:** for example, the [fatkun](https://chrome.google.com/webstore/detail/fatkun-batch-download-ima/nnjjahlikiabnchcpehcpkdeckfgnohf?hl=en) chrome extension is made for the purpose of scraping images. This saves you time, but is not as flexible as Selenium or Beautiful Soup (recommended).
4.   **Buying the data:** You can buy the needed images from the parties that own them.





<mark>Task: Pick 5 celebrities, then use the [fatkun](https://chrome.google.com/webstore/detail/fatkun-batch-download-ima/nnjjahlikiabnchcpehcpkdeckfgnohf?hl=en) chrome extension to download 50 photos for each of them. 

<mark>Alternative task 1: Use Selenium to scrape the images. The code is prepared for you, only need to understand the code and what it does.

<mark>Alternative task 2: You can use 50 photos of yourself or people you know to make your dataset.

The folowing code uses Selenium to scrape images from the web.

Credit for the code goes to: [Image Scraping With Python](https://towardsdatascience.com/image-scraping-with-python-a96feda8af2d)

Make sure you read the article carefully and understand what the code does. I have also added extra comments to the code to explain it as much as i can.


In [None]:
import time
import requests 
import io
import hashlib
import os
import sys 
from os import path
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from PIL import Image, ImageDraw

In [None]:
# install the selenium framework
!pip install selenium 

# to update ubuntu to correctly run apt install
!apt-get update 

# chromedriver is a tool for autimated web navigation
!apt install chromium-chromedriver

# linux command to copy files from src directory to dest directory
!cp /usr/lib/chromium-browser/chromedriver /usr/bin 

# telling python to look in this path for chromedriver
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()

# this makes sure we scrape the web without windows popping up or any form of user interface
chrome_options.add_argument('--headless')

# disable sandboxing (https://www.google.com/googlebooks/chrome/med_26.html)
chrome_options.add_argument('--no-sandbox')

# disable shared memory space to avoid crashes
chrome_options.add_argument('--disable-dev-shm-usage')

The functions below facilitate image scraping from google:

In [None]:
# once we get the URL of an image, we use the fetch_image_urls_util function
def fetch_image_urls_util(url,driver_path):
    images = []
    # Open main window with the URL
    with webdriver.Chrome(executable_path=driver_path, options=chrome_options) as wd:

        # Switch to the new window and open URL B
        try:
            wd.get(url)
        except:
            return []

        # find images with the class 'n3VNCb'
        thumbnail_results = wd.find_elements_by_css_selector("img[class ='n3VNCb']")

        # get the 'src' or the raw image
        for img in thumbnail_results:
            if img.get_attribute('src') and 'http' in img.get_attribute('src'):
                images.append(img.get_attribute('src'))

    return images


def fetch_image_urls(query:str, max_links_to_fetch:int, wd, sleep_between_interactions:int=1,driver_path= None, target_path = None, search_term = None):
    
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    # scroll to the end of the page to load it fully
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    image_count2 = 0
    total_saved = 0
    results_start = 0
    i = 0
    d = {}

    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        # for all the clickable thumbnails
        for img in thumbnail_results[50:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception as e:
                print(e)
                continue
            
            # get the clickable element in the thumbnail
            links = wd.find_elements_by_css_selector("a[jsname='sTFXNd']")

            # for each clickable element
            for link in links:
                # get the target URL
                if link.get_attribute('href') and 'http' in link.get_attribute('href'):
                    if link.get_attribute('href') not in d:
                        d[link.get_attribute('href')] = True
                        # get the raw image from the URL
                        getactualurl = fetch_image_urls_util(link.get_attribute('href'), driver_path)
                    for imageurl in getactualurl:
                        if imageurl is not None:
                            # append the image URL to the list
                            image_urls.add(imageurl)
            # the count of extracted URLs
            image_count2 = len(image_urls)
            # save every 10th of the max URLs
            if image_count2 >= max_links_to_fetch/10:
                print(f"Found: {len(image_urls)} image links, saving!")
                try:    
                    for elem in image_urls:

                        if persist_image(target_folder,elem):
                            total_saved += 1
                except Exception as e:
                    print(e)
                # reset the set of URLs
                image_urls = set()
                d = {}
            # total URL count
            image_count += image_count2
            print("Total: ", total_saved, " images!")
            if total_saved >= max_links_to_fetch:
                break

        # if we reach out target number we stop 
        if total_saved >= max_links_to_fetch:
            print(f"Found: {total_saved} image links, done!")
            break
        else:
            # else we press the load more button and look for more images
            print("Found:", image_count, "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = image_count

    print(len(image_urls))
    return image_urls


# takes the image URL and saves it on disk
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')

        if path.exists(file_path):
            print(f"DUPLICATE - image {url} already exists")
            return False

        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SUCCESS - saved {url} - as {file_path}")
        return True
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")
        
  
# takes the search query and the number of images and fetches them    
def search_and_download(search_term:str,driver_path:str,target_path='./datasets',number_images=50):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path, options=chrome_options) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5,driver_path= driver_path,target_path= target_path,search_term=search_term)
    try:    
        for elem in res:
            persist_image(target_folder,elem)
    except Exception as e:
        print(e)

Finally, we call the `search_and_download` function to get our images. Feel free to change the names in the query to any 5 characters' names you want (they must be popular, have a face and 2 eyes)

<font color="orange">N.B. This is a really slow process and it might take a serious amount of time</font><br>



In [None]:
query = ["Serena Williams", "Lionel Messi", "Maria Sharapova", "Roger Federer", "Virat Kohli"]

for q in query:
    search_and_download(q, 'chromedriver', number_images=50)

Your dataset directory should separate classes in different folders. The directory should look like this in the end (with your class names instead):<br>
![Data](https://i.imgur.com/4qNtole.png)<br><br>

Since we will be using this dataset for the rest of the project, it is wise to upload it to your google drive so you wouldn't have to upload it every time you access your notebooks. Download the dataset if it is not available locally already. Colab does not allow us to download folders, so you will have to zip it first.

In [None]:
# zip -r <zipped file destination> <folder to be zipped>
!zip -r /content/dataset.zip /content/dataset

Now download the zipped dataset by right clicling on it.

![Download](https://i.imgur.com/Nfj161Y.png)<br><br>

Finally, unzip the file and upload the folder to your google drive.

Congratulations! You have successfully created your own image dataset. The most confusing part of this project is over! In the next part, we will clean our dataset and extract faces from the images. In the subsequent notebooks, You will have the option to use a pre-made dataset. Therefore, you will be able to progress even if you did not finish this notebook.<br><br>

Enjoy the rest of your day! ❤️<br><br>

<center><img src="https://i.pinimg.com/originals/55/f5/fd/55f5fdc9455989f8caf7fca7f93bd96a.gif" width="30%">