#### DOMAIN: Automobile
- CONTEXT: A brand research company wants to understand which cars or car manufacturers are popular in a certain area of the city or locality. Company has a team which takes pictures of the cars randomly through the day. Using this the company wants to set up an automation which can classify the make of the car once the picture has been given as an input.
- TASK: Help to build the image dataset to be used by the AI team to build an image classifier data. Import and display the images in python against their labels. Comment on the challenges faced during this task.

#### Assumptions:

   - Company resources will be taking random images of cars in the city, these images will be used as input on a model which will predict the class and do the needful on reporting it.
   - AI team will create a image classifier model, using a image data set for cars. These images of cars could be from any make, having any background, colors, type (sedan,sux,coup etc.) and many other variations.
   - Our job is to create a data set of images with related label/labels

##### Image Data set creation strategy
- Using API's to capture images and related tags
    - We have used image search engines/API (Google/Bing/flicker/pinetrest) to capture images based on our searches and download them as per our need
    - manual intervention is required for random or exaustive evaluation of data being kept for traing. A good model highly dependes on kind of data it got trained on.
    - Can use data augmention to create more data from limited images - this step can be done during model creation as well
    - Organize the images based on car's brand/type on folders or in a data set.
    - Import and visualize if the captured and validated images are exactly what you are looking for as train data or need more intervention to improve it

#### Challanges Faced during dataset creation

    - A model can work best if it was trained on a better training data.
    - Few option avaialbe to us was manually take each image/download images/crowd source it with pinetrest or other similar apps
    - With our approach of web scrapping using API, google image search has a limit of 300 images or so, we have to tweak the search parameters all the time to get diff immages if we have to capture huge number of non repetable images. 
    - Even with that there is no gurantee we will get duplicate or redundant images. We have to manually navigate and delete or edit those.
    - One major issue with web scrapping is we might get many images which are full of noisy text and watermarks
    - Multiple cases we will find we have more than one brands or mltiple models of same brand in single image.
    - Legality of web-scraping images, we need to read the fine print of usage for many images if we can use it for academic or buisness purpose. Most of the cases academmic is acceptable but buisness usage of the images need respective agreement/contracts.
    - While searching we will also get many diff formats like .gifs/.jpegs, they behave differently while using in python
    - Lot of images have watermarks, edited using photoshop, few concept cars designs or sketch drawings will also landup in the database

### Importinng necessary modules

In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import urllib.request
import time
import sys
import os



%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
plt.rcParams['figure.figsize'] = [16, 10]
plt.rcParams['font.size'] = 16

import os
from tqdm import tqdm

import seaborn as sns
from keras.preprocessing import image


- Function for downloading Image
    - Using selenium for image scraping
    - Firefox driver is used to invoke firfox browser and search on google with respective keywords provided
    - Selenium will be used to scroll and load more images as needed, capture the path of the images
    - urllib will be used to download the images

In [None]:

    
    
def downloadImages(searchKeyword, folderPath, folderName):
    print("Images to download:",searchKeyword)
    site = 'https://www.google.com/search?tbm=isch&q='+searchKeyword
    myPath=os.path.join(folderPath, folderName)
    if not os.path.isdir(myPath):
        os.makedirs(myPath)
    
    driver = webdriver.Firefox(executable_path = 'F:\Webdrivers\geckodriver.exe')

    driver.get(site)

#increase the number of scrolls for more images
    i = 0

    while i<1:  
        driver.execute_script("window.scrollBy(0,document.body.scrollHeight)")

        try:
            driver.find_element_by_xpath("/html/body/div[2]/c-wiz/div[3]/div[1]/div/div/div/div/div[5]/input").click()
        except Exception as e:
            pass
        time.sleep(5)
        i+=1

    soup = BeautifulSoup(driver.page_source, 'html.parser')


    driver.close()

    img_tags = soup.find_all("img", class_="rg_i")


    count = 0
    for i in img_tags:
        try:
            fullfilename = os.path.join(myPath, str(count)+".jpg")

            urllib.request.urlretrieve(i['src'], fullfilename)
            count+=1
            print("Number of images downloaded = "+str(count),end='\r')
        except Exception as e:
            pass
    print("Number of images downloaded = "+str(count))
    print("Downloaded Images are saved in : ", myPath)

##### Creating data set with search keywords and folder name

In [None]:
data = [['car alfa romeo', 'AlfaRomeo'],
        ['car Audi', 'Audi'],
        ['car BMW', 'BMW'],
        ['car Bentley', 'Bentley'],
#        ['car Buick', 'Buick'],
#        ['car Cadillac', 'Cadillac'],
#        ['car Chevrolet', 'Chevrolet'],
        ['car Honda', 'Honda'],
        ['car Hyundai', 'Hyundai'],
        ['car Toyota', 'Toyota'],
        ['car Tesla', 'Tesla'],
        ['car Maruti Suzuki', 'MarutiSuzuki'],
        ['car Tata', 'Tata']        
       ] 

#reduced the data set for reducing time

In [None]:
import pandas as pd
df = pd.DataFrame(data, columns = ['Search', 'FolderName']) 

In [None]:
df

In [None]:
datasetPath='F:\GreatLearning\AI\ComputerVision\Project\Project_dataSet_Creation_Cars'
downloadImages('car Audi',datasetPath ,'Audi')

In [None]:
df.apply(lambda row : downloadImages(row['Search'],datasetPath,row['FolderName']),axis=1)

##### Adding another a catgory of cars like 'SUV'/'Hatchback' to see the results

In [None]:
downloadImages('car Hatchback',datasetPath ,'Hatchback')

In [None]:
rootdir='F:\GreatLearning\AI\ComputerVision\Project\Project_dataSet_Creation_Cars'
for file in os.listdir(rootdir):
    d = os.path.join(rootdir, file)
    if os.path.isdir(d):
        print(file)

In [None]:
#for file in os.listdir(rootdir):
 #   d = os.path.join(rootdir, file)
  #  if os.path.isdir(d):
   #     print(file)
        
        
        
listOfCarBrands= [ file for file in os.listdir(rootdir) if os.path.isdir(os.path.join(rootdir, file))]
print(listOfCarBrands)
NUM_CATEGORIES = len(listOfCarBrands)

In [None]:
for carName in listOfCarBrands:
    print('{} {} images'.format(carName, len(os.listdir(os.path.join(rootdir, carName)))))

In [None]:
train = []
for category_id, category in enumerate(listOfCarBrands): 
    for file in os.listdir(os.path.join(rootdir, category)): 
        train.append(['{}/{}/{}'.format(rootdir, category, file), category_id, category]) 
        
train = pd.DataFrame(train, columns = ['file', 'category_id', 'category'])
print(train.head(5))
train.shape 

In [None]:
def read_img(filepath, size):
    img = image.load_img(os.path.join(rootdir, filepath), target_size = size)
    img = image.img_to_array(img)
    return img

In [None]:
# Using matplotlib for this

fig = plt.figure(1, figsize=(NUM_CATEGORIES, NUM_CATEGORIES)) 
grid = ImageGrid(fig, 111, nrows_ncols=(NUM_CATEGORIES, NUM_CATEGORIES), axes_pad=0.05)
i = 0


for category_id, category in enumerate(listOfCarBrands):
    for filepath in train[train['category'] == category]['file'].values[:NUM_CATEGORIES]:
        ax = grid[i]
        img = read_img(filepath, (224,224)) # read_img function call; filepath specified, img_size hard-coded
        ax.imshow(img/255.)
        ax.axis('off')
        if i % NUM_CATEGORIES == NUM_CATEGORIES - 1: # Labeling the row-categories (I believe)
            ax.text(250, 112, filepath.split('/')[1], verticalalignment='center')
        i += 1
plt.show();

- Reference:
    - https://towardsdatascience.com/how-to-create-your-own-image-dataset-for-deep-learning-b53f1c22c443
    - https://medium.com/ai%C2%B3-theory-practice-business/build-image-dataset-from-scratch-7752e9e22162
    - https://medium.com/analytics-vidhya/create-your-own-real-image-dataset-with-python-deep-learning-b2576b63da1e
    - https://www.pyimagesearch.com/2018/04/09/how-to-quickly-build-a-deep-learning-image-dataset/
    - https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/
    - https://dev.to/dillir07/a-python-package-with-selenium-to-download-high-res-image-using-google-search-by-image-6ok