# Amazon Price & Image Scraper 

Use this notebook to scrape spot instance prices and images that can be used on instances for each region and save the data. This data allows users to switch between regions easily without worrying about price and image id cahnges. 

Both resources require javascript submissions which is why selenium web-browser is used to scrape the websites. 

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import re 

def launch_driver(wait=10):
    chromeDriver = "C:/Webdriver/chromedriver.exe"                # set the driver path 
    driver = webdriver.Chrome(executable_path=chromeDriver)       # launch the driver 
    driver.implicitly_wait(wait)                                  # tell the driver to wait at least `wait` seconds before throwing up an error

    return driver 

## spot-instance prices

**Launch the driver**:

In [None]:
driver = launch_driver()

**Get the spot connect pricing website**:

In [None]:
driver.get('https://aws.amazon.com/ec2/spot/pricing/')

**Run the scraper**:

In [None]:
data = {}

headers = ['instance_type', 'linux_price', 'windows_price', 'region']
for header in headers:
    data[header] = []

soup = BeautifulSoup(driver.page_source)

region_tables = soup.find_all('div', {'class':'regions'})

# for region_table in region_tables
region_table = region_tables[0]

region_price_tables = region_table.find_all('div', {'class':'content'})

for table in region_price_tables: 

    table = table.find_all('table')[0]

    region_name = table.find_all('caption')[0].text

    rows = table.find_all('tr', {'class':'sizes'})

    for row in rows:
        for idx, val in enumerate(row.find_all('td')):
            data[headers[idx]].append(val.text)
            if example_num==1: 
                print(headers[idx], val.text)
        example_num +=1 

        data['region'].append(region_name)

**Save the data as a .csv file**: 

In [None]:
pd.DataFrame(data).to_csv('spot_instance_pricing.csv')

## Preset Images 

Collecting the default images is more challenging than scraping the prices because you must log in to an AWS dashboard to enter the launch instance process in each region. Input your username and password below to run this scraper. 

In [None]:
import getpass

username = getpass.getpass('Username: ')
password = getpass.getpass('Password: ')

driver = launch_driver()

driver.get('https://signin.aws.amazon.com')

# Navigate to the sign up menu

driver.find_element_by_id('resolving_input').send_keys(username)
driver.find_element_by_id('next_button').click()
driver.find_element_by_id('password').send_keys(password)
driver.find_element_by_id('signin_button').click()

# The driver might request a Captcha verification at this point 

**CAPTCHA NOTIFICATION**: at the end of the previous block the driver should end up at the logged in landing page, otherwise it will be stuck in a captcha, if it is then just complete the captcha, the  following cell block should run without a problem after that. Note that the next cell block will use the same driver. 

**<br>Scrape images from the launcher**:
<br>The image scraper will be looping across regions so if you don't have access to a region it will get stopped. 

In [None]:
import time

driver.find_element_by_id('nav-regionMenu').click()
region_menu = driver.find_element_by_id('regionMenuContent')
region_menu = region_menu.find_elements_by_tag_name('a')
region_names = [x.text for x in region_menu]

columns = ['image_name','image_id','region']
image_data = {} 
for col in columns: 
    image_data[col] = []

error_list = [] 
for opt, region in enumerate(region_names):

    if region == '':
        continue 

    print('Working on region %s' % region)

    # Refresh the region menu 
    if opt != 0: 
        driver.find_element_by_id('nav-regionMenu').click()
        region_menu = driver.find_element_by_id('regionMenuContent')
        region_menu = region_menu.find_elements_by_tag_name('a')
                
    # Navigate to the region dashboard
    [o for o in region_menu if region==o.text][0].click()
    
    # Navigate to the EC2 launcher 
    if opt==0:
        driver.find_element_by_id('EC2').click()
        driver.get(driver.current_url.split('#')[0]+'#Instances:')
        driver.find_element_by_class_name('gwt-Button').click()
    else: 
        driver.get('https://console.aws.amazon.com/ec2/v2/home?region='+region.split(')')[1]+'#LaunchInstanceWizard:')

    complete=False 
    attempt = 1 
    while not complete: 
        input('Enter any key to proceed once all the AMIs for the region have loaded')
        try:
            iframe = driver.find_element_by_id('instance-lx-gwt-frame') # <iframe id='instance-lx-gwt-frame'>

            # Switch to the table iframe
            driver.switch_to_frame(iframe)

            # Get the page source
            soup = BeautifulSoup(driver.page_source)

            # Switch back to the main frame
            driver.switch_to_default_content()

            # Get the AMI list object (if not found will fail here and return exception to wait.
            ami_list = soup.find_all('div', {'id':'gwt-debug-myAMIList'})[0]
            page_loaded=True        

            operating_systems = ['Ubuntu','Windows','Linux']

            # Get each ami into a list
            ami_list = ami_list.find_all('div', {'__idx':re.compile('.*')})                

            for ix, ami in enumerate(ami_list): 

                # Find every element available and look for the operating system names to identify the description of the AMI
                spans = ami.find_all('span')

                for span in spans:
                    span_text = span.text        
                    has_operating_system = [ops for ops in operating_systems if ops in span_text]

                    if len(has_operating_system)>0:
                        description = span_text
                        ami_id = re.findall('(ami-[A-Za-z0-9]*)', ami.text)[0]            
                        image_data['image_name'].append(description)
                        image_data['image_id'].append(ami_id)
                        image_data['region'].append(region)
                        break 
                
        except Exception as e:
            if str(e)=='Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="instance-lx-gwt-frame"]"':
                complete=True
            attempt+=1
            error_list.append(str(e)+' : '+str(attempt)+' '+region)
            print(str(attempt)+' ', end='')
            time.sleep(1)

**Format and save the data**:

In [None]:
the_data = pd.DataFrame(image_data)
the_data.to_csv('ami_data.csv')
the_data 