# Amazon Price & Image Scraper 

Use this notebook to scrape spot instance prices and images that can be used on instances for each region and save the data. This data allows users to switch between regions easily without worrying about price and image id cahnges. 

Both resources require javascript submissions which is why selenium web-browser is used to scrape the websites. 

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import re 

def launch_driver(wait=10):
    chromeDriver = "C:/Webdriver/chromedriver.exe"                # set the driver path 
    driver = webdriver.Chrome(executable_path=chromeDriver)       # launch the driver 
    driver.implicitly_wait(wait)                                  # tell the driver to wait at least `wait` seconds before throwing up an error

    return driver 

## spot-instance prices

**Launch the driver**:

In [94]:
driver = launch_driver()

**Get the spot connect pricing website**:

In [95]:
driver.get('https://aws.amazon.com/ec2/spot/pricing/')

**Run the scraper**:

In [100]:
region_num = 0 

dropdown_button = driver.find_elements_by_class_name('btn-dropdown')[0].click()

regions = driver.find_elements_by_class_name('dropdown-opened')[0].find_elements_by_tag_name('li')
region_names = [region.text for region in regions]

data = {}

headers = ['instance_type', 'linux_price', 'windows_price', 'region']
for header in headers:
    data[header] = []

for opt, region in enumerate(region_names):

    if opt!=0: 
        dropdown_button = driver.find_elements_by_class_name('btn-dropdown')[0].click()
        regions = driver.find_elements_by_class_name('dropdown-opened')[0].find_elements_by_tag_name('li')
        
    regions[opt].click()
    
    soup = BeautifulSoup(driver.page_source)
    
    tables = soup.find_all('table')
    
    small_tables = tables[0].find_all('tbody')
    
    for small_table in small_tables: 

        rows = small_table.find_all('tr')

        for row in rows[1:]:
            for idx, val in enumerate(row.find_all('td')):
                data[headers[idx]].append(val.text)

            data['region'].append(region)

return data 

**Save the data as a .csv file**: 

In [180]:
pd.DataFrame(data).to_csv('spot_instance_pricing.csv')

## Preset Images 

Collecting the default images is more challenging than scraping the prices because you must log in to an AWS dashboard to enter the launch instance process in each region. Input your username and password below to run this scraper. 

In [None]:
username = 'carlos.d.valcarcel.w@gmail.com'
password = 'P092@(34023%'

driver = launch_driver()

driver.get('https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin')

driver.find_element_by_id('resolving_input').send_keys(username)
driver.find_element_by_id('next_button').click()
driver.find_element_by_id('password').send_keys(password)
driver.find_element_by_id('signin_button').click()

# The driver might request a Captcha verification at this point 

**CAPTCHA NOTIFICATION**: at the end of the previous block the driver should end up at the logged in landing page, otherwise it will be stuck in a captcha, if it is then just complete the captcha, the  following cell block should run without a problem after that. 

**<br>Scrape images from the launcher**:
<br>The image scraper will be looping across regions so if you don't have access to a region it will get stopped. 

In [None]:
import time

driver.find_element_by_id('nav-regionMenu').click()
region_menu = driver.find_element_by_id('regionMenuContent')
region_menu = region_menu.find_elements_by_tag_name('a')
region_names = [x.text for x in region_menu]

columns = ['image_name','image_id','region']
image_data = {} 
for col in columns: 
    image_data[col] = []

for opt, region in enumerate(region_names):

    print('Working on region %s' % region)

    if opt != 0: 
        driver.find_element_by_id('nav-regionMenu').click()
        region_menu = driver.find_element_by_id('regionMenuContent')
        region_menu = region_menu.find_elements_by_tag_name('a')
        
    if region == '':
        continue 
        
    region_menu[opt].click()
    
    in_launcher = False 
    while not in_launcher: 
        try: 
            if opt == 0: 
                driver.find_element_by_id('EC2').click()
                driver.get(driver.current_url.split('#')[0]+'#Instances:')
                driver.find_element_by_class_name('gwt-Button').click()
                in_launcher = True 
            else: 
                in_launcher = True 
        except: 
            time.sleep(1)
            
    complete=False 
    attempt = 1 
    while not complete: 
        try:
            soup = BeautifulSoup(driver.page_source)
            ami_list = soup.find_all('div', {'id':'gwt-debug-myAMIList'})[0]
            page_loaded=True
        
            children = [x for x in ami_list.children]
            children = [x for x in children[1].children]
            children = [x for x in children[0].children]

            assert len(children)>0

            for child in children:
                try: 
                    image_name = child.find_all('span')[4].text

                    image_id = re.findall('[\s?](ami-[A-Za-z0-9]*)[\s?]', child.find_all('span')[5].text)[0]

                    image_data['image_name'].append(image_name)
                    image_data['image_id'].append(image_id)
                    image_data['region'].append(region)
                except:
                    continue
                
            complete = True 
                
        except Exception as e:
            attempt+=1
            time.sleep(1)


In [25]:
pd.DataFrame(image_data).sort_values('image_name').pivot(index='image_name', 
                                                         columns='region',
                                                         values='image_id')

region,US East (N. Virginia)us-east-1,US West (Oregon)us-west-2
image_name,Unnamed: 1_level_1,Unnamed: 2_level_1
"Amazon Linux 2 AMI (HVM), SSD Volume Type",ami-0323c3dd2da7fb37d,ami-0d6621c01e8c2de2c
"Red Hat Enterprise Linux 8 (HVM), SSD Volume Type",ami-098f16afa9edf40be,ami-02f147dfb8be58a10
"SUSE Linux Enterprise Server 15 SP1 (HVM), SSD Volume Type",ami-0068cd63259e9f24c,ami-0b9c71b41cc33f180
"Ubuntu Server 16.04 LTS (HVM), SSD Volume Type",ami-039a49e70ea773ffc,ami-008c6427c8facbe08
"Ubuntu Server 18.04 LTS (HVM), SSD Volume Type",ami-085925f297f89fce1,ami-003634241a8fcdec0
"Ubuntu Server 20.04 LTS (HVM), SSD Volume Type",ami-068663a3c619dd892,ami-09dd2e08d601bff67
