# Amazon Price & Image Scraper 

Use this notebook to scrape spot instance prices and images that can be used on instances for each region and save the data. This data allows users to switch between regions easily without worrying about price and image id cahnges. 

Both resources require javascript submissions which is why selenium web-browser is used to scrape the websites. 

In [1]:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import re 

def launch_driver(wait=10):
    chromeDriver = "C:/Webdriver/chromedriver.exe"                # set the driver path 
    driver = webdriver.Chrome(executable_path=chromeDriver)       # launch the driver 
    driver.implicitly_wait(wait)                                  # tell the driver to wait at least `wait` seconds before throwing up an error

    return driver 

## spot-instance prices

**Launch the driver**:

In [94]:
driver = launch_driver()

**Get the spot connect pricing website**:

In [95]:
driver.get('https://aws.amazon.com/ec2/spot/pricing/')

**Run the scraper**:

In [100]:
region_num = 0 

dropdown_button = driver.find_elements_by_class_name('btn-dropdown')[0].click()

regions = driver.find_elements_by_class_name('dropdown-opened')[0].find_elements_by_tag_name('li')
region_names = [region.text for region in regions]

data = {}

headers = ['instance_type', 'linux_price', 'windows_price', 'region']
for header in headers:
    data[header] = []

for opt, region in enumerate(region_names):

    if opt!=0: 
        dropdown_button = driver.find_elements_by_class_name('btn-dropdown')[0].click()
        regions = driver.find_elements_by_class_name('dropdown-opened')[0].find_elements_by_tag_name('li')
        
    regions[opt].click()
    
    soup = BeautifulSoup(driver.page_source)
    
    tables = soup.find_all('table')
    
    small_tables = tables[0].find_all('tbody')
    
    for small_table in small_tables: 

        rows = small_table.find_all('tr')

        for row in rows[1:]:
            for idx, val in enumerate(row.find_all('td')):
                data[headers[idx]].append(val.text)

            data['region'].append(region)

return data 

**Save the data as a .csv file**: 

In [180]:
pd.DataFrame(data).to_csv('spot_instance_pricing.csv')

## Preset Images 

Collecting the default images is more challenging than scraping the prices because you must log in to an AWS dashboard to enter the launch instance process in each region. Input your username and password below to run this scraper. 

In [None]:
username = 'carlos.d.valcarcel.w@gmail.com'
password = 'P092@(34023%'

driver = launch_driver()

driver.get('https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin')

driver.find_element_by_id('resolving_input').send_keys(username)
driver.find_element_by_id('next_button').click()
driver.find_element_by_id('password').send_keys(password)
driver.find_element_by_id('signin_button').click()

# The driver might request a Captcha verification at this point 

**CAPTCHA NOTIFICATION**: at the end of the previous block the driver should end up at the logged in landing page, otherwise it will be stuck in a captcha, if it is then just complete the captcha, the  following cell block should run without a problem after that. Note that the next cell block will use the same driver. 

**<br>Scrape images from the launcher**:
<br>The image scraper will be looping across regions so if you don't have access to a region it will get stopped. 

In [42]:
import time

driver.find_element_by_id('nav-regionMenu').click()
region_menu = driver.find_element_by_id('regionMenuContent')
region_menu = region_menu.find_elements_by_tag_name('a')
region_names = [x.text for x in region_menu]

columns = ['image_name','image_id','region']
image_data = {} 
for col in columns: 
    image_data[col] = []

for opt, region in enumerate(region_names):

    print('Working on region %s' % region)

    if opt != 0: 
        driver.find_element_by_id('nav-regionMenu').click()
        region_menu = driver.find_element_by_id('regionMenuContent')
        region_menu = region_menu.find_elements_by_tag_name('a')
        
    if region == '':
        continue 
        
    region_menu[opt].click()
    
    in_launcher = False 
    while not in_launcher: 
        try: 
            if opt == 0: 
                driver.find_element_by_id('EC2').click()
                driver.get(driver.current_url.split('#')[0]+'#Instances:')
                driver.find_element_by_class_name('gwt-Button').click()
                in_launcher = True 
            else: 
                in_launcher = True 
        except: 
            time.sleep(1)
            
    complete=False 
    attempt = 1 
    while not complete: 
        try:
            time.sleep(10)
            soup = BeautifulSoup(driver.page_source)
            ami_list = soup.find_all('div', {'id':'gwt-debug-myAMIList'})[0]
            page_loaded=True
        
            children = [x for x in ami_list.children]
            children = [x for x in children[1].children]
            children = [x for x in children[0].children]

            assert len(children)>0
            
            for child in children:
                try: 
                    #image_name = child.find_all('span')[4].text

                    #image_id = re.findall('[\s?](ami-[A-Za-z0-9]*)[\s?]', child.find_all('span')[5].text)[0]

                    #image_data['image_name'].append(image_name)
                    #image_data['image_id'].append(image_id)
                    image_data['image_name'].append(child.find_all('span')[4])
                    image_data['image_id'].append(child.find_all('span')[5])                    
                    image_data['region'].append(region)
                except:
                    pass
                
            complete = True 
                
        except Exception as e:
            attempt+=1
            time.sleep(1)

Working on region US East (N. Virginia)us-east-1
Working on region US East (Ohio)us-east-2
Working on region US West (N. California)us-west-1
Working on region US West (Oregon)us-west-2
Working on region Africa (Cape Town)af-south-1
Working on region Asia Pacific (Hong Kong)ap-east-1
Working on region Asia Pacific (Mumbai)ap-south-1
Working on region Asia Pacific (Seoul)ap-northeast-2
Working on region Asia Pacific (Singapore)ap-southeast-1
Working on region Asia Pacific (Sydney)ap-southeast-2
Working on region Asia Pacific (Tokyo)ap-northeast-1
Working on region Canada (Central)ca-central-1
Working on region Europe (Frankfurt)eu-central-1
Working on region Europe (Ireland)eu-west-1
Working on region Europe (London)eu-west-2
Working on region Europe (Milan)eu-south-1
Working on region Europe (Paris)eu-west-3
Working on region Europe (Stockholm)eu-north-1
Working on region Middle East (Bahrain)me-south-1


In [59]:
dta = pd.DataFrame(image_data)

dta['image_name'] = dta['image_name'].apply(lambda d: d.text)
dta['image_id'] = dta['image_id'].apply(lambda d: d.text)
dta['image_id'] = dta['image_id'].apply(lambda d: re.findall('(ami-[A-Za-z0-9]*)', d)[0])

dta = dta.sort_values('image_name')

Unnamed: 0,image_name,image_id,region
0,"Amazon Linux 2 AMI (HVM), SSD Volume Type",ami-0323c3dd2da7fb37d,US East (N. Virginia)us-east-1
468,"Amazon Linux 2 AMI (HVM), SSD Volume Type",ami-06ce3edf0cff21f07,Europe (Frankfurt)eu-central-1
156,"Amazon Linux 2 AMI (HVM), SSD Volume Type",ami-0d6621c01e8c2de2c,Africa (Cape Town)af-south-1
117,"Amazon Linux 2 AMI (HVM), SSD Volume Type",ami-0d6621c01e8c2de2c,US West (Oregon)us-west-2
507,"Amazon Linux 2 AMI (HVM), SSD Volume Type",ami-01a6e31ac994bbc09,Europe (Ireland)eu-west-1
...,...,...,...
542,"Ubuntu Server 20.04 LTS (HVM), SSD Volume Type",ami-0917237b4e71c5759,Europe (Ireland)eu-west-1
154,"Ubuntu Server 20.04 LTS (HVM), SSD Volume Type",ami-09dd2e08d601bff67,US West (Oregon)us-west-2
347,"Ubuntu Server 20.04 LTS (HVM), SSD Volume Type",ami-0a1a4d97d4af3009b,Asia Pacific (Singapore)ap-southeast-1
511,"Ubuntu Server 20.04 LTS (HVM), SSD Volume Type",ami-0917237b4e71c5759,Europe (Ireland)eu-west-1


In [None]:
columns = ['image_name','image_id','region']
image_data = {} 
for col in columns: 
    image_data[col] = []

soup = BeautifulSoup(driver.page_source)
ami_list = soup.find_all('div', {'id':'gwt-debug-myAMIList'})[0]
page_loaded=True

children = [x for x in ami_list.children]
children = [x for x in children[1].children]
children = [x for x in children[0].children]

assert len(children)>0

for child in children:
    try: 
        image_name = child.find_all('span')[4].text

        image_id = re.findall('[\s?](ami-[A-Za-z0-9]*)[\s?]', child.find_all('span')[5].text)[0]

        image_data['image_name'].append(image_name)
        image_data['image_id'].append(image_id)
        image_data['region'].append(region)

In [37]:
driver.find_element_by_id('nav-regionMenu').click()
region_menu = driver.find_element_by_id('regionMenuContent')
region_menu = region_menu.find_elements_by_tag_name('a')


In [41]:
soup = BeautifulSoup(driver.page_source)
ami_list = soup.find_all('div', {'id':'gwt-debug-myAMIList'})[0]
page_loaded=True

children = [x for x in ami_list.children]
children = [x for x in children[1].children]
children = [x for x in children[0].children]

for c, child in enumerate(children): 
    try: 
        print(c)
        spans = child.find_all('span')
        print(spans[4].text)
        print(spans[5].text)
    except:
        continue
#     image_name = child.find_all('span')[4].text

#     image_id = re.findall('(ami-[A-Za-z0-9]*)', child.find_all('span')[5].text)[0]

#     print(image_name)
#     print(image_id)

0
Amazon Linux 2 AMI (HVM), SSD Volume Type
 - ami-003449ffb2605a74c
1
Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type
 - ami-03e1e4abf50e14ded
2
Red Hat Enterprise Linux 8 (HVM), SSD Volume Type
 - ami-00e63b4959e1a98b7
3
SUSE Linux Enterprise Server 15 SP1 (HVM), SSD Volume Type
 - ami-02a8f447f39e5f0d3
4
Ubuntu Server 18.04 LTS (HVM), SSD Volume Type
 - ami-077d5d3682940b34a
5
6
Microsoft Windows Server 2019 Base
 - ami-0b3d9fa7386b999a4
7
Deep Learning AMI (Ubuntu 18.04) Version 28.0
 - ami-0eb206f610b80ed4b
8
Deep Learning AMI (Ubuntu 16.04) Version 28.0
 - ami-06d2949bae3658531
9
Ubuntu Server 16.04 LTS (HVM), SSD Volume Type
 - ami-0bb677666cd3fd188
10
Deep Learning AMI (Amazon Linux 2) Version 28.0
 - ami-05fd37528a9c1bb92
11
Deep Learning Base AMI (Ubuntu 18.04) Version 23.0
 - ami-032b2c1db65dcdf5f
12
Microsoft Windows Server 2019 Base with Containers
 - ami-0f48a395d6552bf3f
13
Microsoft Windows Server 2019 with SQL Server 2017 Standard
 - ami-097d07d98953037d9
14
Microsof