---
<center><u><h1>Selenium</h1></u></center>
---

---

Selenium is a portable software-testing framework for web applications. Selenium provides a record/playback tool for authoring tests without the need to learn a test scripting language (Selenium IDE). It also provides a test domain-specific language (Selenese) to write tests in a number of popular programming languages, including C#, Groovy, Java, Perl, PHP, Python, Ruby and Scala. The tests can then run against most modern web browsers. Selenium deploys on Windows, Linux, and OS X platforms. It is open-source software, released under the Apache 2.0 license: web developers can download and use it without charge.

## Setup

First you have to set up python bindings for Selenium.
Install it with pip

```
pip install selenium
```

Now you have to download Selenium driver for your browser. We'll assume that you are using Firefox. But for any other browser the process will be similar.

* Go to [here](https://github.com/mozilla/geckodriver/releases).
* Scroll down to downloads section and pick the right package for your system.
![alt text](images/1.jpg)
* Unarchive the downloaded file to somewhere on your computer. Say, to your user's home folder. So file should be located in ~/geckodriver
* Now you have to set the path to this driver.
```
export PATH=$PATH:~/geckodriver
```
* And launch jupyter notebook again
```
jupyter notebook
```
If you use a different Browser, take a look on these links.

* [Google Chrome](https://sites.google.com/a/chromium.org/chromedriver/getting-started/)
* [Internet Explorer](https://github.com/SeleniumHQ/selenium/wiki/InternetExplorerDriver)




Let's now take a look on a very simple Selenium example. It will open Firefox browser, go to Python website, just wait 10 seconds and then close the browser.

For any other browser refer to [this](http://selenium-python.readthedocs.io/api.html) page.

In [None]:
# First import our dependencies
from selenium import webdriver
import time

In [None]:
# Create instance of Firefox WebDriver
driver = webdriver.Firefox()
# Navigate to Python website.
driver.get("http://www.python.org")
# Wait 10 seconds
time.sleep(10)
# Close browser
driver.quit()

# Scraping TripAdvisor
TripAdvisor offers advice from millions of travelers, with 435 million reviews and opinions covering 6.8 million accommodations, restaurants and attractions, and a wide variety of travel choices and planning features -- checking more than 200 websites to help travelers find and book today's lowest hotel prices. TripAdvisor branded sites make up the largest travel community in the world, reaching 390 million average monthly unique visitors* in 49 markets worldwide. 

Now we want to scrape the [TripAdvisor](https://www.tripadvisor.com/) website. Especially the [list of hotels](https://www.tripadvisor.com/Hotels) for few specific cities. We want to pick only the cities from the US and Canada which start with 'L' or 'N', these will be Las Vegas, Los Angeles, New Orleans and New York.
![alt text](images/2.jpg)

Now, let's create our Firefox driver, and navigate to TripAdvisor Hotels webpage.

In [None]:
# Create Firefox driver
driver = webdriver.Firefox()
# Go to webpage
driver.get("https://www.tripadvisor.com/Hotels")

Now, we need to select this exact list on the page without anything else. To do so, you need a way to select this element somehow. You need to open Inspector in the Developer Tools, to be able to find the class name.

## Firefox

* Click the menu button
* Click the Developer button
![alt text](images/3.jpg)
* Click the Inspector button
![alt text](images/4.jpg)


* Selection of objects in Firefox is simple, take a look at this animation:
![alt text](images/ff_howto.gif)

## Google Chrome

The steps are similar.
* Click Menu button
* Hover More tools
* Click Developer tools
![alt text](images/5.jpg)
* Click Elements
![alt text](images/6.jpg)


* Chrome howto animation:
![alt text](images/chrome_howto.gif)

Now you need to find the class of required element.
![alt text](images/7.jpg)
<img src="images/8.jpg" style="width:50%"/>

We can select wanted element by it's class name using `find_element_by_class_name` method of our `driver`. There are a few ways of selecting elements with Selenium which we will use in this lesson. For further info go to [this page](http://selenium-python.readthedocs.io/locating-elements.html).

In [None]:
# Put name of class below, replacing spaces by dots
# box typeA deals region wrap => box.typeA.deals.region.wrap
content = driver.find_element_by_class_name('box.typeA.deals.region.wrap')
print(content.text)

Here we will take all the text from our element. It is a list of all entries from the element, but it also includes unwanted entries like
```Hotels in Popular Destinations - Find Hotels & Motels Near You,
United States & Canada,
Europe,
Amsterdam Hotels,
...```

So we can slice our list selecting the start of wanted entries, and it's end.

In [None]:
hotels = content.text.split("\n")
start = hotels.index('Atlanta Hotels')
end = hotels.index('Europe')

Now we create a new list for links, filtering out all unwanted entries, leaving only those, which start with 'N' or 'L'.

In [None]:
city_links = []
# Find all <a> elements and loop through them
for elem in content.find_elements_by_xpath(".//a")[start:end]: 
    # If name of element starts with 'N' or 'L'
    if elem.text.lower().startswith('n') or elem.text.lower().startswith('l'):
        # Cut name of hotel from string
        name = elem.text.split('Hotels')[0].strip()
        # Add element to list of links
        city_links.append((name,elem.get_attribute('href')))

print("city_links:")        
print(city_links)

This scraping process for each city can be broken into such steps.
* Load page with hotels for a city.
* Pick dates.
* Click sorting button.
* Scrape links for hotels on page.
* Click Next button.
* Repeat. Scrape links...


We use the `time.sleep()` function to ensure that page is reloaded properly after clicks. If you have slow Internet connection or in case you experience any exceptions, first try increasing `sleep_time`.

In [None]:
# 5 seconds
sleep_time = 5

First we will take a look on the process for one city. Then we will wrap everything into a single piece of code.
## Load Page

In [None]:
# Load page for first city
driver.get(city_links[0][1])

Sometimes there is a popup.
![alt text](images/9.jpg)
It will prevent everything from working.<br> So we need to close it with X button.
![alt text](images/10.jpg)

In [None]:
# It is wrapped in try-except block in case there is no such popup
try:
    driver.find_element_by_class_name("ui_close_x").click()
except:
    pass

## Set dates
We have to set up dates if we want TripAdvisor to show us average prices.
![alt text](images/11.jpg)
Well select today as check-in date, and a date two weeks later as the check-out date.
![alt text](images/12.jpg)
There is a bug on TripAdvisor, all dates are shifted one month.
![alt text](images/13.jpg)

Here we define two functions. The first one will click on specific date. And the second one will calculate the dates, and fill them in.

In [None]:
from datetime import datetime, timedelta

# This function will click on the specific date in calendar
def click_date(date):
    months = driver.find_element_by_class_name('dsdc-months.large-bottom-margin')
    # Loop through all days in month, untill we find two-weeks-later date
    for day in months.find_elements_by_class_name("dsdc-cell.dsdc-day"):
        if str(day.get_attribute('data-date')) == date:
            # Click that day
            day.click()
            break

# This function will select date pickers, and fill both of them
def set_dates(date_picker):    
    # Date for today    
    today = datetime.now().date()
    # There is a strange feature on TripAdvisor website, the months are shifted
    str_today = str(today.year) + "-" + str(today.month - 1) + "-" + str(today.day)
    # Delta is the shift. Two weeks = 14 days in our casemik
    delta = timedelta(days=14)
    # Date two weeks later
    tw_later = today + delta
    str_tw_later = str(tw_later.year) + "-" + str(tw_later.month - 1) + "-" + str(tw_later.day)
    
    # Find date picker fields, there are two.
    pickers = date_picker.find_elements_by_class_name('picker-label')    
    # Click the first field    
    pickers[0].click()
    # Set date
    click_date(str_today)    
    # Second one
    pickers[1].click()    
    click_date(str_tw_later)

Here we find our pickers on the page using `find_elements_by_class_name` function. And then fill the values in. We'll select the element with date pickers.
![alt text](images/14.jpg)

In [None]:
# Find the data picker square
date_picker = driver.find_element_by_class_name("prw_rup.prw_datepickers_desktop_horizontal_styleguide_icon.hotels_static_datepickers")
# Fill in dates
set_dates(date_picker)

After date picking this part will look like this.
![alt text](images/15.jpg)
## Click sorting button

Here we will sort all the hotels by Lowest Price. First we will select the bar with buttons. 
![alt text](images/17.jpg)
Then we will select the necessary button, and make Selenium to 'Click' on it. This will change the sorting. Here we use CSS Selectors, for more info on them, refer to this [page](http://saucelabs.com/resources/articles/selenium-tips-css-selectors).

Clicking the button makes structure of HTML to change, pay attention to class names.

**Before**

You can see, that the `data-currentsort = "popularity"` is selected.
![alt text](images/18.jpg)

**After**

Here the `data-currentsort = "priceLow"` becomes selected.
![alt text](images/19.jpg)

In [None]:
time.sleep(sleep_time)
# Find Sortbar that looks like
# Sort by: Ranking | Just for You | Lowest Price | Distance
sort_bar = driver.find_element_by_id('taplc_hotels_sort_bar_redesign_0')
# Find the 'Lowest Price' button
# Using CSS selector, we select 4th <li> tag
# And click on it
sort_bar.find_element_by_css_selector('li:nth-of-type(4)').click()
# Wait for page to reload completely after click
time.sleep(sleep_time)

## Scrape all hotels on the page
Define a function, that will return links for all hotels.

In [None]:
# This function will scrape links for all hotels on a page.
def scrape_hotels_on_page(element):
    linklist = []
    links = element.find_elements_by_class_name('property_title ')
    for link in links:
        linklist.append(link.get_attribute('href'))
    return linklist

We need to select list of hotels. We'll select the element by it's ID.
![alt text](images/18.jpg)

In [None]:
# Find the div element with all hotels list
hotel_list_on_page = driver.find_element_by_id('ACCOM_OVERVIEW')
# Scrape links for hotels on page
# Take a look on first three of them
scrape_hotels_on_page(hotel_list_on_page)[:3]

So, we've got a list of links as a result.
## Click Next button and Repeat 
We want to get through all available pages for the city, scraping hotels on each page. So we'll need to click on 'Next' button a few times in a loop. The button located at the bottom of the page.
![alt text](images/21.jpg)
Let's implement this.

In [None]:
all_links = []
while True:
    # Find the div element with all hotels list
    hotel_list_on_page = driver.find_element_by_id('ACCOM_OVERVIEW')
    # Scrape links for hotels on page
    all_links.extend(scrape_hotels_on_page(hotel_list_on_page))
    # Try except block is used to break the cycle, when there
    # is no more such Button (it'll turn to inactive actually)
    try: 
        # Here we want to click on Next button
        driver.find_element_by_class_name('nav.next.ui_button.primary.taLnk').click()
        # Wait for page to reload completely after click
        time.sleep(sleep_time)
    except:
        break

## Scrape information for each hotel
We'll scrape information from this part of page.
![alt text](images/22.jpg)
Specificly, these fields:

![alt text](images/23.jpg)

![alt text](images/24.jpg)


Define a function, that will scrape info for a hotel. It will return scraped info in form of a dictionary.

In [None]:
def scrape_hotel(link):
        # Load hotel's webpage
        driver.get(link)
        # Heading part
        head = driver.find_element_by_class_name('headingWrapper.easyClear')
        # Select all the necessary information on a webpage
        name = head.find_element_by_id('HEADING').text
        # Here all the try-except wrappings are used for cases,
        # when the field we want to select is not present on the page
        # like, if there is no price, or hotel is not reviewed even once.
        try:
            rating = head.find_element_by_class_name('ui_bubble_rating').get_attribute('alt')
            rating = float(rating.split()[0])
        except:
            rating = '0'
        try:
            reviews = head.find_element_by_class_name("more.taLnk").text
            reviews = float(reviews.split()[0].replace(',',''))
        except:
            reviews = '0'
        try:
            pos_in_hotels = head.find_element_by_class_name('rank').text
            pos_in_hotels = float(pos_in_hotels[1:])
        except:
            pos_in_hotels = '#0'
        try:
            price = head.find_element_by_class_name('price').text
            price = float(price.split(" ")[1].replace(',',''))
        except:
            price = '0'
        address = head.find_element_by_class_name('format_address').text
        # Put all the information to the dictionary
        return {'name' : name,
                'rating': rating,
                'reviews': reviews,
                'position': pos_in_hotels,
                'address': address,
                'price': price}           

Since there are a lot of hotels, it's going to take long time to load everything. For now, let's limit with 5 hotels.

In [None]:
hotels_dict = []
# Limit of hotels
limit = 5
count = 0
for link in all_links:
    hotels_dict.append(scrape_hotel(link))
    count += 1
    if count == limit:
        break

In [None]:
# Close Firefox
driver.quit()

In [None]:
# Take a look on results
hotels_dict

## Put it all together
Now let's make it all work together, so we will scrape all info for all the hotels for all cities we want. Please, be aware, that the full process will take **awhile**. We will limit hotels count with 15, to save time.



In [None]:
# Limit of hotels
limit = 15

# Create Firefox Driver
driver = webdriver.Firefox()

#Dictionary to store our results
city_hotel = {}

# We'll reuse city_links acquired before, to save some time loading pages
for city in city_links:    
    
    # Load page for city
    driver.get(city[1])
    
    #Let page load properly
    time.sleep(sleep_time)    
    # Close the popup if it is there.
    try:
        driver.find_element_by_class_name("ui_close_x").click()
    except:
        pass
   
    date_picker = driver.find_element_by_class_name("prw_rup.prw_datepickers_desktop_horizontal_styleguide_icon.hotels_static_datepickers")
    # Fill in dates
    set_dates(date_picker)
    
    #Let page load properly
    time.sleep(sleep_time)
    # Find Sortbar that looks like
    # Sort by: Ranking | Just for You | Lowest Price | Distance    
    sort_bar = driver.find_element_by_id('taplc_hotels_sort_bar_redesign_0')
    # Find the 'Lowest Price' button
    # Using CSS selector, we select 4th <li> tag
    # And click on it
    sort_bar.find_element_by_css_selector('li:nth-of-type(4)').click()
    # Wait for page to reload completely after click
    time.sleep(sleep_time)
    
    all_links = []    
    while True:
        # Wait for page to reload completely after click
        time.sleep(sleep_time)
        # Find the div element with all hotels list
        hotel_list_on_page = driver.find_element_by_id('ACCOM_OVERVIEW')
        # Scrape links for hotels on page
        all_links.extend(scrape_hotels_on_page(hotel_list_on_page))
        # Try except block is used to break the cycle, when there
        # is no more such Button (it'll turn to inactive actually)
        try: 
            # Here we want to click on Next button
            driver.find_element_by_class_name('nav.next.ui_button.primary.taLnk').click()
        except:
            break
            
    hotels_dict = []
    # Limit of hotels
    count = 0
    for link in all_links:
        hotels_dict.append(scrape_hotel(link))
        count += 1
        if count == limit:
            break
            
    # Insert data for City to the dictionary
    city_hotel[city[0]] = hotels_dict

In [None]:
# Close Firefox
driver.quit()

Now we can take a look on results.

In [None]:
city_hotel

So, now we have all this information, let's save this data to JSON file, so we can use it without any need to scrape TripAdvisor website again.

In [None]:
import json

In [None]:
# Open file for writing, create if it doesnt exist
with open('hotels_data.json', 'w') as file_object:
    # Write data to file
    json.dump(city_hotel, file_object)

In [None]:
# Open file for reading
with open('hotels_data.json','r') as f_obj:
    # Read the file.
    city_hotel_file = json.load(f_obj)

Let's add some visualizations for our data. Let's make a couple of plots.
First we have to prepare our data. We want to sort them according to their position in rating. We'll use [`operator`](https://docs.python.org/2/library/operator.html) package, specificly [`operator.itemgetter`](https://docs.python.org/2/library/operator.html#operator.itemgetter).

In [None]:
import operator

# Take hotels which are in New York City
nyc_hotels = city_hotel_file['New York City']
# Sort hotels by their position in rating
city_hotel_by_rating = sorted(nyc_hotels, key = operator.itemgetter('position'))
# Names of hotels
names = [item['name'] for item in city_hotel_by_rating]
# Their positions
hotels_pos = [item['position'] for item in city_hotel_by_rating]
# Ratios of rating to reviews count
ratios = [item['rating'] / item['reviews'] for item in city_hotel_by_rating]
# Prices
prices = [item['price'] for item in city_hotel_by_rating]
# Ticks for plot " name number"
ticks = list(map(lambda x,y: "{} ({})".format(x, int(y)), names, hotels_pos))

Now we'll draw a Plot with **Prices** and **Ratio of rating to reviews' count** for each hotel. To do so, we'll use `matplotlib`'s [`two scales`](http://matplotlib.org/examples/api/two_scales.html) plot.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, ax1 = plt.subplots()
# Draw a line for ratios
x = range(len(hotels_pos))
plt.plot(x, ratios, 'g.-')
# Name Y-axis
plt.ylabel('rating / reviews', color='g', fontsize=12)
# Name these ticks with names of hotels
plt.xticks(x, ticks, rotation=30, fontsize=11)
# Create a twin Axes sharing the x-axis
ax2 = plt.twinx()
ax2.plot(x, prices, 'r.-')
ax2.set_ylabel('price', color='r', fontsize=12)
# Make this plot fit the area
fig.set_size_inches(12, 6)
plt.tight_layout()
plt.show()

Now, lets find New York's closest city to Empire State Building.
To do so, we are going to use `geopy` package.
Install it with

```
pip install geopy
```
Documentation for `geopy` could be found [here](https://geopy.readthedocs.io/en/1.10.0/).
We'll use [GoogleV3](https://geopy.readthedocs.io/en/1.10.0/#geopy.geocoders.GoogleV3) geocoder.
Let's take a look at simple example.

In [None]:
import geopy
google_geolocator = geopy.geocoders.GoogleV3(timeout=3)
esb_loc = google_geolocator.geocode("Empire State Building")
# To get location's latitude and longitude
print(esb_loc.latitude)
print(esb_loc.longitude)

Now let's fetch GPS coordinates for all hotels in our list for New York.

In [None]:
# Dictionary to store results
gps_coords = {}
for hotel in nyc_hotels:
    # Query Google for data
    loc = google_geolocator.geocode(hotel['address'])
    # Add to dictionary
    gps_coords[hotel['name']] = (loc.latitude, loc.longitude)
# Take a look at results
print(gps_coords)

Lets calculate distance from each hotel to Empire State Building using [`vincenty`](https://geopy.readthedocs.io/en/1.10.0/#module-geopy.distance) distance.

In [None]:
from geopy.distance import vincenty
dists = {}
for key, value in gps_coords.items():
    dists[key] = vincenty((esb_loc.latitude, esb_loc.longitude), value).miles

# Sort hotels by distance from Empire State Building, descending
dists = sorted(dists.items(), key = operator.itemgetter(1), reverse = True)
dists

Now let's make a simple bar plot to visualize these distances.

In [None]:
# Values
vals = [d[1] for d in dists]
plt.bar(range(0,5), vals, align="center", color="#A2C5B6")
# Labels
plt.xticks(range(0,5), vals, rotation="vertical")
plt.ylabel("Distance, miles")
plt.show()

## Conclusion
We've made quite a good job on scraping info from TripAdvisor website. Selenium has shown to be quite useful on emulating behavior of human, to navigate pages. Though it is rather slow, because it requires a lot of refresh, reload cycles. 

> # Exercise 1
> Scrape information for all hotels in Paris, which are near to Eiffel Tower. On the [Hotels](https://www.tripadvisor.com/Hotels) page Click Paris. Then, on page with hotels scroll down to Neighborhoods and select Tour Eiffel. 
![alt text](images/25.jpg)

In [None]:
# write your code here

> # Exercise 2
>Scrape all the hotels for Atlanta City, but scrape only those which have 4 stars rating.
You will need to select the link for Atlanta Hotels on [Hotels](https://www.tripadvisor.com/Hotels) page. Then select hotels with 4 star rating. 
You have to select the link for Atlanta Hotels, 'click' four stars, and then scrape all links for all hotels. And all info for every hotel.


In [None]:
# write your code here

> # Exercise 3
> Do same thing as in exercise 2, but scrape only hotels with prices lower some level. You'll have to modify function, which scrapes all hotels on a page, that it will first take a look on the prices.

In [None]:
# write your code here