# Selenium Quickstart Tutorial
---
Install:  
* [Selenium](https://www.seleniumhq.org/download/)
* [ChromeDriver](https://chromedriver.chromium.org/downloads)
---

**Selenium, accessing data that can't just be downloaded**

Summary:  
1. basic Google search
2. scraping exercise - jobs

**Special disclaimer: you should always check the robots.txt file for a website before you scrape it. [This article](https://moz.com/learn/seo/robotstxt) does a better job explaing than I'll be able to.**

---

### Imports:

In [None]:
# default packages
import numpy as np
import pandas as pd

# storage
import pickle  # pickle is one option for storage when you aren't using "df.to_(csv/excel/...)""

# selenium-related imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import time

# prep-work
chromedriver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver

---
### 1. Google Search

In [None]:
# open a Chrome window with google, this will open a new window "being controlled by test software"

driver = webdriver.Chrome(chromedriver) # webdriver executable
driver.get('https://www.google.com/')

**To open up the inspector: cmnd + shift + C**

Now hover over the search bar and click on it.  
In the inspector there is a large chunk of text that starts with:  
```<input class=```  
Right click, go to "Copy XPath" ...

```//*[@id="tsf"]/div[2]/div[1]/div[1]/div/div[2]/input```

A bit messy, but XPath is generally a safe bet when you need Selenium to click on the right thing.

However, looking back at the text chunk, there's a field ```name="q"```. This is one example where XPath isn't the best bet.

In [None]:
search_bar = driver.find_element_by_name('q')

# sleeps help to make your bot appear more 'human'
time.sleep(1);

# now send in a search...
search_bar.send_keys('data science jobs')
# and hit enter (after 'Keys.' you can hit TAB to check out your options - this is true after any '.')
search_bar.send_keys(Keys.ENTER)

**WARNING:**  
If you don't explicitly call ```driver.quit()``` or ```right-click + quit``` the browser tab, then your driver tabs will remain open.

In [None]:
driver.quit()

---

### 2. Scraping Exercise - Monster

Job listings are extremely dynamic, and Monster has pretty lax rules for scraping data from their site. Our objective is to collect data on listed jobs (job descriptions) and analyze it.

Manually trawling posts would be tedious, so...

---
#### 2.1 One Job

As with all problems that involve repetition, do it once (or a handful of times) at first. Figure out the optimal-ish way to do the task and then loop through it as many times as you need.

Start off by [going to monster](https://www.monster.com/).

For the rest of this exercise, I'll walk you through a scrape of data science job listings. Feel free to change the parameters to your own specifications.

In [None]:
# before we go into developer mode, before we open a driver, let's look at search filters
# I've chosen full time jobs listed in the last week

link = 'https://www.monster.com/jobs/search/Full-Time_8?q=data-scientist&tm=7'

# my search returns ~2k listings, more than enough to be "statistically relevant"

In [None]:
# cmnd + shift + C to check the xpath for the first posting

job_start = '//*[@id="SearchResults"]/section[1]' # starts at 1

# cmnd + shift + C to check the last

job_end = '//*[@id="SearchResults"]/section[29]'

# only 29 postings show before "Load more jobs"

more_jobs = '//*[@id="loadMoreJobs"]'

# scroll to the last one

new_job_end = '//*[@id="SearchResults"]/section[57]' # 57?

# we started with 29 jobs, and we get 28 every time we ask for more

In [None]:
# we have everything that we need, let's do a test run

driver = webdriver.Chrome(chromedriver)

driver.get(link) # link with filters

In [None]:
first_job = driver.find_element_by_xpath(job_start) # find the path we coppied for job 1

first_job.click()

In [None]:
# cmnd + shift + C

header_xpath = '//*[@id="JobPreview"]/div[1]/div[1]'

description_xpath = '//*[@id="JobDescription"]'

In [None]:
# .text is eponymous (no parantheses, the text isn't a function) 

driver.find_element_by_xpath(header_xpath).text

In [None]:
# slice a tweet from the description, looks good

driver.find_element_by_xpath(description_xpath).text[:250]

---
#### 2.2 All the jobs

All of the components are there for us. Now it's time to put it all together.

Our loop should look like this:
1. scrape jobs 1-29
2. click on "load more jobs"
3. scrape the next 28 jobs
4. repeat steps 2-3 until we have our desired number of job listings

In [None]:
# all of our paths

link = 'https://www.monster.com/jobs/search/Full-Time_8?q=data-scientist&tm=7'

job_n = '//*[@id="SearchResults"]/section[{}]' # note the "{}"

header_xpath = '//*[@id="JobPreview"]/div[1]/div[1]'

description_xpath = '//*[@id="JobDescription"]'

more_jobs = '//*[@id="loadMoreJobs"]'

In [None]:
# we need to come up with a way to loop through the posting numbers

# 29, 57, ...
# it looks like the numbers pre-button are a multiple of 28 + 1

29 % 28, 57 % 28  # remainders

In [None]:
# we do have a corner case

1 % 28

In [None]:
# double filter, and verify that the printed numbers match our target click-indices

for i in np.arange(1, 100):
    if i % 28 == 1 and i > 1:
        print(i)

In [None]:
# let's make a ceiling at 100 jobs
# note: this cell will error

job_headers = []
job_descriptions = []

driver = webdriver.Chrome(chromedriver)
driver.get(link)
time.sleep(5) # letting the page load

for n in np.arange(1, 101): # remember that range cuts at end-1
    # click job_n
    job_path = driver.find_element_by_xpath(job_n.format(n))
    job_path.click()
    time.sleep(2)  # resting after clicks
    # get_data
    job_headers.append(driver.find_element_by_xpath(header_xpath).text)
    job_descriptions.append(driver.find_element_by_xpath(description_xpath).text)
    # "more jobs"
    if i % 28 == 1 and i > 1:
        driver.find_element_by_xpath(more_jobs).click()
        time.sleep(2)  # sometimes it's benefical to make these random
        
driver.quit()

In [None]:
# job path wasn't real... hmm
# check the xpath for job 2

driver = webdriver.Chrome(chromedriver)
driver.get(link)

job_2 = '//*[@id="SearchResults"]/section[3]' # lmao, so that's why the numbering was weird

In [None]:
# same thing as last time

job_headers = []
job_descriptions = []

driver = webdriver.Chrome(chromedriver)
driver.get(link)
time.sleep(5) # letting the page load

# we have a few options, the easiest is a try, except statement
# try to do what we wanted, except move on
# there may be other numbers missing, and we don't want to write an exception for each
# that said, let's switch to a while loop and count to 100

i = 0
while i < 100:
    n = 1  # new job number iterator
    try: 
        # click job_n
        job_path = driver.find_element_by_xpath(job_n.format(n))
        job_path.click()
        time.sleep(2)  # resting after clicks
        # get_data
        job_headers.append(driver.find_element_by_xpath(header_xpath).text)
        job_descriptions.append(driver.find_element_by_xpath(description_xpath).text)
        # "more jobs"
        if n % 28 == 1 and n > 1:
            driver.find_element_by_xpath(more_jobs).click()
            time.sleep(3)  # sometimes it's benefical to make these random
        i += 1  # increment for each success
    except:
        pass  # we could also print, or try some other function
    n += 1  # since we're counting jobs not successes
        
driver.quit()

In [None]:
# double check that we got 100

len(job_descriptions), len(job_headers)

---
### 2.3 Data Storage

Pickling is just one way to handle data storage, but it's effective for random data like we just grabbed.

If you have some data that's already in a dataframe, then using pandas to_csv is far superior. Csv files also have the advantage of being easily transportable across environments to pretty much any end user (there's also a to_excel call).

In [None]:
# first lets set everything into a dictionary

job_dict = {j: [job_headers[j], job_descriptions[j]] for j in range(100)}

# now "dump" (object to dump, open(filename for dumping, write as a binary file))

pickle.dump(job_dict, open('jobs.pkl', 'wb'))