Run these notebook commands in order of cell succession for a proper non-erroneous output

Our first import: Selenium!

Refer to the Selenium documentation @ https://www.selenium.dev/documentation/ for a complete guide to browser automation

In [1]:
from selenium import webdriver

You'll need to download the browser specific driver to work with Selenium. Download the Chrome driver here: https://chromedriver.chromium.org/

Get the path to where the Chrome Web Driver has been installed on your system. 

In [2]:
path_to_driver = "C:/Users/<desktop-user>/Downloads/chromedriver_win32/chromedriver.exe"

Get a driver object to automate your browser session

In [3]:
# opens a Chrome browser window
driver = webdriver.Chrome(path_to_driver)

  driver = webdriver.Chrome(path_to_driver)


Many commonly used methods have been deprecated in Selenium 4 and replaced with newer versions. 

Let's resolve the warning by following the warning message direction.

In [4]:
# close the above browser window
driver.quit() 

We need the below import to create a Service object

In [5]:
from selenium.webdriver.chrome.service import Service

Create a service object by passing in the executable path to the Chrome driver

In [6]:
service = Service(path_to_driver)

Now pass in this service object to the webdriver as suggested by the warning above

In [7]:
driver = webdriver.Chrome(service=service)

Selenium now opens a new Chrome browser window! This window will be controlled through the script we write here. 

Now let's initialize a few variables that our driver will use to navigate to the required login page and fill in the login credentials.

In [8]:
login_url = "https://www.instahyre.com/login/"

email = "<enter-your-email-here>"
password = "<enter-your-password-here>"

We can make selenium navigate to a particular website using a GET request, passing in the URL to the driver.get() method. Here we pass in the login_url to get to the login landing page.

In [9]:
driver.get(login_url)

Now to fill in our login credentials at the required spot, we have to first tell Selenium where that spot is on the web page. HTML tags are well suited for this task. 

In order to get the tag info of a particular HTML element, we simply have to right-click on that element and choose 'Inspect' from the drop-down that appears. Chrome will open the 'Developer tools' panel and the HTML tag of that particular element will be highlighted wherein you can find various attributes associated with the tag. These attributes will help us find the specific element that we want to interact with. 

For our scenario, on our webpage: "https://www.instahyre.com/login/", the id attribute for our Email input field is "email". Therefore we can use the find_element_by_id() method to pin-point the exact HTML element that we're looking for. We could also use the name attribute for the Email input which again is "email" and then use the find_element_by_name() method. Similarly, we can get the id of the password input as well.

We finally chain it with the send_keys() method that passes the data we want to populate the input fields with.

In [10]:
driver.find_element_by_id("email").send_keys(email)
driver.find_element_by_id("password").send_keys(password)

  driver.find_element_by_id("email").send_keys(email)
  driver.find_element_by_id("password").send_keys(password)


Clearly, the above methods have been deprecated in Selenium 4 hence we will use the find_element() method further on. For more user documentation on find_element(), refer to: https://www.selenium.dev/documentation/webdriver/elements/finders/ and for the Selenium API for Python refer: https://www.selenium.dev/selenium/docs/api/py/api.html (Use the quick search tool to find the required methods)

After filling in our email and password, we need to "click" the login button on the page. For that, we again "find" the login button and use the click() method on it. The click method is used to click on any element, such as an anchor tag, a link, etc. source: https://www.geeksforgeeks.org/click-element-method-selenium-python/. 

Here we use the click method on the login button element and we'll use find_element() method for locating the specific button. Since we'll be locating the element by XPath, we pass in 'xpath' argument to find_element() and to get the XPath, simply right-click the element's tag on the inspect panel and click Copy -> Copy full XPath and pass the XPath to find_element.

In [11]:
driver.find_element('xpath', "/html/body/div[1]/div[2]/div[2]/div/div/form/div[3]/button").click()

We have now successfully logged in to the website!

The drop-down list we're trying to scrape is present on the profile page hence first we get the URL of the profile page i.e. https://www.instahyre.com/candidate/profile/. Landing on this page is authorized only after a successful login. 

In [12]:
profile_url = "https://www.instahyre.com/candidate/profile/"

In [13]:
driver.get(profile_url)

The drop-down list appears when we click on the Edit button of the 'Skills' section in the profile. However the Edit button of the 'Skills' section becomes active only once the 'Skills' section comes into view. Therefore to scroll up to bring the section fully into view, we get a reference to the prior 'Job Preferences' section and then use the execute_script method to execute the required JavaScript.

In [14]:
skills_element = driver.find_element(by='id', value='job-preferences')
driver.execute_script("arguments[0].scrollIntoView();", skills_element)

Now, "click" the 'Skills' Edit button

In [15]:
driver.find_element(by='xpath',value="/html/body/div[1]/div[2]/div[2]/div/div[1]/div/div[2]/div[3]/a").click()

After the above step we now have the html for the drop-down list loaded on the webpage, hence we will parse this html and extract the data using the Beautiful Soup library. The page_source attribute of the driver returns the html content of the page

In [16]:
html = driver.page_source

Importing Beautiful Soup and a parser like lxml:

In [17]:
from bs4 import BeautifulSoup
import lxml

Refer the BeautifulSoup documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Running the html doc through Beautiful Soup gives us a BeautifulSoup object aptly refered to as 'soup'. We also pass in the parser as an argument.

In [18]:
soup = BeautifulSoup(html, 'lxml')

Now, in order to extract the items from the drop-down list, we 'Inspect' the particular drop-down i.e. the drop-down list under 'Select your preferred role'. This highlights a select tag in the developer tools panel. 

Select tags are used to create drop-down lists and since the page might contain multiple drop-downs, we iterate through all such 'selects' to find the drop-down that has the value of the 'form-name' attribute as 'candidate_skills' because that's the exact drop-down we're looking for. Also we can make sure that no other drop-downs with the same 'form-name' value are present on the page by taking a look at the list returned by soup.find_all('select')

In [19]:
for drop_down in soup.find_all('select'):
    if drop_down['form-name']=="candidate_skills":
        break

# storing the tag content in the skills_soup variable
skills_soup = drop_down

To make sure we got the right select tag, let's display it:

In [20]:
skills_soup.prettify()

'<select auctioned-field="" autofocus="" child="true" class="form-control ng-pristine ng-valid" form-name="candidate_skills" name="job_function" ng-model="candidate.jsp.job_function.resource_uri" ng-options="function.resource_uri as function.name group by function.job_category.name for function in jobFunctions">\n <optgroup label="Software Engineering">\n  <option label="Backend Development" selected="selected" value="0">\n   Backend Development\n  </option>\n  <option label="Big Data / DWH / ETL" value="1">\n   Big Data / DWH / ETL\n  </option>\n  <option label="Embedded / Kernel Development" value="2">\n   Embedded / Kernel Development\n  </option>\n  <option label="Frontend Development" value="3">\n   Frontend Development\n  </option>\n  <option label="Full-Stack Development" value="4">\n   Full-Stack Development\n  </option>\n  <option label="Mobile Development" value="5">\n   Mobile Development\n  </option>\n  <option label="QA / SDET" value="6">\n   QA / SDET\n  </option>\n  <opt

In the above tag content, we can see all the required info that we need, i.e. all the drop-down list elements. We choose to store this information in a dictionary with the key being the group label header and the values being the group items in a list. For example: "Data Science and Analysis" : ["Data Analysis / Business Intelligence", "Data Science / Machine Learning"]

In [21]:
skills_dict = dict()

Now we just loop through the select tag using the contents attribute. The contents of the select tag here are the various groups, and for each group we loop through it to get the group items, storing them in skills_dict

In [22]:
for optgroup in skills_soup.contents:
    group_label = optgroup['label']
    skills_dict[group_label] = []
    for option in optgroup.contents:
        skills_dict[group_label].append(option.string)

Now let's display our skills_dict to check whether we extracted all required data

In [23]:
skills_dict

{'Software Engineering': ['Backend Development',
  'Big Data / DWH / ETL',
  'Embedded / Kernel Development',
  'Frontend Development',
  'Full-Stack Development',
  'Mobile Development',
  'QA / SDET',
  'Other Software Development'],
 'Technical Management': ['Engineering Management',
  'Product Management',
  'Project Management'],
 'Data Science and Analysis': ['Data Analysis / Business Intelligence',
  'Data Science / Machine Learning'],
 'Design and Creative': ['Graphic Design / Animation',
  'Photography / Videography',
  'UX / Visual Design',
  'Other Design'],
 'IT Operations and Support': ['Database Admin / Development',
  'DevOps / Cloud',
  'Functional / Technical Consulting',
  'IT Management / IT Support',
  'IT Security',
  'Network Administration',
  'Solution Architecture / Presales',
  'Systems Administration',
  'Technical / Production Support',
  'Technical Writing'],
 'Human Resources': ['HR Generalist', 'Talent Acquisition'],
 'Marketing': ['Brand Management',
  '

Perfect! 

Now that we retrieved our data, it's best practice to logout from the site and close the browser window opened by Selenium.

In [24]:
driver.find_element('id', "nav-candidates-logout").click()

We have successfully logged out!

Note: For a small-size browser window, the sign out option is part the navigation bar/ overflow action menu indicated by three horizontal bars. Therefore if the above statement raises an exception then un-comment and run the statements below

In [25]:
# driver.find_element('xpath', "/html/body/div[1]/nav/div/div[1]/button").click()

In [26]:
# driver.find_element('id', "nav-candidates-logout").click()

Finally, close the browser window.

In [27]:
driver.quit()