# Installing Selenium

We're going to use Selenium for Firefox, which means we'll have to install `geckodriver`. You can download it [here](https://github.com/mozilla/geckodriver/releases/). Download the right version for your system, and then unzip it.

You'll need to then move it to the correct path. This module expects you to be running Python 3.X with Anaconda. If you drag geckodriver into your anaconda/bin folder, then you should be all set.

# Selenium

Very helpful documentation on how to navigate a webpage with selenium can be found [here](http://selenium-python.readthedocs.io/navigating.html). There are a lot of different ways to navigate, so you'll want to refer to this throughout the workshops, as well as when you're working on your own projects in the future.

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup

First we'll set up the (web)driver. This will open up a Firefox window.

In [43]:
# setup driver
driver = webdriver.Firefox()

To go to a webpage, we just enter the url as the argument of the `get` method.

In [5]:
driver.get("http://www.google.com")

In [44]:
# go to page
#Since using beautiful soup we cannot go to second page,but using selenium we can do so
driver.get("http://loksabhaph.nic.in/Debates/DebateAdvSearch17.aspx")

### Lets Search Session

We can use the method `find_element_by_name` to find an element on the page by its name. An easy way to do this is to inspect the element.

In [45]:
session=driver.find_element_by_name("ctl00$ContentPlaceHolder1$ddlsession")

Now if we want to get the different options in this drop down, we can do the same. You'll notice that each name is associated with a unique value. Here since we're getting multiple elements, we'll use `find_elements_by_tag_name`

In [46]:
# find options in that drop down
session_options = session.find_elements_by_tag_name("option")

print(session_options[1].get_attribute("value"))
print(session_options[1].text)

1
XVII-I Session(17th june to 6th August) 2019


Now we'll make a dictionary associating each name with its value.

In [47]:
d_options = {option.text.strip(): option.get_attribute("value") for option in session_options if option.get_attribute("value").isdigit()}
print(d_options)

{'XVII-I Session(17th june to 6th August) 2019': '1', 'XVII-II Session(18th Nov to 13th Dec)2019': '2', 'XVII-III Session(31st Jan to 11th Feb)2020 and (2nd Mar to 23rd Mar)2020': '3', 'XVII-IV Session(14th Sep to 23rd Sep)2020': '4'}


Now we can select a session by using its name and our dictionary. First we'll make our own function using Selenium's `Select`, and then we'll call it on "XVII-IV Session(14th Sep to 23rd Sep)2020".

In [48]:
session_select = Select(session)
session_select.select_by_value(d_options["XVII-IV Session(14th Sep to 23rd Sep)2020"])


In [58]:
submit_button=driver.find_element_by_name("ctl00$ContentPlaceHolder1$ddldebtype")
submit_button.click()

### Debate Type

We can do the same as we did above to find the different sessions

In [51]:
# find the "debate type" drop down
debate_type= driver.find_element_by_name("ctl00$ContentPlaceHolder1$ddldebtype")

In [52]:
# get options
debate_type_options = debate_type.find_elements_by_tag_name("option")

print(debate_type_options[1].get_attribute("value"))
print(debate_type_options[1].text)

1
ADJOURNMENT MOTION


In [56]:
d_options = {option.text.strip(): option.get_attribute("value") for option in debate_type_options if option.get_attribute("value").isdigit()}
print(d_options)

{'ADJOURNMENT MOTION': '1', 'ANNOUNCEMENT BY THE CHAIR': '2', 'ASSENT TO BILLS': '3', 'BUDGET (GENERAL)': '4', 'BUDGET (STATES)': '5', 'BUSINESS OF HOUSE': '7', 'ELECTION OF SPEAKER/DY. SPEAKER': '58', 'ELECTION TO COMMITTEES/BOARDS': '59', 'EXPULSION FROM MEMBERSHIP OF LOK SABHA': '0', 'FELICITATIONS': '13', 'GOVERNMENT BILLS': '14', 'GOVERNMENT RESOLUTIONS': '16', 'INTRODUCTION OF MINISTERS': '19', 'INTRODUCTION OF PARLIAMENTARY DELEGATIONS': '56', 'LEAVE OF ABSENCE': '20', 'MATTERS UNDER RULE-377': '21', 'MESSAGES FROM PRESIDENT': '23', 'MESSAGES FROM RAJYA SABHA': '24', "MOTION OF THANKS ON THE PRESIDENT'S ADDRESS": '26', 'OATH OR AFFIRMATION': '31', 'OBITUARY REFERENCE': '32', 'OBSERVATION BY THE CHAIR': '33', 'OBSERVENCE OF SILENCE': '65', 'PANEL OF CHAIRMEN': '61', 'PAPERS LAID ON THE TABLE': '34', 'PARLIAMENTARY COMMITTEES': '35', 'POINTS OF ORDER': '37', 'PRESIDENT ADDRESS': '55', "PRIVATE MEMBERS' BILLS": '39', "PRIVATE MEMBERS' RESOLUTIONS": '40', 'QUESTION OF PRIVILEGE': '4

In [57]:
debate_type_select = Select(debate_type)
debate_type_select.select_by_value(d_options['UNION BUDGET'])

In [59]:
submit_button=driver.find_element_by_name("ctl00$ContentPlaceHolder1$ddldebtype")
submit_button.click()

### Select the first debate and open it

In [88]:
# get the html for the table
title = driver.find_element_by_xpath("/html/body/form/div[3]/div[5]/div/div[2]/div/div[3]/table[2]/tbody/tr[2]/td[2]/a")

In [90]:
title.click()

In [95]:
driver.

AttributeError: 'WebDriver' object has no attribute 'html'

To parse the html, we'll use BeautifulSoup.

In [96]:
# soup-ify
debate_text = BeautifulSoup(driver.page_source, 'lxml')
debate_text

<html xmlns="http://www.w3.org/1999/xhtml"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>
	Debate : Loksabha
</title><link href="../favicon.ico" rel="shortcut icon"/><link href="css/style.css" rel="stylesheet" type="text/css"/><link href="http://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,700" rel="stylesheet" type="text/css"/>
<!-- Navigation STARTS here-->
<link href="css/megafish.css" rel="stylesheet" type="text/css"/><link href="../main-css/style.css" rel="stylesheet" type="text/css"/>
<!-- Navigation Ends here-->
<!-- font size switcher -->
<link href="css/css_small.css" media="screen" rel="alternate stylesheet" title="small" type="text/css"/><link href="css/css_bigger.css" media="screen" rel="alternate stylesheet" title="bigger" type="text/css"/><link href="css/fusiaBlack.css" media="screen" rel="alternate stylesheet" title="fusiaBlack" type="text/css"/><link href="css/yelBlack.css" media="screen" rel="alternate stylesheet" titl

First we'll get all the rows of the table using the `td` selector.table

In [106]:
#Since debate_Text is present inside the <p> tag.Lets select p tag
all_text=""
debates=debate_text.find_all('p')
for j in debates:
    jnew=j.getText().strip()
    all_text=all_text+jnew

In [107]:
all_text



### Challenge 1: Now print the text of second debate present in that page