# Guide to Web Scraping

Let's get you started with web scraping and Python. Before we begin, here are some important rules to follow and understand:

1. Always be respectful and try to get premission to scrape, do not bombard a website with scraping requests, otherwise your IP address may be blocked!
2. Be aware that websites change often, meaning your code could go from working to totally broken from one day to the next.
3. Pretty much every web scraping project of interest is a unique and custom job, so try your best to generalize the skills learned here.

OK, let's get started with the basics!

## Basic components of a WebSite

### HTML
HTML stands for  Hypertext Markup Language and every website on the internet uses it to display information. Even the jupyter notebook system uses it to display this information in your browser. If you right click on a website and select "View Page Source" you can see the raw HTML of a web page. This is the information that Python will be looking at to grab information from. Let's take a look at a simple webpage's HTML:

    <!DOCTYPE html>  
    <html>  
        <head>
            <title>Title on Browser Tab</title>
        </head>
        <body>
            <h1> Website Header </h1>
            <p> Some Paragraph </p>
        <body>
    </html>

Let's breakdown these components.

Every <tag> indicates a specific block type on the webpage:

    1.<DOCTYPE html> HTML documents will always start with this type declaration, letting the browser know its an HTML file.
    2. The component blocks of the HTML document are placed between <html> and </html>.
    3. Meta data and script connections (like a link to a CSS file or a JS file) are often placed in the <head> block.
    4. The <title> tag block defines the title of the webpage (its what shows up in the tab of a website you're visiting).
    5. Is between <body> and </body> tags are the blocks that will be visible to the site visitor.
    6. Headings are defined by the <h1> through <h6> tags, where the number represents the size of the heading.
    7. Paragraphs are defined by the <p> tag, this is essentially just normal text on the website.

    There are many more tags than just these, such as <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns, and more!

### CSS

CSS stands for Cascading Style Sheets, this is what gives "style" to a website, including colors and fonts, and even some animations! CSS uses tags such as **id** or **class** to connect an HTML element to a CSS feature, such as a particular color. **id** is a unique id for an HTML tag and must be unique within the HTML document, basically a single use connection. **class** defines a general style that can then be linked to multiple HTML tags. Basically if you only want a single html tag to be red, you would use an id tag, if you wanted several HTML tags/blocks to be red, you would create a class in your CSS doc and then link it to the rest of these blocks.

### Scraping Guidelines

Keep in mind you should always have permission for the website you are scraping! Check a websites terms and conditions for more info. Also keep in mind that a computer can send requests to a website very fast, so a website may block your computer's ip address if you send too many requests too quickly. Lastly, websites change all the time! You will most likely need to update your code often for long term web-scraping jobs.

## Web Scraping with Python

There are a few libraries you will need, you can go to your command line and install them with conda install (if you are using anaconda distribution), or pip install for other python distributions.

    conda install requests
    conda install lxml
    conda install bs4
    
if you are not using the Anaconda Installation, you can use **pip install** instead of **conda install**, for example:

    pip install requests
    pip install lxml
    pip install bs4
    
Now let's see what we can do with these libraries.

### Example Task 0 - Grabbing the title of a page

Let's start very simple, we will grab the title of a page. Remember that this is the HTML block with the **title** tag. For this task we will use **http://loksabhaph.nic.in/Debates/Debatetextsearch16.aspx** which is a website specifically that i used to scrape debates.

In [2]:
import requests

In [3]:
# Step 1: Use the requests library to grab the page
# Note, this may fail if you have a firewall blocking Python/Jupyter 
# Note sometimes you need to run this twice if it fails the first time
res = requests.get("http://loksabhaph.nic.in/Debates/Debatetextsearch16.aspx")

This object is a requests.models.Response object and it actually contains the information from the website, for example:

In [5]:
type(res)

requests.models.Response

In [6]:
res.text

'\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>\r\n\tDebate text Search\r\n</title><link rel="shortcut icon" href="../favicon.ico" /><link rel="stylesheet" type="text/css" href="css/style.css" /><link href="http://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,700" rel="stylesheet" type="text/css" />\r\n    <!-- Navigation STARTS here-->\r\n    <link href="css/megafish.css" rel="stylesheet" type="text/css" /><link rel="stylesheet" type="text/css" href="../main-css/style.css" />\r\n    <!-- Navigation Ends here-->\r\n    <!-- font size switcher -->\r\n    <link rel="alternate stylesheet" type="text/css" media="screen" title="small" href="css/css_small.css" /><link rel="alternate stylesheet" type="text/css" media="screen" title="bigger" href="css/css_bigger.css" /><

____
Now we use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). Using BeautifulSoup we can create a "soup" object that contains all the "ingredients" of the webpage. Don't ask me about the weird library names, I didn't choose them! :)

In [7]:
import bs4

In [8]:
soup = bs4.BeautifulSoup(res.text,"lxml")

In [9]:
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>
	Debate text Search
</title><link href="../favicon.ico" rel="shortcut icon"/><link href="css/style.css" rel="stylesheet" type="text/css"/><link href="http://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,700" rel="stylesheet" type="text/css"/>
<!-- Navigation STARTS here-->
<link href="css/megafish.css" rel="stylesheet" type="text/css"/><link href="../main-css/style.css" rel="stylesheet" type="text/css"/>
<!-- Navigation Ends here-->
<!-- font size switcher -->
<link href="css/css_small.css" media="screen" rel="alternate stylesheet" title="small" type="text/css"/><link href="css/css_bigger.css" media="screen" rel="alternate stylesheet" title="bigger" type="text/css"/><link href="css/fusiaBlack.css" media="screen" rel="alternate 

In [10]:
soup.select('title')

[<title>
 	Debate text Search
 </title>]

Notice what is returned here, its actually a list containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we cna use method calls to grab just the text.

In [11]:
title_tag = soup.select('title')

In [12]:
title_tag[0]

<title>
	Debate text Search
</title>

In [13]:
type(title_tag[0])

bs4.element.Tag

In [14]:
title_tag[0].getText()

'\r\n\tDebate text Search\r\n'

These are escape characters we can remove them using a simple function of .strip()

In [15]:
title_tag[0].getText().strip()

'Debate text Search'

### Example Task 1 - Grabbing all elements of a class

Let's try to grab the total records present on the page

Now its time to figure out what we are actually looking for. Inspect the element on the page to see that the section headers have the class "mw-headline". Because this is a class and not a straight tag, we need to adhere to some syntax for CSS. In this case

<table>

<thead >
<tr>
<th>
<p>Syntax to pass to the .select() method</p>
</th>
<th>
<p>Match Results</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<p><code>soup.select('div')</code></p>
</td>
<td>
<p>All elements with the <code>&lt;div&gt;</code> tag</p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('#some_id')</code></p>
</td>
<td>
<p>The HTML element containing the <code>id</code> attribute of <code>some_id</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('.notice')</code></p>
</td>
<td>
<p>All the HTML elements with the CSS <code>class</code> named <code>notice</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div span')</code></p>
</td>
<td>
<p>Any elements named <code>&lt;span&gt;</code> that are within an element named <code>&lt;div&gt;</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div &gt; span')</code></p>
</td>
<td>
<p>Any elements named <code class="literal2">&lt;span&gt;</code> that are <span><em >directly</em></span> within an element named <code class="literal2">&lt;div&gt;</code>, with no other element in between</p>
</td>
</tr>
<tr>

</tr>
</tbody>
</table>

In [27]:
soup.select('span')

[<span class="t1">Parliament of India</span>,
 <span class="t2">House of the
                         People</span>,
 <span id="error" style="color: Red; display: none">* Special Characters not allowed</span>,
 <span class="indtime">India Time</span>,
 <span id="clockbox1" style="color: #01538C;">
 </span>,
 <span class="yourtime">Your Time</span>,
 <span id="clockbox2" style="color: #01538C;">
 </span>,
 <span class="mainheader" id="ContentPlaceHolder1_Label1" style="display:inline-block;">Debate Search by Text (Seventeenth Lok Sabha)</span>,
 <span style="color: #993366">Archive :</span>,
 <span style="color: #993366; margin-right: 10px;">Search by:</span>,
 <span class="instruction" id="ContentPlaceHolder1_Label2" style="display:inline-block;">Enter the search keywords and click SEARCH Button.</span>,
 <span class="button" id="ContentPlaceHolder1_Label3" style="display:inline-block;"><font color="Navy">Matches on:</font></span>,
 <span class="radiobutton" style="display:inline-block

### We can also one another function to find elements
##### r=soup.find_all('element',attrs={'id':"viewTable"})</b>  
Here attrs stands for attribute and id is unique identifier of that element which is present

Now let's find the element which contains <b>total records</b> 
#<span class="label1" id="ContentPlaceHolder1_totalr" style="display:inline-block;"><b>Total Records :<b>5201</b></b></span>

In [33]:
total_records_element=soup.find_all('span',attrs={'id':'ContentPlaceHolder1_totalr'})
total_records_element

[<span class="label1" id="ContentPlaceHolder1_totalr" style="display:inline-block;"><b>Total Records :<b>5201</b></b></span>]

Now as we can see we have found the element containing total records element.Since it is a list we have to get the first text
Lets get the main text containing Total Records

In [35]:
number_records=total_records_element[0].find_all('b')

In [40]:
records=number_records[0].getText()

In [41]:
records

'Total Records :5201'

# Challenge 1:Find the total pages.  
Hint 1:It is present in this page and it is span class

### Example Task 2: Now lets to try print all the debates information which are present on this first page which we are currently scraping

Now,we don't know anything about how to print these information.  
For that we should select the information which we want on website and do right click and select <code>Inspect or Inspect Element</code>.  
We will get idea in which tag that particular information is stored.  



In [45]:
# So for these records all the information is stored in table tag and we have to ascertain what is the id of that table or something that is unique to this table
table=soup.find_all('table',attrs={'border':"0"})


In [46]:
table

[<table border="0" class="debate-search" id="Table3" style="width: 607px; height: 30px; margin-bottom: -6px;
                     margin-left: -3px;" width="607">
 <tr>
 <td style="width: 482px; height: 26px;" valign="top">
 <input id="ContentPlaceHolder1_TextBox1" name="ctl00$ContentPlaceHolder1$TextBox1" tabindex="1" title="Please use single 'SPACE' as separator." type="text"/>
 </td>
 <td style="width: 63px; height: 26px;" valign="top">
 <input class="submit" id="ContentPlaceHolder1_search1" name="ctl00$ContentPlaceHolder1$search1" onclick='javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$search1", "", true, "", "", false, false))' type="submit" value=""/>
 </td>
 <td style="height: 26px" valign="top">
 <input class="reset" id="ContentPlaceHolder1_btnReset" name="ctl00$ContentPlaceHolder1$btnReset" onclick='javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$btnReset", "", true, "", "", false,

* Now we can see that all the information of debates is stored inside this table  
* And all the information of individual debates is stored inside <code>tr tag</code>
* Since table is returned in form of list.For each element of table we will select <code> tr tag</code>


In [64]:
#First debate
first_debate=table[1].find_all('tr')
first_debate

[<tr>
 <td class="instruction" valign="top" width="20%">
                                     Type of Debate:
                                 </td>
 <td class="griditemTitle">
 <a href="Result17.aspx?dbsl=4742">RULING BY THE SPEAKER</a>
 </td>
 </tr>,
 <tr>
 <td class="instruction" valign="top" width="20%">
                                     Title:
                                 </td>
 <td class="griditemTitle">
 <a href="Result17.aspx?dbsl=4742">Regarding notices of 
 Adjournment Motion. </a>
 </td>
 </tr>,
 <tr>
 <td class="instruction" valign="top" width="20%">
                                     Date:
                                 </td>
 <td class="griditem">
 <a href="Result17.aspx?dbsl=4742">23-09-2020</a>
 </td>
 </tr>,
 <tr>
 <td class="instruction" valign="top" width="20%">
                                     Participants:
                                 </td>
 <td class="griditem">
 <a href="Result17.aspx?dbsl=4742&amp;ser=&amp;smode=t#4716*1">Birla, Shri Om</a>
 <

* Since first debate contains all the information such as Type of Debate,Title,DateParticipants,Ref Key Words
* Everything is returned in list we have to loop through to get required information

In [75]:
# First contains Type of Debate heading which is in td tag
# Corresponding to it is inside the <a> tag
first_debate[0]

<tr>
<td class="instruction" valign="top" width="20%">
                                    Type of Debate:
                                </td>
<td class="griditemTitle">
<a href="Result17.aspx?dbsl=4742">RULING BY THE SPEAKER</a>
</td>
</tr>

In [104]:

heading_debate=first_debate[0].find_all('td',attrs={'class':"instruction"})
heading=heading_debate[0].getText().strip()
type_debate=first_debate[0].find_all('a')
types=type_debate[0].getText().strip()
debate_type=heading+types
debate_type

'Type of Debate:RULING BY THE SPEAKER'

In [105]:
# Similar thing will be done for title of debate which is in second element of first debate
title_heading=first_debate[1].find_all('td',attrs={'class':"instruction"})
title_heading=title_heading[0].getText().strip()
title_debate=first_debate[1].find_all('a')
title=title_debate[0].getText().strip()
debate_title=title_heading+title
debate_title

'Title:Regarding notices of \r\nAdjournment Motion.'

Since we can see that we are doing same thing for every information inside the debate.So to avoid repetition we can use loop

In [107]:
i=0
while(i<len(first_debate)):
    heading_debate=first_debate[i].find_all('td',attrs={'class':"instruction"})
    heading=heading_debate[0].getText().strip()
    type_debate=first_debate[i].find_all('a')
    if(len(type_debate)>0):
        types=type_debate[0].getText().strip()
        debate_type=heading+types
        print(debate_type)
    i+=1

Type of Debate:RULING BY THE SPEAKER
Title:Regarding notices of 
Adjournment Motion.
Date:23-09-2020
Participants:Birla, Shri Om
Ref.Keywords:


In [110]:
# We can do similar things for all debates by using loops  
# Lets do that
j=1 #Since first debate is present at 1.
while(j<len(table)):
    first_debate=table[j].find_all('tr')
    print(str(j)+" Debate")
    i=0
    while(i<len(first_debate)):
        
        heading_debate=first_debate[i].find_all('td',attrs={'class':"instruction"})
        heading=heading_debate[0].getText().strip()
        type_debate=first_debate[i].find_all('a')
        if(len(type_debate)>0):
            types=type_debate[0].getText().strip()
            debate_type=heading+types
            print(debate_type)
        i+=1
    j+=1

1 Debate
Type of Debate:RULING BY THE SPEAKER
Title:Regarding notices of 
Adjournment Motion.
Date:23-09-2020
Participants:Birla, Shri Om
Ref.Keywords:
2 Debate
Type of Debate:PAPERS LAID ON THE TABLE
Title:Papers laid on 
the Table of the House by Ministers/Members.
Date:23-09-2020
Participants:Meghwal, Shri Arjun Ram
Ref.Keywords:
3 Debate
Type of Debate:PARLIAMENTARY COMMITTEES
Title:Presentation of 3rd, 4th and 
5th Action Taken Reports (2020-2021) of the Committee on Estimates .
Date:23-09-2020
Participants:Bapat, Shri Girish Bhalchandra
Ref.Keywords:
4 Debate
Type of Debate:MESSAGES FROM RAJYA SABHA
Title:Rajya Sabha agreed without any amendment to
the Indian Institute of Information Technology Laws (Amendment) Bill, 2020; the
 Essential Commodities (Amendment) Bill, 2020; the Banking Regulation
(Amendment) Bill, 2020; the Companies (Amendment) Bill, 2020; the National
Forensic Sciences University Bill, 2020; the Rashtriya Raksha University Bill,
2020; the Foregin Contrib

# Challenge 2: Now try to print all the debates information on the 15th LokSabha first page
Link:http://loksabhaph.nic.in/Debates/DebateAdvSearch15.aspx  
Hint:You will have to again use requests and soup library for this webpage
    