Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2019 R.R.Watkins

Note: This tutorial is based on: https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/


## Introduction

Web scraping is a useful technique to convert unstructured data on the web (such as a table or list) to structured data (such as dataframe) that you can use for a variety of purposes. For example, you might want to scrape an education website to get data on school test scores in order to analyze student performance, or you might want to take voting information by zip code from a government webpage to create a visual image of voting trends in your area.

These are just a few of the questions / problems / products whose solutions might start with web scraping and information extraction (data collection) before you get to data analysis and interpretation. You can probably think of many others!


## Ways to extract information from web

There are several ways to extract information from the web. **APIs** are probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Reddit, and StackOverflow provide APIs to access their data (or at least limited sections of their data) in a more structured manner. For example, if you want to get all Tweats using #NASA during a time period, you can get that through their API in a nice format that is pretty easy to use.  So, if you can get the data you want through an API, it is almost always preferred approach over web scraping. This is because if you are getting access to structured data from the provider.

**RSS feeds** are another way that a website can share information for people to use for other purposes. For example, blogs will often produce an RSS feed that you can use to get a copy of all the recent posts, which you can then for your work.  But they are limited in their use and are mostly found in sites that have routine updates (such as news pages, blogs, or podcasts).

But what can you do when you want information that is on a website but they don't have an API or RSS feed that meets your purposes? Well, that is when you scrape the website to grab the information.


## What is Web Scraping?

Web scraping is a technique for extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

You can perform web scraping in various ways, and using almost any programming language. But Python makes it pretty  easy, as you will learn below, and Python has lots of libraries that can help.  Python also had a large (and growing) number of users who are continually creating new libraries and packages, as well as tutorials for customizing your projects. 

Python is an open source programming language and you will often find multiple libraries that can perform the same function. We will use the ‘BeautifulSoup’ library for our lesson below, it is easy and intuitive to work with.
 

## Libraries required for web scraping
We will use two libraries in the webscraping project below:

**BeautifulSoup**: It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages. 

BeautifulSoup does not however fetch the web page for us. We will also use a library for opening the webpage URL:

**Urllib.request**: It is a Python module which can be used for fetching URLs. 

Python has several other options for HTML scraping in addition to BeatifulSoup. Here are some others: mechanize, scrapemark, or scrapy.
 

## Basics – Get familiar with HTML (Tags)
While scraping the we, you deal with html tags. Thus, it is quite useful to have good understanding of them.  Below is the basic syntax of HTML.  This syntax has various tags as elaborated below:

    <!DOCTYPE html> : HTML documents must start with a type declaration
    HTML document is contained between <html> and </html>
    The visible part of the HTML document is between <body> and </body>
    HTML headings are defined with the <h1> to <h6> tags
    HTML paragraphs are defined with the <p> tag
    Other useful HTML tags are:

    HTML links are defined with the <a> tag, “<a href=“http://www.test.com”>This is a link for test.com</a>”
    HTML tables are defined with<Table>, row as <tr> and rows are divided into data as <td>
    html table
    HTML list starts with <ul> (unordered) and <ol> (ordered). Each item of list starts with <li>
    
If you are new to this HTML tags, I would also recommend you to refer HTML tutorial from W3schools. This will give you a clear understanding about HTML tags.

Before getting starting, take a look at the webpage you will be scraping:
<a href="https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India" target="_blank">  https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India</a>

Ok, now it is time to begin... you will have to import the **BeautifulSoup** library, which is part of the **bs4** package, and import **urllib.request**.  

In [1]:
import urllib.request
from bs4 import BeautifulSoup

With the libraries now loaded, you will want to specific variables for webpage (i.e., URL) you are scraping. In this example we will call it "wiki", and use **urllib** to open that URL.  And then you will want to run **BeautifulSoup** on that "page".

In [2]:
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page)

You can see the HTML version of the webpage using the "soup.prettify()" function.

Exercise: In the cell below, write the code of printing (i.e., viewing) the HTML of the webpage.

Do your results look like this at the top?

    <!DOCTYPE html>
    <html class="client-nojs" dir="ltr" lang="en">
     <head>
      <meta charset="utf-8"/>
      <title>
       List of state and union territory capitals in India - Wikipedia
      </title>
      <script>
       document.documentElement.className="client-js"...

If not, try again.  If you can't get it, highlight the lines below to view the correct code.

<span style="color:white"> <br>
print(soup)<br>
</span>


Can you also come up with other ways to print it out?
<span style="color:white"><br>
soup<br>
print(soup.prettify())<br> <br>
    </span>

Using BeautifulSoup you scrape specific elements from the webpage using the HTML tags. You can bring in the whole element, including the tags, or just the contents found between the tags (in the example below case a "string").

In [3]:
soup.title

<title>List of state and union territory capitals in India - Wikipedia</title>

In [4]:
soup.title.string


'List of state and union territory capitals in India - Wikipedia'

In [5]:
soup.a

<a id="top"></a>

BeautifulSoup can also be used to find specific HTML tags within a webpage. This is very helpful for finding the specific elements within the webpage that you want to scrape. This is when knowing HTML tags is useful, for example in HTML tables (such as the one we want to scrape from this Wikipedia page) are marked by "table" tags. So to find a table, we can use BeautifulSoup to locate and return the table(s) we want.

In [6]:
soup.find_all("table")

[<table class="vertical-navbox nowraplinks" style="float:right;clear:right;width:22.0em;margin:0 0 1.0em 1.0em;background:#f9f9f9;border:1px solid #aaa;padding:0.2em;border-spacing:0.4em 0;text-align:center;line-height:1.4em;font-size:88%"><tbody><tr><th style="padding:0.2em 0.4em 0.2em;font-size:145%;line-height:1.2em"><a href="/wiki/States_and_union_territories_of_India" title="States and union territories of India">States and union <br/> territories of India</a> <br/> ordered by</th></tr><tr><td style="padding:0.2em 0 0.4em"><div class="center"><div class="floatnone"><a class="image" href="/wiki/File:Flag_of_India.svg"><img alt="Flag of India.svg" data-file-height="900" data-file-width="1350" decoding="async" height="47" src="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/70px-Flag_of_India.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/105px-Flag_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.

You can further the tables on the page by using other HTML tags. This allows us to get the specific table we want from the webpage. You can use the "inspect" function of your web browser to look at the HTML code for elements within the webpage as well.

In [7]:
right_table=soup.find('table', class_='wikitable sortable plainrowheaders')
right_table

<table class="wikitable sortable plainrowheaders">
<tbody><tr>
<th scope="col">No.
</th>
<th scope="col">State or<br/>union territory
</th>
<th scope="col">Administrative capital
</th>
<th scope="col">Legislative capital
</th>
<th scope="col">Judicial capital
</th>
<th scope="col">Year of establishment
</th>
<th scope="col">Former capital
</th></tr>
<tr>
<td>1
</td>
<th scope="row"><a href="/wiki/Andaman_and_Nicobar_Islands" title="Andaman and Nicobar Islands">Andaman and Nicobar Islands</a> <img alt="union territory" data-file-height="14" data-file-width="9" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/commons/3/37/Dagger-14-plain.png" width="9"/>
</th>
<td><a href="/wiki/Port_Blair" title="Port Blair">Port Blair</a>
</td>
<td> —
</td>
<td>Kolkata
</td>
<td>1955
</td>
<td>Calcutta (1945–1955)
</td></tr>
<tr>
<td>2
</td>
<th scope="row"><a href="/wiki/Andhra_Pradesh" title="Andhra Pradesh">Andhra Pradesh</a>
</th>
<td><a class="mw-redirect" href="/wiki/Hyderabad,_

Exercise: In the cell below, write code that will retrieve just rows #2 and #3 of the table.  

Do your results look like this?

    [<tr>
    <td>2
    </td>
    <th scope="row"><a href="/wiki/Andhra_Pradesh" title="Andhra Pradesh">Andhra Pradesh</a>
    </th>
    <td><a class="mw-redirect" href="/wiki/Hyderabad,_India" title="Hyderabad, India">Hyderabad</a> <small>(<i>de jure</i> to 2024)</small><br/><a href="/wiki/Amaravati" title="Amaravati">Amaravati</a> <small>(<i>de facto</i> from 2017)</small><sup class="reference" id="cite_ref-gulte.com_3-0"><a href="#cite_note-gulte.com-3">[3]</a></sup><sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[4]</a></sup><sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[a]</a></sup>
    </td>
    <td><a href="/wiki/Amaravati" title="Amaravati">Amaravati</a><sup class="reference" id="cite_ref-gulte.com_3-1"><a href="#cite_note-gulte.com-3">[3]</a></sup>
    </td>
    <td><a href="/wiki/Andhra_Pradesh_High_Court" title="Andhra Pradesh High Court">Amaravati</a>
    </td>
    <td>1956<br/>2017
    </td>
    <td><a href="/wiki/Kurnool" title="Kurnool">Kurnool</a> (1953-1956)
    </td></tr>, <tr>
    <td>3
    </td>
    <th scope="row"><a href="/wiki/Arunachal_Pradesh" title="Arunachal Pradesh">Arunachal Pradesh</a>
    </th>
    <td><a href="/wiki/Itanagar" title="Itanagar">Itanagar</a>
    </td>
    <td>Itanagar
    </td>
    <td><a href="/wiki/Guwahati" title="Guwahati">Guwahati</a>
    </td>
    <td>1986
    </td>
    <td> —
    </td></tr>]
    
If not, try again.  If you can't get it, highlight the lines below to view the correct code.

<span style="color:white">
table=soup.find('table', class_='wikitable sortable plainrowheaders')<br>
allrows = table.find_all('tr')<br>
print (allrows[2:4]) </span>

You can also retrieve all the links from the webpage.

In [8]:
all_links = soup.find_all("a")
for link in all_links:
    print (link.get("href"))

None
#mw-head
#p-search
/wiki/States_and_union_territories_of_India
/wiki/File:Flag_of_India.svg
/wiki/List_of_states_and_union_territories_of_India_by_area
/wiki/List_of_states_and_union_territories_of_India_by_population
/wiki/List_of_Indian_states_and_union_territories_by_GDP
/wiki/List_of_Indian_states_and_union_territories_by_GDP_per_capita
/wiki/ISO_3166-2:IN
None
/wiki/List_of_Indian_states_by_Child_Nutrition
/wiki/List_of_states_and_union_territories_of_India_by_crime_rate
/wiki/List_of_states_and_union_territories_of_India_by_households_having_electricity
/wiki/List_of_states_and_union_territories_of_India_by_fertility_rate
/wiki/Forest_cover_by_state_in_India
/wiki/Ease_of_doing_business_ranking_of_states_of_India
/wiki/List_of_Indian_states_and_territories_by_highest_point
/wiki/Indian_states_ranked_by_HIV_awareness
/wiki/List_of_Indian_states_and_territories_by_Human_Development_Index
/wiki/Indian_states_ranking_by_families_owning_house
/wiki/Indian_states_ranking_by_househ

You can also retrieve just the last links on the webpage.

In [9]:
all_links = soup.find_all('a')

print('Total number of URLs present = ',len(all_links)) 

print('\n\nLast 5 URLs in the page are : \n')

if len(all_links) > 5 :
  
  last_5 = all_links[len(all_links)-5:]
  for url in last_5 :
    print(url.get('href'))

Total number of URLs present =  431


Last 5 URLs in the page are : 

https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute
https://foundation.wikimedia.org/wiki/Cookie_statement
//en.m.wikipedia.org/w/index.php?title=List_of_state_and_union_territory_capitals_in_India&mobileaction=toggle_view_mobile
https://wikimediafoundation.org/
https://www.mediawiki.org/


Now that you are familiar with scraping the table data from the webpage, you can use organize the HTML table into a Python dataframe that you can work with for data analysis.   You will start by making each row into a list.

In [10]:
#Generate lists
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    states=row.findAll('th') #To store second column data
    if len(cells)==6: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        B.append(states[0].find(text=True))
        C.append(cells[1].find(text=True))
        D.append(cells[2].find(text=True))
        E.append(cells[3].find(text=True))
        F.append(cells[4].find(text=True))
        G.append(cells[5].find(text=True))

You can then use pandas to make the dataframe. 

In [11]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(A,columns=['Number'])
df['State/UT']=B
df['Admin_Capital']=C
df['Legislative_Capital']=D
df['Judiciary_Capital']=E
df['Year_Capital']=F
df['Former_Capital']=G
df

Unnamed: 0,Number,State/UT,Admin_Capital,Legislative_Capital,Judiciary_Capital,Year_Capital,Former_Capital
0,1\n,Andaman and Nicobar Islands,Port Blair,—\n,Kolkata\n,1955\n,Calcutta (1945–1955)\n
1,2\n,Andhra Pradesh,Hyderabad,Amaravati,Amaravati,1956,Kurnool
2,3\n,Arunachal Pradesh,Itanagar,Itanagar\n,Guwahati,1986\n,—\n
3,4\n,Assam,Dispur,Guwahati,Guwahati\n,1975\n,Shillong
4,5\n,Bihar,Patna,Patna\n,Patna\n,1912\n,—\n
5,6\n,Chandigarh,Chandigarh,—\n,Chandigarh\n,1966\n,—\n
6,7\n,Chhattisgarh,Raipur,Raipur\n,Bilaspur,2000\n,—\n
7,8\n,Dadra and Nagar Haveli,Silvassa,—\n,Mumbai,1945\n,Mumbai (1954–1961)
8,9\n,Daman and Diu,Daman,—\n,Mumbai,1987\n,Ahmedabad
9,10\n,National Capital Territory of Delhi,New Delhi,New Delhi\n,New Delhi\n,1931\n,—\n


Exercise: Now it is your turn to try one on your own.

You want to scrape this wikipedia page: https://en.wikipedia.org/wiki/List_of_economic_expansions_in_the_United_States

You want the table of growth periods since the Great Depression,  which can later use Pandas to analyze in interesting ways.

In the cell below, write the Python code that will retrieve this table as a dataframe.

Do your results look similar to this?
```text
	Years	Duration	Annual Employement Growth	Annual GDP Growth	Description
0	Oct 1945–	37	+5.2%	+1.5%	As the United States demobilized from
1	Oct 1949–	45	+4.4%	+6.9%	The United States exited recession in late 194...
2	May 1954–	39	+2.5%	+4.0%	Expansion resumed following a return to growth...
3	April 1958–	24	+3.6%	+5.6%	A brief, two-year period of expansion occurred...
4	Feb 1961–	106	+3.3%	+4.9%	A long expansionary period began in 1961. Inco...
5	Nov 1970–	36	+3.4%	+5.1%	Growth resumed after the brief
6	Mar 1975–	58	+3.6%	+4.3%	Following the steep
7	Jul 1980–	12	+2.0%	+4.4%	This short period of growth saw unemployment r...
8	Dec 1982–	92	+2.8%	+4.3%	Inflation was under control by the mid-1980s. ...
9	Mar 1991–	120	+2.0%	+3.6%	Following a
10	Nov 2001–	73	+0.9%	+2.8%	Another mild recession
11	June 2009–	123+\n	+1.1%	+2.3%	The effects of the
```

If not, try again. If you can't get it, highlight the lines below to view the correct code.<br>
<font style="color:white">
wiki2 = "https://en.wikipedia.org/wiki/List_of_economic_expansions_in_the_United_States"  
page2 = urllib.request.urlopen(wiki2)  
soup2 = BeautifulSoup(page2)  
table2 = soup2.find('table', class_='wikitable sortable')  
A=[]  
B=[]  
C=[] 
D=[]  
E=[]  
F=[]  
G=[]  
for row in table2.find_all("tr"):  
    cells = row.find_all('td')  
    dates=row.find_all('th') #To store second column data  
    if len(cells)==5: #Only extract table body not heading  
        A.append(cells[0].find(text=True))  
        B.append(cells[1].find(text=True))  
        C.append(cells[2].find(text=True))  
        D.append(cells[3].find(text=True))  
        E.append(cells[4].find(text=True))  
df=pd.DataFrame(A,columns=['Years'])  
df['Duration']=B  
df['Annual Employement Growth']=C  
df['Annual GDP Growth']=D  
df['Description']=E  
df  

</font>


Thanks for completing the tutorial.



In [12]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
