# Web Scraping Data from HTML
- Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from website, then the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.


- Data displayed by most websites can only be viewed using a web browser. They do not offer the functionality to save a copy of this data for personal use. The only option then is to manually copy and paste the data - a very tedious job which can take many hours or sometimes days to complete. Web Scraping is the technique of automating this process, so that instead of manually copying the data from websites, the Web Scraping software will perform the same task within a fraction of the time.

The `requests` library is the de facto standard for making HTTP requests in Python.
The `requests` library has a `get function` which we can parse the URL of the site we want to access, 
and it will download the HTML content of the website.

In [1]:
import requests #library used to download web pages.

In [2]:
#specify the url
URL = "https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals"

In [3]:

# The GET method indicates that you’re trying to get or retrieve data from a specified resource. 
# Connect to the website using the variable 'page'
# To make a GET request, invoke requests.get().
page = requests.get(URL)

In [4]:
# A Response is a powerful object for inspecting the results of the request.
type(page)

requests.models.Response

In [5]:
# verify successful connection to website

# To know about the all codes 
# https://www.restapitutorial.com/httpstatuscodes.html
  
#  a 200 OK status means that your request was successful,and the server responded with the data you were requesting,
# whereas a 404 NOT FOUND status means that the resource you were looking for was not found.     
page.status_code

200

## HTML - The Basics
This is the basic syntax of an HTML webpage. Every `<tag>` serves a block inside the webpage:
1. `<!DOCTYPE html>`  HTML documents must start with a type declaration.
2. The HTML document is contained between `<html>` and `</html>`.
3. The meta and script declaration of the HTML document is between <head>and </head>.
4. The visible part of the HTML document is between `<body>` and `</body>` tags.
5. Title headings are defined with the `<h1>`  through  `< h6>` tags.
6. Paragraphs are defined with the `<p>` tag.

Other useful tags include `<a>` for hyperlinks, `<table>` for tables, `<tr>` for table rows, and `<td>` tag defines a standard cell in an HTML table.

In [6]:
#save string format of website HTML into a variable
HTMLstr = page.text
print(HTMLstr[:300])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":[""


### Scraping Rules

- You should check a website’s Terms and Conditions before you scrape it.
- Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
- Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner `(i.e. acts like a human)`. One request for one webpage per second is good practice.
- The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

### What is Beautiful Soup?
- Beautiful Soup is a Python library for pulling data out of HTML and XML files. 

In [7]:
#import the Beautiful soup functions to parse the data returned from the website

# Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML
# or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
from bs4 import BeautifulSoup

In [8]:

# parse the html using beautiful soup and store in variable `soup`
# First argument: It is the raw HTML content.
# Second Argument:  Specifying the HTML parser we want to use.
soup = BeautifulSoup(HTMLstr, "html.parser")

In [9]:
#look at contents of page - wall of text
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"Xkbd5gpAAEAAADkf1joAAACR","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_U.S._state_capitals","wgTitle":"List of U.S. state capitals","wgCurRevisionId":6818835,"wgRevisionId":6818835,"wgArticleId":18635,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["State cap

In [10]:
# Format page contents to include indentation
# Now soup.prettify() is printed, it gives the visual representation
# of the parse tree created from the raw HTML content.
print (soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"Xkbd5gpAAEAAADkf1joAAACR","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_U.S._state_capitals","wgTitle":"List of U.S. state capitals","wgCurRevisionId":6818835,"wgRevisionId":6818835,"wgArticleId":18635,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCatego

In [11]:
# soup.<tag>: Return content between opening and closing tag including tag.
soup.title

<title>List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia</title>

In [12]:
# soup.<tag>.string: Return string within given tag
print(soup.title.string)

List of U.S. state capitals - Simple English Wikipedia, the free encyclopedia


In [13]:
#shows the first <a> tag on the page
soup.a

<a id="top"></a>

### `find_all() method`
- Scans the entire document looking for results and returns a list containing the single result. 
- If `find_all()` can’t find anything, it returns an empty list.

In [14]:
#finds all <a> tags on the page
# This code finds all the <a>  tags in the document (you can replace b with any tag you want to find)
soup.find_all("a")

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a href="/wiki/United_States" title="United States">United States</a>,
 <a href="/wiki/U.S._state" title="U.S. state">state</a>,
 <a class="mw-redirect" href="/wiki/Capital_(city)" title="Capital (city)">capital</a>,
 <a href="/wiki/City" title="City">cities</a>,
 <a href="/wiki/Legislature" title="Legislature">capitol</a>,
 <a href="#cite_note-1">[1]</a>,
 <a class="image" href="/wiki/File:US_states_in_which_the_capital_is_the_largest_city.svg"><img alt="" class="thumbimage" data-file-height="593" data-file-width="959" decoding="async" height="216" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/e0/US_states_in_which_the_capital_is_the_largest_city.svg/350px-US_states_in_which_the_capital_is_the_largest_city.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/e0/US_states_in_which_the_capital_is_the_largest_city.svg/525p

In [15]:
#show hyperlink reference for all <a> tags
all_links=soup.find_all("a")

for link in all_links:
    print (link.get("href"))

None
#mw-head
#p-search
/wiki/United_States
/wiki/U.S._state
/wiki/Capital_(city)
/wiki/City
/wiki/Legislature
#cite_note-1
/wiki/File:US_states_in_which_the_capital_is_the_largest_city.svg
/wiki/File:US_states_in_which_the_capital_is_the_largest_city.svg
/wiki/United_States_Constitution
/wiki/List_of_United_States_cities_by_population
/w/index.php?title=List_of_metropolitan_statistical_areas&action=edit&redlink=1
/wiki/Alabama
/wiki/Montgomery,_Alabama
/wiki/Alaska
/wiki/Juneau,_Alaska
/wiki/Arizona
/wiki/Phoenix,_Arizona
/wiki/Arkansas
/wiki/Little_Rock,_Arkansas
/wiki/California
/wiki/Sacramento,_California
/wiki/Colorado
/wiki/Denver,_Colorado
/wiki/Connecticut
/wiki/Hartford,_Connecticut
/wiki/Delaware
/wiki/Dover,_Delaware
/wiki/Florida
/wiki/Tallahassee,_Florida
/wiki/Georgia_(U.S._state)
/wiki/Atlanta
/wiki/Hawaii
/wiki/Honolulu
/wiki/Idaho
/wiki/Boise,_Idaho
/wiki/Illinois
/wiki/Springfield,_Illinois
/wiki/Indiana
/wiki/Indianapolis
/wiki/Iowa
/wiki/Des_Moines,_Iowa
/wiki/Kans

In [16]:
#fina all the <table> tags
all_tables=soup.find_all('table')
all_tables

[<table class="wikitable sortable">
 <caption>State capitals of the United States
 </caption>
 <tbody><tr>
 <th rowspan="2">State</th>
 <th rowspan="2">Abr.</th>
 <th rowspan="2">State-hood</th>
 <th rowspan="2">Capital</th>
 <th rowspan="2">Capital since</th>
 <th rowspan="2">Area (mi²)</th>
 <th colspan="4">Population (2018)</th>
 <th rowspan="2">Notes
 </th></tr>
 <tr>
 <th><a href="/wiki/List_of_United_States_cities_by_population" title="List of United States cities by population">City</a>
 </th>
 <th><a class="new" href="/w/index.php?title=List_of_metropolitan_statistical_areas&amp;action=edit&amp;redlink=1" title="List of metropolitan statistical areas (not yet started)">Metropolitan</a>
 </th>
 <th>Rank in state
 </th>
 <th>Rank in US
 </th></tr>
 <tr>
 <td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
 <td>AL</td>
 <td align="center">1819</td>
 <td><a href="/wiki/Montgomery,_Alabama" title="Montgomery, Alabama">Montgomery</a></td>
 <td align="center">1846</td>
 <td a

### `find() method` 
- Is great for cases where you know there is only one element you're looking for, such as the **body tag**.
- If `find()` can’t find anything, it returns None

In [17]:
# get the <table> tag that contains the data we want to scrape

right_table=soup.find('table', class_='wikitable sortable')
right_table

<table class="wikitable sortable">
<caption>State capitals of the United States
</caption>
<tbody><tr>
<th rowspan="2">State</th>
<th rowspan="2">Abr.</th>
<th rowspan="2">State-hood</th>
<th rowspan="2">Capital</th>
<th rowspan="2">Capital since</th>
<th rowspan="2">Area (mi²)</th>
<th colspan="4">Population (2018)</th>
<th rowspan="2">Notes
</th></tr>
<tr>
<th><a href="/wiki/List_of_United_States_cities_by_population" title="List of United States cities by population">City</a>
</th>
<th><a class="new" href="/w/index.php?title=List_of_metropolitan_statistical_areas&amp;action=edit&amp;redlink=1" title="List of metropolitan statistical areas (not yet started)">Metropolitan</a>
</th>
<th>Rank in state
</th>
<th>Rank in US
</th></tr>
<tr>
<td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
<td>AL</td>
<td align="center">1819</td>
<td><a href="/wiki/Montgomery,_Alabama" title="Montgomery, Alabama">Montgomery</a></td>
<td align="center">1846</td>
<td align="right">159.8</td>
<td a

In [18]:
#set empty lists to hold data of each column
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
H=[]
I=[]
J=[]
K=[]

#find all <tr> tags in the table and go through each one (row)
# tr table row tag
for row in right_table.findAll("tr"):
    
    #get all the <td> tags for each <tr> tag
    cells = row.findAll('td')
    
    #if there are 11 <td> tags, 11 cells in a row
    if len(cells)==11: 
        
        A.append(cells[0].find(text=True)) #gets info in State column and adds it to list A
        B.append(cells[1].find(text=True)) # gets info from Abr. column and adds it to list B
        C.append(cells[2].find(text=True)) # gets info from Statehood column; add it to list C
        D.append(cells[3].find(text=True)) # gets info from Capital column and adds it to list D
        E.append(cells[4].find(text=True)) # gets info from Capital since column and adds it to list E
        F.append(cells[5].find(text=True)) # gets info from Area column and adds it to list F
        G.append(cells[6].find(text=True)) # gets info from Municipal column and adds it to list G
        H.append(cells[7].find(text=True)) # gets info from Metropolitan column and adds it to list H
        I.append(cells[8].find(text=True)) # gets info from Rank column and adds it to list I
        J.append(cells[9].find(text=True)) # gets info from Notes column and adds it to list J
        K.append(cells[10].find(text=True)) # gets info from Notes column and adds it to list K

In [19]:
print(cells[0])
print(cells[1])
print(cells[2])
print(cells[3])
print(cells[4])
print(cells[5])
print(cells[6])
print(cells[7])
print(cells[8])
print(cells[9])
print(cells[10])


<td><a href="/wiki/Wyoming" title="Wyoming">Wyoming</a></td>
<td>WY</td>
<td align="center">1890</td>
<td><a href="/wiki/Cheyenne,_Wyoming" title="Cheyenne, Wyoming">Cheyenne</a></td>
<td align="center">1869</td>
<td align="right">21.1</td>
<td align="right">59,466</td>
<td align="right">91,738</td>
<td align="center">1</td>
<td></td>
<td>
</td>


In [20]:
#verify data in list A
A

['Alabama',
 'Alaska',
 'Arizona',
 'Arkansas',
 'California',
 'Colorado',
 'Connecticut',
 'Delaware',
 'Florida',
 'Georgia',
 'Hawaii',
 'Idaho',
 'Illinois',
 'Indiana',
 'Iowa',
 'Kansas',
 'Kentucky',
 'Louisiana',
 'Maine',
 'Maryland',
 'Massachusetts',
 'Michigan',
 'Minnesota',
 'Mississippi',
 'Missouri',
 'Montana',
 'Nebraska',
 'Nevada',
 'New Hampshire',
 'New Jersey',
 'New Mexico',
 'New York',
 'North Carolina',
 'North Dakota',
 'Ohio',
 'Oklahoma',
 'Oregon',
 'Pennsylvania',
 'Rhode Island',
 'South Carolina',
 'South Dakota',
 'Tennessee',
 'Texas',
 'Utah',
 'Vermont',
 'Virginia',
 'Washington',
 'West Virginia',
 'Wisconsin',
 'Wyoming']

In [21]:
#import pandas to convert list to data frame
import pandas as pd

df=pd.DataFrame(A, columns=['State']) #turn list A into dataframe first

#add other lists as new columns in my new dataframe
df['Abr'] = B
df['Statehood'] = C
df['Capital'] = D
df['Capital_Since'] = E
df['Area'] = F
df['Municipal'] = G
df['Metropolitan'] = H
df['StateRank'] = I
df['USRank'] = J
df['Notes'] = K

#show first 5 rows of created dataframe
df.head()

Unnamed: 0,State,Abr,Statehood,Capital,Capital_Since,Area,Municipal,Metropolitan,StateRank,USRank,Notes
0,Alabama,AL,1819,Montgomery,1846,159.8,198218,373903.0,2,119.0,\n
1,Alaska,AK,1960,Juneau,1906,2716.7,31275,,3,,Largest capital by municipal land area.\n
2,Arizona,AZ,1912,Phoenix,1889,517.6,1660272,4857962.0,1,5.0,Largest capital by population.\n
3,Arkansas,AR,1836,Little Rock,1821,116.2,193524,699757.0,1,117.0,\n
4,California,CA,1850,Sacramento,1854,97.9,508529,2345210.0,6,35.0,Largest capital by population to not be the mo...


In [22]:
#export scraped data to a csv file
df.to_csv("CapitalList.csv")