# Web Scrapping

Extracting webpages and parsing them for in readable format.

Usually it is HTML. We'll use
- **Requests** to get the webpage
- **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)** to parse it. It parses HTML and XML with the help of a parser(**html or lxml**)

<center> <h1> Beautiful Soup  </h1> </center>

From the website

**"You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help."**

Install 
- **pip install beautifulsoup4**
- **pip install lxml**
in your virtual environment.

# Protocol to follow when scrapping the web page
- Check for robot.txt and see what is allowed
- Avoid lots of simultaneous calls. Your IP may get block. Use sleep between making get call to avoid this.

- Use Requests get method to get the webpage html
- Parse it using Beautiful Soup and lxml. It creates a hierarchical structure of html elements.
- In chrome right click and click on inspect to open developer tools. Inspecting the html elements for their attributes and hierarchical order.
- Use Beautiful Soup object to get to the desired element.

# An example of parsing html

Visit this [w3schools](https://www.w3schools.com/html/html_basic.asp) to get an idea about HTML


In [1]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story_title"><b>The Dormouse's story three little sisters</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""


<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story_title"><b>The Dormouse's story three little sisters</b></p>    
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>


In [2]:
from bs4 import BeautifulSoup as bsoup

In [3]:
soup = bsoup(html_doc, 'lxml')
print(type(soup))
print(soup)

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story_title"><b>The Dormouse's story three little sisters</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>



# Navigating this data structure

In [4]:
soup.title

<title>The Dormouse's story</title>

In [5]:
soup.title.text

"The Dormouse's story"

In [10]:
soup.title.string

"The Dormouse's story"

In [12]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [30]:
# print the name and text in parent tag ??????


# p tag represent a paragraph of text

In [31]:
soup.p

<p class="story_title"><b>The Dormouse's story three little sisters</b></p>

p tag has some attribute too, like class here. How to get the value of attribute

In [28]:
soup.p['class']

['story_title']

But there were more **p** tags. How to get them from soup data structure

In [34]:
soup.find_all('p')

[<p class="story_title"><b>The Dormouse's story three little sisters</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

The **a** tag defines a hyperlink, to link  to another webpage

In [41]:
# Find all the url(href) in a tags

In [50]:
# Third link
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [49]:
# complete text
soup.get_text()

"\n\nThe Dormouse's story\n\nThe Dormouse's story three little sisters\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n\n\n"

# Let's create some Beautiful Soup

We will scrap fry electronics for telescopes following the protocol and store the result in a csv file.



# Checking robots.txt

In [6]:
!curl  https://www.frys.com/robots.txt

User-agent: * 
Crawl-delay: 10 
Sitemap: http://www.frys.com/sitemap_index.xml 
Visit-time: 0030-0300 
Disallow: /ShopCartServlet 
Disallow: /wf 



In [8]:
# import requests and bs4 