Web Scraping is an automated gathering of data from internet
- Also said to be a practise of gathering data through any means other than a program interacting with an API
- Automated program queries webserver , requests data (in form of HTML) and parses the data extract required information
- Webscraping can extract higher volumes of data from multiple webpages as compared to API utility

### BeautifulSoup

BeautifulSoup library helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible python objects representing XML documents

In [None]:
# pip install beautifulsoup4

In [None]:
#html code with story content
html_doc="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
#import BeautifulSoup to create beautifulsoup object and to parse the text and print to decorative unicode strings
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
print(soup.prettify())
#print(soup)

#### Tasks to navigate the data structure

In [None]:
#View part of head
soup.title

In [None]:
soup.title.name

In [None]:
soup.title.parent.name

In [None]:
#View body or paragraph
soup.p

In [None]:
#Inclusive of head and title
soup.title.parent

In [None]:
#View the class of body 'p'
soup.p['class']

In [None]:
# View tree of 'p' or view first attribute
soup.a

#### Searching tree filter

In [None]:
#Searching method in the entire document : find_all by tag 'a'
soup.find_all('a')

In [None]:
#Searches only the first match of 'link3' by 'id'
soup.find(id="link3")

In [None]:
#search the first match of tag 'a'
soup.find('a')

In [None]:
# search for href tag within tag'a' in the entire document.
# print all the URL's found within a page's 'a'
for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
#Extract all the text from a page
print(soup.get_text())

In [None]:
#parser installation if parser isn't recognised for html5lib
#pip install html5lib

In [None]:
#Assign extract to soup variable
soup=BeautifulSoup('<b class="boldest">Extremely bold</b')

In [None]:
# a tag object corresponds to xml or html tag in the document. Here we are adding strings
tag=soup.b

In [None]:
tag

In [None]:
#View the type of variable tag
type(tag)

#### Name

In [None]:
#Every tag has a name, accessible as .name
tag.name

In [None]:
# Chanign a tag name will be reflected in the html markup generated by beautifulsoup
tag.name="blockqoute"
tag

#### Attributes
A tag may have any number fo attibutes.
Access a tag's attributes by treating tag like a dictionary

In [None]:
#View attribute of tag
tag.attrs

In [None]:
#Add, remove and Modify a tag's attribute
tag['id']='verybold'

In [None]:
tag['another-attribute']=1

In [None]:
tag

In [None]:
del tag['id']

In [None]:
del tag['another-attribute']

In [None]:
tag

In [None]:
#Verify deletion of id
tag['id']

In [None]:
print(tag.get('id'))

#### NavigableString

A string corresponds to a text within a tag. BeautifulSoup uses NavigableString class to contain these bits of text.
Also used in navigating the tree and searching the tree

In [None]:
#Access string of tag
tag.string

In [None]:
#MOdification type of tree is navigable 
type(tag.string)

In [None]:
#You may convert a nvigablestring to a unicode string
unicode_string=str(tag.string)
unicode_string

In [None]:
type(unicode_string)

In [None]:
# You can't edit a string in place, but you can replace
#Replace the string
tag.string.replace_with("no longer bold")

In [None]:
tag

In [None]:
tag.string

Beautifulsoup object represents the parsed document as a whole, which can be treated as a tag object
This also means that it supports navigating tree and searching tree

In [None]:
#Load xml version document
doc= BeautifulSoup("<document><content/>INSERT FOOTER HERE</document","xml")

In [None]:
doc

In [None]:
footer = BeautifulSoup("<footer>Here's the footer</footer>","xml")

In [None]:
footer

In [None]:
#Replace text of 'doc' with  text of ' footer'
doc.find(text="INSERT FOOTER HERE").replace_with(footer)

In [None]:
print(doc)

#### Comments and other special strings

In [None]:
#identifying comment. Comments appreas with special formatting 
markup="<b><!--Hey, buddy--></b>"
soup=BeautifulSoup(markup)
commenting=soup.b.string
type(commenting)

In [None]:
commenting

In [None]:
# Markup identity
soup.name

In [None]:
#Parsed to html by BeautifulSoup
soup

In [None]:
#Print style with 'prettify'
print(soup.b.prettify())

Beautifulsoup defines classes called Stylesheet, script and TemplateString for embedded CSS stylesheets(any strings found inside a 'style' tag),
embedded Javascript ( any string found in a 'script' tag) and html templates ( any strings inside a 'templates' tag)

#### Navigating the tree

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'html.parser')

Let's see how to move from one part of a document to another

#### Navigating the parse tree using tag names

Going down the tree

In [None]:
soup.head

In [None]:
soup.title

In [None]:
soup.body

In [None]:
#Zooming in to certain part 
soup.body.b

In [None]:
#Searching for attribute returns first tage of attribute
soup.a

In [None]:
#Searches all attributes
soup.find_all('a')

#### navigating to contents and children

A tag's children are avilable in a list called .contents

In [None]:
head_tag=soup.head

In [None]:
head_tag

In [None]:
head_tag.contents

In [None]:
# Extract  string via index
title_tag=head_tag.contents[0]

In [None]:
title_tag

In [None]:
title_tag.contents

Beautifulsoup itself has a children which is< html > tag

In [None]:
soup.contents

In [None]:
soup.contents[1].name

In [None]:
#Instead of gettign them as a list, iterate over a tag's children
for child in title_tag.children:
    print(child)

#### Descendants
contents and children attributes only consider a tag's direct children

In [None]:
# View direct child which is in the title tag
head_tag.contents

In [None]:
# descendant attribute lets you iterate over all of a tag's children,
#for ex: its direct children, the children of its direct children and so on
for child in head_tag.descendants:
    print(child)

In [None]:
#soup

In [None]:
# View multiline strings inside a tag and remove whitespaces from string
for string in soup.stripped_strings:
    print(repr(string))

#### Going up the tree
Every tag has a parent

In [None]:
# access an element's parent with the .parent attribute.
# For example as per earlier example, the 'head' tag is the parent of the 'title' tag
title_tag=soup.title
title_tag
title_tag.parent

In [None]:
#view string of the parent
title_tag.string.parent

In [None]:
#The parent of a top-level tag like 'html' is the BeautifulSoup object itself
html_tag=soup.html
type(html_tag.parent)

In [None]:
html_tag.parent

In [None]:
# parent of a beautifulsoup object is defined as none
print(soup.parent)

Search for parents travelling from an 'a' tag which is burried deep within the document to eh very top of the document

In [None]:
link=soup.a

In [None]:
link

In [None]:
#View the tree of document upwards from bottom
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

In [None]:
link.parent

In [None]:
link.parent.name

In [None]:
list(link.parents)

#### Going sideways

In [None]:
#Let's use a simple document
# The 'b' tag ans the 'c' tag are at the same level, they are both direct children of the same tag
# they are called siblings. siblings show up at the same indentation level

sibling_soup=BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())

#### next sibling and previous sibling

to navigate between page elements that are on the same level of the parse tree

In [None]:
# b tag has a next sibling but no previous sibling, bcos there is nothing before b tag on same level
sibling_soup.b.next_sibling

In [None]:
# there is no next sibling at the same level
sibling_soup.c.previous_sibling

In [None]:
# the strings text1 and text2 are not siblings because they don't have the same parent
sibling_soup.b.string

In [None]:
print(sibling_soup.b.string.next_sibling)

In [None]:

link=soup.a
link

In [None]:
# comma dn newline seperate the first 'a' tag from the second
link.next_sibling

In [None]:
#second 'a' tag is actually the next sibling of the comma
link.next_sibling.next_sibling

In [None]:
link.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling

#### iterate over a tag's siblings with next and previous siblings

In [None]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

In [None]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

#### Going back and forth

In [None]:
last_a_tag=soup.find("a",id="link3")
last_a_tag

In [None]:
#siblings after link3 'a' tag
last_a_tag.next_sibling

In [None]:
# next elements attribute of a tag points to whatever was parsed immediately afterwards
#tillie appears before the semicolon of the next sibiling
last_a_tag.next_element

In [None]:
last_a_tag.previous_element

In [None]:
for element in last_a_tag.next_elements:
    print(repr(element))

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Simplest filter is a string

In [None]:
soup.find_all('b')

#### Using Regular Expressions object, using its search methods

In [None]:
#Finds all the tags whose names start with the letter 'b'
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

In [None]:
#finds all tags whose name starts with 't'
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

In [None]:
#Passing list of tag name starts with
soup.find_all(["a","b"])

In [None]:
#Finds all the tags but none of the strings
for tag in soup.find_all(True):
    print(tag.name)

In [None]:
a_string=soup.find(string="Lacie")
a_string

#### find parents and find parent

In [None]:
a_string.find_parents("a")

In [None]:
a_string.find_parent("p")

In [None]:
a_string.find_parents("p")

In [None]:
first_link=soup.a
first_link

#### find next siblings and find next sibling

In [None]:
#Returns all the siblings that match the tag
first_link.find_next_siblings("a")

In [None]:

first_story_paragraph=soup.find("p","story")

In [None]:
#Returns only one siblings matching tag and string
first_story_paragraph.find_next_sibling("p")

In [None]:
first_story_paragraph.find_previous_sibling("p")

#### Import data from website

In [None]:
#urlopen is used to open a remote object across a network and read it
from urllib.request import urlopen
from bs4 import BeautifulSoup

Almost every website we encounter contains CSS stylesheets as a layering on websites.
CSS increases efficieny of web scrapers which helps web scrapers seperate different tags based on their class and ID attributes

In [None]:
html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")

In [None]:
#Beautiful object created
bsObj=BeautifulSoup(html)

In [None]:
#grab all of green text class from span tag
nameList=bsObj.findAll("span",{"class":"green"})
for name in nameList:
    print(name.get_text())

Most of the time you will use the first two arguments of findall functions such as tag and attributes

In [None]:
bsObj.findAll({"h1","h2","h3","h4","h5","h6"})

In [None]:
bsObj.findAll("span",{"class":"green","class":"red"})

In [None]:
#number of time "the prince" was surrounded by tags on the exampe page
nameList = bsObj.findAll(text="the prince")
print(len(nameList))
print(nameList)

In [None]:
#Tags that contains a particular attribute
allText=bsObj.findAll(id="text")
print(allText[0].get_text())

In [None]:
bsObj.findAll("",{"class":"green"})

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [None]:
html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html)

##### Find only descendants that are children

In [None]:
#shows list of product rows
for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)

In [None]:
#prints all the product rows except for the title
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)

In [None]:
#selects only the first row of the table
bsObj.find("table",{"id":"giftList"}).tr

Dealing with parents

In [None]:
# prints the price of the object represented by the image at the location

print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

#### using regular expressions

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

In [None]:
html=urlopen("http://www.pythonscraping.com/pages/page3.html")

In [None]:
bsObj=BeautifulSoup(html)

In [None]:
# prints only the relative image paths that starts with the provided path and ends in .jpg

images=bsObj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) 

In [None]:
for image in images:
    print(image["src"])

#### Traversing a single domain

In [None]:
# produces list of all list of article URL's that the wikipedia article on 'Rose' links to
#sets the random number generator seed to current system time to produce new and interesting random path through wiki articles

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import datetime 
import random 
import re
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl) 
    bsObj = BeautifulSoup(html)
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$")) 
links=getLinks("/wiki/Rose")
while len(links) > 0:
    newArticle=links[random.randint(0,len(links)-1)].attrs["href"]
    print(newArticle)
    links=getLinks(newArticle)

#### download image

In [None]:
#downloads a single logo image file and stores in in the directory of code file
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
imageLocation = bsObj.find("a", {"id": "logo"}).find("img")["src"]
urlretrieve (imageLocation, "logo.jpg")

#### Text Encoding

simply using urlopen to read in .txt works great for English text, however while you encounter other languages you might run into problems

In [None]:
from urllib.request import urlopen
textPage = urlopen(
"http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt")
print(textPage.read())

In [None]:
# Earlier python made an attempt to read the document as an ASCII document, whereas actually it is a UTF-8 document wchih was not encoded right
# Selec the unicode or utf when the languages are not ASCII (such as english language)
# output in Cyrillic characters

from urllib.request import urlopen
textPage=urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt") 
print(str(textPage.read(),'utf-8'))

In [None]:
from zipfile import ZipFile
from urllib.request import urlopen
from io import BytesIO
wordFile = urlopen("http://pythonscraping.com/pages/AWordDocument.docx").read()
wordFile = BytesIO(wordFile)
document = ZipFile(wordFile)
xml_content = document.read('word/document.xml')
print(xml_content.decode('utf-8'))

In [None]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
soup.prettify()
# '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...'

print(soup.prettify())

In [None]:
print(soup.a.prettify())

In [None]:
#Non-prettify, unicode renamed to string
str(soup)

In [None]:
print(soup.prettify(formatter="html"))

In [None]:
link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))

In [None]:
from bs4.formatter import HTMLFormatter
def uppercase(str):
    return str.upper()
formatter = HTMLFormatter(uppercase)

print(soup.prettify(formatter=formatter))

#### Encodings

In [None]:
markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
soup = BeautifulSoup(markup)
soup.h1

In [None]:
soup.h1.string

In [None]:
soup.original_encoding
'utf-8'