Notes by - Kiran A Bendigeri
Please Read 'Read me' file.

1 Web scraping and parsing with Beautiful Soup 4 Introduction
Beautiful Soup is a Python library aimed at helping programmers who are trying to scrape data from websites.


In [None]:
import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()
soup = bs.BeautifulSoup(source,'lxml')

# title of the page
print(soup.title)

# get attributes:
print(soup.title.name)

# get values:
print(soup.title.string)

# beginning navigation:
print(soup.title.parent.name)

# getting specific values:
print(soup.p)

print(soup.find_all('p'))

for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text))
    
for url in soup.find_all('a'):
    print(url.get('href'))
    
print(soup.get_text())    

2 Navigation with Beautiful Soup 4


In [None]:
nav = soup.nav
#we can grab the links from just the nav bar
for url in nav.find_all('a'):
    print(url.get('href'))

#body to get the body section, then grab the .text from there
body = soup.body
for paragraph in body.find_all('p'):
    print(paragraph.text)
#div tag with the class of "body"    
for div in soup.find_all('div', class_='body'):
    print(div.text)    

3 Parsing tables and XML with Beautiful Soup 4


In [None]:
table = soup.find('table')

table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)
    
    

Finally, let's talk about parsing XML. XML uses tags much like HTML, but is slightly different. We can use a variety of libraries to parse XML, including standard library options, but, since this is a Beautiful Soup 4 tutorial, let's talk about how to do it with BS4.

One of the most common reasons that you might deal with an XML document is if you are trying to scrape a sitemap for a website. PythonProgramming.net has a sitemap.xml, so we'll use that.

The sitemap looks like:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
		<loc>
		https://pythonprogramming.net/pickle-data-analysis-python-pandas-tutorial/
		</loc>
		<lastmod>2016-10-15</lastmod>
	</url>
	<url>
		<loc>
		https://pythonprogramming.net/training-testing-machine-learning-tutorial/
		</loc>
		<lastmod>2016-10-15</lastmod>
	</url>
</urlset>

In [None]:
import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/sitemap.xml').read()
soup = bs.BeautifulSoup(source,'xml')
#we just want to grab the urls
for url in soup.find_all('loc'):
    print(url.text)

4 Scraping Dynamic Javascript Text
Many websites will supply data that is dynamically loaded via javascript. In Python, you can make use of jinja templating and do this without javascript, but many websites use javascript to populate data. To simulate this, I have added the following code to the parsememcparseface page:

<p>Javascript (dynamic data) test:</p>
  <p class='jstest' id='yesnojs'>y u bad tho?</p>
  <script>
     document.getElementById('yesnojs').innerHTML = 'Look at you shinin!';
  </script> 
The code basically takes regular paragraph tags, with the class of jstest, and initially returns the text y u bad tho?. After this, however, there is some javascript defined that will subsequently update that jstest paragraph data to be Look at you shinin!. Thus, if you are reading the javascript-updated information, you will see the shinin message. If you don't then you will be ridiculed.

If you open the page in your web browser, we'll see the shinin message, so we'll try in Beautiful Soup:

In [None]:
import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = bs.BeautifulSoup(source,'lxml')

js_test = soup.find('p', class_='jstest')

print(js_test.text)