# Web Scraping with Python

### AJ Zerouali (21/06/21)

This part follows the lectures in Section 15 of Pierian Data's Python bootcamp. first one has to install the following libraries: requests, lxml and bs4 (Beautiful Soup v4). The notebook for these lectures contains a lot of useful info.

https://github.com/Pierian-Data/Complete-Python-3-Bootcamp/blob/master/13-Web-Scraping/00-Guide-to-Web-Scraping.ipynb

In [9]:
import requests
import lxml
import bs4

## Example 1 - Grabbing a title from a webpage

I'll do this example with the Wikipedia page on Hermann Weyl. This is explained in Lecture 118.

In [10]:
# If the next instruction doesn't work, it could be because of the firewall.
result_scraping_req = requests.get("https://en.wikipedia.org/wiki/Hermann_Weyl")

In [11]:
print(result_scraping_req)

<Response [200]>


In [12]:
# In the video, Portilla displayed the HTML script as a string using the ".text" attribute. 
# I'll hide the string
result_scraping_req.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Hermann Weyl - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"2d273946-ad2e-43b8-8fea-52ad6a9b7328","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Hermann_Weyl","wgTitle":"Hermann Weyl","wgCurRevisionId":1028223987,"wgRevisionId":1028223987,"wgArticleId":187544,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: long volume value","Articles with short description","Short description is different from Wikidata","Biography with signature","Articles with hCards","All artic

This is where it gets interesting. bs4 will format the string above as a HTML

In [13]:
html_code = bs4.BeautifulSoup(result_scraping_req.text,"lxml")

In [14]:
# Now we have proper HTML:
# Why is there no proper indentation?
#print(html_code)
html_code

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Hermann Weyl - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"2d273946-ad2e-43b8-8fea-52ad6a9b7328","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Hermann_Weyl","wgTitle":"Hermann Weyl","wgCurRevisionId":1028223987,"wgRevisionId":1028223987,"wgArticleId":187544,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: long volume value","Articles with short description","Short description is different from Wikidata","Biography with signature","Articles with hCards","All articles wi

Below is an example of how one extracts elements from a webpage, using the ".select()" function from bs4. This function returns lists, depending on the types of HTML blocks requested (e.g. \<p\> ... \</p\> for paragraphs, \<title\>..\</tiles\> for titles etc.).

Here's a first example with the title of the page, which is the title displayed in a tab when going to Hermann Weyl's entry on Wikipedia:

In [18]:
# Extract title:
title_lst = html_code.select("title")
print(title_lst)
type(title_lst[0])

[<title>Hermann Weyl - Wikipedia</title>]


bs4.element.Tag

In [19]:
# Get a strong from previous list, using ".getText()" method:
title_str = title_lst[0].getText()
print(title_str)

Hermann Weyl - Wikipedia


The next example is about paragraphs.

In [20]:
parags = html_code.select("p")

In [22]:
parags[1].getText()

'His research has had major significance for theoretical physics as well as purely mathematical disciplines including number theory. He was one of the most influential mathematicians of the twentieth century, and an important member of the Institute for Advanced Study during its early years.[5][6][7]\n'

In [23]:
len(parags)

26

In [25]:
# The next loop prints-out the Biography and Contribution sections of the article.
for i in range(len(parags)):
    print(parags[i].getText())

Hermann Klaus Hugo Weyl, ForMemRS[2] (German: [vaɪl]; 9 November 1885 – 8 December 1955) was a German mathematician, theoretical physicist and philosopher. Although much of his working life was spent in Zürich, Switzerland, and then Princeton, New Jersey, he is associated with the University of Göttingen tradition of mathematics, represented by David Hilbert and Hermann Minkowski.

His research has had major significance for theoretical physics as well as purely mathematical disciplines including number theory. He was one of the most influential mathematicians of the twentieth century, and an important member of the Institute for Advanced Study during its early years.[5][6][7]

Weyl published technical and some general works on space, time, matter, philosophy, logic, symmetry and the history of mathematics. He was one of the first to conceive of combining general relativity with the laws of electromagnetism. While no mathematician of his generation aspired to the 'universalism' of Henr

#### Comment from Portilla:

Most of the web scraping work goes into knowing the appropriate tags to pass as arguments in "HTML_code.select('*tag name*')"

## Example 2 - Grabbing a class from a webpage

This is from Lecture 119. **Clarify what's meant by a class here...**

This starts by finding the type of element we're interested in on the webpage, and then right-licking on it and selecting "Inspect". The browser then displays the relevant part of HTML code on the right. *Unclear: are these CSS classes here?*


In [27]:
# this is a first try:
html_code.select(".toctext")

[<span class="toctext">Biography</span>,
 <span class="toctext">Contributions</span>,
 <span class="toctext">Distribution of eigenvalues</span>,
 <span class="toctext">Geometric foundations of manifolds and physics</span>,
 <span class="toctext">Topological groups, Lie groups and representation theory</span>,
 <span class="toctext">Harmonic analysis and analytic number theory</span>,
 <span class="toctext">Foundations of mathematics</span>,
 <span class="toctext">Weyl equation</span>,
 <span class="toctext">Quotes</span>,
 <span class="toctext">Bibliography</span>,
 <span class="toctext">See also</span>,
 <span class="toctext">Topics named after Hermann Weyl</span>,
 <span class="toctext">References</span>,
 <span class="toctext">Further reading</span>,
 <span class="toctext">External links</span>]

So the "toctext" class refers to the various (sub)sections of the Wikipedia page (Table of contents text). In the HTML, it's within \<span\> blocks. The following prints-out all the relevant titles:

In [28]:
for elt in html_code.select(".toctext"):
    print(elt.text)

Biography
Contributions
Distribution of eigenvalues
Geometric foundations of manifolds and physics
Topological groups, Lie groups and representation theory
Harmonic analysis and analytic number theory
Foundations of mathematics
Weyl equation
Quotes
Bibliography
See also
Topics named after Hermann Weyl
References
Further reading
External links


Now let's extract Weyl's quotes form this page. This means that we should extract the \<ul\> blocks that are under the "mw-headlines" class with id = "Quotes". **I really don't see how to do this. Requires more background on HTML/CSS syntax and how to use select()?**

In [37]:
test_Quotes_id = html_code.select('#Quotes')
#toclevel-1 tocsection-9
test_Quotes_id 

[<span class="mw-headline" id="Quotes">Quotes</span>]

In [35]:
help(html_code.select)

Help on method select in module bs4.element:

select(selector, namespaces=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Perform a CSS selection operation on the current element.
    
    This uses the SoupSieve library.
    
    :param selector: A string containing a CSS selector.
    
    :param namespaces: A dictionary mapping namespace prefixes
       used in the CSS selector to namespace URIs. By default,
       Beautiful Soup will use the prefixes it encountered while
       parsing the document.
    
    :param limit: After finding this number of results, stop looking.
    
    :param kwargs: Keyword arguments to be passed into SoupSieve's 
       soupsieve.select() method.
    
    :return: A ResultSet of Tags.
    :rtype: bs4.element.ResultSet



In [36]:
mw_headlines = html_code.select(".mw-headline")
mw_headline

NameError: name 'mw_headline' is not defined

## Example 3 - Grabbing an image from a webpage

This is from Lecture 120. We'll download jpg or png files below.

First, here's a list with all the images on the webpage.

In [42]:
html_code.select('img')

[<img alt="Hermann Weyl ETH-Bib Portr 00890.jpg" data-file-height="479" data-file-width="462" decoding="async" height="228" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/78/Hermann_Weyl_ETH-Bib_Portr_00890.jpg/220px-Hermann_Weyl_ETH-Bib_Portr_00890.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/78/Hermann_Weyl_ETH-Bib_Portr_00890.jpg/330px-Hermann_Weyl_ETH-Bib_Portr_00890.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/78/Hermann_Weyl_ETH-Bib_Portr_00890.jpg/440px-Hermann_Weyl_ETH-Bib_Portr_00890.jpg 2x" width="220"/>,
 <img alt="Hermann Weyl signature.svg" data-file-height="61" data-file-width="240" decoding="async" height="38" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Hermann_Weyl_signature.svg/150px-Hermann_Weyl_signature.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Hermann_Weyl_signature.svg/225px-Hermann_Weyl_signature.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/87/Hermann_Weyl_signature.sv

Let's single-out the portrait picture of Weyl:

In [43]:
img_hermann = html_code.select('img')[0]['src']

In [44]:
img_hermann

'//upload.wikimedia.org/wikipedia/commons/thumb/7/78/Hermann_Weyl_ETH-Bib_Portr_00890.jpg/220px-Hermann_Weyl_ETH-Bib_Portr_00890.jpg'

A nice feature of the Markdown mode of cells in Jupiter is that it's HTML (and LaTex) compatible. If we use the address above:

<img 
     src = "//upload.wikimedia.org/wikipedia/commons/thumb/7/78/Hermann_Weyl_ETH-Bib_Portr_00890.jpg/220px-Hermann_Weyl_ETH-Bib_Portr_00890.jpg">
     
Next, we want to download the image, and then make a file out of it. 

In [45]:
# This will download the image at the address in img_hermann, note we added "https:".
img_download = requests.get("https:"+img_hermann)

In [46]:
type(img_download)

requests.models.Response

In [47]:
# Present working directory to save file in correct folder
pwd

'C:\\Users\\zaj20\\Documents\\Python\\First steps\\Portilla_PyBtcp_Sec.15'

In [48]:
# The raw binary of the image is in the "content" attribute of img_download
## Open a new file in the pwd. Note the "wb" permission in the param, standing for "write binary", as well as the extension 
## of the file (must be compatible with address).
f = open("C:\\Users\\zaj20\\Documents\\Python\\First steps\\Portilla_PyBtcp_Sec.15\\Hermann_Weyl.jpg","wb")

In [49]:
## This line will write the binary content of the image into the new file.
f.write(img_download.content)
## Close the file
f.close()

### Comment:

This web scraping business will be involved...