# Web Scraping Intro

### Hypertext Transfer Protocol (HTTP) is the foundation for data communication on the world wide web.
- Entering a URL is a request for the resource at that domain address
- Response is what happens (page loads? 404 error?)

To retrieve the contents of a website, we will be using the [_requests_](https://requests.readthedocs.io/en/master/) library.

In [1]:
import requests

In this notebook, we will be using a **GET** request. This is a request for data from a specified resource.  

Another common type or request is a **POST** request. POST submits data to be processed (e.g., from an HTML form) to the identified resource. The data is included in the body of the request. This may result in the creation of a new resource or the updates of existing resources or both.

To perform a GET request, use `requests.get()` and pass in the desired url.

In [2]:
URL = 'http://en.wikipedia.org/wiki/Turing_Award'

response = requests.get(URL)

Let's see what kind of object we get.

In [3]:
type(response)

requests.models.Response

We can check the status code using the `status_code` attribute.

In [4]:
response.status_code

200

A 200 status code is the standard response for a successful request.  

Other common status codes:
 * 400: Bad Request
 * 404: Not Found

Let's see what happens if we request a non-existent URL.

In [5]:
requests.get('https://en.wikipedia.org/wiki/Tuning_Award')

<Response [404]>

**Back to the good correct request**, let's see what this request returned.

In [None]:
response.text

It is very hard to decipher the above text. Luckily for us, the [_Beautiful Soup_](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library comes to the rescue. This library assists us in parsing HTML into something usable.

In [7]:
from bs4 import BeautifulSoup as BS

First, we can soupify our response text. Since we are working with HTML, we can specify that we need the html parser.

In [8]:
soup = BS(response.text)

Now, we can print it out in a slightly more readable form.

In [None]:
print(soup.prettify())

What we are looking at is the HTML for this page. This is rendered by your browser into the Wikipedia page that you see.

<img src="assets/html.png">


If you navigate to this page in your browser, you can view page source or inspect elements to see the underlying HTML.

If you are using Safari, this may not by avaiable and you'll need to activate it. According to [this](https://www.socialmeteor.com/2013/03/04/how-to-view-html-source-in-safari-web-browser/) website, you can activate this by following these steps:


1. Open Safari.
2. Select ‘Preferences’ from the ‘Safari’ menu.
3. In the ‘Advanced’ section and select ‘Show Develop menu’ in menu bar.’
4. Visit the web page you want to view HTML source for.
5. Select ‘Show Page Source’ from the ‘Develop’ menu that has been added to Safari.


Beautiful Soup lets us search through this HTML and extract out the contents we want by tag.  

Say we wanted to find the title of this page. We can accomplish this by using the `.find` method on our soup, telling it that we want to find the first `title` tag.

In [10]:
soup.find('title')

<title>Turing Award - Wikipedia</title>

Notice that this returns a bs4 Tag object.

In [11]:
type(soup.find('title'))

bs4.element.Tag

To extract out the text, you can use the `.text` attribute.

In [12]:
soup.find('title').text

'Turing Award - Wikipedia'

The `.find` method find the first matching tag. 

We can find _all_ elements with a particular tag using the `.findAll(<tag>)` method. Say we want to find all images. We'll look for the `img` tag.

In [13]:
images = soup.findAll('img')
print(type(images))
images

<class 'bs4.element.ResultSet'>


[<img alt="Turing-statue-Bletchley 11.jpg" data-file-height="4928" data-file-width="3264" decoding="async" height="332" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/220px-Turing-statue-Bletchley_11.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/330px-Turing-statue-Bletchley_11.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/440px-Turing-statue-Bletchley_11.jpg 2x" width="220"/>,
 <img alt="" class="thumbborder" data-file-height="505" data-file-width="960" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Flag_of_the_United_States_%281896%E2%80%931908%29.svg/23px-Flag_of_the_United_States_%281896%E2%80%931908%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Flag_of_the_United_States_%281896%E2%80%931908%29.svg/35px-Flag_of_the_United_States_%281896%E2%80%931908%29.svg.png 1.5x, //upload.wikimed

Let's look closer at the first image.

In [14]:
first_image = images[0]
print(type(first_image))
first_image

<class 'bs4.element.Tag'>


<img alt="Turing-statue-Bletchley 11.jpg" data-file-height="4928" data-file-width="3264" decoding="async" height="332" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/220px-Turing-statue-Bletchley_11.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/330px-Turing-statue-Bletchley_11.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/440px-Turing-statue-Bletchley_11.jpg 2x" width="220"/>

You can access attributes of a Tag object in the same way that you would access values from a dictionary.

In [15]:
first_image['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/220px-Turing-statue-Bletchley_11.jpg'

You can also safely access attributes using `.get`. This might be useful if, for example, you aren't sure if a particular Tag or all tags had a certain attribute.

In [16]:
# Non-safe
first_image['class']

KeyError: 'class'

In [17]:
# Safe
first_image.get('class')

You can also specify a default value when using `get`.

In [18]:
first_image.get('class', default = 'No Class')

'No Class'

If you want to grab a particular attribute for all images, an easy way to do so is with a list comprehension.

In [19]:
image_srcs = [x.get('src') for x in images]

In [20]:
image_srcs

['//upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/220px-Turing-statue-Bletchley_11.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Flag_of_the_United_States_%281896%E2%80%931908%29.svg/23px-Flag_of_the_United_States_%281896%E2%80%931908%29.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg/80px-Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/23px-Flag_of_the_United_Kingdom.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Flag_of_the_United_States_%281896%E2%80%931908%29.svg/23px-Flag_of_the_United_States_%281896%E2%80%931908%29.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Marvin_Minsky_at_OLPCc.jpg/80px-Marvin_Minsky_at_OLPCc.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Flag_of_the_United_States_%281896%E2%80%931908%29.svg/23px-Flag_of_the_Unite

We can further navigate the html tree to extract out other bits of information.

When scraping from a web page, you should make use of "View Page Source" and/or "Inspect Element" in your web browswer.

For example, let's say we want to look at the third div on the page.

In [21]:
soup.findAll('div')[2]

<div class="mw-body" id="content" role="main">
<a id="top"></a>
<div class="mw-body-content" id="siteNotice"><!-- CentralNotice --></div>
<div class="mw-indicators mw-body-content">
</div>
<h1 class="firstHeading" id="firstHeading" lang="en">Turing Award</h1>
<div class="mw-body-content" id="bodyContent">
<div class="noprint" id="siteSub">From Wikipedia, the free encyclopedia</div>
<div id="contentSub"></div>
<div id="contentSub2"></div>
<div id="jump-to-nav"></div>
<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
<a class="mw-jump-link" href="#searchInput">Jump to search</a>
<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><div class="mw-parser-output"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">American annual computer science prize</div>
<table class="infobox vevent" style="width:22em"><tbody><tr><th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;background-color: #eed

Similar to using `find` and `findall` in the full soup, we can use the `.find` method just within a Tag.

In [22]:
soup.findAll('div')[2].find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">Turing Award</h1>

In [23]:
soup.findAll('div')[2].find('h1').get('id')

'firstHeading'

In [24]:
soup.findAll('div')[2].find('h1').text

'Turing Award'

Now, let's look for the table containing the Turing Award winners.

Using `.findAll` reveals that there are multiple tables on the page.

In [25]:
soup.findAll('table')

[<table class="infobox vevent" style="width:22em"><tbody><tr><th class="summary" colspan="2" style="text-align:center;font-size:125%;font-weight:bold;background-color: #eedd82;">ACM Turing Award</th></tr><tr><td colspan="2" style="text-align:center"><a class="image" href="/wiki/File:Turing-statue-Bletchley_11.jpg"><img alt="Turing-statue-Bletchley 11.jpg" data-file-height="4928" data-file-width="3264" decoding="async" height="332" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/220px-Turing-statue-Bletchley_11.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/330px-Turing-statue-Bletchley_11.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/ad/Turing-statue-Bletchley_11.jpg/440px-Turing-statue-Bletchley_11.jpg 2x" width="220"/></a><div><a href="/wiki/Stephen_Kettle" title="Stephen Kettle">Stephen Kettle</a>'s slate statue of <a href="/wiki/Alan_Turing" title="Alan Turing">Alan Turing</a> at <

In [26]:
len(soup.findAll('table'))

6

If we know a bit more about what we are looking for, we can include an `attrs` argument and pass a dictionary. 

Go to the Turing award page in your browser, right click on the top of the table and choose "Inspect". You will notice that this table is defined with tag `<table class="wikitable">.` Armed with this information, we can narrow down our search.

In [27]:
soup.find('table', attrs={'class' : 'wikitable'})

<table class="wikitable">
<tbody><tr bgcolor="#ccccc">
<th style="width:10px">Year
</th>
<th style="width:150px">Recipient
</th>
<th>Photo
</th>
<th>Nationality<sup class="reference" id="cite_ref-10"><a href="#cite_note-10">[10]</a></sup>
</th>
<th>Rationale
</th></tr>
<tr>
<th>1966
</th>
<td><a href="/wiki/Alan_Perlis" title="Alan Perlis">Alan Perlis</a>
</td>
<td>
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="505" data-file-width="960" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Flag_of_the_United_States_%281896%E2%80%931908%29.svg/23px-Flag_of_the_United_States_%281896%E2%80%931908%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Flag_of_the_United_States_%281896%E2%80%931908%29.svg/35px-Flag_of_the_United_States_%281896%E2%80%931908%29.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Flag_of_the_United_States_%281896%E2%80%931908%29.svg/46px-Flag_of_the_United_Stat

In [28]:
len(soup.find('table', attrs={'class' : 'wikitable'}))

2

We can display the table by importing the `HTML` function.

In [29]:
table_html = str(soup.find('table', attrs={'class' : 'wikitable'}))

from IPython.core.display import HTML

HTML(table_html)

Year,Recipient,Photo,Nationality[10],Rationale
1966,Alan Perlis,,United States,For his influence in the area of advanced computer programming techniques and compiler construction.[11]
1967,Maurice Wilkes,,United Kingdom,"Wilkes is best known as the builder and designer of the EDSAC, the first computer with an internally stored program. Built in 1949, the EDSAC used a mercury delay line memory. He is also known as the author, with Wheeler and Gill, of a volume on ""Preparation of Programs for Electronic Digital Computers"" in 1951, in which program libraries were effectively introduced.[12]"
1968,Richard Hamming,,United States,"For his work on numerical methods, automatic coding systems, and error-detecting and error-correcting codes.[13]"
1969,Marvin Minsky,,United States,"For his central role in creating, shaping, promoting, and advancing the field of artificial intelligence.[14]"
1970,James H. Wilkinson,,United Kingdom,"For his research in numerical analysis to facilitate the use of the high-speed digital computer, having received special recognition for his work in computations in linear algebra and ""backward"" error analysis.[15]"
1971,John McCarthy,,United States,"McCarthy's lecture ""The Present State of Research on Artificial Intelligence"" is a topic that covers the area in which he has achieved considerable recognition for his work.[16]"
1972,Edsger W. Dijkstra,,Netherlands,"Edsger Dijkstra was a principal contributor in the late 1950s to the development of the ALGOL, a high level programming language which has become a model of clarity and mathematical rigor. He is one of the principal proponents of the science and art of programming languages in general, and has greatly contributed to our understanding of their structure, representation, and implementation. His fifteen years of publications extend from theoretical articles on graph theory to basic manuals, expository texts, and philosophical contemplations in the field of programming languages.[17]"
1973,Charles Bachman,,United States,For his outstanding contributions to database technology.[18]
1974,Donald Knuth,,United States,"For his major contributions to the analysis of algorithms and the design of programming languages, and in particular for his contributions to ""The Art of Computer Programming"" through his well-known books in a continuous series by this title.[19]"
1975,Allen Newell,,United States,"In joint scientific efforts extending over twenty years, initially in collaboration with J. C. Shaw at the RAND Corporation, and subsequently with numerous faculty and student colleagues at Carnegie Mellon University, they have made basic contributions to artificial intelligence, the psychology of human cognition, and list processing.[20]"


However, this does not give us a way to work with the data in the table, only to display it.

As part of Data Question 3, your group will need to figure out how to convert the resulting table into a `pandas` DataFrame.