# Internet

## Making requests with Python

* We can fetch web pages using the URL, just like a browser!
* With Python, we can process the data as well

## `urllib` Module

* One of the included modules in Python
* `import urllib`
* Can download the contents of a url using the `urllib.request` module

In [1]:
# Example using urllib

import urllib

file = urllib.request.urlopen("http://www.google.com")
for line in file:
    print(line)

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="TTEv8qQ+OH0vkA0DkL0UZQ==">(function(){window.google={kEI:\'OhnaXa-KIJK90PEP9ZCIkAU\',kEXPI:\'0,1353747,5662,731,223,510,1065,3151,378,207,1017,1398,789,10,713,338,175,32,332,672,270,240,100,153,4,60,688,38,19,1129683,143,1197789,360,329118,1294,12383,4855,32692,15247,867,17444,1100,3335,2,2,6801,369,3314,5505,8384,1119,2,579,727,2432,1361,4323,3700,1268,773,2252,1405,3337,6,1140,9,1745,218,6196,1719,1808,1976,2044,5766,1,1453,1689,3192,2105,2017,37,920,873,1217,2975,2736,1558,

## Text Encoding

* Different types of encoding(ASCII, Unicode)
* By default, Python 3+ reads Unicode strings
* `urllib.requests.urlopen` fetches binary strings
* Must _decode_ them to Unicode

In [3]:
import urllib

file = urllib.request.urlopen("https://automatetheboringstuff.com/files/rj.txt")
for line in file:
    print(line.decode())

The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare



This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever.  You may copy it, give it away or

re-use it under the terms of the Project Gutenberg License included

with this eBook or online at www.gutenberg.org/license





Title: Romeo and Juliet



Author: William Shakespeare



Posting Date: May 25, 2012 [EBook #1112]

Release Date: November, 1997  [Etext #1112]



Language: English





*** START OF THIS PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***



























*Project Gutenberg is proud to cooperate with The World Library*

in the presentation of The Complete Works of William Shakespeare

for your reading for education and entertainment.  HOWEVER, THIS

IS NEITHER SHAREWARE NOR PUBLIC DOMAIN. . .AND UNDER THE LIBRARY

OF THE FUTURE CONDITIONS OF THIS PRESENTATION. . .NO CHARGES MAY

BE MADE FOR *ANY* ACCESS TO THIS MATERIAL.  YOU ARE ENCOURAGED!!

TO G

## HTTP Request methods

* When we get a URL, the form may specify a "method" for the request
* `GET`, `POST`, `DELETE`, and `PUT` are the most common HTTP methods
* `GET` is the default and most common
* The others are used when handling forms

## The `requests` module

* Lets you make requests with different methods
* Better handles decoding and errors
* `conda install requests`

In [9]:
import requests

res = requests.get("http://www.google.com")
print(type(res))
print(res.status_code)
print(len(res.text))
print(res.text[:250])

<class 'requests.models.Response'>
200
11837
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're l


## HTML

* Markup language for web pages
* Out of scope for this class
* Basically plain text as we've seen

## Parsing HTML with Beautiful Soup

* We can extract data from a web page using Beautiful Soup
* More efficient than using string methods or regular expressions
* Install using `conda install beautifulsoup4`
* `import bs4`
* Select elements using CSS syntax

In [16]:
import requests
import bs4

res = requests.get("http://www.google.com")
res.raise_for_status() # Raises an exception if the response is an error
soup = bs4.BeautifulSoup(res.text)

# Now we can find elements
print(soup.select('#gbar')) # Select element whose id is "gbar"

[<div id="gbar"><nobr><b class="gb1">Search</b> <a class="gb1" href="http://www.google.com/imghp?hl=en&amp;tab=wi">Images</a> <a class="gb1" href="http://maps.google.com/maps?hl=en&amp;tab=wl">Maps</a> <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a> <a class="gb1" href="http://www.youtube.com/?gl=US&amp;tab=w1">YouTube</a> <a class="gb1" href="http://news.google.com/nwshp?hl=en&amp;tab=wn">News</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a> <a class="gb1" href="https://www.google.com/intl/en/about/products?tab=wh" style="text-decoration:none"><u>More</u> »</a></nobr></div>]


In [14]:
## Example parsing a file
import bs4

file = open('sample.html', 'r')
html = bs4.BeautifulSoup(file)
print(soup.select('div span'))
file.close()

[<span class="gbi" id="gbn"></span>, <span class="gbf" id="gbf"></span>, <span id="gbe"></span>]


## Exercise: Web search program

* Ask the user to input a search term
* Open the browser to the first matching URL
* Use the `webserver` module to open a web page

In [17]:
# Example: Performing a web search
# webbrowser module opens a web browser
import requests, webbrowser, bs4


BASE_SEARCH_URL = "http://www.google.com/search?q="
CSS_SELECTOR = '.r a'
