# Downloading Text from the Web 

This notebook demonstrates features of Python's urllib URL handling modules. The urllib package has serveral modules for working with URLs:

* urllib.request for opening and reading URLs
* urllib.error for exceptions raised by URL requests
* urllib.parse to parse URLs
* urllib.robotparser to parse robots.txt files

These modules are thoroughly documented [here](https://docs.python.org/3/library/urllib.html) 

For a higher-level HTTP interface, consider using the [Requests package](https://2.python-requests.org/en/master/)

### Downloading text pages

The next code segment shows how to use urllib.request to access a book from Project Gutenberg. Starting with a url pointing to the text of a book, the code below:

* opens the url with request.urlopen
* reads the page with decoding
* prints the first part of the text

In [1]:
from urllib import request

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

with request.urlopen(url) as f:
    raw = f.read().decode('utf-8-sig')
print('len=', len(raw))
raw[:200]

len= 1176966


'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give'

The output from the previous code cell shows that the downloaded *raw* is a very long string. The \r\n for the line feed tell us that this document was created on a Windows machine.

The encoding used above was 'utf-8-sig' in order to get rid of a BOM (byte order mark) at the beginning of the file. The utf-8-sig encoding is a utf-8 version created by Microsoft for their Notepad program. If decode('utf-8') is used for the url above, the first three bytes are: \ufeff. 


### Reading an html page

We can use Python's url handlers to read a web page as before. This would require a lot of processing to extract text, so other packages are typically used for html pages.

In [3]:
url = 'https://nyti.ms/2uAQS89'
html = request.urlopen(url).read().decode('utf8')
html[:1000]

'<!DOCTYPE html>\n<html lang="en" class="story"  xmlns:og="http://opengraphprotocol.org/schema/">\n  <head>\n    <title data-rh="true">With Snowflakes and Unicorns, Marina Ratner and Maryam Mirzakhani Explored a Universe in Motion - The New York Times</title>\n    <meta data-rh="true" itemprop="inLanguage" content="en-US"/><meta data-rh="true" property="article:published" content="2017-08-07T19:57:40.000Z"/><meta data-rh="true" property="article:modified" content="2017-08-07T19:57:38.000Z"/><meta data-rh="true" http-equiv="Content-Language" content="en"/><meta data-rh="true" name="robots" content="noarchive"/><meta data-rh="true" name="articleid" content="100000005321914"/><meta data-rh="true" name="nyt_uri" content="nyt://article/3b6bc8f3-da51-583e-aac1-0daa23adbd34"/><meta data-rh="true" name="pubp_event_id" content="pubp://event/32d48a620c964f0a88a1801d3299d862"/><meta data-rh="true" name="description" content="The legacies and achievements of two great mathematicians will dazzle an

This shows that there is a lot of code in there that we are not interested in. Extracting useful information is easier with packages such as Beautiful Soup, explored in another notebook.

