# Webscraping Links to Download Documents

## The Task:
In order to get the full sermon transcripts, I use BeautifulSoup and basic Python skills to get all links to documents. Then I turn these links into full urls. I use these full urls to download all document files from the webpage. 

## The Tools:

1. [Requests](http://docs.python-requests.org/en/latest/user/quickstart/)
2. [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
3. CSV package.

In [2]:
# import required modules
import requests
import csv
import urllib
from bs4 import BeautifulSoup

## Step 1: Make a Get Request and Read in HTML

1. make a GET request to the page
2. read in the html of the page
3. get the full url links of all document links on the page. 

In [3]:
# make a GET request
req = requests.get('http://english.hdchurch.org/html/news/focus/2011/0429/196.html#2009')
# read the content of the server’s response
src = req.text

Use the `BeautifulSoup` function to parse the reponse into an HTML tree, which returns all of the HTML in the original document.

In [4]:
# parse the response into an HTML tree
soup = BeautifulSoup(src)
# take a look
print soup.prettify()[:500]

<html>
 <body>
  <p>
   ï»¿
   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  </p>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Sort by Date: Sermon (media,ppt,transcription) for download - Haidian Christian Church
  </title>
  <meta content="," name="keywords"/>
  <meta content="collection of Sermon (from 2009.10) Video Audio &amp; lyric PPT (by English Choir) for Download - Haidi


## Step 2. Locate All Document Links


Make a list called "docurl" composed of all the href attributes which contain ".doc"

In [5]:

#Create an empty list
docurl = []

for link in soup('a'):     #within all "a" elements
    if "doc" in link:        #find only links with ".doc"
        docurl.append(link['href']) #append only the urls to list.

print docurl



['/video/2012/20121230eng.doc', '/video/2012/20121028eng.doc', '/video/2012/20121028eng.doc', '/video/2012/20120826eng.doc', '/video/2012/20120819eng.doc', '/video/2012/20120805eng.doc', '/video/2012/20120729eng.doc', '/video/2012/20120722eng.doc', '/video/2012/20120715eng.doc', '/video/2012/20120708eng.doc', '/video/2012/20120701eng.doc', '/video/2012/20120617eng.doc', '/video/2012/20120610eng.doc', '/video/2012/20120603eng.doc', '/video/2012/20120527eng.doc', '/video/2012/20120520eng.doc', '/video/2012/20120513eng.doc', '/video/2012/20120506eng.doc', '/video/2012/20120429eng.doc', '/video/2012/20120422eng.doc', '/video/2012/20120415eng.doc', '/video/2012/20120408eng.doc', '/video/2012/20120401eng.doc', '/video/2012/20120325eng.doc', '/video/2012/20120318eng.doc', '/video/2012/20120311eng.doc', '/video/2012/201201304eng.doc', '/video/2012/20120226eng.doc', '/video/2012/20120219eng.doc', '/video/2012/20120212eng.doc', '/video/2012/20120205eng.doc', '/video/2012/20120122-DoWeReallyUnder

##Step 3. Prepare to Download Files

We create a list of filenames. After we download the documents, these will be the filenames.

In [7]:
#create an empty list
filenames = [] 

#create a for loop that goes through the list created above

for i in range(len(docurl)):   
    filenames.append(docurl[i][12:])   #only grab the last part of link
filenames

['20121230eng.doc',
 '20121028eng.doc',
 '20121028eng.doc',
 '20120826eng.doc',
 '20120819eng.doc',
 '20120805eng.doc',
 '20120729eng.doc',
 '20120722eng.doc',
 '20120715eng.doc',
 '20120708eng.doc',
 '20120701eng.doc',
 '20120617eng.doc',
 '20120610eng.doc',
 '20120603eng.doc',
 '20120527eng.doc',
 '20120520eng.doc',
 '20120513eng.doc',
 '20120506eng.doc',
 '20120429eng.doc',
 '20120422eng.doc',
 '20120415eng.doc',
 '20120408eng.doc',
 '20120401eng.doc',
 '20120325eng.doc',
 '20120318eng.doc',
 '20120311eng.doc',
 '201201304eng.doc',
 '20120226eng.doc',
 '20120219eng.doc',
 '20120212eng.doc',
 '20120205eng.doc',
 '20120122-DoWeReallyUnderstandGodLove.doc',
 '20120108eng.doc',
 '20120101eng.doc',
 '20111218eng.doc',
 '20111204eng.doc',
 '20111127eng.doc',
 '20111113eng.doc',
 '20111106eng.doc',
 '20111030eng.doc',
 '20111023eng.doc',
 '20111016eng.doc',
 '20111009eng.doc',
 '20111002eng.doc',
 '20110925eng.doc',
 '20110918eng.doc',
 '20110911eng.doc',
 '20110904eng.doc',
 '20110828eng.

We also want a list with full links to the document in order to download all of them. We can do so by adding the front portion "http://english.hdchurch.org" to the 'docurl' list we made at the beginning.

In [8]:
#Create an empty list
fullurl = []

#For every url in the list 'docurl'
for url in docurl:
    url = "http://english.hdchurch.org" + url #add the front portion
    fullurl.append(url) #then append it to the 'fullurl' list.

fullurl

['http://english.hdchurch.org/video/2012/20121230eng.doc',
 'http://english.hdchurch.org/video/2012/20121028eng.doc',
 'http://english.hdchurch.org/video/2012/20121028eng.doc',
 'http://english.hdchurch.org/video/2012/20120826eng.doc',
 'http://english.hdchurch.org/video/2012/20120819eng.doc',
 'http://english.hdchurch.org/video/2012/20120805eng.doc',
 'http://english.hdchurch.org/video/2012/20120729eng.doc',
 'http://english.hdchurch.org/video/2012/20120722eng.doc',
 'http://english.hdchurch.org/video/2012/20120715eng.doc',
 'http://english.hdchurch.org/video/2012/20120708eng.doc',
 'http://english.hdchurch.org/video/2012/20120701eng.doc',
 'http://english.hdchurch.org/video/2012/20120617eng.doc',
 'http://english.hdchurch.org/video/2012/20120610eng.doc',
 'http://english.hdchurch.org/video/2012/20120603eng.doc',
 'http://english.hdchurch.org/video/2012/20120527eng.doc',
 'http://english.hdchurch.org/video/2012/20120520eng.doc',
 'http://english.hdchurch.org/video/2012/20120513eng.doc

## Step 4. Downloading Documents

We first test download a single document to check it works. Then we go on to download all files from the 'fullurl' list. 

In [51]:
#Download a test file to see it works

testfile = urllib.URLopener()
testfile.retrieve("http://english.hdchurch.org/video/2011/20110731eng.doc", "20110731eng.doc")


('20110731eng.doc', <httplib.HTTPMessage instance at 0x107f04b00>)

In [43]:
#Download all documents
#with fullurl and filenames

#To be able to see which urls did not work
bad_urls = []

#create a loop to go through the fullurl list
for i, url in enumerate(fullurl):  
    print url     #to see which url is being processed
    try:
        testfile = urllib.URLopener()
        testfile.retrieve(url, filenames[i]) #the filenames from above!
    
    #if not, add to the bad_urls list
    except:
        bad_urls.append(url)


http://english.hdchurch.org/video/2012/20121230eng.doc
http://english.hdchurch.org/video/2012/20121028eng.doc
http://english.hdchurch.org/video/2012/20121028eng.doc
http://english.hdchurch.org/video/2012/20120826eng.doc
http://english.hdchurch.org/video/2012/20120819eng.doc
http://english.hdchurch.org/video/2012/20120805eng.doc
http://english.hdchurch.org/video/2012/20120729eng.doc
http://english.hdchurch.org/video/2012/20120722eng.doc
http://english.hdchurch.org/video/2012/20120715eng.doc
http://english.hdchurch.org/video/2012/20120708eng.doc
http://english.hdchurch.org/video/2012/20120701eng.doc
http://english.hdchurch.org/video/2012/20120617eng.doc
http://english.hdchurch.org/video/2012/20120610eng.doc
http://english.hdchurch.org/video/2012/20120603eng.doc
http://english.hdchurch.org/video/2012/20120527eng.doc
http://english.hdchurch.org/video/2012/20120520eng.doc
http://english.hdchurch.org/video/2012/20120513eng.doc
http://english.hdchurch.org/video/2012/20120506eng.doc
http://eng

##Downloads Complete! Go Check the Folder.  