# Webscraping - Lecture Code

## The Task:

Using Beautiful Soup and basic Python skills, webscrape the table from the Chinese church website: http://english.hdchurch.org/html/news/focus/2011/0429/196.html#2009
The result is a list of dictionaries, each with date, title, preacher, scripture, and filename. Then write the list into a CSV file. 

## The Tools:

1. [Requests](http://docs.python-requests.org/en/latest/user/quickstart/)
2. [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
3. CSV package

In [1]:
# import required modules
import requests
import csv
from bs4 import BeautifulSoup

## Step 1: Make a Get Request

In [2]:
# make a GET request
req = requests.get('http://english.hdchurch.org/html/news/focus/2011/0429/196.html#2009')
# read the content of the server’s response
src = req.text

## Step 2: Parse into HTML

We use the `BeautifulSoup` function to parse the above reponse into an HTML tree. This returns an object with all of the HTML in the original document.

In [19]:
# parse the response into an HTML tree
soup = BeautifulSoup(src)
# take a look
print soup.prettify()[:800]

<html>
 <body>
  <p>
   ï»¿
   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  </p>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Sort by Date: Sermon (media,ppt,transcription) for download - Haidian Christian Church
  </title>
  <meta content="," name="keywords"/>
  <meta content="collection of Sermon (from 2009.10) Video Audio &amp; lyric PPT (by English Choir) for Download - Haidian Christian Church English " name="description"/>
  <link href="/templets/style/subpage.css" rel="stylesheet" type="text/css"/>
  <link href="/templets/style/article.css" rel="stylesheet" type="text/css"/>
  <script language="javascript" src="/include/dedeajax2.js">
  </script>
  <script language="


## Step 3. Get 'href' Attributes from 'a' Tag

In [4]:
#See the href (url) attribute from the first 10 links
for link in soup('a')[:10]:
    print link['href']

http://english.hdchurch.org
http://www.hdchurch.org
/
http://english.hdchurch.org/
/html/aboutus/index.html
/html/news/index.html
/html/Fellowship/index.html
http://english.hdchurch.org/
/html/news/index.html
/video/HolyBibleNIVRedLetterEdition.pdf


## Step 4. Look at the rows.
We get a list of rows in that table. Rows are identified by the `tr` tag.

In [26]:
# get all tr elements
rows = soup.find_all("tr")

#See what they look like
rows[0:3]

[<tr>
 <td align="center">Date &amp; Title</td>
 <td align="center">Preacher</td>
 <td align="center">Sermon</td>
 <td align="center">Sermon</td>
 <td align="center">Scripture</td>
 </tr>, <tr>
 <td align="center"><a href="/html/news/2013/0327/408.html" target="_blank">2013-03-17 Passion For God's Word</a></td>
 <td align="center">Gloria Li</td>
 <td align="center"><a href="/video/2013/20130317eng.wmv" target="_blank">video</a></td>
 <td align="center"><a href="/video/2013/20130317eng.mp3" target="_blank">audio</a><br/>
 <a href="/video/2013/20130317PassionForGodsWord.ppt" target="_blank">ppt</a></td>
 <td align="center">2 Timothy 3:14-17</td>
 </tr>, <tr>
 <td align="center"><a href="/html/news/2013/0324/406.html" target="_blank">2013-03-10 Come into unity with Christ</a></td>
 <td align="center">Pastor Wu</td>
 <td align="center"><a href="/video/2013/20130310eng.wmv" target="_blank">video</a></td>
 <td align="center"><a href="/video/2013/20130310eng.mp3" target="_blank">audio</a></td

In [7]:
# Look at just the third item
row = rows[80]
row
# select only those 'td' elements
detailCells = row.select('td')

detailCells

[<td align="center"><a href="/html/news/2011/1004/273.html" target="_blank">2011-10-02 Breakfast with Jesus</a></td>,
 <td align="center">Stephen Wong</td>,
 <td align="center"><a href="/video/2011/20111002eng.wmv">video</a></td>,
 <td align="center"><a href="/video/2011/20111002eng.doc">doc</a></td>,
 <td align="center">John 21:1-14</td>]

In [12]:
# See how we can get the document link. 

for i in detailCells:
    try: 
        if "doc" in i.a['href']: #if the string "doc" is in the href link
            print i.a['href'][12:-4] + ".txt" #split and add to get the filename
    
    #"if "doc" not in it, we can continue with the loop
    except:
        continue


20111002eng.txt


In [7]:
# Keep only the text in each of those cells
rowData = [cell.text for cell in detailCells]
rowData

[u'2013-03-10 Come into unity with Christ',
 u'Pastor Wu',
 u'video',
 u'audio',
 u'John 6:52-58']

##Step 5. Scrape the Table!
Now we can use the combination of a for loop and beautifulsoup tools in addition to basic python skills to scrape the entire table.

In [13]:
# Create empty list to store our dictionaries
sermonslist = []

# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr')

# loop through all rows
for row in rows:
    
    #create an empty dictionary to store all table data
    sermons = {}

    # select only those 'td' tags with class 'detail'
    detailCells = row.select('td')
    
    #for the 5-column tables:
    
    if len(detailCells) is 5: 
        
        # Keep only the text in each of those cells
        rowData = [cell.text for cell in detailCells]

        # Collect information
        sermons['Date'] = str(rowData[0])[0:10].encode("ascii", "ignore")
        sermons['Title'] = str(rowData[0])[11:].encode("ascii", "ignore")
        #dateandtitle = rowData[0]
        sermons['Preacher'] = rowData[1].encode("ascii", "ignore")
        sermons['Scripture'] = rowData[4].encode("ascii", "ignore")
        
        #Adding filename
        #Note:"audio" and "doc" both show up in rowData[3], or third column 
        
        #if the "doc" link is included in the 3rd column of row:
        if "doc" in rowData[3]:
            
            #let's manipulate the text in 3rd column to get doc filename
            filename = str(detailCells[3]).split("/")[-3] 
            filename = filename.split('.')[0]
            sermons['filename'] = filename
        
        #if the "doc" link doesn't exist, make NA
        else: 
            filename = "NA"
            sermons['filename'] = filename 
    
    #for the 4-column tables:
    elif len(detailCells) is 4: 
            
        # Keep only the text in each of those cells
        rowData = [cell.text for cell in detailCells]

        # Collect information()
        sermons['Date'] = str(rowData[0])[0:10].encode("ascii", "ignore")
        sermons['Title'] = str(rowData[0])[11:].encode("ascii", "ignore")
        sermons['Preacher'] = rowData[1].encode("ascii", "ignore")
        sermons['Scripture'] = rowData[3].encode("ascii", "ignore")
        
        #there are no "doc" links in the 4-column tables
        filename = 'NA'
        sermons['filename'] = filename
    
    #append dictionaries to the empty list at the beginning
    sermonslist.append(sermons)
    
#Getting rid of sentences with column names as values.
sermonslist = [i for i in sermonslist if i.values()[0] != 'Date & Tit'] 


In [30]:
#Check to see it worked!
sermonslist[29:34]

[{'Date': '2012-09-02',
  'Preacher': 'Jessica Wang',
  'Scripture': 'Prov. 4:23',
  'Title': 'Decide in my Heart',
  'filename': 'NA'},
 {'Date': '2012-08-26',
  'Preacher': 'Pastor Wu',
  'Scripture': 'John 2:18-25',
  'Title': 'Destroy this temple, and I will raise it again in three days',
  'filename': '20120826eng'},
 {'Date': '2012-08-19',
  'Preacher': 'Gloria Li',
  'Scripture': '2 Corinthians 5:17, Romans 12:1-2',
  'Title': 'Becoming the Person God Wants You to be',
  'filename': '20120819eng'},
 {'Date': '2012-08-12',
  'Preacher': 'Pastor Wu',
  'Scripture': 'John 2:1-11',
  'Title': 'Do Whatever He Tells You',
  'filename': 'NA'},
 {'Date': '2012-08-05',
  'Preacher': 'Jessica Wang',
  'Scripture': '1 Peter 4:8-10',
  'Title': 'Loving & Serving each other',
  'filename': '20120805eng'}]

In [34]:
#Checking to see the data is clean
sermonslist[0].values()[3] 

#How many sermons?
len(sermonslist)

180

##Step 6. Writing the CSV File.

In [39]:
#Decide on keys for the csv
keys = ['Date', 'Preacher', 'Title', 'Scripture','filename']
keys

['Date', 'Preacher', 'Title', 'Scripture', 'filename']

In [40]:
#Import Dictwriter from csv
from csv import DictWriter

#Write "sermonslist" into 'sermonsfull.csv' 
with open('sermonsfull.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(sermonslist)