# Introduction

This notebook presents a couple of examples on web scraping:
1. Scraping monthly car sales data
2. Loading site maps
3. Collection data about academic programs at Robinson

There are three steps:
1. requesting and loading a particular (HTML) document,
2. extracting elements from the document, and
3. finding links and managing traversal

In addition, we utilize the `robots.txt` file to avoid conflicts with the web-site.

## References
- Documentation `urllib2` https://docs.python.org/2/library/urllib2.html
- Documentation `request` http://docs.python-requests.org/en/master/
- Documentation Beautiful Soup `bs4` https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Documentation `StringIO` https://docs.python.org/2/library/stringio.html
- Documentation `fnmatch` https://docs.python.org/2/library/fnmatch.html
- Read HTML tables with Pandas https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html
- Core tools for working with streams https://docs.python.org/2/library/io.html
- http://avi-urllib-vs-requests.blogspot.com
- Benchmarking urllib2 vs. urllib3 https://attilaolah.eu/2010/08/08/urllib-benchmark/

In [14]:
import os, sys
import urllib2 as ul
from bs4 import BeautifulSoup

## Requesting Documents

In [435]:
seed = 'http://calendar.gsu.edu/event/Structured_Unstructured_Data_Collection_bootcamp?utm_campaign=widget&utm_medium=widget&utm_source=Georgia+State+University+#.WmIQ4CMo524'

In [437]:
print ul.urlparse.urlsplit(seed)

SplitResult(scheme='http', netloc='calendar.gsu.edu', path='/event/Structured_Unstructured_Data_Collection_bootcamp', query='utm_campaign=widget&utm_medium=widget&utm_source=Georgia+State+University+', fragment='.WmIQ4CMo524')


In [438]:
ul.urlparse.parse_qs(ul.urlparse.urlsplit(seed).query)

{'utm_campaign': ['widget'],
 'utm_medium': ['widget'],
 'utm_source': ['Georgia State University ']}

In [439]:
req = ul.Request(url=seed)
f = ul.urlopen(req)
html_doc = f.read()
print html_doc[:1000]

  <!doctype html>
<!--[if lt IE 7 ]> <html class="no-js ie6 lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if IE 7 ]>    <html class="no-js ie7 lt-ie9 lt-ie8" lang="en"> <![endif]-->
<!--[if IE 8 ]>    <html class="no-js ie8 lt-ie9" lang="en"> <![endif]-->
<!--[if (gte IE 9)|!(IE)]><!--> <html class="no-js" lang="en"> <!--<![endif]-->

<!--
  _                     _ _     _
 | |                   | (_)   | |
 | |     ___   ___ __ _| |_ ___| |_
 | |    / _ \ / __/ _` | | / __| __|
 | |___| (_) | (_| (_| | | \__ \ |_
 |______\___/ \___\__,_|_|_|___/\__|

The event marketing calendar.

-->

<head>
  <title>Structured and Unstructured Data Collection, Storage, Cleaning and Preprocessing Bootcamp - Georgia State University </title>

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

    <meta name="description" content="Structured and Unstructured Data Collection, Storage, Cleaning and Preprocessing Bootcamp

 

When

The program will meet for 2 consecutive weekend

## Extracting Content
Let's look at a fairly simple web-site https://xkcd.com

We're **inspecting** this site and identifying some DOM elements or interest.

In [441]:
html_doc = ul.urlopen('https://xkcd.com').read()
print html_doc[:200]

<!DOCTYPE html>
<html>
<head>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
 


In [443]:
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.prettify()[:200]

<!DOCTYPE html>
<html>
 <head>
  <script>
   (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o


In [444]:
soup.find('div', id='comic')

<div id="comic">\n<img alt="The End of the Rainbow" src="//imgs.xkcd.com/comics/the_end_of_the_rainbow.png" srcset="//imgs.xkcd.com/comics/the_end_of_the_rainbow_2x.png 2x" title="The retina is the exposed surface of the brain, so if you think about a pot of gold while looking at a rainbow, then there's one at BOTH ends."/>\n</div>

In [446]:
img = soup.find('div', id='comic').find('img')
img

<img alt="The End of the Rainbow" src="//imgs.xkcd.com/comics/the_end_of_the_rainbow.png" srcset="//imgs.xkcd.com/comics/the_end_of_the_rainbow_2x.png 2x" title="The retina is the exposed surface of the brain, so if you think about a pot of gold while looking at a rainbow, then there's one at BOTH ends."/>

In [451]:
print 'src\t', img.get('src')
print 'alt\t', img.get('alt')
print 'title\t', img.get('title')

src	//imgs.xkcd.com/comics/the_end_of_the_rainbow.png
alt	The End of the Rainbow
title	The retina is the exposed surface of the brain, so if you think about a pot of gold while looking at a rainbow, then there's one at BOTH ends.


We're going to have a bit more fun with this doing the exercise...

In [None]:
#soup.find_all('a')
all_as = soup.find_all('a')
for a in all_as:
    print a.text, a.attrs['href']

In [54]:
def save_html_doc(DATADIR, soup, url):
    q = ul.urlparse.parse_qs(url.split('?', 1)[1])
    parcel = q['KEY'][0].decode('utf-8').replace(' ', '_')
    county = q['county']
    datdir = os.path.join(DATADIR, county)
    if not os.path.isdir(datdir):
        os.makedirs(datdir)
    with open(datdir+".html", 'w') as io:
        io.write(soup.prettify(formatter='html'))
        

# Robots exclusion standard robots.txt

https://en.wikipedia.org/wiki/Robots_exclusion_standard

The robots exclusion standard, also known as the robots exclusion protocol or simply `robots.txt`, is a standard used by websites to **communicate with web crawlers and other web robots**. The standard specifies how to inform the web robot about **which areas of the website should not be processed or scanned**. Robots are often used by search engines to categorize websites.

Not all robots cooperate with the standard; email harvesters, spambots, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard is different from but can be used in conjunction with, **Sitemaps**, a robot inclusion standard for websites.


In [95]:
def get_robots_disallow(url):
    import urllib2 as ul
    #import fnmatch as fm
    url_split = ul.urlparse.urlsplit(url)
    rbts = ul.urlopen(url_split.scheme+"://"+url_split.netloc+"/robots.txt").readlines()
    pt = map(lambda x: x.strip().split(' ')[1]+'*', filter(lambda x: x.startswith('Disallow:'), rbts))
    return pt

In [98]:
get_robots_disallow('http://news.gsu.edu')

['/wp-admin/*',
 '/wp-includes/*',
 '/calendar/action~posterboard/*',
 '/calendar/action~agenda/*',
 '/calendar/action~oneday/*',
 '/calendar/action~month/*',
 '/calendar/action~week/*',
 '/calendar/action~stream/*',
 '/*/action~**',
 '/*/month_offset~**',
 '/*/exact_date~**',
 '/*/cat_ids~**',
 '/*/tag_ids~**']

In [399]:
def is_allowed(pt, url):
    import urllib2 as ul
    import fnmatch
    url_split = ul.urlparse.urlsplit(url)
    for p in pt:
        if fnmatch.fnmatch(url_split.netloc, p):
            return False
    return True


sleep_time = 10 ## seconds
seed = 'http://price.pcauto.com.cn/top/sales/s1-t1.html'
robots_pat = get_robots_disallow(seed)

## initialize queue
queue = {}
queue[seed] = 3

In [94]:
fm.fnmatch('/abcd/action~sadfadsf', '/*/action~**')

True

#  Monthly Car Sales Crawler

Let's look at a site for monthly car sales 
http://price.pcauto.com.cn/top/sales/s1-t1.html

This site turned posed a few challenges (for me):
1. It's in a foreign language
2. It uses a different alphabet and encoding, and because of that
    1. we need to read the data in binary format, and 
    2. transform them from their original encoding to UTF-8


In [None]:
import time
import numpy as np
import urllib2 as ul
from bs4 import BeautifulSoup

In [198]:
## https://stackoverflow.com/questions/41305887/scraping-chinese-characters-python
def load_soup(url):
    import urllib2 as ul
    html_doc = ul.urlopen(url, 'b').read()
    html_doc_utf = html_doc.decode('GBK', 'ignore').encode('UTF-8', 'ignore')
    return BeautifulSoup(html_doc_utf, 'html.parser')

In [196]:
soup = load_soup('http://price.pcauto.com.cn/top/sales/s1-t1.html')

In [197]:
print soup.prettify(formatter='html')[:2000]

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <title>
   【2017年12月轿车销量排行榜】2017年12月轿车月销量排行榜_太平洋汽车网排行榜
  </title>
  <meta content="" name="keywords"/>
  <meta content="" name="description"/>
  <meta content="yangzhifang_gz none" name="Author">
   <meta content="no-transform " http-equiv="Cache-Control"/>
   <meta content="always" name="referrer">
    <meta content="format=html5; url=http://m.pcauto.com.cn/auto/top/sales-k78.html" name="mobile-agent"/>
    <!--炫版地址-->
    <link href="http://m.pcauto.com.cn/auto/top/sales-k78.html" media="only screen and(max-width: 640px)" rel="alternate"/>
    <!--炫版地址-->
    <meta content="pc" name="applicable-device">
     <link href="http://js.3conline.com/pcauto/2017/price/css/list_top.css" rel="stylesheet" type="text/css"/>
     <!--[if IE 6]><script>document.execCommand("BackgroundImageCache", false, true)</script><![endif]-->
     <!-- 设备跳转 S -->
     <!-- 设备跳转 ssi引入方式 -->
     <script>
      !function(a,b){var o,c=navigator.userAgent.

In [None]:
pd.read_html()

Let's find the table of interest in the HTML document

In [214]:
tbls = soup.find_all('table', attrs={'class':'table-sl'})
tbl = tbls[0]
print tbl.prettify(formatter='html')[:5000]

u'<table border="1" class="table-sl">\n <tr>\n  <th class="col1">\n   \u9500\u91cf\u6392\u540d\n  </th>\n  <th class="col2">\n   \u8f66\u7cfb\n  </th>\n  <th class="col3">\n   \u5b98\u65b9\u4ef7\n  </th>\n  <th class="col4">\n   \u4ece\u5c5e\u54c1\u724c\n  </th>\n  <th class="col5">\n   12\u6708\u9500\u91cf\n  </th>\n  <th class="col6">\n   1-12\u6708\n   <br/>\n   \u7d2f\u8ba1\u9500\u91cf\n  </th>\n  <th class="col7">\n   \u76f8\u5173\u94fe\u63a5\n  </th>\n </tr>\n <tr>\n  <td class="col1 index org">\n   1\n  </td>\n  <td class="col2 brand">\n   <a class="blue" href="/sg3225/" target="_blank">\n    \u6717\u9038\n   </a>\n  </td>\n  <td class="col3 price">\n   10.99-15.99\u4e07\n  </td>\n  <td class="col4 relBrand">\n   \u5927\u4f17\n  </td>\n  <td class="col5 salesNum">\n   45554\n  </td>\n  <td class="col6 salesSum">\n   461062\n  </td>\n  <td class="col7 links">\n   <a href="/sg3225/price.html" target="_blank">\n    \u62a5\u4ef7\n   </a>\n   <a href="http://price.pcauto.com.cn/comme

Let's use Pandas to convert the HTML table

In [224]:
#
import pandas as pd
from StringIO import StringIO

df = pd.read_html(StringIO(tbl.prettify(formatter='html')), header=0)[0]
df.head()

Unnamed: 0,销量排名,车系,官方价,从属品牌,12月销量,1-12月 累计销量,相关链接
0,1,朗逸,10.99-15.99万,大众,45554,461062,报价 点评 图片 参配
1,2,轩逸,9.98-15.90万,日产,44141,404726,报价 点评 图片 参配
2,3,英朗,10.99-15.09万,别克,40939,421296,报价 点评 图片 参配
3,4,宝骏310,3.68-6.08万,宝骏,35048,219727,报价 点评 图片 参配
4,5,福睿斯,9.68-12.23万,福特,33581,285029,报价 点评 图片 参配


In [229]:
%%sh
###rm -rf data/pcauto

In [230]:
## set destination data directory
datdir = os.path.join('data','pcauto')
if not os.path.isdir(datdir):
    os.makedirs(datdir)

for y in range(2014, 2018):
    for m in range(1, 13):
        url = 'http://price.pcauto.com.cn/top/sales/s1-t1-y%d-m%d.html'%(y, m)
        fn = 'pcauto-%04d-%02d'%(y, m)
        print url, '->', fn
        # load page
        soup = load_soup('http://price.pcauto.com.cn/top/sales/s1-t1.html')
        
        # find table
        tbls = soup.find_all('table', attrs={'class':'table-sl'})
        tbl = tbls[0]
        tbl_html = tbl.prettify(formatter='html')
        
        # read <table>, remember pandas.read_html returns a **list** of tables
        df = pd.read_html(StringIO(tbl.prettify(formatter='html')), header=0)[0]
        print "successfully read %d rows"%(df.shape[0])
        
        df['Year'] = y
        df['Month'] = m
        
        # now, save table
        df.to_csv(os.path.join(datdir, fn+'.csv'), index=None, header=None, encoding='UTF-8')
        
        # well, the original HTML table also include hyperlinks, maybe we want to keep them for later
        # let's also save the HTML text
        with open(os.path.join(datdir, fn+'.html') as io:
                  io.write(tbl_html)
                  
        # the sleep
        time.sleep(13)

http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m1.html -> pcauto-2014-01
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m2.html -> pcauto-2014-02
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m3.html -> pcauto-2014-03
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m4.html -> pcauto-2014-04
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m5.html -> pcauto-2014-05
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m6.html -> pcauto-2014-06
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m7.html -> pcauto-2014-07
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m8.html -> pcauto-2014-08
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m9.html -> pcauto-2014-09
successfully read 220 rows
http://price.pcauto.com.cn/top/sales/s1-t1-y2014-m10.html -> pcauto-2014-10
succes

In [232]:
%%sh
ls -l data/pcauto/pcauto-*.csv | tail -10

-rw-r--r-- 1 pmolnar pmolnar 18555 Jan 18 17:03 data/pcauto/pcauto-2017-03.csv
-rw-r--r-- 1 pmolnar pmolnar 18555 Jan 18 17:03 data/pcauto/pcauto-2017-04.csv
-rw-r--r-- 1 pmolnar pmolnar 18555 Jan 18 17:03 data/pcauto/pcauto-2017-05.csv
-rw-r--r-- 1 pmolnar pmolnar 18555 Jan 18 17:03 data/pcauto/pcauto-2017-06.csv
-rw-r--r-- 1 pmolnar pmolnar 18555 Jan 18 17:04 data/pcauto/pcauto-2017-07.csv
-rw-r--r-- 1 pmolnar pmolnar 18555 Jan 18 17:04 data/pcauto/pcauto-2017-08.csv
-rw-r--r-- 1 pmolnar pmolnar 18555 Jan 18 17:04 data/pcauto/pcauto-2017-09.csv
-rw-r--r-- 1 pmolnar pmolnar 18775 Jan 18 17:05 data/pcauto/pcauto-2017-10.csv
-rw-r--r-- 1 pmolnar pmolnar 18775 Jan 18 17:05 data/pcauto/pcauto-2017-11.csv
-rw-r--r-- 1 pmolnar pmolnar 18775 Jan 18 17:05 data/pcauto/pcauto-2017-12.csv


In [239]:
df2 = pd.read_csv(os.popen('cat data/pcauto/pcauto-*.csv'), header=None)
print df2.shape
## defining column header from (Google-)translated page, the 'YearToDateSales' column might be something else...
df2.columns = ['SalesRanking', 'Car', 'OfficialPrice', 'SubordinateBrand', 'MonthlySales', 'YearToDateSales', 'Links', 'Year', 'Month']
df2.head()

(10560, 9)


Unnamed: 0,SalesRanking,Car,OfficialPrice,SubordinateBrand,MonthlySales,YearToDateSales,Links,Year,Month
0,1,朗逸,10.99-15.99万,大众,45554,461062,报价 点评 图片 参配,2014,1
1,2,轩逸,9.98-15.90万,日产,44141,404726,报价 点评 图片 参配,2014,1
2,3,英朗,10.99-15.09万,别克,40939,421296,报价 点评 图片 参配,2014,1
3,4,宝骏310,3.68-6.08万,宝骏,35048,219727,报价 点评 图片 参配,2014,1
4,5,福睿斯,9.68-12.23万,福特,33581,285029,报价 点评 图片 参配,2014,1


# Site Maps

https://en.wikipedia.org/wiki/Sitemaps

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. **This allows search engines to crawl the site more intelligently**. Sitemaps are a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

Sitemaps are particularly beneficial on websites where:

- Some areas of the website are not available through the browsable interface
- Webmasters use rich Ajax, Silverlight, or Flash content that is not normally processed by search engines.
- The site is very large and there is a chance for the web crawlers to overlook some of the new or recently updated content
- When websites have a huge number of pages that are isolated or not well linked together, or
- When a website has **few external links**

Let's look at https://apartmentguide.com

First check what they think about robots...
https://www.apartmentguide.com/robots.txt



## Check robots.txt

In [313]:
%%sh
#pushd data/apartmentguide.com > /dev/null
#wget https://www.apartmentguide.com/robots.txt
#popd > /dev/null
head -20 data/apartmentguide.com/robots.txt
grep SITEMAP data/apartmentguide.com/robots.txt

#

# Sitemap Global Search Engine Ping (Google, Yahoo, MSN, Ask)

#

User-Agent: *

Disallow: /thmpg/

Disallow: /*/search/*

Disallow: /schools/*

Disallow: /apartments/print/*

Disallow: /apartments/get_geo_url

Disallow: /apartments/Alaska/Yakutat/
Disallow: /zip/99689*
SITEMAP: https://www.apartmentguide.com/sitemap.xml


## Look at sitemap.xml

In [279]:
sitemap = BeautifulSoup(ul.urlopen('https://www.apartmentguide.com/sitemap.xml').read(), 'lxml')
print sitemap.prettify()[:600]

<?xml version="1.0" encoding="UTF-8"?>
<html>
 <body>
  <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
    <loc>
     https://www.apartmentguide.com/sitemap/AptZips/sitemap_0000.xml
    </loc>
    <lastmod>
     2018-01-18T10:31:38.000Z
    </lastmod>
   </sitemap>
   <sitemap>
    <loc>
     https://www.apartmentguide.com/sitemap/PropTypeNeighborhoods/sitemap_0000.xml
    </loc>
    <lastmod>
     2018-01-18T10:31:38.000Z
    </lastmod>
   </sitemap>
   <sitemap>
    <loc>
     https://www.apartmentguide.com/sitemap/Pdps/sitemap_0000.xml
    </loc>
    <lastmo


In [281]:
sitemap_children = sitemap.find_all('sitemap')
print len(sitemap_children)

9


In [304]:
### This function traverses recursivly through a tree of site mapes
### we assume it's a tree and there are no cycles, otherwise we need to
### ensure that we don't run into an endless loop

###sitemap_df = pd.DataFrame()

def collect_urls(url):
    global sitemap_df
    global sitemap_dir
    
    print "visiting '%s', current number of urls: %d"%(url, sitemap_df.shape[0])
    sitemap = BeautifulSoup(ul.urlopen(url).read(), 'lxml')
    
    ## visit any chidren
    for smap in sitemap.find_all('sitemap'):
        sm_url = smap.find('loc').text
        sm_lastmod = smap.find('lastmod').text
        collect_urls(sm_url)
        
    ## collect urls
    url_df = pd.DataFrame()
    for el in sitemap.find_all('url'):
        data = {'URL': [el.find('loc').text],
                'LastModified': [el.find('lastmod').text],
                'ChangeFrequency': [el.find('changefreq').text]
               }
        df = pd.DataFrame(data=data)
        url_df = pd.concat([url_df, df])
    f = ul.urlparse.urlsplit(url).path.split('.')[0].decode('UTF-8').replace('/', '_')
    url_df.to_csv(os.path.join(sitemap_dir, f+".csv"), index=None, header=None)
    sitemap_df = pd.concat([sitemap_df, url_df])
    time.sleep(13)

In [305]:
# Run crawler
sitemap_df = pd.DataFrame()
sitemap_dir = os.path.join('data', 'apartmentguide.com')
if not os.path.isdir(sitemap_dir):
    os.makedirs(sitemap_dir)
    
sitemap_fn = os.path.join(sitemap_dir, 'sitemap.csv')

collect_urls('https://www.apartmentguide.com/sitemap.xml')


visiting 'https://www.apartmentguide.com/sitemap.xml', current number of urls: 0
visiting 'https://www.apartmentguide.com/sitemap/AptZips/sitemap_0000.xml', current number of urls: 0
visiting 'https://www.apartmentguide.com/sitemap/PropTypeNeighborhoods/sitemap_0000.xml', current number of urls: 6746
visiting 'https://www.apartmentguide.com/sitemap/Pdps/sitemap_0000.xml', current number of urls: 14555
visiting 'https://www.apartmentguide.com/sitemap/AptNeighborhoods/sitemap_0000.xml', current number of urls: 44757
visiting 'https://www.apartmentguide.com/sitemap/AptMilitaryCollege/sitemap_0000.xml', current number of urls: 48814
visiting 'https://www.apartmentguide.com/sitemap/AptRefinements/sitemap_0000.xml', current number of urls: 53282
visiting 'https://www.apartmentguide.com/sitemap/Other/sitemap_0000.xml', current number of urls: 65184
visiting 'https://www.apartmentguide.com/sitemap/AptCities/sitemap_0000.xml', current number of urls: 65198
visiting 'https://www.apartmentguide.c

In [306]:
sitemap_df.to_csv(os.path.join(sitemap_dir, 'sitemap.csv'), index=None)

In [308]:
print sitemap_df.shape
sitemap_df.tail()

(76459, 3)


Unnamed: 0,ChangeFrequency,LastModified,URL
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/batavia-il/houses/
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/league-city-tx/...
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/watford-city-nd...
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/chickamauga-ga/...
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/buford-ga/condos/


In [295]:
ul.urlparse.urlsplit('https://www.apartmentguide.com/sitemap/PropTypesCities/sitemap_0000.xml')

SplitResult(scheme='https', netloc='www.apartmentguide.com', path='/sitemap/PropTypesCities/sitemap_0000.xml', query='', fragment='')

In [265]:
sitemap = BeautifulSoup(ul.urlopen('https://www.apartmentguide.com/sitemap/PropTypesCities/sitemap_0000.xml').read(), 'lxml')

In [276]:
sitemap_df = pd.DataFrame()

for el in sitemap.find_all('url'):
    data = {'URL': [el.find('loc').text],
            'LastModified': [el.find('lastmod').text],
            'ChangeFrequency': [el.find('changefreq').text]
    }
    df = pd.DataFrame(data=data)
    sitemap_df = pd.concat([sitemap_df, df])
print sitemap_df.shape
sitemap_df.head()

(7741, 3)


Unnamed: 0,ChangeFrequency,LastModified,URL
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/saint-louis-mo/...
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/shelbyville-ky/...
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/lake-wylie-sc/h...
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/sterrett-al/hou...
0,monthly,2018-01-18T10:30:02.019Z,https://www.apartmentguide.com/niagara-falls-n...


# Robinson Academic Programs Crawler

Let's visit the website of our college and collect information about the academic programs.


In [329]:
seed = 'https://robinson.gsu.edu/academic-programs/'

Let's create another load function ... out of laziness, we hard-coded the encoding. We can verify by looking at the HTML source that this site uses UTF-8 encoding.

In [330]:
def load_robinson_soup(url):
    import urllib2 as ul
    html_doc = ul.urlopen(url, 'b').read()
    return BeautifulSoup(html_doc, 'html.parser')

In [None]:
soup = load_robinson_soup(seed)

In [331]:
for a in filter(lambda el: el.text.lower().find('more')>=0 , soup.findAll('a', href=True)):
    print a.text.strip(), a['href']
    

Learn More /undergraduate/
Learn More /masters-programs/
Learn More /mba-programs/flex-mba/
Learn More /mba-programs/professional-mba/
Learn More /mba-programs/emba/
Learn More /doctoral-programs/phd/
Learn More /doctoral-programs/executive-doctorate-in-business/
Learn More /immersive-experiential-learning/honors-program/
Learn More /immersive-experiential-learning/pace/
Learn More /cac/panthers-in-the-valley/
Learn More /immersive-experiential-learning/panthers-on-wall-street/
Learn More /immersive-experiential-learning/womenlead/
Learn More http://execed.robinson.gsu.edu/
Learn More /certificate-programs/


In [332]:
soup2 = load_robinson_soup('https://robinson.gsu.edu/mba-programs/professional-mba/')

In [335]:
soup2.find_all('h1', class_='program-title')

[<h1 class="program-title" style="line-height: 1.2em">Professional MBA</h1>]

In [368]:
f = soup2.find_all('h2')[10]
';'.join([ p.text.strip() for p in f.parent.find_all('p')])

u'Cohort, part-time'

In [416]:
def extract_info(soup):
    data = {}
    fields = [u'Title', u'Duration', u'Format', u'Tuition', u'Location']
    for f in fields:
        data[f] = [ np.nan ]

    titles = soup.find_all('h1', class_='program-title')
    if len(titles)>0:
        data[u'Title'] = [ titles[0].text ]
    fields = ['Title', 'Duration', 'Format', 'Tuition', 'Location']
    for s in soup.find_all('h2'):
        if s.text.strip() in fields:
            lst = s.parent.find_all('p') + s.parent.find_all('li')
            if len(lst)>0:
                data[s.text.strip()] = [ ';'.join([ p.text.strip() for p in lst]) ]
    return data

In [380]:
def only_rbc(url):
    return (url.find('://robinson.gsu.edu')>0) | (url[0]=='/')



[<a href="https://robinson.gsu.edu/cac/" itemprop="url">Career Advancement Center</a>,
 <a href="https://robinson.gsu.edu/admissions/scholarships/" itemprop="url">Scholarships</a>,
 <a href="https://robinson.gsu.edu/student-clubs/" itemprop="url">Student Organizations and Clubs</a>,
 <a href="https://robinson.gsu.edu/laptop-recommendation/" itemprop="url">What kind of laptop do I need?</a>,
 <a href="https://robinson.gsu.edu/undergraduate-student-resources/" itemprop="url">Undergraduate Student Resources</a>]

In [342]:
soup2.select('#genesis-content > article > div > div')[0].find_all('h2')

[<h2 class="program-tagline">Rub Elbows with Talented Leaders</h2>]

##  Maintaining a Queue
We're going to follow any link that indicates more information. Since there is **no guaranty** that page references will not create a cycle, we need to keep track of all visited pages.

In [394]:
sleep_time = 13

In [426]:
seed = 'https://robinson.gsu.edu/academic-programs/'
queue = {}
print seed
soup = load_robinson_soup(seed)
for a in filter(lambda a: only_rbc(a['href']), soup.findAll('a', href=True)):
    if a['href'][0]=='/':
        pc = ul.urlparse.urlsplit(seed)
        ##print pc.scheme+'://'+pc.hostname+a['href']  
        queue[pc.scheme+'://'+pc.hostname+a['href']] = 1
    else:
        queue[a['href']] = 1
queue[seed] = 0
print len(queue.keys())

https://robinson.gsu.edu/academic-programs/
131


In [421]:
def add_links_to_queue(soup, url):
    global queue
    for a in filter(lambda a: only_rbc(a['href']), soup.findAll('a', href=True)):
        if a['href'][0]=='/':
            pc = ul.urlparse.urlsplit(url)
            ##print pc.scheme+'://'+pc.hostname+a['href']  
            qk = pc.scheme+'://'+pc.hostname+a['href']
        else:
            qk = a['href']
        if not qk in queue.keys():
            queue[qk] = 1
        

In [432]:
programs_df = pd.DataFrame()

while np.array(map(lambda x: x>0, queue.values())).sum()>0:
    urls = np.array(queue.keys())[np.array(map(lambda x: x>0, queue.values()))]
    url = urls[0]  ## take the first one
    
    print "Remaining %d\tTrying %s"%(len(urls), url)
    if is_allowed(robots_pat, url):
        try:
            soup = load_robinson_soup(url)
            add_links_to_queue(soup, url)
            data = extract_info(soup)
            if not np.nan in data['Title']:
                print data
                programs_df = pd.concat([programs_df, pd.DataFrame(data=data)])
            queue[url] = 0
        except Exception as e:
            print e
        finally:
            queue[url] -= 1 ## reduce trial count
        time.sleep(sleep_time)
    else:
        queue[url] = -999
   

Remaining 131	Trying https://robinson.gsu.edu/faculty/
Remaining 132	Trying https://robinson.gsu.edu/academic-programs/find-your-program/
Remaining 132	Trying https://robinson.gsu.edu/dual-degree-programs/
Remaining 138	Trying https://robinson.gsu.edu/alumni/contact-us/
Remaining 137	Trying https://robinson.gsu.edu/admissions/
Remaining 139	Trying https://robinson.gsu.edu/about/board-of-advisors/
Remaining 138	Trying https://robinson.gsu.edu/academic-departments/marketing/faculty/cbim/
Remaining 143	Trying http://robinson.gsu.edu/
Remaining 268	Trying http://robinson.gsu.edu/undergraduate/
{u'Duration': [nan], u'Format': [nan], u'Location': [nan], u'Tuition': [nan], u'Title': [u'Undergraduate Programs']}
Remaining 284	Trying http://robinson.gsu.edu/graduate-student-resources/
Remaining 284	Trying http://robinson.gsu.edu/undergraduate/areas-of-study/business-economics/
Remaining 284	Trying http://robinson.gsu.edu/essential_grid/sensi-charlery-b-b-18/
Remaining 283	Trying http://robinson

KeyboardInterrupt: 

In [433]:
programs_df

Unnamed: 0,Duration,Format,Location,Title,Tuition
0,,,,Undergraduate Programs,
0,One year,Cohort,Buckhead Center,Master of Science in Managerial Sciences,"Georgia residents: $37,500;Non-Ga. residents: ..."
0,,,,Career Advancement Center,
0,One year,Cohort,Buckhead Center,Master of Science in Marketing,"Georgia residents: $37,500;Non-Ga. residents: ..."
0,"12 months,\nthree semesters",Cohort,Buckhead Center,Regynald G. Washington Master of Global Hospit...,"Georgia residents: $37,500;Non-Ga. residents: ..."
0,Typically 5 years,"Full-time, in-residence",Downtown or Buckhead\nCheck with your coordina...,Ph.D. Programs,Students typically receive tuition waivers and...
0,One year,Cohort,Buckhead Center,Master of Science in Managerial Sciences,"Georgia residents: $37,500;Non-Ga. residents: ..."
0,Two years (junior and senior year),Cohort,Downtown Campus,Honors Program,
0,Flexible (2.5-5 years),Part-time or full-time,"Buckhead, downtown",Flex MBA,"Georgia residents: $37,878\nNon-Ga. residents:..."
0,Flexible,Full Time or Part Time,Downtown Campus and Buckhead Campus,Master of Actuarial Science,"Georgia residents: $21,524;Non-Ga. residents: ..."


# Exercise

Let's get back to *xkcd* https://xkcd.com

Each page has a simple structure with one comic and a bunch of links and navigation buttons.

## Task
- collect about 100 **different** comics from the xkcd site
- extract the following elements and create a data table with those fields:
    1. permanent URL of page
    2. permanent URL of image (such as https://imgs.xkcd.com/comics/the_end_of_the_rainbow.png)
    3. *alt* text of image, which is a short title
    4. *title* text of image, which actually is some text from the comic.
- Bonus: Create an inverted index that keeps a list of document IDs for each term (of the title-text) in those documents.

https://en.wikipedia.org/wiki/Inverted_index