# Scraping

Data is important for analysis; However, some data is external to our enterprise. <br>
In order to test if an analysis is useful, we may have to extract some data from external sources first.
This notebook will demonstrate 2 types of sources: web pages, and rss feeds.


In [1]:
# Requests is a python library for calling websites
import requests
# Lxml supports the reading of html elements 
from lxml import html

In [2]:
url = "https://www.streetdirectory.com/businessfinder/company/407/Catering/"

In [3]:
response = requests.get(url)

In [4]:
content = response.content

In [5]:
domTree = html.fromstring(content)

In [6]:
# Note to trainer to show class how to get element from browser
domTree.cssselect(".listing_company_name")[0].text

'Lph Catering'

In [None]:
# let's generalize the selector so we can get more elements

In [7]:
allRecordTitles = domTree.cssselect(".listing_company_name")

In [8]:
for aRecordTitle in allRecordTitles:
    print(aRecordTitle.text)

Lph Catering
Tim's Fine Catering Services
Indochili Restaurant
Seng Lee Food & Catering
The Bazaar - Vibrant Flavors from India
Sin Bee Hwa Catering Services
Fortune Food (S) Pte Ltd
The Banana Leaf Apolo Pte Ltd
Nature Vegetarian Catering Pte Ltd
Shahi Maharani Restaurant


Using the css-selector or xpath selector together with the browser's web developer mode allows one to access the majority of web content and collect it for your own use for downstream analytics.<br>
Some web sites may be javascript heavy and may only load content on a real browser. To access those content, a more sophisticated process may be required.

Additonal References:
https://medium.freecodecamp.org/better-web-scraping-in-python-with-selenium-beautiful-soup-and-pandas-d6390592e251


Another, more structured form of data are RSS feeds. These are much easier to deal with.


In [44]:
# feedparser is python's rss parsing library
import feedparser


In [45]:
rssURL = "http://www.patentlyapple.com/patently-apple/atom.xml"

In [47]:
response = feedparser.parse(rssURL)
response

{'bozo': 0,
 'encoding': 'utf-8',
 'entries': [{'author': 'Jack Purcher',
   'author_detail': {'name': 'Jack Purcher'},
   'authors': [{'name': 'Jack Purcher'}],
   'guidislink': False,
   'id': 'tag:typepad.com,2003:post-6a0120a5580826970c022ad390d5f7200d',
   'link': 'http://www.patentlyapple.com/patently-apple/2018/09/qualcomm-to-face-new-antitrust-investigation-into-german-iphones-using-intel-chips-by-the-european-commission.html',
   'links': [{'href': 'http://www.patentlyapple.com/patently-apple/2018/09/qualcomm-to-face-new-antitrust-investigation-into-german-iphones-using-intel-chips-by-the-european-commission.html',
     'rel': 'alternate',
     'type': 'text/html'},
    {'count': '0',
     'href': 'http://www.patentlyapple.com/patently-apple/2018/09/qualcomm-to-face-new-antitrust-investigation-into-german-iphones-using-intel-chips-by-the-european-commission.html',
     'rel': 'replies',
     'thr:count': '0',
     'type': 'text/html'}],
   'published': '2018-09-12T11:00:46-06:

In [48]:
response["entries"]

[{'author': 'Jack Purcher',
  'author_detail': {'name': 'Jack Purcher'},
  'authors': [{'name': 'Jack Purcher'}],
  'guidislink': False,
  'id': 'tag:typepad.com,2003:post-6a0120a5580826970c022ad390d5f7200d',
  'link': 'http://www.patentlyapple.com/patently-apple/2018/09/qualcomm-to-face-new-antitrust-investigation-into-german-iphones-using-intel-chips-by-the-european-commission.html',
  'links': [{'href': 'http://www.patentlyapple.com/patently-apple/2018/09/qualcomm-to-face-new-antitrust-investigation-into-german-iphones-using-intel-chips-by-the-european-commission.html',
    'rel': 'alternate',
    'type': 'text/html'},
   {'count': '0',
    'href': 'http://www.patentlyapple.com/patently-apple/2018/09/qualcomm-to-face-new-antitrust-investigation-into-german-iphones-using-intel-chips-by-the-european-commission.html',
    'rel': 'replies',
    'thr:count': '0',
    'type': 'text/html'}],
  'published': '2018-09-12T11:00:46-06:00',
  'published_parsed': time.struct_time(tm_year=2018, tm

You could process the data one by one or write them all into a database by selecting each element out...<br>
Otherwise, you could use pandas to help with the data wrangling (recall the previous notebook!)

In [49]:
import pandas

In [53]:
pandas.DataFrame(response["entries"])

Unnamed: 0,author,author_detail,authors,guidislink,id,link,links,published,published_parsed,summary,summary_detail,tags,title,title_detail,updated,updated_parsed
0,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-12T11:00:46-06:00,"(2018, 9, 12, 17, 0, 46, 2, 255, 0)","MLex, the leading source of insight on regulat...",{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '5A. Apple News', 'l...",Qualcomm to Face new Antitrust Investigation i...,{'base': 'http://www.patentlyapple.com/patentl...,2018-09-12T11:00:46-06:00,"(2018, 9, 12, 17, 0, 46, 2, 255, 0)"
1,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-12T10:39:31-06:00,"(2018, 9, 12, 16, 39, 31, 2, 255, 0)",Cumulative shipments for iPhone X reached 63 M...,{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '5A. Apple News', 'l...",Apple's iPhone X is on Track to be the Most Su...,{'base': 'http://www.patentlyapple.com/patentl...,2018-09-12T10:39:32-06:00,"(2018, 9, 12, 16, 39, 32, 2, 255, 0)"
2,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-12T10:10:41-06:00,"(2018, 9, 12, 16, 10, 41, 2, 255, 0)","When people think of Apple, they think of hard...",{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '5A. Apple News', 'l...","Beyond the iPhone, Apple's Service are gaining...",{'base': 'http://www.patentlyapple.com/patentl...,2018-09-12T10:10:41-06:00,"(2018, 9, 12, 16, 10, 41, 2, 255, 0)"
3,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-12T07:24:15-06:00,"(2018, 9, 12, 13, 24, 15, 2, 255, 0)",It's being report today that data suggests tha...,{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '5A. Apple News', 'l...","Regardless of the Cost, Nearly Half of iPhone ...",{'base': 'http://www.patentlyapple.com/patentl...,2018-09-12T07:24:15-06:00,"(2018, 9, 12, 13, 24, 15, 2, 255, 0)"
4,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-12T06:11:15-06:00,"(2018, 9, 12, 12, 11, 15, 2, 255, 0)",Earlier today Loup Ventures released a new sma...,{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '5A. Apple News', 'l...",While Smart Speaker Adoption is slowly Acceler...,{'base': 'http://www.patentlyapple.com/patentl...,2018-09-12T06:10:50-06:00,"(2018, 9, 12, 12, 10, 50, 2, 255, 0)"
5,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-11T06:53:11-06:00,"(2018, 9, 11, 12, 53, 11, 1, 254, 0)",Yesterday Patently Apple posted a report title...,{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '5C. Non-Apple News ...",Samsung SDI has developed the World's first Op...,{'base': 'http://www.patentlyapple.com/patentl...,2018-09-11T06:53:11-06:00,"(2018, 9, 11, 12, 53, 11, 1, 254, 0)"
6,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-11T06:13:53-06:00,"(2018, 9, 11, 12, 13, 53, 1, 254, 0)",The U.S. Patent and Trademark Office officiall...,{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '2. Granted Patents'...",Apple won 36 Patents today covering an iPad wi...,{'base': 'http://www.patentlyapple.com/patentl...,2018-09-11T06:09:11-06:00,"(2018, 9, 11, 12, 9, 11, 1, 254, 0)"
7,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-11T03:57:28-06:00,"(2018, 9, 11, 9, 57, 28, 1, 254, 0)","It was reported last night that ""A federal app...",{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '3. Patently Legal '...",A Federal Appeals Court invalidates Key Claims...,{'base': 'http://www.patentlyapple.com/patentl...,2018-09-11T03:57:14-06:00,"(2018, 9, 11, 9, 57, 14, 1, 254, 0)"
8,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-11T03:12:18-06:00,"(2018, 9, 11, 9, 12, 18, 1, 254, 0)","Apple Suppliers Foxconn, Quanta and Wistron re...",{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '5A. Apple News', 'l...","Apple Suppliers Foxconn, Quanta and Wistron re...",{'base': 'http://www.patentlyapple.com/patentl...,2018-09-11T03:12:18-06:00,"(2018, 9, 11, 9, 12, 18, 1, 254, 0)"
9,Jack Purcher,{'name': 'Jack Purcher'},[{'name': 'Jack Purcher'}],False,"tag:typepad.com,2003:post-6a0120a5580826970c02...",http://www.patentlyapple.com/patently-apple/20...,"[{'rel': 'alternate', 'href': 'http://www.pate...",2018-09-11T02:40:04-06:00,"(2018, 9, 11, 8, 40, 4, 1, 254, 0)",It's being reported today that the Korean Mini...,{'base': 'http://www.patentlyapple.com/patentl...,"[{'scheme': None, 'term': '5A. Apple News', 'l...",Korean Ministry of Trade Considers Banning a S...,{'base': 'http://www.patentlyapple.com/patentl...,2018-09-11T02:40:04-06:00,"(2018, 9, 11, 8, 40, 4, 1, 254, 0)"
