# Data hunting and gathering


SOFTWARE REQUIREMENTS
    
    + mongoDB #create a MongoDB Atlas cluster on the cloud
    + selenium #pip install selenium
    + pymongo #pip install pymongo
    + lxml #pip install  lxml
    + tweepy #pip install tweepy
    
NOT REQUIRED BUT USED IN ONE EXAMPLE: 

    + MPV via Homebrew (OSX) or mplayer in Linux.
    
OTHER UTILITIES

    + If you are using Firefox browser you may need Firebug.

CONTENTS

+ Introduction and warm-up project: A web crawler
     
+ Using the API

    + Retrieving Twitter data

+ Creating our own web API: Scraping

    + Understanding HTML and CSS
    + CSS selectors
    + XPath selectors
    + Scraping dynamic content with Selenium    
   

# 1. Introduction

Data is the basis of this course. Although we usually find it in well structured formats such as a spreadsheet resulting from our last experiment, or the collection of company records in a classical relational database, with the advent of internet new information sources have to be taken into account. However, these new sources are home of unstructured data. In this lecture several methods for retrieving data and storing it are presented.

Let us first introduce the big picture guiding this lecture. Whenever we want to retrieve data from a web site we should ask first if the web site is providing a simple way for that purpose. Many large sites such as google, facebook, twitter, etc, provide a **Application Programming Interface (API)** that can make data hunting easier. However, most of web sites do not have this interface. Even more, an API may not provide the desired information. In those cases we have to use **scraping** techniques. This means dealing with the raw information as it is provided to the web browser and code our data finding methods.  

<img style="border-radius:20px;" src="./files/big_picture.jpg">

Let us start connecting to the net and checking out how to retrieve a basic page. We will start using `urllib` module.

In [1]:
from urllib.request import urlopen
source = urlopen('http://google.com')#Let us check what is in
source

<http.client.HTTPResponse at 0x1108d19d0>

In [2]:
#Hurray we got a socket. An all sockets behave like files, so let us go read() the "file"
something = source.read()
something

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es"><head><meta content="Google.es permite acceder a la informaci\xf3n mundial en castellano, catal\xe1n, gallego, euskara e ingl\xe9s." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="fMprxNz5aGYgYSFAnN8Vmw==">(function(){window.google={kEI:\'IVRUYZyYBeuiggfiv6mADw\',kEXPI:\'0,1302536,56873,1710,4348,207,4804,2316,383,246,5,1354,5251,1122515,1197714,687,328866,51224,16114,28683,17573,4858,1362,284,9007,3020,17588,4020,978,13228,3847,4192,6430,14763,4281,2778,919,5081,889,704,1279,2212,241,290,148,1103,840,1983,213,4101,109,3405,606,2023,2297,14670,3227,2845,7,12354,5096,16320,908,2,941,2614,13142,3,576,6459,149,13975,4,1528,2304,1236,5803,4684,2014,18375,2658,4243,2458,654,32,5616,8012,2305,638,1494,13406,3

In [3]:
#What!!!!
#Let us read more
print (source.read())

b''


In [4]:
#Ooooppss nothing else.

Ok, hands on!!! 
Some first warm up exercises. Check the str api of python!

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**WARM UP EXERCISES**
<ol>
<li>Is there the word python in python.org?(hint: find python in source) </li>
<li>Does http://google.com contain an image? (hint: < img  TAG ) </li>
<li>What are the first ten characters of python.org?</li>
</ol>
</div>

In [5]:
# Is there the word python in python.org?(hint: find python in source) 
# Write your code here 
from urllib.request import urlopen
source = urlopen('http://python.org').read()
result = "1: - " + str(source.decode('UTF8').lower().find("python")>0)
print (result)
        

1: - True


In [6]:
# Does http://google.com contain an image? (hint: < img  TAG )
#Write your code here 
from urllib.request import urlopen
source = urlopen('http://google.com').read()
result = "2: - " + str(source.decode('latin1').lower().find("<img ")>0)
print (result)


2: - True


In [7]:
#What are the first ten characters of python.org?
from urllib.request import urlopen

source = urlopen('http://python.org').read()

print ("3: - "+source.decode('UTF8')[0:10])

3: - <!doctype 


We are retrieving data from an URL! So we are done! 

# Crawling and Scraping

Scraping and **crawling** are two very related techniques. While scraping is used for retrieving data from a web page, crawling is used to retrieve the web pages. Scraping and crawling are found at the core of search engines. Scraping is used to get keywords, analyze, and extract useful information from the web pages so that given a user query it may return related results. On the other hand, crawling allows to retrieve the actual pages and uses scraping to get the links in each web site. This allows to create a graph of the connection among web sites and this information can be used to order the results of a query.

In general, we might want not only to get data from a single page but probably retrieve from several related pages. In those cases crawling is the way to go. 

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**WARM-UP PROJECT:** Let us build a very simple spider. The basic functionality of an spider is to crawl and store all the data in web pages. In this simple project we will take care of single site. 

<ol>
<li>A crawler must recognize the links to crawl. Take a minute and think how to retrieve the links of a web site.</li>
<li>Let us start the project by creating a Spider class. The constructor will have the following parameters: starting_url, crawl_domain, and max_iter. crawl_domain will be the domain that validates if an absolute link will be considered or not. max_iter is the maximum amount of web items to crawl.</li>
<li>The main method can be Spider.run(). Enumerate the big functionalities/building blocks of the crawler.</li>
</ol>
</div>
    
    

In [8]:
import urllib
import time

def getLinks(html, max_links=10):
    url = []
    cursor = 0
    nlinks=0
    while (cursor>=0 and nlinks<max_links):
        start_link = html.find("a href",cursor)
        if start_link==-1:
            return url
        start_quote = html.find('"', start_link)
        end_quote = html.find('"', start_quote + 1)
        url.append(html[start_quote + 1: end_quote])
        cursor = end_quote+1
        nlinks = nlinks +1
    return url

class Spider:
    def __init__(self,starting_url,crawl_domain,max_iter):
        self.crawl_domain = crawl_domain
        self.max_iter = max_iter
        self.links_to_crawl=[]
        self.links_to_crawl.append(starting_url)
        self.links_visited=[]
        self.collection=[]
        
    def retrieveHtml(self):
        try:
            socket = urllib.request.urlopen(self.url);
            encoding =   socket.headers.get_content_charset()
            if encoding is None:
                    encoding = "utf-8"
            self.html = socket.read().decode(encoding)
            return 0
        except UnicodeDecodeError:
            print ("Bad Encoding")
            return -1
        except urllib.error.HTTPError:
            # Most probably an url not found 404, possibly due to malformating of the links in retrieveAndValidateLinks
            print ("Broken Link")
            return -1
        except urllib.error.URLError:
            # Most probably an url not found 404, possibly due to malformating of the links in retrieveAndValidateLinks
            print ("Broken Link")
            return -1
    def storeHtml(self):
        doc = {}
        doc['url'] = self.url
        doc['date'] = time.strftime("%d/%m/%Y")
        doc['html'] = self.html
        self.collection.append(doc)          
   
    def retrieveAndValidateLinks(self):
        tmpList=[]
        items = getLinks(self.html)
        # Check the validity of a link
        for item in items:
            item = item.strip('"')
            if self.crawl_domain in item:
                tmpList.append(item)
            if not(":") in item: #Take care of http:// https:// and mailto:
                tmpList.append(self.crawl_domain+item)
        # Check that the link has not been previously retrieved or is currently on the links_to_crawl list
        for item in tmpList:
            if item not in self.links_visited:
                if item not in self.links_to_crawl:
                    self.links_to_crawl.append(item)
                    print ('Adding: '+item)
                
     
        
    def run(self):
        while (len(self.links_to_crawl)>0 and len(self.collection)<self.max_iter):
            
            self.url = self.links_to_crawl.pop(0)
            print ("Visiting: "+ self.url)
            self.links_visited.append(self.url)
            if self.retrieveHtml()>=0:
                self.storeHtml()
                self.retrieveAndValidateLinks()

Let us validate the crawler with the following code: 

In [9]:
spider = Spider('http://www.ub.edu','http://www.ub.edu/',20)
spider.run()

Visiting: http://www.ub.edu
Adding: http://www.ub.edu/#
Adding: http://www.ub.edu//web/portal/ca
Visiting: http://www.ub.edu/#
Visiting: http://www.ub.edu//web/portal/ca


Let us go for a more complex web site. Run the code on http://hunch.net (a machine learning blog by John Langford).

In [10]:
spider = Spider('http://hunch.net','http://hunch.net',10)
spider.run()

Visiting: http://hunch.net
Adding: http://hunch.net#content
Visiting: http://hunch.net#content


In [11]:
# And check the urls retrieved
[spider.collection[i]['url'] for i in range(len(spider.collection))]

['http://hunch.net', 'http://hunch.net#content']

It seems that the simple crawler more or less works as expected. There are still many functionalities to work on , such as valid domains, valid urls, etc. One important issue to consider is **persistence**, or how to store the data retrieved for further analysis. In this basic scraping tutorial we us MongoDB as a Non-SQL database for persistence purposes. 

## 1.2 Finishing the warm up project with MongoDB storage

We just have to change two lines of code ... literally.

In [5]:
import urllib
import time
import pymongo

def getLinks(html, max_links=10):
    url = []
    cursor = 0
    nlinks=0
    while (cursor>=0 and nlinks<max_links):
        start_link = html.find("a href",cursor)
        if start_link==-1:
            return url
        start_quote = html.find('"', start_link)
        end_quote = html.find('"', start_quote + 1)
        url.append(html[start_quote + 1: end_quote])
        cursor = end_quote+1
        nlinks = nlinks +1
    return url

class Spider:
    def __init__(self,starting_url,crawl_domain,max_iter):
        self.crawl_domain = crawl_domain
        self.max_iter = max_iter
        self.links_to_crawl=[]
        self.links_to_crawl.append(starting_url)
        self.links_visited=[]
        try:
            with open("credentials.txt", 'r', encoding='utf-8') as f:
                [name,password,url]=f.read().splitlines()
                
                self.conn=pymongo.MongoClient("mongodb+srv://{}:{}@{}".format(name,password,url))
            print ("Connected successfully!!!")
        except pymongo.errors.ConnectionFailure as e:
            print ("Could not connect to MongoDB: %s" % e) 
        self.db = self.conn["Crawler"]
        self.collection = self.db[starting_url[7:]+'DB']
        
    def retrieveHtml(self):
        try:
            socket = urllib.request.urlopen(self.url);
            encoding = socket.headers.get_content_charset()
            if encoding is None:
                    encoding = "utf-8"
            self.html = socket.read().decode(encoding)
            return 0
        except UnicodeDecodeError:
            print ("Bad Encoding")
            return -1
        except urllib.error.HTTPError:
            # Most probably an url not found 404, possibly due to malformating of the links in retrieveAndValidateLinks
            print ("Broken Link")
            return -1
        except urllib.error.URLError:
            # Most probably an url not found 404, possibly due to malformating of the links in retrieveAndValidateLinks
            print ("Broken Link")
            return -1
    def storeHtml(self):
        doc = {}
        doc['url'] = self.url
        doc['date'] = time.strftime("%d/%m/%Y")
        doc['html'] = self.html
        #Insert in the collection
        self.collection.insert_one(doc)        
   
    def retrieveAndValidateLinks(self):
        tmpList=[]
        items = getLinks(self.html)
        # Check the validity of a link
        for item in items:
            item = item.strip('"')
            if self.crawl_domain in item:
                tmpList.append(item)
            if not(":") in item: #Take care of http:// https:// and mailto:
                tmpList.append(self.crawl_domain+item)
        # Check that the link has not been previously retrieved or is currently on the links_to_crawl list
        for item in tmpList:
            if item not in self.links_visited:
                if item not in self.links_to_crawl:
                    self.links_to_crawl.append(item)
                    print ('Adding: '+item)
                
     
        
    def run(self):
        #Change the count on the collection
        count_i = 0
        while (len(self.links_to_crawl)>0 and count_i<self.max_iter):   
            self.url = self.links_to_crawl.pop(0)
            print ("Visiting: "+ self.url)
            self.links_visited.append(self.url)
            if self.retrieveHtml()>=0:
                self.storeHtml()
                self.retrieveAndValidateLinks()
                count_i = count_i+1


In [7]:
spider = Spider('http://www.ub.edu','http://www.ub.edu',20)
spider.run()
print("END")

Connected successfully!!!
Visiting: http://www.ub.edu
Adding: http://www.ub.edu#
Adding: http://www.ub.edu/web/portal/ca
Visiting: http://www.ub.edu#
Visiting: http://www.ub.edu/web/portal/ca
END


Check the collection:

https://cloud.mongodb.com/

In [8]:
spider.db.list_collection_names()

['www.ub.eduDB']

In [9]:
collection = spider.db['www.ub.eduDB']
collection.count_documents({})
#spider.db.drop_collection("www.ub.eduDB")

6

In [10]:
for doc in collection.find():
    print ("[{}] {}".format(doc['date'], doc['url']))

[29/09/2021] http://www.ub.edu
[29/09/2021] http://www.ub.edu#
[29/09/2021] http://www.ub.edu/web/portal/ca
[29/09/2021] http://www.ub.edu
[29/09/2021] http://www.ub.edu#
[29/09/2021] http://www.ub.edu/web/portal/ca


<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">

**PROS and CONS:**
<p>
**MongoDB** querying is powerful but based on basic string operations. This actually tells us that storing full HTML pages is not going to be effiecient for retrieval. Actually, we will see that it is important to break the information in the pieces we really want. However, this is a good starting point before a post processing if we are not sure what we are going to do with the data or further scraping is going to take long. </p>
</div>

In the next section we will see more efficient ways of dealing with web based data.

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">

**URLLIB** is good for getting simple things. In the end you end up with a large HTML string you want to do something on it. 
So the next thing you want to do is to parse data. But you want to do it in the same way you do when you interact with the web page. You see a menu, a frame on the left side, a nice colorful block where the price for your flight is. So **you want to parse data the way you see data in the webpage so that you can target it**.
</div>