# How to become a data vampire? 
    Data scraping with Python
    
### Requirements:
- Use python3 to avoid text encoding problems.
- Some extra libraries: requests (request websites), selenium (to control firefox), beautifulsoup (parse html), json (parse json), fake_headers...
- Firefox and geckodriver (https://github.com/mozilla/geckodriver/releases) to control with selenium
- Jupyter notebook (autocomplete, interactive, great). Anacondas provides everything.

### Every data service has an API.
- Some sites allow you to use it directly (Twitter) -> Easy
- Some sites hide it (most websites) -> Easyish
- Some sites block it (most apps and services whose income come from data) -> No idea

In [None]:
!pip install tweepy
!pip install dateparser
!pip install fake_useragent #this will allow us to use random headers, many websites track it
!pip install selenium --upgrade

# 1. Types of data scraping
## 1.1 Using the API that the site provides (e.g. Twitter)
- Usually you send a request and the site gives you back json (a dictionary with the data).
- Chances are that somebody else have coded a python package for you.
- They have limits.

Example: Twitter
- Get a key: https://apps.twitter.com/
- Documentation: https://dev.twitter.com/rest/public
- Find a library: https://dev.twitter.com/resources/twitter-libraries
(let's use https://github.com/tweepy/tweepy)


Limitations
- 100 mess/query
- 180 messages every 15 min
- Only one week of data

In [117]:
def twitter(d_keys,query):
    import tweepy
    import time

    #Authtentify
    auth = tweepy.OAuthHandler(d_keys["CONSUMER_KEY"], d_keys["CONSUMER_SECRET"])
    auth.set_access_token(d_keys["ACCESS_KEY"], d_keys["ACCESS_SECRET"])
    api = tweepy.API(auth)

    #We want 1000 tweets
    num_results = 1000
    result_count = 0
    last_id = None
    
    #Max 180 tweets 15 min
    cumulative = 0

    #While we don't have them
    while (result_count <  num_results):
        previous_tweets = result_count
        #Ask for more tweets, starting in the last_id (identifier of the tweet)
        results = api.search(q = query,
                              count = 90, max_id = last_id,result_type="recent")
                                # geocode = "{},{},{}km".format(latitude, longitude, max_range) #for geocode

        #for each tweet extract some info (json structure)
        for result in results:
            result_count += 1
            user = result.user.screen_name
            text = result.text
            followers_count = result.user.followers_count
            time_zone = result.user.time_zone
            print("_"*10)
            print(user,time_zone,followers_count)
            print(text)

        #keep the last_id to know where to continue
        last_id = int(result.id)-1
        new_tweets = result_count - previous_tweets

        print ("Number of results: {} ({} new)".format(result_count,new_tweets))

        #If we don't get new tweets exit
        if new_tweets == 0: 
            break
        
        time.sleep(1)
        
        if ((result_count + 90) // 150) > cumulative:
            cumulative += 1
            time.sleep(15*60)

import pickle
d_keys = pickle.load(open(".key","rb")) #don't share your keys ;)
twitter(d_keys,"from:wwcs_2017")

__________
wwcs_2017 None 55
At WWCS2017 in @is_petnica we will also discuss "Scientific Careers in Academia and Beyond": https://t.co/d8YNNilt9A ask your qstns #wwcs17
__________
wwcs_2017 None 55
Full detailed program of @wwcs_2017 can be found here: https://t.co/EZjejUrM80 https://t.co/OuzBL5TiaZ
__________
wwcs_2017 None 55
RT @YRN_CS: Bridge Grants 2017 to foster interdisciplinary research efforts: waiting for great proposals! Deadline is March 5th! https://t.…
__________
wwcs_2017 None 55
Our program begins this sunday with the Belgrade city tour! Looking forward to meeting you all!!!
Number of results: 4 (4 new)
Number of results: 4 (0 new)


## 1.2 Using the hidden API
- You request the website and get **json/html** back.
- You need to find patterns in their API (may be hard sometimes).
- No limits usually. 
    - Try to behave, don't get confused with a DDoS attack.
    - Check the terms of service of the website.
    - I got UvA blocked from a site last year.
- It can be automatized to run periodically: https://scrapy.org/


What will we use:
- The inspector and network tab in the broser
- This web to convert curl queries to python request queries: https://curl.trillworks.com/


#### 1.2.1 UVM example
- Info about students: https://www.uvm.edu/directory/

In [120]:
def get_student_info(netIDs):
    """
    Calculate how many students of first,second,third year are in our database
    """
    ids = []
    import requests
    import json
    from collections import Counter
    n = []
    for ID in netIDs:
        #Request
        r = requests.get('https://www.uvm.edu/directory/api/query_results.php?name={0}&department=&phone=&soundslike=0&request_num=1'.format(ID.replace(".","+") ))
        
        #Get the json
        try: 
            t = json.loads(r.text)
            n.append(t["data"][0]["ou"]["0"])
            ids.append(ID)
        except: 
            print("Error with ",ID)
    
    print(ids)
    print(n)
    print(Counter(n))



In [None]:
netIDs = ['hhoffma2', 'nswright', 'emily.sippin', 'emily.sippin', 'cjmarsha', 'nicholas.grubinger', 'kabdi', 'wbourne', 'hdeng', 'caroline.hooper', 'mghafoor', 'cecappel', 'oolafarg', 'cyang3', 'wgarratt', 'imchale', 'mgalla13', 'tdwise', 'jgterino', 'escott3', 'ailer', 'sarah.mantz', 'mleskova', 'bfilker1', 'jpconsta', 'amcgrory', 'verhoeff.parker', 'aharte', 'emcwrigh', 'dgrundha', 'amanda.silva', 'ajstone', 'dgrundha', 'evolaj', 'acoonwil', 'lberelso', 'solmaz.karimi', 'jmorri18', 'bcpincus', 'faliaj', 'iroach', 'taylor.bird', 'rnesnevi', 'lmyers4', 'mgallag9', 'agryan', 'meschnei', 'rfcalabr', 'tlo1', 'asulli19', 'Bcrocke1', 'wocarrol', 'kkaiters', 'wsahene', 'bsalimi', 'hhoffma2', 'ndudkina', 'pgreenfi', 'grace.jia', 'cswinn', 'tmheffer', 'stosto', 'stosto', 'amcgrory', 'jfiggie', 'uacharya', 'amanda.silva', 'ebambury', 'swshelto', 'mprogers', 'ebambury', 'kgonski', 'ebambury', 'agryan', 'ashah7', 'jbenelli', 'nwoodcoc', 'ejsulliv', 'jldickin', 'aculupa', 'jbfrankl', 'mprogers', 'rjplatt', 'jbfrankl', 'mghafoor', 'swbreen', 'swbreen', 'gprodans', 'amguy', 'amguy', 'vavalone', 'gcongius', 'jwaksman', 'rlangdon', 'c.dong', 'c.dong', 'sphuong', 'rfranken', 'smatika']
get_student_info(netIDs)

##### 1.2.2 UVM example 2
- Donwload all the ISBNs of the books that will be used
- http://uvmbookstore.uvm.edu/textbook_express.asp?mode=2&step=2


In [123]:
def extractBooks():
    """
    Get all the books used at UVM in one year
    """
    import requests
    from bs4 import BeautifulSoup

    cookies = {
        's_vnum': '1440621181017%26vn%3D1',
        's_fid': '0EF2E6D54E22CAAA-084C34FDA4EC5F3D',
        '__utma': '54619247.234461063.1437860582.1438191847.1438191847.1',
        '__utmz': '54619247.1438191847.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided)',
        'ASPSESSIONIDSCQDDRCQ': 'FBDNKLMABLDHPHCHGANLMKCG',
        'ASPSESSIONIDQCSBCQDT': 'OEDPCJMAMCFMELHHNLCMKLHG',
        'referring_url': 'https%3A//www.google.es/',
        '_ga': 'GA1.2.234461063.1437860582',
        '_gat': '1',
        '_gat_unitTracker': '1',
        'mscssid': '280FE7B7BA944614A67CE36A236CBAE3',
        'cookies': 'true',
    }

    headers = {
        'Pragma': 'no-cache',
        'Origin': 'http://uvmbookstore.uvm.edu',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'es,en-US;q=0.8,en;q=0.6',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36',
        'HTTPS': '1',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Cache-Control': 'no-cache',
        'Referer': 'http://uvmbookstore.uvm.edu/default.asp?',
        'Connection': 'keep-alive',
    }

    #All the books start with "70 and then four numbers". Let's try all the combinations
    books = [ "7"+"0"*(4-len(str(_)))+str(_) for _ in range(0,1+9999)]

    while books:
        #keep the top 100
        booksT = books[:100]
        books = books[100:]
        
        #join the ids of the books
        booksT = "%2C".join(booksT)
    
        #create the data and post it
        data = 'tbe-block-mode=0&selTerm=0%7C0&generate-book-list=Get+Your+Books&sectionIds='+booksT
        r = requests.post('http://uvmbookstore.uvm.edu/textbook_express.asp?mode=2&step=2', headers=headers, cookies=cookies, data=data)
        #get the html and parse the html
        html = r.text
        soup = BeautifulSoup(html)

        #find all the isbns
        ids = soup.find_all('span', {'class' : 'isbn'})

        #Print them
        for _ in ids:
            print(_.text)
        

In [124]:
extractBooks()

True




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


013035256X
0130606200
0130606200
0926544306
007340232X
0137003382
111801474X
0205931804
0495898171
1581526601
1451673310
1457602008
1581526601
0547406274
0312612737
0393919595
0618485228
0312445741
0393316548
1567921787
0803259786
1584654953
0374532508
0495907774
0809223805
0809223775
0321652797
0321694643
0130606200
1429283521
0495904333
0078050332
0321558235
020507507X
1449319270
007179591X
0538453044
013182371X
1603290249
032176952X
032176952X
1111131821
0205782159
0073525464
0078023130
0078023130
1449319270
0495095842
140004619X
1581527489
1581527489
1581527489
1581527489
0312542542
0312613385
0375702709
0312609655
145760650X
031333515X
031265362X
0060987014
1581528655
0553375407
1581528655
155643880X
1581528655
0321747593
032176952X
0321361482
0321759664
0321716817
0534432956
1405124784
0199797269
1573228834
0631206337
0800697405
0314199373
1416948821
158901698X
1133960804
0205610617
0078025699
0073379247
0881504203
1882190610
0300115377
1118016181
1449316158
0205578640
0495913383

KeyboardInterrupt: 

## 1.3 Scrapping easy websites
- You request the website and get **html** back.
- You need to find patterns to extract the data (not too hard, but takes a bit of time).
- No limits usually. 
    - Try to behave, don't get confused with a DDoS attack.
    - Check the terms of service of the website.
    - I got UvA blocked from a site last year.
- It can be automatized to run periodically: https://scrapy.org/


- You can download files as well (urlrequest package).

In [125]:
from fake_useragent import UserAgent
import json
import re
import bs4

ua = UserAgent()

def login():
    """
    Use the api to log in
    """
    res = requests.post('https://bookscouter.com/login.php', 
                        data = {'action': 'login','ref':'','email':'7320939@etlgr.com','pass':'wwcs'})
    print(res.ok)
    
def get_price_bookscouter(ISBN):
    """
    Request the website with random headers, extract the table
    """
    #Get html
    url = 'http://bookscouter.com/historic/'+ISBN
    res = requests.get(url, headers={'User-Agent': '%s' %ua.random})
    
    #Check it's all good
    if res.status_code != 200: 
        print("Error code {}".format(es.status_code))
        return ""

    #Parse html
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    
    #Get table
    prices = soup.find_all('td')
    
    #Return last price
    if len(prices) > 0:
        return float(prices[2].text.replace('$', ''))

    else:
        return None


In [126]:
login() #use the hidden api to log in
get_price_bookscouter("978-1491910290") #get the book

True


4.13

## 1.4 Behaving like a person (never a bad idea)
- It's slow
- It can handle javascript
- You get **html** code back.
- Behave.


- Requirements (one):
    - Firefox + geckodriver (https://github.com/mozilla/geckodriver/releases_
    - Chrome + chromedriver
    
    
    geckodriver/chromedriver must be execution permissions (chmod +x geckodriver)



In [109]:
import selenium.common.exceptions
import selenium.webdriver
import selenium.webdriver.common.desired_capabilities
import selenium.webdriver.support.ui
from selenium.webdriver.support import expected_conditions


def _wait_for_element(xpath, wait):
    try:
        polling_f = expected_conditions.presence_of_element_located((selenium.webdriver.common.by.By.XPATH, xpath))
        elem = wait.until(polling_f)
    except: 
        raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath))
    return elem

In [53]:
#define short and long timeouts
wait_timeouts=(30, 180)

#open the driver
driver = selenium.webdriver.Firefox(executable_path="./geckodriver")

#define short and long waits (for the times you have to wait for the page to load)
short_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[0], poll_frequency=0.05)
long_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[1], poll_frequency=1)


In [54]:
#get a website
driver.get("http://vds.issproxy.com/SearchPage.php?CustomerID=4372")

In [60]:
#let's wait until we find the fund
element = _wait_for_element('//*[@id="selFundID"]',short_wait)

In [64]:
#print the possible options
for option in element.find_elements_by_xpath('//*[@id="selFundID"]/option'):
    print(option.text)

Select a fund:
CSTG&E International Social Core Equity Portfolio
CSTG&E U.S. Social Core Equity 2 Portfolio
DFA International Real Estate Securities Portfolio
DFA International Small Cap Value Portfolio
DFA International Value ex Tobacco Portfolio
DFA Real Estate Securities Portfolio
Dimensional Emerging Markets Value Fund
Emerging Markets Core Equity Portfolio
Emerging Markets Social Core Equity Portfolio
International Core Equity Portfolio
International Large Cap Growth Portfolio
International Small Cap Growth Portfolio
International Social Core Equity Portfolio
International Sustainability Core 1 Portfolio
International Vector Equity Portfolio
Large Cap International Portfolio
T.A. U.S. Core Equity 2 Portfolio
T.A. World ex U.S. Core Equity Portfolio
Tax-Managed DFA International Value Portfolio
Tax-Managed U.S. Equity Portfolio
Tax-Managed U.S. Small Cap Portfolio
Tax-Managed U.S. Targeted Value Portfolio
The Asia Pacific Small Company Series
The Canadian Small Company Series
The C

In [66]:
#keep the options
options = element.find_elements_by_xpath('//*[@id="selFundID"]/option')

In [68]:
#click on the first one
options[1].click()

In [69]:
#click on the second one
element.find_element_by_xpath('//*[@id="SPFundInputButton"]').click()

- Move between windows
- Find by attributes (like in beautifulsoup)
- Run javascript or ask for javascript variables
