# Web scraping tutorial

*This version: 3 June 2020*

Comments: h.datta@tilburguniversity.edu

**Requires Python 3.x**

## Installation requirements
Please install the following packages through pip:

```
pip install lxml
pip install selenium
pip install cssselect
```

You also need a *current version of Chrome and chromedriver*; for setup instructions, see http://tilburgsciencehub.com/setup/webscraping_drivers/.

**Only then will you be able to proceed to the next cells.**

<font color='red'>**ATTENTION:**</font> Recall that by the time you're running this notebook, the website we're using in this example may have changed its layout. So don't be surprised if some of the CSS selectors we use in the examples don't work anymore. Instead, directly debub the CSS selector by opening Chrome, navigating to the website, and opening the Developer Tools to copy the CSS selector of your desired target element.

## Load packages

In [3]:
# Load packages into memory
import urllib
import datetime
from lxml import etree 
from lxml.cssselect import CSSSelector
from lxml.etree import fromstring

from selenium import webdriver
import time
import codecs

import re
import os
import requests
import io
import json
from datetime import datetime
from datetime import timedelta
import pandas as pd

# Function to clean text from extra characters (e.g., new line characters, tabs)
def cleanstring(obj):
    return(re.sub(r"[\n\t\s]*", "", obj))

# Function to create a new directory if it does not exist yet
def makedir(dirname):
    try:
        os.stat(dirname)
    except:
        os.mkdir(dirname)
        
# Function to generate filename based on time of the computer
def timestamp_to_string(prefix='', extention='.html'):
    return(prefix+str(time.time()).replace('.','-')+extention)

# Example: Building a web scraper for Socialblade.com

What you've seen before was just a starter. Here, we use proper machinery - read, packages - that help us get the data we want.

## (a) Introducing you to chromedriver ("What you see is what you get!")

This time, we use a web driver provided by Internet browser Chrome ("chromedriver"), in combination with a really powerful Python package called selenium. Using this package will help us to actually retrieve parts of the page. The setup instructions for Chromedriver are on top of this document.

Let's first open the browser

In [2]:
browser=webdriver.Chrome()

Magic, huh? Now let's direct it to a page.

In [3]:
browser.get('https://socialblade.com/youtube/user/enzoknol')

We could now build a loop to extract the number of video views...

Let's first see whether we have the right CSS selector for these views.

In [5]:
print(browser.find_element_by_css_selector('#socialblade-user-content > div:nth-child(3) > div:nth-child(2) > p:nth-child(1)').text)
#print(browser.find_element_by_css_selector('#socialblade-user-content > div:nth-child(3) > div:nth-child(3) > p:nth-child(1)').text)



€11.3K - €180.5K


Let's build a loop and save the data in a JSON object.

## Running a loop and writing to a CSV file

In [6]:
users = ['enzoknol', 'officialtrapcity', 'martingarrix', 'rtlthevoice']
#users = ['enzoknol', 'hannesd84']

f = open('scraping-output.csv', 'w', encoding='utf-8')
f.write('username\tyearlyincome\n')

for i in users:
    print(i)
    browser.get('http://www.socialblade.com/youtube/user/'+i)
    time.sleep(3)
    try:
        yearlyincome = browser.find_element_by_css_selector('#socialblade-user-content > div:nth-child(3) > div:nth-child(2) > p:nth-child(1)').text
    except:
        yearlyincome = 'NA'
        
    f.write(i + '\t' + yearlyincome + '\n')

f.close()

enzoknol
officialtrapcity
martingarrix
rtlthevoice


## Running a loop and writing to a JSON file

In [None]:
users = ['enzoknol', 'officialtrapcity', 'martingarrix', 'rtlthevoice']

dat = []

for user in users:
    url = 'http://www.socialblade.com/youtube/user/'+user
    print(url)
    browser.get(url)
    time.sleep(3) # let the browser wait for a while!
    var = browser.find_element_by_css_selector('#socialblade-user-content > div:nth-child(3) > div:nth-child(2) > p:nth-child(1)').text
    dat.append({"user" : user, "yearlyincome": var})

dat

In [None]:
# save data to JSON
f=open('scraping-output.json', 'w', encoding = 'utf-8')
for line in dat:
    f.write(json.dumps(line)+'\n')
f.close()


In [None]:
# or... save the data to CSV
f=open('scraping-output.csv', 'w')
for line in dat:
    f.write(line.get('user')+'\t'+line.get('yearlyincome')+'\n')
f.close()


## (b) Scraper with error handling (try-except)

Let's re-run the cell above, but catch any error.

In [None]:
users = ['enzoknol', 'officialtrapcity', 'martingarrix', 'rtlthevoice', 'hannesd84']

dat = []

for user in users:
    url = 'http://www.socialblade.com/youtube/user/'+user
    browser.get(url)
    time.sleep(3) # let the browser wait for a while!
    
    try:
        var = browser.find_element_by_css_selector('#socialblade-user-content > div:nth-child(3) > div:nth-child(2) > p:nth-child(1)').text
    except:
        var = 'NA'
    dat.append({"user":user, "yearlyincome": var})

dat

### Diagnostics: write page to file so that you can view it

In [None]:
content = browser.page_source
len(content)
f = open('socialblade.html','w', encoding='utf-8')
f.write(content)
f.close()

Check out the html in a browser now! 

To wrap things up, let's close Chrome again.

In [None]:
browser.close()

# Your turn...!

Why don't you adapt the script below to be able to capture a site you're interested in?

1) First, open Chrome (if you haven't done so!)

In [None]:
# let's first open Chrome again
browser=webdriver.Chrome()

2) Second, open the desired page, and then use Chrome's developer menu to locate the elements you're interested in!

In [None]:
browser.get('https://www.op.gg/champion/kled/statistics/top')

3) Check whether you got the right CSS selector

In [None]:
var = browser.find_element_by_css_selector('body > div.l-wrap.l-wrap--champion > div.l-container > div > div.tabWrap._recognized > div.l-champion-statistics-content.tabItems > div.tabItem.Content.championLayout-overview > div > div.l-champion-statistics-content__main > table.champion-overview__table.champion-overview__table--summonerspell > tbody:nth-child(3) > tr:nth-child(1) > td.champion-overview__stats.champion-overview__stats--pick').text
print(var) # pick rates
    

4) Now build the loop by defining your seeds in a list, and put your selectors in it... (why don't you complete the example?!)

In [None]:
# DISCLAIMER: This code is still the example code from above, it does NOT pertain to op.gg.

users = ['nijntje', 'nike', 'adidas', 'hannesd84']

dat = []

f = open('my-scraping-data.csv','w',encoding='utf-8')

for user in users:
    url = 'http://www.socialblade.com/youtube/user/'+user
    browser.get(url)
    time.sleep(3) # let the browser wait for a while!!!!!!!
    
    try:
        var = browser.find_element_by_css_selector('#YouTubeUserTopInfoBlock > div:nth-child(4) > span:nth-child(3)').text
    except:
        var = 'NA'
    
    f.write(user + '\t' + var + '\n')
    
f.close()

5) Finish the session by closing the browser.

In [None]:
browser.close()