# Web scraping workshop

*This version: 14 June 2019*

Comments: h.datta@tilburguniversity.edu

**Requires Python 3.x**

## Installation requirements
Please install the following packages through pip:

```
pip install lxml
pip install selenium
pip install cssselect
```

You also need a *current version of chromedriver.exe*, which you place in the *same directory as this file*. Please visite the site below to download it: https://sites.google.com/a/chromium.org/chromedriver/

**Only then will you be able to proceed to the next cells.**

## Load packages

In [None]:
# Load packages into memory
import urllib3
import datetime
from lxml import etree 
from lxml.cssselect import CSSSelector
from lxml.etree import fromstring

from selenium import webdriver
import time
import codecs

import re
import os
import requests
import io
import json
from datetime import datetime
from datetime import timedelta
import pandas as pd

# Function to clean text from extra characters (e.g., new line characters, tabs)
def cleanstring(obj):
    return(re.sub(r"[\n\t\s]*", "", obj))

# Function to create a new directory if it does not exist yet
def makedir(dirname):
    try:
        os.stat(dirname)
    except:
        os.mkdir(dirname)
        
# Function to generate filename based on time of the computer
def timestamp_to_string(prefix='', extention='.html'):
    return(prefix+str(time.time()).replace('.','-')+extention)

# Example: Building a web scraper for Socialblade.com

What you've seen before was just a starter. Here, we use proper machinery - read, packages - that help us get the data we want.

## (a) Introducing you to chromedriver ("What you see is what you get!")

This time, we use a web driver provided by Internet browser Chrome ("chromedriver.exe"), in combination with a really powerful Python package called selenium. Using this package will help us to actually retrieve parts of the page.
To make this code work, chromedriver.exe needs to be located in the same directory as this file.

Download chromedriver: http://chromedriver.chromium.org/downloads

Let's first open the browser

In [None]:
browser=webdriver.Chrome()

Magic, huh? Now let's direct it to a page.

In [None]:
browser.get('https://socialblade.com/youtube/user/nike')

We could now build a loop to extract the number of video views...

Let's first see whether we have the right CSS selector for these views.

In [None]:
browser.find_element_by_css_selector('#YouTubeUserTopInfoBlock > div:nth-child(4) > span:nth-child(3)').text

Let's build a loop and save the data in a JSON object.

In [None]:
users = ['nijntje', 'nike', 'adidas', 'hannesd84']

dat = []

for user in users:
    url = 'http://www.socialblade.com/youtube/user/'+user
    print(url)
    browser.get(url)
    time.sleep(3) # let the browser wait for a while!
    var = browser.find_element_by_css_selector('#YouTubeUserTopInfoBlock > div:nth-child(4) > span:nth-child(3)').text
    dat.append({"user":user, "followers": var})

dat

In [None]:
# save data to JSON
makedir('module7')
f=open('module7/data.json', 'w')
for line in dat:
    f.write(json.dumps(line)+'\n')
f.close()


In [None]:
# or... save the data to CSV
makedir('module7')
f=open('module7/data.csv', 'w')
for line in dat:
    f.write(line.get('user')+'\t'+line.get('followers')+'\n')
f.close()


## (b) Working scraper

Let's re-run the cell above, but catch any error.

In [None]:
users = ['nijntje', 'nike', 'adidas', 'hannesd84']

dat = []

for user in users:
    url = 'http://www.socialblade.com/youtube/user/'+user
    browser.get(url)
    time.sleep(3) # let the browser wait for a while!
    
    try:
        var = browser.find_element_by_css_selector('#YouTubeUserTopInfoBlock > div:nth-child(4) > span:nth-child(3)').text
    except:
        var = 'NA'
    dat.append({"user":user, "followers": var})

dat

### Diagnostics: write page to file so that you can view it

In [None]:
content = browser.page_source
len(content)
f = io.open('module7/socialblade.html','w', encoding='utf-8')
f.write(content)
f.close()

Check out the html in a browser now! 

To wrap things up, let's close Chrome again.

In [None]:
browser.close()

# Your turn...!

please adapt the script below for your own site

In [None]:
# let's first open Chrome again
browser=webdriver.Chrome()

In [None]:
users = ['nijntje', 'nike', 'adidas', 'hannesd84']

dat = []

for user in users:
    url = 'http://www.socialblade.com/youtube/user/'+user
    browser.get(url)
    time.sleep(3) # let the browser wait for a while!
    
    try:
        var = browser.find_element_by_css_selector('#YouTubeUserTopInfoBlock > div:nth-child(4) > span:nth-child(3)').text
    except:
        var = 'NA'
    
    dat.append({"user":user, "followers": var})

# save data to JSON
makedir('mysite')
f=open('mysite/my_website.json', 'w')
for line in dat:
    f.write(json.dumps(line)+'\n')
f.close()
