# Web scraping workshop

*This version: 26 June 2020*

Comments: h.datta@tilburguniversity.edu

**Requires Python 3.x**

## Installation requirements
Please install the following packages through pip:

```
pip install selenium
```

You also need a *current version of Chrome and chromedriver*; for setup instructions, see http://tilburgsciencehub.com/setup/webscraping_drivers/.

**Only then will you be able to proceed to the next cells.**

## Load packages

In [None]:
# Load packages into memory
import urllib
import datetime

from selenium import webdriver
import time

import re
import os
import requests
import json
from datetime import datetime
from datetime import timedelta
import pandas as pd

# Function to clean text from extra characters (e.g., new line characters, tabs)
def cleanstring(obj):
    return(re.sub(r"[\n\t\s]*", "", obj))

# Function to create a new directory if it does not exist yet
def makedir(dirname):
    try:
        os.stat(dirname)
    except:
        os.mkdir(dirname)
        
# Function to generate filename based on time of the computer
def timestamp_to_string(prefix='', extention='.html'):
    return(prefix+str(time.time()).replace('.','-')+extention)

# Module 1) Simple HTTP requests

## (a) requests - without header

In [None]:
r = requests.get('http://www.amazon.com', headers={})
r

In [None]:
# show "content"
r.text[1:1000]

# --> looks like a website!

In [None]:
# let's save website to HTML file and open it!
makedir('module1/')
f=open('module1/amazon.html','w', encoding='utf-8') # utf-8 is extremely important (!)
f.write(r.text)
f.close()

# hm... looks like they realized we're trying to scrape!

## (b) requests - with header

In [None]:
# We could try pretending we're a Chrome Browser on an iPhone (!)
headers = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1'}

r = requests.get('http://www.amazon.com', headers=headers)
f=open('module1/amazon_mobile.html','w', encoding='utf-8') # utf-8 is extremely important (!)
f.write(r.text)
f.close()

# --> compare the files!

## (c) selenium - remote controlling chrome

In [None]:
# let's start up the browser - once!
browser = webdriver.Chrome()

In [None]:
# let's navigate to a page
browser.get('https://amazon.com')

In [None]:
# save the page
makedir('module1/')

f=open('module1/amazon_selenium.html','w', encoding='utf-8')
f.write(browser.page_source)
f.close()

## <font color='blue'>Your turn</font>

Write some code to open, after one another, the following category pages at Amazon.
Do you use `requests` or `selenium's webdriver`?

urls = ['https://www.amazon.com/s?bbn=16225007011&rh=n%3A16225007011%2Cn%3A172456&dc&fst=as%3Aoff&qid=1593153247&rnid=16225007011&ref=lp_16225007011_nr_n_0',
'https://www.amazon.com/s?bbn=16225007011&rh=n%3A16225007011%2Cn%3A193870011&dc&fst=as%3Aoff&qid=1593153269&rnid=16225007011&ref=lp_16225007011_nr_n_1',
'https://www.amazon.com/s?bbn=16225007011&rh=n%3A16225007011%2Cn%3A13896617011&dc&fst=as%3Aoff&qid=1593153269&rnid=16225007011&ref=lp_16225007011_nr_n_2']

Pause your code for 2 seconds after each requests, using
```
import time
time.sleep(2)
```

In [None]:
# Your turn here!

# Module 2) Simple HTTP requests to an API

In [None]:
# Let's get the current VAT rate in the EU
url = 'https://mixer.com/api/v1/channels'
data = requests.get(url, headers={})
data.text

# Module 3: Extracting data from JSON objects

In [None]:
# load JSON objects
f=requests.get('https://hannesdatta.github.io/course-jads2020/sessions/json_objects.json').text.split('\n')

postcode = json.loads(f[0])
spotify = json.loads(f[1])
twitter = json.loads(f[2])
mixer = json.loads(data.text)

In [None]:
# example:
print('Source: ' + twitter.get('source'))
print('User name: ' + twitter.get('user').get('name'))

In [None]:
twitter.get('user').get('name')

In [None]:
for i in mixer:
    name=i.get('token')
    followers=i.get('numFollowers')
    print(name + ': ', str(followers), ' followers.')

# Module 4: Extracting data from HTML objects

In [None]:
# get page
browser=webdriver.Chrome()
browser.get('https://www.amazon.com/First-Years-Stack-Up-Cups/dp/B00005C5H4/')


In [None]:
css='#averageCustomerReviews_feature_div > div:nth-child(2) > span:nth-child(3) > a:nth-child(1) > span:nth-child(1)'
print(browser.find_element_by_css_selector(css).text)


# Module 5: Looping

In [None]:
url = 'https://spotifycharts.com/regional/nl/daily/2020-06-23/download'
header = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'}
r = requests.get(url, headers=header)
makedir('module5/')
f=open('module5/spotify-nl.csv', 'w',encoding= 'utf-8')
f.write(r.text)
f.close()

Let's build a loop that iterates through any given day...

In [None]:
date1 = '2020-06-01'
date2 = '2020-06-20'
mydates = pd.date_range(date1, date2).tolist()

Let's view the dates!

In [None]:
mydates

Let's build a loop around our scraping code

In [None]:
for mydate in mydates:
    new_date = str(mydate.date())
    url = 'https://spotifycharts.com/regional/nl/daily/'+new_date+'/download'
    print(url)
    r = requests.get(url, headers=header)
    f=open('module5/spotify-nl-'+ new_date+ '.csv', 'w',encoding= 'utf-8')
    f.write(r.text)
    f.close()


# Module 6: Modularization

In [None]:
# a bit of setup code
makedir('module6/')

Let's wrap our scraping code in a function

In [None]:
def scrape_spotify(country, date, path=''):
    url = 'https://spotifycharts.com/regional/'+country+'/daily/' + date+ '/download'
    print(url)
    header = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'}
    r = requests.get(url, headers=header)
    fn = path+'spot_'+country+'_'+date+'.csv'
    f=open(fn, 'w',encoding= 'utf-8')
    f.write(r.text)
    f.close()


In [None]:
# let's run it!
scrape_spotify('nl', '2018-01-01', path = 'module6/')

Let's rewrap it in a loop - it's now way easier to spot what's going on. Plus we can extend the countries.

In [None]:
countries = ['nl', 'us']
date1 = '2020-06-01'
date2 = '2020-06-20'
mydates = pd.date_range(date1, date2).tolist()
new_date = str(mydate.date())

for country in countries:
    for mydate in mydates:
        scrape_spotify(country, str(mydate.date()), 'module6/')