# Exercise Set 8: Introduction to Web Scraping

*Afternoon, August 16, 2018*

In this Exercise Set we shall practice our webscraping skills utiilizing only basic python. We shall cover variations between static and dynamic pages. 

## Exercise Section 8.1: Scraping Jobnet.dk

This exercise you get to practice locating the request that the JavaScript sends to get the job data that it builds the joblistings from. You should use the **>Network Monitor<** tool in your browser.

Furthermore you practice spotting how the pagination is done, without clicking on the next page button, but instead changing a small parameter in the URL.

> **Ex. 8.1.1:** Hit the joblisting webpage here: https://job.jobnet.dk/CV and locate the request that gets the joblisting data using the a tool called the **>Network Monitor<**. To do this open monitor tool and press the search button on the website. Go to the _network_ pane in the monitor and look at the results. Which two _methods_ does your browser use to communicate with the webserver? What does status code 200 mean?
>
>> _Hint:_ The network monitor is launched by pressing ctrl+shift+i in most browsers. Filter by XHR files.

- GET
- POST
- Status code 200 - the request was recieved and understood and is being processed

> **Ex. 8.1.2.:** Use the `request` module to collect the first page of job postings and unpack the relevant `json` data into a `pandas` DataFrame.

In [1]:
import requests
import pandas as pd
import json

url = 'https://job.jobnet.dk/CV/FindWork/Search?Offset=0&SortValue=BestMatch'
response = requests.get(url)
json = json.loads(response.text)
print(json.keys())

dict_keys(['Expression', 'Facets', 'JobPositionPostings', 'TotalResultCount'])


> **Ex. 8.1.3.:** Store and print the 'TotalResultCount' value for later use. Also create a dataframe from the 'JobPositionPostings' field in the json. 

In [2]:
df_jpp = pd.DataFrame(json['JobPositionPostings'])
trc = json['TotalResultCount']
print(trc)

16000


> **Ex. 8.1.4:** This exercise is about paging the results. We need to understand the websites pagination scheme. Scroll down the webpage and press the next page button. Describe how the parameters of the url changes as you turn the pages.


The pagination scheme is controlled by the variable offset. Every page contains 20 job positions.

> **Ex. 8.1.5:** Design a`for` loop using the `range` function that changes this paging parameter in the URL. Use the TotalResultCount parameter from before to define the limits of the range function. Store these urls in a container. 
>
>**Bonus** Change the SortValue parameter from BestMatch to CreationDate, to make the sorting amendable to updating results daily.
>
>> _Hint:_ See that the parameter is an offset and that this relates to the number of results pr. call made.

In [3]:
pages = []
for i in range(0, trc, 20):
    url = 'https://job.jobnet.dk/CV/FindWork/Search?Offset=' + str(i) + '&SortValue=CreationDate'
    pages.append(url)

> **Ex. 8.1.6:** Pick 20 random links using the `choice` function in the `random` module;  collect the links using the `request` module. Also use the `time.sleep()` function to limit the rate of your calls. Make sure to save the links already collected in a `set()` container to avoid having to reload links already collected. ***extra***: monitor the time left to completing the loop by using `tqdm.tqdm()` function.

In [4]:
import random
import time

set_pages = set(pages)
set_pages20 = set(random.sample(set_pages,20))

jsons = []
for link in set_pages20:
    response = requests.get(link)
    json = response.json()
    jsons.append(json)
    time.sleep(1)

len(jsons)
    

20

> **Ex. 8.1.7:** Load all the results into a DataFrame.

In [5]:
df = pd.DataFrame()
for json in jsons:
    df_load = pd.DataFrame(json['JobPositionPostings'])
    df = pd.concat([df, df_load])


## Exercise Section 8.2: Scraping Trustpilot.com
Now for a slightly more elaborate, yet still simple scraping problem. Here we want to scrape trustpilot for user reviews. This data is very nice since it provides free labeled data (rating) to train a machine learning model to understand positive and negative sentiment. 

Here you will practice crawling a website collecting the links to each company review page, and finally locate another behind the scenes JavaScript request that gets the review data in a neat json format.

> **Ex. 8.2.1:** Visit the https://www.trustpilot.com/ website and locate the categories page. From this page you find links to company listings. Get the category page using the `requests` module and extract each link to a specific category page from the HTML. This can be done using the basic python `.split()` string method. Make sure only links within the ***/categories/*** section are kept, checking each string using the ```if 'pattern' in string``` condition. 
>
>> *Hint:* The links are relative. You need to add the domain name
>>
>> *Hint #2:* You will need to convert the page response to a string, using the `.text` property.


In [62]:
import requests

url = 'https://www.trustpilot.com/categories'
response = requests.get(url)
link_locations = response.text.split('href="')[1:]

links = set()
for link in link_locations:
    path = link.split('"')[0]
    if 'trustpilot.com' in path:
        path = path
    else:
        path = 'https://www.trustpilot.com' + path
    if '/categories/' in path:
        links.add(path)
links

{'https://support.trustpilot.com/hc/categories/200128688-Support-for-Reviewers',
 'https://www.trustpilot.com/categories/animals_and_pets',
 'https://www.trustpilot.com/categories/art',
 'https://www.trustpilot.com/categories/clothes_fashion',
 'https://www.trustpilot.com/categories/cloud_computing',
 'https://www.trustpilot.com/categories/companies',
 'https://www.trustpilot.com/categories/computer',
 'https://www.trustpilot.com/categories/craftsman',
 'https://www.trustpilot.com/categories/electronics',
 'https://www.trustpilot.com/categories/entertainment',
 'https://www.trustpilot.com/categories/erotic',
 'https://www.trustpilot.com/categories/food_beverage',
 'https://www.trustpilot.com/categories/gambling',
 'https://www.trustpilot.com/categories/health_wellbeing',
 'https://www.trustpilot.com/categories/home_garden',
 'https://www.trustpilot.com/categories/kids',
 'https://www.trustpilot.com/categories/legal_services',
 'https://www.trustpilot.com/categories/leisure',
 'https://

> **Ex. 8.2.2:** Get one of the category section links (any one will do). Write a function to extract the links to the company review pages from the HTML.


In [60]:
url = 'https://www.trustpilot.com/categories/clothes_fashion'
response = requests.get(url)
if response.ok:
    html = response.text
else:
    print('error')

def get_links(html):
    links = set()
    for link in html.split('href="')[1:]:
        path = link.split('"')[0]
        if 'trustpilot.com' in path:
            path = path
        else:
            path = 'https://www.trustpilot.com' + path
        links.add(path)
    return links

def get_company_links(html):
    return [link for link in get_links(html) if '/review/' in link] # check if the /review/ pattern is in the link and keep if true


company_links = get_company_links(html)
company_links

['https://www.trustpilot.com/review/soulrevolver.com',
 'https://www.trustpilot.com/review/ipromo.com',
 'https://www.trustpilot.com/review/www.primestyle.com',
 'https://www.trustpilot.com/review/www.baunat.com',
 'https://www.trustpilot.com/review/www.firmoo.com',
 'https://www.trustpilot.com/review/leibish.com',
 'https://www.trustpilot.com/review/advertscreenprinting.com',
 'https://www.trustpilot.com/review/www.swissluxury.com',
 'https://www.trustpilot.com/review/coutureusa.com',
 'https://www.trustpilot.com/review/www.phigora.com',
 'https://www.trustpilot.com/review/www.brilliance.com',
 'https://www.trustpilot.com/review/www.queensboro.com',
 'https://www.trustpilot.com/review/briangavindiamonds.com',
 'https://www.trustpilot.com/review/tinylittlemonster.com',
 'https://www.trustpilot.com/review/greenlakejewelry.com',
 'https://www.trustpilot.com/review/redtunashirtclub.com',
 'https://www.trustpilot.com/review/www.shopworn.com',
 'https://www.trustpilot.com/review/www.thenatu

> **Ex. 8.2.3:** Figure out how the pagination is done, by following how the url changes when pressing the **next page**-button to obtain more company listings. Write a function that builds links to paging all the company listing results of each category. This includes parsing the number of subpages of each category and changing the correct parameter in the url.
>
>> *Hint:* Find the maximum number of result pages, right before the next page button and make a loop change the page parameter of the url.


In [61]:
def get_all_category_pages(category_link): 
    # Get the broad category html
    response = requests.get(category_link)
    if response.ok:
        html = response.text
    else:
        print('error')
        return False
    
    # Get the links in this html
    links = get_links(html)

    # find the max_page number.
    page_links = [link for link in links if '?page=' in link] # check if the paging parameter is in the link
    if len(page_links)==0: # no pages.
        return [category_link]
    n_pages = max([int(link.split('page=')[-1]) for link in page_links]) # extract the page value and take the max
    
    paging_links = [category_link] # define container and store the original result page
    
    q = category_link+'?page=%d' # define the varying parameter string.

    for num in range(2,n_pages+1): # build the links.
        paging_links.append(q%num)
    
    return paging_links

pages = get_all_category_pages(url)
pages

['https://www.trustpilot.com/categories/clothes_fashion',
 'https://www.trustpilot.com/categories/clothes_fashion?page=2',
 'https://www.trustpilot.com/categories/clothes_fashion?page=3',
 'https://www.trustpilot.com/categories/clothes_fashion?page=4',
 'https://www.trustpilot.com/categories/clothes_fashion?page=5',
 'https://www.trustpilot.com/categories/clothes_fashion?page=6',
 'https://www.trustpilot.com/categories/clothes_fashion?page=7',
 'https://www.trustpilot.com/categories/clothes_fashion?page=8',
 'https://www.trustpilot.com/categories/clothes_fashion?page=9',
 'https://www.trustpilot.com/categories/clothes_fashion?page=10',
 'https://www.trustpilot.com/categories/clothes_fashion?page=11',
 'https://www.trustpilot.com/categories/clothes_fashion?page=12',
 'https://www.trustpilot.com/categories/clothes_fashion?page=13',
 'https://www.trustpilot.com/categories/clothes_fashion?page=14',
 'https://www.trustpilot.com/categories/clothes_fashion?page=15',
 'https://www.trustpilot.c

> **Ex. 8.2.4:** Loop through all categories and build the paging links using the above defined function.

In [66]:
# Build the paging links
company_listings = []

for link in links: 
    if 'support.trustpilot' in link:
        continue
    company_listings+=get_all_category_pages(link)
    print(link)

https://www.trustpilot.com/categories/leisure
https://www.trustpilot.com/categories/animals_and_pets
https://www.trustpilot.com/categories/utilities
https://www.trustpilot.com/categories/computer
https://www.trustpilot.com/categories/electronics
https://www.trustpilot.com/categories/online_services
https://www.trustpilot.com/categories/gambling
https://www.trustpilot.com/categories/other
https://www.trustpilot.com/categories/craftsman
https://www.trustpilot.com/categories/transportation
https://www.trustpilot.com/categories/art
https://www.trustpilot.com/categories/erotic
https://www.trustpilot.com/categories/health_wellbeing
https://www.trustpilot.com/categories/kids
https://www.trustpilot.com/categories/companies
https://www.trustpilot.com/categories/cloud_computing
https://www.trustpilot.com/categories/travel_holidays
https://www.trustpilot.com/categories/food_beverage
https://www.trustpilot.com/categories/home_garden
https://www.trustpilot.com/categories/legal_services
https://www.

> **Ex. 8.2.5:** Randomly pick one of category listing links you have generated, and get the links to the companies listed using the other function defined. 


In [10]:
# [Answer to Ex.8.2.5]

> **Ex. 8.2.6:** Visit one of these links and inspect the **>Network Monitor<** to locate the request that loads the review data. Use the requests module to retrieve this link and unpack the json results to a pandas DataFrame.
>
>> _Hint:_ Look for a request which sends a the link to a file called `jsonId`

In [11]:
#[Answer to Ex.8.2.6]

Congratulations on coming this far. By now you are almost - still need to figure out how to page the reviews and to find the company ID in the html -, ready to deploy a scraper collecting all reviews on trustpilot. 
If you wanna see just how valuable such data could be visit the follow blogpost: https://blog.openai.com/unsupervised-sentiment-neuron/