> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

## Use Connector class for accessing the internet
Even if logging is not important for the below exercises, get in the habit of using this class for connecting to the internet, to practice logging your activity. This will be expected in the final exam.

You should run `pip install scraping_class` to install the module to be used.

In [None]:
import scraping_class
logfile = 'log.csv'## name your log file.
connector = scraping_class.Connector(logfile)

# Exercise Set 8: Introduction to Web Scraping

*Afternoon, August 15, 2019*

In this Exercise Set we shall practice our webscraping skills utiilizing only basic python. We shall cover variations between static and dynamic pages and build. 

## Exercise Section 8.1: Scraping Jobnet.dk

This exercise you get to practice locating the request that the JavaScript sends to get the job data that it builds the joblistings from. You should use the **>Network Monitor<** tool in your browser.

Furthermore you practice spotting how the pagination is done, without clicking on the next page button, but instead changing a small parameter in the URL.

> **Ex. 8.1.1:** Hit the joblisting webpage here: https://job.jobnet.dk/CV and locate the request that gets the joblisting data using the the **>Network Monitor<**. *(Hint: Filter by XHR files)  

> **Ex. 8.1.2.:** Use the `request` module to collect the first 20 results and unpack the relevant `json` data into a `pandas` DataFrame.

> **Ex. 8.1.3.:** Store the 'TotalResultCount' value for later use.

In [None]:
# [Answer to Ex. 8.1.1-3 here]

In [None]:
import requests
import pandas as pd
url = 'https://job.jobnet.dk/CV/FindWork/Search?SortValue=CreationDate&Offset=0'
response,call_id = connector.get(url,'mapping_jobposting')
if response.ok:
    d = response.json()
else:
    print('error')
df = pd.DataFrame(d['JobPositionPostings'])
df.head()
n_listings = d['TotalResultCount']

> **Ex. 8.1.4:** This exercise is about paging the results. We need to understand the websites pagination scheme. 

> Now scroll down the webpage and press the next page button. See how the parameters of the url changes as you turn the pages.

> **Ex. 8.1.5:** Design a`for` loop using the `range` function that changes this paging parameter in the URL. Use the TotalResultCount parameter from before to define the limits of the range function. Store these urls in a container. 

>**extra** Change the SortValue parameter from BestMatch to CreationDate, to make the sorting amendable to updating results daily.

*(HINT: See that the parameter is an offset and that this relates to the number of results pr. call made.)*

In [None]:
# [Answer to Ex. 8.1.4-5 here]

In [None]:
q = 'https://job.jobnet.dk/CV/FindWork/Search?SortValue=CreationDate&Offset=%d'
links = []
for offset in range(0,n_listings+20,20):
    url = q%offset
    links.append(url)

> **Ex.8.1.6:** Pick 20 random links using the `random.sample()` function and collect them using the `Connector` class. Also use the `time.sleep()` function to limit the rate of your calls. Make sure to save the links already collected in a `set()` container to avoid having to reload links already collected. ***extra***: monitor the time left to completing the loop by using `tqdm.tqdm()` function.

> **Ex.8.1.7:** Load all the results into a DataFrame.

In [None]:
# [Answer to Ex. 8.1.6-7 here]

In [None]:
import random
import time
done = set()
data = []
import tqdm
for url in tqdm.tqdm(random.sample(links,20)):
    response,call_id = connector.get(url,'download_jobposting')
    if response.ok:
        d = response.json()
    else:
        print('error')
    data += d['JobPositionPostings']
    time.sleep(0.5)
df = pd.DataFrame(data)
df.sample(5)

## Exercise Section 8.2: Scraping Trustpilot.com
Now for a slightly more elaborate, yet still simple scraping problem. Here we want to scrape trustpilot for user reviews. This data is very nice since it provides free labeled data (rating) to train a machine learning model to understand positive and negative sentiment. 

Here you will practice crawling a website collecting the links to each company review page, and finally locate another behind the scenes JavaScript request that gets the review data in a neat json format.

> **Ex. 8.2.1:** Visit the https://www.trustpilot.com/ website and locate the categories page.
From this page you find links to company listings.

> **Ex. 8.2.2:**
Get the category page using the `requests` module and extract each link to a specific category page from the HTML. This can be done using the basic python `.split()` string method. Make sure only links within the ***/categories/*** section are kept, checking each string using the ```if 'pattern' in string``` condition. 

*(Hint: The links are relative. You need to add the domain name)*


In [None]:
# [Answer to Ex. 8.2.1-2]

In [None]:
url = 'https://www.trustpilot.com/categories/'
response,call_id = connector.get(url,'mapping_categories')
if response.ok:
    html = response.text
else:
    print('error')
links = set()
for link_loc in html.split('href="')[1:]:
    link = link_loc.split('"')[0]
    if '/categories/' in link:
        links.add(link)
print(len(links),list(links)[0]) # link is relative
links = ['https://www.trustpilot.com'+link for link in links]# add the domain to each link
links[:10]

> **Ex. 8.2.3:** Get one of the category section links. Write a function to extract the links to the company review page from the HTML.

> **Ex. 8.2.4:** Figure out how the pagination is done, by following how the url changes when pressing the **next page**-button to obtain more company listings. Write a function that builds links to paging all the company listing results of each category. This includes parsing the number of subpages of each category and changing the correct parameter in the url.

(Hint: Find the maximum number of result pages, right before the next page button and make a loop change the page parameter of the url.)


In [None]:
#[Answer to Ex.8.2.3-4]

In [None]:
url = 'https://www.trustpilot.com/categories/art'
response, _ = connector.get(url,'mapping')
if response.ok:
    html = response.text
else:
    print('error')
def get_links(html):
    links = set() # define container
    for link_loc in html.split('href="')[1:]: # locate the start of a link
        link = link_loc.split('"')[0] # split at the end of the link
        links.add(link) # if it is: add it to the set container
    return links
def get_company_links(html):
    company_links = [link for link in get_links(html) if '/review/' in link] # check if the /review/ pattern is in the link
    return company_links
company_links = get_company_links(html)
print(len(company_links))
def get_all_category_pages(category_link):
    response, _ = connector.get(category_link,'mapping_categories')
    if response.ok:
        html = response.text
    else:
        print('error')
        return False
    links = get_links(html)
    # find the max_page.
    page_links = [link for link in links if '?page=' in link] # check if the paging parameter is in the link
    if len(page_links)==0: # no pages.
        return [category_link]
    n_pages = max([int(link.split('page=')[-1]) for link in page_links]) # extract the page value and take the max
    paging_links = [category_link] # define container and store the original result page
    q = category_link+'?page=%d' # define the varying parameter string.
    for num in range(2,n_pages+1): # build the links.
        paging_links.append(q%num)
    return paging_links

In [None]:
company_links

> **Ex. 8.2.5:** Loop through all categories and build the paging links using the above defined function.

> **Ex. 8.2.6:** Randomly pick one of category listing links you have generated, and get the links to the companies listed using the other function defined. 

> **Ex. 8.2.7:** Visit one of these links and inspect the **>Network Monitor<** to locate the request that loads the review data. Use the requests module to retrieve this link and unpack the json results to a pandas DataFrame.


In [None]:
#[Answer to Ex.8.2.5-7]

In [None]:
# ex. 8.2.5. Build the paging links
company_listings = []
for link in links: 
    if 'support.trustpilot' in link:
        continue
    company_listings+=get_all_category_pages(link)
print('We need to visit %d company listing pages to collect all company addresses'%len(company_listings))
# ex. 8.2.6
company_links = get_company_links(random.choice(company_listings)) # use the above defined function
# ex. 8.2.6
direct_link = 'https://www.trustpilot.com/review/59f33de00000ff0005aed171/jsonld'
response, call_id = connector.get(direct_link, 'download_review')
d = response.json() # parse json using the build-in function.
df = pd.DataFrame(d[0]['review'])
df.head()

Congratulations on coming this far. By now you are almost - still need to figure out how to page the reviews and to find the company ID in the html -, ready to deploy a scraper collecting all reviews on trustpilot. 
If you wanna see just how valuable such data could be visit the follow blogpost: https://blog.openai.com/unsupervised-sentiment-neuron/