> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

## Use Connector class for accessing the internet
Even if logging is not important for the below exercises, get in the habit of using this class for connecting to the internet, to practice logging your activity. This will be expected in the final exam.

You should run `pip install scraping_class` to install the module to be used.

In [1]:
#pip install scraping_class
#pip install selenium

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import selenium
import time

In [3]:
import scraping_class 
logfile = 'log_exercise_6.csv'## name your log file.
connector = scraping_class.Connector(logfile)

# Exercise Set 6: Introduction to Web Scraping

In this Exercise Set we shall practice our webscraping skills utiilizing only basic python.  
We shall cover variations between static and dynamic pages and build. 

## Exercise Section 6.1: Scraping Jobnet.dk

This exercise you get to practice locating the request that the JavaScript sends to get the job data that it builds the joblistings from. You should use the **>Network Monitor<** tool in your browser.

Furthermore you practice spotting how the pagination is done, without clicking on the next page button, but instead changing a small parameter in the URL.

> **Ex. 6.1.1:** Hit the joblisting webpage here: https://job.jobnet.dk/CV and locate the request that gets the joblisting data using the the **>Network Monitor<**. *(Hint: Filter by XHR files)  

> **Ex. 6.1.2.:** Use the `request` module to collect the first 20 results and unpack the relevant `json` data into a `pandas` DataFrame.

> **Ex. 6.1.3.:** Store the 'TotalResultCount' value for later use.

In [6]:
import pandas as pd

# Ex. 6.1.1:
url = 'https://job.jobnet.dk/CV/FindWork/Search'

# Ex. 6.1.2:
#request url
response = requests.get(url)

#unpack with json into pandas dataframe
jobs = pd.DataFrame(response.json()['JobPositionPostings'])
print(jobs)

#Ex. 6.1.3.: Store the 'TotalResultCount' value for later use.
TotalResultCount = response.json()['TotalResultCount']
print(TotalResultCount)


    AutomatchType  Abroad  Weight  \
0               0   False     1.0   
1               0   False     1.0   
2               0   False     1.0   
3               0   False     1.0   
4               0   False     1.0   
5               0   False     1.0   
6               0   False     1.0   
7               0   False     1.0   
8               0   False     1.0   
9               0   False     1.0   
10              0   False     1.0   
11              0   False     1.0   
12              0   False     1.0   
13              0   False     1.0   
14              0   False     1.0   
15              0   False     1.0   
16              0   False     1.0   
17              0    True     1.0   
18              0   False     1.0   
19              0   False     1.0   

                                                Title  \
0        Servicemedarbejder over 18 år, Frederiksværk   
1   Tømrere og snedkere - vi udvider i Nordsjællan...   
2                          Pædagogmedhjælper 37 tim

> **Ex. 6.1.4:** This exercise is about paging the results. We need to understand the websites pagination scheme. 

> Now scroll down the webpage and press the next page button. See how the parameters of the url changes as you turn the pages.

> **Ex. 6.1.5:** Design a`for` loop using the `range` function that changes this paging parameter in the URL. Use the TotalResultCount parameter from before to define the limits of the range function. Store these urls in a container. 

>**extra** Change the SortValue parameter from BestMatch to CreationDate, to make the sorting amendable to updating results daily.

*(HINT: See that the parameter is an offset and that this relates to the number of results pr. call made.)*

In [None]:
#Ex. 6.1.4:
# when pressing the 'next' button, the offset=0 changes to offset=20 
# https://job.jobnet.dk/CV/FindWork/Search?Offset=20&SortValue=BestMatch

#Ex. 6.1.5
url = 'https://job.jobnet.dk/CV/FindWork/Search?Offset=20&SortValue=BestMatch'

jobs_url = []
for i in range(0,TotalResultCount,20):
    url = f'https://job.jobnet.dk/CV/FindWork/Search?Offset={i}&SortValue=BestMatch'
    jobs_url.append(url)

#extra Change the SortValue parameter from BestMatch to CreationDate, to make the sorting amendable to updating results daily.
jobs_url_sorted = []   
for i in range(0,TotalResultCount,20):
    url = f'https://job.jobnet.dk/CV/FindWork/Search?Offset={i}&SortValue=CreationDate'
    jobs_url_sorted.append(url)

jobs_url_sorted
    

> **Ex.6.1.6:** Pick 20 random links using the `random.sample()` function and collect them using the `Connector` class. Also use the `time.sleep()` function to limit the rate of your calls. Make sure to save the links already collected in a `set()` container to avoid having to reload links already collected. ***extra***: monitor the time left to completing the loop by using `tqdm.tqdm()` function.

> **Ex.6.1.7:** Load all the results into a DataFrame.

In [12]:

import random, time, tqdm

#see 20 random links
print(random.sample(jobs_url_sorted,20))

done = set()
data = []
for url in tqdm.tqdm(random.sample(jobs_url_sorted,20)):
    response,call_id = connector.get(url,'jobpostings')
    if response.ok:
        d = response.json
    else:
        print('Error')
    data += d['JobPositionPostings']
    time.sleep(0.5)
df = pd.DataFrame(data)
df.sample(5)


  0%|                                                                                           | 0/20 [00:00<?, ?it/s]

['https://job.jobnet.dk/CV/FindWork/Search?Offset=800&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=0&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=11240&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=12860&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=4380&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=1380&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=12240&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=11120&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=7460&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=6500&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=12980&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=680&SortValue=CreationDate', 'https://job.jobnet.dk/CV/FindWork/Search?Offset=78

  0%|                                                                                           | 0/20 [00:01<?, ?it/s]


TypeError: 'method' object is not subscriptable

## Exercise Section 6.2: Scraping Trustpilot.com
Now for a slightly more elaborate, yet still simple scraping problem. Here we want to scrape trustpilot for user reviews. This data is very nice since it provides free labeled data (rating) to train a machine learning model to understand positive and negative sentiment. 

Here you will practice crawling a website collecting the links to each company review page, and finally locate another behind the scenes JavaScript request that gets the review data in a neat json format.

> **Ex. 6.2.1:** Visit the https://www.trustpilot.com/ website and locate the categories page.
From this page you find links to company listings.

> **Ex. 6.2.2:**
Get the category page using the `requests` module and extract each link to a specific category page from the HTML. This can be done using the basic python `.split()` string method. Make sure only links within the ***/categories/*** section are kept, checking each string using the ```if 'pattern' in string``` condition. 

*(Hint: The links are relative. You need to add the domain name)*


In [8]:
# [Answer to Ex. 6.2.1-2]

> **Ex. 6.2.3:** Get one of the category section links. Write a function to extract the links to the company review page from the HTML.

> **Ex. 6.2.4:** Figure out how the pagination is done, by following how the url changes when pressing the **next page**-button to obtain more company listings. Write a function that builds links to paging all the company listing results of each category. This includes parsing the number of subpages of each category and changing the correct parameter in the url.

(Hint: Find the maximum number of result pages, right before the next page button and make a loop change the page parameter of the url.)


In [12]:
#[Answer to Ex.6.2.3-4]

> **Ex. 6.2.5:** Loop through all categories and build the paging links using the above defined function.

> **Ex. 6.2.6:** Randomly pick one of category listing links you have generated, and get the links to the companies listed using the other function defined. 

> **Ex. 6.2.7:** Visit one of these links and inspect the **>Network Monitor<** to locate the request that loads the review data. Use the requests module to retrieve this link and unpack the json results to a pandas DataFrame.


In [15]:
#[Answer to Ex.6.2.5-7]