> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

# Exercise Set 8: Advanced Scrapers

In this Exercise Set we shall develop our webscraping skills even further by practicing using `Selenium` while   parsing and navigating html trees using `BeautifoulSoup`. Furthermore we will train extracting information from raw text with no html tags to help, using regular expressions. 

But just as importantly you will get a chance to think about **data quality issues** and how to ensure reliability when curating your own webdata. 

## Exercise Section 8.1: Translating domains into companies
This exercise is about solving a problem that danish companies are facing. They all want to use external data such as customer review data to gain more knowledge about their customers and maybe even use the information as features in their models. There is just one problem: users rate domains not companies. 

> **Ex. 8.1.1:** You work for the danish authorities and are currently staffed to a project where you have to reduce the amount of dangerous toys. You have build a webscraper that collect user reviews form Trustpilot and have identified some websites that got a bad reputation among its users. You belive that the risk of them selling illegal or dangerous toys might be bigger than some of the bg brands with good ratings and decide to investigate them. 

> Go to the website https://www.dk-hostmaster.dk/da/find-domaenenavn with selenium and search for "netbaby.dk". Store the name of the registrant "Euphemia Media" in the variable `company`.

In [None]:
#[Answer 8.1.1]

In [20]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys

url = 'https://www.dk-hostmaster.dk/da/find-domaenenavn'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.implicitly_wait(5000)
driver.get(url)

question = driver.find_element_by_id('popup-buttons')
question.click()

inputElement = driver.find_element_by_id('query_domain')
inputElement.click()
inputElement.send_keys('netbaby.dk')
inputElement.send_keys(Keys.RETURN)



[WDM] - Cache is valid for [12/08/2020]
[WDM] - Looking for [chromedriver 84.0.4147.30 mac64] driver in cache 
[WDM] - Driver found in cache [/Users/nicklasjohansen/.wdm/drivers/chromedriver/84.0.4147.30/mac64/chromedriver]


 


In [27]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source, 'lxml')
company = soup.select("#domain_registrant_name")
company = str(company)
company = re.findall('(?<=>)(.*)(?=<)', company)[0]
company = company.lower()
print('company:', company)


company: euphemia media


> **Ex. 8.1.2:** Now you know who owns the domain and would like to know more about the company `euphemia media`. 

> Go to the Central Business Register website https://datacvr.virk.dk/data/. Figure out how to look up companies by changing the url and then lookup `euphemia media`. Store the CVR number in the variable `cvr` and print it. 

In [None]:
#[Answer 8.1.2]

> **Ex. 8.1.3:** Congratulations. You are now able to translate domains into companies and by that enrich what ever analysis you want to make. Let's say that you were to build a scraper who could translate thousands of domains. What kind of errors can you imagine running into and how would you mitigate them?

In [None]:
#[Answer 8.1.3]

**I would look into**

1. Some doomains are owned by privates and is not present in CVR.  

2. Company names for organizations might have one name in DK Hostmaster and another on CVR (fx. PRICEWATERHOUSECOOPERS STATSAUTORISERET)  

3. Is there any limitations using these website in our virual browser? Maybe our cnnection will be shut down if we make 1000 request in one hours or put too much pressure on their servers? If you want to work with CVR I recommed that you sign up for their [API](https://datacvr.virk.dk/data/cvr-hj%C3%A6lp/cvr-adgange). It is free and well documented. 

## Exercise Section 8.2: Practicing Regular Expressions.
This exercise is about developing your experience with designing your own regular expressions.

Remember you can always consult the regular expression reference page [here](https://www.regular-expressions.info/refquick.html), if you need to remember or understand a specific symbol. 

You should practice using *"define-inspect-refine-method"* described in the lectures to systematically ***explore*** and ***refine*** your expressions, and save all the patterns tried. You can download the small module that I created to handle this in the following way: 
``` python
import requests
url = 'https://raw.githubusercontent.com/snorreralund/explore_regex/master/explore_regex.py'
response = requests.get(url)
with open('explore_regex.py','w') as f:
    f.write(response.text)
import explore_regex as e_re
```

Remember to start ***broad*** to gain many examples, and iteratively narrow and refine.

We will use a sample of the trustpilot dataset that you practiced collecting yesterday.
You can load it directly into python from the following link: https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv

> **Ex. 8.2.1:** Load the data used in the exercise using the `pd.read_csv` function. (Hint: path to file can be both a url or systempath). 

>Define a variable `sample_string = '\n'.join(df.sample(2000).reviewBody)` as sample of all the reviews that you will practice on.  (Run it once in a while to get a new sample for potential differences).
Imagine we were a company wanting to find the reviews where customers are concerned with the price of a service. They decide to write a regular expression to match all reviews where a currencies and an amount is mentioned. 

> **Ex. 8.2.2:** 
> Write an expression that matches both the dollar-sign (\$) and dollar written literally, and the amount before or after a dollar-sign. Remember that the "$"-sign is a special character in regular expressions. Explore and refine using the explore_pattern function in the package I created called explore_regex. 
```python
import explore_regex as e_re
explore_regex = e_re.Explore_Regex(sample_string) # Initaizlie the Explore regex Class.
explore_regex.explore_pattern(pattern) # Use the .explore_pattern method.
```


Start with exploring the context around digits ("\d") in the data. 

In [None]:
#[Answer 8.2.1-2]

In [4]:
import pandas as pd
import re
import requests

# download data
path2data = 'https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv'
df = pd.read_csv(path2data)

# download module
url = 'https://raw.githubusercontent.com/snorreralund/explore_regex/master/explore_regex.py'
response = requests.get(url)

# write script to your folder to create a locate module
with open('explore_regex.py','w') as f:
    f.write(response.text)

# import local module
import explore_regex as e_re

In [5]:
import re
digit_re = re.compile('[0-9]+') 
df['hasNumber'] = df.reviewBody.apply(lambda x: len(digit_re.findall(x))>0)
sample_string = '\n'.join(df[df['hasNumber']].sample(1000).reviewBody)
#sample_string

In [10]:
explore_regex = e_re.ExploreRegex(sample_string)

In [14]:
pattern = '[£$] ?[0-9]+(?:[.,][0-9]+)?|[0-9]+(?:[.,][0-9]+)? ?(?:USD|usd)|[0-9]+(?:[.,][0-9]+)? ?(?:dollars|DOLLARS)' #[£$] ?[0-9]+(?:[.,][0-9]+)?|[0-9]+(?:[.,][0-9]+)? ?(?:USD|usd)|

explore_regex.explore_pattern(pattern,context=20)

------ Pattern: [£$] ?[0-9]+(?:[.,][0-9]+)?|[0-9]+(?:[.,][0-9]+)? ?(?:USD|usd)|[0-9]+(?:[.,][0-9]+)? ?(?:dollars|DOLLARS)	 Matched 234 patterns -----
Match: 100 dollars	Context:ed them that I lost 100 dollars because the dealer 
Match: $143	Context:cover from them for $143.  So the 48-hour so
Match: $8	Context:.54), he charged me $8, which is the price
Match: $5	Context:ally was to give me $5 off my next order, 
Match: $7.99	Context:
I have been paying $7.99 a month for the las
Match: $200	Context:$800.  Asked for my $200 cash back that I us
Match: $625	Context: estimated quote of $625 and then where char
Match: £40	Context:he hotel cheaper by £40. Very happy and wil
Match: $25.00	Context:r $75.00 vice 3 for $25.00.
Purchased 4 ticket
Match: $10	Context:l carbon plus about $10 worth of plastic pa


> **Ex.8.2.3** Use the .report() method. e_re.report(), and print the all patterns in the development process using the .pattern method - i.e. e_re.patterns 


In [1]:
#[Answer 8.2.3]

In [13]:
explore_regex.patterns

['[£$] ?[0-9]+(?:[.,][0-9]+)?|[0-9]+(?:[.,][0-9]+)? ?(?:USD|usd)|[0-9]+(?:[.,][0-9]+)? ?(?:dollars|DOLLARS)']

> **Ex. 8.2.4** 
Finally write a function that takes in a string and outputs if there is a match. Use the .match function to see if there is a match (hint if does not return a NoneType object - `re.match(pattern,string)!=None`).

> Define a column 'mention_currency' in the dataframe, by applying the above function to the text column of the dataframe. 
*** You should have approximately 310 reviews that matches. - but less is also alright***

> **Ex. 8.2.5** Explore the relation between reviews mentioning prices and the average rating. 

> **Ex. 8.2.6 (extra)** Define a function that outputs the amount mentioned in the review (if more than one the largest), define a new column by applying it to the data, and explore whether reviews mentioning higher prices are worse than others by plotting the amount versus the rating.

In [2]:
#[Answer to 8.2.4-5]

In [13]:
pattern = '[£$] ?[0-9]+(?:[.,][0-9]+)?|[0-9]+(?:[.,][0-9]+)? ?(?:USD|usd)|[0-9]+(?:[.,][0-9]+)? ?(?:dollars|DOLLARS)'
currency_re = re.compile(pattern)
def match_currency(string):
    return len(currency_re.findall(string))>0
df['mention_currency'] = df.reviewBody.apply(match_currency)

In [14]:
df.groupby('mention_currency').reviewRating_ratingValue.mean()

mention_currency
False    4.507275
True     2.935275
Name: reviewRating_ratingValue, dtype: float64

> **Ex. 8.2.7:** Now we write a regular expression to extract emoticons from text.
Start by locating all mouths ')' of emoticons, and develop the variations from there. Remember that paranthesis are special characters in regex, so you should use the escape character.

In [3]:
#[Answer to 8.2.7]

In [17]:
sample_string = '\n'.join(df.sample(1000).reviewBody)
emoticon_regex = e_re.ExploreRegex(sample_string)

In [18]:
pattern = '[:;][-Oo]?[()D]'
emoticon_regex.explore_pattern(pattern,context=20)

------ Pattern: [:;][-Oo]?[()D]	 Matched 11 patterns -----
Match: :)	Context:d Jedco to anyone.  :)
It was a hassle fre
Match: :)	Context:ith Flyopedia again :)
Great Product.  Del
Match: :)	Context:e bag of dog treats :)
JustFly - super eas
Match: :)	Context:e with good bundles :)
fast delivery, well
Match: :)	Context: phone call anxiety :)
This process was de
Match: :)	Context:lpful. God bless her:)
I was pleased with 
Match: :)	Context:thank you HRKGames  :)
I like Moos's produ
Match: :)	Context: that you are doing :)
I love the site, th
Match: :)	Context:n the imei giveaway :)
GoMedigap's represe
Match: :)	Context:lier than predicted :)

Too bad you don'


In [21]:
emoticon_regex.pattern2n_match

{'[:;][-Oo]?[()D]': 11}