# Web Scraping Advanced (oDCM)

*Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce pretium risus at ultricies egestas. Vivamus sit amet arcu sem. In hac habitasse platea dictumst. Nulla pharetra vitae mauris sed mollis. Pellentesque placerat mauris dui, in venenatis nisl posuere ac. Nunc vitae tincidunt risus, ut pellentesque odio. Donec quam neque, iaculis id eros et, condimentum vulputate nulla. Nullam sed ligula leo.*

--- 

## Learning Objectives

Students will be able to: 
* A
* B
* C


--- 

## Acknowledgements
This course draws on online resources built by Brian Keegan, Colt Steele, David Amos, Hannah Cushman Garland, Kimberly Fessel, and Thomas Laetsch. 


--- 

## Contact
For technical issues try to be as specific as possible (e.g., include screenshots, your notebook, errors) so that we can help you better.

**WhatsApp**  
+31 13 466 8938

**Email**  
odcm@uvt.nl

---

## 1. Advanced Web Scraping
* Understand the difference between headless and browser emulation and ability to apply both methods (using selenium).
* Emulate user interaction with a site using timers, clicks, scrolling, and filling in forms 
* Access data that is hidden behind a login-screen
* Preprocess raw data with regular expressions (e.g., special characters, thousand separators, trailing and leading spaces)
* Custom user agents
* Throttling
* Regular expressions
* Feature engineering (date time, week, year, textblob sentiment)
* Error handling (e.g., 404 pages)


* https://github.com/kimfetti/Conferences/tree/master/PyCon_2020
* https://www.youtube.com/watch?v=RUQWPJ1T6Zc&t=190s
* https://github.com/hancush/web-scraping-with-python/blob/master/session/web-scraping-with-python.ipynb#HTML-basics
* https://www.udemy.com/course/the-modern-python3-bootcamp/learn/lecture/7991196#overview
* https://campus.datacamp.com/courses/web-scraping-with-python/introduction-to-html?ex=1
* https://realpython.com/python-web-scraping-practical-introduction/
* https://github.com/CU-ITSS/Web-Data-Scraping-S2019

### Regular expressions
* Regex = regular expressions
* Way of describing patterns within search strings 
* Not Python specific topic 
* Hideous and very difficult to understand (not Pythonic style) 
* There are a ton of regex symbols -> we're just going to cover the most important ones
* Cheat sheet: https://www.rexegg.com/regex-quickstart.html
* test regex: https://pythex.org

Potential use cases
* Credit card number validating
* Phone number validating (website forms)
* Advanced find/replacd in text
    * Check if words are duplicated (one upon a time time)
* Formatting text/output
* Syntax highlighting (wat je ook in IDE ziet)



* Validating emails (check: does it follow the right format?) 
    * letters + @ letters.letters
    * gewoon checken of er een @ symobl in staat makkelijk ("@" in ...)
    * maar @ mag niet op het begin of het einde zijn
    * mag niet meer dan 1x @ zijn
    * @ moet voor de .com zijn
    * ingewikkeld want de "." kan ook op andere plekken voorkomen (roy.klaasse.bos@gmail.com)
    * je zou hier normaal veel if-statements voor moeten schrijven


* Formula
    * Starts with 1 or more letter, number, +, _, -,. then
    * A sigle @sign
    * 1 or more letter, number or - then
    * A single dot
    * End with 1 or more letter, number, - or .

`(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-ZA-Z0-9-.]+$)`


### Basic Syntax
* Regular letters
* Escape special characters (:\) voor een smiley)
* \d\d (for double digits) 
* Capitalize means NOT (\s vs \S)


* \d = digital 0-9
* \w = letter, digit or underscore (word character)
    * zowel lowercase als uppercase letters
    * accenten vallen hier niet onder!
* \s = whitespace character (maar ook een tab, newline) 
    * daarin verschilt het van zelf gewoon een spatie invullen -> handig voor (voor als iets het laatste woord van de zin is - gevolgd door een punt)
* \D = not a digit
* \W = not a word character
* \S = not a whitespace character
* . = any character except line break

### Quantifiers

* \+ = one or more
* {3} = exactly 3 times
* {3,5} = 3 to 5 times
* {4,} = 4 or more times
* \* = 0 or more times
* ? = once or none


Examples
* `ab*c` = a[zero or more b's]c
    * verwijst dus naar het teken wat er voorkomt (en niet naar ab)
* `06-?12345678` = zowel 0612345678 en 06-12345678
* `7{3}` slaat naast 777 ook terug op (4567777789)
* `hi{2,}` is a single "h" and then "i" repeated two or more times (dus niet "hi" repeated two times
* `0?\d` = "00", "9", "0", "03"

### Character Classes and Sets
* Any vowel eaiou werkt niet -> [eaiou]
* Ranges of characters 
    * lowercase[a-z] - kan ook [a-f]
    * [A-Z]
* ^ within range
    * [^A-Z] = not A-Z

### Anchors and boundaries
* ^ = start of string or line
* $ = end of string or line
* \b = word boundary (bijv. eerste woord in een zin; geen spatie ervoor maar toch meenemen)

Examples
* probleem met `\d{3} \d{3}-?\d{4}` voor een telefoonnummer -> kan zijn dat er voor of erna nog allemaal andere crap staat 
    * combinatie van `^` en `$` om andere tekst uit te sluiten 
    * '^\d{3} \d{3}-\d{4}$' = beginnen met 3 digits en eindigen met 4 digits
* `^\d{3}$` matcht niet met `Yay I got 777`
  

### Logical Or and Capture Groups

* `|` = OR `\(\d{3}\)|\d{3}` = 3 digits with or without parentheses
* () to capture groups 
    * `\(\d{3}\)|\d{3} \d{3} \d{4}` maakt de vergelijking 3 digits with parentheses or 10 digits without parentheses. Als je dus echt de 3 met en zonder apart wilt vergelijken met je om dat deel haakjes toevoegen
    * Zelfs al heb je het niet per se nodig kan het alsnog handig zijn om `()` te gebruiken om een groep aan te maken -> scheiden van naam van voorvoegsel (Mr. / Ms.)
    * Gebruik je heel veel zodat je het niet handmatig nog een keer hoeft te gaan splitten
    
Examples
* `https?://([A-Za-z_-0-9]+\.[A-Za-z_-0-9]+)`
    * Alleen het deel na `http://` als groep opslaan
* `Mr.|Mister Holmes` -> Mr. OR Mister Holmes (because of a lack of parentheses)
* Escape group symbol 
    * Which regex would match both of the following strings (`cat(s)` AND `dog(s)`)
    * `\w{3}\(s\)`

### Re Module
* https://docs.python.org/3/library/re.html
* `r` = raw string (otherwise you have to use double backslashes - avoids that \t is seen as a TAB)
* compiling it separately (`re.compile`) vs rechtstreeks
    * if you're using it more than once -> via `re.compile()`
* `search` = max 1 result
* `findall` = return all results

In [12]:
# import regex module
import re

# define our phone number regex
pattern = re.compile(r'\d{3} \d{3}-\d{4}')

# search a string with our regex
result = pattern.search('Call me at 415 555-4242 or 310 234-9999!')
print(result.group())


result2 = pattern.findall('Call me at 415 555-4242 or 310 234-9999!')
print(result2)


# in plaats van een apart object aanmaken -> gelijk pattern in search
print(re.findall(r'\d{3} \d{3}-\d{4}', 'Call me at 415 555-4242 or 310 234-9999!'))

415 555-4242
['415 555-4242', '310 234-9999']
['415 555-4242', '310 234-9999']


In [19]:
import re

def extract_phone(input):
    phone_regex = re.compile(r'\b\d{3} \d{3}-\d{4}\b')
    match = phone_regex.search(input)
    if match: 
        return match.group()
    return None

print(extract_phone("my number is 432 567-8976"))
print(extract_phone("my number is 432 567-897622"))

432 567-8976
None


### Parsing URLs
* Breaking things up (`match.groups()`)


In [41]:
url_regex = re.compile(r'(https?)://(www.[A-Za-z-]{2,256}\.[a-z]{2,6})([-a-zA-Z0-9@:%_\+.~#?&//=]*)')
match = url_regex.search("http://www.youtube.com/videos/asd/das/asd")
print(match.groups())
print(match.groups()[2])

('http', 'www.youtube.com', '/videos/asd/das/asd')
/videos/asd/das/asd


In [23]:
#import re
import re
#define parse_date below

def parse_date(input):
    date_regex = re.compile(r'(\d{2})[/.,](\d{2})[/.,](\d{4})')
    match = date_regex.search(input)
    return {"d": match.groups()[0], 
            "m": match.groups()[1], 
            "y": match.groups()[2], 
            }

parse_date('12.04.2003')

{'d': '12', 'm': '04', 'y': '2003'}

### Compilation Flags
* `IGNORECASE` = geen onderscheid meer tussen lower en upper case ([a-z]) pakt hierdoor ook hoofdletters op
* `VEBOSE` = expand across multiple lines (als je hele lange regular expressions hebt) -> ignores white space
* Meerdere compilation flags combineren met een pipe (|) symbol

In [33]:
pattern = re.compile(r"""
    ^([a-z0-9_\.-]+)      # first part of email
    @                     # single @sign
    ([a-z0-9_\.-]+)       # email provider
    \.                    # single period
    ([a-z0-9_\.-]{2,6})$  # com, org, net, etc.
    """, re.VERBOSE | re.IGNORECASE)

match = pattern.search("Thomas123@Yahoo.com")
print(match.groups())

('Thomas123', 'Yahoo', 'com')


In [31]:
pattern = re.compile(r"""
    ^([a-z0-9_\.-]+)      # first part of email
    @                     # single @sign
    ([a-z0-9_\.-]+)       # email provider
    \.                    # single period
    ([a-z0-9_\.-]{2,6})$  # com, org, net, etc.
    """, re.IGNORECASE)

match = pattern.search("Thomas123@Yahoo.com")
print(match.groups())

AttributeError: 'NoneType' object has no attribute 'groups'

### Substitutions
* Privacy gevoelige informatie weglaten 
* Zinnen herstructureren: bijv. 
    * Significant Others (1987) 
    * Naar: 1987 - Significant Others

In [48]:
# remove names from text (privacy)
text = "Last night Mrs. Daisy and Mr. White murdered Mr. Chow"

pattern = re.compile(r'(Mrs\.|Mr\.) ([a-z]+)', re.IGNORECASE)
result = pattern.sub("REDACTED", text)
result

'Last night REDACTED and REDACTED murdered REDACTED'

In [51]:
# aparte groep maken als je bijvoorbeeld wel de eerste letter wilt laten zien
# \g<1> refers to group 1 (je hebt dus geen group 0)
pattern = re.compile(r'(Mrs\.|Mr\.) ([a-z])([a-z]+)', re.IGNORECASE)
result = pattern.sub("\g<1> \g<2>", text)
result

'Last night Mrs. D and Mr. W murdered Mr. C'

In [58]:
import re

def censor(input):
    censor_pattern = re.compile(r'frack\w*', re.IGNORECASE)
    return censor_pattern.sub("CENSORED", input)
    
censor("Frack you")

'CENSORED you'

Exam questions: 
* Which of the following strings will have matches in them? 
    * Syntax geven
    * Meerdere voorbeeld zinnetjes
* Write a function called `is_valid_time` that accepts a single string argument. It should return `True` if the string is formatted correctly as a time, like 3:15 or 12:48 and return `False` otherwise. Note that times can start with a single number (2:30) or two (11:18).

In [35]:
# Don't forget to import re!
import re
# Define is_valid_time below:
def is_valid_time(input):
    time_regex = re.compile(r'^[0-23]{1,2}:[0-5]{1}[0-9]{1}')
    match = time_regex.search(input)
    if match:
        return True
    return False

is_valid_time("23:59")

True