# Web Scraping Advanced (oDCM)

*Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce pretium risus at ultricies egestas. Vivamus sit amet arcu sem. In hac habitasse platea dictumst. Nulla pharetra vitae mauris sed mollis. Pellentesque placerat mauris dui, in venenatis nisl posuere ac. Nunc vitae tincidunt risus, ut pellentesque odio. Donec quam neque, iaculis id eros et, condimentum vulputate nulla. Nullam sed ligula leo.*

--- 

## Learning Objectives

Students will be able to: 
* Understand the difference between headless and browser emulation and ability to apply both methods (using selenium)
* Emulate user interaction with a site using timers, clicks, scrolling, and filling in forms 
* Access data that is hidden behind a login-screen
 


--- 

## Acknowledgements
This course draws on a variety of online resources that can be retrieved from the [course website](https://odcm.hannesdatta.com/docs/about/).

--- 

## Contact
For technical issues outside of scheduled classes, please check the [support section](https://odcm.hannesdatta.com/docs/course/support) on the course website.

* https://hannesdatta.github.io/course-jads2020/sessions/webscraper_socialblade.html

* https://github.com/CU-ITSS/Web-Data-Scraping-S2019/blob/master/Class%2004%20-%20Selenium%2C%20Twitter%2C%20and%20Internet%20Archive/Class%2004%20-%20Selenium%2C%20Twitter%2C%20and%20Internet%20Archive.ipynb



---
## 1. Selenium 

### 1.1 Why Selenium? 
In the Web Scraping 101 tutorial, we used BeautifulSoup to turn HTML into a data structure that we could search and access using Python-like syntax. While it's easy to get started with this library, it has limitations when it comes to dynamic websites. That is, websites of which the content changes after each page refresh. Selenium can handle both static and dynamic websites and mimic user behavior (e.g., scrolling, clicking, logging in). It launches another web browser window in which all actions are visible which makes it feel more intuitive. For example, the video below launches a regular Google Chrome window and visits [`instagram.com`](https://www.instagram.com). This browser window behaves like normal, so you can click on buttons and fill out fields. Yet you can distinguish it from your normal web browser by the header that indicates that Chrome is being controlled by automated test software. Before you can try it out yourself, we need to install some additional software which we'll explain next. 

<img src="images/selenium_instagram.gif" align="left" width=70%/>


<!-- 

With Selenium we use a standard called "XPath" to navigate through an HTML document: this is the official tutorial for working with XPath. The syntax is different, but the intuition is similar: we can find a parent node by its attribute (class, id, etc.) and then navigate down the tree to its children.
 -->


### 1.2 Installing Selenium
You will need to (1) install the Python package for Selenium, (2) download a web driver to interface with a web browser, and (3) configure Selenium to recognize your web driver.
1. Open Anaconda Prompt (Windows) or the Terminal (Mac), type the command `conda install selenium`, and agree to whatever the package manager wants to install or update (usually by pressing `y` to confirm your choice). 
2. Once we run the scraper, a Chrome browser launches which requires a web driver executable file. Download this file from [here](https://sites.google.com/a/chromium.org/chromedriver/downloads) (open [this](https://www.whatismybrowser.com/detect/what-version-of-chrome-do-i-have) site in Chrome to identify your current Chrome version). 
3. Unzip the file and move it to the same directory where you're running this notebook. Make a note of the path to this directory as you'll need to reference it later. The path to my Chrome driver looks like this (Mac): `/Users/royklaassebos/Google Drive/2020/Data_Scientist_Tilburg/oDCM/chromedriver`. On Windows, it may look like this: `E:/Google Drive/2020/Data_Scientist_Tilburg/oDCM/chromedriver.exe`

Change the `chrome_path` variable to the location where you've stored the driver and run the cell. It should open an empty Chrome window (don't close it until you're done with scraping!). 

In [471]:
import selenium.webdriver

chrome_path = "/Users/royklaassebos/Google Drive/2020/Data_Scientist_Tilburg/oDCM/chromedriver"
driver = selenium.webdriver.Chrome(executable_path = chrome_path)

**Let's try it out!**  
Follow the steps and make sure it works properly on your machine. What happens once you run the cell above twice? 

### 1.3 Access Sites Programmatically

**Importance**  
Next, we're going to tell the browser to visit the Tilburg University Twitter account. We call the `driver` object we created above and use the `get` method, which we pass the URL of the website we'd like to extract. 

In [472]:
driver.get("https://twitter.com/TilburgU")

<img src="images/twitter_tilburgu.png" align="left" width=40%/>

**Let's try it out!**  
As most information can only be obtained once you're signed in, manually login to your Twitter account through the driver page (create a new account if you don't have one yet). 

From this point, we can use BeautifulSoup as we learned previously, though we create the `res` object from the `driver` object this time. 

In [411]:
res = driver.page_source.encode('utf-8')
soup = BeautifulSoup(res, "html.parser")

Once you inspect the HTML code of the Twitter page you'll discover that the class names are more complex than the ones we looked at earlier. Take a look at the gigantic class name of the Twitter bio, for example...

<img src="images/twitter_bio.png" align="left" width=60%/>

...which we can extract using the class name:

In [398]:
soup.find_all(class_ = "css-901oao r-jwli3a r-1qd0xha r-a023e6 r-16dba41 r-ad9z0x r-bcqeeo r-qvutc0")[1].text

'Follow @TilburgU and we will keep you up to date on our latest news! Our webcare team is more than happy to answer your questions on work days, 9 AM - 5 PM.'

...or by filtering on the `data-testid` attribute:

In [399]:
soup.find(attrs={"data-testid": "UserDescription"}).text

'Follow @TilburgU and we will keep you up to date on our latest news! Our webcare team is more than happy to answer your questions on work days, 9 AM - 5 PM.'

**Exercise 1**  
Using the same approach as above, extract the (i) number of followers, (ii) the location, and the (iii) join date of the [TilburgU](https://twitter.com/tilburgU) Twitter account. Tip: use Google Inspector to determine an appropriate navigation strategy.

In [400]:
# solution
followers = soup.find_all(class_ = "css-901oao css-16my406 r-jwli3a r-1qd0xha r-b88u0q r-ad9z0x r-bcqeeo r-qvutc0")[1].text 
location = soup.find(attrs={"data-testid": "UserProfileHeader_Items"}).find_all('span')[1].text
join_date = soup.find(attrs = {"data-testid": "UserProfileHeader_Items"}).find_all('span')[3].text

print(f"Followers: {followers} \nLocation: {location} \nJoin date: {join_date}")

Followers: 13K 
Location: Tilburg, The Netherlands 
Join date: Joined June 2009


### 1.4 Scroll Sites Programmatically

**Importance**  
In a similar way, we can scrape the content of the most recent tweet as follows: 

In [401]:
# 1st tweet
soup.find_all(attrs={"data-testid": "tweet"})[0].find_all(attrs={"dir": "auto"})[4].text

"Don't have any plans for New Year's Eve yet? Join the Tilburg University pub quiz on December 31, 20:00 hrs. Make sure to register before December 31, 10:00 hrs. https://tilburguniversity.edu/current/events/cristmas-and-new-years-activities…"

And for older tweets we simply increment the counter by one: 

In [402]:
# 2nd tweet
soup.find_all(attrs={"data-testid": "tweet"})[1].find_all(attrs={"dir": "auto"})[4].text

'We wish you Happy Holidays and a Wonderful New Year!  Our university will be closed for a couple of days: https://tilburguniversity.edu/contact/openinghours… \nAs our webcare team is also celebrating the holidays, response time will be longer than usual. We will be back in full capacity on Jan 4, 2021!'

Easy right? Not so fast.. From the 10th tweet onwards (in your case it may be a different figure; dependent on screen size, resolution, etc.), it returns an `IndexError: list index out of range`. This is because Twitter only pulls in new tweets once you scroll down the page. 

In [416]:
# 9th tweet
soup.find_all(attrs={"data-testid": "tweet"})[8].find_all(attrs={"dir": "auto"})[4].text

'Congratulations to @MicheleNuijten for winning the Young eScientist Award 2020 from the @eScienceCenter in the amount of 50,000 euro! She will further develop statcheck, an open-access tool for detecting statistical reporting errors. http://tilburguniversity.edu/current/news/more-news/michele-nuijten-escientist-award-meta-research-statcheck…'

In [417]:
# 10th tweet
soup.find_all(attrs={"data-testid": "tweet"})[9].find_all(attrs={"dir": "auto"})[4].text

IndexError: list index out of range

Therefore, we need to scroll down to the bottom of the page if we like to obtain more than a few tweets. Every time you run the cell below it loads another 5-10 tweets.  

In [428]:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

**Let's try it out!**  
Try running the cell above a couple of times. What happens to the most recent tweets? If you run the cells again that extract the 1st, 2nd, 9th, and 10th tweet, does the output change? 

Indeed, we need to recreate the `res` object after each iteration because the HTML code changes once you scroll down (older tweets are added and newer ones are hidden). The number of tweets in the view deviates depending on the type of media (e.g., images take up more space than text). Therefore, we first determine the number of views in the current view to make sure we capture all tweets. After we stored the last tweet in the view, we scroll down the page and start all over again.  

In [473]:
from time import sleep
tweets = []

for _ in range(5):
    res = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(res, "html.parser")
    
    # total number of tweets in current view
    num_tweets_view = len(soup.find_all(attrs={"data-testid": "tweet"}))
        
    # add tweets to list
    for counter in range(num_tweets_view):
        tweets.append(soup.find_all(attrs={"data-testid": "tweet"})[counter].find_all(attrs={"dir": "auto"})[4].text)
    
    # scroll down the page
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    
    # pause for 5 seconds
    sleep(1)

**Exercise 2**
1. What happens once you first scroll down the page and then run the cell above? Does `tweets` differ? Why? 
2. Estimate how many times you would need to scroll in order to capture all tweets (tip: you find the total number of tweets at the top). By the way, there's no need to collect all tweets!
3. Write a function `process_tweets()` that returns a list of dictionaries in which each dictionary contains the original tweet, a list of mentions (e.g., `@gemeentetilburg`), and a list of hashtags (e.g., `#makeitintilburg`). Tip: you may first want to split each tweet into a list of words and work from there. When is a word considered a hashtag? And a mention? How about punctuation? Test your function with the list of `tweets` above. 
4. What's the most-used hashtag in the `tweets` dataset? Start with the output of `preprocess_tweets(tweets)`. Tip: use the [Counter](https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item) module to easily determine the number of occurrences of each hashtag.

**Solution**  
1. The scraper will start from the current view. Since more recent tweets are hidden as you scroll down, the scraper would skip the first few tweets in that case. 
2. Scrolling down five times yielded 56 tweets, so it would take about 3986/56 = 71 times on average. 

In [484]:
# Question 3
def process_tweets(tweets):
    output = []
    
    for tweet in tweets: 
        mentions = []
        hashtags = []
        
        # a more elegant solution can be achieved using regular expressions (outside the scope of this course)
        
        # remove punctuation (to avoid #hashtag?, #hashtag!, etc.)
        for character in ["?", ".", ",", "!"]:
            tweet_clean = tweet.replace(character, "")
            tweet = tweet_clean
            
        # separate words chained by an enter with a space (to avoid #hashtag\nABCDEF)
        tweet = tweet.replace("\n", " ")
        
        for word in tweet.split(" "): 
            try: 
                if word[0] == "@" and word.count("@") == 1: 
                    mentions.append(word)
                if word[0] == "#" and word.count("#") == 1: 
                    hashtags.append(word) 
            except: 
                pass
            
        output.append({
            "tweet": tweet,
            "mentions": mentions,
            "hashtags": hashtags
        })
        
    return output

In [489]:
# Question 4
from collections import Counter

hashtags = []
df = process_tweets(tweets)

# add all hashtags of all tweets to a list 
for tweet_dict in df: 
    hashtags.extend(tweet_dict['hashtags'])
    
# count frequencies of hashtags
Counter(hashtags)  # so the answer is: #TilburgU (what a surprise..)

Counter({'#klimaatneutrale': 1,
         '#economie': 1,
         '#economische': 1,
         '#groei': 1,
         '#TilburgU': 12,
         '#IkPasBattle': 1,
         '#battle': 1,
         '#HealthyCampus:': 1,
         '#2': 1,
         '#statcheck': 1,
         '#Covid_19': 2,
         '#SGTilburg': 1,
         '#sgtilburg': 1,
         '#patiënt': 1,
         '#zorgverlening': 1,
         '#needles': 1,
         '#makeitintilburg': 1,
         '#ondernemerschap': 1,
         '#entrepreneurship': 1,
         '#digitization': 1,
         '#AI': 2,
         '#corona': 1,
         '#cancer': 1,
         '#WDPD2020': 1,
         '#RecognitionandRewards': 1,
         '#remembering': 1,
         '#MidpointBrabant': 1,
         '#coronavirus': 1,
         '#Gebarentaal': 1,
         '#TrusTee': 1,
         '#TilburgUniversityMagazine': 1,
         '#CoronaMelder': 1,
         '#NobelPeacePrize': 1})

**Exercise 3**
List of Twitter followers scrapen

### 1.5 Search Tweets
* XXX
* XXX

### 1.6 Wrap-Up

In [449]:
query_url

'https://twitter.com/search?f=tweets&q=from:WhiteHouse since:2017-01-20 until:2018-01-20&src=typd'

In [448]:
# Make the query params
query_params = {}
query_params['from'] = 'WhiteHouse'
query_params['since'] = '2017-01-20'
query_params['until'] = '2018-01-20'

# Pass the params into a string and quote to format it properly
query_params_quoted = f'from:{query_params["from"]} since:{query_params["since"]} until:{query_params["until"]}'

# Add the quoted query params into the URL
query_url = "https://twitter.com/search?f=tweets&q={0}&src=typd".format(query_params_quoted)

In [431]:
# recreate soup object based on current HTML page
res = driver.page_source.encode('utf-8')
soup = BeautifulSoup(res, "html.parser")

# extract first tweet in current view
soup.find_all(attrs={"data-testid": "tweet"})[0].find_all(attrs={"dir": "auto"})[4].text

'Hoe gaan landen om met een belast verleden? En hoe verhouden sociale ondernemers zich tot de verzorgingsstaat?\n\nA.s. zaterdag geven tien promovendi inzicht in hun onderzoeken, waaronder Marlies en Michiel \n\nMeer info: https://tilburguniversity.edu/nl/campus/studium-generale/wetenschap-de-maak…\n@sgtilburg @bibliotheekmb'

<img src="images/scrolling.png" align="left" width=40%/>

In [356]:
soup.find_all(attrs={"data-testid": "tweet"})[0].find_all(attrs={"dir": "auto"})[4].text

'Wij wensen u fijne feestdagen en een gezond nieuwjaar! Wij hopen elkaar in 2021 weer in het echt te kunnen ontmoeten, maar voor nu deze virtuele kerstgroet, namens het hele VSNU-bureau en namens de universiteiten van Nederland. #kerstgroet #universiteiten'

In [296]:
soup.find_all(attrs={"data-testid": "tweet"})[1].text

'The following media includes potentially sensitive content. Change settingsView'

In [249]:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

In [300]:
soup.find_all(attrs={"data-testid": "tweet"})[4].find_all(attrs={"dir": "auto"})[4].text

'How exceptional is #Covid_19? Can it strike again? Certainly, says Jonathan Verschuuren. He has explored the phenomenon of zoonosis, viruses that jump from animals to humans. What he discovered is not entirely reassuring. http://tilburguniversity.edu/magazine/how-exceptional-covid-19…'

In [314]:
tweets = []
for _ in range(4):
    res = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(res, "html.parser")
    for counter in range(len(soup.find_all(attrs={"data-testid": "tweet"}))):
        tweets.append(soup.find_all(attrs={"data-testid": "tweet"})[counter].find_all(attrs={"dir": "auto"})[4].text)
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    
#print(len(soup.find_all(attrs={"data-testid": "tweet"})))
#soup.find_all(attrs={"data-testid": "tweet"})[8].find_all(attrs={"dir": "auto"})[4].text

Kan #klimaatneutrale #economie gecombineerd worden met #economische #groei? @TilburgU professor Gerlagh sprak over de klimaatverandering van de afgelopen 100 jaar en de verwachtingen van de komende 50 jaar op het Nederlands Gala van de Wetenschap 2020:
Univers blikt met organisatiewetenschapper @mzelst van @C19RedTeam terug op een jaar waarin hij eigenlijk zijn proefschrift had willen afronden, maar dat uiteindelijk vooral in het teken stond van coronacijfers. #TilburgU
This January @gem_Nijmegen & @gemeentetilburg go head to head in the #IkPasBattle. The goal is to recruit the most participants who are willing to go one month without alcohol. @TilburgU joins the #battle in the pursuit of a #HealthyCampus: https://ikpas.nl/bedrijf/tilburguniversity/…
Dr. Martin Hoondert, Associate Professor @TilburgU, is researching the effects of the coronavirus crisis on rituals surrounding funerals. Rituals are changing, disappearing (temporarily or otherwise) and new ones arising. Find out more abo

IndexError: list index out of range

In [307]:
import pandas as pd
temp = pd.DataFrame(tweets)
temp.to_csv('tweets.csv', index=False)



In [203]:
soup.find_all(class_="css-901oao r-jwli3a r-1qd0xha r-a023e6 r-16dba41 r-ad9z0x r-bcqeeo r-bnwqim r-qvutc0")[6].text

'Congratulations! The researchers of the Tilburg School of Economics and Management traditionally do well in the Economists Top 40. @tilburgU ranks #2! Among others Bart Bronnenberg became second in the researchers top 40! http://tilburguniversity.edu/about/schools/economics-and-management/organization/tilburg-school-economics-and-management-ranks-high-economists-top-40-0…'

In [190]:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

**Let's try it out!**  
Meerdere keren runnen

soup.find_all(class_ = "css-901oao r-jwli3a r-1qd0xha r-a023e6 r-16dba41 r-ad9z0x r-bcqeeo r-bnwqim r-qvutc0")[0].text




In [145]:
soup.find_all(attrs={"data-testid": "tweet"})[0].find_all(attrs={"dir": "auto"})[4].text

"Don't have any plans for New Year's Eve yet? Join the Tilburg University pub quiz on December 31, 20:00 hrs. Make sure to register before December 31, 10:00 hrs. https://tilburguniversity.edu/current/events/cristmas-and-new-years-activities…"

In [141]:
soup.find(attrs={"data-testid": "tweet"}).find_all(attrs={"dir": "auto"})[4].text

"Don't have any plans for New Year's Eve yet? Join the Tilburg University pub quiz on December 31, 20:00 hrs. Make sure to register before December 31, 10:00 hrs. https://tilburguniversity.edu/current/events/cristmas-and-new-years-activities…"

In [127]:
soup.find(attrs={"role": "article"}).find(attrs={"dir": "auto"})

'Tilburg University'

Depending on the resolution of your display and size of the window, there may only be 5–10 tweets visible. Typically, we can only scrape the contents of the page that are visible in the current view. In other words, we need to scroll down to the bottom of the page if we like to obtain more than a few tweets. Every time you run the cell below it loads another 5-10 tweets. 

In [26]:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')


In [28]:
from bs4 import BeautifulSoup
raw = driver.page_source.encode('utf-8')
soup = BeautifulSoup(raw)

In [44]:
soup.body.find_all(class_='r-bnwqim r-qvutc0') 


[]

In [None]:
.r-bnwqim.r-qvutc0

In [17]:
a = driver.find_element_by_xpath('//*[@id="default"]/div/div/div/div/section/div[2]/ol/li[1]/article/div[2]/p[1]').get_attribute('class')
a

'price_color'

In [None]:
#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.product_price > p.price_color


### 1.3 Controlling Programmable Web Browser

* Access website
* Clicking 
* Scrolling

In [None]:
driver.get('Twitter')


In [None]:
driver.quit()

* Twitter Account Handle
* Name of Twitter account
* Bio

* Meer dan 20 tweets binnenhalen middels scrolling
    * hoe vaak moet je scrollen? 

* Oefening op tweet level
    * text
    * replies
    * retweets
    * favorites 

### 1.4 Searching for Tweets

### 1.5 Error Handling

## 2 Instagram


In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait

In [None]:
import os 
import wget

In [None]:
mac_path = "/Users/royklaassebos/Google Drive/2020/Data_Scientist_Tilburg/oDCM/chromedriver"

In [None]:
driver = selenium.webdriver.Chrome(executable_path=mac_path)
driver.get("https://www.instagram.com")

In [None]:
username = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']")))
password = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']")))

username.clear()
password.clear()

username.send_keys("onlineregistratie")
password.send_keys("jojnu4-vunpYf-qebqaz")

In [None]:
log_in = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']"))).click()

In [None]:
not_now = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Not Now')]"))).click()

In [None]:
searchbox = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//input[@placeholder='Search']")))
searchbox.clear()

keyword = "#cat"
searchbox.send_keys(keyword)


In [None]:
searchbox.send_keys(Keys.ENTER) # kan zijn dat je het 2x moet laden

In [None]:
driver.execute_script("window.scrollTo(0, 4000);") # 4000 is around 4 times the size of a screen

In [None]:
images = driver.find_elements_by_tag_name("img")
images 

In [None]:
images = [image.get_attribute('src') for image in images]

In [None]:
images

In [None]:
os.path.join(path, keyword[1:])

In [None]:
path = os.getcwd()
path = os.path.join(path, keyword[1:])

os.mkdir(path)

In [None]:
import wget

counter = 0
for image in images: 
    save_as = os.path.join(path, keyword[1:] + str(counter) + ".jpg")
    wget.download(image, save_as)
    counter += 1

---

## 1. Advanced Web Scraping
* Understand the difference between headless and browser emulation and ability to apply both methods (using selenium).
* Emulate user interaction with a site using timers, clicks, scrolling, and filling in forms 
* Access data that is hidden behind a login-screen
* Preprocess raw data with regular expressions (e.g., special characters, thousand separators, trailing and leading spaces)
* Custom user agents
* Throttling
* Regular expressions
* Feature engineering (date time, week, year, textblob sentiment)
* Error handling (e.g., 404 pages)


* https://github.com/kimfetti/Conferences/tree/master/PyCon_2020
* https://www.youtube.com/watch?v=RUQWPJ1T6Zc&t=190s
* https://github.com/hancush/web-scraping-with-python/blob/master/session/web-scraping-with-python.ipynb#HTML-basics
* https://www.udemy.com/course/the-modern-python3-bootcamp/learn/lecture/7991196#overview
* https://campus.datacamp.com/courses/web-scraping-with-python/introduction-to-html?ex=1
* https://realpython.com/python-web-scraping-practical-introduction/
* https://github.com/CU-ITSS/Web-Data-Scraping-S2019

### Regular expressions
* Regex = regular expressions
* Way of describing patterns within search strings 
* Not Python specific topic 
* Hideous and very difficult to understand (not Pythonic style) 
* There are a ton of regex symbols -> we're just going to cover the most important ones
* Cheat sheet: https://www.rexegg.com/regex-quickstart.html
* test regex: https://pythex.org

Potential use cases
* Credit card number validating
* Phone number validating (website forms)
* Advanced find/replacd in text
    * Check if words are duplicated (one upon a time time)
* Formatting text/output
* Syntax highlighting (wat je ook in IDE ziet)



* Validating emails (check: does it follow the right format?) 
    * letters + @ letters.letters
    * gewoon checken of er een @ symobl in staat makkelijk ("@" in ...)
    * maar @ mag niet op het begin of het einde zijn
    * mag niet meer dan 1x @ zijn
    * @ moet voor de .com zijn
    * ingewikkeld want de "." kan ook op andere plekken voorkomen (roy.klaasse.bos@gmail.com)
    * je zou hier normaal veel if-statements voor moeten schrijven


* Formula
    * Starts with 1 or more letter, number, +, _, -,. then
    * A sigle @sign
    * 1 or more letter, number or - then
    * A single dot
    * End with 1 or more letter, number, - or .

`(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-ZA-Z0-9-.]+$)`


### Basic Syntax
* Regular letters
* Escape special characters (:\) voor een smiley)
* \d\d (for double digits) 
* Capitalize means NOT (\s vs \S)


* \d = digital 0-9
* \w = letter, digit or underscore (word character)
    * zowel lowercase als uppercase letters
    * accenten vallen hier niet onder!
* \s = whitespace character (maar ook een tab, newline) 
    * daarin verschilt het van zelf gewoon een spatie invullen -> handig voor (voor als iets het laatste woord van de zin is - gevolgd door een punt)
* \D = not a digit
* \W = not a word character
* \S = not a whitespace character
* . = any character except line break

### Quantifiers

* \+ = one or more
* {3} = exactly 3 times
* {3,5} = 3 to 5 times
* {4,} = 4 or more times
* \* = 0 or more times
* ? = once or none


Examples
* `ab*c` = a[zero or more b's]c
    * verwijst dus naar het teken wat er voorkomt (en niet naar ab)
* `06-?12345678` = zowel 0612345678 en 06-12345678
* `7{3}` slaat naast 777 ook terug op (4567777789)
* `hi{2,}` is a single "h" and then "i" repeated two or more times (dus niet "hi" repeated two times
* `0?\d` = "00", "9", "0", "03"

### Character Classes and Sets
* Any vowel eaiou werkt niet -> [eaiou]
* Ranges of characters 
    * lowercase[a-z] - kan ook [a-f]
    * [A-Z]
* ^ within range
    * [^A-Z] = not A-Z

### Anchors and boundaries
* ^ = start of string or line
* $ = end of string or line
* \b = word boundary (bijv. eerste woord in een zin; geen spatie ervoor maar toch meenemen)

Examples
* probleem met `\d{3} \d{3}-?\d{4}` voor een telefoonnummer -> kan zijn dat er voor of erna nog allemaal andere crap staat 
    * combinatie van `^` en `$` om andere tekst uit te sluiten 
    * '^\d{3} \d{3}-\d{4}$' = beginnen met 3 digits en eindigen met 4 digits
* `^\d{3}$` matcht niet met `Yay I got 777`
  

### Logical Or and Capture Groups

* `|` = OR `\(\d{3}\)|\d{3}` = 3 digits with or without parentheses
* () to capture groups 
    * `\(\d{3}\)|\d{3} \d{3} \d{4}` maakt de vergelijking 3 digits with parentheses or 10 digits without parentheses. Als je dus echt de 3 met en zonder apart wilt vergelijken met je om dat deel haakjes toevoegen
    * Zelfs al heb je het niet per se nodig kan het alsnog handig zijn om `()` te gebruiken om een groep aan te maken -> scheiden van naam van voorvoegsel (Mr. / Ms.)
    * Gebruik je heel veel zodat je het niet handmatig nog een keer hoeft te gaan splitten
    
Examples
* `https?://([A-Za-z_-0-9]+\.[A-Za-z_-0-9]+)`
    * Alleen het deel na `http://` als groep opslaan
* `Mr.|Mister Holmes` -> Mr. OR Mister Holmes (because of a lack of parentheses)
* Escape group symbol 
    * Which regex would match both of the following strings (`cat(s)` AND `dog(s)`)
    * `\w{3}\(s\)`

### Re Module
* https://docs.python.org/3/library/re.html
* `r` = raw string (otherwise you have to use double backslashes - avoids that \t is seen as a TAB)
* compiling it separately (`re.compile`) vs rechtstreeks
    * if you're using it more than once -> via `re.compile()`
* `search` = max 1 result
* `findall` = return all results

In [None]:
# import regex module
import re

# define our phone number regex
pattern = re.compile(r'\d{3} \d{3}-\d{4}')

# search a string with our regex
result = pattern.search('Call me at 415 555-4242 or 310 234-9999!')
print(result.group())


result2 = pattern.findall('Call me at 415 555-4242 or 310 234-9999!')
print(result2)


# in plaats van een apart object aanmaken -> gelijk pattern in search
print(re.findall(r'\d{3} \d{3}-\d{4}', 'Call me at 415 555-4242 or 310 234-9999!'))

In [None]:
import re

def extract_phone(input):
    phone_regex = re.compile(r'\b\d{3} \d{3}-\d{4}\b')
    match = phone_regex.search(input)
    if match: 
        return match.group()
    return None

print(extract_phone("my number is 432 567-8976"))
print(extract_phone("my number is 432 567-897622"))

### Parsing URLs
* Breaking things up (`match.groups()`)


In [None]:
url_regex = re.compile(r'(https?)://(www.[A-Za-z-]{2,256}\.[a-z]{2,6})([-a-zA-Z0-9@:%_\+.~#?&//=]*)')
match = url_regex.search("http://www.youtube.com/videos/asd/das/asd")
print(match.groups())
print(match.groups()[2])

In [None]:
#import re
import re
#define parse_date below

def parse_date(input):
    date_regex = re.compile(r'(\d{2})[/.,](\d{2})[/.,](\d{4})')
    match = date_regex.search(input)
    return {"d": match.groups()[0], 
            "m": match.groups()[1], 
            "y": match.groups()[2], 
            }

parse_date('12.04.2003')

### Compilation Flags
* `IGNORECASE` = geen onderscheid meer tussen lower en upper case ([a-z]) pakt hierdoor ook hoofdletters op
* `VEBOSE` = expand across multiple lines (als je hele lange regular expressions hebt) -> ignores white space
* Meerdere compilation flags combineren met een pipe (|) symbol

In [None]:
pattern = re.compile(r"""
    ^([a-z0-9_\.-]+)      # first part of email
    @                     # single @sign
    ([a-z0-9_\.-]+)       # email provider
    \.                    # single period
    ([a-z0-9_\.-]{2,6})$  # com, org, net, etc.
    """, re.VERBOSE | re.IGNORECASE)

match = pattern.search("Thomas123@Yahoo.com")
print(match.groups())

In [None]:
pattern = re.compile(r"""
    ^([a-z0-9_\.-]+)      # first part of email
    @                     # single @sign
    ([a-z0-9_\.-]+)       # email provider
    \.                    # single period
    ([a-z0-9_\.-]{2,6})$  # com, org, net, etc.
    """, re.IGNORECASE)

match = pattern.search("Thomas123@Yahoo.com")
print(match.groups())

### Substitutions
* Privacy gevoelige informatie weglaten 
* Zinnen herstructureren: bijv. 
    * Significant Others (1987) 
    * Naar: 1987 - Significant Others

In [None]:
# remove names from text (privacy)
text = "Last night Mrs. Daisy and Mr. White murdered Mr. Chow"

pattern = re.compile(r'(Mrs\.|Mr\.) ([a-z]+)', re.IGNORECASE)
result = pattern.sub("REDACTED", text)
result

In [None]:
# aparte groep maken als je bijvoorbeeld wel de eerste letter wilt laten zien
# \g<1> refers to group 1 (je hebt dus geen group 0)
pattern = re.compile(r'(Mrs\.|Mr\.) ([a-z])([a-z]+)', re.IGNORECASE)
result = pattern.sub("\g<1> \g<2>", text)
result

In [None]:
import re

def censor(input):
    censor_pattern = re.compile(r'frack\w*', re.IGNORECASE)
    return censor_pattern.sub("CENSORED", input)
    
censor("Frack you")

Exam questions: 
* Which of the following strings will have matches in them? 
    * Syntax geven
    * Meerdere voorbeeld zinnetjes
* Write a function called `is_valid_time` that accepts a single string argument. It should return `True` if the string is formatted correctly as a time, like 3:15 or 12:48 and return `False` otherwise. Note that times can start with a single number (2:30) or two (11:18).

In [None]:
# Don't forget to import re!
import re
# Define is_valid_time below:
def is_valid_time(input):
    time_regex = re.compile(r'^[0-23]{1,2}:[0-5]{1}[0-9]{1}')
    match = time_regex.search(input)
    if match:
        return True
    return False

is_valid_time("23:59")