# Recap
---

## Regex

* is a special sequence of characters that defines a pattern for a complex string matching.
* uses the `re` library
* patterns uses **raw strings**
* no one pattern to rule them all!


### Exercise

Write a regular expression to convert dates of the format YYYY-MM-DD to DD-MM-YYYY.

**Program Output Example**
<pre>
Enter date in the format YYYY-MM-DD or 'q' to exit: 1985-05-25
New date: 25-05-1985
Enter date in the format YYYY-MM-DD or 'q' to exit: 1786-2-1
New date: 1-2-1786
Enter date in the format YYYY-MM-DD or 'q' to exit: q
</pre>

In [3]:
import re

def change_date_format(dt):
    return re.sub(r'(\d{4})-(\d{1,2})-(\d{1,2})', '\\3-\\2-\\1',dt)

while True:
    date_str = input("Enter date in the format YYYY-MM-DD or 'q' to exit: ")
    
    if date_str == 'q':
        break
    
    print(f'New date: {change_date_format(date_str)}')

Enter date in the format YYYY-MM-DD or 'q' to exit: 1985-05-25
New date: 25-05-1985


KeyboardInterrupt: Interrupted by user

# Case Study: Web Scraper

**Problem Statement:** We need some method to obtain data legally and freely but how?

---

## Topics Covered
* Legalities
* Setting up the required libraries
* HTTP Crash Course
* HTML Crash Course
* Stitching it together
---

## Legalities

Let's say that we want to scrape stock exchange data from websites such as NYC stock exchange or SGX but first we need to answer several questions:

1. Can we legally do it?
2. How do we tell if it's legal or not to scrape the data from the website? 
3. What are the consequence if the data was illegally obtained?

First and foremost, always **inform and get approval from your direct supervisor** for the sources of data you are going to obtain because of the Singapore Legislation for the **[Computer Misuse Act](https://sso.agc.gov.sg/Act/CMA1993)**.

> **Unauthorised access to computer material**    
3.—(1)  Subject to subsection (2), any person who knowingly causes a computer to perform any function for the purpose of securing access without authority to any program or data held in any computer shall be guilty of an offence and shall be liable on conviction to a fine not exceeding \\$5,000 or to imprisonment for a term not exceeding 2 years or to both and, in the case of a second or subsequent conviction, to a fine not exceeding \\$10,000 or to imprisonment for a term not exceeding 3 years or to both.

**How do we tell that the data from the webpage is allowable to be scraped?**    
Most websites uses techniques to limit web scraping like CAPTCHA tests or setting password access or use of anti-scraping tools.

| ![captcha.PNG](attachment:captcha.PNG) | ![captcha_2.PNG](attachment:captcha_2.PNG) |
|:---:|:---:|
| **Figure 1:** Word captcha | **Figure 2:** Image captcha |

Some websites also have a `robots.txt` file which is accessible from their main website (eg: `http://www.flipkart.com/robots.txt`) that details which parts of the website are allowed to be scraped or indexed. Let's go through the rules of `robots.txt` files.

#### Robots.txt Rules

1. **Allow full access**    
<pre>
User-agent: *
Disallow:
</pre>

 This means that all content from this website is allowable to be scraped and used.
 
 
2. **Block ALL access**    
<pre>
User-agent: *
Disallow:/
</pre>
 **DO NOT** scrape or use **any data** from the website or risk legal proceedings.


3. **Partical access**   
<pre>
User-agent: *
Disallow: /folder/<br>
User-agent: *
Disallow: /file.html
</pre>

 Webpages with the URL stated in the 'Disallow' section are not to be scraped.


4. **Crawl rate limit**
```
Crawl-delay: 11
```

 This is meant for web scrapers or crawlers that does automatic data collection or indexing. The `11` signifies the number of seconds to delay between each scrape. This is to avoid putting unwanted stress on the server thus making the site slow for human visitors.


5. **Visit time**
```
Visit-time: 0400-0845
```

 Tells us or the bot when they are able to scrape data. This is mainly used to avoid peak hour data collection or indexing.


6. **Request rate**
```
Request-rate: 1/10
```

 Some websites **do not** entertain bots trying to fetch multiple pages simultaneously therefore they limit this behaviour. The value `1/10` means the site allows requests of 1 page every 10 seconds.

**Example 1: Partial Flipkert's [`robots.txt`](http://www.flipkart.com/robots.txt) file**

In [None]:
User-agent: Mediapartners-Google
Disallow:

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot-Image
Disallow:

# cart
User-agent: *
Disallow: /viewcart

# Something related to carousel and recommendation carousel
User-agent: *
Disallow: /dynamic/
...

**Example 2: Channel News Asia's [`robots.txt`](https://www.channelnewsasia.com/robots.txt) file**

In [4]:
User-agent: *
Sitemap: https://www.channelnewsasia.com/googlenews/cna_news_sitemap.xml
Sitemap: https://www.channelnewsasia.com/sitemap.xml
Crawl-delay: 10

SyntaxError: invalid syntax (<ipython-input-4-2c2737b10c36>, line 1)

**Example 3: Singapore Stock Exchange's [`robots.txt`](https://www.sgx.com/robots.txt) file**

In [None]:
# SGX.com V2 robots.txt
# Tested against https://www.google.com/webmasters/tools/robots-testing-tool
# Last Update: 9 December 2019

# Dev and QA entry to Disallow bots and crawlers

# User-agent: *
# Disallow: /

# Below this line is used in Production

# User-agent: Twitterbot
# Disallow: /
# User-agent: Mediapartners-Google
# Disallow: /
User-agent: *
Allow: /
Disallow: /assets/static/e-learning/

---
## Setting up the required libraries

For this case study, we will be using 2 additional libraries:
* **[Requests: HTTP for Humans](https://requests.readthedocs.io/en/master/)** - a simple HTTP library that makes HTTP requests.
* **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)** - a Python library used for pulling data out of HTML and XML files.

Installation via `pip` can be done using the command:
`pip install requests beautifulsoup4`

---

## HTTP Crash Course

For the purpose of the 2 libraries that we are going to use, lets go through the very basic of The Internet. A very simple analogy of the internet is a bunch of computers linked together via a network with the ability to communicate with each other. The top-level protocol used for this communication is called the **Hypertext Transfer Protocol (HTTP)**.

### The Internet
Consider the scenario of when a URL such as `www.google.com` has been entered into a browser. Several things occur:
* **DNS Lookup** - The URL goes to a DNS (Domain Name System) server and it turns it into an IP address. This IP address is what you are sending the request to.

![dns.png](attachment:dns.png)

* **Request/Response Cycle** - This is where the client (your computer) makes a request to a server. That server then processes your request and issues you a response. Note that every request made is **completely independant**.

![reqresp.png](attachment:reqresp.png)

 Example of a `request` would be:
 <pre>
 GET /18_http.html HTTP/2.0
 </pre>
 
 Example of a `response` would be:
 <pre>
 HTTP/2.0 200 OK
 Content-Length: 65635
 Content-Type: html/text
 ...
 </pre>
 
Whenever a request is made, there is always a method attached to it. The main 4 methonds are:
* **GET** - retrieves data from the server
* **POST** - submit data to the server
* **PUT** - update data already on the server
* **DELETE** - deletes data from the server

On top of those methods, there are also parts like the URL to the server, parameters for authorization or from a form, content length and type, etc all this contained within the HTTP header. Upon receiving a request, the server will return with a response and this response will contain a HTTP status code. 

Common HTTP status codes are:
* **200** - OK
* **404** - Not found
* **500** - Internal server error

A list of all HTTP Status Codes can be found [here](https://www.restapitutorial.com/httpstatuscodes.html). 

The *Requests: HTTP for Humans* library is a 3rd party Python library used for sending HTTP/1.1 requests extremely easily otherwise we would have to use the built-in `json` and `urllib` libraries to construct a request and process request/response messages to and from servers.

**Example 4: Request message without using the *Requests: HTTP for Humans* library**

```python
import json
import urllib.request

conditionsSetURL = 'http://httpbin.org/post'
newConditions = {"con1":40, "con2":20, "con3":99, "con4":40, "password":"1234"} 
params = json.dumps(newConditions).encode('utf8')
req = urllib.request.Request(conditionsSetURL, data=params,
                             headers={'content-type': 'application/json'})
response = urllib.request.urlopen(req)
```

### HTML Crash Course

Every webpage that is displayable on a web browser is created using a language called Hypertext Markup Language (HTML). It can also be assisted by other technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

In a nutshell, all data on a webpage are delineate by HTML tags but not all tags comes with a start and end tag. HTML elements are denoted by their start and end tag pairs.

Examples of HTML elements/tag:
* `<p></p>` - paragraph element
* `<br>` - line break tag
* `<div></div>` - division or section element
* `<hr>` - horizontal line tag

With each start tag, there are a set of global and specific attributes that can be added to them to control its behaviour. These attributes either modify the default functionality of an element/tag type or provides functionality to certain element/tag types unable to function correctly without them. For example, in the figure below, the HTML paragraph element has the attribute `class` added to it. The `class` attribute is used refer to one or more classnames from a CSS.

![html_struc.png](attachment:c790722c-dca4-4bfb-a70e-46ee105f8a9c.png)

HTML elements can also be nested meaning that some HTML elements can contain any number of other elements within them. For example the `<div>` tag is an element can be used like a container to contain other HTML elements/tags to segregate different sections in a HTML document.

```html
<div class="myDiv">
  <h2>This is a heading in a div element</h2>
  <p>Here is some <b>bold</b> text.</p>
</div>
```

Placement of HTML elements/tags follow a particular structure called the Document Object Model (DOM). This model has a tree structure where each leaf represents an object that makes up part of the HTML document.

![dom_model.png](attachment:f66ac901-34f2-4e4d-93fa-405be272cf1c.png)

---
## Stitching it together

Now that we have a rough idea about robots.txt, HTTP and HTML, we can move on to web scraping. The act of web scraping is to effectively harvest data from websites for research or personal interest. The additional libraries that we will use are *Requests: HTTP for Humans* and *Beautiful Soup*. 

Although there are ways to develop automated web scraping, this case study will focus on **manual** web scraping. Occassionally, some website will provide an Application Programming Interfaces (APIs) for you to access their data in a predefined manner instead of jousting with HTML. This is beyond the scope of this case study but more can be read up [here](https://realpython.com/api-integration-in-python/) and [here](https://www.dataquest.io/blog/python-api-tutorial/).

The steps to web scraping are as follows:

1. Find the URL of the website that you want to scrape
2. Check the `robots.txt` file of your chosen website
3. Inspect the webpage
4. Find the data that you want to extract
5. Write the code
6. Run the code and extract the data
7. Store the data in the required format and medium

Let's use the Channel News Asia website for our web scraper. We will not be scraping the whole website but only the Business related news which is found on the URL `https://www.channelnewsasia.com/news/business`. Our goal is to scrape all the article titles and their elapsed time (in hours) from the time of publishing.

![web_sc_05.PNG](attachment:web_sc_05.PNG)

Following the steps:

**Steps 1 & 2:** From the above, we have ascertain that every webpage from that website is legally okay to be scraped, the only rule that we have to take note of is that we have to do it with a 10 second delay between each request.

**Steps 3:** We need to find out how the aticles are structured within the webpage, in-order to do this, we need to **inspect** the page. An easy way is to do a rough selection (highlight) of the parts that you want to scrape then either select the `Inspect` menu item from the context menu or press `Ctrl + Shift + I` on the keyboard while the browser is active.

![web_sc_02.PNG](attachment:web_sc_02.PNG)

This will bring up the inspection window in the browser zooming in to the highlighted HTML elements.

![web_sc_06.PNG](attachment:web_sc_06.PNG)

**Step 4:** From the HTML elements, find the rough area of where the data is located, the `<div>` element is the main container and the `<a>` and `<time>` element is where the data is at. 

![web_sc_07.PNG](attachment:web_sc_07.PNG)

Do take note of the attributes of the elements.

**Step 5 & 6:** Writing the code. First and for most we have to load the required libraries.

In [5]:
import requests   # HTTP library for Python
from bs4 import BeautifulSoup   # library for pulling data out of HTML and XML files
from datetime import datetime   # Python's datetime library

The specified URL is `https://www.channelnewsasia.com/news/business`. Although, we only need the news titles and elapsed time from articles from the Channel News Asia Business page, we will request for the whole webpage and pass this to the the BeautifulSoup object. Make sure that the request return with an `OK` status code before passing to the BeautifulSoup object.

BeautifulSoup object uses a `html.parser` to give BeautifulSoup an idiomatic way of navigating, searching and modifying the HTML parse tree.

In [6]:
# website url
url = "https://www.channelnewsasia.com/news/business"

page = requests.get(url)

if page.status_code == 200:
    soup = BeautifulSoup(page.text, 'html.parser')
    #print(page.status_code)
else:    
    print(page.status_code)

In [7]:
soup

<!DOCTYPE html>

<!--[if lte IE 9]><html lang="de" class="no-js old-ie"><![endif]-->
<!--[if gt IE 9]><!-->
<html class="" lang="en" navigation-scroll-fix="true"><!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=Edge,chrome=IE7" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Latest business news and headlines - CNA</title>
<!-- newrelic.browser.inject.type: api -->
<script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"UgADUl5VGwcHXVVSBwED",licenseKey:"ab7b570406",applicationID:"47940004"};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{c.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(27),

As mentioned earlier, the data are contained within the `<a>` and `<time>` HTML elements and these elements have the corresponding attribute `class` with the values `teaser__title` and `teaser__time` respectively. 

Because we already know which elements and attributes has the data, we are able to directly extract the elements from the HTML document. The return `data` is not a Python datatype but a BeautifulSoup `ResultSet`.

In [8]:
data = soup.find_all(['a', 'time'], {'class': ['teaser__title', 'teaser__time']})

# check that the data we scraped is correct
print(type(data))
print(len(data))
print(data[:2])
print(data[-2:])

<class 'bs4.element.ResultSet'>
116
[<a class="teaser__title" href="/news/business/singapore-airlines-bond-issue-850-million-investor-13530640">Singapore Airlines raises S$850 million through convertible bond issue</a>, <time class="teaser__time" data-js-atom="time" datetime="1605229435"></time>]
[<a class="teaser__title" href="/news/business/who-says-faces--onslaught--of-cyberattacks-as-taiwan-complains-of-censorship-13525690">WHO says faces 'onslaught' of cyberattacks as Taiwan complains of censorship</a>, <time class="teaser__time" data-js-atom="time" datetime="1605184850"></time>]


Once we are certain that the data obtained is correct, we can extract the article's titles and its time information. We will be storing this data in a `CSV` file therefore a proper delimiter should be chosen.

In [9]:
delimiter = '*'
header = 'article_name'+delimiter+'epoch_time\n'
data_to_write = [header]
temp = None

for item in data:
    if item.name =='a':
        temp = item.get_text() + delimiter
    else:
        temp += item['datetime'] + '\n'
        data_to_write.append(temp)
        temp = None


Before writing the data to a `CSV` file, we need to construct a unique file name. One way of doing this is by appending the date and time (timepstamp) to the file name. This helps us keep track of the age of the data.

In [10]:
# constructing the file name
timestamp = datetime.now().strftime("%d-%b-%Y_%H-%M-%S")
filename = 'scrape_'+timestamp+'.csv'

# writing the data to file
with open(filename, 'w') as fo:
    for line in data_to_write:
        fo.writelines(line)

The data can now be used for the next part of the process such as data cleaning and analysis. 