 ### <span style="color: #DB4437; font-family: Trebuchet MS, sans-serif;"> Disclaimer: Please note that the instructions for Web scraping in this file are tailored for the Chrome browser, and may not be applicable to other browsers. We recommend using Chrome for best results. </span>  
 [Chrome Download Link](https://www.google.com/chrome/)

# [Trump's Lies](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html)

<img src="https://raw.githubusercontent.com/justinjiajia/img/69b47fdb83d0f3026cb3677e818df0ead20e15b7/python/trump%20lies.PNG"/>

How to open the Chrome's developer tools (or devtools):

- Press <kbd>F12</kbd> on the keyboard (sometimes <kbd>Fn</kbd> + <kbd>F12</kbd>)
- Right-click on a Web page → Inspect
- Choose <kbd>⋮</kbd> at the top right corner of the Chrome window → More tools → Developers tool (or <kbd>Ctrl</kbd> or <kbd>Command</kbd> + <kbd>Shift</kbd> + <kbd>I</kbd>)


---


# Web Scraping


Using the Python programming language, it is possible to "scrape" data from the Web in a quick and efficient manner.

**Web Scraping** commonly refers to the practice of writing an automated program
that queries a web server, requests data (usually in the form of HTML and other files
that compose web pages), and then parses that data to extract needed information.



Web scraping is a valuable tool in the data scientist's skill set.

<span style="margin-right: 10%;  margin-left: 20%;">[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)</span>
<span style="margin-left: 5%;">[Selenium](https://www.selenium.dev)</span>

<p>

<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/beautifulsoup.jpg" style="float: left; margin-bottom: 1.5em; margin-top: 1.5em; margin-right: 10%;  margin-left: 15%; " width=220/>
 <img src="https://upload.wikimedia.org/wikipedia/commons/d/d5/Selenium_Logo.png" style="float: left; margin-top: 3.8em;" width=140/>



---
<br>

# HTML Basics



[**HyperText Markup Language**](https://developer.mozilla.org/en-US/docs/Web/HTML) (HTML for short) is a markup language for describing Web documents.


It is plain text, but includes a rich collection of "tags" that define the structure of the document and allow documents to include a variety of page elements.

- Tags are always enclosed in angle brackets: ` <tagname>`.


<img src="https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started/grumpy-cat-small.png" width=480/>


- Tags usually travel in pairs and contain something in between. An opening (or start) tag begins a section of page content, and a closing (or end) tag ends it, e.g.,  `<tagname>content</tagname>`.

    - `<h1>`, `<h2>`, ..., `<h6>`: largest headings, second largest headings, etc.;

    - `<p>`: paragraphs;

    - `<ul>` or `<ol>`: unordered or ordered bulleted lists;

    - `<li>`: individual list items;

    - `<div>`: divisions or sections;

    - `<a>`: anchors;

    - and [many others ...](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)



    
- There are also a few *self-closing* tags, e.g.:

    - `<br/>`

    - ```html
      <img src="https://idp.ust.hk/idp/images/logo.png" alt="UST Logo"/>
      ```


- Tags can have attributes, which are always specified in the start tag and come in `name="value"` pairs. E.g.:

    - ```html
      <a href="https://docs.python.org/" target="_blank">Python documentation</a>
      ```
      
      
<img src="https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started/grumpy-cat-attribute-small.png" width=600/>






- There are a few attributes that are common to all tags, among which the two most used ones are `id` and `class`:

    - `id` allows us to specify a unique identifier for an element which must be unique in the whole document. E.g.:
    
        - ```html
          <p id="unique-one">This is a paragraph with id "unique-one".</p>
          ```    

    - `class` allows us to assign multiple elements (possibly of different types) to the same group identified by a class name. E.g.:

        - ```html
          <p class="class-one"> This is a paragraph of class "class-one". </p>
          ```

        - An element can also be assigned to multiple groups that are specified by a *space-separated* list of class names. E.g.:
        
            - ```html
              <div class="col m-1 border class-one">Some Content</div>
              ```

    - In Web scraping, the  `class` and/or `id` attributes can be leveraged to differentiate the section we want from other sections. To do so, we need to use something called the **CSS selector**.

**<font color='steelblue' >Question</font>**:


Use a text editor on your computer to create an HTML Web page that has the following layout:

<img src="https://canvas.ust.hk/courses/49017/files/7437224/preview?instfs=true" width=800 />


- The text on the page is as follows:

    Countries of the World: A Simple Example

    Andorra

    - Capital: Andorra la Vella
    - Population: 84000
    - Area (km^2): 468.0

    - The image address: https://upload.wikimedia.org/wikipedia/commons/5/5b/Andorra_la_Vella_-_view2.jpg



- Tags you may want to enclose in the `<body>` tag includes: `<h1>`, `<h3>`, `<hr>`, `<div>`, `<ol>`, `<li>`, and `<img>`

---

<br>

## CSS Basics


[**Cascading Style Sheets**](https://developer.mozilla.org/en-US/docs/Web/CSS) (CSS for short) is a style sheet language for describing the presentation of a document written in a markup language.

HTML dictates the content and structure of a webpage, while CSS modifies design and display of HTML elements.




3 ways of creating and applying CSS rules:

- An inline style created right on an HTML start tag, using `style="attribute: value;"`;

- An [embedded style sheet](sample_embeddedCSS.html) included by a `<style>` tag nested within the `<head>` section;

- As an [external style sheet](sample_externalCSS.html) in a [separate file](example.css).


<img src="https://4.bp.blogspot.com/-yQFU_PhmTRg/U7viQ7bMNjI/AAAAAAAADJE/ctlBTLl-fhY/s640/css-selectors-lrg.png" width=500px />

A CSS stylesheet is a set of style rules, and a CSS ruleset consists of:

- A CSS selector: a pattern used to select the element(s) we want to style;

-  A style declaration block, which contains a list of style properties and their values:

    - Style properties and values come in pairs, with each pair separated with a semicolon (`;`).

    - See a [list](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Properties_Reference) of common properties and what values are accepted for them.







Different types of selectors:

| Type | Notation |  Example | Explanation |
|-----|-----|-----|-----|
| type | `elementname` | `p` | Select all `<p>` elements|
| id | `#idname` | `#unique-one` | Select the element with `id="unique-one"` |
| class | `.classname` | `.class-one`  |Select all elements with `class="class-one"` |
| attribute | `[attributename]` |`[alt]` | Select all elements with an `alt` attribute |
| attribute value | `[attr=value]` |`[alt='UST Logo']` |  Select all elements whose `alt` attribute has the value `'UST Logo'` |




Attribute selectors also allow for several special symbols (e.g., `~`, `$`, `^`, `*`, etc.) to create more flexible value matching patterns. You can find more details [here](https://developer.mozilla.org/en-US/docs/Web/CSS).

CSS selectors can be combined to create more specific selectors. For example,
to select
```html
<div class="col m-1 border class-one">Relevant Content</div>
```
we can use:

- `div.class-one`;

- `div.m-1.col` (the order does not matter);

- `div[class="col m-1 border class-one"]`.



CSS selector syntax also supports a set of relational notations, called **combinators**, that describe elements in terms of their relations to others

| Name | Notation |  Example | Explanation |
|-----|-----|-----|-----|
| Descendant| ` `| `div p` | Select all `<p>` elements inside `<div>` elements |
| Child | `>`| `div > p` | Select all `<p>` elements whose parent is a `<div>` element|
|Adjacent sibling | `+`| `div + p` | Select all `<p>` elements that immediately follows a `<div>` element |
|General sibling  | `~`| `div ~ p` | Select all `<p>` elements as long as they follow a `<div>` element |


More use can be find [here](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors).


Handy interactive [HTML cheatsheet](https://htmlcheatsheet.com/html) and [CSS cheatsheet](https://htmlcheatsheet.com/css)




---


# Beautiful Soup


> Beautiful Soup, so rich and green,<br>
> Waiting in a hot tureen!<br>
> Who for such dainties would not stoop?<br>
> Soup of the evening, beautiful Soup!


<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/beautifulsoup.jpg"  width=220/>


<br>

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for pulling data out of files written in a markup language.




An examples of ["**tag soup**"](https://en.wikipedia.org/wiki/Tag_soup):


```html
<H1>Tacky HTML</H1>
<p>
    <img src=https://idp.ust.hk/idp/images/logo.png>
    Browsers tolerate a lot of completely broken HTML.
<UL>
    <LI>List one
    <LI>List 2
</UL>
```

BeautifulSoup helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects.

The most commonly used object in the BeautifulSoup library is the `BeautifulSoup` object.

In [None]:
from bs4 import BeautifulSoup
import bs4

bs4.__version__


The use of the `BeautifulSoup` constructor:

- The 1st argument is the HTML text the object is based on;

- The 2nd argument specifies the parser that we want BeautifulSoup to use in order to create a `BeautifulSoup` object.


<div class='alert alert-info'><code>html.parser</code> is the HTML parser included in the standard Python 3 library. Information on other HTML parsers is available <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser">here</a>.</div>




Let's create a `BeautifulSoup` object for the tacky HTML code above to "***make the soup***":



In [None]:
tacky_html = '''<H1>Tacky HTML</H1>
<p>
    <img src=https://idp.ust.hk/idp/images/logo.png>
    Browsers tolerate a lot of completely broken HTML.
<UL>
    <LI>List one </LI>
    <LI>List 2
</UL>'''

BeautifulSoup(tacky_html, 'html.parser')

In [None]:
html_sample_code = ('<!DOCTYPE html><html lang="en"><head><title>Sample HTML Page</title></head>'
                    '<body><h1>This is a heading.</h1>'
                    '<p>This is a typical paragraph.</p>'
                    '<p class="class-one">This is a paragraph of class "class-one".</p>'
                    '<ol><li class="class-one"><a href="sample.html">The 1st item</a></li></ol>'
                    '<p id="unique-one">This is a paragraph with id "unique-one".</p>'
                    '<div class="col m-3 border class-one">This is a division.</div></body></html>')

sample_soup = BeautifulSoup(html_sample_code, 'html.parser')
sample_soup


We can use the `prettify()` method to format the HTML source code so as to get a better idea of its structure:


In [None]:
from bs4.formatter import HTMLFormatter

# you can customize the indent size by setting e.g., formatter=HTMLFormatter(indent=4)
print(sample_soup.prettify())



These textual components form a ["family tree"](familytree.html), where the top-level `<html>` tag contains the `<head>` and `<body`> tags, which further contain other textual contents and tags, and so on.




We can define parents, chidren, and siblings for each tag by referencing this family tree (based on the identation levels of related tags)


## What's in a Soup?

A `BeautifulSoup` object is essentially a hierarchical collection of `Tag` objects, which are each equipped with attributes for navigating and iterating over child tags and strings it contains.

The simplest way to navigate this collection is to say the name of a child tag we want:

In [None]:
sample_soup.html.body          # the <body> tag is beneath the <html> tag

We can also call the <body> tag directly as long as there's no ambiguity:

In [None]:
sample_soup.body

In [None]:
# get the <h1> tag nested two layers deep into the BeautifulSoup object structure (html → body → h1)
sample_soup.html.body.h1        # equivalently, bs_sample.body.h1 or bs_sample.h1

In [None]:
sample_soup.body.p              # get the 1st <p> tag beneath the <body> tag

A tag's children are available in a list when we call `.contents`:

In [None]:
sample_soup.body.contents

If we want the human-readable text inside a document or tag, we can use the `get_text()` method, which returns all the text in a document or beneath a tag, as a single regular string:

In [None]:
sample_soup.body.get_text()

In [None]:
sample_soup.ol.get_text()

If we want to extract the value of a specific attribute, we can use a tag's `.get()` method:

In [None]:
sample_soup.a.get('href')    # alternatively, it can also be accessed with dictionary-like indexing, e.g., sample_soup.a['href]


---

<br>

### `.find()` & `.find_all()`

Using a tag name as an attribute gives us only the first tag by that name.

If we need to get all tags with a certain name, we need to use `find_all()`.

The `find_all()` (`find()`) method can take a variety of filters to find lists of desired tags (a single tag):

In [None]:
sample_soup.find_all('p')                      # perform a match against that exact string; return a list of tags

[<p>This is a typical paragraph.</p>,
 <p class="class-one">This is a paragraph of class "class-one".</p>,
 <p id="unique-one">This is a paragraph with id "unique-one".</p>]

In [None]:
sample_soup.find_all(["p", "a"])               # perform a string match against any item in that list

[<p>This is a typical paragraph.</p>,
 <p class="class-one">This is a paragraph of class "class-one".</p>,
 <a href="sample.html">The 1st item</a>,
 <p id="unique-one">This is a paragraph with id "unique-one".</p>]

 We can also form filters based on tags' various attributes:

- The 1st argument is a (list of) filter(s) on tag name(s);

- The 2nd argument is a dictionary of filters on attribute values, e.g., `{attr_1: val_1, attr_2: val_2, ...}`.

In [None]:
help(sample_soup.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Extracts a list of Tag objects that match the given
    criteria.  You can specify the name of the Tag and any
    attributes you want the Tag to have.
    
    The value of a key-value pair in the 'attrs' map can be a
    string, a list of strings, a regular expression object, or a
    callable that takes a string and returns whether or not the
    string matches for some custom definition of 'matches'. The
    same is true of the tag name.



In [None]:
sample_soup.find_all('p', {'class': 'class-one'})

In [None]:
sample_soup.find_all(['div', 'p'], {'class': 'class-one'})

In [None]:
# filter elements only on attributes' values
sample_soup.find_all(attrs={'class': 'class-one'})

We can also search for the exact string value of the class attribute:

In [None]:
sample_soup.find('div', {'class': 'col m-3 border class-one'})

<div class="col m-3 border class-one">This is a division.</div>

But searching for variants of the string value won't work:

In [None]:
sample_soup.find_all('div', {'class': 'col class-one'})

**<font color='steelblue' >Question</font>**:


This [web page](https://justinjia.people.ust.hk/countries.html) lists information about 249 countries in the world. Use `.find_all()` to extract the information on name, capital, population, and area of each country.

In [None]:
# starter code

import requests
response = requests.get("https://justinjia.people.ust.hk/countries.html", timeout=3)
soup = BeautifulSoup(response.content, 'html.parser')

# provide your code below




---

<br/>

### `.select()` & `.select_one()`




A `BeautifulSoup` object has a `.select()` (or `.select_one()`) method, which allows us to query elements using CSS selectors:

In [None]:
# match elements against two or more classes
sample_soup.select(".class-one.m-3")

In [None]:
sample_soup.select("[href]")

**<font color='steelblue' >Question</font>**:


Now use `.select()` to extract the same information from this [web page](https://justinjia.people.ust.hk/countries.html).

In [None]:
# starter code
import requests
response = requests.get("https://justinjia.people.ust.hk/countries.html", timeout=3)
soup = BeautifulSoup(response.content, 'html.parser')

# provide your code below




---
<Br>

## Scraping Textual Data on Trump's Lies


We'll first use `requests.get(url)` to connect to a URL and make a soup with what is held by the `.content` attribute of the response:





In [None]:
from bs4 import BeautifulSoup

import requests
# open the URL; set timeout to 3 seconds
response = requests.get("https://drive.google.com/uc?export=download&id=1rJNpIuhDMnuARYfDbUqHjo-RQMWTCTvg",
                        timeout=3)
lies_soup = BeautifulSoup(response.content, 'html.parser')

---

<br>


In the HTML code, every record is surrounded by the `<span>` tag of `class="short-desc"`:

```html
<span class="short-desc">
      <strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span>
</span>

```






In [None]:
# equivalently, lies_soup.select('span.short-desc')
item_list = lies_soup.find_all('span', {'class': 'short-desc'})

This returns a list of all tags that match the given criteria:

In [None]:
item_list[0]

In [None]:
item_list[5]

The general structure of a single record is:

```html
<strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span>
```


Use `.find()` with the tag name `"strong"` to select the tag that contains the `DATE`:



In [None]:
item_list[0].find("strong")    # equivalently, item_list[0].select_one("strong")

Then use `.get_text()` to extract only the text, with the `strip` option active to remove leading and trailing spaces:

In [None]:
item_list[0].find("strong").get_text(strip=True)

Next, use `.contents` with list indexing to extract the `LIE`:

In [None]:
item_list[0].contents

In [None]:
child_nodes = item_list[0].contents
child_nodes[1][1:-2]

For the `EXPLANATION`, select the text within the `<span>` tag, which is the 3rd child of the tag:

In [None]:
child_nodes[2].get_text(strip=True)[1:-2]

'He was for an invasion before he was against it'

Note that the `URL` is an attribute (the `href` attribute) within the `<a>` tag.  We can access the tag's certain attribute with `get()`:


In [None]:
item_list[0].a.get('href')

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

Finally, we extend this process to all the rest of it using a `for` loop:

In [None]:
date_list = []; lie_list = []; explanation_list = []; url_list = []

for item in item_list:
    first, middle, last = item.contents
    date_list.append(first.get_text(strip=True))
    lie_list.append(middle[1:-2])
    explanation_list.append(last.get_text(strip=True)[1:-2])
    url_list.append(item.find('a').get('href'))


In [None]:
print(date_list)

We can now combine the data into a pandas `DataFrame` (a tabular data model that makes data manipulation and analysis easier):

In [None]:
import pandas as pd
lie_df = pd.DataFrame({'date': date_list, 'lie': lie_list,
                       'explanation': explanation_list, 'url': url_list})
lie_df

Unnamed: 0,date,lie,explanation,url
0,Jan. 21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,Jan. 21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,Jan. 23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting,https://www.nytimes.com/2017/01/23/us/politics...
3,Jan. 25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,Jan. 25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud,https://www.nytimes.com/2017/01/24/us/politics...
...,...,...,...,...
175,Oct. 25,We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,Oct. 27,"Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,Nov. 1,"Again, we're the highest-taxed nation, just ab...",We're not,http://www.politifact.com/truth-o-meter/statem...
178,Nov. 7,When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...


In [None]:
# alternatively, a row-based approach

all_records = []


for item in item_list:
    first, middle, last = item.contents
    date = first.get_text(strip=True)
    lie = middle[1:-2]
    explanation = last.get_text(strip=True)[1:-2]
    url = item.find('a').get('href'))
    record = {'date': date, 'lie': lie, 'explanation': explanation, 'url': url}
    all_records.append(record)

lie_df = pd.DataFrame(all_records)

Save the output in the file system:

In [None]:
lie_df.to_csv('trump_lies.csv')

---


**<font color='steelblue' >Question</font>**:

This time, extract the 4 pieces of information country by country, and combine them into a Pandas `DataFrame`.


In [None]:
# starter code
import requests
response = requests.get("https://justinjia.people.ust.hk/countries.html", timeout=3)
soup = BeautifulSoup(response.content, 'html.parser')

# provide your code below


