# Lesson 4: Web Scraping
---
Intro: 
Today we will learn about the benefits and applications of web scraping using Python.


# Review
---

1. What are sets?
2. What kinds of operations can you use with sets?
3. What's the difference between a set and a frozen set?

# Concept 1: Web Scraping
---


## What is it?
Web scraping is the process of using bots to extract content and data from a website. It extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.

Although this sounds and quite possibly is malicious, there are many benefits that outweigh negative perspectives. 
Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include:

* Search engine bots crawling a site, analyzing its content and then ranking it.
* Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.
* Market research companies using scrapers to pull data from forums and social media

## Tools

1. [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - a Python library for pulling data out of HTML and XML files. It's easy to learn and master. However, it requires dependencies. 
2. [Selenium](https://www.selenium.dev/documentation/en/) - Automated testing and web scraping at the same time. It's versatile as it works with javascript and HTML. However, not the best for only web scraping.
3. [Scrapy](https://docs.scrapy.org/en/latest/) - Very fast and efficient but can be complex.

We will be using BeautifulSoup4.

## Examples:
---

1. The Honey extension, searches other websites to find the best available price
2. Data analytics, machine learning, data science
3. Finance

## DIY:
---

Name other places where web scraping can be useful and harmful.

# Concept 2: HTML Basics
---


## Quick Lesson on HTML
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser.
[HTML Graphic](https://images.app.goo.gl/Cd4r56rewkvjp4qRA)

All HTML documents must start with a document type declaration: \<!DOCTYPE html>.
The HTML document itself begins with \<html> and ends with \</html>.
The visible part of the HTML document is between \<body> and \</body>.
```
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
```
## HTML Heading Tag
HTML headings are defined with the \<h1> to \<h6> tags.
\<h1> defines the most important heading. \<h6> defines the least important heading: 
```
<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
```

## HTML Paragraph Tag
HTML paragraphs are defined with the \<p> tag:
```
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
```

## HTML Link Tag
HTML links are defined with the \<a> tag:
```
<a href="https://www.google.com">This is a link</a>
```
The link's destination is specified in the href attribute. 
Attributes are used to provide additional information about HTML elements.

## Unordered HTML List Tag
An unordered list starts with the \<ul> tag. Each list item starts with the \<li> tag.

The list items will be marked with bullets (small black circles) by default:
```
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>
```
## Using The id Attribute
The HTML id attribute specifies a unique id for an HTML element. The value of the id attribute must be unique within the HTML document.

The id attribute is used to point to a specific style declaration in a style sheet. It is also used by JavaScript to access and manipulate the element with the specific id.

```
<h1 id="myHeader">My Header</h1>
```

## HTML \<div> Tag
The \<div> tag defines a division or a section in an HTML document.

The \<div> tag is used as a container for HTML elements - which is then styled with CSS or manipulated with JavaScript.

The \<div> tag is easily styled by using the class or id attribute.

Any sort of content can be put inside the \<div> tag!

```
<body>

<div class="myDiv">
  <h2>This is a heading in a div element</h2>
  <p>This is some text in a div element.</p>
</div>

</body>
```

## Bold Tag

```
<b>and this is bold text</b>
```

## Examples:
---

In [None]:
<!DOCTYPE html>
<html>
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
  </body>
</html>

SyntaxError: ignored

In [None]:
<!DOCTYPE html>
<html>
  <body>
    <ul>
      <li> Basketball </li>
      <li> Soccer </li>
      <li> Baseball </li>
    </ul>
  </body>
</html>

## DIY:
---

1. Create an unordered list of different names in HTML.

In [None]:
<!DOCTYPE html>
<html>
  <body>
    <ul>
      <li> Tomas </li>
      <li> Jonathan </li>
      <li> Bob </li>
    </ul>
  </body>
</html

# Concept 3: Setting up for 1st Program
---


We will use the BeautifulSoup4 and requests modules. As mentioned before, BS4 will parse through the HTML code. With the requests module, you can send HTTP requests using Python. More simply, we can work with a specific website. BeautifulSoup4 allows you to read the website whereas requests allows you to choose which website to work on.

## Installing Packages
1. Open GitBash and type in:
```
python -m pip install -U pip
```
This not only checks if pip is installed but also upgrades pip to the most recent version. Pip is standard package-management system used to install and manage software packages written in Python. We need to install external packages (BS4 and requests) so we can use them. 

2. Next we need to install BeautifulSoup4. Type in:
```
pip install BeautifulSoup4
```

3. Now, to install requests:
```
pip install requests
```

Since we have both packages installed, let's look at them a bit closer.

## DIY:
---

1. What is pip?
2. How can we use these packages in our program? Hint: these are modules

# Concept 4: BeautifulSoup4 Documentation
---


## Importing BeautifulSoup4

```
from bs4 import BeautifulSoup
```
Recall that we are importing a specific module from bs4. If this doesn't work for you, make sure you installed BeautifulSoup!

## Constructor

```
soup = BeautifulSoup(mywebsite, "html.parser")
```
* soup is just an object name, common name is soup
* 1st arg: The document you're going to parse - (analyze)
* 2nd arg: The parser tool. For this instance, we will always use "html.parser" because we will parse through an html document. Python already supports the html.parser. 

## Kinds of Objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

### Tag
A Tag object corresponds to an XML or HTML tag in the original document:

```
>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
>>> tag = soup.b
>>> type(tag)
# <class 'bs4.element.Tag'>
```

> Note: In the constructor you only see one argument. By default, it just reads the html document or string in this case using Python's default html.parser. For our examples, we will explicitly call "html.parser"

### Name
Every tag has a name, accessible as .name:
```
tag.name
# u'b'
```
If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

```
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>
```

### Attributes
A tag may have any number of attributes. The tag \<b id="boldest"> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

```
tag['id']
# u'boldest'
```
You can access that dictionary directly as .attrs:

```
tag.attrs
# {u'id': 'boldest'}
```

### NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

```
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
```

### BeautifulSoup
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

### Comment
The Comment object is just a special type of NavigableString:

```
comment
# u'Hey, buddy. Want to buy a used parser'
```
But when it appears as part of an HTML document, a Comment is displayed with special formatting:

```
print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>
```

## Searching the tree
Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: find() and find_all().

Here's the example:
```
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
```

### A String
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the \<b> tags in the document:
```
soup.find_all('b')
# [<b>The Dormouse's story</b>]
```

### find_all()
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
```
soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```

### find()
The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one \<body> tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in limit=1 every time you call find_all, you can use the find() method. These two lines of code are nearly equivalent:

```
soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>
```
The only difference is that find_all() returns a list containing the single result, and find() just returns the result.

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None:
```
print(soup.find("nosuchtag"))
# None
```

## Examples:
---

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""
 
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find('p')

<p class="title"><b>The Dormouse's story</b></p>

In [None]:
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [None]:
soup.p
# <p class="title"><b>The Dormouse's story</b></p>

## DIY:
---

1. import BeautifulSoup
2. Use this html_doc
```
html_doc = '<head> <title>The Dormouse\'s story </title> </head>'
```
3. Create an instance of BeautifulSoup
4. print out the title

# Concept 5: Requests Documentation
---


## Importing requests
```
import requests
```
Make sure to have requests installed!

The requests module allows you to send HTTP requests using Python.

The HTTP request returns a Response Object with all the response data (content, encoding, status, etc). Just know you're able to retrieve website information.

General syntax:
```
requests.methodname(params)
```

Method	Description

We will use:
* get(url, params, args)	Sends a GET request to the specified url

Here are others:
* delete(url, args)	Sends a DELETE request to the specified url
* head(url, args)	Sends a HEAD request to the specified url
* patch(url, data, args)	Sends a PATCH request to the specified url
* post(url, data, json, args)	Sends a POST request to the specified url
* put(url, data, args)	Sends a PUT request to the specified url
* request(method, url, args)	Sends a request of the specified method to the specified url

Since the HTTP request returns a response object, we then use:
* content - Returns the content of the response, in bytes

Don't worry if you don't understand, it'll come together when we work with these methods.

## Example:
---

In [None]:
import requests

x = requests.get('https://w3schools.com/python/demopage.htm')

print(x.text)

<!DOCTYPE html>
<html>
<body>

<h1>This is a Test Page</h1>

</body>
</html>


## DIY:
---

1. import requests
2. Use get to retrieve the data from https://en.wikipedia.org/wiki/LeBron_James
3. Print it out

In [None]:
import requests

wiki =  requests.get(' https://en.wikipedia.org/wiki/LeBron_James')

print(wiki.text)

# Concept 6: Putting it all together
---


## Outline
We will go to Wikipedia and extract the titles and links of the In The News Section.

1. Let's analyze the webpage first. Go to https://en.wikipedia.org/wiki/Main_Page.
2. We can check the html code using the inspector by pressing cntrl + option + i or rightclick->inspect.
3. Hover over the In The News section.
4. You should see a \<div> containing the section with an id of "mp-itn"
5. Click on the \<div> and find the \<ul> unordered list with \<li> elements. Inside you should see \<a> tags for hyperlinks. The href id is the link to that website.
6. Now let's code!

## Code on the Fly / DIY
1. import requests and from bs4 import BeautifulSoup. Remember using BeautifulSoup can analyze the data and using requests can request the specific website you need.
```
import requests
from bs4 import BeautifulSoup
```
2. Assign a variable called url with the url as a string. 
```
url = 'https://en.wikipedia.org/wiki/Main_Page'
```
3. Create a variable called response. Use requests' get method and pass in the url. Remember that get sends a GET response from the url provided.
```
response = requests.get(url)
```
4. Create another variable called page and assign it with response's content attribute. This returns the content in bytes.
```
page = response.content
```
5. Now let's make the soup. Call BeautifulSoup's constructor. 1st argument is page and the 2nd argument is 'html.parser' Again this tells BeautifulSoup that we will use the html parser to analyze an html document.
```
soup = BeautifulSoup(page, 'html.parser')
```
6. Create a variable called inTheNews and use BeautifulSoup's find method. 1st parameter is 'div' and 2nd argument is id='mp-itn'. This finds the In The News section in Wikipedia. 
```
inTheNews = soup.find('div', id='mp-itn')
```
7. Have another variable called news that uses the variable inTheNews to find all lists. Remember to use the find_all() method. This finds each individual news.
```
news = inTheNews.find_all('li')
```
8. Now let's loop through the news. 
```
for n in news:
```
9. For each element in the news, we want to print the headline. We can use the get_text() method for this. This prints out the title of the news in text format.
```
print(n.get_text())
```
10. Now that we can iterate through each list, let's find all tags that have links. Remember that we also want to find the links to each news. The a tag is for hyperlinks.
```
links = n.find_all('a')
```
11. Let's loop through the links. For each element we will print out the website link. This for loop is nested.
```
for link in links:
  print(link['href'])
```
12. Outside the second for loop, print a new line so we can get a clean output.

# Summary:
---


1. What is web scraping?
2. What is BeautifulSoup?
3. How can web scraping be used in real life?

# Homework:
---
1. Read more about HTML [via the html glossary](https://www.codecademy.com/articles/glossary-html)
1. You are given two sets. Set A= 1,2,3,4,5,6, Set B= 2,3,4,5,6,7,8.
How many elements are present in A union B? A intersection B?
2. Get input from the user. Ask for their name. Then ask for the year they were born. Check to see if the year they were born in is either odd or even. Should look like this:
```
Sarah, you were born in 1989. 
That is an odd year.
```
3. Similar to the web scraping diy you did before, ask the user to enter a website. Now find all links from that website.
Example:
```
Enter a website to extract the URL's from: google.com
```
The output is this:
```
http://www.google.com/imghp?hl=en&tab=wi
http://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=US&tab=w1
http://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/
/advanced_search?hl=en&authuser=0
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/
```

# Notes on homework:
---



I will check in on Thursday,  through email to check on your progress. Respond with any questions you might have. Otherwise, a simple “all good” is appropriate if you have no questions or comments. 

You will need to upload your coding homework assignments to GitHub.
1. In gitbash, change directories to the homework directory: tomas_python/homework
* TIP: use ‘cd’ to change directories
* Use ‘cd ..’ to return to the previous directory
* Use ‘pwd’ to show full pathname of the current working directory 
* Use ‘ls’ to list all your directories
2. Once you’re in that directory, type in ‘git pull’
* This ensures you have all updated files
* If there is an error involved, email me immediately so we can try resolving it.
* Otherwise, type your code below and we’ll resolve issues next class
3. To create a new file, type in ‘touch hw01.py’ or the appropriate file name
* ‘Touch’ creates a new file
4. Open up the python file and start coding!

Note: Become familiar with these actions. This is essentially what happens in the backend when you right-click and create a new folder/file!

# DIY Solutions
---

In [None]:
<ul>
  <li> Tomas </li>
  <li> Sarah </li>
  <li> Bobby </li>
</ul>

In [None]:
from bs4 import BeautifulSoup

html_doc = '<head> <title>The Dormouse\'s story </title> </head>'
soup = BeautifulSoup(html_doc, 'html.parser')

soup.title


In [None]:
import requests

x = requests.get('https://en.wikipedia.org/wiki/LeBron_James')

print(x.text)

In [None]:
# 1. Get page data
import requests

url = 'https://en.wikipedia.org/wiki/Main_Page'
response = requests.get(url)

page = response.content

# 2. Work with page data
from bs4 import BeautifulSoup

soup = BeautifulSoup(page, 'html.parser')

inTheNews = soup.find('div', id='mp-itn')

news = inTheNews.find_all('li')

for n in news:
    print(n.get_text())
    links = n.find_all('a')
    for link in links:
        print(link['href'])
    print('\t')

Disease
/wiki/Coronavirus_disease_2019
	
Virus
/wiki/Severe_acute_respiratory_syndrome_coronavirus_2
	
Testing
/wiki/COVID-19_testing
	
Timeline
/wiki/Timeline_of_the_COVID-19_pandemic
	
By location
/wiki/COVID-19_pandemic_by_country_and_territory
	
Impact
/wiki/Impact_of_the_COVID-19_pandemic
	
Notable deaths
/wiki/List_of_deaths_due_to_COVID-19
	
Portal
/wiki/Portal:Coronavirus_disease_2019
	
Comet NEOWISE (pictured) is visible to the naked eye in the Northern Hemisphere.
/wiki/C/2020_F3_(NEOWISE)
/wiki/Naked_eye#In_astronomy
	
In the Singaporean general election, Lee Hsien Loong is re-elected Prime Minister as his People's Action Party retains its supermajority.
/wiki/2020_Singaporean_general_election
/wiki/Lee_Hsien_Loong
/wiki/Prime_Minister_of_Singapore
/wiki/People%27s_Action_Party
	
Bulgaria and Croatia join the European Exchange Rate Mechanism 2, the first major step in their adoption of the euro.
/wiki/European_Exchange_Rate_Mechanism
/wiki/Euro
	
A bus plunges into a reservo