# Day 9

* [Quick HTML Tutorial](#quick-html-tutorial)
* [HTML Tags and Attributes](#class-and-id-attributes)
* [Web Scraping with Beautiful Soup](#web-scraping-with-beautiful-soup)
* [Getting HTML data](#requests)
* [Scraping a Website from the internet](#scraping-a-webpage-on-the-internet)

## Quick HTML Tutorial

Web pages are written using HTML (hyper text markup language) and CSS (cascading style sheets). HTML provides the structure of the and the text that is to be displayed; CSS is responsible for visual and aural layout i.e. the font, colors, styling, etc.  

### HTML
1. HTML is used to give structure to the page. 
2. The HTML langugage comprises *elements*, which are used to *markup* the content i.e. label the text (or other content like image, video) as headings, paragraphs, lists, etc. 
3. The elements are defined through the use of tags. For most elements, there is an *opening tag* `<tagname>` and a *closing tag* `<tagname>`. For example: `<p>The first paragraph </p>` represents the paragraph element through the use of `<p>` tags

Reference: https://www.w3.org/standards/webdesign/htmlcss

**Let's now** open "day9-example-webpage.html" in the browser. 

### Inspecting a webpage 

To scrape a webpage, we should get a sense of its HTML. Web browsers generally allow us to inspect the HTML and CSS of websites.  

**Accessing the HTML on Chrome and Firefox:**
- Right click anywhere on the page and click on "Inspect" / "Inspect Element" in dropdown menu. 
- If you are a Safari or Internet Explorer user, the process may be slightly different. 

#### day9-example-webpage.html 

This example webpage has the following elements: 
1. A main `html` element 
    2. Body element, nested in the html element
        3. 1 `h1` element, 3 paragraph element `p` 
            4. 3 anchor elements `a` - all nested in the last paragraph element.  

2. Thus, the HTML element is the root element. The body element is the child of HTML and parent to three elements, etc. 

## HTML Tags and Attributes

1. Attributes provide additional information about tags
2. Attributes are always specified in the start tag
3. Attributes usually come in name/value pairs like: `name="value"`. 

In the example webapge, the second and the third `<p>` tags have a *class* attribute. The value of the class attribute is `"title"` and `"story"` respectively.  

An *anchor tag* generally has the href attribute, which states the URL of the webpage. 

### class and id attributes 
Most elements' tags have either a class or an id attribute. These attributes are used to style the element. 

No two elements will have the same value for the id attribute but may share the values for the class attributes.  

Reference: https://www.w3schools.com/html/html_attributes.asp

### Lecture Practice (5 minutes)

1. Inspect the three anchor tags in the example webpage. What are the different attributes and their values in the anchor tags? 


- Anchor tag 1 attributes: class, id, href. Values are respectively sister, link1, http://example.com/elsie
- Anchor tag 2 attributes: class, id, href. Values are respectively sister, link2, http://example.com/lacie
- Acnhor tag 3 attributes: id, href. Values are respectively link3, http://example.com/tillie

```html
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a href="http://example.com/tillie" id="link3">Tillie</a>; 
```

## Web Scraping with Beautiful Soup

The Beautiful Soup library allows us to represent the webpage in nested strucuture and reach different elements through use of different functions. 

Since we have the html file for `"day9-example-webpage.html"` on our computer, we can open it as a file using the `open()` function in the read mode. 

**BeautifulSoup Module Documentation**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

In [None]:
# We want to import the BeautifulSoup class from the beautiful soup module. (It is called the bs4 module)

from bs4 import BeautifulSoup 

In [None]:
# Open the file in read mode 
f = open('day9-example-webpage.html', 'r')

# Read the html data from the file, which is basically text data :-) 
# Use the read() function that can read all the text lines from any file 
text = f.read() 

# Close the file! Always close the file!
f.close() 

In [None]:
# Let's output the html text we read from 9-example-webpage.html
text 

In [None]:
# We have to create a soup object before we can start using the functions in the module
# To do so, we need to call BeautifulSoup() function with two arguments: html text and 'html.parser' 

soup = BeautifulSoup(text, 'html.parser')

In [None]:
# Lets reprint text but this time we can "prettify" it
print(soup.prettify())

In [None]:
# Let's do some basic navigation on the nested structure 
soup.body

In [None]:
# Will print the opening and closing tags with the enclosed text 
print(soup.h1)

# Will print the text enclosed between <h1> and </h1> i.e. Sample Webpage 
print(soup.h1.text)

In [None]:
# Note that it outputs the first paragraph, does not output the other two paragraph elements 
soup.p, soup.p.text

### Important Functions from Beautiful Soup Module

1. `soup.find_all(tagname)`: This will return a list of all the tags that match the specified tag name. We can also specify the attributes of class and id as second arguments for further filtering.  

2. `soup.find(tagname)`: This will return first tag that matches the specified tag name. 

In [None]:
# Finding all anchor_tags
anchor_tags = soup.find_all('a')

print(len(anchor_tags))
anchor_tags

In [None]:
# Finding all anchor tags with class = "sister"
anchor_tags_class = soup.find_all('a', class_ = 'sister')
anchor_tags_class

In [None]:
# Finding all anchor tags with id = "link1"
anchor_tags_id = soup.find_all('a', id = "link1")
anchor_tags_id

In [None]:
# Finding paragraph that has class="title"
para_second = soup.find('p', class_="title")
para_second

### Lecture Practice (15 minutes)

1. Anchor tags practice: 
    - Create an empty list `a_text`
    - Create a for loop to loop through `anchor_tags` and append the text of each anchor tag to `a_text`. 
    - Print `a_text` after the for loop. **Expected Output**: The list `a_text` should be `["Elsie", "Lacie", "Tillie"]`.

2. Use `find_all()` to find all paragraph elements. Store the result of `find_all` in a variable `para_tags`. Print the length of `para_tags`. **Expected Output: 3**

3. Use `find()` to find the `h1` tag.

4. Use `find()` to find the `p` tag that has `class = "story"` and assign it to a variable `para_third`.

In [None]:
anchor_tags = soup.find_all('a')
anchor_tags

In [None]:
# Problem 1 
a_text = [] 

for item in anchor_tags: 
    tag_text = item.text  
    a_text.append(tag_text)
    
a_text

In [None]:
# Problem 2 
para_tags = soup.find_all("p")
len(para_tags)

In [None]:
# Problem 3 
soup.find('h1')

In [None]:
# Problem 4 
para_third = soup.find('p', class_="story")
para_third

# Print the text in the third <p> tag
print(para_third.text)

# para_third has three more tags within it. We can use find and find_all functions on para_third too 
para_third.find_all('a')

In [None]:
# Getting the attribute values of tags
# In this case, let's print the href attribute of all anchor tags
href_list = [] 

for item in anchor_tags:
    link = item['id']
    print(link)
    href_list.append(link)
    
print(href_list)

## Scraping a webpage on the internet 

Can everyone access this webpage?: https://www.imdb.com/chart/top/

**Our Goal:** Scrape the webpage to get the following details on IMDB's top 250 movies - movie title, release year, rating

### Lecture Practice (15 minutes)

1. Which tag encloses the text "Top Rated Movies"? 

2. Which tag encloses the text "Top Rated Movies by Genre"? 

3. Which tag encloses the text "The Shawshank Redemption"? 
    - What are the attributes-values of that tag? 
    - What is the parent tag of this tag? 
    
4. When inspecting the web page, can you identify the tag that is the parent to all the movies? 

### Let's scrape it together

In [None]:
# import the module
import requests 

In [None]:
# Let's request the data from https://www.imdb.com/chart/top/ using the get function 
# This will return a response 

url = "https://www.imdb.com/chart/top/"
response = requests.get(url) 

In [None]:
# Different kinds of status codes 
# Status Code 404 - Page not found 
# Status Code 400 - Internet connection refused 

# Status Code 200 - request was successful, we can use response.text to access the HTML of the webpage 

response

In [None]:
from bs4 import BeautifulSoup 

# Getting the html for https://www.imdb.com/chart/top/ by using the response.text 
html_text = response.text 

# make a soup
soup = BeautifulSoup(html_text, 'html.parser')

In [None]:
# get all title columns
title_columns = soup.findAll(class_="titleColumn")
for i in range(250):
    print(title_columns[i].text)
    