# CS5481 - Tutorial 2

## Introduction to Web Crawling

Welcome to CS5481 tutorial. In this tutorial, you will learn to how to crawl the data from web with Python.

## Part 1: Introduction to HTML (20 minutes)

### What is HTML?
HTML (HyperText Markup Language) is the standard language used for creating web pages. It structures content on the web and allows browsers to interpret and display it.

### Key Features of HTML
- **Markup Language**: HTML is a markup language that uses tags to define elements within a document.
- **Browser Compatibility**: HTML is universally supported by all web browsers, making it a foundational technology for web development.

### Common HTML Tags
- `<html>`: The root element that wraps all other HTML content.
- `<head>`: Contains meta-information about the document, such as the title and links to stylesheets.
- `<title>`: Sets the title of the web page that appears in the browser tab.
- `<body>`: Contains the main content of the page, including text, images, and other media.

### Header Tags
- `<h1>`: Represents the main heading of the page (largest).
- `<h2>`, `<h3>`, etc.: Subheadings, with decreasing size and importance.

### Text Content Tags
- `<p>`: Defines a paragraph of text.
- `<b>`: Makes text bold.
- `<i>`: Italicizes text.
- `<br>`: Inserts a line break.

### Link and Image Tags
- `<a>`: Anchor tag used to create hyperlinks. Example: `<a href="https://example.com">Visit Example</a>`.
- `<img>`: Embeds an image. Example: `<img src="image.jpg" alt="Description">`.

### Example HTML Structure
```
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Sample Page</title>
</head>
<body>
    <h1>Welcome to My Web Page</h1>
    <p>This is a sample paragraph.</p>
    <a href="https://example.com">Visit Example</a>
</body>
```

From: https://cn.w3schools.com/html/html_elements.asp， You can learn more about HTML :)

## Part 2: Introduction to Web Scraping (30 minutes)

### What is Web Scraping?
Web scraping is the process of extracting data from websites. 

Python provides powerful libraries like `requests` and `Beautiful Soup` for this purpose.

### Installing Libraries
To get started, ensure you have the required libraries installed:

In [None]:
! pip install requests
! pip install bs4

## 1. Import Libraries

In [1]:
import requests as r
from bs4 import BeautifulSoup

## 2. Find the Url of Target Html

In [2]:
url = r'https://stackoverflow.com/'

## 3. Obtain Html Framework and Contents

In [3]:
res = r.get(url)
html = res.text
print(html)


<!DOCTYPE html>


    <html class="html__responsive " lang="en">

    <head>

        <title>Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196">
        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a"> 
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
    <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">
        <meta name="bingbot" content="noarchive">         
        <meta property="og:type" content= "website" />
        <meta property="og:url" content="https://stackoverflow.com/"/>
        <meta property="og:site_name" con

## 4. Reformat and Parse Html

In [4]:
bf = BeautifulSoup(html)
print(bf.prettify())

<!DOCTYPE html>
<html class="html__responsive" lang="en">
 <head>
  <title>
   Stack Overflow - Where Developers Learn, Share, &amp; Build Careers
  </title>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" rel="shortcut icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="image_src"/>
  <link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
  <meta content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
  <meta content="noarchive" name="bingbot"/>
  <meta content="website" property="og:type"/>
  <meta content="https://stackoverflow.com/" property="og:url"/>
  <meta content="Stack Overflow" property="og:site_name"/>
  <meta content="https://cdn.sstatic.net/Sit

## 5. Obtain Information We Need

In [5]:
# obtain title according to <title> tag
print(bf.title) 

<title>Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</title>


In [6]:
# obtain title string
print(bf.title.string)

Stack Overflow - Where Developers Learn, Share, & Build Careers


In [7]:
# obtain all <a> tags
for item in bf.find_all("a"):
    print(item)

<a class="s-topbar--skip-link" href="#content">Skip to main content</a>
<a aria-controls="left-sidebar" aria-expanded="false" aria-haspopup="true" class="s-topbar--menu-btn js-left-sidebar-toggle" href="#" role="menuitem"><span></span></a>
<a class="s-topbar--logo js-gps-track" data-gps-track="top_nav.click({is_current:true, location:1, destination:8}); homelogo_nav.click({location:1})" href="https://stackoverflow.com">
<span class="-img _glyph">Stack Overflow</span>
</a>
<a class="s-navigation--item js-gps-track" data-ga='["top navigation","about menu click",null,null,null]' data-gps-track="top_nav.products.click({location:1, destination:7})" href="https://stackoverflow.co/">About</a>
<a class="s-navigation--item js-gps-track" data-ga='["top navigation","learn more - overflowai",null,null,null]' data-gps-track="top_nav.products.click({location:1, destination:10})" href="https://stackoverflow.co/teams/ai/?utm_medium=referral&amp;utm_source=stackoverflow-community&amp;utm_campaign=top-n

In [8]:
# obtain text content from document
print(bf.get_text)

<bound method PageElement.get_text of <!DOCTYPE html>
<html class="html__responsive" lang="en">
<head>
<title>Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</title>
<link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" rel="shortcut icon"/>
<link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon"/>
<link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="image_src"/>
<link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
<meta content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
<meta content="noarchive" name="bingbot"/>
<meta content="website" property="og:type"/>
<meta content="https://stackoverflow.com/" property="og:url"/>
<meta content="Stack Overflow" property="og:site_name"/>
<meta content="https://cdn.sstatic

In [9]:
# find <a> tags including "id" attributes
for item in bf.find_all("a", id=True):
    print(item)

<a aria-controls="" aria-current="false" class="s-block-link pl8 js-gps-track nav-links--link -link__with-icon" data-controller=" " data-gps-track="top_nav.click({is_current: false, location:1, destination:1,  has_activity_notification:False})" data-s-popover-auto-show="true" data-s-popover-hide-on-outside-click="never" data-s-popover-placement="right" href="/questions" id="nav-questions">
<div class="d-flex ai-center">
<svg aria-hidden="true" class="svg-icon iconQuestion" height="18" viewbox="0 0 18 18" width="18"><path d="m4 15-3 3V4c0-1.1.9-2 2-2h12c1.09 0 2 .91 2 2v9c0 1.09-.91 2-2 2zm7.75-3.97c.72-.83.98-1.86.98-2.94 0-1.65-.7-3.22-2.3-3.83a4.4 4.4 0 0 0-3.02 0 3.8 3.8 0 0 0-2.32 3.83q0 1.93 1.03 3a3.8 3.8 0 0 0 2.85 1.07q.94 0 1.71-.34.97.66 1.06.7.34.2.7.3l.59-1.13a5 5 0 0 1-1.28-.66m-1.27-.9a5 5 0 0 0-1.5-.8l-.45.9q.5.18.98.5-.3.1-.65.11-.92 0-1.52-.68c-.86-1-.86-3.12 0-4.11.8-.9 2.35-.9 3.15 0 .9 1.01.86 3.03-.01 4.08"></path></svg> <span class="-link--channel-name pl6">Questi

In [10]:
# find <a> tags whose id is "nav-tags"
for item in bf.find_all("a", id="nav-tags"):
    print(item)

**More use cases could be found at** https://beautiful-soup-4.readthedocs.io/en/latest/

# Practice:

Try to print title, source, editor, full text in the target html

https://english.news.cn/20220904/b1955558af1c4179a355fab10b1ee28f/c.html

In [11]:
# insert your code
import requests
from bs4 import BeautifulSoup

# URL of the news article
url = "https://english.news.cn/20220904/b1955558af1c4179a355fab10b1ee28f/c.html"

# Fetch the page
response = requests.get(url)
response.encoding = 'utf-8'  # Ensure proper encoding

# Create BeautifulSoup object
soup = BeautifulSoup(response.text)
print(soup)

<!DOCTYPE html>
<html lang="en"><head><meta content="9587a01d4f5042768563d1283001587b" name="templateId"/> <meta charset="utf-8"/> <meta content="新华社" name="source"/> <meta content="IE=edge" http-equiv="X-UA-Compatible"/> <meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0,user-scalable=no" name="viewport"/> <meta content="webkit" name="renderer"/> <meta content="telephone=no" name="format-detection"/> <meta content="email=no" name="format-detection"/> <meta content="no" name="msapplication-tap-highlight"/> </head><body><div data="datasource:20220904b1955558af1c4179a355fab10b1ee28f" datatype="content"><meta content="manufacturing industry,global economy,China" name="keywords"/></div> <div><meta content="Economic Watch: China upgrades manufacturing industry for global economy certainty-" name="description"/></div> <div><meta property="og:url"/></div> <div><meta content="Economic Watch: China upgrades manufacturing industry for global economy certai

In [12]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="9587a01d4f5042768563d1283001587b" name="templateId"/>
  <meta charset="utf-8"/>
  <meta content="新华社" name="source"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0,user-scalable=no" name="viewport"/>
  <meta content="webkit" name="renderer"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="email=no" name="format-detection"/>
  <meta content="no" name="msapplication-tap-highlight"/>
 </head>
 <body>
  <div data="datasource:20220904b1955558af1c4179a355fab10b1ee28f" datatype="content">
   <meta content="manufacturing industry,global economy,China" name="keywords"/>
  </div>
  <div>
   <meta content="Economic Watch: China upgrades manufacturing industry for global economy certainty-" name="description"/>
  </div>
  <div>
   <meta property="og:url"/>
  </div>
  <div>
   <meta content="Economic Watch: China

In [13]:
# Extract title
title = soup.find('title').text.strip() if soup.find('title') else 'Title not found'

# Extract source
source = soup.find('p', class_='source').text.strip() if soup.find('p', class_='source') else 'Source not found'

# Extract editor
editor = soup.find('p', class_='editor').text.strip() if soup.find('p', class_='editor') else 'Editor not found'

# Extract full text
full_text = soup.find('div', id='detailContent').text.strip() if soup.find('div', id='detailContent') else 'Full text not found'

# Print the extracted information
print(f"Title: {title}")
print(f"Source: {source}")
print(f"Editor: {editor}")
print(f"Full Text: {full_text}")

Title: Economic Watch: China upgrades manufacturing industry for global economy certainty-Xinhua
Source: Source: Xinhua
Editor: Editor: huaxia
Full Text: JINAN, Sept. 4 (Xinhua) -- At the intelligent manufacturing base of Goertek in Weifang City, east China's Shandong Province, tonnes of components are assembled into Virtual Reality (VR) equipment to be delivered to different countries.   According to Chang Gang, vice president of marketing and sales of Goertek Inc., it is one of the few VR manufacturers globally to reach the production scale of 1 million sets. It is also one of the largest manufacturers of mid and high-end VR headsets.   Goertek is the epitome of China's advanced manufacturing industry. At the 2022 World Advanced Manufacturing Conference, which closed Friday in Jinan, capital of Shandong, enterprises and experts gathered to discuss the frontier issues of global advanced manufacturing development.   As a major manufacturer, the steady growth and upgrading of China's ma