# 简单爬虫
**我们将展示如何打开一个非常简单的[webpage](https://mofanpy.com/static/scraping/basic-structure.html)，并阅读其中的所有内容。**

In [6]:
import requests
from bs4 import BeautifulSoup

# 代理设置
proxies = {
    'http': '127.0.0.1:10809',
    'https': '127.0.0.1:10809'
}

url = "https://www.itlaoli.com/"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

try:
    # 明确指定不使用代理
    response = requests.get(url, headers=headers, proxies=proxies, timeout=30)
    
    print(f"状态码: {response.status_code}")
    
    if response.status_code == 200:
        html = response.text
        print(f"成功获取内容，长度: {len(html)} 字符")
        
        # 解析HTML
        soup = BeautifulSoup(html, 'html.parser')
        print("\n前5000个字符:")
        print(html[:5000])
    else:
        print(f"请求失败，状态码: {response.status_code}")
        
except Exception as e:
    print(f"请求失败: {e}")

状态码: 200
成功获取内容，长度: 106594 字符

前5000个字符:
<!DOCTYPE html>
<html xml:lang="zh-Hans" lang="zh-Hans">
<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<meta http-equiv="X-UA-Compatible" content="IE=edge, chrome=1" />
	<meta name="viewport" content="width=device-width,minimum-scale=1,initial-scale=1,viewport-fit=cover">
	<meta name="applicable-device" content="pc,mobile">
	<meta name="renderer" content="webkit" />
    <meta name="force-rendering" content="webkit" />
    <title>李洋博客 | 专业博客搭建与技术分享 - ZBlog主题定制与网站运维教程
</title>
	<meta name="description" content="李洋博客专注ZBlog主题模板定制开发，提供网站运维实战笔记与图文教程。涵盖主题设计、服务器配置、SEO优化等核心技术，助力个人站长快速搭建专业博客，分享从0到1的网站建设全流程经验。" />
	<meta name="keywords" content="ZBlog主题定制,网站运维教程,博客模板开发,服务器配置指南,图文建站教程,SEO优化技巧,个人博客搭建,网站维护笔记,主题模板设计,建站技术分享" />
	<meta property="og:type" content="index"/>
	<meta property="og:title" content="李洋博客 | 专业博客搭建与技术分享" />
	<meta property="og:description" content="李洋博客专注ZBlog主题模板定制开发，提供网站运维实战笔记与图文教程。涵盖主题设计、服务器配置、SEO优化等核心技

In [7]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# 获取网页内容
html = urlopen("https://mofanpy.com/static/scraping/basic-structure.html").read().decode('utf-8')
# html = urlopen("https://www.weiyoun.com/2023/09/28/cms9geduibi/").read().decode('utf-8')

# 使用BeautifulSoup解析并格式化HTML
soup = BeautifulSoup(html, 'html.parser')
formatted_html = soup.prettify()

# 打印格式化后的HTML
print(formatted_html)

<!DOCTYPE html>
<html lang="cn">
 <head>
  <meta charset="utf-8"/>
  <title>
   Scraping tutorial 1 | 莫烦Python
  </title>
  <link href="/static/img/description/tab_icon.png" rel="icon"/>
 </head>
 <body>
  <h1>
   爬虫测试1
  </h1>
  <p>
   这是一个在
   <a href="/">
    莫烦Python
   </a>
   <a href="/tutorials/data-manipulation/scraping/">
    爬虫教程
   </a>
   中的简单测试.
  </p>
 </body>
</html>



**然后我们根据标签选择一些文本，使用正则表达式将文本环绕起来**

In [8]:
import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])


Page title is:  Scraping tutorial 1 | 莫烦Python


**从 html 中选择段落内容的另一个示例。**

In [11]:
res = re.findall(r"<body>(.*?)</body>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])


Page paragraph is:   <h1>爬虫测试1</h1> <p> 这是一个在 <a href="/">莫烦Python</a> <a href="/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试. </p> 


**提取全部的链接**

In [13]:
res = re.findall(r"href=\"(.+?)\"", html)
print("\nAll links is: ", res)


All links is:  ['/static/img/description/tab_icon.png', '/', '/tutorials/data-manipulation/scraping/']
