# Part 1 : 数据爬取
<br>

### 任务目标
1. 利用feedparser库爬取RSS类信息流；
2. 利用BeautifulSoup库爬取通用HTML网页元素；
3. 利用selenium模拟真实用户行为爬取特殊的网页内容；
4. 将爬取的内容插入金山云RDS。

<br>
<br>

In [123]:
import yaml
from IPython.display import display
import mysql.connector
import json
import feedparser
from bs4 import BeautifulSoup
from dateutil.tz import gettz
from datetime import datetime, timezone, timedelta
import pandas as pd
import requests

---
<br>

## 一、环境准备

In [124]:
with open(f'config_new.yaml', 'r') as file:
    config = yaml.safe_load(file)

In [125]:
# 定义时间转换函数

china_timezone = timezone(timedelta(hours=8))
def china_time_converter(*args):
    """
    自定义时间转换器，将UTC时间转换为北京时间。
    """
    utc_dt = datetime.utcnow().replace(tzinfo=timezone.utc)
    return utc_dt.astimezone(china_timezone).timetuple()

def convert_to_cst(datetime_str):
    # 解析日期字符串
    try:
        return parser.parse(datetime_str).astimezone(gettz("Asia/Shanghai"))
    except Exception as e:
        return datetime.now()

In [126]:
rss_source_list = []
web_source_list = []

rss_source_list = [config[source] for source in config['Source_List'] if config[source]['type'] == 'rss']
web_source_list = [config[source] for source in config['Source_List'] if config[source]['type'] != 'rss']

In [127]:
display(pd.DataFrame(rss_source_list))
display(pd.DataFrame(web_source_list))

Unnamed: 0,url,type,folder
0,https://aws.amazon.com/cn/blogs/machine-learni...,rss,AWS-MachineLearning
1,https://research.fb.com/feed/,rss,Meta-Research
2,https://deepmind.google/blog/rss.xml,rss,DeepMind-Google
3,https://blogs.microsoft.com/ai/feed/rss,rss,Microsoft_Blog
4,https://www.infoq.cn/feed/,rss,InfoQ
5,https://www.36kr.com/feed-article,rss,36kr
6,https://sspai.com/feed,rss,sspai


Unnamed: 0,url,type,folder,domain
0,https://openai.com/news/?limit=10,web,OpenAI_News,https://openai.com
1,https://www.deeplearning.ai/the-batch/,web,deeplearning_ai,https://www.deeplearning.ai


---
<br>

## 二、内容爬取

### 2.1 通过feedparser获取RSS源数据

feedparser 是一个用来解析 RSS 和 Atom feeds 的 Python 库，用于从网络抓取和处理内容聚合数据，例如新闻、博客文章、播客等。只能解析标准化的 RSS 或 Atom feed 数据。能快速提取 feed 标题、条目标题、链接、摘要等。
![feedparser信息流展示](pictures/feedparser信息流展示.png)

In [128]:
def get_rss_article(source_content):
    content_list = []
    feed = feedparser.parse(source_content.get('url'))
    print(f"解析成功: {source_content.get('url')}, 共找到{len(feed.entries)}篇文章。\n")
    for entry in feed.entries:
        try:
            content_dict = {
                'title': entry.get('title', '无标题'),
                'link': entry.get('link', '未知链接'),
                'description_original': BeautifulSoup(entry.get('summary', '无描述'), "lxml").get_text(),  #由于summary并非纯文本格式，需要使用bs来去除HTML标签
                'source': source_content.get('folder'),
                'published_at': convert_to_cst(entry.get('published', '未知时间')),
                'folder': source_content.get('folder')
            }
            content_list.append(content_dict)
        except Exception as e:
            print(e)
            continue
    print('本次解析任务完成\n---')
    return content_list

In [129]:
content_list = get_rss_article(rss_source_list[0])

解析成功: https://aws.amazon.com/cn/blogs/machine-learning/feed/, 共找到20篇文章。

本次解析任务完成
---


In [136]:
display(pd.DataFrame(content_list)[:5])

Unnamed: 0,title,link,description_original,source,published_at,folder
0,How Amazon trains sequential ensemble models a...,https://aws.amazon.com/blogs/machine-learning/...,Ensemble models are becoming popular within th...,AWS-MachineLearning,2024-12-16 17:28:15.600928,AWS-MachineLearning
1,Implementing login node load balancing in Sage...,https://aws.amazon.com/blogs/machine-learning/...,"In this post, we explore a solution for implem...",AWS-MachineLearning,2024-12-16 17:28:15.601269,AWS-MachineLearning
2,How Clearwater Analytics is revolutionizing in...,https://aws.amazon.com/blogs/machine-learning/...,"In this post, we explore Clearwater Analytics’...",AWS-MachineLearning,2024-12-16 17:28:15.601499,AWS-MachineLearning
3,How Twitch used agentic workflow with RAG on A...,https://aws.amazon.com/blogs/machine-learning/...,"In this post, we demonstrate how we innovated ...",AWS-MachineLearning,2024-12-16 17:28:15.601714,AWS-MachineLearning
4,Accelerate analysis and discovery of cancer bi...,https://aws.amazon.com/blogs/machine-learning/...,Bedrock multi-agent collaboration enables deve...,AWS-MachineLearning,2024-12-16 17:28:15.601947,AWS-MachineLearning


<br>

### 2.2 通过BeautifulSoup爬取deeplearning_ai内容
用于解析和处理 HTML/XML 的 Python 库，常与 lxml 或 html.parser 等解析器一起使用。它提供了一套简洁的 API，方便开发者从网页或 XML 数据中提取结构化信息。能解析任何 HTML/XML 内容，不依赖于特定格式（如 RSS/Atom）。可以处理不完整或格式不规范的 HTML。

![bs4爬取网页内容](pictures/bs4爬取网页内容.png)

In [131]:
def deeplearning_ai(source_content):
    content_list = []
    # 获取网页内容并解析文章
    notice_items = BeautifulSoup(requests.get(source_content.get('url')).text, 'lxml').find_all('article')
    for item in notice_items:
        target_link = ''
        try:
            # 提取文章链接
            target_link = next((source_content.get('domain') + link.get('href') for link in item.find_all('a') if 'issue' in link.get('href')), '')
            # 提取发布时间和简介
            divs = item.find_all('div')
            published_at = divs[1].find_all('div')[0].text if len(divs) >= 2 else ''
            description_original = divs[1].find_all('div')[1].text if len(divs) >= 2 else ''
            # 构建文章字典
            content_dict = {
                'title': BeautifulSoup(item.find('h2').text.strip(), "lxml").get_text(),
                'link': target_link,
                'description_original': BeautifulSoup(description_original, "lxml").get_text(),
                'source': source_content.get('folder'),
                'published_at': convert_to_cst(BeautifulSoup(published_at, "lxml").get_text()),
                'folder': source_content.get('folder')
            }
            content_list.append(content_dict)

        except Exception as e:
            continue
    return content_list

In [137]:
content_web_list = deeplearning_ai(web_source_list[1])
display(pd.DataFrame(content_web_list)[:5])

  'title': BeautifulSoup(item.find('h2').text.strip(), "lxml").get_text(),


Unnamed: 0,title,link,description_original,source,published_at,folder
0,"Amazon Nova’s Competitive Price/Performance, O...",https://www.deeplearning.ai/the-batch/issue-279/,The Batch AI News and Insights: AI Product Man...,deeplearning_ai,2024-12-17 13:02:00.363918,deeplearning_ai
1,"AI Agents Spend Real Money, Breaking Jailbreak...",https://www.deeplearning.ai/the-batch/issue-278/,The Batch AI News and Insights: AI Agents Spen...,deeplearning_ai,2024-12-17 13:02:00.364810,deeplearning_ai
2,"DeepSeek Takes On OpenAI, Robots Fold Laundry,...",https://www.deeplearning.ai/the-batch/issue-277/,The Batch AI News and Insights: DeepSeek Takes...,deeplearning_ai,2024-12-17 13:02:00.365564,deeplearning_ai
3,"Next-Gen Models Show Limited Gains, Real-Time ...",https://www.deeplearning.ai/the-batch/issue-276/,The Batch AI News and Insights: A small number...,deeplearning_ai,2024-12-17 13:02:00.366284,deeplearning_ai
4,"Llama On the Battlefield, Mixture of Experts P...",https://www.deeplearning.ai/the-batch/issue-275/,The Batch AI News and Insights: Large language...,deeplearning_ai,2024-12-17 13:02:00.367147,deeplearning_ai


<br>

### 2.3 通过selenium模拟浏览器爬取OpenAI News
可以模拟用户操作浏览器的行为，它适用于网页测试、爬虫开发、表单提交和动态内容抓取等场景。无需打开实际浏览器窗口（Headless Mode），更适合后台运行任务。模拟点击、输入、拖拽、滚动等用户行为，支持执行自定义的 JavaScript 脚本。

In [None]:
# import lxml
# import time
# from selenium import webdriver
# from selenium.webdriver.chrome.options import Options
# from selenium.webdriver.chrome.service import Service
# from webdriver_manager.chrome import ChromeDriverManager

In [None]:
def OpenAI_News(self):
    # 设置Chrome浏览器选项
    chrome_options = Options()
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--no-sandbox')

    # 设置无头模式
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-dev-shm-usage')

    # 隐藏自动化特征
    chrome_options.add_argument('--disable-blink-features=AutomationControlled')
    chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    chrome_options.add_argument("--disable-web-security")  # 允许跨域
    chrome_options.add_argument("--allow-running-insecure-content")
    chrome_options.add_argument("--disable-site-isolation-trials")

    # 修改 User-Agent
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
    chrome_options.add_argument(f'user-agent={user_agent}')

    # 初始化浏览器
    prefs = {
        "profile.managed_default_content_settings.images": 2,
        "profile.default_content_setting_values.notifications": 2,
        "profile.default_content_setting_values.stylesheets": 2,
        "profile.default_content_setting_values.cookies": 2,
        "profile.default_content_setting_values.javascript": 1,
        "profile.default_content_setting_values.plugins": 2,
        "profile.default_content_setting_values.popups": 2,
        "profile.default_content_setting_values.geolocation": 2,
        "profile.default_content_setting_values.media_stream": 2,
    }
    chrome_options.add_experimental_option("prefs", prefs)

    # 初始化浏览器
    service = Service(ChromeDriverManager().install(), log_path='/dev/null')
    driver = webdriver.Chrome(service=service, options=chrome_options)
    # 打开网页
    driver.get(self.source_url)
    # 等待页面加载完成（根据需要调整时间）
    time.sleep(30)
    # 获取渲染后的页面源代码
    html_content = driver.page_source
    # 关闭浏览器
    driver.quit()

    # 解析HTML内容
    soup = BeautifulSoup(html_content, 'lxml')

    # 查找符合条件的<div>
    notice_items = soup.select('div.col-span-1.transition.opacity-1.ease-curve-c.duration-400')
    for item in notice_items:
        try:
            # 获取link
            other_content = item.find_all('span')
            content = item.find('a')
            try:
                published_at = other_content[1].text
            except Exception as e:
                logging.error(f"文章“published_at”字段解析失败:{e}，文章内容为\n{item}")
                published_at = ''
            try:
                description_original = other_content[0].text
            except Exception as e:
                logging.info(f"文章“description_original”字段解析失败:{e}，文章内容为\n{item}")
                description_original = ''
            if content:
                content_dict = {
                    'title': BeautifulSoup(content.get('aria-label'), "lxml").get_text(),
                    'link': self.domain + content.get('href'),
                    'description_original': BeautifulSoup(description_original, "lxml").get_text(),
                    'source': self.source,
                    'published_at': convert_to_cst(BeautifulSoup(published_at, "lxml").get_text()),
                    'folder': self.folder
                }
                self.article_data_list.append(content_dict)
            else:
                logging.info("未找到元素")

        except Exception as e:
            logging.error(f"解析文章失败:{item} {e}，文章内容为\n{item}")
            continue
    return self.article_data_list

---
<br>

## 三、数据库插入

In [133]:
mysql_message = config['mysql']
connection = mysql.connector.connect(
        host=mysql_message['host'],
        port=mysql_message['port'],
        database=mysql_message['database'],
        user=mysql_message['user'],
        password=mysql_message['password'],
        ssl_disabled=True
    )

In [134]:
if connection is not None:
    cursor = connection.cursor()
    for content in content_list:
        try:
            # 检查是否存在相同title的文章
            check_query = "SELECT 1 FROM original_article_table WHERE title = %s"
            cursor.execute(check_query, (content['title'],))
            if cursor.fetchone():
                # 如果找到了，说明已存在相同title的文章，跳过插入
                print(f"【Skipping】 Article '{content['title']}' already exists, skipping.")
                continue

            # 构造SQL插入语句，包括所有字段
            insert_query = """
                        INSERT INTO original_article_table (title, link, description_original, source, published_at, is_processed, folder) 
                        VALUES (%s, %s, %s, %s, %s, %s, %s)
                        ON DUPLICATE KEY UPDATE title=VALUES(title)
                        """
            # 从article_content字典中提取字段值
            title = content['title']
            link = content['link']
            description_original = content['description_original']
            folder = content['folder']
            source = content['source']  # 假设source字段从类属性获取或其他逻辑
            published_at = content['published_at']  # 确保published_at是datetime对象
            is_processed = False

            # 执行SQL语句
            cursor.execute(insert_query, (title, link, description_original, source, published_at, is_processed, folder))
            connection.commit()
            print(f"【successfully】 Article '{title}' inserted successfully.")
        except mysql.connector.Error as e:
            print(f"【Failed】 Failed to insert article '{content['title']}'. MySQL Error: {e}")
            continue  # 即使发生错误也继续尝试插入下一篇文章
    cursor.close()
    

【Skipping】 Article 'How Amazon trains sequential ensemble models at scale with Amazon SageMaker Pipelines' already exists, skipping.
【Skipping】 Article 'Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience' already exists, skipping.
【Skipping】 Article 'How Clearwater Analytics is revolutionizing investment management with generative AI and Amazon SageMaker JumpStart' already exists, skipping.
【Skipping】 Article 'How Twitch used agentic workflow with RAG on Amazon Bedrock to supercharge ad sales' already exists, skipping.
【Skipping】 Article 'Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents' already exists, skipping.
【Skipping】 Article 'Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 2: ModelBuilder' already exists, skipping.
【Skipping】 Article 'Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer' already exists, ski

In [135]:
connection.close()