# General Blog Scraping

#### Goal:
- Scrape any blog that is allowed (after screening robots.txt)

#### Outline:
- Boilerplate
- Scraping a single blog
- Trying to scrape multiple different blog formats

## Boilerplate

In [1]:
import requests as req
from urllib.parse import urlsplit
import urllib.robotparser as urp
import re
from bs4 import BeautifulSoup
from html2text import html2text as htt
import sys

#Parse RSS
def can_scrape(url: str):
    #Get robot URL
    url_parts = urlsplit(url)
    base_url = url_parts.scheme + "://" + url_parts.netloc
    robot_url = base_url + '/robots.txt'
    rp = urp.RobotFileParser()
    rp.set_url(robot_url)
    rp.read()
    return rp.can_fetch("*", url)

In [6]:
url = 'https://aws.amazon.com/blogs/machine-learning/feed/'
    #Check robots file
    # print(check_robots('https://techinasia.com'))
if(can_scrape(url)):
    header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    res = req.get('https://aws.amazon.com/blogs/machine-learning/feed/', headers=header)
    soup = BeautifulSoup(res.content, features='xml')
    articles = soup.findAll('item')

    article_list = []
    for a in articles:
        title = a.find('title').text
        link = a.find('link').text
        published = a.find('pubDate').text
        description = a.find('description').text

        article = {
            'title': title,
            'link': link,
            'published': published,
            'description': htt(description)
            }
        article_list.append(article)

    idx = 4
    print(htt(article_list[idx]['description']))
    print(article_list[idx]['title'])
    print(article_list[idx]['published'])
    print(article_list[idx]['link'])
    print(sys.getsizeof(article_list))

True
Amazon Polly is a service that turns text into lifelike speech. It enables the
development of a whole class of applications that can convert text into speech
in multiple languages. This service can be used by chatbots, audio books, and
other text-to-speech applications in conjunction with other AWS AI or machine
learning (ML) services. For […]


Highlight text as it’s being spoken using Amazon Polly
Wed, 05 Jul 2023 20:12:48 +0000
https://aws.amazon.com/blogs/machine-learning/highlight-text-as-its-being-spoken-using-amazon-polly/
248


In [7]:
#Try this blog post
can_scrape("https://blog.mithrilsecurity.io")

True


True

### Trying Article Object instead (for various blog formats)

Okay, article object works wayyy better.

In [76]:
if(can_scrape(url)):
    #Get HTML
    header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    res = req.get('https://netflixtechblog.com/detecting-scene-changes-in-audiovisual-content-77a61d3eaad6', headers=header)
#     res = req.get('https://aws.amazon.com/blogs/aws/new-solution-clickstream-analytics-on-aws-for-mobile-and-web-applications/', headers=header)
    soup = BeautifulSoup(res.content, features='html')
    try:
        article = soup.find('article')
    except TypeError:
        pass
    title = article.find('h1').text
    body = article.findAll('p')
    for idx, p in enumerate(body):
        #Smaller paragraphs are usually acknowledgements, etc.
        if(sys.getsizeof(p.text) > 120):
            print(p.text)
#         print(sys.getsizeof(p.text))

True
When watching a movie or an episode of a TV show, we experience a cohesive narrative that unfolds before us, often without giving much thought to the underlying structure that makes it all possible. However, movies and episodes are not atomic units, but rather composed of smaller elements such as frames, shots, scenes, sequences, and acts. Understanding these elements and how they relate to each other is crucial for tasks such as video summarization and highlights detection, content-based video retrieval, dubbing quality assessment, and video editing. At Netflix, such workflows are performed hundreds of times a day by many teams around the world, so investing in algorithmically-assisted tooling around content understanding can reap outsized rewards.
While segmentation of more granular units like frames and shot boundaries is either trivial or can primarily rely on pixel-based information, higher order segmentation¹ requires a more nuanced understanding of the content, such as the 