**Automatic News Scraping with Python**

Extract relevant informations from news articles, such as the title, author publish date, and the main content of the article. This information can then be used for various purposes such as creating a personal news feed, analyzing trends in the news, or even creating a dataset for natural language processing tasks. in the news we will look at how we can use Newspaper and Feedparser modules to scrape and parse news articles from various sources

The Newspaper module is a powerful tool for extracting and parsing news articles from various sources, while the Feedparser module is useful for parsing RSS feeds. RSS (Really Simple Syndication) is a web feed that allows users and applications to access updates to websites in a standardized, computer-readable format. These updates can include blog entries, news articles, audio, video, and any other content that can be provided in a feed.

In [2]:
!pip install newspaper3k
!pip install feedparser

Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.1.2-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tinysegmenter==0.3 (from newspaper3k)
  Downloading tinysegmenter-0.3.tar.gz (16 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Co

The Newspaper and  Feedparser modules have several useful methods for extracting and parsing news articles:
* newspaper.build(): This method is used to build a newspaper object from a given URL.
* newspaper.download(): This method is used to download the HTML of a given URL.
* newspaper.parse(): This method is used to parse the HTML of a given URL and extract relevant information such as the title, author, publish date, and main content of the article.
* feedparser.parse(): This method is used to parse an RSS feed and extract relevant information such as the title, author, publish date, and link of the article


**Code Implementation**
First, we import the required modules newspaper, and feedparser. Next, we define a function called scrape_news_from_feed() which takes a feed URL as input. Inside the function, we first parse the RSS feed using the feedparser.parse() method. This returns a dictionary containing various information about the feed and its entries.

Create a newspaper article object using the newspaper.Article() constructor and passing it the link of the article. Then download and parse the article using the article.download() and article.parse() methods. Extract relevant information such as the title, author, publish date, and main content of the article. Append this information to a list of articles. Finally, the function returns the list of articles.

In [10]:
import newspaper
import feedparser

def scrape_news_from_feed(feed_url):
    articles = []
    feed = feedparser.parse(feed_url)
    for entry in feed.entries:
      # Create a newspaper article object
      article = newspaper.Article(entry.link)
      # download and parse the article
      article.download()
      article.parse()
      # extract relevant information
      articles.append({
          'title':article.title,
          'author':article.authors,
          'publish_date': article.publish_date,
          'content': article.text
          })
    return articles

feed_url = 'http://feeds.bbci.co.uk/news/rss.xml'
articles = scrape_news_from_feed(feed_url)

# print the extracted articles
for article in articles:
  print('Title:', article['title'])
  print('Author:', article['author'])
  print('Publish Date:', article['publish_date'])
  print('Content:', article['content'])
  print()

Title: Violence, overcrowding, self-harm: Inside one of Britain’s most dangerous prisons
Author: []
Publish Date: None
Content: Violence, overcrowding, self-harm: BBC goes inside one of Britain’s most dangerous prisons

BBC

There’s chaos in HMP Pentonville. A piercing alarm alerts us to what prison officers describe as an “incident”. There’s a cacophony of slamming metal doors, keys jangling, and shouts and screams from inmates as officers race to see what’s happened. We run behind as they head to where the trouble is. Cell doors and chipped painted white bars are just about the only scenery as we move through this chaotic and nerve-jangling environment. A muffled walkie-talkie tells us it’s a case of self-harm. An inmate who’s been locked up for most of the day has carved “mum and dad” into his arm with a sharp object. A quick glance into the cell and the sight of blood. A prison officer crouches down, stemming the flow.

The BBC has been given rare access to HMP Pentonville men's pr