# C-More

## News

In [55]:
import feedparser

import requests
from bs4 import BeautifulSoup

import re

#### BBC News

A list of BBC RSS feeds can be found here: https://blog.feedspot.com/bbc_rss_feeds/ .

In [2]:
# parsing RSS feed

d = feedparser.parse("http://feeds.bbci.co.uk/news/world/rss.xml")

Some common channel elements, available in `d.feed` (or `d["feed"]`) are:

- title
- link
- description
- publication date
- language

In [3]:
# RSS feed title

d["feed"]["title"]

'BBC News - World'

In [4]:
# RSS feed link

d["feed"]["link"]

'https://www.bbc.co.uk/news/'

In [5]:
# RSS feed description

d["feed"]["description"]

'BBC News - World'

In [6]:
# RSS feed update date

d["feed"]["updated"]

'Tue, 08 Nov 2022 12:30:24 GMT'

In [7]:
# RSS feed language

d["feed"]["language"]

'en-gb'

The items are available in `d.entries`.

In [8]:
len(d["entries"])

28

We have 27 items.

Some common item elements are:

- title
- link
- description
- publication date
- parsed publication date
- id

In [9]:
# title of first item

d["entries"][0]["title"]

'Iran International: TV channel says Iran threatened UK-based journalists'

In [10]:
# link of first item

d["entries"][0]["link"]

'https://www.bbc.co.uk/news/world-middle-east-63554305?at_medium=RSS&at_campaign=KARANGA'

In [11]:
# description/summary of first item

d["entries"][0]["description"]

'Two Iran International staff have been warned of a risk to their lives, a law enforcement source says.'

In [12]:
# publication date of first item

d["entries"][0]["published"]

'Tue, 08 Nov 2022 11:37:46 GMT'

In [13]:
# parsed publication date of first item

d["entries"][0]["published_parsed"]

time.struct_time(tm_year=2022, tm_mon=11, tm_mday=8, tm_hour=11, tm_min=37, tm_sec=46, tm_wday=1, tm_yday=312, tm_isdst=0)

In [14]:
# id of first item

d["entries"][0]["id"]

'https://www.bbc.co.uk/news/world-middle-east-63554305'

We can also sort these items by publication date.

In [20]:
sorted(d["entries"], key=lambda item: item['published_parsed'], reverse=True)

[{'title': 'COP27: Is climate change going to make diseases more likely in future?',
  'title_detail': {'type': 'text/plain',
   'language': None,
   'base': 'http://feeds.bbci.co.uk/news/world/rss.xml',
   'value': 'COP27: Is climate change going to make diseases more likely in future?'},
  'summary': 'Scientists say climate change is making over half of all infectious diseases worse, watch to see why.',
  'summary_detail': {'type': 'text/html',
   'language': None,
   'base': 'http://feeds.bbci.co.uk/news/world/rss.xml',
   'value': 'Scientists say climate change is making over half of all infectious diseases worse, watch to see why.'},
  'links': [{'rel': 'alternate',
    'type': 'text/html',
    'href': 'https://www.bbc.co.uk/news/science-environment-63556927?at_medium=RSS&at_campaign=KARANGA'}],
  'link': 'https://www.bbc.co.uk/news/science-environment-63556927?at_medium=RSS&at_campaign=KARANGA',
  'id': 'https://www.bbc.co.uk/news/science-environment-63556927',
  'guidislink': Fa

We can now get the html content of a given publication.

In [24]:
url = d["entries"][0]["link"]

r = requests.get(url)
html = r.text

In [25]:
soup = BeautifulSoup(html, 'html.parser')

In [35]:
# title

soup.title.text

'Iran International: TV channel says Iran threatened UK-based journalists - BBC News'

The available tags are:

In [124]:
tags = set()

for tag in soup.find_all():
    tags.add(tag.name)

In [125]:
tags

{'a',
 'article',
 'aside',
 'b',
 'body',
 'button',
 'circle',
 'div',
 'figcaption',
 'figure',
 'footer',
 'g',
 'h1',
 'h2',
 'head',
 'header',
 'html',
 'img',
 'li',
 'link',
 'main',
 'meta',
 'nav',
 'noscript',
 'ol',
 'p',
 'path',
 'picture',
 'script',
 'section',
 'source',
 'span',
 'style',
 'svg',
 'time',
 'title',
 'ul'}

We can use the article tag to have access to the article text only.

In [127]:
# article

soup.article.text



In [164]:
soup.article.find_all("p")

[<p class="ssrcss-1q0x1qg-Paragraph eq5iqo00"><b class="ssrcss-hmf8ql-BoldText e5tfeyi3">Two British-Iranian journalists for the UK-based Persian-language TV channel Iran International have been warned of a possible risk to their lives, a UK law enforcement source has confirmed.</b></p>,
 <p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">Parent company Volant Media said the Metropolitan Police had notified the pair of a recent increase in "credible" threats from Iranian security forces. </p>,
 <p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">It denounced the "escalation of a state-sponsored campaign to intimidate Iranian journalists working abroad".</p>,
 <p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">Iranian authorities have not commented.</p>,
 <p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">However, they announced sanctions against Iran International and BBC News Persian last month, accusing them of "incitement of riots" and "support of terrorism" over their coverage of the anti-government protes

The paragraphs of interest have a class attribute of `class="ssrcss-1q0x1qg-Paragraph eq5iqo00"`.

In [165]:
for x in soup.article.find_all("p", class_=re.compile("Paragraph")):
    print(x.text)

Two British-Iranian journalists for the UK-based Persian-language TV channel Iran International have been warned of a possible risk to their lives, a UK law enforcement source has confirmed.
Parent company Volant Media said the Metropolitan Police had notified the pair of a recent increase in "credible" threats from Iranian security forces. 
It denounced the "escalation of a state-sponsored campaign to intimidate Iranian journalists working abroad".
Iranian authorities have not commented.
However, they announced sanctions against Iran International and BBC News Persian last month, accusing them of "incitement of riots" and "support of terrorism" over their coverage of the anti-government protests that have engulfed the country over the past two months.
The two UK-based channels are already banned from Iran, but a press freedom watchdog says they are among the main sources of news and information in a country where independent media and journalists are constantly persecuted. 
This video

Defining the class to look for, we get only the article text itself.

We can now look for a particular expression like "Iran".

In [185]:
soup.article.find_all("p", class_=re.compile("Paragraph"), text=re.compile("\\bIran\\b"))

[<p class="ssrcss-1q0x1qg-Paragraph eq5iqo00"><b class="ssrcss-hmf8ql-BoldText e5tfeyi3">Two British-Iranian journalists for the UK-based Persian-language TV channel Iran International have been warned of a possible risk to their lives, a UK law enforcement source has confirmed.</b></p>,
 <p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">However, they announced sanctions against Iran International and BBC News Persian last month, accusing them of "incitement of riots" and "support of terrorism" over their coverage of the anti-government protests that have engulfed the country over the past two months.</p>,
 <p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">The two UK-based channels are already banned from Iran, but a press freedom watchdog says they are among the main sources of news and information in a country where independent media and journalists are constantly persecuted. </p>,
 <p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">How Iran state TV tries to control the story of the protests</p>,
 

However, we this method we do not retrieve the paragraphs where there are inline links.

In [183]:
soup.article.find_all(text=re.compile("\\bIran\\b"))

['Iran International: TV channel says Iran threatened UK-based journalists',
 '2022 Iran protests',
 'Iran International',
 'Two British-Iranian journalists for the UK-based Persian-language TV channel Iran International have been warned of a possible risk to their lives, a UK law enforcement source has confirmed.',
 'However, they announced sanctions against Iran International and BBC News Persian last month, accusing them of "incitement of riots" and "support of terrorism" over their coverage of the anti-government protests that have engulfed the country over the past two months.',
 'The two UK-based channels are already banned from Iran, but a press freedom watchdog says they are among the main sources of news and information in a country where independent media and journalists are constantly persecuted. ',
 'How Iran state TV tries to control the story of the protests',
 ", which it attributed to Iran's Islamic Revolution Guard Corps (IRGC), a powerful military force with close tie

Besides the additional hits at the beginning and end, we could not retrieve the following passages:

In [179]:
soup.article.find_all(text=re.compile("\\bIran\\b"))[7]

", which it attributed to Iran's Islamic Revolution Guard Corps (IRGC), a powerful military force with close ties to the Supreme Leader, Ayatollah Ali Khamenei. "

In [181]:
soup.article.find_all(text=re.compile("\\bIran\\b"))[10]

'US prosecutors also announced last year that four Iranian intelligence officials had been charged with plotting to kidnap a New York-based journalist critical of Iran'