# NLP 3. Scraping news articles from URL's

**James Morgan (jhmmorgan)**

_2022-05-11_

# 📖 Background

We want a proof of concept, where an end user can easily be provided with a summary of a news article, along with a warning on whether the text is likely to contain hate speech or fake news.

This proof of concept would be in the form of a standalone application that when provided the URL to a news article, provides the end-user with the summary of the article, along with a flag if the article may contain hate speech or fake news.

### The Task
This notebook is **Part 3** of my NLP project. The task of this notebook is to scrape the relevant text from a provided URL.  This text can then be processed by our various NLP models.


# 🔬 Approach
This is a fairly easy task, thanks to the amazing library **BeautifulSoup**.  Extracting raw text from any URL is fairly simple with only a few lines of code, however configuation is needed to scrape just the relevant parts of that text.

Many news sites have additional links, side articles and advertisements in addition to a lot of meta (hidden to the eye, but often useful) data.  If we were to get all text, we'd potentially return useless articles that contain much of this additional information.

There are several approachs we can take to get around this.

Most news sites uses classes or ID's that are unique to the text of the main article.  We can filter the text to only include paragraphs within these identifiers.  The pro is that this is the most accurate way of obtaining the relevant text, however the con is that it relies on a custom-defined approach that's unique for each website. In otherwords, I have to provide the identifiers for each website and if I've not completed this for a website, then it won't work.

An alternative is to either use a third-party library that extracts the article for us, or build a machine learning model that identifies which classes are likely for an article.  Both of which are out of scope for a proof-of-concept.


<div class="alert alert-block alert-info">
<b>So how does our technique work?</b>
</div>

Rather than write separate classes for each news URL, we can create a dictionary that contains the tags to search for or remove, e.g. the following will search the for any **id** tag containing **maincontent**,
```python
{"theguardian" : [{"id" : "maincontent"}]}
```
whereas, the following will search for any **class** tag containing **sdc-article-body--story**, whilst excluding any **class** tags containing **sdc-site-related-stories** or **sdc-site-video** found within the results.
```python 
{"sky"         : [{"class_" : "sdc-article-body--story"}, {"class_" : ["sdc-site-video", "sdc-article-related-stories"]}]}
```

<div class="alert alert-block alert-info">
<b>So how does this look in practice?</b>
</div>

# 📚 Libraries and functions
We'll start by loading the libraries and then loading in the example data containing various articles of text.

In [1]:
from utils import *
from nlp_web_scraper import *

---


# ⚙️ Output of extracted text

We'll then decide which URL to extract an article from and display it

In [2]:
#link = "https://www.dailymail.co.uk/news/article-10759651/Ukraine-war-Putin-suggest-use-nukes-necessary.html"
#link = "https://www.theguardian.com/world/2022/apr/26/unprecedented-phoenician-necropolis-osuna-spain"
link = "https://news.sky.com/story/local-elections-2022-cost-of-living-and-prime-ministers-future-in-focus-as-election-campaigns-reach-climax-12605293"

sat     = scrape_article_text(link)
article = sat.get_article()
print(article)

Britain's cost of living squeeze and the future of the prime minister have taken centre stage as party leaders delivered their final messages to voters on the eve of local elections.Labour leader Sir Keir Starmer said Thursday's local elections, in which thousands of seats across England, Scotland and Wales will be up for grabs, was "a chance to send a message to the government about their abject failure".But Prime Minister Boris Johnson said it was his Conservative party that would be the best choice "if you want help with your family budgets and you want to make sure you've got more at the end of the month".You can find results where you live with our dedicated elections service. And we'll have a special election programme on Sky News from 11pm on Thursday nightThe last day of campaigning came against the backdrop of the prime minister's recent apology after being fined for breaking lockdown rules in Downing Street in 2020 - as well as awkward questions for Sir Keir about a gathering

---


# 🎓 Summary
This concludes the third part of our NLP project.  Using a simple custom class that we import in, we can easily extract an article from any given URL.  The downside being that we need to manually define the labels to extract for each individual news website.

Our final step is to bring everything together in our fourth and final part.