# NLP 4. Bringing it all together.  The final outcome.

**James Morgan (jhmmorgan)**

_2022-05-11_

# 📖 Background

We want a proof of concept, where an end user can easily be provided with a summary of a news article, along with a warning on whether the text is likely to contain hate speech or fake news.

This proof of concept would be in the form of a standalone application that when provided the URL to a news article, provides the end-user with the summary of the article, along with a flag if the article may contain hate speech or fake news.

### The Task
This notebook is **Part 3** of my NLP project. The task of this notebook is to scrape the relevant text from a provided URL.  This text can then be processed by our various NLP models.


# 🔬 Approach
This is a fairly easy task, thanks to the amazing library **BeautifulSoup**.  Extracting raw text from any URL is fairly simple with only a few lines of code, however configuation is needed to scrape just the relevant parts of that text.



# 📚 Libraries and functions
We'll start by loading the libraries and then loading in the example data containing various articles of text.

In [1]:
from nlp_article_summary import *

In [2]:
url = "https://www.dailymail.co.uk/news/article-10759651/Ukraine-war-Putin-suggest-use-nukes-necessary.html"
nlp = NLP_summary()
nlp.process_article_from_url(url)


[1;4mSummary of article[0m
He again repeated unsubstantiated claims that Ukraine was seeking to possess nuclear weapons itself or develop biological weapons, which he said posed 'a real threat [to] our motherland. '

Earlier today, Kremlin propagandist Vladimir Solovyov directly threatened the UK with nukes - saying that 'one Sarmat missile means minus one Great Britain. '

A fake 'attack' by Ukrainian forces - or messages claiming Ukraine will launch an invasion - would help provide a pretext for sending in Russian forces.


In [3]:
url = "https://news.sky.com/story/local-elections-2022-cost-of-living-and-prime-ministers-future-in-focus-as-election-campaigns-reach-climax-12605293"
nlp.process_article_from_url(url)

[1;30;47m Content unlikey to be fake news.        [0m

[1;4mSummary of article[0m
"The prime minister has been resistant to opposition calls for a windfall tax on the likes of BP and Shell - which are reaping bumper profits from high energy prices - to pay for more help for families.

In Wakefield, Labour leader Sir Keir - responding to Mr Eustice's comments about rising prices - said: "Talk about out of touch, out of ideas and out of excuses.

"On the question of a gathering in April 2021, when he was photographed drinking a beer with colleagues while campaigning in Durham - and the subject of which has resurfaced in recent days - Sir Keir said he had had no contact with police and accused Conservatives of "mudslinging".


In [5]:
url = "https://www.theguardian.com/media/2022/may/11/channel-4-strikes-deal-to-air-1000-hours-of-hit-shows-for-free-on-youtube"
nlp.process_article_from_url(url, top_n=5)

[1;30;47m Content unlikey to include hate speech. [0m

[1;4mSummary of article[0m
Channel 4 is to make available 1,000 hours of hit shows from Location, Location, Location to SAS: Who Dares Wins on YouTube in the widest-ranging commercial deal the Silicon Valley giant has struck with a UK broadcaster.

In 2009, Channel 4 became the first broadcaster in the world to make thousands of hours of programming available on YouTube.

Nadine Dorries’s slip-ups may be funny – but her Channel 4 plans are no joke | Jane MartinsonRead more“Innovative strategic partnerships are Channel 4’s specialty and this new relationship with YouTube is another which will ensure we continue to keep growing our reach with young audiences and build on our unrivalled digital success,” said Alex Mahon, chief executive of Channel 4.

“Together with YouTube we have created a powerful consumer channel full of our brilliant Channel 4 content.

The deal will involve a combination of popular archive catch-up programmi

---


# 🎓 Summary
This concludes the fourth and final part of my NLP project.  


<div class="alert alert-block alert-info">
<b>What I did</b>
</div>

* In part 1, I trained a basic fake news and hate speech model
* In part 2, I created an extractive text summarisation class
* In part 3, I created a news article web scraper
* In part 4, I brought all parts together to summarise a news article and to provide an indication on whether that article is likely to contain hate speech or fake news.


<div class="alert alert-block alert-info">
<b>How I can improve this project in the future</b>
</div>


The models aren't perfect.  
1. We could train our fake news and hate speech models in a more shopisticated way
    - There's no training around the source of articles
    - There's no training around key hate words
    - There's no understanding of context, which could turn any normal phrase into hate news when used in certain ways.
2. The summarisation technique is simple but effective and it'll take significantly more effort to create an abstrative summarisation technique.
3. The news article scraper is good, however for the proof of concept we've taken a manual approach of extracting text.
     - We could use third-party libraries, or train a model to identify key tags to extract text from
4. Our predictions aren't perfect.
    - We trained our models on limited data
    - The hate speech model was based on tweets, which contain a lot fewer words than a news article, which likely impacts the performance
    - So whilst our models perform great on their trained data, this may not translate perfectly to articles of text of different sizes or of different context.


That said, as a proof of concept, I consider this project a success.


<div class="alert alert-block alert-info">
<b>Next Steps</b>
</div>

One thing missing from this project is a nice front end to run my models.  It would be nice to create a HTML front-end where I can deploy this model for anyone to use.