# NCSC Report Web Scraping

### Introduction

In this project we'll look at web scraping and parallelisation. We do this in the contect of the NCSC Weekly threat reports which we will get using selenium. We'll then using multiprocessing to parallelise this process and determine the speed increase we can get from this.

We first import all the libraries we need. BeautifulSoup was initially considered for use due to the inspiration referenced down below but it turned out to be incompatible with this form of web scraping. It doesn't provide java support which is required for the articles we want to obtain so we changed to selenium since we can utilise the webdrivers here to access the articles.

In [1]:
import requests

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

import re
import pandas as pd
from IPython.display import clear_output
import pickle 
import datetime

The below options are adjusted for ease of use and to allow us to run the code while performing other tasks. The options mean that we no longer get the usual 'pop-up' from the chrome driver.

In [2]:
chrome_options = Options()
chrome_options.add_argument("headless")
chrome_options.add_argument("no-sandbox")
chrome_options.add_argument("disable-gpu")

### Getting Links

The first step here is to grab the links we need for all the articles necessary. We do this using the Chrome webdriver. *Note that the link used below only grabs the links of all the articles up to 23/04/2021. As more articles are produced, we need to update this code to access the rest.*

In [3]:
linkbrowser = webdriver.Chrome(executable_path='chromedriver.exe',chrome_options=chrome_options)

  linkbrowser = webdriver.Chrome(executable_path='chromedriver.exe',chrome_options=chrome_options)


In [4]:
linkbrowser.get('https://www.ncsc.gov.uk/section/keep-up-to-date/threat-reports?q=&defaultTypes=report&sort=date%2Bdesc&writtenFor=Large+organisations&rows=232')

Here we actually access our links after accessing the page. We use the xpath search with '//a' as the path we search for to find all hyperlinks present on the age. After this, we check for 'report/weekly' in the string for the link. The first cell grabs all links from the page so we end up with links such as those to the careers site which aren't what we're looking for.

In [5]:
links=[]
if links == []:
    WebDriverWait(linkbrowser, 10).until(EC.presence_of_element_located((By.XPATH, ".//a")))

    for a in linkbrowser.find_elements_by_xpath('.//a'):
        links.append(a.get_attribute('href'))
else:
    linkbrowser.close()

In [6]:
report_links = []
for ind, link in enumerate(links):
    if 'report/weekly' in link:
        report_links.append(link)

In [7]:
report_links

['https://www.ncsc.gov.uk/report/weekly-threat-report-23rd-april-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-16th-april-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-12th-april-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-2nd-april-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-26th-march-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-19th-march-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-12th-march-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-5th-march-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-26th-february-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-19th-february-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-12th-february-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-5th-february-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report-29th-january-2021',
 'https://www.ncsc.gov.uk/report/weekly-threat-report

### Getting Articles

We now search each link for the title, article and tags in it. 

The function get_articles is our web scraper for this job. *Note that this is also used in the parallelised code*. The code searches classes for the title, body and tags of each report to compile them into a data frame. The tags section often provides multiple tags so we use reg-ex to seperate these tags from each other to obtain a list of the independent tags.

The cell below this calls the function for each link we have obtained and then creates a pickle dump of the data frame.

In [8]:
def get_articles(link):
    
    gbrowser = webdriver.Chrome(executable_path='chromedriver.exe',chrome_options=chrome_options)
    gbrowser.get(link)
    WebDriverWait(gbrowser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pcf-title")))
    
    articles = []
    t = []
    
    try:

        title = gbrowser.find_elements_by_class_name("pcf-title")
        try:
            title = title.text
        except:
            title = title[0].text

        main = gbrowser.find_elements_by_class_name('pcf-BodyText')
        try:
            articles = main.text
        except AttributeError:
            for i in range(len(main)):
                articles.append(main[i].text)

        topics = gbrowser.find_elements_by_class_name('topic-tags-container')
        topics = topics[0].text[7:]
        count = len(re.findall('([A-Z])', topics))
        for num1, i in enumerate(re.finditer('([A-Z])',topics)):
            for num2, j in enumerate(re.finditer('([A-Z])',topics)):
                if num1 == num2 + 1:
                    t.append(topics[j.start():i.end()-1])
                elif num1 == num2 & num1 == count - 1:
                    t.append(topics[i.start():])

    except:
        pass
    
    data = {'Title': title,'Article': articles,'topics': t}
    gbrowser.close()
    
    return data

In [9]:
try:
    
    my_df = pickle.load(open('NCSC Reports.p','rb'))

except:
    
    begin_time = datetime.datetime.now()
    my_df = []
    for num, link in enumerate(report_links):
        data = []
        data = get_articles(link)
        
        my_df.append(data)
        if num % 10 == 0:
            clear_output(wait=True)
            print('Scraping article number {}'.format(num))

    my_df = pd.DataFrame(my_df)
    my_df['Links'] = report_links
    pickle.dump(my_df, open('NCSC Reports.p','wb'))

    end_time = datetime.datetime.now()
    print('Time Taken: %s' % (end_time - begin_time))

Scraping article number 220
Time Taken: 0:14:36.257181


In [10]:
pd.set_option('display.max_rows', 500)
my_df

Unnamed: 0,Title,Article,topics,Links
0,Weekly Threat Report 23rd April 2021,[The NCSC is aware that a malicious piece of s...,"[Cyber attack, Cyber strategy, Education, Vuln...",https://www.ncsc.gov.uk/report/weekly-threat-r...
1,Weekly Threat Report 16th April 2021,[Cyber security researchers have uncovered a s...,"[Cyber strategy, Patching, Vulnerabilities]",https://www.ncsc.gov.uk/report/weekly-threat-r...
2,Weekly Threat Report 12th April 2021,"[Cyber security researchers, Esentire, have wa...","[Phishing, Social media, Personal data, Vulner...",https://www.ncsc.gov.uk/report/weekly-threat-r...
3,Weekly Threat Report 2nd April 2021,[The UK education sector continues to face an ...,"[Education, Incident management, Secure design...",https://www.ncsc.gov.uk/report/weekly-threat-r...
4,Weekly Threat Report 26th March 2021,[Earlier this month Microsoft confirmed that s...,"[Cyber attack, Education, Mitigation, Patching...",https://www.ncsc.gov.uk/report/weekly-threat-r...
5,Weekly Threat Report 19th March 2021,[Courier service company Fastway said this wee...,"[Cyber attack, Personal data, Phishing, Vulner...",https://www.ncsc.gov.uk/report/weekly-threat-r...
6,Weekly Threat Report 12th March 2021,[There has been a rise in vulnerability report...,"[Cyber threat, Risk management, Vulnerabilities]",https://www.ncsc.gov.uk/report/weekly-threat-r...
7,Weekly Threat Report 5th March 2021,[Microsoft has released a number of security u...,"[Cyber threat, Patching, Personal data, Phishi...",https://www.ncsc.gov.uk/report/weekly-threat-r...
8,Weekly Threat Report 26th February 2021,[VMware have released security updates to addr...,"[Cyber attack, Vulnerabilities]",https://www.ncsc.gov.uk/report/weekly-threat-r...
9,Weekly Threat Report 19th February 2021,"[Scam emails, which aim to convince people to ...","[Cyber attack, Personal data, Phishing, Secure...",https://www.ncsc.gov.uk/report/weekly-threat-r...


In [11]:
pd.reset_option('all')


: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.



: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.



Here we run our parallelised code. The code is implemented in NSCS-Report-Parallelisation.py where comments are added to explain the process. We use a worker pool to scrape the data much faster than we can without parallelisation. **Ignore the fact that this says using 4 cores (if it does), it is using all available cores.**

In [18]:
try:
    my_df_pl = pickle.load(open('NCSC Reports Parallelised.p','rb'))
except:
    %run -i NCSC-Report-Parallelisation.py
    my_df_pl = pd.DataFrame(results)
    my_df_pl['Links'] = report_links
    pickle.dump(my_df_pl, open('NCSC Reports Parallelised.p','wb'))

In [19]:
my_df_pl

Unnamed: 0,Title,Article,topics,Links
0,Weekly Threat Report 23rd April 2021,[The NCSC is aware that a malicious piece of s...,"[Cyber attack, Cyber strategy, Education, Vuln...",https://www.ncsc.gov.uk/report/weekly-threat-r...
1,Weekly Threat Report 16th April 2021,[Cyber security researchers have uncovered a s...,"[Cyber strategy, Patching, Vulnerabilities]",https://www.ncsc.gov.uk/report/weekly-threat-r...
2,Weekly Threat Report 12th April 2021,"[Cyber security researchers, Esentire, have wa...","[Phishing, Social media, Personal data, Vulner...",https://www.ncsc.gov.uk/report/weekly-threat-r...
3,Weekly Threat Report 2nd April 2021,[The UK education sector continues to face an ...,"[Education, Incident management, Secure design...",https://www.ncsc.gov.uk/report/weekly-threat-r...
4,Weekly Threat Report 26th March 2021,[Earlier this month Microsoft confirmed that s...,"[Cyber attack, Education, Mitigation, Patching...",https://www.ncsc.gov.uk/report/weekly-threat-r...
...,...,...,...,...
216,Weekly Threat Report 28th October 2016,[Malware-infected ATMs compromise Indian debit...,[Cyber threat],https://www.ncsc.gov.uk/report/weekly-threat-r...
217,Weekly Threat Report 24th October 2016,[Threat assessment and trend analysis\nOnline ...,[Cyber threat],https://www.ncsc.gov.uk/report/weekly-threat-r...
218,Weekly Threat Report 17th October 2016,[New Trojan used in financial attacks\nSymante...,[Cyber threat],https://www.ncsc.gov.uk/report/weekly-threat-r...
219,Weekly Threat Report 10th October 2016,[Threat assessment and trend analysis\nDressco...,[Cyber threat],https://www.ncsc.gov.uk/report/weekly-threat-r...


In [16]:
try: 
    my_df_pla = pickle.load(open('NCSC Reports Parallelised Asynchronous.p', 'rb'))
except:
    %run -i NCSC-Report-Parallelisation-Asynch.py
    my_df_pla = pd.DataFrame(results)
    my_df_pla['Links'] = report_links
    pickle.dump(my_df_pla, open('NCSC Reports Parallelised Asynchronous.p','wb'))    

In [17]:
my_df_pla

Unnamed: 0,Title,Article,topics,Links
0,Weekly Threat Report 23rd April 2021,[The NCSC is aware that a malicious piece of s...,"[Cyber attack, Cyber strategy, Education, Vuln...",https://www.ncsc.gov.uk/report/weekly-threat-r...
1,Weekly Threat Report 16th April 2021,[Cyber security researchers have uncovered a s...,"[Cyber strategy, Patching, Vulnerabilities]",https://www.ncsc.gov.uk/report/weekly-threat-r...
2,Weekly Threat Report 12th April 2021,"[Cyber security researchers, Esentire, have wa...","[Phishing, Social media, Personal data, Vulner...",https://www.ncsc.gov.uk/report/weekly-threat-r...
3,Weekly Threat Report 2nd April 2021,[The UK education sector continues to face an ...,"[Education, Incident management, Secure design...",https://www.ncsc.gov.uk/report/weekly-threat-r...
4,Weekly Threat Report 26th March 2021,[Earlier this month Microsoft confirmed that s...,"[Cyber attack, Education, Mitigation, Patching...",https://www.ncsc.gov.uk/report/weekly-threat-r...
...,...,...,...,...
216,Weekly Threat Report 28th October 2016,[Malware-infected ATMs compromise Indian debit...,[Cyber threat],https://www.ncsc.gov.uk/report/weekly-threat-r...
217,Weekly Threat Report 24th October 2016,[Threat assessment and trend analysis\nOnline ...,[Cyber threat],https://www.ncsc.gov.uk/report/weekly-threat-r...
218,Weekly Threat Report 17th October 2016,[New Trojan used in financial attacks\nSymante...,[Cyber threat],https://www.ncsc.gov.uk/report/weekly-threat-r...
219,Weekly Threat Report 10th October 2016,[Threat assessment and trend analysis\nDressco...,[Cyber threat],https://www.ncsc.gov.uk/report/weekly-threat-r...


When using my own computer, which has 16 cores, we see a speed increase of almost 300% so 20% per core using the parallelised versio. This is a massive increase on speed compared to the non-parallelised version, and is similar to the asynchronous version. Since we will pull the data frames from pickle files from now on, for reference, the times were:
- 17:40.70 (Un-parallelised)
- 06:49.86 (Parallelised)
- (Parallelised Asynchronous)

After reviewing all three data frames, we can see that we obtain the correct data in all three cases. The data is split into Titles, Articles and Topics for later use in LDA models.

### References

1. [Selenium](https://selenium-python.readthedocs.io/index.html)
2. [Getting Links](https://pythonspot.com/selenium-get-links/)
3. [Enumerate](https://realpython.com/python-enumerate/)
4. [Inspiration](https://www.youtube.com/watch?v=F1kZ39SvuGE)
5. [Parallelisation](https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop)
7. [Multi-Processing](https://medium.com/swlh/5-step-guide-to-parallel-processing-in-python-ac0ecdfcea09)
8. [Parallelisation of for loops](https://stackoverflow.com/questions/51325705/parallelize-for-loop-in-python-3)