# Scraping IOPC reports

*For a [story on police misconduct](https://www.bbc.co.uk/news/uk-59594712) I scraped reports on the website of the Independent Office for Police Conduct (IOPC). This notebook details part of the process.*

The IOPC [publishes investigations on its website](https://policeconduct.gov.uk/investigations/our-investigations):

> "For most of the cases we investigate, we publish anonymised summaries of our reports. These set out a summary of the circumstances that prompted the investigation, the evidence gathered and our conclusions. They also explain any outcomes for those involved – for instance, what happened if there was a misconduct hearing.
>
> "We remove news releases and investigation reports from our website six months after completing an investigation. Summaries remain on our site for five years. This ensures that we are complying with data protection legislation and with our publication policy."

## Import libraries

To scrape the reports and store the data we need to import a number of Python libraries:

In [None]:
#install the libraries
#requests to fetch the page
import requests
#beautiful soup to drill into it
from bs4 import BeautifulSoup
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

## Generate a list of pages to scrape

The reports are linked from a series of pages: the first is https://policeconduct.gov.uk/investigations/our-investigations but going past the first page generates a URL with a page number like this: https://policeconduct.gov.uk/investigations/our-investigations?page=1

We need to generate a range of page numbers and loop through those, adding them to the basic URL to create all the pages we need to scrape to get the links to all the reports.

In [None]:
#This URL remains unchanged, only the number at the end changes
baseurl = "https://policeconduct.gov.uk/investigations/our-investigations?page="
#Create a range of numbers - the last one at the moment is page 48
pagerange = range(0,49)
#Check the last one is 48
print(pagerange[-1])
#create an empty list to store the URLs we are about to create
resultsurls = []
#loop through the numbers
for i in pagerange:
  #add to base URL and append to list
  resultsurls.append(baseurl+str(i)) #the number is converted to a string so it doesn't generate an error

#once the loop has finished, print the list to check
print(resultsurls)

48
['https://policeconduct.gov.uk/investigations/our-investigations?page=0', 'https://policeconduct.gov.uk/investigations/our-investigations?page=1', 'https://policeconduct.gov.uk/investigations/our-investigations?page=2', 'https://policeconduct.gov.uk/investigations/our-investigations?page=3', 'https://policeconduct.gov.uk/investigations/our-investigations?page=4', 'https://policeconduct.gov.uk/investigations/our-investigations?page=5', 'https://policeconduct.gov.uk/investigations/our-investigations?page=6', 'https://policeconduct.gov.uk/investigations/our-investigations?page=7', 'https://policeconduct.gov.uk/investigations/our-investigations?page=8', 'https://policeconduct.gov.uk/investigations/our-investigations?page=9', 'https://policeconduct.gov.uk/investigations/our-investigations?page=10', 'https://policeconduct.gov.uk/investigations/our-investigations?page=11', 'https://policeconduct.gov.uk/investigations/our-investigations?page=12', 'https://policeconduct.gov.uk/investigations

## Create a function to scrape one page

Now we need to scrape one of the pages. First we test some code:

In [None]:
testurl = "https://policeconduct.gov.uk/investigations/essex-police-officer-charged-computer-misuse-offence"
#Scrape the html at that url
page = requests.get(testurl)
# turn our HTML into a soup object
soup = BeautifulSoup(page.content, 'html.parser')
#The links are all in <span> and then <a
#target the contents of the html tags containing what we want
headings = soup.select('h1')
published = soup.select('div.author-block.border-top div p')
contents = soup.select('div.entity.entity-paragraphs-item.paragraphs-item-article-body div.content p')
tags = soup.select('div.related-topic.border-top a')

#Show how many matches we get for each
print(len(headings),len(published), len(contents), len(tags))
#There should only be one heading
print(headings[0].get_text())
#The datestamp is the second match, and needs stripping of carriage returns
print("published", published[1].get_text().strip())
#We can concatenate the content - starting with an empty string
content = ""
for i in contents:
  #store the link text, adding a new line after each line
  content = content+i.get_text()+"\n"
print("content",content.strip()) #strip out the extra new line
#create empty list to store tags
taglist = []
for i in tags:
  print(i.get_text().strip())
  #add to list
  taglist.append(i.get_text().strip())
print(taglist)
fulldata = {"heading" : headings[0].get_text(),
            "date" : published[1].get_text().strip(),
            "content" : content,
            "tags" : taglist}
print(fulldata)

1 2 3 2
Essex Police - officer charged with computer misuse offence
published 21 Jan 2021
content Read information about our investigation into allegations an Essex Police officer used the police computer system to access records he had no legitimate policing purpose for doing so.
Our investigation began in October 2019 and concluded in April 2020.  At the investigation’s conclusion, we referred a file of evidence to the Crown Prosecution Service (CPS), which made the decision to charge the officer.
Corruption and abuse of power
Essex Police
['Corruption and abuse of power', 'Essex Police']
{'heading': 'Essex Police - officer charged with computer misuse offence', 'date': '21 Jan 2021', 'content': 'Read information about our investigation into allegations an Essex Police officer used the police computer system to access records he had no legitimate policing purpose for doing so.\nOur investigation began in October 2019 and concluded in April 2020.\xa0 At the investigation’s conclusion,

Then we store it in a function. Here we just return the results at the end to whatever calls the function.

In [None]:
def scrapereport(url):
  #Scrape the html at that url
  try:
    #Scrape the html at that url
    page = requests.get(url)
    # turn our HTML into a soup object
    soup = BeautifulSoup(page.content, 'html.parser')
    #The links are all in <span> and then <a
    #target the contents of the html tags containing what we want
    headings = soup.select('h1')
    published = soup.select('div.author-block.border-top div p')
    contents = soup.select('div.entity.entity-paragraphs-item.paragraphs-item-article-body div.content p')
    tags = soup.select('div.related-topic.border-top a')

    #Show how many matches we get for each
    print(len(headings),len(published), len(contents), len(tags))
    #There should only be one heading
    print(headings[0].get_text())
    #The datestamp is the second match, and needs stripping of carriage returns
    print("published", published[1].get_text().strip())
    #We can concatenate the content - starting with an empty string
    content = ""
    for i in contents:
      #store the link text, adding a new line after each line
      content = content+i.get_text()+"\n"
    print("content",content.strip()) #strip out the extra new line
    #create empty list to store tags
    taglist = []
    for i in tags:
      print(i.get_text().strip())
      #add to list
      taglist.append(i.get_text().strip())
    print(taglist)
    fulldata = {"url": url,
                "heading" : headings[0].get_text(),
                "date" : published[1].get_text().strip(),
                "content" : content,
                "tags" : taglist}
    print(fulldata)
    #return that to whatever called the function
    return(fulldata)
  except:
    #create a dictionary holding all the data, including the url
    fulldata = {"url": url,
                "heading" : "404 error",
                "date" : "404 error",
                "content" : "404 error",
                "tags" : ["404 error"]}
    #return that to whatever called the function
    return(fulldata)

## Testing the function

Then test the function

In [None]:
testdict = scrapereport("https://policeconduct.gov.uk/investigations/pc-david-owen-dismissed-gross-misconduct-west-midlands-police")
print(testdict)

0 0 0 0
{'url': 'https://policeconduct.gov.uk/investigations/pc-david-owen-dismissed-gross-misconduct-west-midlands-police', 'heading': '404 error', 'date': '404 error', 'content': '404 error', 'tags': ['404 error']}


## Loop through the results pages

Now we can apply that function as we loop through results pages.

In [None]:
#This URL remains unchanged, only the number at the end changes
baseurl = "https://policeconduct.gov.uk/investigations/our-investigations?page="
#Create a range of numbers - the last one at the moment is page 46
pagerange = range(0,47)
#Create an empty list to store all the links
#Loop through
for i in pagerange[:1]:
  #we add the page number, converting it to a string because we're making a string
  pageurl = baseurl+str(i)
  #Scrape the html at that url
  print("scraping resultspage", pageurl)
  page = requests.get(pageurl)
  # turn our HTML into a soup object
  soup = BeautifulSoup(page.content, 'html.parser')
  #The links are all in <span> and then <a
  #This targets the contents of those html tags
  links = soup.select('span.field-content a')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in links[-2:]:
    #grab the href attribute (the link) and add it to the base url
    linkurl = "https://policeconduct.gov.uk"+i['href']
    #keep us updated...
    print("scraping", linkurl)
    #run the scraping function on that link, adding the base URL
    reportresults = scrapereport(linkurl)
    print(reportresults)


scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=0
scraping https://policeconduct.gov.uk/investigations/detective-guilty-forging-murder-trial-witness-statement-hampshire-constabulary
1 2 3 2
Detective guilty of forging murder trial witness statement - Hampshire Constabulary
published 25 Jan 2022
content Read information about our investigation into DC Robert Ferrow, based at Portsmouth police station, who was charged with making a false instrument with intent for it to be accepted as genuine, under the Forgery and Counterfeiting Act 1981. 
Our investigation began in August 2019 and concluded in June 2020. At the investigation’s conclusion, we referred a file of evidence to the Crown Prosecution Service (CPS), which made the decision to charge the officer. A special case misconduct hearing was held on 25 October 2021, where the Hampshire Constabulary Police Chief Constable ruled former DC Ferrow would have been dismissed if still a serving officer

Let's adapt that to just store the links

In [None]:
#Create an empty list to store all the links
alllinks = []
#Loop through
for i in pagerange:
  #we add the page number, converting it to a string because we're making a string
  pageurl = baseurl+str(i)
  #Scrape the html at that url
  print("scraping resultspage", pageurl)
  page = requests.get(pageurl)
  # turn our HTML into a soup object
  soup = BeautifulSoup(page.content, 'html.parser')
  #The links are all in <span> and then <a
  #This targets the contents of those html tags
  links = soup.select('span.field-content a')
  #add to our ongoing list
  alllinks.extend(links)
print(len(alllinks))
print(alllinks)
#print first one href attrib
print(alllinks[0]['href'])

scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=0
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=1
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=2
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=3
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=4
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=5
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=6
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=7
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=8
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=9
scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=1

## Creating a dataframe to store those results

Now we run it in full, this time creating a dataframe to store the results.

In [None]:
#Create a dataframe to store the data we are about to scrape
#It has to match the structure of the data we're fetching
#We call this dataframe 'df'
df = pd.DataFrame(columns=["url","heading","date","content","tags"])

#Loop through - limit to first few links here
for linkurl in alllinks[:5]:
  linkurl = linkurl['href']
  #we have a problem with 'investigations' being repeated, so this cleans that
  linkurl = linkurl.replace("/investigations","https://policeconduct.gov.uk/investigations")
  print(linkurl)
  reportresults = scrapereport(linkurl)
  #print(reportresults)
  #append to our dataframe
  df = df.append(
    reportresults,
    ignore_index=True
    )


https://policeconduct.gov.uk/investigations/yasmin-jasiak-northumbria
1 2 1 1
Yasmin Jasiak - Northumbria
published 15 Feb 2022
content Read information about our five-month investigation into into the death of a woman who died in her home following police contact by Northumbria Police officers. We concluded our investigation in July 2021 and found that officers acted in line with procedures and did all they could to help save her life. We shared our findings with Ms Jasiak's family, the force and the Coroner. An inquest held ended on Tuesday 8 February 2022 with the Coronor recording a narrative conclusion of cardiac arrest associated with hanging, due to post traumatic stress disorder and intoxication.
Northumbria Police
['Northumbria Police']
{'url': 'https://policeconduct.gov.uk/investigations/yasmin-jasiak-northumbria', 'heading': 'Yasmin Jasiak - Northumbria', 'date': '15 Feb 2022', 'content': "Read information about our five-month investigation into into the death of a woman who

In [None]:
print(df)

                                                 url  ...                                               tags
0  https://policeconduct.gov.uk/investigations/ya...  ...                               [Northumbria Police]
1  https://policeconduct.gov.uk/investigations/al...  ...  [Corruption and abuse of power, West Mercia Po...
2  https://policeconduct.gov.uk/investigations/al...  ...  [Corruption and abuse of power, West Mercia Po...
3  https://policeconduct.gov.uk/investigations/us...  ...                        [Greater Manchester Police]
4  https://policeconduct.gov.uk/investigations/de...  ...  [Corruption and abuse of power, Hampshire Cons...

[5 rows x 5 columns]


## Export data (and remove duplicates0

We use `drop_duplicates()` to remove entries with the same URL.

Export what we have...

In [None]:
#remove duplicates based on the url column
df = df.drop_duplicates(subset="url")
#And we can export it
df.to_csv("scrapeddata.csv")


## Doing some analysis

It looks like we've stored our tags as a column of lists. But have we?

In [None]:
for i, l in enumerate(df["tags"][:10]):
    print("list",i,"is",type(l))

list 0 is <class 'list'>
list 1 is <class 'list'>
list 2 is <class 'list'>
list 3 is <class 'list'>
list 4 is <class 'list'>


I've borrowed a function from [this post](https://towardsdatascience.com/dealing-with-list-values-in-pandas-dataframes-a177e534f173) to count frequency:

In [None]:
def to_1D(series):
 return pd.Series([x for _list in series for x in _list])

In [None]:
tagcounts = to_1D(df["tags"]).value_counts()
print(tagcounts)
tagcounts.to_csv("tagcounts.csv")

Corruption and abuse of power    3
West Mercia Police               2
Northumbria Police               1
Greater Manchester Police        1
Hampshire Constabulary           1
dtype: int64
