# Scraping IOPC reports

The Independent Office for Police Conduct [publishes investigations on its website](https://policeconduct.gov.uk/investigations/our-investigations):

> "For most of the cases we investigate, we publish anonymised summaries of our reports. These set out a summary of the circumstances that prompted the investigation, the evidence gathered and our conclusions. They also explain any outcomes for those involved – for instance, what happened if there was a misconduct hearing.
>
> "We remove news releases and investigation reports from our website six months after completing an investigation. Summaries remain on our site for five years. This ensures that we are complying with data protection legislation and with our publication policy."

## Import libraries

To scrape the reports and store the data we need to import a number of Python libraries:

In [None]:
#install the libraries 
#scraperwiki is a library for scraping webpages
!pip install scraperwiki
import scraperwiki
#We can also use requests instead
import requests
#lxml.html is used to convert it into xml (more structured)
import lxml.html
#cssselect is used to drill down into that and find data in tags
!pip install cssselect
import cssselect
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

Collecting scraperwiki
  Downloading https://files.pythonhosted.org/packages/30/84/d874847baad89f03e6984fcd87505a37bf924b66519d1e07bf76e2369af0/scraperwiki-0.5.1.tar.gz
Collecting alembic
[?25l  Downloading https://files.pythonhosted.org/packages/8e/07/799a76aca0acd406e3259cc6c558ca1cdadf88250953b6c8105b421a9e33/alembic-1.5.5.tar.gz (1.2MB)
[K     |████████████████████████████████| 1.2MB 13.0MB/s 
Collecting Mako
[?25l  Downloading https://files.pythonhosted.org/packages/5c/db/2d2d88b924aa4674a080aae83b59ea19d593250bfe5ed789947c21736785/Mako-1.1.4.tar.gz (479kB)
[K     |████████████████████████████████| 481kB 44.2MB/s 
[?25hCollecting python-editor>=0.3
  Downloading https://files.pythonhosted.org/packages/c6/d3/201fc3abe391bbae6606e6f1d598c15d367033332bd54352b12f35513717/python_editor-1.0.4-py3-none-any.whl
Building wheels for collected packages: scraperwiki, alembic, Mako
  Building wheel for scraperwiki (setup.py) ... [?25l[?25hdone
  Created wheel for scraperwiki: filename=s

## Generate a list of pages to scrape

The reports are linked from a series of pages: the first is https://policeconduct.gov.uk/investigations/our-investigations but going past the first page generates a URL with a page number like this: https://policeconduct.gov.uk/investigations/our-investigations?page=1

We need to generate a range of page numbers and loop through those, adding them to the basic URL to create all the pages we need to scrape to get the links to all the reports.

In [None]:
#This URL remains unchanged, only the number at the end changes
baseurl = "https://policeconduct.gov.uk/investigations/our-investigations?page="
#Create a range of numbers - the last one at the moment is page 44
pagerange = range(0,45)
#Check the last one is 44
print(pagerange[-1])

44


## Create a function to scrape one page

Now we need to scrape one of the pages. First we test some code:

In [None]:
testurl = "https://policeconduct.gov.uk/investigations/essex-police-officer-charged-computer-misuse-offence"
#Scrape the html at that url
html = scraperwiki.scrape(testurl)
# turn our HTML into an lxml object
root = lxml.html.fromstring(html) 
#The links are all in <span> and then <a 
#target the contents of the html tags containing what we want
headings = root.cssselect('h1')
published = root.cssselect('div.author-block.border-top div p')
contents = root.cssselect('div.entity.entity-paragraphs-item.paragraphs-item-article-body div.content p')
tags = root.cssselect('div.related-topic.border-top a')

#Show how many matches we get for each
print(len(headings),len(published), len(contents), len(tags))
#There should only be one heading
print(headings[0].text_content())
#The datestamp is the second match, and needs stripping of carriage returns
print("published", published[1].text_content().strip())
#We can concatenate the content - starting with an empty string
content = ""
for i in contents:
  #store the link text, adding a new line after each line
  content = content+i.text_content()+"\n"
print("content",content.strip()) #strip out the extra new line
#create empty list to store tags
taglist = []
for i in tags:
  print(i.text_content().strip())
  #add to list
  taglist.append(i.text_content().strip())
print(taglist)
fulldata = {"heading" : headings[0].text_content(), 
            "date" : published[1].text_content().strip(),
            "content" : content,
            "tags" : taglist}
print(fulldata)

1 2 3 2
Essex Police - officer charged with computer misuse offence
published 21 Jan 2021
content Read information about our investigation into allegations an Essex Police officer used the police computer system to access records he had no legitimate policing purpose for doing so.
Our investigation began in October 2019 and concluded in April 2020.  At the investigation’s conclusion, we referred a file of evidence to the Crown Prosecution Service (CPS), which made the decision to charge the officer.
Corruption and abuse of power
Essex Police
['Corruption and abuse of power', 'Essex Police']
{'heading': 'Essex Police - officer charged with computer misuse offence', 'date': '21 Jan 2021', 'content': 'Read information about our investigation into allegations an Essex Police officer used the police computer system to access records he had no legitimate policing purpose for doing so.\nOur investigation began in October 2019 and concluded in April 2020.\xa0 At the investigation’s conclusion,

Then we store it in a function. Here we just return the results at the end to whatever calls the function.

In [None]:
def scrapereport(url):
  #Scrape the html at that url
  try:
    html = scraperwiki.scrape(url)
    # turn our HTML into an lxml object
    root = lxml.html.fromstring(html) 
    #The links are all in <span> and then <a 
    #target the contents of the html tags containing what we want
    headings = root.cssselect('h1')
    published = root.cssselect('div.author-block.border-top div p')
    contents = root.cssselect('div.entity.entity-paragraphs-item.paragraphs-item-article-body div.content p')
    tags = root.cssselect('div.related-topic.border-top a')

    #Show how many matches we get for each
    #print(len(headings),len(published), len(contents), len(tags))
    #There should only be one heading
    #print(headings[0].text_content())
    #The datestamp is the second match, and needs stripping of carriage returns
    #print("published", published[1].text_content().strip())
    #We can concatenate the content - starting with an empty string
    content = ""
    for i in contents:
      #store the link text, adding a new line after each line
      content = content+i.text_content()+"\n"
    #print("content",content.strip()) #strip out the extra new line
    #create empty list to store tags
    taglist = []
    #loop through tag matches, stripping them of new lines
    for i in tags:
      #print(i.text_content().strip())
      #add to list
      taglist.append(i.text_content().strip())
    #print(taglist)
    #create a dictionary holding all the data, including the url
    fulldata = {"url": url,
                "heading" : headings[0].text_content(), 
                "date" : published[1].text_content().strip(),
                "content" : content,
                "tags" : taglist}
    #return that to whatever called the function
    return(fulldata)
  except:
    #create a dictionary holding all the data, including the url
    fulldata = {"url": url,
                "heading" : "404 error", 
                "date" : "404 error",
                "content" : "404 error",
                "tags" : ["404 error"]}
    #return that to whatever called the function
    return(fulldata)

## Testing the function

Then test the function

In [None]:
testdict = scrapereport("https://policeconduct.gov.uk/investigations/pc-david-owen-dismissed-gross-misconduct-west-midlands-police")
print(testdict)

1 2 1 1
PC David Owen dismissed for gross misconduct - West Midlands Police
published 15 Jan 2021
content Read information about our investigation into allegations that West Midlands Police Constable David Owen had formed an inappropriate relationship with a vulnerable woman he met during the course of his duties. Our investigation began following a referral from the force in February 2019 and was completed in 11 months. At a gross misconduct hearing which concluded on 15 January 2021 PC Owen was dismissed without notice.
West Midlands Police
['West Midlands Police']
{'url': 'https://policeconduct.gov.uk/investigations/pc-david-owen-dismissed-gross-misconduct-west-midlands-police', 'heading': 'PC David Owen dismissed for gross misconduct - West Midlands Police', 'date': '15 Jan 2021', 'content': 'Read information about our investigation into allegations that West Midlands Police Constable David Owen had formed an inappropriate relationship with a vulnerable woman he met during the cour

## Loop through the results pages

Now we can apply that function as we loop through results pages.

In [None]:
#This URL remains unchanged, only the number at the end changes
baseurl = "https://policeconduct.gov.uk/investigations/our-investigations?page="
#Create a range of numbers - the last one at the moment is page 44
pagerange = range(0,45)
#Loop through
for i in pagerange[:1]:
  #we add the page number, converting it to a string because we're making a string
  pageurl = baseurl+str(i)
  #Scrape the html at that url
  print("scraping resultspage", pageurl)
  html = scraperwiki.scrape(pageurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html) 
  #The links are all in <span> and then <a 
  #This targets the contents of those html tags
  links = root.cssselect('span.field-content a')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in links[-2:]:
    #grab the href attribute (the link) and add it to the base url
    linkurl = "https://policeconduct.gov.uk"+i.attrib['href']
    #keep us updated...
    print("scraping", linkurl)
    #run the scraping function on that link, adding the base URL
    reportresults = scrapereport(linkurl)
    print(reportresults)


scraping resultspage https://policeconduct.gov.uk/investigations/our-investigations?page=0
scraping https://policeconduct.gov.uk/investigations/essex-police-officer-charged-computer-misuse-offence
{'url': 'https://policeconduct.gov.uk/investigations/essex-police-officer-charged-computer-misuse-offence', 'heading': 'Essex Police - officer charged with computer misuse offence', 'date': '21 Jan 2021', 'content': 'Read information about our investigation into allegations an Essex Police officer used the police computer system to access records he had no legitimate policing purpose for doing so.\nOur investigation began in October 2019 and concluded in April 2020.\xa0 At the investigation’s conclusion, we referred a file of evidence to the Crown Prosecution Service (CPS), which made the decision to charge the officer.\n\xa0\n', 'tags': ['Corruption and abuse of power', 'Essex Police']}
scraping https://policeconduct.gov.uk/investigations/pc-david-owen-dismissed-gross-misconduct-west-midland

## Links not being picked up

There *was* a problem here in that the links being picked up are not the ones we want. This was because I forgot to change one variable in the code which meant it was still looking at `testurl`. 

As a result, we went down this path for a while...

## Import CSV file of links

Instead, then, we scraped those links separately in OutWit Hub, and import them here.

In [None]:
linksdf = pd.read_csv("iopclinks.csv")
print(linksdf)

                                               cleaned
0    https://policeconduct.gov.uk/investigations/ab...
1    https://policeconduct.gov.uk/investigations/ab...
2    https://policeconduct.gov.uk/investigations/ab...
3    https://policeconduct.gov.uk/investigations/ac...
4    https://policeconduct.gov.uk/investigations/ac...
..                                                 ...
275  https://policeconduct.gov.uk/investigations/wi...
276  https://policeconduct.gov.uk/investigations/wi...
277  https://policeconduct.gov.uk/investigations/wi...
278  https://policeconduct.gov.uk/investigations/wo...
279  https://policeconduct.gov.uk/investigations/wo...

[280 rows x 1 columns]


In [None]:
for i in linksdf['cleaned'][:5]:
  print(i)

https://policeconduct.gov.uk/investigations/abuse-position-call-handler-west-midlands-police
https://policeconduct.gov.uk/investigations/abuse-position-police-constable-tameside-greater-manchester-police
https://policeconduct.gov.uk/investigations/abuse-position-sexual-purpose-detective-constable-bedfordshire-police
https://policeconduct.gov.uk/investigations/accrington-arrest-lancashire-constabulary
https://policeconduct.gov.uk/investigations/accrington-incident-lancashire-constabulary


We have a problem with 'investigations/' being repeated

## Creating a dataframe to store those results

Now we run it in full, this time creating a dataframe to store the results.

In [None]:
#Create a dataframe to store the data we are about to scrape
#It has to match the structure of the data we're fetching
#We call this dataframe 'df'
df = pd.DataFrame(columns=["url","heading","date","content","tags"])

#Loop through
for linkurl in linksdf['cleaned']:
  linkurl = linkurl.replace("investigations/investigations","investigations")
  print(linkurl)
  reportresults = scrapereport(linkurl)
  #print(reportresults)
  #append to our dataframe
  df = df.append(
    reportresults, 
    ignore_index=True
    )


https://policeconduct.gov.uk/investigations/abuse-position-call-handler-west-midlands-police
https://policeconduct.gov.uk/investigations/abuse-position-police-constable-tameside-greater-manchester-police
https://policeconduct.gov.uk/investigations/abuse-position-sexual-purpose-detective-constable-bedfordshire-police
https://policeconduct.gov.uk/investigations/accrington-arrest-lancashire-constabulary
https://policeconduct.gov.uk/investigations/accrington-incident-lancashire-constabulary
https://policeconduct.gov.uk/investigations/allegation-fraud-during-scene-guard-duty-metropolitan-police-service
https://policeconduct.gov.uk/investigations/allegations-abuse-position-police-constable-devon-cornwall
https://policeconduct.gov.uk/investigations/allegations-abuse-position-police-officer-devon-and-cornwall
https://policeconduct.gov.uk/investigations/allegations-accessing-confidential-information-city-london
https://policeconduct.gov.uk/investigations/allegations-computer-misuse-dyfed-powys-

In [None]:
print(df)

                                                   url  ...                                               tags
0    https://policeconduct.gov.uk/investigations/ab...  ...                             [West Midlands Police]
1    https://policeconduct.gov.uk/investigations/ab...  ...  [Corruption and abuse of power, Greater Manche...
2    https://policeconduct.gov.uk/investigations/ab...  ...  [Corruption and abuse of power, Bedfordshire P...
3    https://policeconduct.gov.uk/investigations/ac...  ...  [Use of force and armed policing, Lancashire C...
4    https://policeconduct.gov.uk/investigations/ac...  ...                    [Corruption and abuse of power]
..                                                 ...  ...                                                ...
275  https://policeconduct.gov.uk/investigations/wi...  ...  [Death and serious injury, Use of force and ar...
276  https://policeconduct.gov.uk/investigations/wi...  ...  [Welfare and vulnerable people, Cambridgeshire...
2

## Export data (and remove duplicates0

We use `drop_duplicates()` to remove entries with the same URL.

Export what we have...

In [None]:
#remove duplicates based on the url column
df = df.drop_duplicates(subset="url")
#And we can export it
df.to_csv("scrapeddata.csv")


## Doing some analysis

It looks like we've stored our tags as a column of lists. But have we?

In [None]:
for i, l in enumerate(df["tags"][:10]):
    print("list",i,"is",type(l))

list 0 is <class 'list'>
list 1 is <class 'list'>
list 2 is <class 'list'>
list 3 is <class 'list'>
list 4 is <class 'list'>
list 5 is <class 'list'>
list 6 is <class 'list'>
list 7 is <class 'list'>
list 8 is <class 'list'>
list 9 is <class 'list'>


I've borrowed a function from [this post](https://towardsdatascience.com/dealing-with-list-values-in-pandas-dataframes-a177e534f173) to count frequency:

In [None]:
def to_1D(series):
 return pd.Series([x for _list in series for x in _list])

In [None]:
tagcounts = to_1D(df["tags"]).value_counts()
tagcounts
tagcounts.to_csv("tagcounts.csv")

## Repeat process with 'recommendations' section

There is another section on the website with investigation summaries and recommendations https://policeconduct.gov.uk/investigations/investigation-summaries-and-learning-recommendations - these are often more detailed and perhaps more structured, too.

In [None]:
#docs for requests at https://requests.readthedocs.io/en/master/
html = requests.get('https://policeconduct.gov.uk/investigations/investigation-summaries-and-learning-recommendations?page=0')
root = lxml.html.fromstring(html.content)
somelinks = root.cssselect('span a')
for i in somelinks:
  print(i.text_content())
  print(i.attrib['href'])


Road traffic incident following pursuit - West Yorkshire Police, July 2020
/recommendations/road-traffic-incident-following-pursuit-west-yorkshire-police-july-2020
Response to a missing person report - Gwent Police, August 2019
/recommendations/response-missing-person-report-gwent-police-august-2019
Response to calls expressing concern for a woman’s welfare - Leicestershire Police, October 2018 
/recommendations/response-calls-expressing-concern-woman%E2%80%99s-welfare-leicestershire-police-october-2018
Inappropriate relationship with person encountered as part of policing role and duties - Suffolk Constabulary, March 2019
/recommendations/inappropriate-relationship-person-encountered-part-policing-role-and-duties-suffolk
Response to a missing person report - Humberside Police, November 2019
/recommendations/response-missing-person-report-humberside-police-november-2019
Content found on mobile phone - Metropolitan Police Service, January 2018
/recommendations/content-found-mobile-phone

In [None]:
print(html.content)



In [None]:
#Create a dataframe to store the data we are about to scrape
#It has to match the structure of the data we're fetching
#We call this dataframe 'df'
summariesdf = pd.DataFrame(columns=["url","heading","date","content","tags"])

#This URL remains unchanged, only the number at the end changes
baseurl = "https://policeconduct.gov.uk/investigations/investigation-summaries-and-learning-recommendations?page="
#Create a range of numbers - the last one at the moment is page 213
pagerange = range(0,214)
#Loop through
for i in pagerange:
  #we add the page number, converting it to a string because we're making a string
  pageurl = baseurl+str(i)
  #Scrape the html at that url
  print("scraping resultspage", pageurl)
  html = requests.get(pageurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html.content) 
  #The links are all in <span> and then <a 
  #This targets the contents of those html tags
  links = root.cssselect('span a')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in links:
    print(i.text_content())
    #grab the href attribute (the link) and add it to the base url
    linkurl = "https://policeconduct.gov.uk"+i.attrib['href']
    #keep us updated...
    print("scraping", linkurl)
    #run the scraping function on that link, adding the base URL
    reportresults = scrapereport(linkurl)
    print(reportresults)
    #print(reportresults)
    #append to our dataframe
    summariesdf = summariesdf.append(
      reportresults, 
      ignore_index=True
      )

  

In [None]:

#And we can export it
summariesdf.to_csv("scrapedsummaries.csv")


## Analyse frequency of tags

In [None]:
summariestagcounts = to_1D(summariesdf["tags"]).value_counts()
summariestagcounts
summariestagcounts.to_csv("summariestagcounts.csv")

## Expand scraper to grab recommendations and other details

The recommendations pages include more detail than the initial reports, so we need to expand the function.

In [None]:
def scrapereport(url):
  #Scrape the html at that url
  try:
    html = scraperwiki.scrape(url)
    # turn our HTML into an lxml object
    root = lxml.html.fromstring(html) 
    #The links are all in <span> and then <a 
    #target the contents of the html tags containing what we want
    headings = root.cssselect('h1')
    published = root.cssselect('div.author-block.border-top div p')
    contents = root.cssselect('div.entity.entity-paragraphs-item.paragraphs-item-article-body div.content p')
    tags = root.cssselect('div.related-topic.border-top a')
    #grab the recommendations and the heading to that as a check
    recs = root.cssselect('div.paragraphs-items.paragraphs-items-field-para-recommendations.paragraphs-items-field-para-recommendations-full.paragraphs-items-full div.accordion__head')
    recsheadings = root.cssselect('div.paragraphs-items.paragraphs-items-field-para-recommendations.paragraphs-items-field-para-recommendations-full.paragraphs-items-full div.field-label')
    if len(recsheadings) !=0 :
      #print("IF!")
      recsheading = recsheadings[0].text_content()
    else:
      #print("ELSE!")
      recsheading = ""
    if len(recs) != 0:
      recommendation = recs[0].text_content()
    else:
      #print("ELSE!")
      recommendation = ""
    #grab any document links
    reclinks = root.cssselect('div.paragraphs-items.paragraphs-items-field-para-recommendations.paragraphs-items-field-para-recommendations-full.paragraphs-items-full div.accordion__head a')
    if (len(reclinks) != 0):
      reclink = reclinks[0].attrib['href']
    else:
      reclink = ""
    #we want the dates of the recommendation and that a response is due
    dates = root.cssselect('span.date-display-single')
    fieldlabels = root.cssselect('div.content.clearfix div.field-label')
    #print(len(fieldlabels))
    #for i in fieldlabels:
     # print(i.text_content())
    accepteds = root.cssselect('div.paragraphs-items.paragraphs-items-field-para-recommendations.paragraphs-items-field-para-recommendations-full.paragraphs-items-full div.accordion__body p')
    if len(accepteds)>0:
      #print("HELLO",accepteds[0].text_content())
      accepted = accepteds[0].text_content()
    else:
      accepted = ""
    #print(len(dates))
    if len(dates) > 0:
      dateofrecommendation = dates[0].text_content()
      dateofrecommendation_stamp = dates[0].attrib['content']
      dateresponsedue = dates[1].text_content()
      dateresponsedue_stamp = dates[1].attrib['content']
    else:
      dateofrecommendation = ""
      dateofrecommendation_stamp = ""
      dateresponsedue = ""
      dateresponsedue_stamp = ""
    #There should only be one heading
    print(headings[0].text_content())
    #The datestamp is the second match, and needs stripping of carriage returns
    #print("published", published[1].text_content().strip())
    #We can concatenate the content - starting with an empty string
    content = ""
    for i in contents:
      #store the link text, adding a new line after each line
      content = content+i.text_content()+"\n"
    #print("content",content.strip()) #strip out the extra new line
    #create empty list to store tags
    taglist = []
    #loop through tag matches, stripping them of new lines
    for i in tags:
      #print(i.text_content().strip())
      #add to list
      taglist.append(i.text_content().strip())
    #print(taglist)
    #create a dictionary holding all the data, including the url
    fulldata = {"url": url,
                "heading" : headings[0].text_content(), 
                "date" : published[1].text_content().strip(),
                "content" : content,
                "recsheading" : recsheading,
                "recommendation" : recommendation,
                "dateofrecommendation" : dateofrecommendation,
                "dateofrecommendation_stamp" : dateofrecommendation_stamp,
                "dateresponsedue" : dateresponsedue,
                "dateresponsedue_stamp" : dateresponsedue_stamp,
                "accepted" : accepted,
                "reclink" : reclink,
                "tags" : taglist}
    #return that to whatever called the function
    #print(fulldata)
    return(fulldata)
  except:
    #create a dictionary holding all the data, including the url
    fulldata = {"url": url,
                "heading" : "404 error", 
                "date" : "404 error",
                "content" : "404 error",
                "recsheading" : "404 error",
                "recommendation" : "404 error",
                "dateofrecommendation" : "404 error",
                "dateofrecommendation_stamp" : "404 error",
                "dateresponsedue" : "404 error",
                "dateresponsedue_stamp" : "404 error",
                "accepted" : "404 error",
                "reclink" : "",
                "tags" : ["404 error"]}
    #return that to whatever called the function
    return(fulldata)

testdict = scrapereport("https://policeconduct.gov.uk/recommendations/recommendation-stop-and-search-deptford-metropolitan-police-february-2018")
print(testdict)

Recommendation, Stop and search, Deptford - Metropolitan Police, February 2018
{'url': 'https://policeconduct.gov.uk/recommendations/recommendation-stop-and-search-deptford-metropolitan-police-february-2018', 'heading': 'Recommendation, Stop and search, Deptford - Metropolitan Police, February 2018', 'date': '16 Nov 2020', 'content': 'On 27 February 2018, six Metropolitan Police Service (MPS) officers were patrolling Deptford High Street, Lewisham. One of the officers alleged that they sighted two men involved in a drug exchange.\nThe officers approached the men and detained them under Section 23 of the Misuse of Drugs Act.\nDuring the search, one of the men was handcuffed and the officers also searched a vehicle belonging to one of the men. No drugs were found. One of the men was also arrested when a credit card was found in his possession, which was issued to a different name. The man was later de-arrested when it was confirmed that the credit card belonged to his girlfriend.\nThe se

## Test new scraper function

In [None]:
testdict = scrapereport("https://policeconduct.gov.uk/recommendations/collision-another-car-causing-serious-injury-northumbria-police-march-2020")
print(testdict)

0 0
{'url': 'https://policeconduct.gov.uk/recommendations/collision-another-car-causing-serious-injury-northumbria-police-march-2020', 'heading': '404 error', 'date': '404 error', 'content': '404 error', 'recsheading': '404 error', 'recommendation': '404 error', 'dateofrecommendation': '404 error', 'dateofrecommendation_stamp': '404 error', 'dateresponsedue': '404 error', 'dateresponsedue_stamp': '404 error', 'accepted': '404 error', 'tags': ['404 error']}


## Run on all links

In [None]:
#Create a dataframe to store the data we are about to scrape
#It has to match the structure of the data we're fetching
#We call this dataframe 'df'
summariesdf = pd.DataFrame(columns=["url","heading","date","content","recsheading","recommendation", "dateofrecommendation", "dateofrecommendation_stamp","dateresponsedue","dateresponsedue_stamp","accepted","tags"])
#This URL remains unchanged, only the number at the end changes
baseurl = "https://policeconduct.gov.uk/investigations/investigation-summaries-and-learning-recommendations?page="
#Create a range of numbers - the last one at the moment is page 213
pagerange = range(0,214)
#Loop through
for i in pagerange:
  #we add the page number, converting it to a string because we're making a string
  pageurl = baseurl+str(i)
  #Scrape the html at that url
  print("scraping resultspage", pageurl)
  html = requests.get(pageurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html.content) 
  #The links are all in <span> and then <a 
  #This targets the contents of those html tags
  links = root.cssselect('span a')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in links:
    print(i.text_content())
    #grab the href attribute (the link) and add it to the base url
    linkurl = "https://policeconduct.gov.uk"+i.attrib['href']
    #keep us updated...
    print("scraping", linkurl)
    #run the scraping function on that link, adding the base URL
    reportresults = scrapereport(linkurl)
    print(reportresults)
    #print(reportresults)
    #append to our dataframe
    summariesdf = summariesdf.append(
      reportresults, 
      ignore_index=True
      )

  

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
scraping resultspage https://policeconduct.gov.uk/investigations/investigation-summaries-and-learning-recommendations?page=5
Recommendation - Metropolitan Police Service, October 2020
scraping https://policeconduct.gov.uk/recommendations/recommendation-metropolitan-police-service-october-2020
Recommendation - Metropolitan Police Service, October 2020
{'url': 'https://policeconduct.gov.uk/recommendations/recommendation-metropolitan-police-service-october-2020', 'heading': 'Recommendation - Metropolitan Police Service, October 2020', 'date': '17 Dec 2020', 'content': 'An 18 year old was arrested following a stop and search, as he was found to be carrying a knife like object and a bank card suspected to be stolen. After arriving at custody, the man was found to have a second knife hidden in his clothes. A strip search of the man was authorised but, while waiting for a suitable room to use for the search, the man reached down

In [None]:
#remove duplicates based on the url column
summariesdf = summariesdf.drop_duplicates(subset="url")
#And we can export it
summariesdf.to_csv("scrapeddata.csv")
