<a href="https://colab.research.google.com/github/jessie-taylor/GICTracker/blob/main/waittime_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from bs4 import BeautifulSoup
from datetime import datetime
import re
from urllib.request import Request, urlopen
import unicodedata
import yaml

# Version I

It appears as though I can't use this method to just scrape for something that spans multiple lines. First turn into a string and search for a substring.

In [2]:
url = "https://www.leedsandyorkpft.nhs.uk/our-services/gender-identity-service/"
request = Request(
    "https://www.leedsandyorkpft.nhs.uk/our-services/gender-identity-service/",
    headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(request).read().decode('utf-8')

soup = BeautifulSoup(page, "html.parser")

#turn into string
soup_string = soup.get_text()

# removing "\xa0" from text by converting it to space
souped = unicodedata.normalize('NFKD', soup_string)

# Next search for the important parts

The important parts of the text are here:


> This information was updated on Friday, 20 September. We will update the referral information on a three-monthly basis. Please note, we are open to referrals.

>There are currently 5934 people on our standard waiting list to be seen.

>There are currently 169 people on our priority waiting list waiting to be seen.

>We are currently booking appointments for people who were referred in approximately June 2019. If you were referred before this date and have not been contacted, please contact the service.



In [3]:
# Dictionary for all holding data in
data = {}

In [4]:
search_strings = ["This information was updated on",
                  "There are currently",
                  "We are currently booking appointments for people who were"]
days = ['Monday',
        'Tuesday',
        'Wednesday',
        'Thursday',
        'Friday',
        'Saturday',
        'Sunday']

months = ['January',
          'February',
          'March',
          'April',
          'May',
          'June',
          'July',
          'August',
          'September',
          'October',
          'November',
          'December']

Finding update date

In [47]:
# Find where phrase appears in site
update_sentence_loc = souped.find(search_strings[0])
# Date ends one character before the next full stop, so find location of that
next_stop = souped[update_sentence_loc:].find('.')
# Pull out the sentence
update_sentence = souped[update_sentence_loc:update_sentence_loc + next_stop]

# Find where the day is in the sentence
for day in days:
  if day in update_sentence:
    location = update_sentence.find(day)
    break
# If no day has been found, raise error
else:
  raise ValueError("No day found within range. Check site for changes.")

# Date in format ie 'Friday, 20 September, 2024'
# Pull it out from the rest of the sentence
update_date_str = update_sentence[location:]

# Convert to datetime format
update_date_obj = datetime.strptime(date_str, '%A, %d %B, %Y')

print (f'Date of update: {update_date_str} ({update_date_obj})')

Date of update: Friday, 20 September, 2024 (2024-09-20 00:00:00)


In [43]:
print(update_sentence_loc, next_stop, update_sentence, location)

12202 58 This information was updated on Friday, 20 September, 2024 32


Finding waiting list sizes (both standard, and priority)

In [7]:
# Find each instance of the beginning of the waiting list sentences
matches = [match.start() for match in re.finditer(search_strings[1], souped)]

# List for storage of waiting list sizes
waiting_list_sizes = []

# Find the standard and priority waiting list sizes
for match_index in matches:
  size = re.search(r'\d+', souped[match_index:match_index+100])
  # If value found, add to waiting list sizes list
  if size:
    waiting_list_sizes.append(size.group())
  else:
    raise ValueError("Error: Waiting list number not found, check site for changes")
print(waiting_list_sizes)

standard_waiting_list_size = waiting_list_sizes[0]
priority_waiting_list_size = waiting_list_sizes[1]

print(f'Standard waiting list size: {standard_waiting_list_size}, priority list size: {priority_waiting_list_size}')

['5934', '169']
Standard waiting list size: 5934, priority list size: 169


Finally, find when they're currently seeing people from

In [48]:
# Find phrase for final part
reftime_sentence_loc = souped.find(search_strings[2])
# Find next period in sentence
next_stop = souped[reftime_sentence_loc:].find('.')
# Pull out sentence
reftime_sentence = souped[reftime_sentence_loc:reftime_sentence_loc+next_stop]

for month in months:
  if month in reftime_sentence:
    location = reftime_sentence.find(month)
    break
# Raise exception if above break is never executed
else:
 raise ValueError("No month found in sentence! Check site for changes.")

# Pull out and store date string
ref_date_str = reftime_sentence[location:]

# Convert date string to obj
ref_date_obj = datetime.strptime(ref_date_str, '%B %Y')

print (f'Date of referral: {ref_date_str} ({ref_date_obj})')

Date of referral: June 2019 (2019-06-01 00:00:00)


Now add the data into a dictionary, which can then be put into a list with the other data.

In [82]:
# Define main list, which will hold all dictionaries
all_data = []

 # Add results of previous run into keyed dictionary
obtained_data = {'Date': update_date_obj,
                 'Date (str)': update_date_str,
                 'Standard waiting list': standard_waiting_list_size,
                 'Priority waiting list': priority_waiting_list_size,
                 'Referral date of appts': ref_date_obj,
                 'Referral date of appts (str)': ref_date_str}

# ADD A CHECK HERE TO SEE IF THIS DATA ALRADY EXISTS IN THERE
for entry in all_data:
  if obtained_data["Date"] == entry["Date"]:
    print("This update has already been recorded, discarding results.")
    break
else:
  # Add to list of all data
  print("Appending new data!")
  all_data.append(obtained_data)

# For printing out in a nice way for reading it if you're a human.
yaml.Dumper.ignore_aliases = lambda *args : True # don't use aliases
print(yaml.dump(all_data, allow_unicode=True, default_flow_style=False))

Appending new data!
- Date: 2024-09-20 00:00:00
  Date (str): Friday, 20 September, 2024
  Priority waiting list: '169'
  Referral date of appts: 2019-06-01 00:00:00
  Referral date of appts (str): June 2019
  Standard waiting list: '5934'



In [60]:
obtained_data["Date"]

datetime.datetime(2024, 9, 20, 0, 0)

Next steps:
- Fix the fact that it will find whichever day it found last (maybe limit the area it searches for days or months, as if it finds January first but then September comes up later, that will be the result - fix this by only searching between the sentence start found location index and that of the next full stop, or
- Clean up top of page
- Add data to dictionary (can modify example I created below)
- Find out how to use wayback machine API and begin to code for this (may not be back online yet, but try API anyway)
- Add other data to CSV and put it in the correct format so it can be analysed and trends extrapolated.

Make data a list of dictionaries? Can be converted to something else later.

In [None]:
test = []
a, b, c = 1, 2, 3
test.append({'Date': a,
             'Waiting list size': b,
             'Currently seeing': c})
print(test)

[{'Date': 1, 'Waiting list size': 2, 'Currently seeing': 3}]


## Notes
When it's checking more than one version it will need to be able to see if the upload date matches any of the previous ones stored and disregard that update.

Would be nice to be able to see the difference between the date of update and the date they're seeing people from on a graph over time.