# Capture data from UTK's COVID-19 dashboard

I initially tried to scrape the data from UTK's COVID_19 dashboard and parse it with BeautifulSoup, but the data I want is buried in nested table tags that looked like a pain. Well, to cut a long story short, [my hammer](https://en.wiktionary.org/wiki/if_all_you_have_is_a_hammer,_everything_looks_like_a_nail) is computer vision and this website looks a whole lot like a nail to me.

UTK's COVID-19 dashboard data is currently located at https://veoci.com/veoci/p/form/4jmds5x4jj4j#tab=entryForm

Currently, I just want to make sure I am capturing the data, for which visualizations can be created later.

The steps are:

1. get screenshot of page in binary with Selenium
2. convert binary data to grayscale PIL.image
3. OCR image
4. load old OCR text
5. compare new OCR text with old OCR text
6. If OCR text changed, then save screenshot with date and update old OCR text

In [1]:
# import standard library
from datetime import datetime
from io import BytesIO
from pathlib import Path
from shutil import copy

# import 3rd party
from PIL import Image
from pytesseract import image_to_string
from selenium import webdriver

In [2]:
# variables
URL = 'https://veoci.com/veoci/p/form/4jmds5x4jj4j#tab=entryForm'
data_dir_path = Path('../data')
last_ocr_text_path = data_dir_path.joinpath('last_ocr.txt')

In [3]:
# functions
def get_element_screenshot(URL = 'https://veoci.com/veoci/p/form/4jmds5x4jj4j#tab=entryForm',
                           element = '/html/body/div[2]/div[2]',
                           window_width = 1000,
                           window_height = 1500
                          ):
    with webdriver.Safari() as driver:
        driver.get(URL)
        # default window size works on 5k 27" iMac
        driver.set_window_size(window_width, window_height)
        # default element gets all contents for COVID data
        element = driver.find_element_by_xpath(element)
        screenshot = element.screenshot_as_png
    return screenshot

def screenshot_to_image(screenshot):
    image_bytes = BytesIO(screenshot)
    image = Image.open(image_bytes)
    return image

def image_to_covid_text(image):
    gray_image = image.convert('L')
    text = image_to_string(gray_image)
    covid_text = text.split('Learn more about what happens with a COVID-19 case is reported.')[1].split('View Acknowledgement')[0].strip()
    return covid_text

def compare_ocr_text(covid_text):
    
    with open(last_ocr_text_path, 'r') as text_file:
        last_ocr_text = text_file.read()
    # print(last_ocr_text)
    
    if covid_text != last_ocr_text:  # then we have new data!
        
        # set date_stub now so it's consistent for filenames
        date_stub = datetime.now().strftime("%Y-%m-%d_%H%M")
        
        # save screenshot
        image_filename = f'{date_stub}_screenshot.png'
        image_path = data_dir_path.joinpath(image_filename)
        image.save(image_path)
        print(f'Image saved at {image_path}\n')
        
        # save covid_text
        text_filename = f'{date_stub}_covid-text.txt'
        text_path = data_dir_path.joinpath(text_filename)
        with open(text_path, 'w') as text_file:
            for line in covid_text:
                text_file.write(line)
        
        # overwrite "last_ocr.txt"
        copy(text_path, last_ocr_text_path)
        print(f'Updated "last_ocr.txt" with:\n{79*"*"}\n')
        print(covid_text)
    else:
        print(f'No change in last_ocr.txt')

In [4]:
# gather data
screenshot = get_element_screenshot()

image = screenshot_to_image(screenshot)

covid_text = image_to_covid_text(image)

compare_ocr_text(covid_text)

IndexError: list index out of range