# Schema Recommender 

## High Level Goal 
 Automate schema creation process, for SD County markup and Schema.org indexing and reference. 

## Steps 
 - Compare markup and PDF scraper packages/softwares
 - Create a set of rules to automate scraping 
 - Compile Vocabulary List/Rules to replicate automation process for other counties
 - Validate Data by Schema standards on Google’s Structured Data Tool 
 - Create a single tool to help governments, hospitals, schools automatically "markup" their pages

## Current Work 
- Working on Vocabulary List 
- Working on script like functioning 


# Setup

In [1]:
import requests

from bs4 import BeautifulSoup
import datetime

vocab_list = []
date = datetime.datetime.today()

In [2]:
url = 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html'

In [3]:
page = requests.get(url)

In [4]:
soup = BeautifulSoup(page.text, 'html.parser')

In [5]:
tables = soup.find_all('table')
my_list = []
for t in tables:
    title_list = t.td.text.splitlines()
#     if 'coronavirus' not in map(str.lower,title_list):
#         continue
    print(title_list)
    


['', 'Positive Cases in San Diego County Since February 14,', '        2020 ', 'Coronavirus Disease 2019 (COVID-19) ', 'Table updated June 18, 2020, with data through June 17,', '        2020. ']


## Creating Announcement Schema

## San Diego 

In [6]:
url = 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
county = "San Diego County"

In [9]:

def create_schema(title, county, url):
    schema = '{"@context": "http://schema.org",\
     "@type": "SpecialAnnouncement",\
     "name":'+'"'+title +'"'+',\
     "text":'+'"'+title +'"'+',\
      "encodingFormat" : ["text/html", "text/css" ] ,\
      "inLanguage": "en",\
      "datePosted": "???",\
      "keywords":'+str(title_list)+',\
      "publisher": {\
        "@type": "GovernmentOrganization",\
        "name": ' +county +',\
        }\
        "url":'+url + '\
    }'
    return schema



In [10]:

titles = soup.find_all('h3')
for title in titles:
    title = str(title.text)
    title = title.replace('\n','')
    break
title_list = []
for title in titles:
    title = str(title.text)
    title = title.replace('\n','')
    title = title.replace('  ',' ')
    print(create_schema(title, county, url))
    title_list.append(title)


{"@context": "http://schema.org",     "@type": "SpecialAnnouncement",     "name":"Coronavirus  in San Diego",     "text":"Coronavirus  in San Diego",      "encodingFormat" : ["text/html", "text/css" ] ,      "inLanguage": "en",      "datePosted": "???",      "keywords":[],      "publisher": {        "@type": "GovernmentOrganization",        "name": San Diego County,        }        "url":https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/    }
{"@context": "http://schema.org",     "@type": "SpecialAnnouncement",     "name":"Testing",     "text":"Testing",      "encodingFormat" : ["text/html", "text/css" ] ,      "inLanguage": "en",      "datePosted": "???",      "keywords":['Coronavirus  in San Diego'],      "publisher": {        "@type": "GovernmentOrganization",        "name": San Diego County,        }        "url":https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/    }
{"@context": "http://

In [9]:
for elem in title_list: print(elem)

 
Coronavirus  in San Diego
Testing
About Coronavirus
Community  Sector Support
Health Professionals
Resources  and Materials
Prepare  for a Pandemic
Care  for Your Mental Health
Text COSD COVID19 to 468-311 to get text alert updates.
For general questions about COVID-19, info about resources, or if you're uninsured, call 2-1-1 San Diego. Report face covering violations.
Live Well @ Home
Resources to keep healthy in mind and body while staying at home.
Ways to Help
Hand  Washing Stations Map
Outside resources



In [10]:
url = 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
county = "San Diego County"

In [11]:
titles = soup.find_all('a')
   
# titles = soup.find('COVID-19')
print(titles)

[<a href="/content/sdc/home.html">
<span id="dept-cosd-domain">
<span id="sdc-link">SanDiegoCounty.gov</span>
</span>
<span id="dept-cosd-text"> Home</span>
</a>, <a class="dropdown-toggle" data-toggle="dropdown" href="javascript:void(0)">Departments</a>, <a class="dropdown-toggle" href="javascript:void(0)">A-C</a>, <a href="/content/sdc/awm.html" target="">  Agriculture, Weights and Measures</a>, <a href="/content/sdc/apcd/en.html" target="">  Air Pollution Control District</a>, <a href="/content/sdc/hhsa/programs/ais.html" target="">  Aging &amp; Independence Services</a>, <a href="http://www.sddac.com" target="">  Animal Services</a>, <a href="https://arcc.sdcounty.ca.gov/Pages/default.aspx" target="">  Assessor/Recorder/County Clerk</a>, <a href="/content/sdc/auditor.html" target="">  Auditor/Controller</a>, <a href="/content/sdc/hhsa/programs/bhs.html" target="">  Behavioral Health Services</a>, <a href="/content/sdc/cao.html" target="">  Chief Administrative Office</a>, <a href="

In [12]:
link_dict = {}
for title in titles:
    if title.has_attr('href') and 'COVID' in str(title.text):
        link_dict[title.text] = title['href']
        title = str(title.text)
        title = title.replace('\n','')
        print(title)
        title_list.append(title)
        

County of San Diego Coronavirus Disease (COVID-19) Dashboard
COVID-19 Tests in San Diego      County by Date Reported
Percentage of Positive      COVID‐19 Cases Among Tests by Date Reported
Bar Graph of New and Total      COVID-19 Cases in San Diego County by Date  Reported
Confirmed COVID-19 Cases by Date of Illness  Onset
Summary of County of San Diego      COVID-19 Cases by City of Residence
Summary of County of San Diego      COVID-19 Cases by Race/Ethnicity
Summary of County of San Diego      COVID-19 Cases by Zip Code
COVID-19-Associated Hospitalizations by Date of  Admission
Summary of County of San Diego      COVID-19 Cases that Required Hospitalization
COVID-19-Associated Deaths by Date of  Death
Summary of County of San Diego COVID-19 Deaths by    Demographics
COVID-19 Watch
					Subscribe to the COVID-19 Watch 


In [13]:
link_dict

{'County of San Diego Coronavirus Disease (COVID-19) Dashboard': 'https://www.arcgis.com/apps/opsdashboard/index.html#/96feda77f12f46638b984fcb1d17bd24',
 'COVID-19 Tests in San Diego\n      County by Date Reported': '/content/dam/sdc/hhsa/programs/phs/Epidemiology/COVID-19%20Tests%20in%20San%20Diego%20County%20by%20Date%20Reported.pdf',
 'Percentage of Positive\n      COVID‐19 Cases Among Tests by Date Reported': '/content/dam/sdc/hhsa/programs/phs/Epidemiology/COVID-19%20Percentage%20Positive.pdf',
 'Bar Graph of New and Total\n      COVID-19 Cases in San Diego County by Date\n  Reported': '/content/dam/sdc/hhsa/programs/phs/Epidemiology/COVID-19%20Bar%20Graph%20of%20New%20and%20Total%20Cases.pdf',
 'Confirmed COVID-19 Cases by Date of Illness\n  Onset': '/content/dam/sdc/hhsa/programs/phs/Epidemiology/COVID-19%20Cases%20by%20Date%20of%20Illness%20Onset.pdf',
 'Summary of County of San Diego\n      COVID-19 Cases by City of Residence': '/content/dam/sdc/hhsa/programs/phs/Epidemiology

In [14]:
for elem in link_dict:
    create_schema(elem, 'San Diego', link_dict[elem])

## Creating Observation Schema

In [11]:
def create_schema_observations(title, county, url, counts):
    schema = '{"@context": "http://schema.org",\
     "@type": "Special Announcement",\
     "name":'+'"'+str(county)+'"'+',\
     "text":'+'"'+str(counts)+'"'+',\
      "encodingFormat" : ["text/html", "text/css" ] ,\
      "inLanguage": "en",\
      "datePosted":' +'"'+str(date) +'"'+',\
      "keywords":'+'"'+str(1)+'"'+',\
      "publisher": {\
        "@type": "GovernmentOrganization",\
        "name": ' +'"'+county +'"'+'\
        },\
        "url":'+'"'+url +'"'+ '\
    }'
    return schema


'''
Example observation schema 
type - observations 
name - case counts
text - case counts val
encoding - text/html
inLanguage - en
datePosted - date_last
keywords - same vocab list 
publisher - {type - govt organization 
             name - San Diego
             URL - 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html}''
            

'''


"\nExample observation schema \ntype - observations \nname - case counts\ntext - case counts val\nencoding - text/html\ninLanguage - en\ndatePosted - date_last\nkeywords - same vocab list \npublisher - {type - govt organization \n             name - San Diego\n             URL - 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html}''\n            \n\n"

In [12]:
import pandas as pd
observation_table = pd.read_csv('COVID_COUNTS_SANDIEGO (2).csv')

FileNotFoundError: [Errno 2] File COVID_COUNTS_SANDIEGO (2).csv does not exist: 'COVID_COUNTS_SANDIEGO (2).csv'

In [None]:
observation_table = observation_table.drop("Unnamed: 0",1)

In [None]:
observation_table

### San Diego Schema

In [None]:
count = observation_table.loc[observation_table['City'] == 'Incorporated City'].Count[0].strip(',')
count2 = observation_table.loc[observation_table['City'] == 'Unincorporated'].Count[19].strip(',')
city_count = int(count.replace(',','')) + int(count2.replace(',',''))

create_schema_observations('San Diego County Case Counts',
                          'San Diego',
                          'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html'
                           ,city_count)



In [None]:
url = 'https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html'
title = 'San Diego County Case Counts'
for index, elem in observation_table.iterrows():
        city = elem.City
        count = elem.Count 
        print(create_schema_observations(title, city, url, count))
        