This script was last run on September 3, 2021-- website structures likely do not look the same today and code will not output the same results if re-run. It scrapes SARS-CoV-2 variant threat level lists from outbreak.info and exports their names and VOI/VOC/VUM status as a .csv to a specified file. 

Before running this script, download the appropriate webdriver to allow selenium to interact with the site. Webdriver information can be found here: (https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/_).

In [1]:
#call packages
import urllib
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import datetime

In [2]:
#open and collect data from voc table
#before running, replace <driver_path> with the filepath where your webdriver is saved

#set url
url="https://outbreak.info/situation-reports#voc"

#set webdriver and get data
driver = webdriver.Chrome(executable_path = '<driver_path>')
driver.get(url)

#scroll and wait for page to load
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(30)

#use bs4 to parse html and pull the first table
soup_voc = BeautifulSoup(driver.page_source, 'lxml')
table_voc = soup_voc.find('table')

#close webpage
driver.quit()

In [10]:
#extract voc names from table
voc_names=[]
for i in table_voc.find_all("h4"):
    voc_names.append(i.get_text().strip())
voc_names 

['B.1.617.2', 'B.1.1.7', 'B.1.351', 'P.1']

In [33]:
#make VOC data frame with column VOC
voc_col=list(["VOC"]*len(voc_names))

voc_df=pd.DataFrame([voc_names,voc_col]).transpose()
voc_df

Unnamed: 0,0,1
0,B.1.617.2,VOC
1,B.1.1.7,VOC
2,B.1.351,VOC
3,P.1,VOC


In [12]:
#now they differentiate lineages and sublineages-- the following code pulls sublineages
voc_sub_names=[]
for i in table_voc.find_all("a"):
    voc_sub_names.append(i.get_text().strip())
voc_sub_names 
#will come back and clean up if we want to keep this

['Delta',
 'B.1.617.2',
 'AY.1',
 'AY.2',
 'AY.3',
 'AY.3.1',
 'AY.4',
 'AY.5',
 'AY.5.1',
 'AY.5.2',
 'AY.6',
 'AY.7',
 'AY.7.1',
 'AY.7.2',
 'AY.8',
 'AY.9',
 'AY.10',
 'AY.11',
 'AY.12',
 'AY.13',
 'AY.14',
 'AY.15',
 'AY.16',
 'AY.17',
 'AY.18',
 'AY.19',
 'AY.20',
 'AY.21',
 'AY.22',
 '(read more)',
 'B.1.617.1',
 'B.1.617.3',
 '14 Jun 2021',
 '24 May 2021',
 '06 May 2021',
 '11 May 2021',
 'Compare sublineages',
 'Alpha',
 'B.1.1.7',
 'Q.1',
 'Q.2',
 'Q.3',
 'Q.4',
 'report',
 '29 Dec 2020',
 '18 Dec 2020',
 '18 Dec 2020',
 'Compare sublineages',
 'Beta',
 'B.1.351',
 'B.1.351.1',
 'B.1.351.2',
 'B.1.351.3',
 'B.1.351.4',
 'report',
 '29 Dec 2020',
 '24 Dec 2020',
 '18 Dec 2020',
 'Compare sublineages',
 'Gamma',
 'P.1',
 'P.1.1',
 'P.1.2',
 'P.1.3',
 'P.1.4',
 'P.1.5',
 'P.1.6',
 'P.1.7',
 'P.1.8',
 'P.1.9',
 'P.1.10',
 'P.1.10.1',
 'P.1.10.2',
 'P.2',
 'P.3',
 'report',
 'report',
 '13 Jan 2021',
 '11 Jan 2021',
 'Compare sublineages']

In [13]:
#open and collect data from voi table
#before running, replace <driver_path> with the filepath where your webdriver is saved

#set url
url="https://outbreak.info/situation-reports#voc"

#set webdriver and get data
driver = webdriver.Chrome(executable_path = '<driver_path>')
driver.get(url)

#scroll and wait for page to load
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(60)

#find voi table by xpath
xpath_tab=driver.find_element_by_xpath('//*[@id="voi-reports"]/table')
#extract html from element
tab_html_out=xpath_tab.get_attribute('outerHTML')

#close webpage
driver.quit()

#check stuff exists
try:
    tab_html_out
except NameError:
    print("you didn't find it")
else:
    print("you found the html! good job!")


you found the html! good job!


In [14]:
#convert voi html to bs4 object
soup_voi = BeautifulSoup(tab_html_out, 'lxml')

#extract names from table
voi_names=[]
for i in soup_voi.find_all("h4"):
    voi_names.append(i.get_text().strip())
voi_names 

['B.1.525',
 'B.1.526',
 'B.1.617.1',
 'C.37',
 'B.1.621',
 'P.3',
 'P.2',
 'B.1.1.318',
 'C.36.3']

In [15]:
#missed AV.1, B.1.617.3 which are listed on site -- try to get by other means
voi_names_addtl=[]
for i in soup_voi.find_all("h3"):
    voi_names_addtl.append(i.get_text().strip())
voi_names_addtl 
#unfortunately, brings in other text

['Eta',
 'Iota',
 'Kappa',
 'Lambda',
 'Mu',
 'Theta',
 'Zeta',
 'AV.1',
 'B.1.1.318-related',
 'B.1.617.3',
 'C.36.3-related']

In [32]:
#instead just add manually
voi_names=voi_names+['AV.1','B.1.617.3']
voi_names

['B.1.525',
 'B.1.526',
 'B.1.617.1',
 'C.37',
 'B.1.621',
 'P.3',
 'P.2',
 'B.1.1.318',
 'C.36.3',
 'AV.1',
 'B.1.617.3']

In [34]:
#make VOI data frame with column VOI
voi_col=list(["VOI"]*len(voi_names))

voi_df=pd.DataFrame([voi_names,voi_col]).transpose()
voi_df

Unnamed: 0,0,1
0,B.1.525,VOI
1,B.1.526,VOI
2,B.1.617.1,VOI
3,C.37,VOI
4,B.1.621,VOI
5,P.3,VOI
6,P.2,VOI
7,B.1.1.318,VOI
8,C.36.3,VOI
9,AV.1,VOI


In [35]:
#combine to create one dataframe for export
all_df=voi_df.append(voc_df, ignore_index=True)
all_df

Unnamed: 0,0,1
0,B.1.525,VOI
1,B.1.526,VOI
2,B.1.617.1,VOI
3,C.37,VOI
4,B.1.621,VOI
5,P.3,VOI
6,P.2,VOI
7,B.1.1.318,VOI
8,C.36.3,VOI
9,AV.1,VOI


Date cell 1 was actually used for the project to record which day data was pulled. For purposes of this record, date was manually set to the date data was last pulled in Date cell 2, while Date cell 1 was commented out.

In [36]:
#Date cell 1

#get the date of data collection
#todayIs=datetime.date.today()
#dataDate = todayIs.strftime("%m%d%y")
#dataDate

'090321'

In [None]:
#Date cell 2
dataDate=("090321")
dataDate

In [None]:
#add date of data collection to get the file name before saving
#change <your_filepath> to actual desired file path
file_list = ["<your_filepath>",dataDate,".csv"]
filename="".join(file_list)
filename

In [14]:
#write file for later
all_df.to_csv(filename)