# **Try it yourself 1.**

Goal:

  * Scrape genotype call data for GEO accession number from GSM288035, GSM288036, ... , GSM288044.
  * Save each genotype call data as .csv file with its accession number by using `pandas` package.
      * e.g. 'GSM288035.csv', 'GSM288036.csv', ...

In [53]:
!pip install BeautifulSoup4



In [55]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
str1='https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=GSM2880'
str2='&id=266'
str3='&db=GeoDb_blob24'
for j in range(35,45):
  url = str1+str(j)+str2+str(j-23)+str3
  html = urlopen(url).read()
  soup = BeautifulSoup(html, "html.parser")
  value = soup.find('pre').text.split('\n')
  snps= {}
  for i in value:
    snpval = i.split('\t')
    if len(snpval) > 1:
      snps[snpval[0]] = snpval[1]
  df1 = pd.DataFrame.from_dict(snps, orient='index')
  df1.to_csv('GSM2880'+str(j)+'.csv')

# **Try it Yourself 2.**

Goal:

  * Scrape information of COVID-19 articles on PubMed search result from page 1 to page 10.
  * The information we want are:
      * Title
      * Author
      * Hyperlink
      * Abstract: this is available when you access through the hyperlink you scraped. If the article has NO abstract, save it as 'No abstract exists.'.
  * Save these 4 variables into Excel file using `pandas` package.



In [2]:
!pip install selenium
!pip install fake_useragent

Collecting selenium
  Downloading selenium-4.18.1-py3-none-any.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.25.0-py3-none-any.whl (467 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m467.2/467.2 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25

In [52]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from fake_useragent import UserAgent
import pandas as pd

ua = UserAgent()
userAgent = ua.chrome
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument(f'user-agent={userAgent}')

driver = webdriver.Chrome(options=chrome_options)
driver1 = webdriver.Chrome(options=chrome_options)

url = 'https://pubmed.ncbi.nlm.nih.gov/?term=covid-19'
driver.get(url)

data = []

k = 10
for page in range(1, k + 1):
    print('Page no :', page)
    for i in range(1, k + 1):
        print('Article No :', i)
        xpath1 = '//*[@id="search-results"]/section[1]/div[2]/div/article[' + str(i) + ']/div[2]/div[1]/a'
        title = driver.find_element(By.XPATH, xpath1).text
        print('Title :', title)
        xpath2 = '//*[@id="search-results"]/section[1]/div[2]/div/article[' + str(i) + ']/div[2]/div[1]/div[1]/span[1]'
        author = driver.find_element(By.XPATH, xpath2).text
        print('Authors :', author)
        xpath3 = '//*[@id="search-results"]/section[1]/div[2]/div/article[' + str(i) + ']/div[2]'
        result_driver = driver.find_element(By.XPATH, xpath3)
        hyperlink = result_driver.find_element(By.TAG_NAME, "a").get_attribute('href')
        print('Link :', hyperlink)
        driver1.get(hyperlink)
        xpath4 = '//*[@id="eng-abstract"]'
        if driver1.find_elements(By.XPATH, xpath4):
            abstract = driver1.find_element(By.XPATH, xpath4).text
        else:
            abstract = 'No abstract exists'
        print('Abstract:', abstract, '\n')
        data.append({'Title': title, 'Authors': author, 'Link': hyperlink, 'Abstract': abstract})
        if i % 10 == 0:
            button_xpath = '//*[@id="search-results"]/div[6]/button[3]'
            driver.find_element(By.XPATH, button_xpath).click()

driver.quit()
driver1.quit()
df = pd.DataFrame(data)
df.to_excel('Assignment1.xlsx', index=False)

Page no : 1
Article No : 1
Title : Origin, transmission, diagnosis and management of coronavirus disease 2019 (COVID-19).
Authors : Umakanthan S, Sahu P, Ranade AV, Bukelo MM, Rao JS, Abrahao-Machado LF, Dahal S, Kumar H, Kv D.
Link : https://pubmed.ncbi.nlm.nih.gov/32563999/
Abstract: Coronavirus has emerged as a global health threat due to its accelerated geographic spread over the last two decades. This article reviews the current state of knowledge concerning the origin, transmission, diagnosis and management of coronavirus disease 2019 (COVID-19). Historically, it has caused two pandemics: severe acute respiratory syndrome and Middle East respiratory syndrome followed by the present COVID-19 that emerged from China. The virus is believed to be acquired from zoonotic source and spreads through direct and contact transmission. The symptomatic phase manifests with fever, cough and myalgia to severe respiratory failure. The diagnosis is confirmed using reverse transcriptase PCR. Manag