# Download sample

We download a sample of data from PubMed by running a query

- *pubmed_result.csv*
```
("2010"[Date - Publication] : "3000"[Date - Publication]) AND "Drug Combinations"[MeSH Terms] 
```
- *amino_acid_substitution.csv*
```
"Amino Acid Substitution"[MAJR] 
```

and saving all results in XML and CSV formats.

# Imports

In [14]:
%run _imports.ipynb

Setting the PYTHON_VERSION environment variable.
Setting the SPARK_MASTER environment variable.
Setting the DB_TYPE environment variable.
Setting the DB_PORT environment variable.


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
2017-12-02 16:09:23.843494


In [16]:
import requests
from bs4 import BeautifulSoup

import pmc_tables.xml_parser

In [17]:
NOTEBOOK_NAME = 'download_sample'
os.makedirs(NOTEBOOK_NAME, exist_ok=True)

INPUT_FILE_NAME = 'amino_acid_substitution.xml'

OUTPUT_DIR = 'downloaded_pdfs'
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Functions

In [18]:
def df_to_table(df, integer_columns=None, integer_dtypes=None):
    """
    """
    extra_columns = {}
    for column, dtype in zip(integer_columns, integer_dtypes):
        extra_columns[column] = {
            'dtype': dtype,
            'idx': list(df.columns).index(column),
            'data': df[column]
        }

    table = pa.Table.from_pandas(
        df[[c for c in df.columns if c not in integer_columns]],
        preserve_index=False)
    
    for column_name, column_attrib in sorted(extra_columns.items(), key=lambda c: c[1]['idx']):
        array = pa.Array.from_pandas(
            column_attrib['data'], column_attrib['data'].isnull(), column_attrib['dtype'])
        column = pa.Column.from_array(column_name, array)
        table = table.add_column(column_attrib['idx'], column)
        
    return table

In [19]:
# Access data from sci-hub
SCIHUB_BASE_URL = 'http://sci-hub.cc/'
HEADERS = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'}


def download_article(doi, output_dir):
    """
    Fetches the paper by first retrieving the direct link to the pdf.
    If the indentifier is a DOI, PMID, or URL pay-wall, then use Sci-Hub
    to access and download paper. Otherwise, just download paper directly.
    """
    url = _get_scihub_url(doi)

    try:
        # verify=False is dangerous but sci-hub.io 
        # requires intermediate certificates to verify
        # and requests doesn't know how to download them.
        # as a hacky fix, you can add them to your store
        # and verifying would work. will fix this later.
        res = requests.get(url, headers=HEADERS, verify=False)
    except requests.exceptions.RequestException as e:
        logger.error('Failed to fetch pdf with doi %s from url %s due to request exception!', doi, url)
        return None, None, 

    if res.headers['Content-Type'] != 'application/pdf':
        logger.error('Failed to fetch pdf with doi %s from url %s due to a captcha!', doi, url)
        return None, None
    else:
        filename = _slugify(doi) + '.pdf'
        if filename not in url:
            logger.warning("Filename %s not in url %s", filename, url)
        with open(op.join(output_dir, filename), 'wb') as ofh:
            ofh.write(res.content)
        return url, filename


def _slugify(doi):
    """Generate a name from DOI using the same approach as SciHub."""
    return doi.replace('/', '@')


def _get_soup(html):
    """Return html soup."""
    return BeautifulSoup(html, 'html.parser')

    
def _get_scihub_url(doi):
    """
    Sci-Hub embeds papers in an iframe. This function finds the actual
    source url which looks something like https://moscow.sci-hub.io/.../....pdf.
    """
    res = requests.get(SCIHUB_BASE_URL + doi, headers=HEADERS, verify=False)
    s = _get_soup(res.content)
    iframe = s.find('iframe')
    if iframe:
        return iframe.get('src') if not iframe.get('src').startswith('//') \
           else 'http:' + iframe.get('src')

# Load data

In [21]:
os.listdir(NOTEBOOK_NAME)

['pubmed_result_2.csv',
 'delete.csv',
 'pubmed_result.xml',
 'pubmed_result.csv',
 'amino_acid_substitution.xml',
 'amino_acid_substitution.csv']

In [22]:
INPUT_FILE_NAME

'amino_acid_substitution.xml'

In [25]:
data = pmc_tables.xml_parser.parse_pubmed_xml_file(f"{NOTEBOOK_NAME}/{INPUT_FILE_NAME}")
df = pd.DataFrame(data)

In [29]:
df.shape

(4433, 9)

In [31]:
df['pmc'].notnull().sum()

1302

In [32]:
df_pmc = df[df['pmc'].notnull()]

In [33]:
df_pmc.head()

Unnamed: 0,pmid,title,authors,journal,year_published,abstract,mesh_terms,doi,pmc
4,28700616,Natural variation in a single amino acid substitution underlies physiological responses to topoisomerase II poisons.,"[Zdraljevic, Strand, Seidel, Cook, Doench, Andersen]",PLoS Genet,2017.0,Many chemotherapeutic drugs are differentially effective from one patient to the next. Understanding the causes of t...,"[Amino Acid Substitution, Animals, Antineoplastic Agents, Caenorhabditis elegans, DNA Damage, DNA Topoisomerases, Ty...",10.1371/journal.pgen.1006891,PMC5529024
7,28614374,Accurate prediction of functional effects for variants by combining gradient tree boosting with optimal neighborhood...,"[Pan, Liu, Deng]",PLoS One,2017.0,"Single amino acid variations (SAVs) potentially alter biological functions, including causing diseases or natural di...","[Algorithms, Amino Acid Substitution, Computational Biology, Disease, Genetic Predisposition to Disease, Humans, Mod...",10.1371/journal.pone.0179314,PMC5470696
9,28531234,Directed evolution to improve the catalytic efficiency of urate oxidase from Bacillus subtilis.,"[Li, Xu, Zhang, Zhu, Hua, Kong, Sun, Hong]",PLoS One,2017.0,Urate oxidase is a key enzyme in purine metabolism and catalyzes the oxidation of uric acid to allantoin. It is used...,"[Amino Acid Substitution, Bacillus subtilis, Bacterial Proteins, Biocatalysis, Catalytic Domain, Directed Molecular ...",10.1371/journal.pone.0177877,PMC5439685
10,28514686,Glycine Substitution at Helix-to-Coil Transitions Facilitates the Structural Determination of a Stabilized Subtype C...,"[Guenaga, Garces, de Val, Stanfield, Dubrovskaya, Higgins, Carrette, Ward, Wilson, Wyatt]",Immunity,2017.0,"Advances in HIV-1 envelope glycoprotein (Env) design generate native-like trimers and high-resolution clade A, B, an...","[Amino Acid Substitution, Antibodies, Neutralizing, Binding Sites, Genotype, Glycine, Glycosylation, HIV Antibodies,...",10.1016/j.immuni.2017.04.014,PMC5439057
11,28464395,Specificity Effects of Amino Acid Substitutions in Promiscuous Hydrolases: Context-Dependence of Catalytic Residue C...,"[Bayer, van Loo, Hollfelder]",Chembiochem,2017.0,Catalytic promiscuity can facilitate evolution of enzyme functions-a multifunctional catalyst may act as a springboa...,"[Alkaline Phosphatase, Amino Acid Substitution, Bacteria, Catalysis, Catalytic Domain, Directed Molecular Evolution,...",10.1002/cbic.201600657,PMC5488252


In [158]:
folders = []

In [159]:
for subset in ['NON-OA', 'OA']:
    folders = ftp.nlst(f'/pub/databases/pmc/suppl/{subset}/')
    for folder1 in tqdm.tqdm_notebook(folders, total=len(folders)):
        for folder2 in ftp.nlst(folder1):
            folders.append(folder1)

KeyboardInterrupt: 

In [None]:
len(folders)

In [157]:
ftp.nlst('/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899')

['/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194899.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194901.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194903.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194904.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194906.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194908.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194909.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194910.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194911.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194917.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194919.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194920.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC1197899/PMC1194921.zip',
 '/pub/databases/pmc/suppl/NON-OA/PMC1193900-PMC119

In [41]:
import pmc_tables.download

In [42]:
ftp = pmc_tables.download.get_ftp_client('ebi')

In [43]:
oa_folders, non_oa_folders = pmc_tables.download.get_ebi_suppl_folder_list(ftp)

In [44]:
pmc_id = 'PMC5488252'

In [45]:
pmc_tables.download.get_containing_folder(pmc_id, oa_folders)

'PMC5485900-PMC5489899'

In [46]:
pmc_tables.download.get_containing_folder(pmc_id, non_oa_folders)

'PMC5485900-PMC5489899'

In [130]:
import urllib.request
urllib.request.URLError

urllib.error.URLError

In [131]:
importlib.reload(pmc_tables.download)
downloader = pmc_tables.download.EbiDownloader()

In [132]:
downloader.source_urls

['ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl']

In [133]:
downloader.source_urls.insert(0, op.abspath('../downloads/ebi/pmc/suppl'))

In [134]:
downloader.source_urls

['/home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl',
 'ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl']

In [135]:
downloader.download_ebi_suppl(pmc_id)

Could not download file /home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/OA/PMC5445900-PMC5449899/PMC5448120.zip.
unknown url type: '/home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/OA/PMC5445900-PMC5449899/PMC5448120.zip'.
Could not download file ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl/OA/PMC5445900-PMC5449899/PMC5448120.zip.
<urlopen error ftp error: URLError("ftp error: error_perm('550 Failed to change directory.',)",)>.
Could not download file /home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/NON-OA/PMC5445900-PMC5449899/PMC5448120.zip.
unknown url type: '/home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/NON-OA/PMC5445900-PMC5449899/PMC5448120.zip'.
Could not download file ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl/NON-OA/PMC5445900-PMC5449899/PMC5448120.zip.
<urlopen error ftp error: URLError("ftp error: error_perm('550 Failed to change directory.',)",)>.
Could not download suppl file for PM

In [75]:
with concurrent.futures.ThreadPoolExecutor() as pool:
    futures = pool.map(downloader.download_ebi_suppl, df_pmc['pmc'].values.tolist())

In [139]:
ftp.getresp('find')

TypeError: getresp() takes 1 positional argument but 2 were given

EOFError: 

In [78]:
logging.getLogger('pmc_tables.download').setLevel(logging.DEBUG)

In [136]:
for pmc_id in df_pmc['pmc'].values:
    downloader.download_ebi_suppl(pmc_id)

File /home/kimlab2/database_data/datapkg/pmc_tables/notebooks/.pmc/suppl/OA/PMC5525900-PMC5529899/PMC5529024.zip already exists.
File /home/kimlab2/database_data/datapkg/pmc_tables/notebooks/.pmc/suppl/OA/PMC5469900-PMC5473899/PMC5470696.zip already exists.
File /home/kimlab2/database_data/datapkg/pmc_tables/notebooks/.pmc/suppl/OA/PMC5437900-PMC5441899/PMC5439685.zip already exists.
File /home/kimlab2/database_data/datapkg/pmc_tables/notebooks/.pmc/suppl/OA/PMC5437900-PMC5441899/PMC5439057.zip already exists.
File /home/kimlab2/database_data/datapkg/pmc_tables/notebooks/.pmc/suppl/OA/PMC5485900-PMC5489899/PMC5488252.zip already exists.
Could not download file /home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/OA/PMC5445900-PMC5449899/PMC5448120.zip.
unknown url type: '/home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/OA/PMC5445900-PMC5449899/PMC5448120.zip'.
Could not download file ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl/OA/PMC5445900-PMC54

Could not download file /home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/OA/PMC5325900-PMC5329899/PMC5326559.zip.
unknown url type: '/home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/OA/PMC5325900-PMC5329899/PMC5326559.zip'.
Could not download file ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl/OA/PMC5325900-PMC5329899/PMC5326559.zip.
<urlopen error ftp error: URLError("ftp error: error_perm('550 Failed to change directory.',)",)>.
Could not download file /home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/NON-OA/PMC5325900-PMC5329899/PMC5326559.zip.
unknown url type: '/home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/NON-OA/PMC5325900-PMC5329899/PMC5326559.zip'.
Could not download file ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl/NON-OA/PMC5325900-PMC5329899/PMC5326559.zip.
<urlopen error ftp error: URLError("ftp error: error_perm('550 Failed to change directory.',)",)>.
Could not download suppl file for PM

Could not download file ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl/OA/PMC5125900-PMC5129899/PMC5126365.zip.
<urlopen error ftp error: URLError("ftp error: error_perm('550 Failed to change directory.',)",)>.
Could not download file /home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/NON-OA/PMC5125900-PMC5129899/PMC5126365.zip.
unknown url type: '/home/kimlab2/database_data/datapkg/pmc_tables/downloads/ebi/pmc/suppl/NON-OA/PMC5125900-PMC5129899/PMC5126365.zip'.
Could not download file ftp://ftp.ebi.ac.uk/pub/databases/pmc/suppl/NON-OA/PMC5125900-PMC5129899/PMC5126365.zip.
<urlopen error ftp error: URLError("ftp error: error_perm('550 Failed to change directory.',)",)>.
Could not download suppl file for PMC5126365!


ValueError: invalid literal for int() with base 10: 'print'

In [76]:
results = list(futures)

Exception: Could not download suppl file for PMC5448120!

<generator object Executor.map.<locals>.result_iterator at 0x7f96083bc6d0>

In [47]:
len(oa_folders)

1208

In [48]:
len(non_oa_folders)

1273

In [26]:
df.head(2)

Unnamed: 0,pmid,title,authors,journal,year_published,abstract,mesh_terms,doi,pmc
0,28846085,A two-amino-acid substitution in the transcription factor RORγt disrupts its function in TH17 differentiation but no...,"[He, Ma, Wang, Zhang, Huang, Wang, Sen, Rothenberg, Sun]",Nat Immunol,2017.0,"The transcription factor RORγt regulates differentiation of the TH17 subset of helper T cells, thymic T cell develop...","[Amino Acid Substitution, Animals, Biomarkers, Cell Differentiation, Cluster Analysis, Encephalomyelitis, Autoimmune...",10.1038/ni.3832,
1,28738245,Two-amino acids change in the nsp4 of SARS coronavirus abolishes viral replication.,"[Sakai, Kawachi, Terada, Omori, Matsuura, Kamitani]",Virology,2017.0,Infection with coronavirus rearranges the host cell membrane to assemble a replication/transcription complex in whic...,"[Amino Acid Substitution, DNA Mutational Analysis, Protein Interaction Mapping, SARS Virus, Viral Nonstructural Prot...",10.1016/j.virol.2017.07.019,


In [27]:
# Number of papers with missing DOIs
df['doi'].isnull().sum()

1082

In [28]:
df[df['doi'].isnull()].head(2)

Unnamed: 0,pmid,title,authors,journal,year_published,abstract,mesh_terms,doi,pmc
107,27312559,A Novel Hemoglobin Variant Associated with Congenital Erythrocytosis: Hb Seoul [β86(F2)Ala→Thr] (HBB:c.259G>A).,"[Shin, Bang, Kim]",Ann Clin Lab Sci,2016.0,"We report the identification of a novel hemoglobin (Hb) variant [β86(F2)Ala→Thr; HBB: c.259G>A], Hb Seoul, causing c...","[Adult, Amino Acid Sequence, Amino Acid Substitution, Base Sequence, Hemoglobins, Abnormal, Humans, Male, Polycythem...",,
108,27305778,[Genetic evolution and substitution frequency of avian influenza virus HA gene in chicken H9N2 subtype in China in t...,"[Meng, Xu, Zhang, Huang, Zhang, Liu, Chang, Qin]",Wei Sheng Wu Xue Bao,2016.0,Low pathogenic avian influenza (LPAI) H9N2 subtype virus has been prevalent in domestic poultry in China over two de...,"[Amino Acid Sequence, Amino Acid Substitution, Animals, Chickens, China, Evolution, Molecular, Genotype, Hemagglutin...",,


# Output

**Note**:
- `pubmed_url` can be converted to an actual URL by prepending <https://www.ncbi.nlm.nih.gov>.
- `doi` can be converted to a URL by prepending <https://doi.org/>

In [12]:
os.makedirs('output', exist_ok=True)

In [13]:
output_df = (
    df
    .loc[df['doi'].notnull(), :]
)
output_df.head(2)

Unnamed: 0,pmid,title,authors,journal,year_published,abstract,mesh_terms,doi,pmc
0,28846085,A two-amino-acid substitution in the transcription factor RORγt disrupts its function in TH17 differentiation but no...,"[He, Ma, Wang, Zhang, Huang, Wang, Sen, Rothenberg, Sun]",Nat Immunol,2017.0,"The transcription factor RORγt regulates differentiation of the TH17 subset of helper T cells, thymic T cell develop...","[Amino Acid Substitution, Animals, Biomarkers, Cell Differentiation, Cluster Analysis, Encephalomyelitis, Autoimmune...",10.1038/ni.3832,
1,28738245,Two-amino acids change in the nsp4 of SARS coronavirus abolishes viral replication.,"[Sakai, Kawachi, Terada, Omori, Matsuura, Kamitani]",Virology,2017.0,Infection with coronavirus rearranges the host cell membrane to assemble a replication/transcription complex in whic...,"[Amino Acid Substitution, DNA Mutational Analysis, Protein Interaction Mapping, SARS Virus, Viral Nonstructural Prot...",10.1016/j.virol.2017.07.019,


In [14]:
output_df.to_csv(f'output/{op.splitext(INPUT_FILE_NAME)[0]}.tsv', sep='\t', index=False)

In [15]:
!head output/{op.splitext(INPUT_FILE_NAME)[0]}.tsv -n 2

pmid	title	authors	journal	year_published	abstract	mesh_terms	doi	pmc
28846085	A two-amino-acid substitution in the transcription factor RORγt disrupts its function in TH17 differentiation but not in thymocyte development.	['He', 'Ma', 'Wang', 'Zhang', 'Huang', 'Wang', 'Sen', 'Rothenberg', 'Sun']	Nat Immunol	2017.0	The transcription factor RORγt regulates differentiation of the TH17 subset of helper T cells, thymic T cell development and lymph-node genesis. Although elimination of RORγt prevents TH17 cell-mediated experimental autoimmune encephalomyelitis (EAE), it also disrupts thymocyte development, which could lead to lethal thymic lymphoma. Here we identified a two-amino-acid substitution in RORγt (RORγt(M)) that 'preferentially' disrupted TH17 differentiation but not thymocyte development. Mice expressing RORγt(M) were resistant to EAE associated with defective TH17 differentiation but maintained normal thymocyte development and normal lymph-node genesis, except for Peyer's patch

# Download articles

In [45]:
from selenium import webdriver


def download_pdf(doi, download_folder):
    url = _get_scihub_url(doi)
    options = webdriver.ChromeOptions()
    profile = {
        "plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}],
        "download.default_directory": download_folder,
        "download.extensions_to_open": ""
    }
    options.add_experimental_option("prefs", profile)
    driver = webdriver.Chrome(chrome_options=options)
#     driver = webdriver.Firefox()
    driver.get(url)
    try:
        filename = url.split("/")[4].split(".cfm")[0]
    except IndexError as e:
        print(e)
        filename = url
    print(filename)
    time.sleep(10)
    driver.close()

In [46]:
download_results = []
for doi in output_df['doi'].values[100:120]:
    download_results.append(download_pdf(doi, OUTPUT_DIR))
    time.sleep(0.1)

kim2016.pdf




list index out of range
http://www.googletagmanager.com/ns.html?id=GTM-TP26BH




list index out of range
http://www.googletagmanager.com/ns.html?id=GTM-TP26BH
renaud2016.pdf
schaefer2016.pdf
sujjitjoon2016.pdf
claisse2016.pdf
kono2016.pdf
abe2016.pdf
hamasy2016.pdf
fajardo2016.pdf
wang2016.pdf
choi2016.pdf
murayama2015.pdf




WebDriverException: Message: unknown error: 'url' must be a string
  (Session info: chrome=61.0.3163.100)
  (Driver info: chromedriver=2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4),platform=Linux 4.4.0-96-generic x86_64)


In [47]:
output_df['doi'].values[100:110]

array(['10.1016/j.ijbiomac.2016.06.091', '10.1371/journal.pone.0158579',
       '10.1371/journal.pcbi.1004771', '10.1038/bjc.2016.182',
       '10.1002/pro.2888', '10.1038/jhg.2016.80',
       '10.1016/j.ibmb.2016.06.003', '10.1093/molbev/msw102',
       '10.1002/pro.2966', '10.1038/leu.2016.153'], dtype=object)