# Topic Modeling Scientific Text
*This template and workflow were developed by Margaret Gratian. This set of notebooks can be used to find topics in scientific text.*

____________________________________
## 1. Extract Articles from PubMed


**Notebook Goals**
- Demonstrate how to extract articles from PubMed for use in topic modeling.

**Major Caveats**
- See the requirements section for important information about the Entrez Direct command line tool.
- The data format selected here was the Medline file format because there are libraries to parse this format and consistent names used to identify the abstracts (AB), titles (TI), and other pieces of information. It is also possible to get back other file formats, such as XML.

**Requirements**
- This notebook runs a command line script using the Entrez Direct command line tool developed by NCBI.  Learn more about it here: https://www.ncbi.nlm.nih.gov/books/NBK179288/.
- Please see the README for instructions and recommendations on proper installation. The README will guide you in the process of setting up a virtual Python environment and installing the tool via:   
      `sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"`

 

**Inputs**
- No inputs to this notebook.

**Outputs**

The following is a recommended path for saving your data. If you modify it, be sure to modify the inputs and outputs of subsequent notebooks.

- Output Filepath 1: "../data/pubmed_raw_data.txt"
    - The result of a PubMed search. 

## Import Packages

In [None]:
import pandas as pd

## Dataset Development

### Use edirect to Get PubMed Data

We will use the PubMed E-utilities to search for articles from PubMed that we will use later in the topic modeling pipeline. As an example, here we search for "tobacco cessation."

As another example, if you had a specific list of PMIDs you wanted to search for, you could assign those PMIDs to list called pmids_to_search and then run the following command: 

`! efetch -db pubmed -id $pubs_with_pmid_int -format medline > ../data/pubmed_data/pubmed_raw_data.txt`

In [None]:
# The ! at the start of the search will execute this cell as a shell command
! esearch -db pubmed -query "tobacco cessation" | efetch -format medline >  ../data/pubmed_raw_text.txt