# Topic Modeling Scientific Text

*This template and workflow were developed by Margaret Gratian. This set of notebooks can be used to find topics in scientific text.*
________________________________________
## 2. Format Raw PubMed Data into Table

**Notebook Goals**
- Demonstrate how to format PubMed articles in a text file using the Medline file format into a table.

**Major Caveats**
- The data format selected to store the raw PubMed data was the Medline file format because there are libraries to parse this format (we use biopython) and consistent names used to identify the abstracts (AB), titles (TI), and other pieces of information. While it is also possible to get back other file formats, such as XML, these are more complicated to parse.
- See here for more about parsing the Medline file format with the biopython library: https://biopython.org/docs/1.75/api/Bio.Medline.html  

**Requirements**
- This notebook requires the biopython library. Learn more about it here: https://biopython.org/
- Please see the README for instructions and recommendations on proper installation.

**Inputs**

The following assumes you used the recommended path for saving your data in Notebook 1. If you modified it, be sure to modify the input path here.

- Input Filepath 1: "../data/pubmed_raw_data.txt"
    - Title and abstract and other article information from PubMed, in the Medline file format  


**Outputs**

The following is a recommended path for saving your data. If you modify it, be sure to modify the inputs and outputs of subsequent notebooks.

- Output Filepath 1: "../data/pubmed_text_tabular.csv"
    - The raw Medline file format in CSV (tabular) format  

## Import Packages

In [None]:
import pandas as pd
from Bio import Medline

## Read in Data

### Read in PudMed Medline data using the biopython library to parse

In [None]:
# Create empty list to hold extracted record information 
data = []

with open("../data/pubmed_raw_text.txt") as handle:
    # Use the Medline tool from biopython to parse the dictionary 
    records = Medline.parse(handle)

    # Iterate over the records (each record is a dictionary)
    for record in records:
        # Check that the record titles is available by checking the key TI exists in the record 
        if 'TI' not in record.keys():
            title = None
        else:
            title = record['TI']

        # Check that the record titles is available by checking the key AB exists in the record
        if 'AB' not in record.keys():
            abstract_data = None
        else:
            abstract_data = record['AB']
        
        # Fill in new dictionary with just the values we're interested in 
        data_dict = {'pmid': record['PMID'], 'title': title, 'abstract': abstract_data}
        
        # Add to data
        data.append(data_dict)

In [None]:
# Check length of data 
# One publication did not have an abstract
print(len(data))

# Preview
data[0]

## Dataset Development

### Convert Data to Pandas DataFrame

The resulting Pandas DataFrame will have the following columns:
- pmid
- title
- abstract

In [None]:
# Read data into Pandas DataFrame
pubmed_text_data = pd.DataFrame.from_records(data)

# See shape
print(pubmed_text_data.shape)

# Preview
pubmed_text_data.head()

In [None]:
# Note we have missing titles and abstracts - we will drop empty records in subsequent notebooks
pubmed_text_data.info()

## Save Outputs

In [None]:
# Save the tabular PubMed data
pubmed_text_data.to_csv("../data/pubmed_text_tabular.csv")