#### Install necessary modules

In [1]:
!pip install -r requirements.txt



#### Import pubmed_oa_parser  "oa" stands for pubmed open access dataset. 

Papers' xml files within this dataset have uniform formats. So a specialized xml parser is designed for these dataset. This pubmed parser is based on the "lxml" module

In [5]:
from pubmed_parser.pubmed_oa_parser import  parse_pubmed_xml, parse_pubmed_paragraph, parse_pubmed_references, parse_pubmed_caption

A brief introduction of these four functions:

    1. parse_pubmed_xml:
       This function takes the path to xml file or a byte string object of a xml file as input. It will return all the main meta information of a pubmed paper, including title, abstract, pmid, doi, author_list, affiliation_list, etc. 
       
    2. parse_pubmed_paragraph:
       This function takes xml as input, and return a list of paragraphs. Each paragraph contains information like pmc, pmid, reference ids, section and text.
       
    3. parse_pubmed_references:
       This function parse the citation information of a paper, mapping reference ids to pmids
       
    4. parse_pubmed_caption:
       This function parse the figure captions, etc.

#### Examples

In [6]:
path_to_xml_file = "data/pubmed/xml/PMC3339580.nxml"

##### parse_pubmed_xml

In [13]:
parsed_paper_info = parse_pubmed_xml(path_to_xml_file)

In [8]:
parsed_paper_info

{'full_title': 'Decolorization and partial mineralization of a polyazo dye by  Bacillus firmus  immobilized within tubular polymeric gel',
 'abstract': 'The degradation of C.I. Direct red 80, a polyazo dye, was investigated using Bacillus firmus immobilized by entrapment in tubular polymeric gel. This bacterial strain was able to completely decolorize 50\xa0mg/L of C.I. Direct red 80 under anoxic conditions within 12\xa0h and also degrade the reaction intermediates (aromatic amines) during the subsequent 12\xa0h under aerobic conditions. The tubular gel harboring the immobilized cells consisted of anoxic and aerobic regions integrated in a single unit which was ideal for azo dye degradation studies. Results obtained show that effective dye decolorization (97.8%), chemical oxygen demand (COD) reduction (91.7%) and total aromatic amines removal were obtained in 15\xa0h with the immobilized bacterial cell system whereas for the free cells, a hydraulic residence time of 24\xa0h was require

Passing string instead of file name is also supported. An example is shown below:

In [15]:
with open( path_to_xml_file, "r" ) as f:
    xml_string = f.read()
parsed_paper_info_2 = parse_pubmed_xml(xml_string)
## the results are identical
assert parsed_paper_info_2 == parsed_paper_info

This is useful when we want to do some preprocessing to the xml file before parsing it. e.g. prepsocessing the citation marker, etc.

##### parse_pubmed_paragraph

In [20]:
## here all_paragraph = True means keeping the pararaph even if it doesn't have a citation.
parsed_paragraphs = parse_pubmed_paragraph(xml_string, all_paragraph= True)

In [26]:
parsed_paragraphs[10]

{'pmc': '3339580',
 'pmid': '22582158',
 'reference_ids': [],
 'section': 'Materials and methods/Culture conditions for degradation',
 'text': 'To determine the fate of aromatic amines generated during biodegradation of azo dyes, batch sequential anoxic/aerobic culture experiments were carried using the SWM supplemented with 50\xa0mg/L of C.I. Direct red 80. The experiment was started by inoculating the medium with the bacterium and incubating at 30\xa0°C for 12\xa0h under anoxic conditions or until no color was observed. Subsequently, same flasks were incubated under aerobic conditions as previously described for another 12\xa0h at 30\xa0°C. Abiotic control flasks were also set up and kept under similar conditions as the inoculated ones. At pre-determined intervals (every 3\xa0h), samples were withdrawn from each flask for the determination of percentage dye decolorization, percentage COD reduction and residual TAA concentration.'}

##### parse_pubmed_references

In [23]:
parsed_refs = parse_pubmed_references(xml_string)

In [25]:
parsed_refs[:2]

[{'pmid': '22582158',
  'pmc': '3339580',
  'ref_id': 'CR57',
  'pmid_cited': '19879175',
  'doi_cited': '',
  'article_title': 'Rapid decolorization of azo dyes in aqueous solution by an ultrasound-assisted electrocatalytic oxidation process',
  'name': 'Z Ai; J Li; L Zhang; S Lee',
  'year': '2010',
  'journal': 'Ultrason Sonochem',
  'journal_type': 'journal'},
 {'pmid': '22582158',
  'pmc': '3339580',
  'ref_id': 'CR58',
  'pmid_cited': '17070992',
  'doi_cited': '',
  'article_title': 'Continuous fixed bed biosorption of reactive dyes by dried Rhizopus arrhizus: determination of column capacity',
  'name': 'Z Aksu; SS Cagatay; F Gonen',
  'year': '2007',
  'journal': 'J Hazard Mater',
  'journal_type': 'journal'}]