As data abstracts from PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) are used. The search parameter was 'vitamin d'. The data is pure text and is to be transformed into a Pandas DataFrame.

In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.mode.chained_assignment = None

In [2]:
with open('../data/pubmed_result.txt', 'r', encoding="latin1") as myfile:
    data = myfile.read()

test_slice = data[:8331]

In [3]:
# the first 3 abstracts
print(test_slice)


1. Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7. doi: 10.1016/j.bone.2018.09.017.
[Epub ahead of print]

Quality of life in hypoparathyroidism.

Vokes T(1).

Author information: 
(1)The University of Chicago, Chicago, IL, United States. Electronic address:
tvokes@medicine.bsd.uchicago.edu.

Hypoparathyroidism is a rare endocrine disorder where deficiency (or lack of
effect) of parathyroid hormone results in disordered mineral metabolism leading
to hypocalcemia and hyperphosphatemia. Many patients with this disorder have
physical, emotional and cognitive complaints suggestive of impaired quality of
life (QOL). Several recent studies have demonstrated that hypoparathyroid
patients treated with calcium and vitamin D (conventional therapy) have reduced
QOL compared to either suitable controls or general population. QOL has also been
studied during treatment with PTH1-84, which has been FDA approved in the USA as 
an adjunct to calcium and vitamin D in patients not adequately controlled o

# Problems to solve
- Splitting the txt-file into single abstracts and putting them each in a row of a DataFrame
    - determining a common denominator that every abstract has, prefarably at the end of the abstract to use as a split condition
- Determining which abstracts are non-English and remove them
- Removing abstracts that do not belong to actual publications but to 'Comments', 'Correspondence' or 'Opinions' to a publication, case reports or abstracts that read more like a post to a self-help forum (therefor abstracts that contain 'I am' are removed)
- Removing differences like different phrasing for copyright
- Removing the numbering of the abstract
- Get the PMID (PubMed ID) for every abstract in new column as an identifier
- Get the year of publication in new column
- Get title into new column
- Get authors into new column
    - remove the numbering and brackets
    - fuse initials with last name
- Get the prof
    - discard all authors except for the last one (the professor)
- isolate the clean abstract text, without title etc.

### Getting the abstracts into a DataFrame
- PMID was chosen as a split condition:
    - it is at the end of every abstract
    - it has a pre-determined format

In [4]:
# Split text after the last number of PMID
data_list = re.split('(PMID: \d+)', data)

# The previos split returned a list where the abstract and the PMID are separated, now they are joined together again
new_data_list = []

for entry in range(0,len(data_list)-1):
    if entry % 2:
        continue
    else:
        new_data_list.append((data_list[entry] + data_list[entry+1]))

# The abstracts are added to a new DataFrame as rows
df_data = pd.DataFrame()
df_data['abstract'] = new_data_list

In [5]:
df_data.head()

Unnamed: 0,abstract
0,\n1. Bone. 2018 Sep 24. pii: S8756-3282(18)303...
1,\n\n\n2. Neurosci Lett. 2018 Sep 24. pii: S03...
2,\n\n\n3. Antivir Ther. 2018 Sep 27. doi: 10.3...
3,\n\n\n4. Fetal Pediatr Pathol. 2018 Sep 27:1-...
4,\n\n\n5. Clin Exp Immunol. 2018 Oct;194(1):17...


In [6]:
# This is the first abstract
print(df_data.iloc[0,0])


1. Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7. doi: 10.1016/j.bone.2018.09.017.
[Epub ahead of print]

Quality of life in hypoparathyroidism.

Vokes T(1).

Author information: 
(1)The University of Chicago, Chicago, IL, United States. Electronic address:
tvokes@medicine.bsd.uchicago.edu.

Hypoparathyroidism is a rare endocrine disorder where deficiency (or lack of
effect) of parathyroid hormone results in disordered mineral metabolism leading
to hypocalcemia and hyperphosphatemia. Many patients with this disorder have
physical, emotional and cognitive complaints suggestive of impaired quality of
life (QOL). Several recent studies have demonstrated that hypoparathyroid
patients treated with calcium and vitamin D (conventional therapy) have reduced
QOL compared to either suitable controls or general population. QOL has also been
studied during treatment with PTH1-84, which has been FDA approved in the USA as 
an adjunct to calcium and vitamin D in patients not adequately controlled o

In [7]:
print(f'Number of abstracts: {len(df_data)}')

Number of abstracts: 77886


### Removing the non-English abstracts
- Many non-English abstracts are marked by PubMed with '[Article in ...]' and can therefore easily be dropped

In [8]:
df_data = df_data[df_data.abstract.str.contains('Article in') == False]

In [9]:
print(f'Remaining abstracts: {len(df_data)}')

Remaining abstracts: 68606


### Removing comments and correspondence
- Abstracts that belong to comments or correspondence are so marked by PubMed and can easily be removed

In [10]:
df_data = df_data[df_data.abstract.str.contains('Comment on\n') == False]
df_data = df_data[df_data.abstract.str.contains('Correspondence re:') == False]
df_data = df_data[df_data.abstract.str.contains('I am') == False]
df_data = df_data[df_data.abstract.str.contains('Diagnosis:') == False]
df_data = df_data[df_data.abstract.str.contains('\[Opinion\]') == False]


# The index is resetted
df_data.reset_index(drop=True, inplace=True)

print(f'Remaining abstracts: {len(df_data)}')

Remaining abstracts: 66240


### Removing differences from abstracts
Differences need to be removed so that abstracts have the same amount of empty lines that can be used for slicing the abstracts into different parts (e.g. title, authors, main body).
- The declaration of copyright varies from abstract to abstract
- Some are indexed for MEDLINE
- Some abstracts are online before the paper is published ([Epub ahead of print])
    - here the difficultiy is that depending on how long the title is, the string can be in one line or two lines, and different parts can be in line 1 and in line 2
- Some have an Epub number
- Some belong to an eCollection
- Some have a PMCID
- Some not only contain authors but also collaborators
- Some contain different indexing (tri, WNL)
    - only those that were obvious at superficial screening were removed

In [11]:
df_data['abstract'] = df_data['abstract'].str.replace('(Conflict of interest statement: \D+)', '')
df_data['abstract'] = df_data['abstract'].str.replace('This article is protected by copyright. All rights reserved.', '', regex=False)
df_data['abstract'] = df_data['abstract'].str.replace('(Creative Commons Attribution License)', '')
df_data['abstract'] = df_data['abstract'].str.replace('(© \d+\D+)', '')
df_data['abstract'] = df_data['abstract'].str.replace('[Indexed for MEDLINE]', '', regex=False)
df_data['abstract'] = df_data['abstract'].str.replace('[Epub ahead of print]', '', regex=False)
df_data['abstract'] = df_data['abstract'].str.replace('[Epub', '', regex=False)
df_data['abstract'] = df_data['abstract'].str.replace('ahead of print]', '', regex=False)
df_data['abstract'] = df_data['abstract'].str.replace('print]', '', regex=False)
df_data['abstract'] = df_data['abstract'].str.replace('(Epub \d+\D+\d+\.)', '')
df_data['abstract'] = df_data['abstract'].str.replace('(eCollection \d+)', '')
df_data['abstract'] = df_data['abstract'].str.replace('(PMCID: PMC\d+)', '')
df_data['abstract'] = df_data['abstract'].str.replace('(Collaborators: \D+)', '')
df_data['abstract'] = df_data['abstract'].str.replace('(\d+/tri.\d+)', '')
df_data['abstract'] = df_data['abstract'].str.replace('(\d+/WNL.\d+)', '')

In [12]:
# First abstract with some of the copyright and Epub comment removed
print(df_data.iloc[0,0])


1. Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7. doi: 10.1016/j.bone.2018.09.017.


Quality of life in hypoparathyroidism.

Vokes T(1).

Author information: 
(1)The University of Chicago, Chicago, IL, United States. Electronic address:
tvokes@medicine.bsd.uchicago.edu.

Hypoparathyroidism is a rare endocrine disorder where deficiency (or lack of
effect) of parathyroid hormone results in disordered mineral metabolism leading
to hypocalcemia and hyperphosphatemia. Many patients with this disorder have
physical, emotional and cognitive complaints suggestive of impaired quality of
life (QOL). Several recent studies have demonstrated that hypoparathyroid
patients treated with calcium and vitamin D (conventional therapy) have reduced
QOL compared to either suitable controls or general population. QOL has also been
studied during treatment with PTH1-84, which has been FDA approved in the USA as 
an adjunct to calcium and vitamin D in patients not adequately controlled on
conventional therap

### Remove numbering from abstracts
- the difficulty with removing the numbering is that there are other numbers in the same line (e.g. the year), and also numbers with a . after them (e.g. the day)
    - since every abstract has at least one new line (\n) in the beginning, it was used to make the search term unique and ONLY remove the numbering

In [13]:
# the raw, uninterpreted text for better understanding
df_data.iloc[0,0]

'\n1. Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7. doi: 10.1016/j.bone.2018.09.017.\n\n\nQuality of life in hypoparathyroidism.\n\nVokes T(1).\n\nAuthor information: \n(1)The University of Chicago, Chicago, IL, United States. Electronic address:\ntvokes@medicine.bsd.uchicago.edu.\n\nHypoparathyroidism is a rare endocrine disorder where deficiency (or lack of\neffect) of parathyroid hormone results in disordered mineral metabolism leading\nto hypocalcemia and hyperphosphatemia. Many patients with this disorder have\nphysical, emotional and cognitive complaints suggestive of impaired quality of\nlife (QOL). Several recent studies have demonstrated that hypoparathyroid\npatients treated with calcium and vitamin D (conventional therapy) have reduced\nQOL compared to either suitable controls or general population. QOL has also been\nstudied during treatment with PTH1-84, which has been FDA approved in the USA as \nan adjunct to calcium and vitamin D in patients not adequately controlled o

In [14]:
df_data['abstract'] = df_data['abstract'].str.replace('(\n+\d+\.)', '')

# whitespaces in the beginning are stripped
df_data['abstract'] = df_data['abstract'].str.lstrip()

In [15]:
# First abstract with the numbering removed, the year and the day intact
print(df_data.iloc[0,0])

Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7. doi: 10.1016/j.bone.2018.09.017.


Quality of life in hypoparathyroidism.

Vokes T(1).

Author information: 
(1)The University of Chicago, Chicago, IL, United States. Electronic address:
tvokes@medicine.bsd.uchicago.edu.

Hypoparathyroidism is a rare endocrine disorder where deficiency (or lack of
effect) of parathyroid hormone results in disordered mineral metabolism leading
to hypocalcemia and hyperphosphatemia. Many patients with this disorder have
physical, emotional and cognitive complaints suggestive of impaired quality of
life (QOL). Several recent studies have demonstrated that hypoparathyroid
patients treated with calcium and vitamin D (conventional therapy) have reduced
QOL compared to either suitable controls or general population. QOL has also been
studied during treatment with PTH1-84, which has been FDA approved in the USA as 
an adjunct to calcium and vitamin D in patients not adequately controlled on
conventional therapy. I

### Adding the PMID into a new column
- the PMID is at the end of the abstract and consist of 8 numbers at the maximum
- slice was used to get the numbers and put them into a new column
    - some numbers consist of less than 8 numbers so the : and sometimes the D are also taken - these need to be removed afterwars

In [16]:
df_data['PMID'] = df_data['abstract'].str.slice(-8,)

In [17]:
df_data['abstract'] = df_data['abstract'].str.replace('(\n+PMID: \d+)', '')
df_data['PMID'] = df_data['PMID'].str.replace('(\D+: )', '')
df_data['PMID'] = df_data['PMID'].str.replace('(: )', '')

In [18]:
df_data.head()

Unnamed: 0,abstract,PMID
0,Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7....,30261328
1,Neurosci Lett. 2018 Sep 24. pii: S0304-3940(18...,30261231
2,Antivir Ther. 2018 Sep 27. doi: 10.3851/IMP326...,30260797
3,Fetal Pediatr Pathol. 2018 Sep 27:1-11. doi: 1...,30260729
4,Clin Exp Immunol. 2018 Oct;194(1):17-26. doi: ...,30260469


### Adding the year of publication to a new column
- extracting four digits returns the year, if they are preceded by a period or whitespace and followed by a whitespace followed by a capital letter, semicolon, minus, period or colon.

In [23]:
df_data['year'] = df_data['abstract'].str.extract('((?<=(\.|\s))\d{4})(?=(\s[A-Z]|;|-|\.|:))')

In [24]:
df_data.head()

Unnamed: 0,abstract,PMID,year
0,Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7....,30261328,2018
1,Neurosci Lett. 2018 Sep 24. pii: S0304-3940(18...,30261231,2018
2,Antivir Ther. 2018 Sep 27. doi: 10.3851/IMP326...,30260797,2018
3,Fetal Pediatr Pathol. 2018 Sep 27:1-11. doi: 1...,30260729,2018
4,Clin Exp Immunol. 2018 Oct;194(1):17-26. doi: ...,30260469,2018


In [25]:
df_data.to_csv('../data/regex_save1.csv')