In [1]:
import pandas as pd
import numpy as np

pd.options.mode.chained_assignment = None

In [2]:
df = pd.read_csv('../data/regex_save1.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,abstract,PMID,year
0,Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7....,30261328,2018.0
1,Neurosci Lett. 2018 Sep 24. pii: S0304-3940(18...,30261231,2018.0
2,Antivir Ther. 2018 Sep 27. doi: 10.3851/IMP326...,30260797,2018.0
3,Fetal Pediatr Pathol. 2018 Sep 27:1-11. doi: 1...,30260729,2018.0
4,Clin Exp Immunol. 2018 Oct;194(1):17-26. doi: ...,30260469,2018.0


In [4]:
df['abstract'] = df.abstract.str.replace('\r', '', regex=False)

### Adding journal to a new column
- split condition is the year
- the split condition was improved after screening the length of titles after extraction

In [7]:
df['journal'] = df['abstract'].str.extract('(((\w|\s|\(|\))*(?=(\.\s(\d{4}|Spring|Summer))))|((\w|\s|®)*\[Internet\](?=\.)))')

In [8]:
df.head()

Unnamed: 0,abstract,PMID,year,journal
0,Bone. 2018 Sep 24. pii: S8756-3282(18)30355-7....,30261328,2018.0,Bone
1,Neurosci Lett. 2018 Sep 24. pii: S0304-3940(18...,30261231,2018.0,Neurosci Lett
2,Antivir Ther. 2018 Sep 27. doi: 10.3851/IMP326...,30260797,2018.0,Antivir Ther
3,Fetal Pediatr Pathol. 2018 Sep 27:1-11. doi: 1...,30260729,2018.0,Fetal Pediatr Pathol
4,Clin Exp Immunol. 2018 Oct;194(1):17-26. doi: ...,30260469,2018.0,Clin Exp Immunol


In [9]:
df.to_csv('../data/regex_save2.csv')

In [10]:
df['abstract'] = df.abstract.str.replace('2017–1565).\nAll participants provide written informed consent before participating in the\ntrial.Not applicable.The authors declare that they have no competing\ninterests.Springer Nature remains neutral with regard to jurisdictional claims in\npublished maps and institutional affiliations. ','', regex=False)

### Delete the first line
- the first line containing journal, year etc. needs to be deleted so that the title can be accessed the title more easily
- the whitespaces in the beginning are stripped

In [11]:
df['abstract'] = df.abstract.str.split('\n\n', n=1).str.get(1)
df['abstract'] = df.abstract.str.lstrip()

In [13]:
print(df.iloc[0,0])

Quality of life in hypoparathyroidism.

Vokes T(1).

Author information: 
(1)The University of Chicago, Chicago, IL, United States. Electronic address:
tvokes@medicine.bsd.uchicago.edu.

Hypoparathyroidism is a rare endocrine disorder where deficiency (or lack of
effect) of parathyroid hormone results in disordered mineral metabolism leading
to hypocalcemia and hyperphosphatemia. Many patients with this disorder have
physical, emotional and cognitive complaints suggestive of impaired quality of
life (QOL). Several recent studies have demonstrated that hypoparathyroid
patients treated with calcium and vitamin D (conventional therapy) have reduced
QOL compared to either suitable controls or general population. QOL has also been
studied during treatment with PTH1-84, which has been FDA approved in the USA as 
an adjunct to calcium and vitamin D in patients not adequately controlled on
conventional therapy. In open label studies, PTH therapy has resulted in dramatic
improvements in SF-36 s

### Adding the title to new column
- first splitting after '.' was considered but some titles do have periods in their title
- as a split condition a '.' or '?' followed by two new lines was chosen because it resulted in most titles getting properly recognised

In [14]:
df['title'] = df['abstract'].str.split('(\.|\?)\\n\\n', n=2).str.get(0)

In [15]:
df['title'] = df.title.str.replace('\(|\)|,', '')
df['title'] = df.title.str.replace('\\n\\n(\w|\s)*', '')

In [16]:
df.head()

Unnamed: 0,abstract,PMID,year,journal,title
0,Quality of life in hypoparathyroidism.\n\nVoke...,30261328,2018.0,Bone,Quality of life in hypoparathyroidism
1,Vitamin D status and its association with seas...,30261231,2018.0,Neurosci Lett,Vitamin D status and its association with seas...
2,Tenofovir disoproxil fumarate appears to disru...,30260797,2018.0,Antivir Ther,Tenofovir disoproxil fumarate appears to disru...
3,"Vitamin D Levels in Active TB, Latent TB, Non-...",30260729,2018.0,Fetal Pediatr Pathol,Vitamin D Levels in Active TB Latent TB Non-TB...
4,Vitamin D receptor interacts with NLRP3 to res...,30260469,2018.0,Clin Exp Immunol,Vitamin D receptor interacts with NLRP3 to res...


### Adding author to new column & cleaning it up
- the same split as for title was used, however the third entry was used
- the authors are often numbered according to the university they belong to, needs to be removed
- also, just the professor was added as an extra column

In [17]:
df['author'] = df['abstract'].str.split('([a-z]|\.|\?)\\n\\n', n=2).str.get(2)
df['author'] = df['author'].str.strip('\n')

df.head()

Unnamed: 0,abstract,PMID,year,journal,title,author
0,Quality of life in hypoparathyroidism.\n\nVoke...,30261328,2018.0,Bone,Quality of life in hypoparathyroidism,Vokes T(1)
1,Vitamin D status and its association with seas...,30261231,2018.0,Neurosci Lett,Vitamin D status and its association with seas...,"Gu Y(1), Zhu Z(2), Luan X(2), He J(3)"
2,Tenofovir disoproxil fumarate appears to disru...,30260797,2018.0,Antivir Ther,Tenofovir disoproxil fumarate appears to disru...,"Havens PL(1), Long D(2), Schuster GU(3), Gordo..."
3,"Vitamin D Levels in Active TB, Latent TB, Non-...",30260729,2018.0,Fetal Pediatr Pathol,Vitamin D Levels in Active TB Latent TB Non-TB...,"Buonsenso D(1), Sali M(2), Pata D(1), Masiello..."
4,Vitamin D receptor interacts with NLRP3 to res...,30260469,2018.0,Clin Exp Immunol,Vitamin D receptor interacts with NLRP3 to res...,"Huang H(1), Hong JY(2), Wu YJ(2)(3), Wang EY(2..."


In [18]:
# removal of any new lines (\n)
df['author'] = df['author'].str.replace('\n', '')

# removal of 'Author information', that sometimes made it into the author column
df['author'] = df['author'].str.replace('Author information:', '')

# removal of the numbering in brackets
df['author'] = df['author'].str.replace('(\(\d+\))', '')

# removal of anything in the cell except the authors
df['author'] = df['author'].str.replace(';\s(.*)', '')
df['author'] = df['author'].str.replace('\[No authors listed](.*)', '')

# removel of all whitespaces
df['author'] = df['author'].str.replace(' ', '')

# replacing comma with whitespace
df['author'] = df['author'].str.replace(',', ' ')

df.head()

Unnamed: 0,abstract,PMID,year,journal,title,author
0,Quality of life in hypoparathyroidism.\n\nVoke...,30261328,2018.0,Bone,Quality of life in hypoparathyroidism,VokesT
1,Vitamin D status and its association with seas...,30261231,2018.0,Neurosci Lett,Vitamin D status and its association with seas...,GuY ZhuZ LuanX HeJ
2,Tenofovir disoproxil fumarate appears to disru...,30260797,2018.0,Antivir Ther,Tenofovir disoproxil fumarate appears to disru...,HavensPL LongD SchusterGU GordonCM PriceG Wils...
3,"Vitamin D Levels in Active TB, Latent TB, Non-...",30260729,2018.0,Fetal Pediatr Pathol,Vitamin D Levels in Active TB Latent TB Non-TB...,BuonsensoD SaliM PataD MasielloE SalernoG Cecc...
4,Vitamin D receptor interacts with NLRP3 to res...,30260469,2018.0,Clin Exp Immunol,Vitamin D receptor interacts with NLRP3 to res...,HuangH HongJY WuYJ WangEY LiuZQ ChengBH MeiL L...


In [19]:
# splitting at commas and getting the last entry
df['prof'] = df['author'].str.split(' ').str.get(-1)

df.head()

Unnamed: 0,abstract,PMID,year,journal,title,author,prof
0,Quality of life in hypoparathyroidism.\n\nVoke...,30261328,2018.0,Bone,Quality of life in hypoparathyroidism,VokesT,VokesT
1,Vitamin D status and its association with seas...,30261231,2018.0,Neurosci Lett,Vitamin D status and its association with seas...,GuY ZhuZ LuanX HeJ,HeJ
2,Tenofovir disoproxil fumarate appears to disru...,30260797,2018.0,Antivir Ther,Tenofovir disoproxil fumarate appears to disru...,HavensPL LongD SchusterGU GordonCM PriceG Wils...,StephensenCB
3,"Vitamin D Levels in Active TB, Latent TB, Non-...",30260729,2018.0,Fetal Pediatr Pathol,Vitamin D Levels in Active TB Latent TB Non-TB...,BuonsensoD SaliM PataD MasielloE SalernoG Cecc...,ValentiniP
4,Vitamin D receptor interacts with NLRP3 to res...,30260469,2018.0,Clin Exp Immunol,Vitamin D receptor interacts with NLRP3 to res...,HuangH HongJY WuYJ WangEY LiuZQ ChengBH MeiL L...,ZhengPY


### Adding if blood serum level was determined
- the abstracts are checked, if they contain the unit used to describe the 25-OH vit D serum level
- the bolean value was then tranformed to binary values

In [20]:
df['blood'] =  df.abstract.str.contains('(nmol\/(l|L)|ng\/ml)')

  """Entry point for launching an IPython kernel.


In [21]:
df.groupby('blood').count()

Unnamed: 0_level_0,abstract,PMID,year,journal,title,author,prof
blood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
False,59890,59890,59888,59882,59890,59875,59875
True,6350,6350,6350,6350,6350,6350,6350


In [22]:
df['blood'] = 1.0 * df['blood']

df.head()

Unnamed: 0,abstract,PMID,year,journal,title,author,prof,blood
0,Quality of life in hypoparathyroidism.\n\nVoke...,30261328,2018.0,Bone,Quality of life in hypoparathyroidism,VokesT,VokesT,0.0
1,Vitamin D status and its association with seas...,30261231,2018.0,Neurosci Lett,Vitamin D status and its association with seas...,GuY ZhuZ LuanX HeJ,HeJ,1.0
2,Tenofovir disoproxil fumarate appears to disru...,30260797,2018.0,Antivir Ther,Tenofovir disoproxil fumarate appears to disru...,HavensPL LongD SchusterGU GordonCM PriceG Wils...,StephensenCB,0.0
3,"Vitamin D Levels in Active TB, Latent TB, Non-...",30260729,2018.0,Fetal Pediatr Pathol,Vitamin D Levels in Active TB Latent TB Non-TB...,BuonsensoD SaliM PataD MasielloE SalernoG Cecc...,ValentiniP,0.0
4,Vitamin D receptor interacts with NLRP3 to res...,30260469,2018.0,Clin Exp Immunol,Vitamin D receptor interacts with NLRP3 to res...,HuangH HongJY WuYJ WangEY LiuZQ ChengBH MeiL L...,ZhengPY,0.0


### Adding if vitamin D dosage was mentioned
- the abstracts are checked, if they contain the unit used to describe a vitamin D dosage
- the bolean value was then tranformed to binary values

In [23]:
df['dosage'] =  df.abstract.str.contains('(IU|I.U.)')

  """Entry point for launching an IPython kernel.


In [24]:
df.groupby('dosage').count()

Unnamed: 0_level_0,abstract,PMID,year,journal,title,author,prof,blood
dosage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,61929,61929,61927,61921,61929,61914,61914,61929
True,4311,4311,4311,4311,4311,4311,4311,4311


In [25]:
df['dosage'] = 1.0 * df['dosage']

df.head()

Unnamed: 0,abstract,PMID,year,journal,title,author,prof,blood,dosage
0,Quality of life in hypoparathyroidism.\n\nVoke...,30261328,2018.0,Bone,Quality of life in hypoparathyroidism,VokesT,VokesT,0.0,0.0
1,Vitamin D status and its association with seas...,30261231,2018.0,Neurosci Lett,Vitamin D status and its association with seas...,GuY ZhuZ LuanX HeJ,HeJ,1.0,0.0
2,Tenofovir disoproxil fumarate appears to disru...,30260797,2018.0,Antivir Ther,Tenofovir disoproxil fumarate appears to disru...,HavensPL LongD SchusterGU GordonCM PriceG Wils...,StephensenCB,0.0,0.0
3,"Vitamin D Levels in Active TB, Latent TB, Non-...",30260729,2018.0,Fetal Pediatr Pathol,Vitamin D Levels in Active TB Latent TB Non-TB...,BuonsensoD SaliM PataD MasielloE SalernoG Cecc...,ValentiniP,0.0,0.0
4,Vitamin D receptor interacts with NLRP3 to res...,30260469,2018.0,Clin Exp Immunol,Vitamin D receptor interacts with NLRP3 to res...,HuangH HongJY WuYJ WangEY LiuZQ ChengBH MeiL L...,ZhengPY,0.0,0.0


### Cleaning the abstracts
- the abstracts are splitted into parts using \n\n as split condition
- for every row the column with the longest text is used
    - if the author list is not incredibly long, this will usually result in the abstract

In [26]:
split_abstracts = df.abstract.str.split('\\n\\n', expand=True)
split_abstracts

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,Quality of life in hypoparathyroidism.,Vokes T(1).,Author information: \n(1)The University of Chi...,Hypoparathyroidism is a rare endocrine disorde...,Copyright Â10.1016/j.bone.2018.09.017,,,,
1,Vitamin D status and its association with seas...,"Gu Y(1), Zhu Z(2), Luan X(2), He J(3).",Author information: \n(1)Department of Psychia...,BACKGROUND: Vitamin D plays a key role in depr...,Copyright Â10.1016/j.neulet.2018.09.046,,,,
2,Tenofovir disoproxil fumarate appears to disru...,"Havens PL(1), Long D(2), Schuster GU(3), Gordo...",Author information: \n(1)Department of Pediatr...,BACKGROUND: Tenofovir disoproxil fumarate (TDF...,DOI: 10.3851/IMP3269,,,,
3,"Vitamin D Levels in Active TB, Latent TB, Non-...","Buonsenso D(1), Sali M(2), Pata D(1), Masiello...",Author information: \n(1)a Department of Pedia...,BACKGROUND: Growing evidence suggests that vit...,DOI: 10.1080/15513815.2018.1509407,,,,
4,Vitamin D receptor interacts with NLRP3 to res...,"Huang H(1), Hong JY(2), Wu YJ(2)(3), Wang EY(2...",Author information: \n(1)Department of Gastroe...,Vitamin D receptor (VDR) mediates various bioc...,Â10.1111/cei.13164,,,,
5,Effects of extracorporeal photopheresis on ser...,"Kessler H(1), Marculescu R(1), Knobler R(1), J...",Author information: \n(1)Department of Dermato...,Extracorporeal photopheresis (ECP) is a therap...,,DOI: 10.1111/phpp.12428,,,
6,Correlation of vitamin D binding protein gene ...,"Chuaychoo B(1), Tungtrongchitr R(2), Kriengsin...",Author information: \n(1)Department of Medicin...,AIM: The risk of vitamin D binding protein (DB...,DOI: 10.2217/pme-2018-0005,,,,
7,Vitamin D status including 3-epi-25(OH)D3 amon...,"KmieÄ P(1), Minkiewicz I, Sworczak K, Å»mijew...",Author information: \n(1)Department of Endocri...,INTRODUCTION: In the context of pleiotropic vi...,DOI: 10.5603/EP.a2018.0065,,,,
8,Novel biomarker signatures for idiopathic REM ...,"Mondello S(1), Kobeissy F(2), Mechref Y(2), Zh...",Author information: \n(1)From Oasi Research In...,OBJECTIVE: To perform a rigorous in-depth prot...,Â10.,,,,
9,Vitamin D deficiency aggravates the liver meta...,"Borges CC(1), Bringhenti I(2), Mandarim-de-Lac...",Author information: \n(1)Laboratory of Morphom...,AIMS: A prevalence of vitamin D deficiency has...,Copyright Â10.1016/j.biopha.2018.08.075,,,,


In [27]:
length = pd.DataFrame()

for column in range(split_abstracts.shape[1]):
    length[column] = split_abstracts[column].str.len()

In [28]:
max_length = length.idxmax(axis=1)

In [29]:
df['clean_abstracts'] = np.zeros([len(df),1])

In [30]:
for row in range(len(df)):
    if row % 1000 == 0:
        print(row)
    max_column = max_length.iloc[row]
    df.loc[row,'clean_abstracts'] = split_abstracts.iloc[row, max_column]

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000


In [31]:
df.to_csv('../data/save.csv')

In [32]:
df.head()

Unnamed: 0,abstract,PMID,year,journal,title,author,prof,blood,dosage,clean_abstracts
0,Quality of life in hypoparathyroidism.\n\nVoke...,30261328,2018.0,Bone,Quality of life in hypoparathyroidism,VokesT,VokesT,0.0,0.0,Hypoparathyroidism is a rare endocrine disorde...
1,Vitamin D status and its association with seas...,30261231,2018.0,Neurosci Lett,Vitamin D status and its association with seas...,GuY ZhuZ LuanX HeJ,HeJ,1.0,0.0,BACKGROUND: Vitamin D plays a key role in depr...
2,Tenofovir disoproxil fumarate appears to disru...,30260797,2018.0,Antivir Ther,Tenofovir disoproxil fumarate appears to disru...,HavensPL LongD SchusterGU GordonCM PriceG Wils...,StephensenCB,0.0,0.0,BACKGROUND: Tenofovir disoproxil fumarate (TDF...
3,"Vitamin D Levels in Active TB, Latent TB, Non-...",30260729,2018.0,Fetal Pediatr Pathol,Vitamin D Levels in Active TB Latent TB Non-TB...,BuonsensoD SaliM PataD MasielloE SalernoG Cecc...,ValentiniP,0.0,0.0,BACKGROUND: Growing evidence suggests that vit...
4,Vitamin D receptor interacts with NLRP3 to res...,30260469,2018.0,Clin Exp Immunol,Vitamin D receptor interacts with NLRP3 to res...,HuangH HongJY WuYJ WangEY LiuZQ ChengBH MeiL L...,ZhengPY,0.0,0.0,Vitamin D receptor (VDR) mediates various bioc...
