In [2]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec, Phrases


### How is preprocessing applied?

Interesting papers have usually a two column format. Extracting their content is fast but their structure is often problematic. Narrow columns -> cuted words being the biggest issue. An example follows: 

In [6]:
with open('string.txt') as f:
    data = f.read()
print(data)

"It is difficult to alter the architecture of hospitals after they are 
built; therefore, it is necessary to consider various treatment-
related aspects beforehand, both for the benefits of patients and 
health care providers. Traditionally, medicine has pursued evidence-based treatments, which establish a procedure by defin-
ing and measuring changes in outcomes, depending on the 
presence or absence of a treatment, and by judging their efficacy. Moreover, architecture has introduced the concept of 
evidence-based designs. Since any change in the physical environment might affect the progress of diseases in various ways, 
the rationale for studying these associations is clear. The belief 
that the physical environment of a hospital could affect the 
recovery of patients has existed since ancient times; however, it 
is difficult to support this assumption, because randomized 
controlled trials—although often conducted in medicine—are 
rarely adopted in architecture.
Medical facilities,

In [7]:
papers = pd.read_csv('data/papers.csv')
content = pd.Series(papers['Content'], dtype=str)
example = content[31][0:1096]
print(type(example))
print(example)

<class 'str'>
There is widespread concern about the increase in intrapartum intervention rates, e.g. caesarean sections, and recent research and discussions have focused on the need for the appropriate use of medically indicated interventions (Miller et al., 2016; Shaw et al., 2016). Medical interventions during birth have consequences for the health of the mother and child, in both the immediate and long term (as shown in the latest studies on epigenetics). In this article, the authors define intrapartum interventions as all interventions occurring from the onset of labor up to and including the expulsion of the placenta and membranes. Intrapartum interventions include, but are not limited to, the induction of labor, the use of intravenous oxytocin, artificial rupture of the amniotic membranes, epidural anesthesia, electronic fetal health rate monitoring, episiotomy, caesarean section. The reasons for the increase in intervention rates are multifactorial and in many circumstances unex

In [8]:
import nltk

In [9]:
tokens = nltk.word_tokenize(example)
print(tokens)

['There', 'is', 'widespread', 'concern', 'about', 'the', 'increase', 'in', 'intrapartum', 'intervention', 'rates', ',', 'e.g', '.', 'caesarean', 'sections', ',', 'and', 'recent', 'research', 'and', 'discussions', 'have', 'focused', 'on', 'the', 'need', 'for', 'the', 'appropriate', 'use', 'of', 'medically', 'indicated', 'interventions', '(', 'Miller', 'et', 'al.', ',', '2016', ';', 'Shaw', 'et', 'al.', ',', '2016', ')', '.', 'Medical', 'interventions', 'during', 'birth', 'have', 'consequences', 'for', 'the', 'health', 'of', 'the', 'mother', 'and', 'child', ',', 'in', 'both', 'the', 'immediate', 'and', 'long', 'term', '(', 'as', 'shown', 'in', 'the', 'latest', 'studies', 'on', 'epigenetics', ')', '.', 'In', 'this', 'article', ',', 'the', 'authors', 'define', 'intrapartum', 'interventions', 'as', 'all', 'interventions', 'occurring', 'from', 'the', 'onset', 'of', 'labor', 'up', 'to', 'and', 'including', 'the', 'expulsion', 'of', 'the', 'placenta', 'and', 'membranes', '.', 'Intrapartum', 'i

Tokenizing as first step doesn't work. We need to fix hyphenated words because of new lines (circum-/nstances). We also need sentences. We can't simply use the dot character for it (e.g. e.g.).

In [10]:
content_str = ' '.join(content)

In [11]:
print(content_str)

There are well-recognized benefits for both mother and baby when the woman receives support from a chosen birth companion as reflected in Cochrane reviews (Bohren et al., 2017; Hodnett et al., 2013) and by the World Health Organization (WHO, 2020), yet we claim a dichotomy in current birth environment design exists which does not reflect this knowledge. While in most resource-rich nations, a continuous supporter is expected to accompany the woman throughout her labor and birth experience, the supporter is usually not well accommodated in the birth space (Johansson et al., 2015, Minnie et al., 2018) and subsequently may be unable to function effectively in their support role. In their role, supporters often feel overwhelmed, anxious, or uncertain (Gawlik et al., 2015). Although recent design guidelines show an awareness of supporters’ importance, it appears supporters have been overlooked when it comes to birth environment design (Harte et al., 2016). Although recent design guideli

In [12]:
lines = example.split('\n')
print(lines)

['There is widespread concern about the increase in intrapartum intervention rates, e.g. caesarean sections, and recent research and discussions have focused on the need for the appropriate use of medically indicated interventions (Miller et al., 2016; Shaw et al., 2016). Medical interventions during birth have consequences for the health of the mother and child, in both the immediate and long term (as shown in the latest studies on epigenetics). In this article, the authors define intrapartum interventions as all interventions occurring from the onset of labor up to and including the expulsion of the placenta and membranes. Intrapartum interventions include, but are not limited to, the induction of labor, the use of intravenous oxytocin, artificial rupture of the amniotic membranes, epidural anesthesia, electronic fetal health rate monitoring, episiotomy, caesarean section. The reasons for the increase in intervention rates are multifactorial and in many circumstances unexplainable, a

In [13]:
new_lines = ''
for line in lines:
    if line.endswith('-'):
        new_lines += line[:-1] + lines[lines.index(line)+1] + ' '
        lines.remove(lines[lines.index(line)+1])
    else:
        new_lines += line + ' '


In [14]:
print(example)

There is widespread concern about the increase in intrapartum intervention rates, e.g. caesarean sections, and recent research and discussions have focused on the need for the appropriate use of medically indicated interventions (Miller et al., 2016; Shaw et al., 2016). Medical interventions during birth have consequences for the health of the mother and child, in both the immediate and long term (as shown in the latest studies on epigenetics). In this article, the authors define intrapartum interventions as all interventions occurring from the onset of labor up to and including the expulsion of the placenta and membranes. Intrapartum interventions include, but are not limited to, the induction of labor, the use of intravenous oxytocin, artificial rupture of the amniotic membranes, epidural anesthesia, electronic fetal health rate monitoring, episiotomy, caesarean section. The reasons for the increase in intervention rates are multifactorial and in many circumstances unexplainable, as 

In [15]:
print(new_lines)

There is widespread concern about the increase in intrapartum intervention rates, e.g. caesarean sections, and recent research and discussions have focused on the need for the appropriate use of medically indicated interventions (Miller et al., 2016; Shaw et al., 2016). Medical interventions during birth have consequences for the health of the mother and child, in both the immediate and long term (as shown in the latest studies on epigenetics). In this article, the authors define intrapartum interventions as all interventions occurring from the onset of labor up to and including the expulsion of the placenta and membranes. Intrapartum interventions include, but are not limited to, the induction of labor, the use of intravenous oxytocin, artificial rupture of the amniotic membranes, epidural anesthesia, electronic fetal health rate monitoring, episiotomy, caesarean section. The reasons for the increase in intervention rates are multifactorial and in many circumstances unexplainable, as 

In [16]:
# Sentence parsing
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(new_lines)
print(sentences)

['There is widespread concern about the increase in intrapartum intervention rates, e.g.', 'caesarean sections, and recent research and discussions have focused on the need for the appropriate use of medically indicated interventions (Miller et al., 2016; Shaw et al., 2016).', 'Medical interventions during birth have consequences for the health of the mother and child, in both the immediate and long term (as shown in the latest studies on epigenetics).', 'In this article, the authors define intrapartum interventions as all interventions occurring from the onset of labor up to and including the expulsion of the placenta and membranes.', 'Intrapartum interventions include, but are not limited to, the induction of labor, the use of intravenous oxytocin, artificial rupture of the amniotic membranes, epidural anesthesia, electronic fetal health rate monitoring, episiotomy, caesarean section.', 'The reasons for the increase in intervention rates are multifactorial and in many circumstances u

In [17]:
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
print(sentences)

[['There', 'is', 'widespread', 'concern', 'about', 'the', 'increase', 'in', 'intrapartum', 'intervention', 'rates', ',', 'e.g', '.'], ['caesarean', 'sections', ',', 'and', 'recent', 'research', 'and', 'discussions', 'have', 'focused', 'on', 'the', 'need', 'for', 'the', 'appropriate', 'use', 'of', 'medically', 'indicated', 'interventions', '(', 'Miller', 'et', 'al.', ',', '2016', ';', 'Shaw', 'et', 'al.', ',', '2016', ')', '.'], ['Medical', 'interventions', 'during', 'birth', 'have', 'consequences', 'for', 'the', 'health', 'of', 'the', 'mother', 'and', 'child', ',', 'in', 'both', 'the', 'immediate', 'and', 'long', 'term', '(', 'as', 'shown', 'in', 'the', 'latest', 'studies', 'on', 'epigenetics', ')', '.'], ['In', 'this', 'article', ',', 'the', 'authors', 'define', 'intrapartum', 'interventions', 'as', 'all', 'interventions', 'occurring', 'from', 'the', 'onset', 'of', 'labor', 'up', 'to', 'and', 'including', 'the', 'expulsion', 'of', 'the', 'placenta', 'and', 'membranes', '.'], ['Intrapa

In [18]:
#remove special characters
import re
sentences = [[re.sub('[^A-Za-z0-9]+', '', word) for word in sentence] for sentence in sentences]
#remove empty strings
sentences = [[word for word in sentence if word] for sentence in sentences]
print(sentences)

[['There', 'is', 'widespread', 'concern', 'about', 'the', 'increase', 'in', 'intrapartum', 'intervention', 'rates', 'eg'], ['caesarean', 'sections', 'and', 'recent', 'research', 'and', 'discussions', 'have', 'focused', 'on', 'the', 'need', 'for', 'the', 'appropriate', 'use', 'of', 'medically', 'indicated', 'interventions', 'Miller', 'et', 'al', '2016', 'Shaw', 'et', 'al', '2016'], ['Medical', 'interventions', 'during', 'birth', 'have', 'consequences', 'for', 'the', 'health', 'of', 'the', 'mother', 'and', 'child', 'in', 'both', 'the', 'immediate', 'and', 'long', 'term', 'as', 'shown', 'in', 'the', 'latest', 'studies', 'on', 'epigenetics'], ['In', 'this', 'article', 'the', 'authors', 'define', 'intrapartum', 'interventions', 'as', 'all', 'interventions', 'occurring', 'from', 'the', 'onset', 'of', 'labor', 'up', 'to', 'and', 'including', 'the', 'expulsion', 'of', 'the', 'placenta', 'and', 'membranes'], ['Intrapartum', 'interventions', 'include', 'but', 'are', 'not', 'limited', 'to', 'the'

In [19]:

content = pd.read_csv('data/papers.csv')
#data = papers.iloc[:, 4]
stopwords = ["et", "figure", "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
content.groupby('Decade')['Content'].apply(lambda x: ' '.join(x)).reset_index()
content['Content'] = content['Content'].astype("string")
decade_content = pd.Series(papers['Content'], dtype=str)

print(content.info())

def preprocess_papers (content, stop_words = True, bigrams = True, trigrams = False):
    content_str = ' '.join(content)
    lines = content_str.split('\n')
    
    new_lines = ''
    for line in lines:
        if line.endswith('-'):
            new_lines += line[:-1] + lines[lines.index(line)+1] + ' '
            lines.remove(lines[lines.index(line)+1])
        else:
            new_lines += line + ' '

    sentences = sent_tokenize(new_lines)

    # tokenize sentences
    sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

    #stopwords
    if stop_words:
        sentences = [[word for word in sentence if word.lower() not in stopwords] for sentence in sentences]

    sentences = [[re.sub('[^A-Za-z0-9-]+', '', word) for word in sentence] for sentence in sentences]
    #remove empty strings
    sentences = [[word for word in sentence if word] for sentence in sentences]

    if bigrams:
        bigram = Phrases(sentences, min_count=1, threshold=1)
        sentences = [bigram[sentence] for sentence in sentences]
    
    if trigrams:
        trigram = Phrases(bigram[sentences], min_count=1, threshold=1)
        sentences = [trigram[bigram[sentence]] for sentence in sentences]
    

    return sentences

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   DOI       34 non-null     object
 1   Title     34 non-null     object
 2   Year      34 non-null     int64 
 3   Abstract  12 non-null     object
 4   Content   34 non-null     string
 5   Decade    34 non-null     object
dtypes: int64(1), object(4), string(1)
memory usage: 1.7+ KB
None


In [20]:
sentences = preprocess_papers(decade_content, stop_words=True, bigrams=True, trigrams=False)
print(sentences)

[['well-recognized', 'benefits', 'mother', 'baby', 'woman', 'receives', 'support', 'chosen', 'birth', 'companion', 'reflected', 'Cochrane', 'reviews', 'Bohren', 'al_2017', 'Hodnett_al', '2013_World', 'Health_Organization', '2020', 'yet', 'claim', 'dichotomy', 'current', 'birth_environment', 'design', 'exists', 'reflect', 'knowledge'], ['resource-rich', 'nations', 'continuous', 'supporter', 'expected', 'accompany', 'woman', 'throughout_labor', 'birth_experience', 'supporter', 'usually', 'well', 'accommodated', 'birth_space', 'Johansson_al', '2015', 'Minnie', 'al_2018', 'subsequently', 'may_unable', 'function', 'effectively', 'support', 'role'], ['role', 'supporters', 'often_feel', 'overwhelmed', 'anxious', 'uncertain', 'Gawlik', 'al_2015'], ['Although_recent', 'design_guidelines', 'show_awareness', 'supporters_importance', 'appears_supporters', 'overlooked_comes', 'birth_environment', 'design_Harte', 'al_2016'], ['Although_recent', 'design_guidelines', 'show_awareness', 'supporters_impo

In [135]:
import re
papers = pd.read_csv('data/papers.csv')

def remove_hyphenation(text):
    hyphenated_word_pattern = r'(\b\w+)-\s+(\w+\b)'
    
    def join_hyphenated(match):
        return match.group(1) + match.group(2)
    
    result = re.sub(hyphenated_word_pattern, join_hyphenated, text)
    
    return result



In [141]:
#papers['Content'] = papers['Content'].astype("string")
for i in range(len(papers['Content'])):
    papers['Content'][i] = remove_hyphenation(papers['Content'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  papers['Content'][i] = remove_hyphenation(papers['Content'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  papers['Content'][i] = remove_hyphenation(papers['Content'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  papers['Content'][i] = remove_hyphenation(papers['Content'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning

In [140]:
print(papers['Content'][5])

#write to csv
#papers.to_csv('data/papers.csv', index=False)

The number of People with Dementia (PwD) is rising worldwide. World Health Organization estimates that around 55 million people have dementia and this is rising to 139 million in 2050 (1). Dementia is an umbrella term that has within it various diseases, of which the most frequent (about 60-70 percent of all forms) is Alzheimer’s (2). This disease is characterized mainly by a general deterioration in cognitive function. It affects memory, thinking, orientation, comprehension, computation,   learning ability, language, mobility, and judgment. Due to age and comorbidity that often characterize PwD, they are the most frequent visitors to healthcare facilities. Although PwD is a heavy consumer of health services, direct costs in developed countries arise mostly from community and residential care (3). In Europe, nowadays nursing homes and assisted home care are among the main lines of investment in the real estate sector. These facilities must be suitable for a fragile category of users li