# Preprocessing data

Preparing abstracts of papers for Natural Language Processing.

## Imports

In [51]:
import string
import nltk
import unidecode
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /home/fani/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/fani/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/fani/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/fani/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Sample abstracts

Taking the abstracts of three papers to test data cleaning. The examples come from these papers:
- Raichle ME. The brain's default mode network. Annu Rev Neurosci. 2015 Jul 8;38:433-47. doi: 10.1146/annurev-neuro-071013-014030. Epub 2015 May 4. PMID: 25938726.
- Thiebaut de Schotten M, Forkel SJ. The emergent properties of the connected brain. Science. 2022 Nov 4;378(6619):505-510. doi: 10.1126/science.abq2591. Epub 2022 Nov 3. PMID: 36378968.
- Alahmari A. Blood-Brain Barrier Overview: Structural and Functional Correlation. Neural Plast. 2021 Dec 6;2021:6564585. doi: 10.1155/2021/6564585. PMID: 34912450; PMCID: PMC8668349.

In [52]:
default = "The brain's default mode network consists of discrete, bilateral and symmetrical cortical areas, in the medial and lateral parietal, medial prefrontal, and medial and lateral temporal cortices of the human, nonhuman primate, cat, and rodent brains. Its discovery was an unexpected consequence of brain-imaging studies first performed with positron emission tomography in which various novel, attention-demanding, and non-self-referential tasks were compared with quiet repose either with eyes closed or with simple visual fixation. The default mode network consistently decreases its activity when compared with activity during these relaxed nontask states. The discovery of the default mode network reignited a longstanding interest in the significance of the brain's ongoing or intrinsic activity. Presently, studies of the brain's intrinsic activity, popularly referred to as resting-state studies, have come to play a major role in studies of the human brain in health and disease. The brain's default mode network plays a central role in this work."
connected = "There is more to brain connections than the mere transfer of signals between brain regions. Behavior and cognition emerge through cortical area interaction. This requires integration between local and distant areas orchestrated by densely connected networks. Brain connections determine the brain's functional organization. The imaging of connections in the living brain has provided an opportunity to identify the driving factors behind the neurobiology of cognition. Connectivity differences between species and among humans have furthered the understanding of brain evolution and of diverging cognitive profiles. Brain pathologies amplify this variability through disconnections and, consequently, the disintegration of cognitive functions. The prediction of long-term symptoms is now preferentially based on brain disconnections. This paradigm shift will reshape our brain maps and challenge current brain models. "
blood = "The blood-brain barrier (BBB) is a semipermeable and extremely selective system in the central nervous system of most vertebrates, that separates blood from the brain's extracellular fluid. It plays a vital role in regulating the transport of necessary materials for brain function, furthermore, protecting it from foreign substances in the blood that could damage it. In this review, we searched in Google Scholar, Pubmed, Web of Science, and Saudi Digital Library for the various cells and components that support the development and function of this barrier, as well as the different pathways to transport the various molecules between blood and the brain. We also discussed the aspects that lead to BBB dysfunction and its neuropathological consequences, with the identification of some of the most important biomarkers that might be used as a biomarker to predict the BBB disturbances. This comprehensive overview of BBB will pave the way for future studies to focus on developing more specific targeting systems in material delivery as a future approach that assists in combinatorial therapy or nanotherapy to destroy or modify this barrier in pathological conditions such as brain tumors and brain stem cell carcinomas. "

In [53]:
# List with abstracts
original = [default, connected, blood]

In [54]:
#Abstracts into dataframe
abstracts_df = pd.DataFrame(original, columns=['original'])

In [55]:
abstracts_df

Unnamed: 0,original
0,The brain's default mode network consists of d...
1,There is more to brain connections than the me...
2,The blood-brain barrier (BBB) is a semipermeab...


## Processing data

Processing the abstract of a paper.

### Basic cleaning
- turning text into lowercase
- removing numbers
- removing punctuation
- removing spaces in the beginning and the end

### Tokenization
Turning string into list of individual words.

### Removing stopwords
Keeping only useful words.

### Lemmatization
Simplifying word forms

### Defining stopwords

#### Generic stopwords

In [56]:
stop_words = set(stopwords.words('english'))

#### Specialised stopwords

Making a vocabulary with academic and medical terms to ignore.

In [57]:
with open ('stopwords.csv') as csvfile:
    reader = csv.reader(csvfile)
    specialised_stopwords = [word.strip('\ufeff') for row in reader for word in row]

In [58]:
 def preprocess(sentence):

     #print(text)
     if sentence is None:
        return ""  # Return an empty string if the text is None
        
     # Basic Cleaning
     cleaned = sentence.lower()
     cleaned = ''.join(char for char in cleaned if not char.isdigit())

     for punctuation in string.punctuation:
         cleaned = cleaned.replace(punctuation, '') 

     cleaned = cleaned.strip()
     unaccented_string = unidecode.unidecode(cleaned) 

     # Tokenization
     tokenized = word_tokenize(unaccented_string)

     # Remove Stopwords
     #stop_words = set(stopwords.words('english'))
     tokenized_no_stopwords = [word for word in tokenized if word not in stop_words]

     # Convert specialised_stopwords to lowercase
     specialised_stopwords_lower = [word.lower() for word in specialised_stopwords]

     # Filter out specialised stopwords from tokenized_no_stopwords
     no_specialised_stopwords = [word for word in tokenized_no_stopwords if word.lower() not in specialised_stopwords_lower]

     # Lemmatization
     lemmatized = [WordNetLemmatizer().lemmatize(word, pos="v") for word in no_specialised_stopwords]
     lemmatized = [WordNetLemmatizer().lemmatize(word, pos="n") for word in lemmatized]

     cleaned_sentence = " ".join(lemmatized)
     return cleaned_sentence

### Applying on dataframe column

In [59]:
abstracts_df['cleaned'] = abstracts_df["original"].apply(preprocess)

In [60]:
abstracts_df

Unnamed: 0,original,cleaned
0,The brain's default mode network consists of d...,brain default mode network consist discrete bi...
1,There is more to brain connections than the me...,brain connection mere transfer signal brain re...
2,The blood-brain barrier (BBB) is a semipermeab...,bloodbrain barrier bbb semipermeable extremely...


### Applying on list

In [61]:
processed_original = [preprocess(i) for i in original]

# Vectorizing

Preparing data for LDA.

In [30]:
vectorizer = TfidfVectorizer()

weighted_words = pd.DataFrame(vectorizer.fit_transform(abstracts_df['cleaned']).toarray(),
                 columns = vectorizer.get_feature_names_out())

weighted_words

Unnamed: 0,activity,also,among,amplify,approach,area,aspect,assist,attentiondemanding,barrier,...,use,variability,various,vertebrate,visual,vital,way,web,well,work
0,0.312443,0.0,0.0,0.0,0.0,0.059405,0.0,0.0,0.078111,0.0,...,0.0,0.0,0.059405,0.0,0.078111,0.0,0.0,0.0,0.0,0.078111
1,0.0,0.0,0.094277,0.094277,0.0,0.1434,0.0,0.0,0.0,0.0,...,0.0,0.094277,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.082977,0.0,0.0,0.082977,0.0,0.082977,0.082977,0.0,0.248931,...,0.082977,0.0,0.126212,0.082977,0.0,0.082977,0.082977,0.082977,0.082977,0.0


### Vectorizing test

Confirming that the defining term for the common topic of the three abstracts corresponds to the term I looked up to find them: brain.

In [40]:
weighted_words.mean().sort_values(ascending=False).head(1).index[0]

'brain'