# Preprocessing data

Preparing abstracts of papers for Natural Language Processing.

## Intro

### Imports

In [1]:
!pip install --upgrade nltk



In [2]:
import nltk
import csv
import string
import unidecode
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /home/fani/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/fani/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/fani/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/fani/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Sample abstracts

Taking the abstracts of four papers to test data cleaning.
1. Raichle ME. The brain's default mode network. Annu Rev Neurosci. 2015 Jul 8;38:433-47. doi: 10.1146/annurev-neuro-071013-014030. Epub 2015 May 4. PMID: 25938726.
2. Thiebaut de Schotten M, Forkel SJ. The emergent properties of the connected brain. Science. 2022 Nov 4;378(6619):505-510. doi: 10.1126/science.abq2591. Epub 2022 Nov 3. PMID: 36378968.
3. Alahmari A. Blood-Brain Barrier Overview: Structural and Functional Correlation. Neural Plast. 2021 Dec 6;2021:6564585. doi: 10.1155/2021/6564585. PMID: 34912450; PMCID: PMC8668349.
4. Grabowska A. Sex on the brain: Are gender-dependent structural and functional differences associated with behavior? J Neurosci Res. 2017 Jan 2;95(1-2):200-212. doi: 10.1002/jnr.23953. PMID: 27870447.

In [3]:
default = "The brain's default mode network consists of discrete, bilateral and symmetrical cortical areas, in the medial and lateral parietal, medial prefrontal, and medial and lateral temporal cortices of the human, nonhuman primate, cat, and rodent brains. Its discovery was an unexpected consequence of brain-imaging studies first performed with positron emission tomography in which various novel, attention-demanding, and non-self-referential tasks were compared with quiet repose either with eyes closed or with simple visual fixation. The default mode network consistently decreases its activity when compared with activity during these relaxed nontask states. The discovery of the default mode network reignited a longstanding interest in the significance of the brain's ongoing or intrinsic activity. Presently, studies of the brain's intrinsic activity, popularly referred to as resting-state studies, have come to play a major role in studies of the human brain in health and disease. The brain's default mode network plays a central role in this work."
connected = "There is more to brain connections than the mere transfer of signals between brain regions. Behavior and cognition emerge through cortical area interaction. This requires integration between local and distant areas orchestrated by densely connected networks. Brain connections determine the brain's functional organization. The imaging of connections in the living brain has provided an opportunity to identify the driving factors behind the neurobiology of cognition. Connectivity differences between species and among humans have furthered the understanding of brain evolution and of diverging cognitive profiles. Brain pathologies amplify this variability through disconnections and, consequently, the disintegration of cognitive functions. The prediction of long-term symptoms is now preferentially based on brain disconnections. This paradigm shift will reshape our brain maps and challenge current brain models. "
blood = "The blood-brain barrier (BBB) is a semipermeable and extremely selective system in the central nervous system of most vertebrates, that separates blood from the brain's extracellular fluid. It plays a vital role in regulating the transport of necessary materials for brain function, furthermore, protecting it from foreign substances in the blood that could damage it. In this review, we searched in Google Scholar, Pubmed, Web of Science, and Saudi Digital Library for the various cells and components that support the development and function of this barrier, as well as the different pathways to transport the various molecules between blood and the brain. We also discussed the aspects that lead to BBB dysfunction and its neuropathological consequences, with the identification of some of the most important biomarkers that might be used as a biomarker to predict the BBB disturbances. This comprehensive overview of BBB will pave the way for future studies to focus on developing more specific targeting systems in material delivery as a future approach that assists in combinatorial therapy or nanotherapy to destroy or modify this barrier in pathological conditions such as brain tumors and brain stem cell carcinomas. "
gender = "A substantial number of studies provide evidence documenting a variety of sex differences in the brain. It remains unclear whether sexual differentiation at the neural level is related to that observed in daily behavior, cognitive function, and the risk of developing certain psychiatric and neurological disorders. Some investigators have questioned whether the brain is truly sexually differentiated and support this view with several arguments including the following: (1) brain structural or functional differences are not necessarily reflected in appropriate differences at the behavioral level, which might suggest that these two phenomena are not linked to each other; and (2) sex-related differences in the brain are rather small and concern features that significantly overlap between males and females. This review polemicizes with those opinions and presents examples of sex-related local neural differences underpinning a variety of sex differences in behaviors, skills, and cognitive/emotional abilities. Although male/female brain differentiation may vary in pattern and scale, nonetheless, in some respects (e.g., relative local gray matter volumes) it can be substantial, taking the form of sexual dimorphism and involving large areas of the brain (the cortex in particular). A significant part of this review is devoted to arguing that some sex differences in the brain may serve to prevent (in the case where they are maladaptive), rather than to produce, differences at the behavioral/skill level. Specifically, some differences might result from compensatory mechanisms aimed at maintaining similar intellectual capacities across the sexes, despite the smaller average volume of the brain in females compared with males. © 2016 Wiley Periodicals, Inc. "

# Abstracts list and dataframe
original = [default, connected, blood, gender]
df = pd.DataFrame(original, columns=['Original Text'])

### Words to ignore

Academic words should not define the topic. We made a list of academic and medical jargon.

In [4]:
with open ('stopwords.csv') as csvfile:
    reader = csv.reader(csvfile)
    academic_terms_proper_case = [word.strip('\ufeff') for row in reader for word in row]

academic_terms = [word.lower() for word in academic_terms_proper_case]
academic_terms[0:3]

['abstract', 'academic', 'achievement']

We will remove these terms before passing our data to the model.

## Main function

We are building a function to use in our main python file

**Basic cleaning** | turning text into lowercase, removing numbers, removing punctuation, removing whitespaces.

**Tokenization** | turning string into list of individual words.

**Lemmatization** | simplifying word forms.

**Removing stopwords** | keeping only useful words.

In [5]:
 def preprocess(sentence):

     if sentence is None:
        return "" 
        
     # Basic Cleaning
     cleaned = sentence.lower()
     cleaned = ''.join(char for char in cleaned if not char.isdigit())

     for punctuation in string.punctuation:
         cleaned = cleaned.replace(punctuation, '') 

     cleaned = cleaned.strip()
     unaccented_string = unidecode.unidecode(cleaned) 

     # Tokenization
     tokenized = word_tokenize(unaccented_string)

     # Lemmatization
     lemmatized = [WordNetLemmatizer().lemmatize(word, pos="v") for word in tokenized]
     lemmatized = [WordNetLemmatizer().lemmatize(word, pos="n") for word in lemmatized]

     # Remove Stopwords
     stop_words = set(stopwords.words('english'))
     tokenized_no_stopwords = [word for word in lemmatized if word not in stop_words]

     # Filter out specialised stopwords from tokenized_no_stopwords
     no_specialised_stopwords = [word for word in tokenized_no_stopwords if word.lower() not in academic_terms]

     cleaned_sentence = " ".join(no_specialised_stopwords)
     return cleaned_sentence

df['Cleaned Text'] = df["Original Text"].apply(preprocess)
df

Unnamed: 0,Original Text,Cleaned Text
0,The brain's default mode network consists of d...,brain default mode network consist discrete bi...
1,There is more to brain connections than the me...,brain connection mere transfer signal brain re...
2,The blood-brain barrier (BBB) is a semipermeab...,bloodbrain barrier bbb semipermeable extremely...
3,A substantial number of studies provide eviden...,substantial number provide evidence document v...


## Testing

In [6]:
test_word = 'arguments'
text_number = 3
test_word_lemma = WordNetLemmatizer().lemmatize(test_word)

In [7]:
assert test_word_lemma in academic_terms

The word 'argument', which we have defined as a stopword

In [8]:
assert test_word in df['Original Text'][text_number]

... is found in a different grammatical form in the text, 'arguments'

In [9]:
assert test_word not in df['Cleaned Text'][text_number]

... and is removed as intended.

## Stats

In [10]:
def count_removed_words(df):
    stats = pd.DataFrame()
    stats['original word count'] = df['Original Text'].str.split().str.len()
    stats['cleaned word count'] = df['Cleaned Text'].str.split().str.len()
    stats['words removed'] = stats['original word count'] - stats['cleaned word count']
    stats['% of words removed'] = round((stats['words removed'] / stats['original word count'])*100).astype(int)
    stats['% of words removed'] = stats['% of words removed'].astype(str) + ' %'
    stats = stats.drop(columns=['original word count', 'cleaned word count'])
    stats_df = df.join(stats)
    return stats_df.sort_values(by='% of words removed',ascending=False)

In [11]:
count_removed_words(df)

Unnamed: 0,Original Text,Cleaned Text,words removed,% of words removed
2,The blood-brain barrier (BBB) is a semipermeab...,bloodbrain barrier bbb semipermeable extremely...,90,47 %
3,A substantial number of studies provide eviden...,substantial number provide evidence document v...,104,41 %
0,The brain's default mode network consists of d...,brain default mode network consist discrete bi...,59,39 %
1,There is more to brain connections than the me...,brain connection mere transfer signal brain re...,47,38 %


# Vectorizing

Preparing data for LDA.

In [12]:
vectorizer = TfidfVectorizer()
weighted_words = pd.DataFrame(vectorizer.fit_transform(df['Cleaned Text']).toarray(),
                 columns = vectorizer.get_feature_names_out())
weighted_words

Unnamed: 0,ability,across,activity,also,although,among,amplify,approach,appropriate,area,...,view,visual,vital,volume,way,web,well,whether,wiley,work
0,0.0,0.0,0.325289,0.0,0.0,0.0,0.0,0.0,0.0,0.051907,...,0.0,0.081322,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081322
1,0.0,0.0,0.0,0.0,0.0,0.100136,0.100136,0.0,0.0,0.127831,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.08535,0.0,0.0,0.0,0.08535,0.0,0.0,...,0.0,0.0,0.08535,0.0,0.08535,0.08535,0.08535,0.0,0.0,0.0
3,0.06463,0.06463,0.0,0.0,0.06463,0.0,0.0,0.0,0.06463,0.041253,...,0.06463,0.0,0.0,0.12926,0.0,0.0,0.0,0.12926,0.06463,0.0


### Vectorizing test

Confirming that the defining term for the common topic of the three abstracts corresponds to the term I looked up to find them: brain.

In [13]:
weighted_words.mean().sort_values(ascending=False).head(1).index[0]

'brain'