## Data Preprocessing
##### This code preprocesses the PubMed Multi Label Text Classification Dataset found in this source (https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification), in preparation for multi-label modelling. 

### 1. Import Data

In [1]:
# Import packages
import pandas as pd
import numpy as np

# Import data
df = pd.read_csv(r"/Users/ishanisahama/Documents/Data Science/github_blog/multi-label classification/input/PubMed Multi Label Text Classification Dataset.csv")
df.head()

Unnamed: 0,Title,abstractText,meshMajor,pmid,meshid,meshroot,A,B,C,D,...,G,H,I,J,K,L,M,N,V,Z
0,Expression of p53 and coexistence of HPV in pr...,Fifty-four paraffin embedded tissue sections f...,"['DNA Probes, HPV', 'DNA, Viral', 'Female', 'H...",8549602,"[['D13.444.600.223.555', 'D27.505.259.750.600....","['Chemicals and Drugs [D]', 'Organisms [B]', '...",0,1,1,1,...,0,1,0,0,0,0,0,0,0,0
1,Vitamin D status in pregnant Indian women acro...,The present cross-sectional study was conducte...,"['Adult', 'Alkaline Phosphatase', 'Breast Feed...",21736816,"[['M01.060.116'], ['D08.811.277.352.650.035'],...","['Named Groups [M]', 'Chemicals and Drugs [D]'...",0,1,1,1,...,1,0,1,1,0,0,1,1,0,1
2,[Identification of a functionally important di...,The occurrence of individual amino acids and d...,"['Amino Acid Sequence', 'Analgesics, Opioid', ...",19060934,"[['G02.111.570.060', 'L01.453.245.667.060'], [...","['Phenomena and Processes [G]', 'Information S...",1,1,0,1,...,1,0,0,0,0,1,0,0,0,0
3,Multilayer capsules: a promising microencapsul...,"In 1980, Lim and Sun introduced a microcapsule...","['Acrylic Resins', 'Alginates', 'Animals', 'Bi...",11426874,"[['D05.750.716.822.111', 'D25.720.716.822.111'...","['Chemicals and Drugs [D]', 'Technology, Indus...",1,1,1,1,...,1,0,0,1,0,0,0,0,0,0
4,"Nanohydrogel with N,N'-bis(acryloyl)cystine cr...",Substantially improved hydrogel particles base...,"['Antineoplastic Agents', 'Cell Proliferation'...",28323099,"[['D27.505.954.248'], ['G04.161.750', 'G07.345...","['Chemicals and Drugs [D]', 'Phenomena and Pro...",1,1,0,1,...,1,0,0,1,0,0,0,0,0,0


### 2. Separate Features and Labels
##### This section separates the data features (typically the title and abstract of Pubmed articles) and the labels (Columns A-D) for further processing.

In [2]:
# Drop missing values
df = df.dropna()

# Separate features and labels
features = df[["pmid", "Title", "abstractText"]]
labels = df[["pmid", "A", "B", "C", "D"]]

# The "meshMajor", "meshid" and "meshroot" columns were removed in this example to focus on the Title and abstractText as features. 
# In addition, only columns A-D were included as labels because they're clearly labelled in the data webpage. 

# Print data dimensions of subsets
print("The dimensions of the features are", features.shape)
print("The dimensions of the labels are", labels.shape)

The dimensions of the features are (9999, 3)
The dimensions of the labels are (9999, 5)


In [3]:
# Lowercase features columns
features.columns = features.columns.str.lower()
list(features.columns)

['pmid', 'title', 'abstracttext']

In [4]:
# Merge title and abstract information into one column
features["title_and_abstract"]  = features[["title", "abstracttext"]].agg(" ".join, axis=1)

# Remove "title" and "abstracttext" columns from features table
features = features[["pmid", "title_and_abstract"]]
list(features.columns)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features["title_and_abstract"]  = features[["title", "abstracttext"]].agg(" ".join, axis=1)


['pmid', 'title_and_abstract']

### 3. Text Preprocessing
##### This section preprocesses the title and abstract data to create keyword based features to model from. In this section, Yet Another Keyword Extractor (YAKE) processing is used to extract keywords from the text, replicating the keyword extraction method of choice in work based settings. Further analysis would require experimenting with different keyword extraction algorithms. Because this project aims to test multi-label classification methods, the keyword extraction method was kept consistently as YAKE. 

In [5]:
# Lowercase text data
features["text_proc"] = features["title_and_abstract"].str.lower()

# Remove all characters and numbers except text
import re
features["text_proc"] = features["text_proc"].map(lambda x: re.sub(r'[^a-zA-Z]', ' ', x))

# Import stop words packages
import nltk
from nltk.corpus import stopwords

# Set stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from text
features["text_proc"] = features["text_proc"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
features.head()

Unnamed: 0,pmid,title_and_abstract,text_proc
0,8549602,Expression of p53 and coexistence of HPV in pr...,expression p coexistence hpv premalignant lesi...
1,21736816,Vitamin D status in pregnant Indian women acro...,vitamin status pregnant indian women across tr...
2,19060934,[Identification of a functionally important di...,identification functionally important dipeptid...
3,11426874,Multilayer capsules: a promising microencapsul...,multilayer capsules promising microencapsulati...
4,28323099,"Nanohydrogel with N,N'-bis(acryloyl)cystine cr...",nanohydrogel n n bis acryloyl cystine crosslin...


In [6]:
# Find distribution of word length in "text_proc"
features["word_count"] = [len(x.split()) for x in features["text_proc"].tolist()]

# Output summary statistics of word length
features["word_count"].describe()


count    9999.000000
mean      131.280728
std        50.249499
min        10.000000
25%        96.000000
50%       133.000000
75%       164.000000
max       522.000000
Name: word_count, dtype: float64

In [7]:
# Apply YAKE to preprocessed text

## Import packages
import yake

## Specify parameters (keyword count was set to 10 to reflect minimum number of words as above)
language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 10 

yake_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)

## Apply YAKE extraction
extract_keywords = lambda x: [k[0] for k in yake_extractor.extract_keywords(x)]
features["keywords"] = features["text_proc"].apply(extract_keywords)
features.head()

# SOURCE: https://stackoverflow.com/questions/71100378/create-a-new-columns-based-on-keywords-in-yake.

Unnamed: 0,pmid,title_and_abstract,text_proc,word_count,keywords
0,8549602,Expression of p53 and coexistence of HPV in pr...,expression p coexistence hpv premalignant lesi...,81,"[cervical, cancer, hpv, expression, dysplasia,..."
1,21736816,Vitamin D status in pregnant Indian women acro...,vitamin status pregnant indian women across tr...,179,"[trimester, serum, levels, women, vitamin, sta..."
2,19060934,[Identification of a functionally important di...,identification functionally important dipeptid...,75,"[opioid, tyr, pro, activity, atypical, peptide..."
3,11426874,Multilayer capsules: a promising microencapsul...,multilayer capsules promising microencapsulati...,162,"[membrane, multilayer, encapsulation, pancreat..."
4,28323099,"Nanohydrogel with N,N'-bis(acryloyl)cystine cr...",nanohydrogel n n bis acryloyl cystine crosslin...,165,"[nanogels, obtained, cells, dox, drug, acryloy..."


In [8]:
# Remove columns to prep for modelling
features = features[["pmid", "keywords"]]
features.head()

Unnamed: 0,pmid,keywords
0,8549602,"[cervical, cancer, hpv, expression, dysplasia,..."
1,21736816,"[trimester, serum, levels, women, vitamin, sta..."
2,19060934,"[opioid, tyr, pro, activity, atypical, peptide..."
3,11426874,"[membrane, multilayer, encapsulation, pancreat..."
4,28323099,"[nanogels, obtained, cells, dox, drug, acryloy..."


### 4. Label Preprocessing
##### This section ensures that the label data is ready for multi-label modelling. 

In [9]:
# Check data types of the label data
labels.dtypes

pmid    int64
A       int64
B       int64
C       int64
D       int64
dtype: object

In [10]:
# Change label columns to correspond to original data labelling
labels.rename(columns={"A": "anatomy", "B": "organisms", "C": "diseases", "D": "chemicals_and_drugs"}, inplace=True)
labels.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labels.rename(columns={"A": "anatomy", "B": "organisms", "C": "diseases", "D": "chemicals_and_drugs"}, inplace=True)


Unnamed: 0,pmid,anatomy,organisms,diseases,chemicals_and_drugs
0,8549602,0,1,1,1
1,21736816,0,1,1,1
2,19060934,1,1,0,1
3,11426874,1,1,1,1
4,28323099,1,1,0,1


### 5. Export Data

In [11]:
# Export output for further processing
features.to_excel(r"/Users/ishanisahama/Documents/Data Science/github_blog/multi-label classification/output/features_proc.xlsx")
labels.to_excel(r"/Users/ishanisahama/Documents/Data Science/github_blog/multi-label classification/output/labels_proc.xlsx")