## Data Exploration
##### This code records data exploration of the PubMed Multi Label Text Classification Dataset found in this source (https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification). The data will be cleaned and prepped for downstream multi-label modelling. 

### 1. Import Data

In [16]:
# Import packages
import pandas as pd
import numpy as np

# Import data
df = pd.read_csv(r"/Users/ishanisahama/Documents/Data Science/github_blog/multi-label classification/input/PubMed Multi Label Text Classification Dataset.csv")
df.head()


Unnamed: 0,Title,abstractText,meshMajor,pmid,meshid,meshroot,A,B,C,D,...,G,H,I,J,K,L,M,N,V,Z
0,Expression of p53 and coexistence of HPV in pr...,Fifty-four paraffin embedded tissue sections f...,"['DNA Probes, HPV', 'DNA, Viral', 'Female', 'H...",8549602,"[['D13.444.600.223.555', 'D27.505.259.750.600....","['Chemicals and Drugs [D]', 'Organisms [B]', '...",0,1,1,1,...,0,1,0,0,0,0,0,0,0,0
1,Vitamin D status in pregnant Indian women acro...,The present cross-sectional study was conducte...,"['Adult', 'Alkaline Phosphatase', 'Breast Feed...",21736816,"[['M01.060.116'], ['D08.811.277.352.650.035'],...","['Named Groups [M]', 'Chemicals and Drugs [D]'...",0,1,1,1,...,1,0,1,1,0,0,1,1,0,1
2,[Identification of a functionally important di...,The occurrence of individual amino acids and d...,"['Amino Acid Sequence', 'Analgesics, Opioid', ...",19060934,"[['G02.111.570.060', 'L01.453.245.667.060'], [...","['Phenomena and Processes [G]', 'Information S...",1,1,0,1,...,1,0,0,0,0,1,0,0,0,0
3,Multilayer capsules: a promising microencapsul...,"In 1980, Lim and Sun introduced a microcapsule...","['Acrylic Resins', 'Alginates', 'Animals', 'Bi...",11426874,"[['D05.750.716.822.111', 'D25.720.716.822.111'...","['Chemicals and Drugs [D]', 'Technology, Indus...",1,1,1,1,...,1,0,0,1,0,0,0,0,0,0
4,"Nanohydrogel with N,N'-bis(acryloyl)cystine cr...",Substantially improved hydrogel particles base...,"['Antineoplastic Agents', 'Cell Proliferation'...",28323099,"[['D27.505.954.248'], ['G04.161.750', 'G07.345...","['Chemicals and Drugs [D]', 'Phenomena and Pro...",1,1,0,1,...,1,0,0,1,0,0,0,0,0,0


### 2. Separate Features and Labels

##### This section separates the data features (typically the title and abstract of Pubmed articles) and the labels (Columns A-Z) for further processing.

In [17]:
# Print data dimensions
print("The dimensions of the data before processing are:", df.shape)

# Drop missing values
df = df.dropna()

# Print data dimensions
print("The dimensions of the data after processing are:", df.shape)

The dimensions of the data before processing are: (10000, 22)
The dimensions of the data after processing are: (9999, 22)


In [18]:
# Separate features and labels
features = df[["pmid", "Title", "abstractText"]]
labels = df[["pmid", "A", "B", "C", "D"]]

# The "meshMajor", "meshid" and "meshroot" columns were removed in this example to focus on the Title and abstractText as features. 
# In addition, only columns A-D were included as labels because they're clearly labelled in the data webpage. 

# Print data dimensions of subsets
print("The dimensions of the features are", features.shape)
print("The dimensions of the labels are", labels.shape)

The dimensions of the features are (9999, 3)
The dimensions of the labels are (9999, 5)


In [19]:
# Lowercase features columns
features.columns = features.columns.str.lower()
list(features.columns)

['pmid', 'title', 'abstracttext']

In [20]:
# Merge title and abstract information into one column
features["title_and_abstract"]  = features[["title", "abstracttext"]].agg(" ".join, axis=1)
features.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features["title_and_abstract"]  = features[["title", "abstracttext"]].agg(" ".join, axis=1)


Unnamed: 0,pmid,title,abstracttext,title_and_abstract
0,8549602,Expression of p53 and coexistence of HPV in pr...,Fifty-four paraffin embedded tissue sections f...,Expression of p53 and coexistence of HPV in pr...
1,21736816,Vitamin D status in pregnant Indian women acro...,The present cross-sectional study was conducte...,Vitamin D status in pregnant Indian women acro...
2,19060934,[Identification of a functionally important di...,The occurrence of individual amino acids and d...,[Identification of a functionally important di...
3,11426874,Multilayer capsules: a promising microencapsul...,"In 1980, Lim and Sun introduced a microcapsule...",Multilayer capsules: a promising microencapsul...
4,28323099,"Nanohydrogel with N,N'-bis(acryloyl)cystine cr...",Substantially improved hydrogel particles base...,"Nanohydrogel with N,N'-bis(acryloyl)cystine cr..."


In [21]:
# Remove "title" and "abstracttext" columns from features table
features = features[["pmid", "title_and_abstract"]]
list(features.columns)

['pmid', 'title_and_abstract']

### 3. Export Data

In [22]:
# Export output for further processing
features.to_excel(r"/Users/ishanisahama/Documents/Data Science/github_blog/multi-label classification/output/features.xlsx")
labels.to_excel(r"/Users/ishanisahama/Documents/Data Science/github_blog/multi-label classification/output/labels.xlsx")