**PubMedQA - Exploratory Data Analysis (EDA)**
This notebook provides an exploratory data analysis (EDA) of the PubMedQA dataset, including both labeled and unlabeled samples. The main objectives are:

- **Loading and Inspecting Data:** Load the PubMedQA dataset (labeled and unlabeled) and examine its structure using pandas DataFrames.
- **Label Distribution:** Analyze the distribution of answer labels in the labeled dataset.
- **Context Analysis:** Compute statistics such as the number of context passages per question and the word count of contexts.
- **Custom Columns:** Add columns to distinguish between labeled and unlabeled samples and to facilitate further analysis.
- **Descriptive Statistics:** Calculate descriptive statistics (mean, mode, skewness, kurtosis) for context-related features.
- **Preparation for Further Analysis:** Prepare the data for more advanced analyses of question complexity and answer patterns.

This EDA will help in understanding the dataset's characteristics and guide subsequent modeling or research steps.

In [1]:
from datasets import load_dataset
import pandas as pd
ds = load_dataset("qiaojin/PubMedQA", "pqa_labeled")

In [2]:
df_train = pd.DataFrame(ds['train'])
# If you want to see the first few rows of the DataFrame
df_train.head()

Unnamed: 0,pubid,question,context,long_answer,final_decision
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no
2,9488747,"Syncope during bathing in infants, a pediatric...",{'contexts': ['Apparent life-threatening event...,"""Aquagenic maladies"" could be a pediatric form...",yes
3,17208539,Are the long-term results of the transanal pul...,{'contexts': ['The transanal endorectal pull-t...,Our long-term study showed significantly bette...,no
4,10808977,Can tailored interventions increase mammograph...,{'contexts': ['Telephone counseling and tailor...,The effects of the intervention were most pron...,yes


In [3]:
len(df_train)

1000

In [4]:
## Custom_label
df_train['labelled'] = "Yes"

In [5]:
df_train.head()

Unnamed: 0,pubid,question,context,long_answer,final_decision,labelled
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes,Yes
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no,Yes
2,9488747,"Syncope during bathing in infants, a pediatric...",{'contexts': ['Apparent life-threatening event...,"""Aquagenic maladies"" could be a pediatric form...",yes,Yes
3,17208539,Are the long-term results of the transanal pul...,{'contexts': ['The transanal endorectal pull-t...,Our long-term study showed significantly bette...,no,Yes
4,10808977,Can tailored interventions increase mammograph...,{'contexts': ['Telephone counseling and tailor...,The effects of the intervention were most pron...,yes,Yes


In [6]:
dict_data =df_train['context'][0]

In [7]:
dict_data

{'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.',
  'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were stained with the mitochondrial dye MitoT

## Distribution of final_decision

In [8]:
df_train['final_decision'].value_counts()

final_decision
yes      552
no       338
maybe    110
Name: count, dtype: int64

In [9]:
from nltk.tokenize import word_tokenize

In [10]:
df_train['context_count'] =0
df_train['context_length'] =0
for index, row in df_train.iterrows():
    context = row['context']['contexts']
    df_train.at[index,'context_count'] =len(context)
    context_str =''
    for c in context:
        context_str +=c
    df_train.at[index,'context_length'] =len(word_tokenize(context_str))
    

In [11]:
df_train['context_count'].value_counts()

context_count
3    720
4    101
2     69
6     47
5     34
7     22
8      5
9      1
1      1
Name: count, dtype: int64

In [12]:
# Retry computing the descriptive statistics with explicit handling of data types
content_lengths = pd.to_numeric(df_train['context_count'], errors='coerce')

# Compute descriptive statistics
descriptive_stats = content_lengths.describe()

# Calculate additional useful statistics like mode, skewness, and kurtosis
mode = content_lengths.mode()[0]  # Mode could be multimodal, we take the first one
skewness = content_lengths.skew()
kurtosis = content_lengths.kurt()

(descriptive_stats, mode, skewness, kurtosis)

(count    1000.000000
 mean        3.358000
 std         1.057807
 min         1.000000
 25%         3.000000
 50%         3.000000
 75%         3.000000
 max         9.000000
 Name: context_count, dtype: float64,
 3,
 2.175734543159045,
 5.063905223193476)

In [13]:
# Retry computing the descriptive statistics with explicit handling of data types
content_lengths = pd.to_numeric(df_train['context_length'], errors='coerce')

# Compute descriptive statistics
descriptive_stats = content_lengths.describe()

# Calculate additional useful statistics like mode, skewness, and kurtosis
mode = content_lengths.mode()[0]  # Mode could be multimodal, we take the first one
skewness = content_lengths.skew()
kurtosis = content_lengths.kurt()

(descriptive_stats, mode, skewness, kurtosis)

(count    1000.000000
 mean      234.442000
 std        65.007219
 min        50.000000
 25%       191.000000
 50%       233.000000
 75%       272.000000
 max       461.000000
 Name: context_length, dtype: float64,
 216,
 0.3354727243032049,
 0.5991016061513186)

In [14]:
unlabbeld = load_dataset("qiaojin/PubMedQA", "pqa_unlabeled")

In [15]:
df_unlabelled = pd.DataFrame(unlabbeld['train'])
df_unlabelled.head()

Unnamed: 0,pubid,question,context,long_answer
0,14499029,Is naturopathy as effective as conventional th...,{'contexts': ['Although the use of alternative...,Naturopathy appears to be an effective alterna...
1,14499049,Can randomised trials rely on existing electro...,"{'contexts': ['To estimate the feasibility, ut...",Routine data have the potential to support hea...
2,14499672,Is laparoscopic radical prostatectomy better t...,{'contexts': ['To compare morbidity in two gro...,The results of our non-randomized study show t...
3,14499773,Does bacterial gastroenteritis predispose peop...,{'contexts': ['Irritable bowel syndrome (IBS) ...,Symptoms consistent with IBS and functional di...
4,14499777,Is early colonoscopy after admission for acute...,{'contexts': ['Urgent colonoscopy has been pro...,No significant association is apparent between...


In [16]:
df_unlabelled['context'][0]

{'contexts': ['Although the use of alternative medicine in the United States is increasing, no published studies have documented the effectiveness of naturopathy for treatment of menopausal symptoms compared to women receiving conventional therapy in the clinical setting.',
  'To compare naturopathic therapy with conventional medical therapy for treatment of selected menopausal symptoms.',
  'A retrospective cohort study, using abstracted data from medical charts.',
  'One natural medicine and six conventional medical clinics at Community Health Centers of King County, Washington, from November 1, 1996, through July 31, 1998.',
  'Women aged 40 years of age or more with a diagnosis of menopausal symptoms documented by a naturopathic or conventional physician.',
  'Improvement in selected menopausal symptoms.',
  'In univariate analyses, patients treated with naturopathy for menopausal symptoms reported higher monthly incomes ($1848.00 versus $853.60), were less likely to be smokers (11

In [17]:
df_unlabelled['labelled'] = "No"

In [18]:
df_unlabelled.shape

(61249, 5)

In [19]:
df_train.shape

(1000, 8)

In [20]:
len(df_unlabelled['context'][2]['contexts'])

3