# Exploratory data inspection (PCOS CBC–Hormone dataset)

Notebook: Piorkowska_Jonczyk_01_exploratory.ipynb  
Target journal: *Reproductive Biology and Endocrinology*

Version: v1.1  
Last updated: 2025-12-18

## Purpose
This notebook performs a **lightweight, read-only exploratory inspection** of the raw dataset used in the study (row/column counts, preview, column names, missingness, and duplicate records).  
It is intended to support transparent reporting (STROBE) and to document the initial data state before preprocessing.

## Inputs
- In Google Colab: `/content/patients_results.xlsx` (upload via the Files panel)
- Locally: `./data/patients_results.xlsx` (recommended project structure)


## Outputs
- No analytical results are generated here.
- Optional: summary tables may be printed to the notebook output for auditability.


## Reproducibility notes
- This notebook is intentionally **read-only** (no cleaning or transformations are applied).
- To avoid accidental disclosure of sensitive information, avoid printing full raw records or unique identifiers.
- Run from the project root so relative paths resolve correctly.


## 1. Imports and configuration


In [None]:
from __future__ import annotations

from pathlib import Path
import sys

import pandas as pd


## 2. Load raw data


In [None]:
from pathlib import Path
import os
import pandas as pd


ENV_PATH = os.getenv("PATIENT_RESULTS_XLSX")

# Candidate locations (Colab + local)
CANDIDATES = [
    Path(ENV_PATH) if ENV_PATH else None,
    Path("/content/patients_results.xlsx"),          # Colab (common)
    Path("/content/patient_results.xlsx"),           # Colab (typo-proof fallback)
    Path("./data/patients_results.xlsx"),            # local recommended structure
    Path("./content/patients_results.xlsx"),         # local alt (mirrors Colab naming)
    Path("./patients_results.xlsx"),                 # local simplest
]

DATA_PATH = next((p for p in CANDIDATES if p and p.exists()), None)

if DATA_PATH is None:
    tried = "\n".join([str(p) for p in CANDIDATES if p is not None])
    raise FileNotFoundError(
        "Input file not found. Tried the following locations:\n"
        f"{tried}\n\n"
        "Fix options:\n"
        " - Colab: upload the file to /content/ (left panel → Files → Upload)\n"
        " - Local: place it in ./data/ as patients_results.xlsx\n"
        " - Or set env var PATIENT_RESULTS_XLSX to the full path."
    )

print(f"Reading input file: {DATA_PATH.resolve()}")
df = pd.read_excel(DATA_PATH, engine="openpyxl")


Reading input file: /content/patients_results.xlsx


In [None]:
print("Python:", sys.version.split()[0])
print("pandas:", pd.__version__)
print("Rows, columns:", df.shape)


Python: 3.12.12
pandas: 2.2.2
Rows, columns: (1315, 202)


## 3. Quick preview


In [None]:
pd.Series(df.columns, name="column_name").head(50)


Unnamed: 0,column_name
0,Nr KG
1,Rok KG
2,Przyjęcie na oddział zlecający
3,Wypis z oddziału zlecającego
4,Wiek
5,17 - OH progesteron (L79) (17-OHPG)
6,17 OH progesteron (L79)
7,ALAT (ALT)
8,AMH (hormon anty-Mullerowski) (AMH_CP)
9,AMH-anty Mullerian Hormon (AMH)


## 4. Dataset structure


In [None]:
df.shape[1]


202

## 5. Data types


In [None]:
df.dtypes

Unnamed: 0,0
Nr KG,int64
Rok KG,int64
Przyjęcie na oddział zlecający,datetime64[ns]
Wypis z oddziału zlecającego,datetime64[ns]
Wiek,int64
...,...
Wymaz z kanału szyjki macicy - posiew tlenowy (DAT-ZAK),object
Wymaz z kanału szyjki macicy - posiew tlenowy (UWAGI),object
Wymaz z kanału szyjki macicy - posiew tlenowy (Wynik badania),float64
beta HCG (L_HCG),object


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1315 entries, 0 to 1314
Columns: 202 entries, Nr KG to Żelazo (FE)
dtypes: datetime64[ns](2), float64(25), int64(3), object(172)
memory usage: 2.0+ MB


## 6. Number of observations


In [None]:
len(df)

1315

## 7. Dataset size (rows × columns)


In [None]:
df.shape

(1315, 202)

## 8. Column list


In [None]:
cols = df.columns.tolist()
print(f"Number of columns: {len(cols)}")
print("First 50 columns:")
for name in cols[:50]:
    print(" -", name)


Number of columns: 202
First 50 columns:
 - Nr KG
 - Rok KG
 - Przyjęcie na oddział zlecający
 - Wypis z oddziału zlecającego
 - Wiek
 - 17 - OH progesteron (L79) (17-OHPG)
 - 17 OH progesteron (L79)
 - ALAT (ALT)
 - AMH (hormon anty-Mullerowski) (AMH_CP)
 - AMH-anty Mullerian Hormon (AMH)
 - APTT Czas kaolinowo-kefalinowy (APTTCZ)
 - ASO - ilościowo (ASOIL)
 - ASPAT (AST)
 - Androstendion (ANDRO)
 - Androstendion (I31)
 - Androstendion (I31) (ANDRO)
 - Anty - HCV (L_ANTHCV)
 - Anty-TG (O18)
 - Anty-TG (p/c przeciw tyreoglobulinie) (ATG)
 - Anty-TPO (ATA_TPO)
 - Białko C-reaktywne (CRP)
 - Bilirubina całkowita (TBIL)
 - CEA (CEA)
 - Ca125 (CA125)
 - Ca19.9 (CA199)
 - Cholesterol całkowity (TCHOL)
 - D-dimery (DDIMER)
 - DHEAS (DHEA)
 - Dobowy rytm tolerancji glukozy (L_G1030)
 - Dobowy rytm tolerancji glukozy (L_G1200)
 - Dobowy rytm tolerancji glukozy (L_G1500)
 - Dobowy rytm tolerancji glukozy (L_G1800)
 - Dobowy rytm tolerancji glukozy (L_G2100)
 - Dobowy rytm tolerancji glukozy (L_

## 9. Missingness overview


In [None]:

missing_counts = df.isna().sum().sort_values(ascending=False)
missing_pct = (missing_counts / len(df) * 100).round(2)

missing_table = pd.DataFrame(
    {"missing_n": missing_counts, "missing_%": missing_pct}
)

missing_table = missing_table[missing_table["missing_n"] > 0]

missing_table.head(50)

SAVE_SUMMARIES = False
OUTPUT_DIR = Path("./outputs/eda")
if SAVE_SUMMARIES:
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    missing_table.to_csv(OUTPUT_DIR / "missingness_table.csv", index=True)


## 10. Duplicate rows


In [None]:
n_dup = int(df.duplicated().sum())
print(f"Exact duplicate rows: {n_dup} ({(n_dup/len(df)*100):.2f}%)")


Exact duplicate rows: 0 (0.00%)
