# Exploratory Data Analysis: Cochrane Reviews and References

**Summary:** In this notebook, I explore the Cochrane reviews and their references that I downloaded from PubMed. I look at how many reviews we have, their publication years, abstract lengths, and reference counts. I also check how many referenced papers have PMIDs that I can use to fetch their abstracts later.

**Key findings:**
- I have ~17,000 Cochrane reviews with abstracts
- There are ~1.2 million reference edges connecting reviews to cited papers
- About 72% of references have PMIDs, giving me ~491,000 unique papers I can fetch abstracts for
- These referenced papers are "included" studies that passed screening—I'll use them as positive examples for my LLM evaluation

In [None]:
# I load the required libraries and set up the file paths to my data
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

DATA_DIR = Path.cwd().parent / "Data" if not (Path.cwd() / "Data").exists() else Path.cwd() / "Data"
ABSTRACTS_CSV = DATA_DIR / "cochrane_pubmed_abstracts.csv"
REFERENCES_CSV = DATA_DIR / "cochrane_pubmed_references.csv"

print(f"Data directory: {DATA_DIR}")

In [None]:
# I load and inspect the Cochrane reviews abstracts file
abstracts = pd.read_csv(ABSTRACTS_CSV, dtype={"pmid": str, "year": str})
print(f"Total Cochrane reviews: {len(abstracts):,}")
print(f"\nColumn types:\n{abstracts.dtypes}")
print(f"\nMissing values:\n{abstracts.isnull().sum()}")
abstracts.head()

In [None]:
# I plot the distribution of Cochrane reviews by publication year
abstracts["year_clean"] = pd.to_numeric(abstracts["year"].str[:4], errors="coerce")

fig, ax = plt.subplots(figsize=(12, 4))
abstracts["year_clean"].dropna().astype(int).value_counts().sort_index().plot(kind="bar", ax=ax)
ax.set_xlabel("Year")
ax.set_ylabel("Number of Reviews")
ax.set_title("Cochrane Reviews by Publication Year")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# I analyze the abstract lengths to understand what text I'm working with
abstracts["abstract_words"] = abstracts["abstract"].fillna("").str.split().str.len()

print("Abstract word count statistics:")
print(abstracts["abstract_words"].describe())

fig, ax = plt.subplots(figsize=(10, 4))
abstracts["abstract_words"].hist(bins=50, ax=ax, edgecolor="black")
ax.set_xlabel("Word Count")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of Abstract Lengths (words)")
plt.tight_layout()
plt.show()

In [None]:
# I load the references file that links Cochrane reviews to their cited papers
refs = pd.read_csv(REFERENCES_CSV, dtype={"citing_pmid": str, "ref_pmid": str})
print(f"Total reference edges: {len(refs):,}")

reviews_with_refs = refs["citing_pmid"].nunique()
print(f"Cochrane reviews with at least one reference: {reviews_with_refs:,}")
print(f"Reviews without references: {len(abstracts) - reviews_with_refs:,}")
refs.head()

In [None]:
# I plot the distribution of how many references each Cochrane review has
refs_per_review = refs.groupby("citing_pmid").size()

print("References per Cochrane review:")
print(refs_per_review.describe())

fig, ax = plt.subplots(figsize=(10, 4))
refs_per_review.hist(bins=50, ax=ax, edgecolor="black")
ax.set_xlabel("Number of References")
ax.set_ylabel("Number of Reviews")
ax.set_title("Distribution of Reference Counts per Cochrane Review")
plt.tight_layout()
plt.show()

In [None]:
# I check how many references have PMIDs (which I need to fetch their abstracts)
has_pmid = refs["ref_pmid"].notna() & (refs["ref_pmid"] != "")
has_doi = refs["ref_doi"].notna() & (refs["ref_doi"] != "")

print(f"References with PMID: {has_pmid.sum():,} ({100*has_pmid.mean():.1f}%)")
print(f"References with DOI:  {has_doi.sum():,} ({100*has_doi.mean():.1f}%)")
print(f"References with either: {(has_pmid | has_doi).sum():,} ({100*(has_pmid | has_doi).mean():.1f}%)")

unique_ref_pmids = refs.loc[has_pmid, "ref_pmid"].nunique()
print(f"\nUnique referenced papers with PMIDs: {unique_ref_pmids:,}")
print("These are the papers I can fetch abstracts for and use in my screening task.")

In [None]:
# I look at the reviews with the most references to get a sense of review sizes
top_reviews = refs_per_review.nlargest(10)
print("Top 10 Cochrane reviews by reference count:")
for pmid, count in top_reviews.items():
    title = abstracts.loc[abstracts["pmid"] == pmid, "title"].values
    title_str = title[0][:80] + "..." if len(title) > 0 and len(title[0]) > 80 else (title[0] if len(title) > 0 else "N/A")
    print(f"  PMID {pmid}: {count} refs — {title_str}")