# Knowledge Representation & Logic with Dimensions Research Data

This notebook shows how to use **logic and knowledge representation** ideas with
Dimensions-style research data (papers, authors, institutions, topics).

1. **Modeling Research Data with Propositional Logic**
   - Represent simple statements as propositional symbols.
   - Build logical relationships like `P ∧ Q → R`.

2. **Building a Knowledge Base (KB) for Research Trends**
   - Facts and rules about topic trends.
   - Simple forward chaining for inference.

3. **Inference & Entailment in Research Analysis**
   - Use logical implication to derive conclusions.
   - Example: “Paper X is peer-reviewed” from journal facts.

4. **Model Checking for Validating Hypotheses**
   - Check whether a logical hypothesis holds in your data model.
   - Example: “Collaboration between A and B → high citations”.

5. **Automating Analysis with Logic Programming**
   - Implement small rule-based queries in Python.
   - Use rules to surface collaborations or high-impact areas.

All examples use small toy DataFrames modeled on typical Dimensions exports.
Adapt column names to match your actual pipeline.

In [None]:
import pandas as pd
import numpy as np

# Optional: for propositional logic formulas
# !pip install sympy
from sympy import symbols
from sympy.logic.boolalg import And, Or, Not, Implies, satisfiable

np.random.seed(42)

# --- Toy "papers" table ---
papers = pd.DataFrame({
    "paper_id": ["P1", "P2", "P3"],
    "title": [
        "AI methods for infectious disease modeling",
        "Data repositories for immunology research",
        "Collaborative networks in pandemic preparedness"
    ],
    "journal": ["JAI", "JDATA", "JAI"],
    "primary_topic": ["AI", "DRKB", "Collab"],
    "author": ["A1", "A2", "A1"],
    "institution": ["NIAID", "UniversityX", "NIAID"],
    "citations_5yr": [120, 35, 80]
})

# --- Toy "institutions" table ---
institutions = pd.DataFrame({
    "institution": ["NIAID", "UniversityX", "InstituteY"],
    "specialty_topic": ["AI", "DRKB", "Vaccines"]
})

# --- Toy "collaborations" table ---
collabs = pd.DataFrame({
    "paper_id": ["P3", "P4", "P5"],
    "inst_a": ["NIAID", "NIAID", "UniversityX"],
    "inst_b": ["UniversityX", "InstituteY", "InstituteY"],
    "citations_5yr": [80, 10, 200]
})

# --- Toy topic trend table ---
topic_trends = pd.DataFrame({
    "topic": ["AI", "DRKB", "Vaccines"],
    "year": [2019, 2020, 2021, 2019, 2020, 2021, 2019, 2020, 2021],
    "n_pubs": [50, 70, 90,   20, 25, 22,    10, 12, 25]
}).sort_values(["topic", "year"])

topic_trends

## 1. Modeling Research Data with Propositional Logic

Represent simple research facts as propositional symbols:

- Let **P** represent “Paper X is published in Journal Y”.
- Let **Q** represent “Author A is affiliated with Institution B”.
- Let **R** represent “Institution B specializes in Topic Z”.

Then a rule like:

\[
P \land Q \rightarrow R
\]

means: *If a paper by A is in that journal and A is at B, then B specializes in Z*.

Here we’ll show:

- How to map small data facts into propositional variables.
- How to build a formula like `P & Q → R`.
- How to check if a particular valuation (model) satisfies the formula.

In [None]:
# Define propositional symbols for one simple scenario
P, Q, R = symbols("P Q R")

# Example formula: P ∧ Q → R
formula = Implies(And(P, Q), R)
formula

In [None]:
# Build a valuation (truth assignment) from toy data

# Suppose we pick:
paper = papers.loc[papers["paper_id"] == "P1"].iloc[0]  # AI paper in JAI by A1 at NIAID
inst = institutions.loc[institutions["institution"] == paper["institution"]].iloc[0]

# Define propositions concretely:
# P = "Paper P1 is in JAI" → True
P_val = True

# Q = "Author A1 is affiliated with NIAID" → True in this toy example
Q_val = True

# R = "NIAID specializes in AI" → check institutions table
R_val = (inst["specialty_topic"] == "AI")

P_val, Q_val, R_val

In [None]:
# Check if the formula is satisfied under this valuation
def eval_formula(f, assignment):
    return bool(f.subs(assignment))

assignment = {P: P_val, Q: Q_val, R: R_val}
print("Formula:", formula)
print("Satisfied?", eval_formula(formula, assignment))

## 2. Building a Knowledge Base (KB) for Research Trends

Build a simple **knowledge base** of:

- **Facts**: atomic statements derived directly from data.
- **Rules**: implications linking conditions to conclusions.

Example:

- Fact: “In 2020, publications on topic T increased.”
- Rule: “If there’s an increase in publications on topic T, then topic T is gaining research interest.”

We’ll:

1. Automatically create **increase facts** from `topic_trends`.
2. Define a rule: `increase(T) → gaining_interest(T)`.
3. Use a simple **forward-chaining** procedure to infer `gaining_interest(T)` for all topics.

In [None]:
# Step 1: compute "increase in topic T" facts from topic_trends

trend_facts = set()  # strings like "increase(AI)"

for topic in topic_trends["topic"].unique():
    df_t = topic_trends[topic_trends["topic"] == topic].sort_values("year")
    n = df_t["n_pubs"].values
    years = df_t["year"].values

    # Check if there's at least one increase year-over-year
    increases = (n[1:] > n[:-1])
    if increases.any():
        trend_facts.add(f"increase({topic})")

trend_facts

In [None]:
# Step 2: Define a simple rule: increase(T) => gaining_interest(T)

rules = [
    ("increase(T)", "gaining_interest(T)")
]

# Step 3: Forward chaining for this small Horn-clause subset

def substitute(pattern, topic):
    # pattern like "increase(T)" -> "increase(AI)"
    return pattern.replace("T", topic)

def forward_chain_trends(facts, rules):
    derived = set(facts)
    changed = True

    topics = topic_trends["topic"].unique()

    while changed:
        changed = False
        for (premise_pattern, conclusion_pattern) in rules:
            for t in topics:
                premise = substitute(premise_pattern, t)
                conclusion = substitute(conclusion_pattern, t)
                if premise in derived and conclusion not in derived:
                    derived.add(conclusion)
                    changed = True
    return derived

kb_trends = forward_chain_trends(trend_facts, rules)
sorted(kb_trends)

## 3. Inference & Entailment in Research Analysis

**Logical entailment**:

> A knowledge base KB entails sentence S (KB ⊨ S) if S is true in **all models** where KB is true.

Example:

- Premise 1: “All papers published in Journal J are peer-reviewed.”
- Premise 2: “Paper X is published in Journal J.”
- Conclusion: “Paper X is peer-reviewed.”

We’ll encode:

- A fact table for journals and their peer-review status.
- A rule: `published_in(p, j) ∧ peer_reviewed_journal(j) → peer_reviewed(p)`.
- A simple inference procedure to label peer-reviewed papers.

In [None]:
# Toy journal metadata
journals = pd.DataFrame({
    "journal": ["JAI", "JDATA"],
    "peer_reviewed": [True, True]
})

# Build facts
facts_pub = set()
facts_journal = set()

for _, row in papers.iterrows():
    facts_pub.add(f"published_in({row['paper_id']},{row['journal']})")

for _, row in journals.iterrows():
    if row["peer_reviewed"]:
        facts_journal.add(f"peer_reviewed_journal({row['journal']})")

facts_pub, facts_journal

In [None]:
# Rule: published_in(P, J) ∧ peer_reviewed_journal(J) → peer_reviewed(P)

def infer_peer_reviewed(papers, facts_pub, facts_journal):
    derived = set()
    for fact in facts_pub:
        # fact looks like published_in(P1,JAI)
        inside = fact[len("published_in("):-1]
        p_id, j_id = inside.split(",")
        if f"peer_reviewed_journal({j_id})" in facts_journal:
            derived.add(f"peer_reviewed({p_id})")
    return derived

peer_reviewed_facts = infer_peer_reviewed(papers, facts_pub, facts_journal)
peer_reviewed_facts

In [None]:
# Attach as a column to papers (KB entails "peer_reviewed(P)" for these)
papers["peer_reviewed"] = papers["paper_id"].apply(
    lambda pid: f"peer_reviewed({pid})" in peer_reviewed_facts
)
papers[["paper_id", "journal", "peer_reviewed"]]

## 4. Model Checking for Validating Hypotheses

**Model checking**: verify whether a logical hypothesis holds in a particular model (dataset).

Example hypothesis:

> H: "Collaboration between Institutions A and B leads to high citation counts."

We can interpret this as the implication:

\[
\forall \text{papers } p, (collab(p, A, B) \rightarrow \text{high_citations}(p))
\]

Where:
- `collab(p, A, B)` means paper p involves A and B.
- `high_citations(p)` means citations_5yr ≥ threshold.

To model check H on our data:

1. Identify all collaborative papers between A and B.
2. Check if any of them violates `high_citations` (counterexamples).
3. If there are no counterexamples in the dataset, H holds in this model.

In [None]:
HIGH_CIT_THRESHOLD = 50

def check_collab_hypothesis(collabs_df, inst_a, inst_b, threshold=HIGH_CIT_THRESHOLD):
    # Select collabs between inst_a and inst_b (order-insensitive)
    mask = (
        ((collabs_df["inst_a"] == inst_a) & (collabs_df["inst_b"] == inst_b)) |
        ((collabs_df["inst_a"] == inst_b) & (collabs_df["inst_b"] == inst_a))
    )
    subs = collabs_df[mask]

    if subs.empty:
        return {
            "holds": None,
            "reason": "No collaborations between these institutions in the data.",
            "counterexamples": []
        }

    # Counterexamples: collab with citations < threshold
    counter = subs[subs["citations_5yr"] < threshold]

    return {
        "holds": counter.empty,
        "reason": "No counterexamples found." if counter.empty else "Some collaborations have low citations.",
        "counterexamples": counter.to_dict(orient="records")
    }

# Example: hypothesis "NIAID and UniversityX collabs → high citations"
res_h = check_collab_hypothesis(collabs, "NIAID", "UniversityX")
res_h

## 5. Automating Analysis with Logic Programming

Logic programming (e.g., Prolog) lets you:

- Declare **facts** and **rules**.
- Ask **queries** like “Which collaborations are high-impact and AI-related?”

We’ll emulate a tiny slice of logic programming in Python:

- Facts:
  - `collab(inst_a, inst_b, topic, citations)`
- Rules:
  - `high_impact_collab(A,B) :- collab(A,B,Topic,C), C >= threshold`
  - `ai_collab(A,B) :- collab(A,B,Topic,C), Topic = AI`

Then we’ll write simple query functions that behave like Prolog predicates.

In [None]:
# Build a fact table for collaborations enriched with topic

# Join collabs with papers to get topic for known paper_ids
collab_enriched = collabs.merge(
    papers[["paper_id", "primary_topic"]],
    on="paper_id",
    how="left"
)
collab_enriched

In [None]:
# Facts as a list of tuples: (inst_a, inst_b, topic, citations)
facts_collab = [
    (row["inst_a"], row["inst_b"], row["primary_topic"], row["citations_5yr"])
    for _, row in collab_enriched.iterrows()
]

facts_collab

In [None]:
HIGH_IMPACT = 50

def high_impact_collab(facts, threshold=HIGH_IMPACT):
    """
    Logic rule: high_impact_collab(A,B) :- collab(A,B,Topic,C), C >= threshold
    Returns set of (A,B) pairs.
    """
    result = set()
    for (A, B, topic, C) in facts:
        if C >= threshold:
            # normalize pair ordering
            pair = tuple(sorted((A, B)))
            result.add(pair)
    return result

def ai_collab(facts):
    """
    Logic rule: ai_collab(A,B) :- collab(A,B,Topic,C), Topic = 'AI'
    """
    result = set()
    for (A, B, topic, C) in facts:
        if topic == "AI":
            pair = tuple(sorted((A, B)))
            result.add(pair)
    return result

def high_impact_ai_collab(facts, threshold=HIGH_IMPACT):
    """
    Conjunction of rules:
    high_impact_ai_collab(A,B) :- high_impact_collab(A,B), ai_collab(A,B)
    """
    hi = high_impact_collab(facts, threshold)
    ai = ai_collab(facts)
    return hi.intersection(ai)

high_impact_pairs = high_impact_collab(facts_collab)
ai_pairs = ai_collab(facts_collab)
hi_ai_pairs = high_impact_ai_collab(facts_collab)

high_impact_pairs, ai_pairs, hi_ai_pairs



1. **Propositional Logic**
   - Modeled simple statements like:
     - Paper–journal relationships
     - Author–institution affiliations
     - Institution–topic specialties
   - Built and evaluated implications such as `P ∧ Q → R`.

2. **Knowledge Base for Trends**
   - Derived `increase(T)` facts from topic counts.
   - Used rules like `increase(T) → gaining_interest(T)` with forward-chaining.

3. **Inference & Entailment**
   - Encoded journal peer-review facts and rules.
   - Derived `peer_reviewed(P)` for papers in peer-reviewed journals.

4. **Model Checking**
   - Interpreted hypotheses as implications on the data.
   - Checked for counterexamples to statements like:
     - “Collab(A,B) → high citations.”

5. **Logic Programming-Style Automation**
   - Represented collaborations as facts.
   - Implemented simple rule-based queries (high-impact, AI-related collabs).

You can extend this framework to:

- Richer rule sets (e.g., disease areas, division-level logic, equity criteria).
- More expressive logics (first-order, description logics).
- Integration with formal tools (e.g., Prolog, Z3, PyDatalog) for heavier reasoning tasks.

This module complements your ML / ANN / NLP / Optimization / Uncertainty modules by providing a **symbolic reasoning layer** for your NIAID/NIH research intelligence workflows.