In [23]:
# Imports
from IPython.display import Markdown, display
import os
import openai
from dotenv import load_dotenv
import pdfplumber
from pathlib import Path

In [24]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

In [35]:
# Model Prompt — TEXT version
system_prompt = (
    "You are a helpful assistant. The user will give a link of a paper from Journal of clinical Epidemiology. this is a paper about how to handle missing data.  "
    "Your tasks are:\n\n"
    "1. State the title and Give a summary in high language (10-12 sentences).\n"
    "2. List the key points as bullet points.\n"
    "3. Show a comparison table between the new approach and the existing approach."
    "5. discuss how the novel approach works?\n"
)


In [36]:
def ingest_file(file_path: str) -> str:
    """
    Returns the text from a PDF or any UTF-8-encoded text file.
    Raises FileNotFoundError if the path is wrong.
    """
    path = Path(file_path)
    if not path.exists():
        raise FileNotFoundError(f"{file_path} not found")

    if path.suffix.lower() == ".pdf":
        text_chunks = []
        with pdfplumber.open(path) as pdf:
            for page in pdf.pages:
                text_chunks.append(page.extract_text() or "")
        return "\n".join(text_chunks)

    # fallback: treat as plain text
    return path.read_text(encoding="utf-8")

In [37]:
def messages_for():
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]

In [38]:
def summarize():          # already there
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages_for(),
        temperature=0.3    # optional: a touch of randomness
    )
    return response.choices[0].message.content

In [39]:
# --- pick your input file ---
raw_text = ingest_file("/Users/nusratjahan/Desktop/courses/llm and promtp/llm_engineering/week1/Review--A-gentle-introduction-to-imputation-of-mis.pdf")

# --- existing notebook bits ----
user_prompt = raw_text          # <- same variable you used before
response_md = summarize()       # the summarize() you already have
display(Markdown(response_md))  # pretty-print in notebook


### Title
A Gentle Introduction to Imputation of Missing Values

### Summary
In the realm of clinical epidemiology, the challenge of missing data is pervasive, often leading to biased results when handled improperly. This review article elucidates the shortcomings of simplistic methods such as complete case analysis, overall mean imputation, and the missing-indicator method, which frequently yield unreliable estimates. The authors advocate for more sophisticated imputation techniques, particularly single and multiple imputations, which are predicated on the principle that any subject in a study can be replaced by a randomly chosen subject from the same population. They delineate the conditions under which these methods yield unbiased estimates, specifically when data are missing at random (MAR) or completely at random (MCAR). The article further illustrates the efficacy of multiple imputation through a simulation study, demonstrating that it produces valid results with correct standard errors and confidence intervals, unlike single imputation, which often underestimates variability. The authors emphasize the importance of understanding the underlying mechanisms of missing data and the advantages of employing advanced imputation techniques to enhance the reliability of epidemiological analyses. Ultimately, the review serves as a clarion call for researchers to adopt these more robust methodologies to mitigate the biases inherent in traditional approaches to missing data.

### Key Points
- Missing data is a common issue in clinical research that can lead to biased results.
- Simple methods like complete case analysis and overall mean imputation often produce unreliable estimates.
- Imputation techniques replace missing values with estimates drawn from the distribution of observed data.
- Single imputation uses one estimate, while multiple imputation employs several estimates to reflect uncertainty.
- Both single and multiple imputations can yield unbiased estimates under MAR and MCAR conditions.
- Single imputation typically leads to underestimated standard errors, while multiple imputation provides more accurate estimates.
- The article includes a simulation study demonstrating the effectiveness of multiple imputation.
- The authors argue for the adoption of sophisticated imputation methods in epidemiological research.
- Understanding the mechanisms of missing data is crucial for accurate analysis.
- Advanced imputation techniques are now accessible through standard statistical software.

### Comparison Table: New Approach vs. Existing Approach

| Aspect                          | Existing Approach (Simple Methods)           | New Approach (Imputation Techniques)      |
|---------------------------------|---------------------------------------------|-------------------------------------------|
| Methods Used                    | Complete case analysis, overall mean imputation, missing-indicator method | Single imputation, multiple imputation    |
| Bias in Estimates               | High risk of biased estimates                | Unbiased estimates under MAR and MCAR    |
| Treatment of Missing Data       | Excludes cases or assigns arbitrary values   | Replaces missing values with estimated distributions |
| Standard Errors                 | Often underestimated                         | Correctly estimated, reflecting uncertainty |
| Complexity                      | Simple and straightforward                   | More complex but valid                     |
| Software Availability            | Limited statistical software support         | Widely available in standard software (e.g., SAS, R) |
| Applicability                   | Limited to specific scenarios                | Applicable in a broader range of studies  |

### Discussion on How the Novel Approach Works
The novel approach to handling missing data, particularly through multiple imputation, operates on the principle of replacing missing values with estimates derived from the distribution of observed data. In this method, multiple datasets are created, each with different imputed values for the missing data, reflecting the uncertainty inherent in the imputation process. This is achieved by drawing values from a statistical model that estimates the distribution of the variable with missing data based on other observed variables. 

For instance, if a subject's test result is missing, the imputation model might use other characteristics (such as age, sex, and disease status) to predict a plausible value for the test result. By generating several imputed datasets, researchers can analyze each dataset separately and then combine the results to produce a single estimate that accounts for both the variability due to sampling and the uncertainty due to imputation. This results in more accurate standard errors and confidence intervals, enhancing the validity of the conclusions drawn from the analysis.

In contrast, single imputation only fills in missing values once, which can lead to an underestimation of variability and overly optimistic confidence intervals. The multiple imputation approach, therefore, provides a more robust framework for dealing with missing data, ensuring that the analyses remain valid and reliable across various scenarios commonly encountered in clinical research.