# üè• Breast Cancer Risk Prediction: Complete AWS Implementation Guide

This notebook documents the end-to-end process of building, deploying, and automating the Breast Cancer Risk Prediction system on AWS.

## üèóÔ∏è Architecture Overview
1.  **Training**: EC2 instance trains a Logistic Regression model using data from S3.
2.  **Prediction**: Users submit data via a Web UI (EC2) -> API Gateway -> Lambda -> Model.
3.  **Storage**: All predictions are stored in DynamoDB.
4.  **Alerts**: High-risk (Malignant) predictions trigger an SNS Email Alert.
5.  **Continuous Learning**: A weekly EventBridge schedule triggers a Sync Lambda to move confirmed cases from DynamoDB back to S3 for retraining.

---
## üõ†Ô∏è Prerequisites
*   AWS Account with Admin Access.
*   `clean-data.csv` (Initial Dataset).
*   `breastcancer_env.yaml` (Conda Environment).


## 1Ô∏è‚É£ Phase 1: Storage & Permissions

### A. S3 Buckets
We created two S3 buckets to separate data from artifacts.
1.  **`breast-cancer-cleaneddata`**: Stores the "Golden Copy" of the dataset (`clean-data.csv` and versioned files like `patient_data_v1.csv`).
2.  **`breast-cancer-models`**: Stores the trained model artifacts (`model.pkl`, `scaler.pkl`).

### B. IAM Role (`EC2-S3-Access-Role`)
This role allows our EC2 instance to read/write to S3 without hardcoding credentials.
*   **Trusted Entity**: EC2
*   **Permissions**: `AmazonS3FullAccess`


## 2Ô∏è‚É£ Phase 2: Model Training (EC2)

We launched an **Ubuntu t3.medium** EC2 instance to handle the training workload.

### A. Setup
1.  **Instance Name**: `BreastCancerTraining`
2.  **Key Pair**: `breast-cancer-key.pem`
3.  **Environment**: Installed Miniconda and created `breastcancer_env`.

### B. The Training Script (`train.py`)
This script is the heart of the ML pipeline. It:
1.  Downloads the latest data from S3.
2.  Trains a Logistic Regression model.
3.  Evaluates accuracy.
4.  Uploads the trained model back to S3.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib
import boto3
import os

# --- CONFIGURATION ---
DATA_PATH = 'data/clean-data.csv'  # Path to your dataset
MODEL_FILENAME = 'latest_model.pkl'
BUCKET_NAME = 'breast-cancer-prediction-models'

def train_and_upload_model():
    # 1. Load Data
    if not os.path.exists(DATA_PATH):
        print(f"Error: Data file not found at {DATA_PATH}")
        # Fallback for demonstration if file is missing
        from sklearn.datasets import load_breast_cancer
        data = load_breast_cancer()
        X = pd.DataFrame(data.data, columns=data.feature_names)
        y = data.target
        # Note: In Wisconsin dataset, 0 is Malignant, 1 is Benign usually, 
        # but check your specific CSV labels. 
        # Assuming 1=Malignant for this generic fallback.
    else:
        df = pd.read_csv(DATA_PATH)
        # Basic preprocessing (adjust based on your actual data columns)
        # Assuming 'diagnosis' is the target and 'id' is not a feature
        if 'diagnosis' in df.columns:
            y = df['diagnosis'].map({'M': 1, 'B': 0}) # Encode M as 1, B as 0
            X = df.drop(['diagnosis', 'id', 'Unnamed: 32'], axis=1, errors='ignore')
        else:
            # Fallback if column names differ
            print("Warning: 'diagnosis' column not found. Using last column as target.")
            X = df.iloc[:, :-1]
            y = df.iloc[:, -1]

    # 2. Train Model
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred))

    # 3. Save Model
    joblib.dump(model, MODEL_FILENAME)
    print(f"Model saved locally as {MODEL_FILENAME}")

    # 4. Upload to S3
    s3 = boto3.client('s3')
    try:
        s3.upload_file(MODEL_FILENAME, BUCKET_NAME, MODEL_FILENAME)
        print(f"Successfully uploaded {MODEL_FILENAME} to s3://{BUCKET_NAME}/")
    except Exception as e:
        print(f"Upload failed: {e}")
        print("Ensure you have AWS credentials configured.")

if __name__ == "__main__":
    train_and_upload_model()

### D. Lambda Function 1: The Predictor
*   **Runtime**: Python 3.9
*   **Trigger**: API Gateway (HTTP API)
*   **Permissions**: `AmazonDynamoDBFullAccess`, `AmazonSNSFullAccess`, `AmazonS3ReadOnlyAccess`.
*   **Layers**: We used `sklearn_light.zip` as a Lambda Layer to provide `scikit-learn` and `joblib` without exceeding the deployment size limit.

In [None]:
import json
import boto3
import joblib
import os
from io import BytesIO

# --- CONFIGURATION ---
BUCKET_NAME = 'breast-cancer-prediction-models'
MODEL_KEY = 'latest_model.pkl'
DYNAMODB_TABLE = 'Patient_Entries'
SNS_TOPIC_ARN = 'arn:aws:sns:eu-central-1:469541406278:MalignantAlerts'

s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
sns = boto3.client('sns')
table = dynamodb.Table(DYNAMODB_TABLE)

def load_model_from_s3():
    print("Loading model from S3...")
    response = s3.get_object(Bucket=BUCKET_NAME, Key=MODEL_KEY)
    model_stream = BytesIO(response['Body'].read())
    model = joblib.load(model_stream)
    print("Model loaded successfully.")
    return model

# Load model globally to reuse across invocations
model = load_model_from_s3()

def process_doctor_feedback(body):
    """Handles the Tick/Cross feedback from the doctor."""
    case_id = body.get('id')
    resolution = body.get('resolution')

    if not case_id or not resolution:
        return {'statusCode': 400, 'body': json.dumps('Missing id or resolution')}

    print(f"Processing feedback for Case {case_id}: {resolution}")

    # Update DynamoDB
    try:
        # 'status' is a reserved word, so we use ExpressionAttributeNames (#st)
        table.update_item(
            Key={'id': case_id},
            UpdateExpression="SET doctor_resolution = :r, #st = :s",
            ExpressionAttributeNames={'#st': 'status'},
            ExpressionAttributeValues={
                ':r': resolution,
                ':s': 'Resolved'
            }
        )
        return {'statusCode': 200, 'body': json.dumps(f"Case {case_id} resolved as {resolution}")}
    except Exception as e:
        print(f"Error updating DynamoDB: {str(e)}")
        return {'statusCode': 500, 'body': json.dumps(f"Database error: {str(e)}")}

def lambda_handler(event, context):
    try:
        print("Received event:", json.dumps(event))
        
        # Parse body
        if 'body' in event:
            if isinstance(event['body'], str):
                body = json.loads(event['body'])
            else:
                body = event['body']
        else:
            body = event

        # CHECK OPERATION TYPE
        if body.get('operation') == 'feedback':
            return process_doctor_feedback(body)

        # --- PREDICTION LOGIC ---
        features = body.get('features')
        case_id = body.get('id', 'unknown_id')

        if not features:
            return {
                'statusCode': 400,
                'body': json.dumps("Error: 'features' list is required.")
            }

        prediction = model.predict([features])
        result = 'M' if prediction[0] == 1 else 'B'

        print(f"Prediction for {case_id}: {result}")

        # Save to DynamoDB
        item = {
            'id': case_id,
            'features': str(features),
            'prediction': result,
            'status': 'Pending Review'
        }
        table.put_item(Item=item)

        # Send SNS Alert if Malignant
        if result == 'M':
            message = (
                f"URGENT: Malignant Case Detected\n\n"
                f"Case ID: {case_id}\n"
                f"Prediction: Malignant (M)\n"
                f"Status: Pending Doctor Review\n\n"
                f"Please log in to the dashboard to review this case immediately."
            )
            sns.publish(
                TopicArn=SNS_TOPIC_ARN,
                Message=message,
                Subject=f"Alert: Malignant Case {case_id}"
            )

        return {
            'statusCode': 200,
            'body': json.dumps({'prediction': result, 'id': case_id})
        }

    except Exception as e:
        print(f"Error: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps(f"Internal Server Error: {str(e)}")
        }

ModuleNotFoundError: No module named 'boto3'

## 4Ô∏è‚É£ Phase 4: The Frontend (EC2 Web Server)

We hosted the user interface on an EC2 instance (can be the same as the training one or a separate t2.micro).

### A. Setup
1.  **Install Web Server**: `sudo apt install apache2` (or nginx).
2.  **Deploy Code**: Replaced `/var/www/html/index.html` with our custom HTML.
3.  **Security Group**: Allowed Inbound HTTP (Port 80) traffic.

### B. The Interface Code (`index.html`)
This is the final, polished version of the UI that connects to the API Gateway.


In [None]:
<!DOCTYPE html>
<html>
<head>
    <title>Breast Cancer Risk System</title>
    <style>
        body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; max-width: 1000px; margin: 0 auto; padding: 20px; background-color: #f8f9fa; color: #333; }
        .view { display: none; animation: fadeIn 0.5s; }
        .active { display: block; }
        @keyframes fadeIn { from { opacity: 0; } to { opacity: 1; } }

        /* Buttons */
        .btn { padding: 12px 24px; margin: 5px; cursor: pointer; border-radius: 6px; font-size: 15px; border: none; transition: all 0.2s; font-weight: 600; }
        .btn:disabled { opacity: 0.6; cursor: not-allowed; }
        .btn-primary { background-color: #007bff; color: white; }
        .btn-primary:hover { background-color: #0056b3; transform: translateY(-1px); }
        .btn-success { background-color: #28a745; color: white; }
        .btn-success:hover { background-color: #218838; }
        .btn-danger { background-color: #dc3545; color: white; }
        .btn-danger:hover { background-color: #c82333; }
        .btn-outline { background: transparent; border: 1px solid #007bff; color: #007bff; }
        .btn-outline:hover { background: #e7f1ff; }

        /* File Upload */
        .upload-area { border: 2px dashed #ccc; padding: 30px; text-align: center; background: white; border-radius: 10px; margin: 20px 0; transition: 0.3s; }
        .upload-area:hover { border-color: #007bff; background: #f0f8ff; }

        /* Table */
        .table-container { max-height: 400px; overflow: auto; margin: 20px 0; border: 1px solid #dee2e6; border-radius: 5px; background: white; display: none; }
        table { width: 100%; border-collapse: collapse; font-size: 13px; white-space: nowrap; }
        th, td { padding: 10px; text-align: left; border-bottom: 1px solid #dee2e6; }
        th { background-color: #e9ecef; position: sticky; top: 0; z-index: 10; font-weight: 600; }
        tr:hover { background-color: #f8f9fa; }

        /* Cards */
        .case-card { background: white; border-left: 5px solid #dc3545; border-radius: 5px; padding: 15px; margin: 10px 0; display: flex; justify-content: space-between; align-items: center; box-shadow: 0 2px 8px rgba(0,0,0,0.05); }
        .status-badge { padding: 4px 8px; border-radius: 4px; font-size: 12px; font-weight: bold; }
        .status-malignant { background: #ffebee; color: #c62828; }

        /* Logs */
        #analysis-log { background: #2d2d2d; color: #00ff00; padding: 15px; border-radius: 5px; font-family: monospace; max-height: 200px; overflow-y: auto; margin-top: 20px; display: none; }
    </style>
    <script>
        // --- CONFIGURATION ---
        const PREDICT_URL = 'https://qu6y4f29lg.execute-api.eu-central-1.amazonaws.com/predict';

        // --- CONSTANTS ---
        const FEATURE_NAMES = [
            "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean", "concave points_mean", "symmetry_mean", "fractal_dimension_mean",
            "radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se", "compactness_se", "concavity_se", "concave points_se", "symmetry_se", "fractal_dimension_se",
            "radius_worst", "texture_worst", "perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst", "concave points_worst", "symmetry_worst", "fractal_dimension_worst"
        ];

        // --- STATE ---
        let loadedData = []; // Stores objects: { id: string, features: number[] }
        let openCases = [];  // Stores Malignant cases waiting for review

        // --- NAVIGATION ---
        function showView(viewId) {
            document.querySelectorAll('.view').forEach(el => el.classList.remove('active'));
            document.getElementById(viewId).classList.add('active');
            if(viewId === 'cases-view') renderOpenCases();
        }

        // --- FILE HANDLING ---
        function handleFileUpload(event) {
            const file = event.target.files[0];
            if (!file) return;

            const reader = new FileReader();
            reader.onload = function(e) {
                const text = e.target.result;
                parseCSV(text, file.name);
            };
            reader.readAsText(file);
        }

        function parseCSV(csvText, fileName) {
            try {
                // 1. Extract Base ID from Filename
                // Example: "sample_patient_569.csv" -> "569"
                const nameNoExt = fileName.substring(0, fileName.lastIndexOf('.')) || fileName;
                const parts = nameNoExt.split(/[_-\s]+/);
                const baseId = parts[parts.length - 1];

                const lines = csvText.split('\n').filter(line => line.trim() !== '');
                loadedData = [];

                // Determine start row (Skip header if present)
                let startRow = 0;
                if (/[a-zA-Z]/.test(lines[0])) {
                    startRow = 1;
                }

                // Parse rows
                const tempRows = [];
                for (let i = startRow; i < Math.min(lines.length, 101); i++) {
                    const rawValues = lines[i].split(',');

                    // Filter for numeric values (to find features)
                    const numericValues = rawValues
                        .map(v => parseFloat(v))
                        .filter(v => !isNaN(v));

                    // We need at least 30 numeric features
                    if (numericValues.length >= 30) {
                        const features = numericValues.slice(-30);
                        tempRows.push(features);
                    }
                }

                if (tempRows.length === 0) {
                    alert("Could not parse any valid rows with 30 numeric features. Please check your CSV format.");
                    return;
                }

                // Assign IDs based on filename
                loadedData = tempRows.map((features, index) => {
                    // If single row, use the filename ID directly (e.g. "569")
                    // If multiple rows, append index (e.g. "569-1", "569-2")
                    let caseId = baseId;
                    if (tempRows.length > 1) {
                        caseId = `${baseId}-${index + 1}`;
                    }
                    return { id: caseId, features: features };
                });

                renderTable(loadedData);
                document.getElementById('confirm-btn').disabled = false;

            } catch (e) {
                console.error(e);
                alert("Error parsing CSV: " + e.message);
            }
        }

        function renderTable(data) {
            const container = document.getElementById('table-container');
            const tableHead = document.getElementById('data-table-head');
            const tableBody = document.getElementById('data-table-body');

            // Clear previous
            tableHead.innerHTML = '';
            tableBody.innerHTML = '';

            // Headers
            const headerRow = document.createElement('tr');

            // ID Column Header
            const thId = document.createElement('th');
            thId.innerText = "Case ID";
            thId.style.minWidth = "80px";
            headerRow.appendChild(thId);

            // Feature Headers
            FEATURE_NAMES.forEach(h => {
                const th = document.createElement('th');
                th.innerText = h;
                headerRow.appendChild(th);
            });
            tableHead.appendChild(headerRow);

            // Rows
            data.forEach((item) => {
                const tr = document.createElement('tr');

                // ID Cell
                const tdId = document.createElement('td');
                tdId.innerText = item.id;
                tdId.style.fontWeight = "bold";
                tdId.style.backgroundColor = "#f8f9fa";
                tr.appendChild(tdId);

                // Feature Cells
                item.features.forEach(val => {
                    const td = document.createElement('td');
                    td.innerText = val;
                    tr.appendChild(td);
                });
                tableBody.appendChild(tr);
            });

            container.style.display = 'block';
        }

        function confirmData() {
            if (loadedData.length > 0) {
                document.getElementById('analyze-btn').disabled = false;
                alert(`Data Confirmed! ${loadedData.length} patient records ready for analysis.`);
            } else {
                alert("No data to confirm. Please upload a valid CSV.");
            }
        }

        // --- LOGIC: BATCH ANALYSIS ---
        async function analyzeRisk() {
            if (!loadedData || loadedData.length === 0) {
                alert("No data loaded. Please upload a CSV first.");
                return;
            }

            showView('analyze-view');

            // MIRROR TABLE TO ANALYZE VIEW
            const originalTable = document.getElementById('table-container').innerHTML;
            const targetContainer = document.getElementById('analysis-table-container');
            targetContainer.innerHTML = originalTable;
            targetContainer.style.display = 'block';

            const logDiv = document.getElementById('analysis-log');
            logDiv.style.display = 'block';
            logDiv.innerHTML = "Starting analysis...\n";

            let processed = 0;

            // Iterate through loaded data
            for (let i = 0; i < loadedData.length; i++) {
                const item = loadedData[i];
                const features = item.features;
                const patientId = item.id;

                try {
                    // Call API
                    const response = await fetch(PREDICT_URL, {
                        method: 'POST',
                        headers: { 'Content-Type': 'application/json' },
                        body: JSON.stringify({ features: features, id: patientId })
                    });

                    // ERROR HANDLING: Check if the request failed (e.g. 500 or 400)
                    if (!response.ok) {
                        const errorText = await response.text();
                        throw new Error(`Server Error (${response.status}): ${errorText}`);
                    }

                    const result = await response.json();
                    const prediction = result.prediction;

                    // VALIDATION: Check if prediction is missing
                    if (prediction === undefined) {
                        throw new Error("Invalid response: 'prediction' field missing.");
                    }

                    processed++;

                    if (prediction === 'M') {
                        // LOGIC: If Malignant -> Add to Open Cases
                        openCases.push({
                            id: patientId,
                            features: features,
                            date: new Date().toLocaleDateString(),
                            status: 'Malignant'
                        });
                        logDiv.innerHTML += `[${patientId}] -> MALIGNANT (Added to Open Cases)\n`;
                    } else {
                        // LOGIC: If Benign -> Auto-save to DB
                        logDiv.innerHTML += `[${patientId}] -> Benign (Archived to DB)\n`;
                    }

                } catch (error) {
                    logDiv.innerHTML += `[${patientId}] Error: ${error.message}\n`;
                }

                // Auto-scroll log
                logDiv.scrollTop = logDiv.scrollHeight;
            }

            logDiv.innerHTML += `\nAnalysis Complete. ${openCases.length} new cases require review.`;
        }

        // --- LOGIC: OPEN CASES ---
        function renderOpenCases() {
            const list = document.getElementById('cases-list');
            list.innerHTML = '';

            if (openCases.length === 0) {
                list.innerHTML = "<p style='text-align:center; color:#666;'>No open cases found.</p>";
                return;
            }

            openCases.forEach(c => {
                const item = document.createElement('div');
                item.className = 'case-card';
                item.innerHTML = `
                    <div>
                        <strong>ID: ${c.id}</strong> <span style="color:#999; font-size:0.9em">(${c.date})</span><br>
                        <span class="status-badge status-malignant">Potential Malignancy</span>
                    </div>
                    <div>
                        <button class="btn btn-success" onclick="resolveCase('${c.id}', 'Confirmed')">‚úì Confirm</button>
                        <button class="btn btn-danger" onclick="resolveCase('${c.id}', 'False Positive')">X False Positive</button>
                    </div>
                `;
                list.appendChild(item);
            });
        }

        function resolveCase(id, resolution) {
            // Find the case data to send back to the backend
            const caseData = openCases.find(c => c.id === id);
            const features = caseData ? caseData.features : [];

            if(confirm(`Mark case ${id} as ${resolution}?`)) {

                // 1. Prepare the payload
                // We add 'operation': 'feedback' so the Lambda knows this isn't a new prediction
                const payload = {
                    operation: 'feedback',
                    id: id,
                    resolution: resolution
                };

                // 2. Send to API
                fetch(PREDICT_URL, {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify(payload)
                })
                .then(response => {
                    if (!response.ok) {
                        throw new Error(`Server responded with ${response.status}`);
                    }
                    return response.json();
                })
                .then(data => {
                    // 3. Success Handling
                    alert(`‚úÖ Success! Feedback saved to DynamoDB.\nCase ${id} marked as ${resolution}.`);
                    // Remove from the list
                    openCases = openCases.filter(c => c.id !== id);
                    renderOpenCases();
                })
                .catch(error => {
                    // 4. Error Handling (The "Why it didn't work" part)
                    console.error("Feedback Error:", error);
                    alert(`‚ö†Ô∏è Action Failed: Could not save to DynamoDB.\n\nReason: ${error.message}\n\nWhy is this happening?\n1. The Lambda function might not be updated yet to handle "feedback" requests.\n2. The DynamoDB table might be missing or named incorrectly.\n3. The API Gateway might be blocking the request (CORS).\n\nPlease check your Lambda logs for more details.`);
                });
            }
        }
    </script>
</head>
<body>

    <!-- HOME VIEW -->
    <div id="home-view" class="view active">
        <h1 style="text-align:center;">Breast Cancer Risk System</h1>

        <!-- Upload Section -->
        <div class="upload-area">
            <h3>üìÇ Upload Patient Data</h3>
            <p>Drag and drop your CSV file here or click to browse</p>
            <input type="file" id="csvFileInput" accept=".csv" onchange="handleFileUpload(event)" />
        </div>

        <!-- Data Preview Table -->
        <div id="table-container" class="table-container">
            <table id="data-table">
                <thead id="data-table-head"></thead>
                <tbody id="data-table-body"></tbody>
            </table>
        </div>

        <!-- Actions -->
        <div style="text-align:center; margin-top: 20px;">
            <button id="confirm-btn" class="btn btn-success" onclick="confirmData()" disabled>‚úì Confirm Data</button>
            <button id="analyze-btn" class="btn btn-primary" onclick="analyzeRisk()" disabled>‚ö° Analyze Risk</button>
            <button class="btn btn-outline" onclick="showView('cases-view')">üìÇ Open Cases</button>
        </div>
    </div>

    <!-- ANALYZE VIEW (Log Output) -->
    <div id="analyze-view" class="view">
        <button onclick="showView('home-view')" class="btn btn-outline">‚Üê Back to Home</button>
        <h2>Analysis in Progress</h2>

        <!-- MIRRORED TABLE -->
        <div id="analysis-table-container" class="table-container"></div>

        <p>Processing uploaded records against the AI model...</p>

        <div id="analysis-log"></div>

        <div style="margin-top: 20px; text-align: center;">
            <button class="btn btn-primary" onclick="showView('cases-view')">Go to Open Cases Review</button>
        </div>
    </div>

    <!-- OPEN CASES VIEW -->
    <div id="cases-view" class="view">
        <button onclick="showView('home-view')" class="btn btn-outline">‚Üê Back to Home</button>
        <h2>Open Cases</h2>
        <p>The following cases were flagged as <strong>Malignant</strong> and require doctor confirmation.</p>
        <div id="cases-list"></div>
    </div>

</body>
</html>

SyntaxError: invalid decimal literal (760329767.py, line 6)

## 5Ô∏è‚É£ Phase 5: Automation (Continuous Learning)

This phase closes the loop. It ensures that confirmed cases are automatically fed back into the training dataset.

### A. EventBridge Schedule
*   **Type**: Schedule Rule
*   **Cron Expression**: `cron(0 0 ? * SUN *)` (Every Sunday at Midnight)
*   **Target**: Lambda Function 2 (Sync)

### B. Lambda Function 2: The Sync Agent (`DynamoDB-S3-SyncLambda`)
*   **Runtime**: Python 3.9
*   **Permissions**: `AmazonS3FullAccess`, `AmazonDynamoDBFullAccess`.
*   **Logic**:
    1.  Scans DynamoDB for `Confirmed` or `False Positive` cases.
    2.  Downloads the latest `patient_data_vX.csv`.
    3.  Appends new rows (correcting diagnosis if False Positive).
    4.  Uploads `patient_data_v(X+1).csv`.
    5.  Marks DynamoDB items as `is_exported = True`.


In [None]:
import json
import boto3
import csv
import io
import os
from boto3.dynamodb.conditions import Attr

# --- CONFIGURATION ---
DYNAMODB_TABLE = 'Patient_Entries'
S3_BUCKET = 'breast-cancer-cleaneddata'
BASE_DATA_FILE = 'clean-data.csv'

# HEADERS: ID first, then Diagnosis, then Features
CSV_HEADERS = [
    'id', 'diagnosis',
    'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
    'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
    'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
    'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se',
    'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst',
    'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'
]

s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(DYNAMODB_TABLE)

def get_latest_dataset():
    """
    Finds the latest version of the dataset.
    Returns (content_string, next_version_number)
    """
    try:
        # List all files that look like patient_data_v*
        response = s3.list_objects_v2(Bucket=S3_BUCKET, Prefix='patient_data_v')
        
        latest_version = 0
        latest_file = BASE_DATA_FILE
        
        if 'Contents' in response:
            for obj in response['Contents']:
                key = obj['Key']
                if key.startswith('patient_data_v') and key.endswith('.csv'):
                    try:
                        # Extract version number
                        ver = int(key.replace('patient_data_v', '').replace('.csv', ''))
                        if ver > latest_version:
                            latest_version = ver
                            latest_file = key
                    except ValueError:
                        continue
        
        print(f"Latest dataset found: {latest_file}")
        
        # Download the content of the latest file
        file_obj = s3.get_object(Bucket=S3_BUCKET, Key=latest_file)
        content = file_obj['Body'].read().decode('utf-8')
        
        return content, latest_version + 1
        
    except Exception as e:
        print(f"Error finding latest dataset: {e}")
        # Fallback: try to get the base file
        try:
            print(f"Falling back to base file: {BASE_DATA_FILE}")
            file_obj = s3.get_object(Bucket=S3_BUCKET, Key=BASE_DATA_FILE)
            content = file_obj['Body'].read().decode('utf-8')
            return content, 1
        except Exception as inner_e:
            print(f"Critical Error: Could not find base file either. {inner_e}")
            # If absolutely nothing exists, start with headers
            return ",".join(CSV_HEADERS) + "\n", 1

def determine_diagnosis(prediction, resolution):
    """
    Determines the true diagnosis (M/B) based on prediction and doctor feedback.
    """
    pred_str = str(prediction).lower()
    res_str = str(resolution).lower()
    
    # Check for 'm' (from your screenshot) or 'malignant'
    is_predicted_malignant = ('m' == pred_str) or ('malignant' in pred_str)
    
    if 'confirmed' in res_str:
        return 'M' if is_predicted_malignant else 'B'
        
    elif 'false positive' in res_str:
        # Doctor disagrees.
        # If predicted Malignant but marked False Positive -> It is Benign.
        if is_predicted_malignant:
            return 'B'
        # If predicted Benign but marked False Positive (False Negative) -> It is Malignant.
        else:
            return 'M'
            
    return 'M' if is_predicted_malignant else 'B'

def lambda_handler(event, context):
    print("Starting Sync Process...")
    
    # 1. Scan DynamoDB for new resolved cases
    try:
        response = table.scan(
            FilterExpression=(
                (Attr('doctor_resolution').eq('Confirmed') | Attr('doctor_resolution').eq('False Positive')) & 
                Attr('is_exported').ne('True')
            )
        )
        items = response.get('Items', [])
        
        while 'LastEvaluatedKey' in response:
            response = table.scan(
                FilterExpression=(
                    (Attr('doctor_resolution').eq('Confirmed') | Attr('doctor_resolution').eq('False Positive')) & 
                    Attr('is_exported').ne('True')
                ),
                ExclusiveStartKey=response['LastEvaluatedKey']
            )
            items.extend(response.get('Items', []))
            
        print(f"Found {len(items)} new resolved cases to export.")
        
        if not items:
            return {
                'statusCode': 200,
                'body': json.dumps('No new resolved cases to export.')
            }

        # 2. Get Latest Dataset Content (Cumulative)
        current_csv_content, next_version = get_latest_dataset()
        
        # 3. Append New Data
        # We use StringIO to append to the existing string
        csv_buffer = io.StringIO()
        csv_buffer.write(current_csv_content)
        
        # Ensure we are on a new line if the previous file didn't end with one
        if not current_csv_content.endswith('\n'):
            csv_buffer.write('\n')
            
        writer = csv.DictWriter(csv_buffer, fieldnames=CSV_HEADERS)
        # Do NOT write header, as it's already in the existing content
        
        items_to_update = []
        
        for item in items:
            row = {}
            
            # 1. ID
            row['id'] = item.get('id', 'unknown')
            
            # 2. Diagnosis
            pred = item.get('prediction', '')
            res = item.get('doctor_resolution', '')
            row['diagnosis'] = determine_diagnosis(pred, res)
            
            # 3. Features (Unpacking the list from DynamoDB)
            features_data = item.get('features', [])
            
            # Parse features if they are stored as a string (e.g. "[1.0, 2.0]")
            features_list = []
            if isinstance(features_data, str):
                try:
                    # Try JSON load first
                    features_list = json.loads(features_data)
                except:
                    # Fallback: remove brackets and split
                    try:
                        clean_str = features_data.replace('[', '').replace(']', '')
                        features_list = [float(x.strip()) for x in clean_str.split(',') if x.strip()]
                    except:
                        features_list = []
            elif isinstance(features_data, list):
                features_list = features_data
            
            # The CSV_HEADERS list has 'id' and 'diagnosis' at indices 0 and 1.
            # The features start at index 2.
            feature_names = CSV_HEADERS[2:]
            
            for i, feature_name in enumerate(feature_names):
                if i < len(features_list):
                    # Convert Decimal to float
                    row[feature_name] = float(features_list[i])
                else:
                    # Fallback if list is shorter than expected
                    row[feature_name] = 0.0
            
            writer.writerow(row)
            items_to_update.append(item['id'])

        # 4. Upload New Cumulative Version
        new_filename = f'patient_data_v{next_version}.csv'
        print(f"Uploading cumulative data to {new_filename}...")
        
        s3.put_object(
            Bucket=S3_BUCKET,
            Key=new_filename,
            Body=csv_buffer.getvalue()
        )
        
        # 5. Mark items as exported
        print(f"Marking {len(items_to_update)} items as exported...")
        for item_id in items_to_update:
            table.update_item(
                Key={'id': item_id},
                UpdateExpression="set is_exported = :val",
                ExpressionAttributeValues={':val': 'True'}
            )
            
        return {
            'statusCode': 200,
            'body': json.dumps(f"Success! Appended {len(items)} cases to {new_filename}")
        }

    except Exception as e:
        print(f"Error: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps(f"Error syncing data: {str(e)}")
        }