# AWS Migration: Step-by-Step Implementation Guide

This notebook serves as a checklist and guide for migrating your Breast Cancer Prediction model to AWS. Follow these steps sequentially.

## Prerequisites
*   An active AWS Account.
*   The `breastcancer_env.yaml` file on your local machine.
*   The `clean-data.csv` file on your local machine.

## Step 1: Create S3 Buckets (Storage)
1.  Log in to the **AWS Console** and search for **S3**.
2.  Click **Create bucket**.
3.  **Bucket 1 (Data):** Name it something unique (e.g., `breast-cancer-data-jad-2025`).
    *   Keep defaults (Block Public Access: On).
    *   Click **Create bucket**.
4.  **Bucket 2 (Models):** Name it something unique (e.g., `breast-cancer-models-jad-2025`).
    *   Click **Create bucket**.
5.  **Upload Data:**
    *   Go into your **Data Bucket**.
    *   Click **Upload** -> **Add files**.
    *   Select your local `data/clean-data.csv`.
    *   Click **Upload**.

## Step 2: Create IAM Role (Permissions)
This role allows your EC2 instance to write to your S3 buckets.
1.  Search for **IAM** in the console.
2.  Go to **Roles** -> **Create role**.
3.  **Trusted entity type:** AWS Service.
4.  **Service or use case:** EC2.
5.  Click **Next**.
6.  **Add permissions:** Search for and select `AmazonS3FullAccess`.
    *   *Note: In a strict production environment, you would limit this to just your specific buckets, but FullAccess is fine for setup.*
7.  Click **Next**.
8.  **Role name:** `EC2-S3-Access-Role`.
9.  Click **Create role**.

## Step 3: Launch EC2 Instance (Compute)
1.  Search for **EC2** in the console.
2.  Click **Launch Instance**.
3.  **Name:** `Breast-Cancer-Training-Server`.
4.  **OS Images:** Select **Ubuntu** (Ubuntu Server 24.04 LTS is good).
5.  **Instance Type:** `t3.medium` (2 vCPU, 4GB RAM) is recommended for this workload. `t2.micro` might run out of memory during installation.
6.  **Key pair:**
    *   Click **Create new key pair**.
    *   Name: `breast-cancer-key`.
    *   Type: `.pem` (for OpenSSH/Mac/Linux) or `.ppk` (for PuTTY/Windows).
    *   **Download Key Pair** (Keep this safe!).
7.  **Network settings:** Check the boxes for "Allow SSH traffic from". Select "My IP" for security.
8.  **Advanced details (Crucial Step):**
    *   Find **IAM instance profile**.
    *   Select the role you created in Step 2: `EC2-S3-Access-Role`.
9.  Click **Launch instance**.

## Step 4: Create the Training Script
Create a file named `train.py` on your local machine with the following content. 
**Important:** Replace `YOUR_SOURCE_DATA_BUCKET_NAME` and `YOUR_MODEL_BUCKET_NAME` with the actual names of the buckets you created in Step 1.

In [None]:

import pandas as pd
import boto3
import joblib
import os
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# --- CONFIGURATION ---
# 1. UPDATE THESE WITH YOUR ACTUAL BUCKET NAMES
SOURCE_BUCKET = 'breast-cancer-cleaneddata' 
MODEL_BUCKET = 'breast-cancer-prediction-models'
DATA_FILE = 'clean-data.csv'
LOCAL_DATA_PATH = 'data/clean-data.csv'

# 1. Load Data
if not os.path.exists(LOCAL_DATA_PATH):
    os.makedirs('data', exist_ok=True)
    print(f"Downloading {DATA_FILE} from S3...")
    try:
        s3 = boto3.client('s3')
        s3.download_file(SOURCE_BUCKET, DATA_FILE, LOCAL_DATA_PATH)
        print("Download successful.")
    except Exception as e:
        print(f"Error downloading from S3: {e}")
        print("Make sure you updated SOURCE_BUCKET and that your EC2 role has S3 permissions.")
        raise

data = pd.read_csv(LOCAL_DATA_PATH)

# 2. Preprocessing (Cleaning)
if 'Unnamed: 0' in data.columns: data.drop('Unnamed: 0', axis=1, inplace=True)
if 'id' in data.columns: data.drop('id', axis=1, inplace=True)
if 'Unnamed: 32' in data.columns: data.drop('Unnamed: 32', axis=1, inplace=True)

X = data.iloc[:, 1:].values
y = data.iloc[:, 0].values

# Encode Target
le = LabelEncoder()
y = le.fit_transform(y)

# 3. Define Pipeline (StandardScaler -> PCA -> SVM)
pipeline = Pipeline([
    ('scl', StandardScaler()),
    ('pca', PCA(n_components=2)),
    # Best params from Notebook (C=0.1, gamma=0.001, kernel='linear')
    ('clf', SVC(C=0.1, gamma=0.001, kernel='linear', probability=True)) 
])

# 4. Train
print("Training model...")
pipeline.fit(X, y)
print("Training complete.")

# 5. Save and Upload
LOCAL_MODEL_PATH = 'model_v1.pkl'
joblib.dump(pipeline, LOCAL_MODEL_PATH)
print(f"Model saved locally to {LOCAL_MODEL_PATH}")

print("Uploading to S3...")
s3 = boto3.client('s3')
s3.upload_file(LOCAL_MODEL_PATH, MODEL_BUCKET, 'latest_model.pkl')
print(f"Success! Model uploaded to s3://{MODEL_BUCKET}/latest_model.pkl")

### Is this the most efficient model?

**Short Answer:** Yes, this architecture (SVM with Scaling) was the top performer in your analysis. We have updated the script with your specific hyperparameters (`C=0.1`, `kernel='linear'`) based on your Grid Search results.

**Detailed Explanation:**
1.  **Why SVM?**
    In your notebook `NB6 Comparison between different classifiers`, you compared Logistic Regression, LDA, KNN, CART, Naive Bayes, and SVM.
    *   Initially, SVM performed poorly.
    *   **However**, after you applied `StandardScaler` (Standardization), SVM became the **most accurate algorithm**, outperforming the others. This is why we selected `SVC` for your production script.

2.  **Why PCA?**
    Your notebooks (`NB5` and `NB6`) use Principal Component Analysis (PCA).
    *   In `NB6`, you explicitly defined a pipeline: `StandardScaler` -> `PCA(n_components=2)` -> `SVC`.
    *   Using `n_components=2` makes the model extremely **efficient** (fast to train, small to store) because it only looks at the 2 most important variance features instead of all 30.
    *   *Trade-off:* If you want slightly higher accuracy at the cost of speed, you could increase `n_components` to 10 (as explored in `NB5`) or remove PCA entirely. But for a "lightweight" cloud deployment, keeping PCA is a smart move for efficiency.

3.  **Why C=0.1?**
    In `NB5`, you ran a `GridSearchCV` to find the best `C` (Regularization) and `gamma`.
    *   We have updated the `train.py` script to use `C=0.1` and `kernel='linear'` as per your specific results.

**Conclusion:**
The `train.py` script represents the **best architecture** (Scaled SVM) found in your research. It is "efficient" because it balances high accuracy (via SVM) with low computational cost (via PCA).

## Step 5: Deploy to EC2
Now that your infrastructure is ready and your code is written, let's run it.

1.  **Connect to EC2:**
    *   Open your terminal (WSL/Ubuntu).
    *   Run the following command:
    ```bash
    ssh -i "~/breast-cancer-key.pem" ubuntu@18.184.78.242
    ```

2.  **Install Environment Manager (Micromamba):**
    *(You have already completed this step)*
    Run these commands inside the EC2 terminal:
    ```bash
    "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
    # Press Enter to accept defaults
    source ~/.bashrc
    ```

3.  **Upload Files:**
    **CRITICAL:** Do NOT run this in the EC2 terminal. Open a **NEW** terminal window (WSL/Ubuntu) on your local machine.
    
    Use `scp` to copy your files to the server.
    ```bash
    # 1. Upload Environment File
    scp -i "~/breast-cancer-key.pem" "/mnt/c/Users/Jad Zoghaib/OneDrive/Desktop/CC_Breast_Cancer/Breast-cancer-risk-prediction/breastcancer_env.yaml" ubuntu@18.184.78.242:~/
    
    # 2. Upload Training Script
    scp -i "~/breast-cancer-key.pem" "/mnt/c/Users/Jad Zoghaib/OneDrive/Desktop/CC_Breast_Cancer/Breast-cancer-risk-prediction/train.py" ubuntu@18.184.78.242:~/
    ```

4.  **Setup & Run (Back in EC2 Terminal):**
    ```bash
    # Create environment
    micromamba create -f breastcancer_env.yaml
    
    # Activate
    micromamba activate breastcancer
    
    # Install boto3 (if missing)
    micromamba install boto3
    
    # Run the training
    python train.py
    ```

5.  **Verify:**
    Check your S3 bucket "Models". You should see `latest_model.pkl`.

#  ðŸš¨ðŸš¨ðŸš¨ Part that is not done  ðŸš¨ðŸš¨ðŸš¨
 
## Phase 1: The Base Application (Inference)
Now that we have a trained model on S3, we will build the core application: a website where doctors can upload data and get a prediction.

### Step 6: Create the Predictor Lambda
**What is this?**
AWS Lambda is a "serverless" compute service. It allows us to run our Python prediction code without managing a server (like EC2).

**How it fits:**
This function acts as the **Brain** of the application. When triggered, it will:
1.  Download the `latest_model.pkl` from your S3 bucket.
2.  Receive the patient data (30 features) from the website.
3.  Use the model to calculate a prediction (Benign/Malignant).
4.  Return the result.

**Instructions:**
1.  Search for **Lambda** in AWS Console.
2.  Click **Create function**.
3.  **Name:** `BreastCancerPredictor`.
4.  **Runtime:** **Python 3.10**.
5.  **Permissions:**
    *   Click "Change default execution role".
    *   Select "Create a new role from AWS policy templates".
    *   Role name: `Lambda-ML-Role`.
    *   **Policy templates:** Search for "S3 object read-only permissions".
6.  Click **Create function**.

**Add Dependencies (Layers):**
1.  Scroll to the **Layers** section at the bottom.
2.  Click **Add a layer**.
3.  Choose **AWS layers**.
4.  Select **AWSSDKPandas-Python310** (This includes pandas and numpy).
5.  Click **Add**.

**Lambda Code:**
Paste this into the code editor and click **Deploy**.

In [None]:

import json
import boto3
import joblib
import pandas as pd
from io import BytesIO

# --- CONFIGURATION ---
BUCKET_NAME = 'breast-cancer-prediction-models'
MODEL_KEY = 'latest_model.pkl'

s3 = boto3.client('s3')
model = None

def load_model():
    global model
    if model is None:
        with BytesIO() as data:
            s3.download_fileobj(BUCKET_NAME, MODEL_KEY, data)
            data.seek(0)
            model = joblib.load(data)
    return model

def lambda_handler(event, context):
    try:
        body = json.loads(event['body'])
        features = body.get('features') # List of 30 numbers
        
        clf = load_model()
        prediction = clf.predict([features])[0] # 0 or 1
        result = 'Malignant' if prediction == 1 else 'Benign'
        
        return {
            'statusCode': 200,
            'headers': {
                'Access-Control-Allow-Origin': '*',
                'Access-Control-Allow-Headers': 'Content-Type',
                'Access-Control-Allow-Methods': 'OPTIONS,POST'
            },
            'body': json.dumps({'prediction': result})
        }
    except Exception as e:
        return {'statusCode': 400, 'body': json.dumps(str(e))}

### Step 7: Setup API Gateway
**What is this?**
API Gateway is a service that creates and publishes secure APIs.

**How it fits:**
Your Lambda function is hidden inside your private AWS cloud. The API Gateway acts as the **Front Door**. It provides a public URL (endpoint) that your HTML website can send data to. When the API Gateway receives data, it passes it to the Lambda function.

**Instructions:**
1.  Search for **API Gateway**.
2.  Click **Create API** -> **REST API** (Build).
3.  **Name:** `BreastCancerAPI`. Click **Create API**.
4.  **Create Resource:** Actions -> Create Resource -> Name: `predict` -> Check **Enable API Gateway CORS** -> Create.
5.  **Create Method:** Select `/predict` -> Actions -> Create Method -> **POST**.
    *   Integration type: **Lambda Function**.
    *   Function: `BreastCancerPredictor`.
    *   Save -> OK.
6.  **Deploy:** Actions -> Deploy API -> Stage: `prod` -> Deploy.
7.  **Copy URL:** Save the **Invoke URL** (e.g., `https://xyz.../prod`).

### Step 8: Create the Frontend (V1)
**What is this?**
This is the user interface (UI) for the doctor. It is a simple HTML/JavaScript file.

**How it fits:**
This file will be hosted on S3 (acting as a web server). It runs in the doctor's browser, reads the CSV file they select, and sends the data to the **API Gateway** URL we just created. It then displays the result returned by the Lambda.

**Instructions:**
1.  Create `index.html` locally.
2.  Paste the code below (Update `API_URL` with your link).
3.  Upload to your S3 Data Bucket and enable **Static website hosting** in Properties.

In [None]:
<!DOCTYPE html>
<html>
<head>
    <title>Breast Cancer Risk Predictor</title>
    <style>
        body { font-family: sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }
        .container { border: 1px solid #ccc; padding: 20px; border-radius: 8px; }
        .hidden { display: none; }
        .malignant { color: red; font-weight: bold; }
        .benign { color: green; font-weight: bold; }
    </style>
</head>
<body>
    <div class="container">
        <h1>Breast Cancer Risk Assessment</h1>
        <p>Upload patient data (CSV with 30 features).</p>
        
        <label>Patient ID: <input type="text" id="patientId"></label><br><br>
        <input type="file" id="csvFile" accept=".csv"><br><br>
        <button onclick="analyze()">Analyze Risk</button>
        
        <div id="resultArea" class="hidden">
            <h3>Prediction: <span id="predictionText"></span></h3>
            <hr>
            <h4>Doctor's Validation</h4>
            <p>Please confirm the final diagnosis to update the medical database.</p>
            <select id="finalDiagnosis">
                <option value="Benign">Confirmed Benign</option>
                <option value="Malignant">Confirmed Malignant</option>
            </select>
            <button onclick="submitData()">Submit to Database</button>
        </div>
    </div>

    <script>
        const PREDICT_URL = 'https://YOUR-API-ID.execute-api.us-east-1.amazonaws.com/prod/predict';
        const SAVE_URL = 'https://YOUR-API-ID.execute-api.us-east-1.amazonaws.com/prod/save'; // We will create this next
        
        let currentFeatures = [];

        function analyze() {
            const file = document.getElementById('csvFile').files[0];
            const reader = new FileReader();
            reader.onload = function(e) {
                currentFeatures = e.target.result.trim().split(',').map(Number);
                
                fetch(PREDICT_URL, {
                    method: 'POST',
                    body: JSON.stringify({ features: currentFeatures })
                })
                .then(res => res.json())
                .then(data => {
                    const pred = data.prediction;
                    const span = document.getElementById('predictionText');
                    span.innerText = pred;
                    span.className = pred === 'Malignant' ? 'malignant' : 'benign';
                    
                    // Auto-select the predicted value in dropdown
                    document.getElementById('finalDiagnosis').value = pred;
                    document.getElementById('resultArea').classList.remove('hidden');
                });
            };
            reader.readAsText(file);
        }

        function submitData() {
            const patientId = document.getElementById('patientId').value;
            const diagnosis = document.getElementById('finalDiagnosis').value;
            
            fetch(SAVE_URL, {
                method: 'POST',
                body: JSON.stringify({
                    patient_id: patientId,
                    features: currentFeatures,
                    diagnosis: diagnosis
                })
            })
            .then(res => res.json())
            .then(data => alert('Data saved to database successfully!'));
        }
    </script>
</body>
</html>

## Phase 2: Notifications (Alerts)
We want to send an email if the prediction is Malignant.

### Step 9: Create SNS Topic
1.  Go to **SNS** -> **Create Topic** -> Name: `BreastCancerAlerts` -> Standard.
2.  **Create Subscription** -> Protocol: Email -> Endpoint: Your Email.
3.  Confirm the subscription in your email.
4.  Copy the **Topic ARN**.

### Step 10: Update Predictor Lambda
1.  Go back to your `BreastCancerPredictor` Lambda.
2.  Add `sns:Publish` permissions to its IAM Role.
3.  Update the code to send an alert if `result == 'Malignant'`.

## Phase 3: The Feedback Loop (Data Collection)
We will create a separate flow to save validated data to DynamoDB.

### Step 11: Create DynamoDB Table
1.  Go to **DynamoDB** -> **Create table**.
2.  Name: `BreastCancerPredictions`.
3.  Partition key: `patient_id` (String).
4.  Sort key: `timestamp` (String).

### Step 12: Create "SaveData" Lambda
This function handles the "Submit to Database" button.

1.  Create a new Lambda: `BreastCancerSaveData`.
2.  Add permissions for **DynamoDB** (Write).
3.  **Code:**

In [None]:
import json
import boto3
from datetime import datetime

dynamodb = boto3.resource('dynamodb')
TABLE_NAME = 'BreastCancerPredictions'

def lambda_handler(event, context):
    try:
        body = json.loads(event['body'])
        table = dynamodb.Table(TABLE_NAME)
        
        item = {
            'patient_id': body['patient_id'],
            'timestamp': datetime.now().isoformat(),
            'features': str(body['features']),
            'diagnosis': body['diagnosis'] # This is the CONFIRMED diagnosis from the doctor
        }
        table.put_item(Item=item)
        
        return {'statusCode': 200, 'body': json.dumps('Saved')}
    except Exception as e:
        return {'statusCode': 400, 'body': json.dumps(str(e))}

### Step 13: Update API Gateway
1.  Go to `BreastCancerAPI`.
2.  Create a new Resource: `save`.
3.  Create Method: **POST** -> Integration: `BreastCancerSaveData`.
4.  **Deploy API** again to `prod`.
5.  **Update your HTML file:** Replace `SAVE_URL` with this new endpoint.

## Phase 4: Continuous Learning (Retraining)
Every 2 days, we move data from DynamoDB to S3 and retrain.

### Step 14: Automate Retraining
1.  **Create `sync_data.py` on EC2:**
    (Use the script provided in the previous version of this guide, but ensure it maps the DynamoDB `diagnosis` field to the CSV target column).

2.  **Setup Cron Job:**
    Run `crontab -e` on EC2 and add:

In [None]:
0 0 */2 * * /home/ubuntu/micromamba/envs/breastcancer/bin/python /home/ubuntu/sync_data.py && /home/ubuntu/micromamba/envs/breastcancer/bin/python /home/ubuntu/train.py