<a href="https://colab.research.google.com/github/mclaughlinfernandeez/snp/blob/main/Genomecomparegrch38.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Web Application Frontend: File Uploads

### Subtask:
Outline the design and functionality of the web application frontend specifically for handling user file uploads.

**Reasoning**:
Describe the high-level structure of the `polymorphism-processor-service` and where the GRCh38 checking logic fits in. This involves explaining:
- The entry point of the service (e.g., an HTTP endpoint or a Pub/Sub subscriber).
- How the service receives the input polymorphism data (e.g., file path in GCS).
- The sequence of operations: loading reference, parsing input, performing validation, handling results, generating output formats, and uploading to the output bucket.
- How errors are handled and reported throughout the process.
- The key functions or modules that would encapsulate the logic for each step.

## Integrate into the Processor Service

### Subtask:
Outline how this checking logic will be integrated into the overall `polymorphism-processor-service` code.

## Handle Validation Results

### Subtask:
Describe how the service will handle valid and invalid entries, potentially generating a report or filtering the data.

**Reasoning**:
Explain the specific steps and logic for validating each polymorphism entry against the GRCh38 reference. This includes:
- How to access the relevant part of the reference genome based on the chromosome and position from the polymorphism data.
- How to compare the reference allele in the polymorphism data with the actual nucleotide(s) at the specified position in the GRCh38 reference.
- How to handle different types of polymorphisms (SNPs, indels) if applicable.
- How to account for potential discrepancies in chromosome naming (e.g., 'chr1' vs '1').
- How to use the loaded GRCh38 reference data structure (e.g., `pyfaidx` object) for efficient lookups.

In [None]:
# Assuming you have loaded the GRCh38 reference using pyfaidx as suggested in the loading step:
# from pyfaidx import Fasta
# grch38_ref = Fasta('/path/to/GRCh38.fasta') # Replace with the actual path or object from GCS streaming/download

# Assuming you have parsed the polymorphism data into a list of dictionaries:
# polymorphisms = [
#     {'chromosome': 'chr1', 'position': 100, 'reference_allele': 'A', 'observed_allele': 'T'},
#     {'chromosome':s'1', 'position': 150, 'reference_allele': 'G', 'observed_allele': 'C'},
#     {'chromosome': 'chrM', 'position': 50, 'reference_allele': 'C', 'observed_allele': 'T'},
#     # Add more polymorphism entries
# ]

import logging
import csv
import sys

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def log_error(message, polymorphism=None, e=None):
    """Logs an error with optional polymorphism details and exception."""
    details = ""
    if polymorphism:
        details += f"Polymorphism: Chromosome={polymorphism.get('chromosome')}, Position={polymorphism.get('position')}, Ref={polymorphism.get('reference_allele')}, Obs={polymorphism.get('observed_allele')}. "
    if e:
        details += f"Error: {e}"
    logging.error(f"{message} {details}")

def handle_validation_error(polymorphism, error_type, e=None):
    """Handles and logs specific validation errors."""
    status_message = ""
    if error_type == "Chromosome not found":
        status_message = f"Chromosome '{polymorphism.get('chromosome')}' not found in reference"
        log_error(status_message, polymorphism)
    elif error_type == "Position out of bounds":
        status_message = f"Position {polymorphism.get('position')} is out of bounds for chromosome {polymorphism.get('chromosome')}"
        log_error(status_message, polymorphism)
    elif error_type == "Invalid reference allele":
         expected = e.args[0] if e and e.args else "N/A"
         found = polymorphism.get('reference_allele')
         status_message = f"Invalid reference allele: Expected '{expected}', found '{found}' at {polymorphism.get('chromosome')}:{polymorphism.get('position')}"
         log_error(status_message, polymorphism)
    elif error_type == "General validation error":
        status_message = f"An error occurred during validation"
        log_error(status_message, polymorphism, e)
    else:
        status_message = "Unknown validation error"
        log_error(status_message, polymorphism, e)

    return False, status_message


def validate_polymorphism(polymorphism, grch38_ref):
    """
    Validates a single polymorphism entry against the GRCh38 reference.

    Args:
        polymorphism (dict): A dictionary representing a single polymorphism
                             with keys 'chromosome', 'position', 'reference_allele',
                             and 'observed_allele'.
        grch38_ref (pyfaidx.Fasta): The loaded GRCh38 reference genome object.

    Returns:
        bool: True if the polymorphism's reference allele matches the GRCh38 reference,
              False otherwise.
        str: A status message ("Valid", "Invalid reference allele", "Chromosome not found", "Position out of bounds").
    """
    chrom = polymorphism.get('chromosome')
    pos = polymorphism.get('position')
    ref_allele_polymorphism = polymorphism.get('reference_allele')

    if not chrom or pos is None or not ref_allele_polymorphism:
        return handle_validation_error(polymorphism, "General validation error", e=ValueError("Missing required fields"))


    # Handle potential chromosome naming discrepancies (e.g., 'chr1' vs '1')
    # You might need a mapping or try both 'chrX' and 'X'
    if chrom not in grch38_ref:
        if f'chr{chrom}' in grch38_ref:
            chrom = f'chr{chrom}'
        elif chrom.startswith('chr') and chrom[3:] in grch38_ref:
             chrom = chrom[3:]
        else:
            return handle_validation_error(polymorphism, "Chromosome not found")


    try:
        # pyfaidx uses 1-based indexing for fetching sequences
        # The polymorphism position is likely 1-based, so no adjustment needed for pyfaidx fetch
        # If your polymorphism positions are 0-based, you would need to add 1: pos + 1
        # Fetch the reference allele(s) at the specified position
        # For a SNP, fetch a single nucleotide. For an indel, the logic would be more complex
        # and might involve fetching flanking sequences.
        ref_allele_grch38 = grch38_ref[chrom][pos - 1].seq # pyfaidx uses 0-based indexing for slicing, so subtract 1

        # Compare the reference alleles (case-insensitive comparison is often safer)
        if ref_allele_polymorphism.upper() == ref_allele_grch38.upper():
            return True, "Valid"
        else:
            return handle_validation_error(polymorphism, "Invalid reference allele", e=ValueError(ref_allele_grch38))

    except IndexError:
        return handle_validation_error(polymorphism, "Position out of bounds")
    except Exception as e:
        return handle_validation_error(polymorphism, "General validation error", e)

def process_polymorphism_file(input_filepath, output_filepath, grch38_ref):
    """
    Reads polymorphism data from a CSV file, validates each entry, and writes
    the results to a new CSV file.

    Args:
        input_filepath (str): The path to the input CSV file containing the full SNP set.
        output_filepath (str): The path where the output CSV file with validation results will be saved.
        grch38_ref (pyfaidx.Fasta): The loaded GRCh38 reference genome object.
                                   This object should be created by loading the full GRCh38 fasta file.
    """
    valid_polymorphisms = []
    invalid_polymorphisms = []

    with open(input_filepath, 'r', newline='') as infile, \
         open(output_filepath, 'w', newline='') as outfile:

        reader = csv.DictReader(infile)
        fieldnames = reader.fieldnames + ['validation_status', 'is_valid']
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()

        for row in reader:
            # Assuming your CSV has columns named 'chromosome', 'position', 'reference_allele', 'observed_allele'
            # Adjust these names if your CSV uses different headers
            polymorphism = {
                'chromosome': row.get('chromosome'),
                'position': int(row.get('position')) if row.get('position') else None,
                'reference_allele': row.get('reference_allele'),
                'observed_allele': row.get('observed_allele')
            }

            is_valid, status = validate_polymorphism(polymorphism, grch38_ref)

            row['validation_status'] = status
            row['is_valid'] = is_valid
            writer.writerow(row)

            if is_valid:
                valid_polymorphisms.append(row)
            else:
                invalid_polymorphisms.append(row)

    logging.info(f"Processing complete. Valid entries: {len(valid_polymorphisms)}, Invalid entries: {len(invalid_polymorphisms)}")
    logging.info(f"Results written to {output_filepath}")

# Example usage (replace with actual file paths and loaded GRCh38 reference):
# from pyfaidx import Fasta
# grch38_ref_path = '/path/to/your/GRCh38.fasta' # <<< Specify the path to your full GRCh38 fasta file
# grch38_ref = Fasta(grch38_ref_path)

# input_csv_path = '/path/to/your/full_snp_set.csv' # <<< Specify the path to your input CSV file with the full SNP set
# output_csv_path = '/path/to/your/validation_results.csv' # <<< Specify the desired path for the output results file

# process_polymorphism_file(input_csv_path, output_csv_path, grch38_ref)

In [None]:
u%%writefile Dockerfile.frontend
# Stage 1: Build the frontend (if needed - e.g., for building static assets with Node.js)
# Example using Node.js to build a React/Angular app:
# FROM node:16 as builder
# WORKDIR /app
# COPY package*.json ./
# RUN npm install
# COPY . .
# RUN npm run build # Command to build your static assets

# Stage 2: Serve the static files with Nginx
FROM nginx:latest

# Copy the built frontend files from the builder stage (adjust path if using a builder stage)
# If no builder stage is needed (just serving static files), copy directly from the source
COPY ./frontend /usr/share/nginx/html

# Expose port 80, the default port for Nginx
EXPOSE 80

# The default command of the Nginx image starts the server, so no explicit CMD is needed

## Build and push container images

### Subtask:
Build the Docker images for both services and push them to Google Cloud Container Registry or Artifact Registry.


**Reasoning**:
Authenticate Docker with Google Cloud and then build and tag the Docker images for both the processor service and the frontend application based on the instructions provided.



In [None]:
!gcloud auth configure-docker us-central1-docker.pkg.dev

# Replace 'your-gcp-project-id' and 'your-repo' with your actual GCP project ID and Artifact Registry repository name
PROJECT_ID = 'your-gcp-project-id'
REPO_NAME = 'your-repo'
PROCESSOR_IMAGE = f'us-central1-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/polymorphism-processor-service:latest'
FRONTEND_IMAGE = f'us-central1-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/webapp-frontend:latest'

# Build and tag the processor service image
!docker build -t {PROCESSOR_IMAGE} -f Dockerfile .

# Build and tag the frontend application image
# Assuming your frontend files are in a directory named 'frontend' relative to the Dockerfile.frontend
# If your frontend files are in the same directory as Dockerfile.frontend, remove the './frontend' part in the COPY instruction in Dockerfile.frontend
!docker build -t {FRONTEND_IMAGE} -f Dockerfile.frontend .

# Push both images to Artifact Registry
!docker push {PROCESSOR_IMAGE}
!docker push {FRONTEND_IMAGE}

# Task
Refactor the docker build commands into a single cell.

**Reasoning**:
Configure the Google Cloud project by setting the project ID and enabling the necessary APIs.



## Set up google cloud project and services

### Subtask:
Configure your Google Cloud project, including enabling necessary APIs (Cloud Run, Cloud Storage, Pub/Sub), and setting up service accounts with appropriate permissions.


**Reasoning**:
Authenticate gcloud and then set the project and enable the services as requested by the instructions.



## Containerize the processor service

### Subtask:
Create a Dockerfile for the `polymorphism-processor-service` to package it as a container image.


**Reasoning**:
Create the Dockerfile for the polymorphism-processor-service with the specified steps.



In [None]:
%%writefile Dockerfile
# Stage 1: Build the application
FROM python:3.9 as builder

# Set the working directory
WORKDIR /app

# Copy requirements file and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application code
COPY . .

# Stage 2: Create the final lightweight image
FROM python:3.9-slim

# Set the working directory
WORKDIR /app

# Copy only the installed packages from the builder stage
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages

# Copy the application code from the builder stage
COPY --from=builder /app /app

# Make port 8080 available to the world outside this container
EXPOSE 8080

# Define environment variable
ENV NAME World

# Run app.py when the container launches
CMD ["python", "main.py"]

# Task
Refactor the Dockerfile and Dockerfile.frontend to use multi-stage builds.

## Set up google cloud project and services

### Subtask:
Configure your Google Cloud project, including enabling necessary APIs (Cloud Run, Cloud Storage, Pub/Sub), and setting up service accounts with appropriate permissions.


**Reasoning**:
Authenticate gcloud and then set the project and enable the services as requested by the instructions.



**Reasoning**:
Authenticate gcloud with the provided authorization code and then set the project and enable the services as requested by the instructions.

