<a href="https://colab.research.google.com/github/mclaughlinfernandeez/snp/blob/main/Csv2FASTBED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Web Application Frontend: File Uploads

### Subtask:
Outline the design and functionality of the web application frontend specifically for handling user file uploads.

**Reasoning**:
Describe the high-level structure of the `polymorphism-processor-service` and where the GRCh38 checking logic fits in. This involves explaining:
- The entry point of the service (e.g., an HTTP endpoint or a Pub/Sub subscriber).
- How the service receives the input polymorphism data (e.g., file path in GCS).
- The sequence of operations: loading reference, parsing input, performing validation, handling results, generating output formats, and uploading to the output bucket.
- How errors are handled and reported throughout the process.
- The key functions or modules that would encapsulate the logic for each step.

## Integrate into the Processor Service

### Subtask:
Outline how this checking logic will be integrated into the overall `polymorphism-processor-service` code.

## Handle Validation Results

### Subtask:
Describe how the service will handle valid and invalid entries, potentially generating a report or filtering the data.

**Reasoning**:
Explain the specific steps and logic for validating each polymorphism entry against the GRCh38 reference. This includes:
- How to access the relevant part of the reference genome based on the chromosome and position from the polymorphism data.
- How to compare the reference allele in the polymorphism data with the actual nucleotide(s) at the specified position in the GRCh38 reference.
- How to handle different types of polymorphisms (SNPs, indels) if applicable.
- How to account for potential discrepancies in chromosome naming (e.g., 'chr1' vs '1').
- How to use the loaded GRCh38 reference data structure (e.g., `pyfaidx` object) for efficient lookups.

In [None]:
# Assuming you have loaded the GRCh38 reference using pyfaidx as suggested in the loading step:
# from pyfaidx import Fasta
# grch38_ref = Fasta('/path/to/GRCh38.fasta') # Replace with the actual path or object from GCS streaming/download

# Assuming you have parsed the polymorphism data into a list of dictionaries:
# polymorphisms = [
#     {'chromosome': 'chr1', 'position': 100, 'reference_allele': 'A', 'observed_allele': 'T'},
#     {'chromosome': '1', 'position': 150, 'reference_allele': 'G', 'observed_allele': 'C'},
#     {'chromosome': 'chrM', 'position': 50, 'reference_allele': 'C', 'observed_allele': 'T'},
#     # Add more polymorphism entries
# ]

import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def log_error(message, polymorphism=None, e=None):
    """Logs an error with optional polymorphism details and exception."""
    details = ""
    if polymorphism:
        details += f"Polymorphism: Chromosome={polymorphism.get('chromosome')}, Position={polymorphism.get('position')}, Ref={polymorphism.get('reference_allele')}, Obs={polymorphism.get('observed_allele')}. "
    if e:
        details += f"Error: {e}"
    logging.error(f"{message} {details}")

def handle_validation_error(polymorphism, error_type, e=None):
    """Handles and logs specific validation errors."""
    status_message = ""
    if error_type == "Chromosome not found":
        status_message = f"Chromosome '{polymorphism.get('chromosome')}' not found in reference"
        log_error(status_message, polymorphism)
    elif error_type == "Position out of bounds":
        status_message = f"Position {polymorphism.get('position')} is out of bounds for chromosome {polymorphism.get('chromosome')}"
        log_error(status_message, polymorphism)
    elif error_type == "Invalid reference allele":
         expected = e.args[0] if e and e.args else "N/A"
         found = polymorphism.get('reference_allele')
         status_message = f"Invalid reference allele: Expected '{expected}', found '{found}' at {polymorphism.get('chromosome')}:{polymorphism.get('position')}"
         log_error(status_message, polymorphism)
    elif error_type == "General validation error":
        status_message = f"An error occurred during validation"
        log_error(status_message, polymorphism, e)
    else:
        status_message = "Unknown validation error"
        log_error(status_message, polymorphism, e)

    return False, status_message


def validate_polymorphism(polymorphism, grch38_ref):
    """
    Validates a single polymorphism entry against the GRCh38 reference.

    Args:
        polymorphism (dict): A dictionary representing a single polymorphism
                             with keys 'chromosome', 'position', 'reference_allele',
                             and 'observed_allele'.
        grch38_ref (pyfaidx.Fasta): The loaded GRCh38 reference genome object.

    Returns:
        bool: True if the polymorphism's reference allele matches the GRCh38 reference,
              False otherwise.
        str: A status message ("Valid", "Invalid reference allele", "Chromosome not found", "Position out of bounds").
    """
    chrom = polymorphism.get('chromosome')
    pos = polymorphism.get('position')
    ref_allele_polymorphism = polymorphism.get('reference_allele')

    if not chrom or pos is None or not ref_allele_polymorphism:
        return handle_validation_error(polymorphism, "General validation error", e=ValueError("Missing required fields"))


    # Handle potential chromosome naming discrepancies (e.g., 'chr1' vs '1')
    # You might need a mapping or try both 'chrX' and 'X'
    if chrom not in grch38_ref:
        if f'chr{chrom}' in grch38_ref:
            chrom = f'chr{chrom}'
        elif chrom.startswith('chr') and chrom[3:] in grch38_ref:
             chrom = chrom[3:]
        else:
            return handle_validation_error(polymorphism, "Chromosome not found")


    try:
        # pyfaidx uses 1-based indexing for fetching sequences
        # The polymorphism position is likely 1-based, so no adjustment needed for pyfaidx fetch
        # If your polymorphism positions are 0-based, you would need to add 1: pos + 1
        # Fetch the reference allele(s) at the specified position
        # For a SNP, fetch a single nucleotide. For an indel, you might need to fetch a region.
        # This example assumes SNPs for simplicity. For indels, the logic would be more complex
        # and might involve fetching flanking sequences.
        ref_allele_grch38 = grch38_ref[chrom][pos - 1].seq # pyfaidx uses 0-based indexing for slicing, so subtract 1

        # Compare the reference alleles (case-insensitive comparison is often safer)
        if ref_allele_polymorphism.upper() == ref_allele_grch38.upper():
            return True, "Valid"
        else:
            return handle_validation_error(polymorphism, "Invalid reference allele", e=ValueError(ref_allele_grch38))

    except IndexError:
        return handle_validation_error(polymorphism, "Position out of bounds")
    except Exception as e:
        return handle_validation_error(polymorphism, "General validation error", e)

# Example usage:
# validated_polymorphisms = []
# for poly in polymorphisms:
#     is_valid, status = validate_polymorphism(poly, grch38_ref)
#     poly['validation_status'] = status
#     poly['is_valid'] = is_valid
#     validated_polymorphisms.append(poly)

# Now validated_polymorphisms contains the original data with added validation status.
# You would then process this list to handle valid/invalid entries.

## Perform GRCh38 Checks

### Subtask:
Detail the process of comparing the polymorphism data against the loaded GRCh38 reference to validate positions and alleles.

# Task
Implement the GRCh38 checks within the `polymorphism-processor-service` for the Google Cloud application described, ensuring the service accesses the GRCh38 reference data from the "grch38-ref-bucket", parses the input TXT polymorphism data, performs validation against the reference, handles the results, and integrates this logic into the service.

## Load grch38 reference data

### Subtask:
Describe how the processor service will access and load the GRCh38 reference genome data from the `grch38-ref-bucket`.


**Reasoning**:
Describe the mechanisms for accessing, locating, and loading the GRCh38 reference genome data from the specified Google Cloud Storage bucket. This involves explaining authentication, file identification, and data retrieval methods.



In [None]:
# 1. Authentication and Authorization
# The polymorphism-processor-service will typically use a Service Account associated with the Google Cloud project.
# This Service Account should have the necessary permissions (e.g., Storage Object Viewer) to access objects in the 'grch38-ref-bucket'.
# Authentication is handled automatically when running within Google Cloud environments (like GKE, Cloud Run, or GCE) if the service is configured to use the default service account or a specifically attached service account.
# For local development or testing, credentials can be provided through the GOOGLE_APPLICATION_CREDENTIALS environment variable pointing to a service account key file.

# 2. Identifying and Locating the Reference File(s)
# The service will need to know the name or path of the GRCh38 reference file(s) within the bucket.
# This could be configured in the service's environment variables, a configuration file, or passed as a parameter.
# For example, if the file is named 'GRCh38.fasta', the service would use this name to access the specific object in the bucket.
# The service could potentially list objects in the bucket or a specific prefix if the exact file name isn't known beforehand, but knowing the file name is more efficient.

# 3. Downloading or Streaming the Reference Data
# The service will use the Google Cloud Storage client library for Python (google-cloud-storage) to interact with the bucket.
# Data can be downloaded in chunks or streamed to avoid loading the entire potentially large file into memory at once if not necessary.
# For validation lookup, streaming or downloading the relevant parts of the file (e.g., specific chromosomes or regions) might be more efficient than downloading the whole genome.
# Example using the client library:
# from google.cloud import storage
# client = storage.Client()
# bucket = client.get_bucket('grch38-ref-bucket')
# blob = bucket.blob('GRCh38.fasta')
# # To download as a string:
# # content = blob.download_as_string()
# # To stream:
# # with blob.open("r") as f:
# #     for line in f:
# #         # Process line

# 4. Structuring/Storing Data in Memory for Efficient Lookup
# The structure in memory depends on the validation process.
# If validation requires random access to specific genomic positions, an indexed format like a FASTA index (.fai) should be used alongside the FASTA file.
# The service can load the index into memory to quickly find the byte offset of a sequence.
# For the sequence data itself, it might be loaded into memory partially (e.g., one chromosome at a time) or accessed by streaming based on the index lookup.
# Libraries like `pyfaidx` in Python can be used to efficiently access sequences from indexed FASTA files without loading the entire file into memory.
# Alternatively, if the reference data is pre-processed into a database or a more query-optimized format (like a specialized genomic database), the service could connect to that instead of directly accessing the FASTA file. For the scope of this task focusing on the GCS bucket, we assume direct access to the FASTA and its index.

## Parse input polymorphism data

### Subtask:
Explain how the service will read and parse the uploaded TXT polymorphism data.


**Reasoning**:
Describe how the service will read and parse the uploaded TXT polymorphism data, including how it receives the data, reads the content, parses different potential formats and expected columns, handles errors during parsing, and the in-memory data structure for the parsed data.



In [None]:
# 1. Receiving the Uploaded TXT File
# The polymorphism-processor-service will likely receive the uploaded TXT file as part of an API request.
# This could be a file upload via an HTTP POST request if the service is exposed via a web framework (e.g., Flask, FastAPI).
# Alternatively, if the service is triggered by an event (e.g., a file being uploaded to a different GCS bucket), the file path in GCS could be passed as a message in a Pub/Sub topic, and the service would then read the file from that GCS location using the google-cloud-storage library, similar to how the reference data is accessed.
# For this explanation, let's assume the file content is received directly as part of an API request or read from a GCS path provided in a message.

# 2. Reading the Content of the TXT File
# If the file content is received in an API request, it will be available in the request body or as a file object provided by the web framework.
# If reading from GCS, the google-cloud-storage client library will be used to download the file content as a string or stream it.
# The content will be read as plain text. It's important to handle potential character encodings (e.g., UTF-8).

# 3. Parsing the Read Data
# The service will need to parse the text content line by line.
# Each line is expected to represent a single polymorphism entry.
# The format of the line needs to be known or inferred. Common formats include tab-separated values (TSV) or comma-separated values (CSV).
# The service will split each line based on the delimiter (tab or comma).
# The expected columns and their order must be defined. A typical format might include:
# - Chromosome (e.g., 'chr1', '1')
# - Position (an integer)
# - Reference Allele (a nucleotide or sequence)
# - Observed Allele (a nucleotide or sequence)
# The service will need to convert the data in each column to the appropriate data type (e.g., position to integer).

# Example parsing logic (assuming tab-separated):
# import io
# def parse_polymorphism_data(text_content):
#     polymorphisms = []
#     # Use io.StringIO to treat the string content as a file
#     data_file = io.StringIO(text_content)
#     for line in data_file:
#         line = line.strip()
#         if not line or line.startswith('#'): # Skip empty lines or comments
#             continue
#         fields = line.split('\t') # Assuming tab-separated
#         if len(fields) >= 4: # Ensure minimum expected columns are present
#             try:
#                 chromosome = fields[0]
#                 position = int(fields[1])
#                 reference_allele = fields[2]
#                 observed_allele = fields[3]
#                 polymorphisms.append({
#                     'chromosome': chromosome,
#                     'position': position,
#                     'reference_allele': reference_allele,
#                     'observed_allele': observed_allele
#                 })
#             except ValueError as e:
#                 # Handle error: position is not an integer
#                 print(f"Error parsing line: {line} - {e}")
#                 # Depending on requirements, log the error, skip the line, or return an error
#             except IndexError as e:
#                  # Handle error: not enough fields
#                 print(f"Error parsing line (missing fields): {line} - {e}")
#         else:
#             # Handle error: incorrect number of fields
#             print(f"Skipping malformed line: {line}")
#     return polymorphisms


# 4. Handling Potential Errors During Parsing
# Errors can occur due to:
# - Incorrect delimiter (e.g., comma instead of tab)
# - Incorrect number of columns in a line
# - Data in a column not being of the expected type (e.g., non-integer position)
# - Invalid characters or encoding issues
# The service should implement error handling for these cases.
# - For minor errors (e.g., a few malformed lines), the service might log the error and skip the problematic line, continuing with the valid data.
# - For major errors (e.g., the entire file has the wrong format), the service might reject the file and return an error response to the user or triggering system.
# - Specific exceptions (like ValueError, IndexError) should be caught during parsing.
# - Providing informative error messages about the line number and type of error will be helpful for debugging.

# 5. In-Memory Data Structure for Parsed Data
# The parsed polymorphism data can be stored in a list of dictionaries or objects.
# A list of dictionaries is a simple and flexible structure in Python. Each dictionary would represent a polymorphism entry with keys for 'chromosome', 'position', 'reference_allele', and 'observed_allele'.
# Example:
# [
#     {'chromosome': 'chr1', 'position': 100, 'reference_allele': 'A', 'observed_allele': 'T'},
#     {'chromosome': 'chr1', 'position': 150, 'reference_allele': 'G', 'observed_allele': 'C'},
#     ...
# ]
# For very large files, consider using a more memory-efficient structure or processing data in chunks if feasible for the validation logic. However, a list of dictionaries is a good starting point for moderate file sizes and ease of use. Pandas DataFrame could also be used for more complex data manipulation and cleaning, but might be overkill for simple validation lookup.

# Task
Set up a Google Cloud project and services for a data processing pipeline with a web interface for user interaction, containerize the processor service and frontend application, build and push container images, deploy Cloud Run services, set up Cloud Storage buckets, configure a Global Load Balancer, set up a GitHub repository, integrate with GitHub for CI/CD, sync code with GitHub, and finally test and monitor the application.

## Set up google cloud project and services

### Subtask:
Configure your Google Cloud project, including enabling necessary APIs (Cloud Run, Cloud Storage, Pub/Sub), and setting up service accounts with appropriate permissions.


## Containerize the processor service

### Subtask:
Create a Dockerfile for the `polymorphism-processor-service` to package it as a container image.


**Reasoning**:
Create a Dockerfile for the polymorphism-processor-service with the specified steps.



In [None]:
%%writefile Dockerfile
# Use an official Python runtime as a parent image
FROM python:3.9-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
# Assume a requirements.txt file exists with necessary dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Make port 8080 available to the world outside this container
# Assuming the service runs on port 8080, adjust if necessary
EXPOSE 8080

# Define environment variable
ENV NAME World

# Run app.py when the container launches
# Adjust 'main.py' to your service's entry point file name
CMD ["python", "main.py"]

Writing Dockerfile


## Containerize the frontend application

### Subtask:
Create Dockerfiles for the `webapp-frontend` (HTML, CSS, JS files served by a web server like Nginx or a simple Python server).


**Reasoning**:
Create the Dockerfile for the frontend application using Nginx as the base image and copy the necessary files.



In [None]:
%%writefile Dockerfile.frontend
# Use an official Nginx image as a base
FROM nginx:latest

# Copy the frontend application files into the default Nginx html directory
# Assume your frontend files are in a directory named 'frontend' relative to the Dockerfile
COPY ./frontend /usr/share/nginx/html

# Expose port 80, the default port for Nginx
EXPOSE 80

# The default command of the Nginx image starts the server, so no explicit CMD is needed

Writing Dockerfile.frontend


## Build and push container images

### Subtask:
Build the Docker images for both services and push them to Google Cloud Container Registry or Artifact Registry.


**Reasoning**:
Authenticate Docker with Google Cloud and then build and tag the Docker images for both the processor service and the frontend application based on the instructions provided.



In [None]:
!gcloud auth configure-docker us-central1-docker.pkg.dev

# Replace 'your-gcp-project-id' and 'your-repo' with your actual GCP project ID and Artifact Registry repository name
PROJECT_ID = 'your-gcp-project-id'
REPO_NAME = 'your-repo'
PROCESSOR_IMAGE = f'us-central1-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/polymorphism-processor-service:latest'
FRONTEND_IMAGE = f'us-central1-docker.pkg.dev/{PROJECT_ID}/{REPO_NAME}/webapp-frontend:latest'

# Build and tag the processor service image
!docker build -t {PROCESSOR_IMAGE} -f Dockerfile .

# Build and tag the frontend application image
# Assuming your frontend files are in a directory named 'frontend' relative to the Dockerfile.frontend
# If your frontend files are in the same directory as Dockerfile.frontend, remove the './frontend' part in the COPY instruction in Dockerfile.frontend
!docker build -t {FRONTEND_IMAGE} -f Dockerfile.frontend .

# Push both images to Artifact Registry
!docker push {PROCESSOR_IMAGE}
!docker push {FRONTEND_IMAGE}

`docker` and `docker-credential-gcloud` need to be in the same PATH in order to work correctly together.
gcloud's Docker credential helper can be configured but it will not work until this is corrected.
Adding credentials for: us-central1-docker.pkg.dev
After update, the following will be written to your Docker config file located 
at [/root/.docker/config.json]:
 {
  "credHelpers": {
    "us-central1-docker.pkg.dev": "gcloud"
  }
}

Do you want to continue (Y/n)?  

Command killed by keyboard interrupt


[1;31mERROR:[0m gcloud crashed (RuntimeError): reentrant call inside <_io.BufferedWriter name='<stderr>'>

If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics
/bin/bash: line 1: docker: command not found
/bin/bash: line 1: docker: command not found
/bin/bash: line 1: docker: command not found
/bin/bash: line 1: docker: command not found
