### **1. Nextflow Installation**
   
Nextflow is the workflow management system that orchestrates the entire pipeline. We recommend installing it via Conda.

In [None]:
%%bash
# Install Nextflow from the bioconda channel
conda install nextflow -c bioconda

# Verify the installation
nextflow -version

### **2. Set Up Google Cloud SDK**

The Google Cloud SDK is required for running the pipeline on Google Cloud Platform. This enables secure access to your GCP resources and services.

In [None]:
%%bash
# Download the Google Cloud SDK installation 
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/install_google_cloud_sdk.bash
chmod +x install_google_cloud_sdk.bash
./install_google_cloud_sdk.bash
./google-cloud-sdk/bin/gcloud init
# This will guide you through initial GCP setup and authentication.

### **3. Authenticate with Google Cloud**

Authenticate with your Google Cloud account and configure your project settings. This step is essential for enabling the pipeline to access your GCP resources.

*Note: Replace YOUR_PROJECT_ID with your actual Google Cloud project identifier.* <br>

*Note: You can also use **igvf-pertub-seq-pipeline** for testing. This project is billed to the IGVF-DACC.*

In [None]:
%%bash
# Log in to your Google account
gcloud auth login

# Set the active project
gcloud config set project YOUR_PROJECT_ID

# Verify your configuration
gcloud config list

### **4. Create Service Account and Configure IAM Permissions**
For automated pipeline execution, you'll need to create a service account with appropriate permissions. This ensures secure, programmatic access to Google Cloud resources. The service account needs specific roles to manage compute resources and storage.

*Note: Replace YOUR_NAME with your actual service account name.*

In [None]:
%%bash
# Create a new service account for the pipeline
gcloud iam service-accounts create YOUR_NAME

In [None]:
%%bash

# Grant Service Account User role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:YOUR_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountUser"

# Grant Batch Jobs Editor role for compute job management
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:YOUR_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/batch.jobsEditor"

# Grant Batch Agent Reporter role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:YOUR_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/batch.agentReporter"

# Grant Cloud Life Sciences Admin role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:YOUR_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/lifesciences.admin"

# Grant Logs Viewer role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:YOUR_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/logging.viewer"

# Grant Logs Writer role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:YOUR_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/logging.logWriter"

# Grant Storage Admin role
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:YOUR_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.admin"

### **4. Clone the CRISPR Pipeline**

Download the CRISPR pipeline repository to your local environment. This will provide you with all the necessary workflow scripts and configuration files.

In [None]:
%%bash
# Clone the CRISPR pipeline repository
git clone https://github.com/jiangsizhul201/crispr-pipeline

# Navigate to the pipeline directory
cd crispr-pipeline

In [None]:
%%bash
# Generate and Download Service Account Key

## Create and download the service account key file
gcloud iam service-accounts keys create ./pipeline-service-key.json \
    --iam-account=YOUR_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com

## Set the environment variable for authentication
export GOOGLE_APPLICATION_CREDENTIALS=./pipeline-service-key.json

### **5. Run the pipeline**

Parameter Explanation:

- `chmod +x bin/*`: Makes all utility scripts in the bin/ directory executable, which is necessary if the pipeline relies on helper scripts.

- `nextflow run main.nf`: Executes the main Nextflow workflow script.

- `-profile google`: Applies the configuration profile optimized for execution on Google Cloud (e.g., using google executor).

- `--input`: Path to your input sample sheet in TSV format. This file should define the samples and metadata needed for the pipeline run.

- `--outdir`: Destination folder in a Google Cloud Storage bucket where all pipeline outputs will be written.

In [None]:
%%bash
# Makes all utility scripts in the bin/ directory executable
chmod +x bin/*
# Launch the pipeline using the Google Cloud profile
# test_samplesheet1.tsv contains one measurement set
nextflow run main.nf -profile google --input example-data/test_samplesheet1.tsv --outdir gs://YOUR_PROJECT_ID-data/scratch/

### **Additional Parameters:**
- `-resume`: Resumes execution from the last successful checkpoint if the pipeline was interrupted