# Lab 2: Processing Raw DNA Profiles

In this lab, you'll learn how to process raw DNA profiles from consumer genetic testing companies and prepare them for genetic genealogy analysis. You'll work with simulated data formatted similarly to profiles from popular testing services.

## Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
# Check if running in JupyterLite (browser) or local environment
import sys
IN_BROWSER = 'pyodide' in sys.modules

# Install required packages if running in browser
if IN_BROWSER:
    %pip install -q numpy pandas matplotlib seaborn
    print("Running in JupyterLite browser environment")
else:
    print("Running in standard Jupyter environment")

In [None]:
# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import io
import json
from IPython.display import HTML, display

# Set plot styles
plt.style.use('seaborn-whitegrid')
sns.set_context("notebook", font_scale=1.2)
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['savefig.dpi'] = 100

## Load Data from Lab 1

If running in JupyterLite, we'll first load data from Lab 1 stored in browser storage. This demonstrates how to maintain state between labs.

In [None]:
def load_from_browser_storage(keys, key_prefix='genetic_genealogy_'):
    """Load data from browser localStorage"""
    if not IN_BROWSER:
        print("Not running in a browser environment, skipping loading from storage")
        return {}
    
    try:
        # Import necessary modules for browser environment
        from js import localStorage
        import json
        
        loaded_data = {}
        
        for key in keys:
            storage_key = f"{key_prefix}{key}"
            stored_json = localStorage.getItem(storage_key)
            
            if stored_json is None:
                print(f"Warning: Key '{key}' not found in browser storage")
                continue
                
            # Parse the JSON data
            stored_data = json.loads(stored_json)
            
            # Reconstruct the data based on type
            data_type = stored_data.get('type')
            
            if data_type == 'ndarray':
                # Reconstruct numpy array
                data = np.array(stored_data['data'])
                if stored_data['dtype'].startswith('float'):
                    data = data.astype(float)
                elif stored_data['dtype'].startswith('int'):
                    data = data.astype(int)
                loaded_data[key] = data
                
            elif data_type == 'dataframe':
                # Reconstruct pandas DataFrame
                df = pd.DataFrame(stored_data['data'])
                if len(stored_data['columns']) > 0:
                    df = df[stored_data['columns']]  # Reorder columns
                loaded_data[key] = df
                
            elif data_type == 'series':
                # Reconstruct pandas Series
                series = pd.Series(stored_data['data'], index=stored_data['index'], name=stored_data['name'])
                loaded_data[key] = series
                
            elif data_type in ('list', 'dict', 'str', 'int', 'float', 'bool'):
                # Directly use JSON-serializable types
                loaded_data[key] = stored_data['data']
                
            else:
                print(f"Unsupported data type: {data_type} for key {key}")
        
        print(f"Successfully loaded {len(loaded_data)} items from browser storage")
        return loaded_data
    
    except Exception as e:
        print(f"Error loading from browser storage: {str(e)}")
        return {}

# Load data from Lab 1 if running in browser
if IN_BROWSER:
    # List of data keys we want to load
    keys_to_load = ['metadata', 'variant_info', 'pca_projections', 'lab_progress']
    
    # Load the data
    loaded_data = load_from_browser_storage(keys_to_load)
    
    # Check if we have the lab progress
    if 'lab_progress' in loaded_data and loaded_data['lab_progress'] == 100:
        from IPython.display import HTML
        display(HTML(f"""
        <div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px; border: 1px solid #a8d1df;">
            <h3 style="color: #2c7ea1;">Lab 1 Data Loaded Successfully</h3>
            <p>You have successfully completed Lab 1 and loaded your progress.</p>
            <p>Ready to proceed with Lab 2: Processing Raw DNA Profiles.</p>
        </div>
        """))
    else:
        # If lab progress not 100%, suggest completing Lab 1 first
        display(HTML(f"""
        <div style="background-color: #fff3cd; padding: 10px; border-radius: 5px; border: 1px solid #ffeeba;">
            <h3 style="color: #856404;">Lab 1 Not Completed</h3>
            <p>It seems you haven't completed Lab 1 yet. For the best learning experience, we recommend:</p>
            <ol>
                <li>Return to the Course Page</li>
                <li>Complete Lab 1: Exploring Genomic Data</li>
                <li>Then proceed to this lab</li>
            </ol>
        </div>
        """))
    
    # Extract the data
    metadata = loaded_data.get('metadata')
    variant_info = loaded_data.get('variant_info')
    pca_projections = loaded_data.get('pca_projections')
    
    # Display the data we loaded
    if metadata is not None:
        print("\nSample metadata from Lab 1:")
        display(metadata.head())
        
    if variant_info is not None:
        print("\nVariant information from Lab 1:")
        display(variant_info.head())
else:
    print("\nNot running in browser environment - in a full environment, you would load the actual data here.")

## Simulated Raw DNA Profile

In a browser environment, we'll work with simulated DNA profile data. In a local environment, you would load an actual raw DNA profile file from a consumer genetic testing company.

In [None]:
# This is a placeholder - in Lab 2 we would continue with more code to:
# 1. Load/simulate raw DNA profile formats
# 2. Parse and standardize the different formats
# 3. Convert to VCF format
# 4. Apply quality filters
# 5. Save processed data for Lab 3

print("Lab 2 notebook is under construction. Full implementation coming soon!")

# Display a coming soon message
from IPython.display import HTML
display(HTML("""
<div style="background-color: #e2eafc; padding: 15px; border-radius: 8px; margin-top: 20px;">
    <h3 style="color: #3f51b5; margin-top: 0;">Coming Soon</h3>
    <p>Lab 2 will provide a complete workflow for:</p>
    <ul>
        <li>Loading raw DNA profiles from consumer testing companies</li>
        <li>Converting between different file formats</li>
        <li>Standardizing genetic data for analysis</li>
        <li>Applying quality control measures</li>
        <li>Preparing data for downstream genetic genealogy analysis</li>
    </ul>
    <p>Check back soon for the complete lab!</p>
</div>
"""))