# LAB 3: Quality Control of Genetic Data\n\n## Browser-Based Version\n\nThis notebook is a simplified version of the full Lab 3 notebook, adapted to run in your browser using JupyterLite.\n\n### Key Learning Objectives:\n\n1. Understand the importance of quality control in genetic data analysis\n2. Learn standard QC metrics and thresholds for genetic data\n3. Explore visualization techniques for QC assessment\n4. Understand the phasing process for genetic data\n\n### Browser vs. Local Environment\n\nThis browser version contains simplified demonstrations using pre-processed sample data. For the full experience with complete datasets and bioinformatics tools, download the full notebook from the course materials and run it in your local environment.

In [ ]:
# Import libraries\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom io import StringIO\n\n# Display settings\npd.set_option('display.max_columns', 10)\npd.set_option('display.max_rows', 10)\n\nprint(\"Libraries imported successfully!\")

## Introduction to Quality Control in Genetic Data

Quality control (QC) is a critical step in genetic data analysis. Poor quality data can lead to false conclusions, missed associations, and wasted computational resources. In this lab, we'll explore common QC metrics and procedures for genetic data, focusing on processed VCF files.

In [ ]:
# Create sample quality control data\n\n# Let's create a dataset simulating quality metrics for 20 samples\nnp.random.seed(42)  # For reproducibility\n\n# Sample IDs\nsample_ids = [f'sample_{i+1:02d}' for i in range(20)]\n\n# Create quality metrics\nqc_data = {\n    'sample_id': sample_ids,\n    'call_rate': np.random.uniform(0.85, 1.0, 20),  # Proportion of non-missing genotypes\n    'heterozygosity': np.random.normal(0.19, 0.05, 20),  # Heterozygosity rate\n    'ti_tv_ratio': np.random.normal(2.0, 0.2, 20),  # Transition/transversion ratio\n    'het_hom_ratio': np.random.normal(1.5, 0.3, 20),  # Heterozygous/homozygous ratio\n    'mendelian_errors': np.random.poisson(5, 20),  # Simulated Mendelian errors\n    'sample_contamination': np.random.uniform(0, 0.05, 20)  # Contamination estimate\n}\n\n# Some quality flags (intentionally adding a few questionable samples)\nqc_data['low_call_rate'] = qc_data['call_rate'] < 0.95\nqc_data['abnormal_heterozygosity'] = np.abs(qc_data['heterozygosity'] - 0.19) > 0.03\nqc_data['contaminated'] = qc_data['sample_contamination'] > 0.03\n\n# Convert to DataFrame\nqc_df = pd.DataFrame(qc_data)\n\n# Display the QC data\nprint(\"Sample Quality Control Metrics:\")\ndisplay(qc_df)

## Key Quality Control Metrics

Let's examine some of the most important QC metrics for genetic data:

1. **Call Rate**: The proportion of non-missing genotypes for a sample. Low call rates may indicate poor DNA quality or technical issues.

2. **Heterozygosity Rate**: The proportion of heterozygous genotypes. Abnormal heterozygosity can indicate sample contamination or inbreeding.

3. **Transition/Transversion (Ti/Tv) Ratio**: The ratio of transition mutations (A↔G, C↔T) to transversion mutations (A↔C, A↔T, G↔C, G↔T). For humans, this should be around 2.0-2.1. Deviations can indicate sequencing errors.

4. **Heterozygous/Homozygous Ratio**: The ratio of heterozygous to homozygous alternate genotypes. Abnormal ratios can indicate quality issues.

5. **Mendelian Errors**: In family data, violations of Mendelian inheritance patterns. High rates indicate genotyping errors or sample misidentification.

6. **Sample Contamination**: Estimates of contamination with DNA from other individuals. Even low levels can affect genetic analysis.

In [ ]:
# Visualize sample quality metrics\n\n# First, let's count how many samples have quality issues\nlow_call_rate_count = qc_df['low_call_rate'].sum()\nabnormal_heterozygosity_count = qc_df['abnormal_heterozygosity'].sum()\ncontaminated_count = qc_df['contaminated'].sum()\n\n# Count samples with any issue\nqc_df['any_issue'] = qc_df['low_call_rate'] | qc_df['abnormal_heterozygosity'] | qc_df['contaminated']\nany_issue_count = qc_df['any_issue'].sum()\n\nprint(f\"Samples with low call rate: {low_call_rate_count}\")\nprint(f\"Samples with abnormal heterozygosity: {abnormal_heterozygosity_count}\")\nprint(f\"Samples with potential contamination: {contaminated_count}\")\nprint(f\"Samples with any quality issue: {any_issue_count}\")\n\n# Create a figure with subplots\nplt.figure(figsize=(14, 12))\n\n# 1. Call rate distribution\nplt.subplot(2, 2, 1)\nplt.hist(qc_df['call_rate'], bins=15, color='skyblue', edgecolor='black')\nplt.axvline(x=0.95, color='red', linestyle='--')\nplt.text(0.95, plt.ylim()[1]*0.9, 'Threshold = 0.95', color='red', ha='right')\nplt.title('Call Rate Distribution')\nplt.xlabel('Call Rate')\nplt.ylabel('Number of Samples')\n\n# 2. Heterozygosity distribution\nplt.subplot(2, 2, 2)\nplt.hist(qc_df['heterozygosity'], bins=15, color='lightgreen', edgecolor='black')\nplt.axvline(x=0.16, color='red', linestyle='--')\nplt.axvline(x=0.22, color='red', linestyle='--')\nplt.title('Heterozygosity Distribution')\nplt.xlabel('Heterozygosity Rate')\nplt.ylabel('Number of Samples')\n\n# 3. Scatterplot of call rate vs heterozygosity\nplt.subplot(2, 2, 3)\nplt.scatter(\n    qc_df['call_rate'], \n    qc_df['heterozygosity'], \n    c=qc_df['any_issue'].map({True: 'red', False: 'blue'}),\n    alpha=0.7\n)\nplt.axhline(y=0.16, color='red', linestyle='--', alpha=0.3)\nplt.axhline(y=0.22, color='red', linestyle='--', alpha=0.3)\nplt.axvline(x=0.95, color='red', linestyle='--', alpha=0.3)\nplt.title('Call Rate vs Heterozygosity')\nplt.xlabel('Call Rate')\nplt.ylabel('Heterozygosity Rate')\n\n# 4. Sample contamination\nplt.subplot(2, 2, 4)\nplt.bar(qc_df['sample_id'], qc_df['sample_contamination'], color='lightcoral')\nplt.axhline(y=0.03, color='red', linestyle='--')\nplt.text(0, 0.03, 'Threshold = 0.03', color='red', ha='left', va='bottom')\nplt.title('Sample Contamination Estimates')\nplt.xlabel('Sample ID')\nplt.ylabel('Contamination Level')\nplt.xticks(rotation=90)\n\nplt.tight_layout()\nplt.show()

## Quality Control for Genetic Variants

In addition to sample-level QC, we also need to perform variant-level quality control. Let's examine some key metrics for variant QC:

1. **Call Rate**: The proportion of samples with non-missing data for a variant.
   
2. **Minor Allele Frequency (MAF)**: The frequency of the less common allele. Very rare variants may be genotyping errors.
   
3. **Hardy-Weinberg Equilibrium (HWE)**: Tests whether genotype frequencies match theoretical expectations. Significant deviations can indicate genotyping errors, population stratification, or selection.
   
4. **Missingness by Genotype**: Differential missingness between genotype groups can indicate bias.

Let's explore these metrics using simulated data:

In [ ]:
# Create simulated variant QC data\n\n# Create 100 variants\nnp.random.seed(43)  # Different seed from sample data\n\n# Variant IDs\nvariant_ids = [f'rs{i+1000:07d}' for i in range(100)]\n\n# Create variant metrics\nvariant_data = {\n    'variant_id': variant_ids,\n    'chromosome': np.random.choice(range(1, 23), 100),  # Chromosomes 1-22\n    'position': np.random.randint(1000000, 10000000, 100),  # Random positions\n    'call_rate': np.random.uniform(0.80, 1.0, 100),  # Proportion of non-missing genotypes\n    'maf': np.random.beta(1, 4, 100),  # Minor allele frequency, biased towards lower values\n    'hwe_p_value': np.random.uniform(0, 1, 100),  # HWE p-values\n}\n\n# Add flags for filtering\nvariant_data['low_call_rate'] = variant_data['call_rate'] < 0.95\nvariant_data['low_maf'] = variant_data['maf'] < 0.01\nvariant_data['hwe_violation'] = variant_data['hwe_p_value'] < 0.001\n\n# Convert to DataFrame\nvariant_df = pd.DataFrame(variant_data)\n\n# Apply standard QC filters\nfiltered_variants = variant_df[\n    (variant_df['call_rate'] >= 0.95) &\n    (variant_df['maf'] >= 0.01) &\n    (variant_df['hwe_p_value'] >= 0.001)\n]\n\n# Display the QC data\nprint(f\"Total variants: {len(variant_df)}\")\nprint(f\"Variants passing QC: {len(filtered_variants)}\")\nprint(f\"Variants filtered out: {len(variant_df) - len(filtered_variants)}\")\nprint(\"\\nVariant QC Metrics (first 10 variants):\")\ndisplay(variant_df.head(10))

In [ ]:
# Visualize variant QC metrics\n\n# Create a figure with subplots\nplt.figure(figsize=(14, 12))\n\n# 1. Call rate distribution for variants\nplt.subplot(2, 2, 1)\nplt.hist(variant_df['call_rate'], bins=20, color='skyblue', edgecolor='black')\nplt.axvline(x=0.95, color='red', linestyle='--')\nplt.text(0.95, plt.ylim()[1]*0.9, 'Threshold = 0.95', color='red', ha='right')\nplt.title('Variant Call Rate Distribution')\nplt.xlabel('Call Rate')\nplt.ylabel('Number of Variants')\n\n# 2. MAF distribution\nplt.subplot(2, 2, 2)\nplt.hist(variant_df['maf'], bins=20, color='lightgreen', edgecolor='black')\nplt.axvline(x=0.01, color='red', linestyle='--')\nplt.text(0.01, plt.ylim()[1]*0.9, 'Threshold = 0.01', color='red', ha='left')\nplt.title('Minor Allele Frequency Distribution')\nplt.xlabel('MAF')\nplt.ylabel('Number of Variants')\n\n# 3. HWE p-value distribution (log scale for better visualization)\nplt.subplot(2, 2, 3)\nplt.hist(-np.log10(variant_df['hwe_p_value']), bins=20, color='lightcoral', edgecolor='black')\nplt.axvline(x=-np.log10(0.001), color='red', linestyle='--')\nplt.text(-np.log10(0.001), plt.ylim()[1]*0.9, 'Threshold = 0.001', color='red', ha='left')\nplt.title('Hardy-Weinberg Equilibrium P-values')\nplt.xlabel('-log10(HWE p-value)')\nplt.ylabel('Number of Variants')\n\n# 4. Combined filters - variants passing each filter\nfilter_counts = {\n    'Total': len(variant_df),\n    'Passed Call Rate': sum(~variant_df['low_call_rate']),\n    'Passed MAF': sum(~variant_df['low_maf']),\n    'Passed HWE': sum(~variant_df['hwe_violation']),\n    'Passed All': len(filtered_variants)\n}\n\nplt.subplot(2, 2, 4)\nplt.bar(filter_counts.keys(), filter_counts.values(), color='purple')\nplt.title('Variants Passing QC Filters')\nplt.ylabel('Number of Variants')\nplt.xticks(rotation=45)\n\n# Tighten layout and show the plots\nplt.tight_layout()\nplt.show()\n\n# Display filtered variant counts by chromosome\nchrom_counts = variant_df.groupby('chromosome').size().reset_index(name='total')\nchrom_filtered = filtered_variants.groupby('chromosome').size().reset_index(name='passed')\nchrom_counts = pd.merge(chrom_counts, chrom_filtered, on='chromosome', how='left').fillna(0)\nchrom_counts['passed'] = chrom_counts['passed'].astype(int)\nchrom_counts['filtered'] = chrom_counts['total'] - chrom_counts['passed']\n\n# Sort by chromosome number\nchrom_counts = chrom_counts.sort_values('chromosome')\n\nprint(\"\\nVariant counts by chromosome:\")\ndisplay(chrom_counts)\n\n# Plot variants by chromosome\nplt.figure(figsize=(12, 6))\n\n# Create a stacked bar chart\nplt.bar(chrom_counts['chromosome'], chrom_counts['passed'], label='Passed QC')\nplt.bar(chrom_counts['chromosome'], chrom_counts['filtered'], bottom=chrom_counts['passed'], label='Failed QC')\n\nplt.title('Variants by Chromosome')\nplt.xlabel('Chromosome')\nplt.ylabel('Number of Variants')\nplt.xticks(chrom_counts['chromosome'])\nplt.legend()\nplt.grid(axis='y', alpha=0.3)\n\nplt.tight_layout()\nplt.show()

## Phasing Genetic Data

After performing quality control, the next step in many genetic analyses is **phasing**. Phasing is the process of determining which alleles are on the same chromosome (i.e., assigning alleles to haplotypes). This is crucial for genetic genealogy, identity-by-descent analysis, and other applications.

### Why Phasing Matters

In unphased data, we only know the genotype at each position without knowing which alleles are on the same chromosome. For example, if an individual is heterozygous (A/T) at one site and heterozygous (C/G) at another site, there are two possible haplotype configurations:
- A-C on one chromosome and T-G on the other, or
- A-G on one chromosome and T-C on the other.

Phasing determines which of these configurations is correct, which is essential for accurate genetic analysis.

### Phasing Methods

There are several approaches to phasing:

1. **Statistical Phasing**: Using statistical methods to infer the most likely haplotypes based on population frequencies and linkage disequilibrium patterns.

2. **Reference-Based Phasing**: Using a reference panel of previously phased haplotypes to guide the phasing process.

3. **Family-Based Phasing**: Using family relationships (e.g., parent-offspring trios) to determine which alleles are inherited together.

In real genetic analysis, tools like SHAPEIT, Eagle, or Beagle are used for phasing. Let's simulate phased and unphased data to understand the difference:

In [ ]:
# Simulate a small dataset with unphased genotypes\n\n# Set random seed for reproducibility\nnp.random.seed(44)\n\n# Let's create a very small dataset with 5 variants and 3 individuals\nindividual_ids = ['ind1', 'ind2', 'ind3']\nvariant_pos = [100, 200, 300, 400, 500]\n\n# Possible alleles at each position\nalleles = ['A', 'C', 'G', 'T']\n\n# Generate random genotypes (unphased)\nunphased_genotypes = {}\nfor ind in individual_ids:\n    genotypes = []\n    for pos in variant_pos:\n        # Randomly select two alleles for each variant\n        allele1 = np.random.choice(alleles)\n        allele2 = np.random.choice(alleles)\n        genotypes.append((allele1, allele2))\n    unphased_genotypes[ind] = genotypes\n\n# Display unphased genotypes\nprint(\"Unphased Genotypes:\")\nfor ind, genotypes in unphased_genotypes.items():\n    print(f\"{ind}: {' '.join([f'({a1}/{a2})' for a1, a2 in genotypes])}\")\n\n# Now, let's simulate phasing these genotypes\nphased_haplotypes = {}\nfor ind, genotypes in unphased_genotypes.items():\n    # Create two haplotypes for each individual\n    haplotype1 = []\n    haplotype2 = []\n    \n    for a1, a2 in genotypes:\n        # Randomly decide which allele goes to which haplotype\n        if np.random.random() < 0.5:\n            haplotype1.append(a1)\n            haplotype2.append(a2)\n        else:\n            haplotype1.append(a2)\n            haplotype2.append(a1)\n    \n    phased_haplotypes[ind] = (haplotype1, haplotype2)\n\n# Display phased haplotypes\nprint(\"\\nPhased Haplotypes:\")\nfor ind, (hap1, hap2) in phased_haplotypes.items():\n    print(f\"{ind}: Haplotype 1: {'-'.join(hap1)}\")\n    print(f\"     Haplotype 2: {'-'.join(hap2)}\")

In [ ]:
# Visualize the difference between phased and unphased data\n\n# Create a visualization of phased vs unphased data for one individual\nselected_ind = 'ind1'\n\n# Extract the data\nunphased = unphased_genotypes[selected_ind]\nphased = phased_haplotypes[selected_ind]\n\n# Create figure\nplt.figure(figsize=(12, 6))\n\n# Unphased visualization\nplt.subplot(2, 1, 1)\nplt.title(f\"Unphased Genotypes for {selected_ind}\")\n\n# Plot the chromosomes as lines\nplt.plot([0, len(variant_pos)+1], [1, 1], 'k-', linewidth=2)  # Chromosome 1\nplt.plot([0, len(variant_pos)+1], [2, 2], 'k-', linewidth=2)  # Chromosome 2\n\n# Plot the alleles at each position\nfor i, (a1, a2) in enumerate(unphased):\n    plt.text(i+1, 1, a1, ha='center', va='center', fontsize=16, bbox=dict(facecolor='lightblue', alpha=0.5))\n    plt.text(i+1, 2, a2, ha='center', va='center', fontsize=16, bbox=dict(facecolor='lightblue', alpha=0.5))\n    \n    # Add dashed line to show we don't know which alleles are on the same chromosome\n    plt.plot([i+1, i+1], [1, 2], 'k--', alpha=0.3)\n\n# Show position numbers\nfor i, pos in enumerate(variant_pos):\n    plt.text(i+1, 0.5, f\"Pos {pos}\", ha='center', va='center', fontsize=10)\n\nplt.ylim(0, 3)\nplt.xlim(0, len(variant_pos)+1)\nplt.yticks([1, 2], ['Chromosome 1', 'Chromosome 2'])\nplt.xticks([])\n\n# Phased visualization\nplt.subplot(2, 1, 2)\nplt.title(f\"Phased Haplotypes for {selected_ind}\")\n\n# Plot the chromosomes as lines\nplt.plot([0, len(variant_pos)+1], [1, 1], 'k-', linewidth=2)  # Haplotype 1\nplt.plot([0, len(variant_pos)+1], [2, 2], 'k-', linewidth=2)  # Haplotype 2\n\n# Plot the phased alleles at each position\nhap1, hap2 = phased\nfor i, (a1, a2) in enumerate(zip(hap1, hap2)):\n    plt.text(i+1, 1, a1, ha='center', va='center', fontsize=16, bbox=dict(facecolor='lightgreen', alpha=0.5))\n    plt.text(i+1, 2, a2, ha='center', va='center', fontsize=16, bbox=dict(facecolor='lightcoral', alpha=0.5))\n\n# Show position numbers\nfor i, pos in enumerate(variant_pos):\n    plt.text(i+1, 0.5, f\"Pos {pos}\", ha='center', va='center', fontsize=10)\n\nplt.ylim(0, 3)\nplt.xlim(0, len(variant_pos)+1)\nplt.yticks([1, 2], ['Haplotype 1', 'Haplotype 2'])\nplt.xticks([])\n\n# Add explanatory text\nplt.figtext(0.02, 0.02, \"Note: In unphased data, we don't know which alleles are on the same chromosome.\\n\"\n                      \"Phasing resolves this ambiguity, allowing for accurate haplotype analysis.\", \n           fontsize=12)\n\nplt.tight_layout()\nplt.show()

## Benefits of Phasing for Genetic Genealogy

Phasing is particularly important for genetic genealogy applications for several reasons:

1. **Improved IBD Detection**: Identity-by-descent (IBD) detection, which identifies segments of DNA shared between individuals due to common ancestry, is much more accurate with phased data.

2. **Better Imputation**: Phasing improves the accuracy of genotype imputation, which predicts genotypes at variants that weren't directly genotyped.

3. **Haplotype-Based Analysis**: Many advanced analyses in genetic genealogy rely on haplotype information, which requires phased data.

4. **Understanding Inheritance Patterns**: Phasing allows us to trace the inheritance of specific chromosome segments through generations.

Let's explore a real-world example of how phasing improves IBD detection:

In [ ]:
# Simulate IBD detection with phased vs unphased data\n\n# Let's create a more realistic example with 100 variants\nnp.random.seed(45)\n\n# Positions along a chromosome (in centiMorgans)\npositions = np.linspace(0, 100, 100)  # 100 cM chromosome\n\n# Create two related individuals who share a segment\n# We'll simulate a true IBD segment from position 30 to 70\nibd_start_idx = 30\nibd_end_idx = 70\n\n# Function to generate phased haplotypes\ndef generate_haplotypes(n_variants):\n    return ['A' if np.random.random() < 0.5 else 'B' for _ in range(n_variants)]\n\n# Generate base haplotypes for two individuals (4 haplotypes total)\nindividual1_hap1 = generate_haplotypes(100)\nindividual1_hap2 = generate_haplotypes(100)\nindividual2_hap1 = generate_haplotypes(100)\nindividual2_hap2 = generate_haplotypes(100)\n\n# Create the shared IBD segment - make individual2_hap1 identical to individual1_hap1 in the IBD region\nfor i in range(ibd_start_idx, ibd_end_idx):\n    individual2_hap1[i] = individual1_hap1[i]\n\n# Convert haplotypes to unphased genotypes\nunphased1 = []\nunphased2 = []\n\nfor i in range(100):\n    allele1 = 'A' if individual1_hap1[i] == 'A' else 'B'\n    allele2 = 'A' if individual1_hap2[i] == 'A' else 'B'\n    unphased1.append((allele1, allele2))\n    \n    allele1 = 'A' if individual2_hap1[i] == 'A' else 'B'\n    allele2 = 'A' if individual2_hap2[i] == 'A' else 'B'\n    unphased2.append((allele1, allele2))\n\n# Create a function to check for matching genotypes (unphased)\ndef check_genotype_match(genotype1, genotype2):\n    a1, a2 = genotype1\n    b1, b2 = genotype2\n    \n    # Check if the genotypes match in any order\n    return (a1 == b1 and a2 == b2) or (a1 == b2 and a2 == b1)\n\n# Create a function to check for matching alleles (phased)\ndef check_allele_match(allele1, allele2):\n    return allele1 == allele2\n\n# Detect IBD segments using unphased data\nunphased_ibd_segments = []\ncurrent_segment = None\n\nfor i in range(100):\n    if check_genotype_match(unphased1[i], unphased2[i]):\n        if current_segment is None:\n            current_segment = [i]\n        else:\n            current_segment.append(i)\n    else:\n        if current_segment is not None and len(current_segment) >= 5:  # Require at least 5 matching variants\n            start = current_segment[0]\n            end = current_segment[-1]\n            unphased_ibd_segments.append((start, end))\n        current_segment = None\n\n# Don't forget to check the last segment\nif current_segment is not None and len(current_segment) >= 5:\n    start = current_segment[0]\n    end = current_segment[-1]\n    unphased_ibd_segments.append((start, end))\n\n# Detect IBD segments using phased data\nphased_ibd_segments = []\ncurrent_segment = None\n\nfor i in range(100):\n    # Check for matches between individual1_hap1 and individual2_hap1\n    if check_allele_match(individual1_hap1[i], individual2_hap1[i]):\n        if current_segment is None:\n            current_segment = [i]\n        else:\n            current_segment.append(i)\n    else:\n        if current_segment is not None and len(current_segment) >= 5:  # Require at least 5 matching variants\n            start = current_segment[0]\n            end = current_segment[-1]\n            phased_ibd_segments.append((start, end))\n        current_segment = None\n\n# Don't forget to check the last segment\nif current_segment is not None and len(current_segment) >= 5:\n    start = current_segment[0]\n    end = current_segment[-1]\n    phased_ibd_segments.append((start, end))\n\n# Display results\nprint(\"True IBD segment: from position\", ibd_start_idx, \"to\", ibd_end_idx)\nprint(\"\\nIBD segments detected using unphased data:\")\nfor start, end in unphased_ibd_segments:\n    print(f\"  From position {start} to {end} (length: {end-start+1} variants)\")\n\nprint(\"\\nIBD segments detected using phased data:\")\nfor start, end in phased_ibd_segments:\n    print(f\"  From position {start} to {end} (length: {end-start+1} variants)\")\n\n# Visualize the results\nplt.figure(figsize=(12, 6))\n\n# Plot true IBD region\nplt.axvspan(positions[ibd_start_idx], positions[ibd_end_idx-1], alpha=0.2, color='gray', label='True IBD segment')\n\n# Plot unphased IBD segments\nfor start, end in unphased_ibd_segments:\n    plt.plot([positions[start], positions[end]], [0.3, 0.3], 'b-', linewidth=6, alpha=0.7)\n\n# Plot phased IBD segments\nfor start, end in phased_ibd_segments:\n    plt.plot([positions[start], positions[end]], [0.7, 0.7], 'g-', linewidth=6, alpha=0.7)\n\n# Add labels\nplt.text(50, 0.3, \"Unphased Detection\", ha='center', fontsize=12)\nplt.text(50, 0.7, \"Phased Detection\", ha='center', fontsize=12)\n\nplt.yticks([])\nplt.xlabel('Chromosome Position (cM)')\nplt.title('IBD Detection: Phased vs Unphased')\nplt.xlim(0, 100)\nplt.ylim(0, 1)\n\nplt.tight_layout()\nplt.show()

## Connecting to the Full Lab Environment\n\nThis notebook provides a basic introduction to quality control and phasing concepts for genetic data in the browser. For more comprehensive analysis with real genetic data, you'll want to use the full notebook in a local environment.\n\n### How to Continue in the Local Environment\n\n1. Download the full Lab 3 notebook from the course materials\n2. Set up your local Python environment with necessary tools (bcftools, BEAGLE, etc.)\n3. Run the notebook with your quality-controlled and phased data from Lab 2\n\n### What's Different in the Full Version?\n\n- Processes real VCF files with bcftools and other bioinformatics tools\n- Performs actual phasing using BEAGLE with genetic maps\n- Provides more advanced visualizations and statistical analyses\n- Prepares data for downstream IBD detection in later labs\n\nIn the next lab, we'll explore Identity-by-Descent (IBD) detection using tools like IBIS, building on the phased data prepared in this lab.

In [ ]:
# Conclusion\n\n# Create a QC and Phasing workflow diagram\nfrom matplotlib.patches import FancyArrowPatch\n\n# Create a figure for the workflow diagram\nplt.figure(figsize=(12, 8))\n\n# Define the stages of the QC and phasing workflow\nstages = [\n    \"Raw VCF Files\",\n    \"Sample QC\",\n    \"Variant QC\",\n    \"Filtered VCF\",\n    \"Phasing\",\n    \"Phased VCF\"\n]\n\n# Calculate positions for the stages\npositions = [(i*2, 0) for i in range(len(stages))]\n\n# Plot each stage as a node\nfor i, (pos, label) in enumerate(zip(positions, stages)):\n    plt.plot(pos[0], pos[1], 'ko', markersize=15, alpha=0.8)\n    plt.text(pos[0], pos[1]-0.15, label, ha='center', va='top', fontsize=12, fontweight='bold')\n\n# Define substeps for each main stage\nsubsteps = {\n    0: [], # Raw VCF Files (no substeps)\n    1: [\"Call Rate\", \"Heterozygosity\", \"Contamination\"], # Sample QC substeps\n    2: [\"Call Rate\", \"MAF\", \"HWE\"], # Variant QC substeps\n    3: [], # Filtered VCF (no substeps)\n    4: [\"Statistical\", \"Reference-based\", \"Family-based\"], # Phasing substeps\n    5: []  # Phased VCF (no substeps)\n}\n\n# Plot substeps\nfor stage_idx, steps in substeps.items():\n    x = positions[stage_idx][0]\n    for i, step in enumerate(steps):\n        y = -1.5 - i*0.6\n        plt.plot(x, y, 'o', markersize=8, color='skyblue')\n        plt.text(x, y-0.15, step, ha='center', va='top', fontsize=10)\n        \n        # Connect to main stage\n        plt.plot([x, x], [positions[stage_idx][1]-0.2, y+0.1], 'k--', alpha=0.5)\n\n# Add arrows between main stages\nfor i in range(len(positions)-1):\n    arrow = FancyArrowPatch(\n        positions[i], positions[i+1],\n        connectionstyle=\"arc3,rad=0.1\",\n        arrowstyle='-|>',\n        color='blue',\n        mutation_scale=15,\n        linewidth=2\n    )\n    plt.gca().add_patch(arrow)\n\n# Add brief descriptions\ndescriptions = {\n    0: \"Starting data\",\n    1: \"Remove low-quality samples\",\n    2: \"Remove problematic variants\",\n    3: \"Quality controlled data\",\n    4: \"Determine haplotypes\",\n    5: \"Ready for IBD detection\"\n}\n\nfor stage_idx, desc in descriptions.items():\n    x = positions[stage_idx][0]\n    plt.text(x, 0.5, desc, ha='center', va='bottom', fontsize=10, style='italic')\n\n# Set the plot limits and remove axes\nplt.xlim(-1, positions[-1][0]+1)\nplt.ylim(-4, 1)\nplt.axis('off')\n\nplt.title('Quality Control and Phasing Workflow', fontsize=14)\nplt.tight_layout()\nplt.show()\n\nprint(\"Lab 3 completed successfully!\")