# LAB 1: Exploring Population Genomic Data\n\n## Browser-Based Version\n\nThis notebook is a simplified version of the full Lab 1 notebook, adapted to run in your browser using JupyterLite. \n\n### Key Learning Objectives:\n\n1. Understand the structure of population genomic data\n2. Explore sample metadata from the 1000 Genomes Project\n3. Learn basic data manipulation techniques for genomic analysis\n4. Visualize population distributions and relationships\n\n### Browser vs. Local Environment\n\nThis browser version contains pre-loaded sample data and simplified analyses that can run in your browser. For the full experience with complete datasets, you can download the full notebook from the course materials and run it in your local Python environment.

In [ ]:
# Import libraries\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom io import StringIO\n\n# Display settings\npd.set_option('display.max_columns', 10)\npd.set_option('display.max_rows', 10)\n\nprint(\"Libraries imported successfully!\")

## Sample Data from the 1000 Genomes Project\n\nIn this browser-based lab, we will work with a small subset of sample metadata from the 1000 Genomes Project. This data includes information about:\n\n- Sample IDs\n- Population groups\n- Relationships between samples\n- Sequencing quality metrics\n\nThe full dataset contains information on 2,504 individuals from 26 populations around the world. For this browser version, we're using a smaller sample to ensure smooth performance.

In [ ]:
# Load sample data (a subset of 1000 Genomes metadata)\n\n# This is a small sample of the 1000 Genomes metadata in CSV format\nsample_data = \"\"\"Sample,Family_ID,Population,Population_Description,Gender\nHG00096,HG00096,GBR,British in England and Scotland,male\nHG00097,HG00097,GBR,British in England and Scotland,female\nHG00099,HG00099,GBR,British in England and Scotland,female\nHG00100,HG00100,GBR,British in England and Scotland,female\nHG00101,HG00101,GBR,British in England and Scotland,male\nNA19017,NA19017,ASW,African Ancestry in Southwest US,female\nNA19019,NA19019,ASW,African Ancestry in Southwest US,male\nNA19020,NA19020,ASW,African Ancestry in Southwest US,female\nNA19023,NA19023,ASW,African Ancestry in Southwest US,male\nNA19026,NA19026,ASW,African Ancestry in Southwest US,male\nNA19711,Y117,CHB,Han Chinese in Beijing,male\nNA19982,Y042,ASW,African Ancestry in Southwest US,female\nNA19985,Y044,ASW,African Ancestry in Southwest US,female\nNA20533,2536,TSI,Toscani in Italia,female\nNA20534,2536,TSI,Toscani in Italia,female\nNA20539,2538,TSI,Toscani in Italia,female\nNA12843,1347,CEU,Utah Residents (CEPH) with Northern and Western European Ancestry,male\nNA12873,1349,CEU,Utah Residents (CEPH) with Northern and Western European Ancestry,male\nNA12874,1349,CEU,Utah Residents (CEPH) with Northern and Western European Ancestry,male\nHG02461,BB48,GWD,Gambian in Western Division,female\nHG02462,BB48,GWD,Gambian in Western Division,male\nHG02463,BB48,GWD,Gambian in Western Division,female\nHG02568,BB05,GWD,Gambian in Western Division,female\nHG02570,BB05,GWD,Gambian in Western Division,female\nHG02571,BB05,GWD,Gambian in Western Division,female\nNA18510,Y034,YRI,Yoruba in Ibadan,male\nNA18511,Y034,YRI,Yoruba in Ibadan,female\nNA18516,Y034,YRI,Yoruba in Ibadan,female\nNA18519,Y040,YRI,Yoruba in Ibadan,female\nNA18520,Y040,YRI,Yoruba in Ibadan,male\n\"\"\"\n\n# Read the sample data into a DataFrame\nsample_df = pd.read_csv(StringIO(sample_data))\n\n# Display the first few rows\nsample_df.head()

## Basic Data Exploration\n\nLet's explore our dataset to understand its structure and contents. We'll start by checking the number of individuals and the populations represented in our sample.

In [ ]:
# Check dataset information\nprint(f\"Total number of samples: {len(sample_df)}\")\n\n# Count samples by population\npopulation_counts = sample_df['Population'].value_counts()\nprint(\"\\nSamples by population:\")\nprint(population_counts)\n\n# Count samples by gender\ngender_counts = sample_df['Gender'].value_counts()\nprint(\"\\nSamples by gender:\")\nprint(gender_counts)\n\n# Show unique population descriptions\nprint(\"\\nUnique population descriptions:\")\nunique_populations = sample_df[['Population', 'Population_Description']].drop_duplicates()\nprint(unique_populations)

## Data Visualization\n\nVisualization helps us understand patterns in our data more intuitively. Let's create a few visualizations to better understand the distribution of samples across populations and gender.

In [ ]:
# Create a bar chart of populations\nplt.figure(figsize=(10, 6))\npopulation_counts.plot(kind='bar', color='skyblue')\nplt.title('Number of Samples by Population')\nplt.xlabel('Population')\nplt.ylabel('Number of Samples')\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.show()\n\n# Create a pie chart of gender distribution\nplt.figure(figsize=(8, 8))\ngender_counts.plot(kind='pie', autopct='%1.1f%%', colors=['lightcoral', 'lightblue'])\nplt.title('Sample Distribution by Gender')\nplt.ylabel('')  # Hide the ylabel\nplt.tight_layout()\nplt.show()

In [ ]:
# Create a stacked bar chart of gender distribution by population\n# First, create a cross-tabulation of population and gender\npop_gender_crosstab = pd.crosstab(sample_df['Population'], sample_df['Gender'])\n\n# Create a stacked bar chart\nplt.figure(figsize=(12, 6))\npop_gender_crosstab.plot(kind='bar', stacked=True, color=['lightcoral', 'lightblue'])\nplt.title('Gender Distribution by Population')\nplt.xlabel('Population')\nplt.ylabel('Number of Samples')\nplt.xticks(rotation=45)\nplt.legend(title='Gender')\nplt.tight_layout()\nplt.show()

## Filtering Data\n\nFiltering allows us to focus on specific subsets of data. Let's explore how to filter our dataset to extract specific populations or individuals.

In [ ]:
# Filter to include only individuals from the 'YRI' population\nyri_population = sample_df[sample_df['Population'] == 'YRI']\nprint(\"YRI Population Samples:\")\ndisplay(yri_population)\n\n# Filter to include only female samples\nfemale_samples = sample_df[sample_df['Gender'] == 'female']\nprint(\"\\nFemale Samples (first 5):\")\ndisplay(female_samples.head())\n\n# Multiple conditions: female samples from specific populations\ntarget_populations = ['YRI', 'CEU']\nfemales_from_target = sample_df[(sample_df['Population'].isin(target_populations)) & \n                               (sample_df['Gender'] == 'female')]\nprint(\"\\nFemale samples from YRI and CEU populations:\")\ndisplay(females_from_target)

## Family Relationships\n\nThe 1000 Genomes Project includes information about family relationships between samples. Let's add some family relationship data to our sample and explore it.

In [ ]:
# Add family relationship data to our sample\n# In a real dataset, this would come from the full metadata file\nfamily_data = \"\"\"Sample,Family_ID,Population,Relationship\nHG02461,BB48,GWD,mother\nHG02462,BB48,GWD,father\nHG02463,BB48,GWD,child\nHG02568,BB05,GWD,child\nHG02570,BB05,GWD,mother\nHG02571,BB05,GWD,child\nNA18510,Y034,YRI,father\nNA18511,Y034,YRI,mother\nNA18516,Y034,YRI,child\nNA18519,Y040,YRI,mother\nNA18520,Y040,YRI,father\nNA12873,1349,CEU,father\nNA12874,1349,CEU,child\nNA20533,2536,TSI,mother\nNA20534,2536,TSI,child\n\"\"\"\n\n# Read the family data\nfamily_df = pd.read_csv(StringIO(family_data))\n\n# Display family relationships\nprint(\"Family Relationship Data:\")\ndisplay(family_df)\n\n# Count the types of relationships\nrelationship_counts = family_df['Relationship'].value_counts()\nprint(\"\\nRelationship Counts:\")\nprint(relationship_counts)\n\n# Create a visualization of relationship types\nplt.figure(figsize=(8, 6))\nrelationship_counts.plot(kind='bar', color='lightgreen')\nplt.title('Count of Relationship Types')\nplt.xlabel('Relationship')\nplt.ylabel('Count')\nplt.xticks(rotation=0)\nplt.tight_layout()\nplt.show()

## Family Structure Analysis\n\nLet's identify complete family trios (mother, father, child) in our dataset. This is useful for many genetic analyses, as family trios allow us to study inheritance patterns and validate genetic variants.

In [ ]:
# Identify complete family trios (families with mother, father, and child)\n\n# Group by Family_ID and check if all required relationships exist\nfamily_relationships = family_df.groupby('Family_ID')['Relationship'].apply(list).reset_index()\n\n# Define a function to check if a family has a trio (mother, father, child)\ndef has_trio(relationships):\n    return ('mother' in relationships and 'father' in relationships and 'child' in relationships)\n\n# Apply the function to identify trios\nfamily_relationships['has_trio'] = family_relationships['Relationship'].apply(has_trio)\n\n# Show families with complete trios\ncomplete_trios = family_relationships[family_relationships['has_trio'] == True]\nprint(\"Families with complete trios (mother, father, child):\")\ndisplay(complete_trios[['Family_ID', 'has_trio']])\n\n# Get details of a complete trio family\ntrio_family_id = complete_trios['Family_ID'].iloc[0] if len(complete_trios) > 0 else None\nif trio_family_id:\n    trio_details = family_df[family_df['Family_ID'] == trio_family_id]\n    print(f\"\\nDetails of family {trio_family_id}:\")\n    display(trio_details)

## Population Structure Visualization\n\nLet's create a more advanced visualization to understand the structure of our dataset. We'll use a hierarchical clustering approach to visualize relationships between populations.

In [ ]:
# Create a more advanced visualization of population structure\n\n# First, let's prepare some dummy genetic distance data between populations\n# In a real analysis, this would come from analyzing genetic variants\npopulations = sample_df['Population'].unique()\nn_pops = len(populations)\n\n# Create a dummy distance matrix (in real analysis, this would be calculated from genetic data)\nnp.random.seed(42)  # For reproducibility\ndistance_matrix = np.random.rand(n_pops, n_pops)\n\n# Make it symmetric (distances between populations should be symmetric)\ndistance_matrix = (distance_matrix + distance_matrix.T) / 2\n\n# Set diagonal to 0 (distance of a population to itself is 0)\nnp.fill_diagonal(distance_matrix, 0)\n\n# Create a DataFrame from the distance matrix\ndistance_df = pd.DataFrame(distance_matrix, index=populations, columns=populations)\n\n# Display the distance matrix\nprint(\"Simulated Genetic Distance Matrix Between Populations:\")\ndisplay(distance_df)\n\n# Create a heatmap to visualize the distances\nplt.figure(figsize=(10, 8))\nplt.imshow(distance_matrix, cmap='viridis')\nplt.colorbar(label='Genetic Distance (simulated)')\nplt.xticks(np.arange(len(populations)), populations, rotation=90)\nplt.yticks(np.arange(len(populations)), populations)\nplt.title('Simulated Genetic Distances Between Populations')\n\n# Add distance values to the heatmap\nfor i in range(len(populations)):\n    for j in range(len(populations)):\n        plt.text(j, i, f'{distance_matrix[i, j]:.2f}', \n                 ha='center', va='center', \n                 color='white' if distance_matrix[i, j] > 0.5 else 'black')\n\nplt.tight_layout()\nplt.show()

## Connecting to the Full Lab Environment\n\nThis notebook provides a simplified introduction to exploring genetic data in the browser. For more advanced analyses with full-sized datasets, you'll want to use the local notebook environment.\n\n### How to Continue in the Local Environment\n\n1. Download the full Lab 1 notebook from the course materials\n2. Set up your local Python environment following the course instructions\n3. Run the notebook in your local Jupyter environment\n\n### What's Different in the Full Version?\n\n- Access to complete 1000 Genomes Project data (2,504 individuals)\n- Ability to download and process VCF files with real genetic variants\n- Advanced analyses of population structure with actual genetic data\n- Integration with other bioinformatics tools and pipelines\n\nIn the next lab, we'll explore how to process raw DNA profiles, building on the concepts learned here.

In [ ]:
# Conclusion - Final exploration\n\n# Let's do one more visualization - a scatter plot of population size vs. number of families\n\n# Count individuals by population\npop_counts = sample_df['Population'].value_counts().reset_index()\npop_counts.columns = ['Population', 'Individual_Count']\n\n# Count unique families by population\nfamily_counts = sample_df.groupby('Population')['Family_ID'].nunique().reset_index()\nfamily_counts.columns = ['Population', 'Family_Count']\n\n# Merge the data\npop_summary = pd.merge(pop_counts, family_counts, on='Population')\n\n# Display the summary\nprint(\"Population Summary:\")\ndisplay(pop_summary)\n\n# Create a scatter plot\nplt.figure(figsize=(10, 6))\nplt.scatter(pop_summary['Individual_Count'], pop_summary['Family_Count'], \n            s=80, alpha=0.7, c=range(len(pop_summary)), cmap='viridis')\n\n# Add population labels to each point\nfor i, row in pop_summary.iterrows():\n    plt.annotate(row['Population'], \n                 (row['Individual_Count'], row['Family_Count']),\n                 xytext=(5, 5), textcoords='offset points')\n\nplt.title('Population Size vs. Number of Families')\nplt.xlabel('Number of Individuals')\nplt.ylabel('Number of Families')\nplt.grid(True, linestyle='--', alpha=0.7)\nplt.tight_layout()\nplt.show()\n\nprint(\"\\nLab 1 completed successfully!\")