In [None]:
import os # Imports Python's built-in os library, which allows us to interact with the operating system.

In [None]:
primary_directory = os.getcwd() # get the current working directory and stores it in the variable primary_directory
primary_directory # Displays the value of primary_directory

In [None]:
# !pip install virtualenv !virtualenv myenv
!python -m venv /path/to/new/virtual/environment
# Select Kernel

In [None]:
# !source /path/to/new/virtual/environment/bin/activate
# deactivate

In [None]:
# If you want to use a different path as your primary directory
primary_directory = "/home/lakishadavid"

The followng code checks if the directories already exist before attempting to create them.

In [None]:
use_directory = os.path.join(primary_directory, "use")
results_directory = os.path.join(primary_directory, "results")
references_directory = os.path.join(primary_directory, "references")
data_directory = os.path.join(primary_directory, "data")

In [None]:
# Create the directories
for directory in [use_directory, results_directory, references_directory, data_directory]:
    os.makedirs(directory, exist_ok=True)


## Introduction to the IGSR 30x GRCh38 Data Collection
The International Genome Sample Resource (IGSR) provides a data collection for the 30x GRCh38 human genome assembly. This resource is invaluable for researchers and scientists who are working on genomics, as it offers high-quality, publicly available data sets. The 30x coverage ensures a high level of accuracy and reliability for genomic studies.

In [None]:
# Your reference directory
print(references_directory)

Get the files

In [None]:
# !git clone https://github.com/lakishadavid/anthropology_genetic_genealogy.git

## File Verification

Before proceeding with the data subsetting, let's ensure that the sample and population files you intend to use are available in the `references_directory`.

If you are using VSCode to run this Jupyter Notebook, when you run the next cell, a popup box should appear asking you to enter the filenames for the sample and population files. Please enter the filenames in these boxes to proceed.



In [None]:
import os

sample_file_name = os.path.join(references_directory, "samples_igsr_1000genomes_grch38.tsv")

# Check if the file exists
if os.path.exists(sample_file_name):
    print(f"Using sample file: {sample_file_name}")
else:
    print(f"File does not exist: {sample_file_name}")


population_file_name = os.path.join(references_directory, "populations_igsr_1000genomes_grch38.tsv")

# Check if the file exists
if os.path.exists(population_file_name):
    print(f"Using sample file: {population_file_name}")
else:
    print(f"File does not exist: {population_file_name}")


## Exploring the Sample and Population Files

Before diving into the analysis, it's a good idea to explore the sample and population files to get a sense of what the data looks like. We'll use Pandas to open these files and display the first few rows.


In [None]:
# Import the Pandas library
import pandas as pd

# Load the sample and population files into Pandas DataFrames
try:
    sample_df = pd.read_csv(sample_file_name, sep='\t')
    population_df = pd.read_csv(population_file_name, sep='\t')
    
    # Display the first few rows of each DataFrame
    print("First few rows of the sample file:")
    display(sample_df.head())
    
    print("First few rows of the population file:")
    display(population_df.head())
    
except FileNotFoundError as e:
    print(f"File not found: {e}")


## Data Files Overview

Before diving into the analysis, it's crucial to understand the data files we'll be working with.

---

### Populations File

#### Description
The populations file contains information about the various populations that are part of the genomic study. This file is essential for understanding the diversity of the samples and for performing population-specific analyses.

#### Typical Columns
- **Population ID**: Unique identifier for each population.
- **Population Name**: Name of the population.
- **Region**: Geographical region where the population is located.
- **Number of Samples**: Number of samples collected from this population.
- **Other Metadata**: Additional information such as ethnicity, age range, etc.

#### Use Cases
- Filtering genomic data based on specific populations.
- Performing population-specific genetic variation analyses.
- Understanding the distribution of samples across different populations.

---

### Sample File

#### Description
The sample file contains detailed information about each individual sample that is part of the study. This file is essential for tracking the source of each genomic sequence and for associating it with specific traits or conditions.

#### Typical Columns
- **Sample ID**: Unique identifier for each sample.
- **Population ID**: The population to which the sample belongs.
- **Gender**: Gender of the individual from whom the sample was taken.
- **Age**: Age of the individual.
- **Health Status**: Information about the health condition of the individual, if applicable.
- **Other Metadata**: Additional information such as the date of sample collection, sequencing technology used, etc.

#### Use Cases
- Filtering genomic data based on specific samples or traits.
- Performing individual-level analyses.
- Associating genomic variations with specific traits or conditions.


# Exploratory Data Analysis with Pandas

## Introduction
Before diving into more complex analyses, it's essential to understand the structure and characteristics of your data. Pandas is a powerful Python library that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. Let's explore some basic Pandas functionalities to better understand our sample and population files.

Remember that we loaded the sample and population files earlier and created Pandas Dataframes called sample_df and population_df.

### Basic Information
You can get a quick overview of the DataFrame using .info().

In [None]:
# Get basic information about the sample DataFrame
sample_df.info()

You can preview the dataframe by viewing the first rows using .head() or the last rows using .tail, default 5.

In [None]:
sample_df.head()

In [None]:
sample_df.head(10)

In [None]:
sample_df.tail()

In [None]:
sample_df.tail(15)

### Summary Statistics
The .describe() method provides summary statistics of the DataFrame, useful for getting a sense of the distribution of each attribute.

In [None]:
# Get summary statistics for the sample DataFrame
sample_df.describe()

### Count Values
To count the number of occurrences of each unique value in a column, you can use .value_counts().

In [None]:
# Count the number of individuals per population
sample_df['Population name'].value_counts()

### Filtering Data
You can filter rows based on certain conditions. For example, let's filter the sample DataFrame to only include individuals from a specific population.

In [None]:
# Filter to include only individuals from the 'YRI' population
yri_population = sample_df[sample_df['Population code'] == 'YRI']
yri_population.head()

In [None]:
type(yri_population)

In [None]:
yri_population.info()

In [None]:
yri_population.describe()

In [None]:
# Search for individuals with specific attributes
specific_entries = sample_df[(sample_df['Population code'] == 'YRI') & (sample_df['Sex'] == 'female')]
specific_entries.info()

In [None]:
specific_entries.head()

In [None]:
specific_entries.describe()

### Multiple Conditions
You can include multiple conditions in your query. For example, let's find all females in either the 'YRI' or 'CEU' populations.

In [None]:
# Search for females in either 'YRI' or 'CEU' populations
multiple_conditions = sample_df[(sample_df['Population code'].isin(['YRI', 'CEU'])) & (sample_df['Sex'] == 'female')]
multiple_conditions.describe()

In [None]:
target_population = 'YRI'
target_sex = 'female'

# Search using variables
variable_filter = sample_df[(sample_df['Population code'] == target_population) & (sample_df['Sex'] == target_sex)]
variable_filter.describe()

### Searching for Entries in a List
If you have a list of values to search for, you can use the isin() method within .query().

In [None]:
# List of target populations
target_populations = ['YRI', 'CEU']

# Search for individuals in target populations
list_filter = sample_df[sample_df['Population code'].isin(target_populations)]
list_filter.describe()

### Using String Methods
You can also use string methods to search for specific patterns in string columns.

In [None]:
# Search for individuals whose sample IDs start with 'NA'
string_filter = sample_df[sample_df['Sample name'].str.startswith('NA')]
print(len(string_filter))
string_filter.head()

## Exploratory Data Analysis

Before we move on, let's specifically look at:

1. The total number of samples in the dataset.
2. The distribution of samples by sex.
3. The distribution of samples by population.
4. The distribution of samples by superpopulation.

In [None]:
# Check if the sample DataFrame is loaded
if 'sample_df' in locals():

    # Total number of samples
    total_samples = len(sample_df)
    print(f"Total number of samples: {total_samples}")

    # Distribution by Sex
    print("\nDistribution of samples by Sex:")
    display(sample_df['Sex'].value_counts())

    # Distribution by Superpopulation
    print("\nDistribution of samples by Superpopulation:")
    display(sample_df['Superpopulation name'].value_counts())

else:
    print("The sample DataFrame is not loaded. Please make sure to load the sample file.")