# Lecture 3: Fundamentals of Python

**Course:** BRAIN - Single-Cell Neurogenomics Training  
**Date:** December 12, 2025  
**Duration:** 90 minutes  
**Instructor:** BRAIN Course Team  

---

## Learning Objectives

By the end of this lecture, you will be able to:

1. **Understand** basic Python syntax and data types
2. **Learn** to use variables, operators, and control structures
3. **Work** with data structures like lists, tuples, and dictionaries
4. **Write** functions and apply basic debugging techniques
5. **Apply** Python programming concepts to biological data analysis

---

## Table of Contents

1. [Introduction to Python](#introduction)
2. [Variables and Data Types](#variables)
3. [Operators and Expressions](#operators)
4. [Control Structures](#control)
5. [Data Structures: Lists](#lists)
6. [Data Structures: Tuples](#tuples)
7. [Data Structures: Dictionaries](#dictionaries)
8. [Functions](#functions)
9. [File Input/Output](#file-io)
10. [Debugging Techniques](#debugging)
11. [Biological Data Examples](#bio-examples)
12. [Summary and Key Takeaways](#summary)
13. [Additional Resources](#resources)
14. [Homework Assignment](#homework)

---

<a id='introduction'></a>
## 1. Introduction to Python

### Why Python for Bioinformatics?

Python has become the dominant language in bioinformatics and data science for several reasons:

**Advantages:**
- **Readable syntax**: Easier to learn and maintain than other languages
- **Rich ecosystem**: Extensive libraries for scientific computing (NumPy, Pandas, Scanpy)
- **Community**: Large bioinformatics community with shared tools and resources
- **Versatility**: From data analysis to web development to machine learning
- **Integration**: Easy to integrate with other tools and languages

### Python vs. Other Languages

| Feature | Python | R | Java/C++ |
|---------|--------|---|----------|
| Learning curve | Easy | Moderate | Steep |
| Speed | Moderate | Moderate | Fast |
| Bioinformatics libraries | Excellent | Excellent | Limited |
| General purpose | Yes | No | Yes |
| Interactive analysis | Yes | Yes | No |

### Python in Single-Cell Analysis

The **scverse ecosystem** provides Python tools for single-cell analysis:
- **scanpy**: Core single-cell analysis
- **anndata**: Data structures
- **scvi-tools**: Deep learning models
- **squidpy**: Spatial transcriptomics
- **muon**: Multimodal omics

---

In [None]:
# Check Python version
import sys
print(f"Python version: {sys.version}")
print(f"Python version info: {sys.version_info}")

# Recommended: Python 3.8 or higher for bioinformatics work
if sys.version_info >= (3, 8):
    print("\n✓ Your Python version is suitable for this course!")
else:
    print("\n⚠ Consider upgrading to Python 3.8 or higher")

<a id='variables'></a>
## 2. Variables and Data Types

### Variables

Variables are containers for storing data values. In Python, you don't need to declare variable types explicitly.

**Naming Rules:**
- Must start with a letter or underscore
- Can contain letters, numbers, and underscores
- Case-sensitive (age, Age, and AGE are different)
- Cannot use Python keywords (if, for, while, etc.)

**Naming Conventions:**
- Variables: `snake_case` (e.g., `gene_count`, `cell_type`)
- Constants: `UPPER_CASE` (e.g., `MAX_GENES`, `PI`)
- Classes: `PascalCase` (e.g., `DataProcessor`, `GeneExpression`)

### Basic Data Types

Python has several built-in data types:

1. **Numeric Types**:
   - `int`: Integer numbers (1, 42, -100)
   - `float`: Decimal numbers (3.14, -0.001, 2.0)
   - `complex`: Complex numbers (1+2j)

2. **Text Type**:
   - `str`: Strings ("ATCG", 'gene', """multi-line""")

3. **Boolean Type**:
   - `bool`: True or False

4. **None Type**:
   - `NoneType`: Represents absence of value

---

In [None]:
# Numeric types
num_cells = 2700  # int
expression_value = 3.14159  # float
complex_num = 1 + 2j  # complex (rarely used in bioinformatics)

# String types
gene_name = "BRCA1"  # str
cell_type = 'T cell'  # str (single or double quotes work)
dna_sequence = "ATCGATCGATCG"  # str

# Boolean
is_mitochondrial = True  # bool
is_expressed = False  # bool

# None type
annotation = None  # NoneType

# Print types
print("Variable Types:")
print(f"num_cells: {type(num_cells)} = {num_cells}")
print(f"expression_value: {type(expression_value)} = {expression_value}")
print(f"gene_name: {type(gene_name)} = {gene_name}")
print(f"is_mitochondrial: {type(is_mitochondrial)} = {is_mitochondrial}")
print(f"annotation: {type(annotation)} = {annotation}")

### Type Conversion

You can convert between types using built-in functions:

In [None]:
# Type conversion examples
x = "100"  # string
y = int(x)  # convert to integer
z = float(x)  # convert to float

print(f"Original: {x} (type: {type(x)})")
print(f"As int: {y} (type: {type(y)})")
print(f"As float: {z} (type: {type(z)})")

# Convert numbers to strings
num = 42
text = str(num)
print(f"\nNumber {num} as string: '{text}' (type: {type(text)})")

# Bioinformatics example: convert quality score to integer
quality_score = "35"  # Phred quality score as string
quality_int = int(quality_score)
print(f"\nQuality score: {quality_int} (type: {type(quality_int)})")

<a id='operators'></a>
## 3. Operators and Expressions

### Arithmetic Operators

| Operator | Description | Example |
|----------|-------------|----------|
| `+` | Addition | `5 + 3 = 8` |
| `-` | Subtraction | `5 - 3 = 2` |
| `*` | Multiplication | `5 * 3 = 15` |
| `/` | Division (float) | `5 / 2 = 2.5` |
| `//` | Floor division | `5 // 2 = 2` |
| `%` | Modulus (remainder) | `5 % 2 = 1` |
| `**` | Exponentiation | `5 ** 2 = 25` |

### Comparison Operators

| Operator | Description | Example |
|----------|-------------|----------|
| `==` | Equal to | `5 == 5` → True |
| `!=` | Not equal to | `5 != 3` → True |
| `>` | Greater than | `5 > 3` → True |
| `<` | Less than | `5 < 3` → False |
| `>=` | Greater than or equal | `5 >= 5` → True |
| `<=` | Less than or equal | `5 <= 3` → False |

### Logical Operators

| Operator | Description | Example |
|----------|-------------|----------|
| `and` | Both conditions true | `True and False` → False |
| `or` | At least one condition true | `True or False` → True |
| `not` | Negation | `not True` → False |

---

In [None]:
# Arithmetic operations on biological data
total_genes = 20000
expressed_genes = 5000
unexpressed_genes = total_genes - expressed_genes

expression_percent = (expressed_genes / total_genes) * 100

print("Gene Expression Statistics:")
print(f"Total genes: {total_genes:,}")
print(f"Expressed genes: {expressed_genes:,}")
print(f"Unexpressed genes: {unexpressed_genes:,}")
print(f"Expression percentage: {expression_percent:.1f}%")

# Modulus for checking even/odd
num_cells = 2701
is_even = (num_cells % 2) == 0
print(f"\nNumber of cells: {num_cells}")
print(f"Is even? {is_even}")

# Exponentiation for fold change
log2_fold_change = 2
fold_change = 2 ** log2_fold_change
print(f"\nLog2 fold change: {log2_fold_change}")
print(f"Fold change: {fold_change}x")

In [None]:
# Comparison and logical operators for filtering cells
n_genes = 1500
total_counts = 4000
pct_mito = 8.5

# Quality control thresholds
MIN_GENES = 200
MAX_GENES = 5000
MIN_COUNTS = 500
MAX_MITO = 20

# Check if cell passes QC
passes_gene_filter = (n_genes >= MIN_GENES) and (n_genes <= MAX_GENES)
passes_count_filter = total_counts >= MIN_COUNTS
passes_mito_filter = pct_mito <= MAX_MITO

passes_all_qc = passes_gene_filter and passes_count_filter and passes_mito_filter

print("Cell Quality Control:")
print(f"Genes detected: {n_genes}")
print(f"Total counts: {total_counts:,}")
print(f"Mitochondrial %: {pct_mito}%")
print(f"\nQC Results:")
print(f"  Passes gene filter: {passes_gene_filter}")
print(f"  Passes count filter: {passes_count_filter}")
print(f"  Passes mito filter: {passes_mito_filter}")
print(f"  Passes ALL QC: {passes_all_qc}")

<a id='control'></a>
## 4. Control Structures

### If-Elif-Else Statements

Control the flow of your program based on conditions:

```python
if condition1:
    # code block executed if condition1 is True
elif condition2:
    # code block executed if condition1 is False and condition2 is True
else:
    # code block executed if all conditions are False
```

**Important:** Python uses **indentation** (4 spaces) to define code blocks!

In [None]:
# Example: Classify cell quality based on QC metrics
def classify_cell_quality(n_genes, pct_mito):
    """
    Classify cell quality based on genes detected and mitochondrial content.
    """
    if n_genes < 200:
        quality = "Low quality (too few genes)"
    elif pct_mito > 20:
        quality = "Low quality (high mitochondrial content)"
    elif n_genes > 5000:
        quality = "Potential doublet (too many genes)"
    else:
        quality = "Good quality"
    
    return quality

# Test with different cells
cells = [
    (150, 5),    # Low genes
    (1500, 25),  # High mito
    (6000, 5),   # Potential doublet
    (2000, 8)    # Good cell
]

print("Cell Quality Classification:")
print("=" * 60)
for i, (genes, mito) in enumerate(cells, 1):
    quality = classify_cell_quality(genes, mito)
    print(f"Cell {i}: {genes} genes, {mito}% mito → {quality}")

### For Loops

Iterate over a sequence (list, tuple, string, etc.):

```python
for item in sequence:
    # code block executed for each item
```

In [None]:
# Example: Iterate over gene names
genes = ["TP53", "BRCA1", "EGFR", "MYC", "KRAS"]

print("Cancer-associated genes:")
for gene in genes:
    print(f"  - {gene}")

# Using range() for numeric iterations
print("\nProcessing 5 samples:")
for i in range(5):
    print(f"  Processing sample {i+1}...")

# Using enumerate() to get index and value
print("\nGenes with indices:")
for idx, gene in enumerate(genes, start=1):
    print(f"  {idx}. {gene}")

### While Loops

Execute code while a condition is True:

```python
while condition:
    # code block
    # (must eventually make condition False to avoid infinite loop!)
```

In [None]:
# Example: Simulate sequencing depth until target is reached
target_reads = 10000
current_reads = 0
cycle = 0

print("Sequencing simulation:")
while current_reads < target_reads:
    cycle += 1
    reads_this_cycle = 1500  # Reads per cycle
    current_reads += reads_this_cycle
    print(f"  Cycle {cycle}: {current_reads:,} total reads")

print(f"\nTarget of {target_reads:,} reads reached in {cycle} cycles!")

### Break and Continue

- `break`: Exit the loop early
- `continue`: Skip to the next iteration

In [None]:
# Example: Find first gene above expression threshold
gene_expression = {
    "ACTB": 250,   # Housekeeping gene
    "GAPDH": 300,  # Housekeeping gene
    "CD3D": 15,    # T cell marker
    "CD8A": 80,    # CD8 T cell marker
    "CD4": 120     # CD4 T cell marker
}

threshold = 100

print(f"Finding first gene with expression > {threshold}:")
for gene, expr in gene_expression.items():
    print(f"  Checking {gene}: {expr}")
    if expr > threshold:
        print(f"  → Found: {gene} with expression {expr}!")
        break  # Stop searching

# Example: Skip low-expressed genes
print(f"\nGenes with expression > {threshold}:")
for gene, expr in gene_expression.items():
    if expr <= threshold:
        continue  # Skip this gene
    print(f"  {gene}: {expr}")

<a id='lists'></a>
## 5. Data Structures: Lists

### Lists

Lists are **ordered, mutable** collections that can contain items of different types.

**Key Features:**
- Ordered (items have a defined position)
- Mutable (can be changed after creation)
- Allow duplicates
- Can contain mixed types
- Defined with square brackets `[]`

In [None]:
# Creating lists
genes = ["TP53", "BRCA1", "EGFR", "MYC"]
expression = [120, 45, 89, 210]
mixed_list = ["CD8A", 150, True, 3.14]

print("Gene list:", genes)
print("Expression values:", expression)
print("Mixed types:", mixed_list)

# List length
print(f"\nNumber of genes: {len(genes)}")

In [None]:
# Accessing list elements (indexing starts at 0)
print("List indexing:")
print(f"First gene: {genes[0]}")
print(f"Second gene: {genes[1]}")
print(f"Last gene: {genes[-1]}")
print(f"Second to last: {genes[-2]}")

# Slicing lists
print("\nList slicing:")
print(f"First two genes: {genes[0:2]}")
print(f"From index 1 onward: {genes[1:]}")
print(f"Up to index 3: {genes[:3]}")
print(f"Every other gene: {genes[::2]}")

In [None]:
# Modifying lists
print("Original list:", genes)

# Append (add to end)
genes.append("KRAS")
print("After append:", genes)

# Insert at specific position
genes.insert(2, "PTEN")
print("After insert:", genes)

# Remove specific element
genes.remove("PTEN")
print("After remove:", genes)

# Pop (remove and return last element)
last_gene = genes.pop()
print(f"Popped: {last_gene}")
print("After pop:", genes)

# Extend (add multiple elements)
more_genes = ["APC", "RB1"]
genes.extend(more_genes)
print("After extend:", genes)

In [None]:
# Useful list methods
numbers = [3, 1, 4, 1, 5, 9, 2, 6]

print(f"Original: {numbers}")
print(f"Sorted: {sorted(numbers)}")
print(f"Reversed: {list(reversed(numbers))}")
print(f"Sum: {sum(numbers)}")
print(f"Max: {max(numbers)}")
print(f"Min: {min(numbers)}")
print(f"Count of 1: {numbers.count(1)}")
print(f"Index of 5: {numbers.index(5)}")

In [None]:
# List comprehensions (compact way to create lists)
# Syntax: [expression for item in iterable if condition]

# Example 1: Square of numbers
squares = [x**2 for x in range(1, 6)]
print(f"Squares: {squares}")

# Example 2: Filter expressed genes (expression > 100)
genes = ["TP53", "BRCA1", "EGFR", "MYC", "KRAS"]
expression = [120, 45, 89, 210, 180]

highly_expressed = [genes[i] for i in range(len(genes)) if expression[i] > 100]
print(f"\nHighly expressed genes (>100): {highly_expressed}")

# Example 3: Gene name lengths
gene_lengths = [len(gene) for gene in genes]
print(f"Gene name lengths: {gene_lengths}")

<a id='tuples'></a>
## 6. Data Structures: Tuples

### Tuples

Tuples are **ordered, immutable** collections.

**Key Features:**
- Ordered (items have a defined position)
- **Immutable** (cannot be changed after creation)
- Allow duplicates
- Faster than lists (due to immutability)
- Defined with parentheses `()`

**Use tuples when:**
- Data should not be modified (e.g., coordinates, RGB colors)
- You need a dictionary key (lists can't be used as keys)
- Returning multiple values from a function

In [None]:
# Creating tuples
coordinates = (10, 20)
gene_info = ("TP53", "chr17", 7661779, 7687538)
single_item = (42,)  # Note the comma!

print(f"Coordinates: {coordinates}")
print(f"Gene info: {gene_info}")
print(f"Type: {type(coordinates)}")

# Accessing tuple elements (same as lists)
gene_name, chromosome, start, end = gene_info  # Unpacking
print(f"\nGene: {gene_name}")
print(f"Location: {chromosome}:{start}-{end}")

# Tuples are immutable
try:
    coordinates[0] = 15  # This will raise an error
except TypeError as e:
    print(f"\nError: {e}")

In [None]:
# Biological example: Store gene annotations as tuples
gene_annotations = [
    ("TP53", "chr17", 7661779, 7687538, "tumor suppressor"),
    ("BRCA1", "chr17", 43044295, 43125483, "DNA repair"),
    ("EGFR", "chr7", 55086714, 55275031, "receptor tyrosine kinase")
]

print("Gene Annotations:")
print("=" * 70)
for gene, chrom, start, end, function in gene_annotations:
    length = end - start
    print(f"{gene:8} | {chrom:5} | {start:>10}-{end:<10} | {length:>7} bp | {function}")

<a id='dictionaries'></a>
## 7. Data Structures: Dictionaries

### Dictionaries

Dictionaries store **key-value pairs**. They are unordered, mutable collections.

**Key Features:**
- Unordered (Python 3.7+ maintains insertion order)
- Mutable (can be changed)
- Keys must be unique and immutable (strings, numbers, tuples)
- Values can be any type
- Very fast lookups
- Defined with curly braces `{}`

**Common uses in bioinformatics:**
- Gene expression: `{"TP53": 120, "BRCA1": 45}`
- Annotations: `{"cell_1": "T cell", "cell_2": "B cell"}`
- Configuration: `{"min_genes": 200, "max_mito": 20}`

In [None]:
# Creating dictionaries
gene_expression = {
    "TP53": 120,
    "BRCA1": 45,
    "EGFR": 89,
    "MYC": 210
}

# Alternative creation methods
cell_types = dict(cell_1="T cell", cell_2="B cell", cell_3="Monocyte")

print("Gene expression:", gene_expression)
print("Cell types:", cell_types)

In [None]:
# Accessing dictionary values
print("Accessing values:")
print(f"TP53 expression: {gene_expression['TP53']}")

# Safe access with get() (returns None if key doesn't exist)
print(f"KRAS expression: {gene_expression.get('KRAS')}")
print(f"KRAS expression (with default): {gene_expression.get('KRAS', 0)}")

# Adding/modifying entries
gene_expression["KRAS"] = 180  # Add new key-value
gene_expression["TP53"] = 150  # Modify existing value
print(f"\nUpdated: {gene_expression}")

# Removing entries
del gene_expression["KRAS"]  # Remove key-value pair
popped_value = gene_expression.pop("EGFR")  # Remove and return value
print(f"\nAfter deletions: {gene_expression}")
print(f"Popped value: {popped_value}")

In [None]:
# Dictionary methods
print("Dictionary keys:", gene_expression.keys())
print("Dictionary values:", gene_expression.values())
print("Dictionary items:", gene_expression.items())

# Checking membership
print(f"\nIs 'TP53' in dictionary? {'TP53' in gene_expression}")
print(f"Is 'KRAS' in dictionary? {'KRAS' in gene_expression}")

# Iterating over dictionaries
print("\nGene Expression Values:")
for gene, expr in gene_expression.items():
    print(f"  {gene}: {expr}")

In [None]:
# Dictionary comprehensions
# Syntax: {key_expr: value_expr for item in iterable if condition}

# Example 1: Square numbers
squares_dict = {x: x**2 for x in range(1, 6)}
print(f"Squares: {squares_dict}")

# Example 2: Filter highly expressed genes
gene_expression = {"TP53": 120, "BRCA1": 45, "EGFR": 89, "MYC": 210}
highly_expressed = {gene: expr for gene, expr in gene_expression.items() if expr > 100}
print(f"\nHighly expressed genes: {highly_expressed}")

# Example 3: Nested dictionary for cell metadata
cell_metadata = {
    "cell_1": {"type": "T cell", "n_genes": 1500, "pct_mito": 5},
    "cell_2": {"type": "B cell", "n_genes": 1200, "pct_mito": 8},
    "cell_3": {"type": "Monocyte", "n_genes": 1800, "pct_mito": 12}
}

print("\nCell Metadata:")
for cell_id, metadata in cell_metadata.items():
    print(f"  {cell_id}: {metadata['type']}, {metadata['n_genes']} genes, {metadata['pct_mito']}% mito")

<a id='functions'></a>
## 8. Functions

### Defining Functions

Functions are reusable blocks of code that perform specific tasks.

```python
def function_name(parameter1, parameter2):
    """
    Docstring: Describe what the function does.
    """
    # function body
    result = parameter1 + parameter2
    return result
```

**Best Practices:**
- Use descriptive names (verb + noun: `calculate_expression`, `filter_cells`)
- Include docstrings
- Keep functions focused on one task
- Use type hints (optional but helpful)

In [None]:
# Basic function
def calculate_gc_content(sequence):
    """
    Calculate GC content of a DNA sequence.
    
    Parameters:
    -----------
    sequence : str
        DNA sequence string
    
    Returns:
    --------
    float
        GC content as percentage
    """
    sequence = sequence.upper()
    gc_count = sequence.count('G') + sequence.count('C')
    total = len(sequence)
    gc_percent = (gc_count / total) * 100 if total > 0 else 0
    return gc_percent

# Test the function
dna_seq = "ATCGATCGATCG"
gc = calculate_gc_content(dna_seq)
print(f"Sequence: {dna_seq}")
print(f"GC content: {gc:.1f}%")

In [None]:
# Function with default parameters
def quality_control(n_genes, total_counts, pct_mito, 
                   min_genes=200, max_genes=5000, max_mito=20):
    """
    Perform quality control on a cell.
    
    Parameters:
    -----------
    n_genes : int
        Number of genes detected
    total_counts : int
        Total UMI counts
    pct_mito : float
        Mitochondrial percentage
    min_genes : int, optional
        Minimum genes threshold (default: 200)
    max_genes : int, optional
        Maximum genes threshold (default: 5000)
    max_mito : float, optional
        Maximum mitochondrial percentage (default: 20)
    
    Returns:
    --------
    bool
        True if cell passes QC, False otherwise
    """
    passes_gene_filter = (n_genes >= min_genes) and (n_genes <= max_genes)
    passes_mito_filter = pct_mito <= max_mito
    
    return passes_gene_filter and passes_mito_filter

# Test with default parameters
result1 = quality_control(1500, 4000, 8)
print(f"Cell 1 (1500 genes, 8% mito): Passes QC = {result1}")

# Test with custom thresholds
result2 = quality_control(1500, 4000, 8, min_genes=500, max_mito=10)
print(f"Cell 2 (1500 genes, 8% mito, stricter): Passes QC = {result2}")

In [None]:
# Function returning multiple values
def analyze_expression(gene_dict):
    """
    Analyze gene expression dictionary.
    
    Returns:
    --------
    tuple
        (mean, max_gene, max_value, count)
    """
    values = list(gene_dict.values())
    mean_expr = sum(values) / len(values)
    max_gene = max(gene_dict, key=gene_dict.get)
    max_value = gene_dict[max_gene]
    count = len(gene_dict)
    
    return mean_expr, max_gene, max_value, count

# Test the function
expression = {"TP53": 120, "BRCA1": 45, "EGFR": 89, "MYC": 210}
mean, top_gene, top_expr, n_genes = analyze_expression(expression)

print("Expression Analysis:")
print(f"  Mean expression: {mean:.1f}")
print(f"  Top gene: {top_gene} ({top_expr})")
print(f"  Number of genes: {n_genes}")

In [None]:
# Lambda functions (anonymous, one-line functions)
# Syntax: lambda parameters: expression

# Example 1: Simple calculation
square = lambda x: x**2
print(f"Square of 5: {square(5)}")

# Example 2: Sorting genes by expression
gene_expr = [("TP53", 120), ("BRCA1", 45), ("MYC", 210), ("EGFR", 89)]
sorted_genes = sorted(gene_expr, key=lambda x: x[1], reverse=True)

print("\nGenes sorted by expression (high to low):")
for gene, expr in sorted_genes:
    print(f"  {gene}: {expr}")

<a id='file-io'></a>
## 9. File Input/Output

### Reading Files

Reading data from files is essential for bioinformatics:

```python
# Basic file reading
with open('filename.txt', 'r') as file:
    content = file.read()
```

**File modes:**
- `'r'`: Read (default)
- `'w'`: Write (overwrites existing file)
- `'a'`: Append
- `'r+'`: Read and write

In [None]:
# Write a sample file first
sample_data = """gene_name,expression,cell_type
TP53,120,T cell
BRCA1,45,T cell
EGFR,89,B cell
MYC,210,B cell
KRAS,180,Monocyte
"""

with open('sample_data.csv', 'w') as f:
    f.write(sample_data)

print("Sample file created: sample_data.csv")

In [None]:
# Read file line by line
print("Reading file line by line:")
print("=" * 50)

with open('sample_data.csv', 'r') as f:
    header = f.readline().strip()  # Read first line (header)
    print(f"Header: {header}\n")
    
    for line in f:
        line = line.strip()  # Remove whitespace
        if line:  # Skip empty lines
            gene, expr, cell_type = line.split(',')
            print(f"  {gene:8} | {expr:>4} | {cell_type}")

In [None]:
# Parse CSV file into dictionary
def read_expression_data(filename):
    """
    Read gene expression data from CSV file.
    
    Returns:
    --------
    list of dict
        List of dictionaries, one per row
    """
    data = []
    with open(filename, 'r') as f:
        header = f.readline().strip().split(',')
        
        for line in f:
            if line.strip():
                values = line.strip().split(',')
                row_dict = {
                    'gene': values[0],
                    'expression': int(values[1]),
                    'cell_type': values[2]
                }
                data.append(row_dict)
    
    return data

# Test the function
expression_data = read_expression_data('sample_data.csv')

print("Parsed data:")
for row in expression_data:
    print(f"  {row}")

In [None]:
# Writing files
output_data = [
    {'gene': 'APC', 'expression': 95, 'cell_type': 'T cell'},
    {'gene': 'RB1', 'expression': 130, 'cell_type': 'B cell'}
]

with open('output_data.csv', 'w') as f:
    # Write header
    f.write('gene_name,expression,cell_type\n')
    
    # Write data rows
    for row in output_data:
        f.write(f"{row['gene']},{row['expression']},{row['cell_type']}\n")

print("Output file written: output_data.csv")

# Verify by reading it back
with open('output_data.csv', 'r') as f:
    print("\nFile contents:")
    print(f.read())

<a id='debugging'></a>
## 10. Debugging Techniques

### Common Error Types

1. **SyntaxError**: Invalid Python syntax
2. **NameError**: Variable not defined
3. **TypeError**: Operation on wrong type
4. **IndexError**: Index out of range
5. **KeyError**: Dictionary key doesn't exist
6. **ValueError**: Invalid value for operation

### Debugging Strategies

1. **Print debugging**: Add print statements to track values
2. **Assert statements**: Check assumptions
3. **Try-except**: Handle expected errors gracefully
4. **Debugger**: Use Python debugger (pdb) for step-by-step execution

In [None]:
# Example 1: IndexError
genes = ["TP53", "BRCA1", "EGFR"]

try:
    print(genes[5])  # This will raise IndexError
except IndexError as e:
    print(f"IndexError: {e}")
    print(f"List has only {len(genes)} elements (indices 0-{len(genes)-1})")

# Example 2: KeyError
expression = {"TP53": 120, "BRCA1": 45}

try:
    value = expression["EGFR"]  # This will raise KeyError
except KeyError as e:
    print(f"\nKeyError: {e}")
    print(f"Available keys: {list(expression.keys())}")
    # Use get() method instead
    value = expression.get("EGFR", 0)
    print(f"Using get() with default: {value}")

In [None]:
# Print debugging example
def calculate_mean_expression(gene_list, expression_dict):
    """
    Calculate mean expression of genes in list.
    """
    print(f"DEBUG: Processing {len(gene_list)} genes")  # Debug print
    
    total = 0
    count = 0
    
    for gene in gene_list:
        if gene in expression_dict:
            expr = expression_dict[gene]
            print(f"DEBUG: {gene} = {expr}")  # Debug print
            total += expr
            count += 1
        else:
            print(f"DEBUG: {gene} not found, skipping")  # Debug print
    
    mean = total / count if count > 0 else 0
    print(f"DEBUG: Mean = {total} / {count} = {mean}")  # Debug print
    return mean

# Test
genes = ["TP53", "BRCA1", "UNKNOWN"]
expr = {"TP53": 120, "BRCA1": 45, "EGFR": 89}

result = calculate_mean_expression(genes, expr)
print(f"\nFinal result: {result:.1f}")

In [None]:
# Assert statements for checking assumptions
def normalize_expression(values):
    """
    Normalize expression values to 0-1 range.
    """
    # Check input assumptions
    assert len(values) > 0, "Cannot normalize empty list"
    assert all(v >= 0 for v in values), "All values must be non-negative"
    
    max_val = max(values)
    assert max_val > 0, "Maximum value must be positive"
    
    normalized = [v / max_val for v in values]
    
    # Check output assumptions
    assert len(normalized) == len(values), "Output length mismatch"
    assert all(0 <= v <= 1 for v in normalized), "Normalized values out of range"
    
    return normalized

# Test with valid data
expr_values = [120, 45, 89, 210]
normalized = normalize_expression(expr_values)
print(f"Original: {expr_values}")
print(f"Normalized: {[f'{v:.2f}' for v in normalized]}")

# Test with invalid data (uncomment to see assertion error)
# invalid = [120, -45, 89]  # Negative value
# normalize_expression(invalid)

<a id='bio-examples'></a>
## 11. Biological Data Examples

Let's apply what we've learned to real biological data analysis scenarios.

In [None]:
# Example 1: Sequence Analysis
def reverse_complement(dna_seq):
    """
    Calculate reverse complement of DNA sequence.
    """
    complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    
    # Reverse the sequence and complement each base
    rev_comp = ''.join([complement[base] for base in reversed(dna_seq.upper())])
    
    return rev_comp

# Test
sequence = "ATCGATCG"
rev_comp = reverse_complement(sequence)

print("DNA Sequence Analysis:")
print(f"  Original:           5'-{sequence}-3'")
print(f"  Reverse complement: 3'-{rev_comp}-5'")
print(f"  GC content:         {calculate_gc_content(sequence):.1f}%")

In [None]:
# Example 2: Cell Quality Control Analysis
def analyze_cell_population(cell_data):
    """
    Analyze QC metrics for a population of cells.
    
    Parameters:
    -----------
    cell_data : list of dict
        Each dict contains 'n_genes', 'total_counts', 'pct_mito'
    
    Returns:
    --------
    dict
        Summary statistics
    """
    n_cells = len(cell_data)
    
    # Extract metrics
    genes_list = [cell['n_genes'] for cell in cell_data]
    counts_list = [cell['total_counts'] for cell in cell_data]
    mito_list = [cell['pct_mito'] for cell in cell_data]
    
    # Calculate statistics
    summary = {
        'n_cells': n_cells,
        'mean_genes': sum(genes_list) / n_cells,
        'mean_counts': sum(counts_list) / n_cells,
        'mean_mito': sum(mito_list) / n_cells,
        'median_genes': sorted(genes_list)[n_cells // 2],
        'cells_passing_qc': sum(1 for cell in cell_data 
                               if quality_control(cell['n_genes'], 
                                                cell['total_counts'], 
                                                cell['pct_mito']))
    }
    
    summary['pass_rate'] = (summary['cells_passing_qc'] / n_cells) * 100
    
    return summary

# Test data
cells = [
    {'n_genes': 1500, 'total_counts': 4000, 'pct_mito': 5},
    {'n_genes': 2000, 'total_counts': 6000, 'pct_mito': 8},
    {'n_genes': 150, 'total_counts': 500, 'pct_mito': 3},   # Low quality
    {'n_genes': 1800, 'total_counts': 5000, 'pct_mito': 25}, # High mito
    {'n_genes': 2200, 'total_counts': 7000, 'pct_mito': 6}
]

summary = analyze_cell_population(cells)

print("Population Analysis:")
print("=" * 50)
print(f"Total cells: {summary['n_cells']}")
print(f"Mean genes detected: {summary['mean_genes']:.0f}")
print(f"Mean UMI counts: {summary['mean_counts']:.0f}")
print(f"Mean mitochondrial %: {summary['mean_mito']:.1f}%")
print(f"Cells passing QC: {summary['cells_passing_qc']} ({summary['pass_rate']:.0f}%)")

In [None]:
# Example 3: Gene Expression Matrix Simulation
import random
random.seed(42)

def simulate_expression_matrix(n_cells, n_genes):
    """
    Simulate a simple gene expression matrix.
    
    Returns:
    --------
    dict
        Gene names as keys, list of expression values as values
    """
    matrix = {}
    
    for i in range(n_genes):
        gene_name = f"Gene{i+1}"
        # Simulate sparse expression (many zeros)
        expression = [random.randint(0, 200) if random.random() > 0.7 else 0 
                     for _ in range(n_cells)]
        matrix[gene_name] = expression
    
    return matrix

# Simulate data
expr_matrix = simulate_expression_matrix(n_cells=5, n_genes=8)

# Display as table
print("Simulated Expression Matrix:")
print("=" * 60)

# Header
print(f"{'Gene':<10}", end='')
for i in range(5):
    print(f"Cell{i+1:>3}", end='  ')
print()
print("-" * 60)

# Data rows
for gene, values in expr_matrix.items():
    print(f"{gene:<10}", end='')
    for val in values:
        print(f"{val:>6}", end='  ')
    print()

# Calculate sparsity
total_values = sum(len(vals) for vals in expr_matrix.values())
zero_values = sum(val == 0 for vals in expr_matrix.values() for val in vals)
sparsity = (zero_values / total_values) * 100

print("\nMatrix Properties:")
print(f"  Dimensions: {len(expr_matrix)} genes × 5 cells")
print(f"  Sparsity: {sparsity:.1f}% zeros")

<a id='summary'></a>
## 12. Summary and Key Takeaways

### What We Learned

1. **Variables and Data Types**
   - Basic types: int, float, str, bool
   - Type conversion
   - Variable naming conventions

2. **Operators**
   - Arithmetic: `+, -, *, /, //, %, **`
   - Comparison: `==, !=, >, <, >=, <=`
   - Logical: `and, or, not`

3. **Control Structures**
   - `if-elif-else` statements
   - `for` loops and `range()`
   - `while` loops
   - `break` and `continue`

4. **Data Structures**
   - **Lists**: Ordered, mutable `[1, 2, 3]`
   - **Tuples**: Ordered, immutable `(1, 2, 3)`
   - **Dictionaries**: Key-value pairs `{"gene": "TP53"}`
   - List/dict comprehensions for concise code

5. **Functions**
   - Define reusable code blocks
   - Parameters and return values
   - Default parameters
   - Lambda functions

6. **File I/O**
   - Reading files with `open()`
   - Writing files
   - Parsing CSV data

7. **Debugging**
   - Common error types
   - Try-except blocks
   - Assert statements
   - Print debugging

### Python Skills for Bioinformatics

✅ Read and write biological data files  
✅ Analyze sequences (GC content, reverse complement)  
✅ Filter cells based on QC metrics  
✅ Process gene expression data  
✅ Write reusable analysis functions  

### Next Steps

**Lecture 4: Quantification Pipeline**
- From FASTQ to count matrices
- Running kallisto|bustools (kb-python)
- Understanding pipeline outputs

**Future Topics:**
- NumPy for numerical computing
- Pandas for data manipulation
- Matplotlib/Seaborn for visualization
- Scanpy for single-cell analysis

---

<a id='resources'></a>
## 13. Additional Resources

### Official Documentation

- **Python Tutorial**: https://docs.python.org/3/tutorial/
- **Python Standard Library**: https://docs.python.org/3/library/
- **PEP 8 Style Guide**: https://peps.python.org/pep-0008/

### Interactive Learning

- **Codecademy Python**: https://www.codecademy.com/learn/learn-python-3
- **Python for Everybody**: https://www.py4e.com/
- **Real Python**: https://realpython.com/

### Bioinformatics-Specific

- **Rosalind Bioinformatics Problems**: http://rosalind.info/
- **Biopython Tutorial**: https://biopython.org/wiki/Documentation
- **Python for Biologists**: https://pythonforbiologists.com/

### Books

1. **"Python for Data Analysis"** by Wes McKinney (creator of pandas)
2. **"Bioinformatics with Python Cookbook"** by Tiago Antao
3. **"Python for Bioinformatics"** by Sebastian Bassi

### Video Tutorials

- **Corey Schafer's Python Tutorials**: https://www.youtube.com/c/Coreyms
- **Python for Bioinformatics**: https://www.youtube.com/@PythonforBioinformatics

---

<a id='homework'></a>
## 14. Homework Assignment

### Assignment: Python Fundamentals Practice

**Due:** Before Lecture 4  
**Points:** 100

---

#### Task 1: DNA Sequence Analysis (25 points)

Write a function `analyze_sequence(dna_seq)` that:
1. Calculates GC content
2. Finds the reverse complement
3. Counts each nucleotide (A, T, G, C)
4. Returns a dictionary with all results

**Test sequence:** `"ATCGATCGATCGATCGAAAA"`

---

#### Task 2: Cell Quality Control (30 points)

Given this list of cells:

```python
cells = [
    {'id': 'cell_1', 'n_genes': 1500, 'total_counts': 4000, 'pct_mito': 5},
    {'id': 'cell_2', 'n_genes': 2200, 'total_counts': 6500, 'pct_mito': 8},
    {'id': 'cell_3', 'n_genes': 180, 'total_counts': 600, 'pct_mito': 3},
    {'id': 'cell_4', 'n_genes': 1800, 'total_counts': 5200, 'pct_mito': 25},
    {'id': 'cell_5', 'n_genes': 6500, 'total_counts': 12000, 'pct_mito': 6},
    {'id': 'cell_6', 'n_genes': 2000, 'total_counts': 7000, 'pct_mito': 7}
]
```

1. Filter cells using these thresholds:
   - `200 ≤ n_genes ≤ 5000`
   - `pct_mito ≤ 20`
2. Calculate the percentage of cells that pass QC
3. For passing cells, calculate mean `n_genes` and mean `total_counts`
4. Print a summary report

---

#### Task 3: Gene Expression Analysis (25 points)

Given this gene expression dictionary:

```python
expression = {
    'TP53': 120, 'BRCA1': 45, 'EGFR': 189, 'MYC': 210,
    'KRAS': 180, 'APC': 95, 'RB1': 130, 'PTEN': 75
}
```

1. Find the top 3 most highly expressed genes
2. Calculate mean expression across all genes
3. Identify genes with expression > mean
4. Create a new dictionary with only highly expressed genes (>150)

---

#### Task 4: File Processing (20 points)

1. Create a CSV file named `markers.csv` with this content:
   ```
   gene,cell_type,expression
   CD3D,T cell,150
   CD8A,CD8 T cell,120
   CD4,CD4 T cell,95
   CD79A,B cell,180
   CD14,Monocyte,200
   ```

2. Write a function that:
   - Reads the file
   - Filters rows where expression > 100
   - Writes filtered results to `markers_filtered.csv`
   - Returns the number of rows filtered

---

### Submission Guidelines

**Format:** Submit a Jupyter notebook (.ipynb) with:
- Code cells for each task
- Markdown cells explaining your approach
- Output showing results

**File name:** `lecture03_homework_[YourLastName].ipynb`

**Grading:**
- Code correctness: 60%
- Code quality (comments, style): 20%
- Documentation: 10%
- Output clarity: 10%

---

**Good luck!**

*End of Lecture 3*