# 🐍 Python Setup & Tooling

This notebook covers:
- **Installing Python 3.x**: Anaconda vs. venv  
- **IDEs & Editors**: VS Code, PyCharm, JupyterLab  
- **Package Management**: pip vs. conda  


## 1. Checking Your Current Python

Run this cell to see what Python you have on your PATH:


In [None]:
!python --version
!which python   # Linux/macOS
!where python   # Windows


### 1a. If you want Anaconda (all batteries included)

1. Download from https://www.anaconda.com/products/distribution   
2. Follow the installer wizard.  
3. Open **Anaconda Prompt** (Windows) or your shell (macOS/Linux) and verify:


In [None]:
!conda --version
!python --version


### 1b. If you prefer a lightweight venv

Create a virtual environment in your project folder:


In [None]:
!python -m venv myenv


In [None]:
# macOS / Linux
# !source myenv/bin/activate

# Windows PowerShell
!myenv\Scripts\Activate.ps1


In [None]:
!which python   # should point inside myenv
!pip --version


## 2. IDEs & Editors

### 2a. VS Code
```python
# from within Jupyter you can even launch VS Code in this folder:
import subprocess
subprocess.run(["code", "."])


In [7]:
# If not already installed:
!pip install jupyterlab   # or: conda install -c conda-forge jupyterlab
# Launch:
!jupyter lab


^C


## 3. Managing Packages: pip vs. conda

### 3a. pip (works in any venv/Anaconda env)
```bash
!pip install numpy pandas matplotlib
!pip list


In [None]:
# From your terminal (outside the notebook):
#   Launch the classic notebook:
!jupyter notebook

#   Or the newer Lab interface:
!jupyter lab


In [None]:
# From your shell:
python   # starts plain REPL
ipython  # richer prompt with autocomplete, %magic commands, etc.


In [None]:
# plain python subprocess
import subprocess
print("python REPL banner:")
print(subprocess.run(["python","--version"], capture_output=True).stdout.decode())

# IPython magic inside notebook:
print("IPython info:")
get_ipython().run_line_magic("whos", "")


In [None]:
# 1) Install if needed:
!pip install biopython

from Bio.Seq import Seq
from Bio import SeqIO

# Simple translation demo
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print("Protein translation:", dna.translate())

# 2) Parse a FASTA file in one line (assumes you uploaded sample.fasta)
records = list(SeqIO.parse("sample.fasta", "fasta"))
print(f"Found {len(records)} sequences. First ID:", records[0].id)



Protein translation: MAIVMGR*KGAR*


# 📐 Whitespace & Indentation Rules

Python uses indentation (spaces or tabs) to define code blocks. Mixing tabs and spaces will lead to errors—PEP 8 recommends **4 spaces per indent level**.

```python
# WRONG: inconsistent indentation leads to an IndentationError
def greet(name):
    if name:
      print(f"Hello, {name}")  # 6 spaces here
    else:
        print("Hello, world")  # 8 spaces here

# → IndentationError: unindent does not match any outer indentation level

# CORRECT: consistent 4-space indentation
def greet(name):
    if name:
        print(f"Hello, {name}")
    else:
        print("Hello, world")


In [10]:
# This is a single-line comment

def add(a, b):
    """
    add(a, b) → int or float
    Returns the sum of a and b.
    """
    return a + b

result = add(2, 3)  # inline comment explaining this call


In [11]:
# 1. Explicit continuation
total = 1 + 2 + 3 + \
        4 + 5 + 6

# 2. Implicit continuation in parentheses
numbers = [
    1, 2, 3,
    4, 5, 6
]

message = (
    "Python allows you to split long strings "
    "across multiple lines inside parentheses."
)

# Both forms execute without errors
print(total, numbers, message)


21 [1, 2, 3, 4, 5, 6] Python allows you to split long strings across multiple lines inside parentheses.


# 🏷️ Variables & Naming Conventions

Python is dynamically typed—you don’t declare variable types up front. By convention (PEP 8):

- **Variables** use `snake_case`  
- **Constants** use `UPPER_SNAKE_CASE`  
- A single leading underscore (`_var`) signals “internal use”  
- A double leading underscore (`__var`) triggers name-mangling in classes  


In [12]:
# Dynamic typing: no declarations needed
value = 10            # initially an integer
print(value, type(value))

value = "ten"         # now a string
print(value, type(value))


# PEP 8 naming: snake_case for variables
first_name = "Alice"
last_name  = "Smith"
full_name  = first_name + " " + last_name
print(full_name)


# Constants: UPPER_SNAKE_CASE
MAX_RETRIES     = 5
TIMEOUT_SECONDS = 30.0

for attempt in range(MAX_RETRIES):
    print(f"Attempt {attempt+1}/{MAX_RETRIES}")


# Underscore prefixes
_internal_counter = 0   # for module-internal use

class Example:
    def __init__(self, value):
        self._value = value       # “protected” by convention
        self.__secret = "xyz"     # name-mangled to _Example__secret

e = Example(123)
print("Internal:", e._value)
# print(e.__secret)        # AttributeError
print("Mangling gives:", e._Example__secret)


10 <class 'int'>
ten <class 'str'>
Alice Smith
Attempt 1/5
Attempt 2/5
Attempt 3/5
Attempt 4/5
Attempt 5/5
Internal: 123
Mangling gives: xyz


# 🧮 Basic Data Types

Python provides several built-in primitive types. In this section we’ll cover:

- **Integers (`int`)**  
- **Floating-point numbers (`float`)**  
- **Strings (`str`)**  
- **Booleans (`bool`)**  
- **Type inspection** with `type()` and `isinstance()`

---




In [15]:
# 1. Integers
x = 42
print(x, "→", type(x))   # <class 'int'>

# 2. Floating-point numbers
pi = 3.14159
print(pi, "→", type(pi)) # <class 'float'>

# 3. Strings
greeting = "Hello, bioscientist!"
print(greeting, "→", type(greeting))  # <class 'str'>

# 4. Booleans
flag_true = True
flag_false = False
print(flag_true, flag_false, "→", type(flag_true))  # <class 'bool'>

42 → <class 'int'>
3.14159 → <class 'float'>
Hello, bioscientist! → <class 'str'>
True False → <class 'bool'>


In [16]:
# 🔍 Type Checking
x =42
# Using type()
print(type(x) == int)      # True
print(type(pi) == float)   # True

# Using isinstance()
print(isinstance(x, int))        # True
print(isinstance(pi, (int, float)))  # True, covers both types
print(isinstance(greeting, str)) # True
print(isinstance(flag_true, bool))  # True

# You can also guard operations:
def safe_divide(a, b):
    if not (isinstance(a, (int, float)) and isinstance(b, (int, float))):
        raise TypeError("Both operands must be numeric")
    return a / b

print(safe_divide(10, 2))
# print(safe_divide("10", 2))  # would raise TypeError


True
True
True
True
True
True
5.0


# 🔢 Arithmetic Operators

Demonstrates the basic arithmetic operations in Python.

```python
a = 10
b = 3

# Addition
print("a + b =", a + b)        # 13

# Subtraction
print("a - b =", a - b)        # 7

# Multiplication
print("a * b =", a * b)        # 30

# True division
print("a / b =", a / b)        # 3.3333333333333335

# Floor division
print("a // b =", a // b)      # 3

# Modulus (remainder)
print("a % b =", a % b)        # 1

# Exponentiation
print("a ** b =", a ** b)      # 1000


# Comparison Operators


In [17]:
x = 5
y = 8

print("x == y:", x == y)   # False
print("x != y:", x != y)   # True
print("x < y:",  x < y)    # True
print("x > y:",  x > y)    # False
print("x <= y:", x <= y)   # True
print("x >= y:", x >= y)   # False


x == y: False
x != y: True
x < y: True
x > y: False
x <= y: True
x >= y: False


# Logical Operators

In [18]:
p = True
q = False

# AND: True only if both are True
print("p and q:", p and q)   # False

# OR: True if at least one is True
print("p or q:",  p or q)    # True

# NOT: inverts the boolean
print("not p:",   not p)     # False

# Combining comparisons
age = 25
has_permission = True

can_enter = (age >= 18) and has_permission
print("Can enter club:", can_enter)  # True


p and q: False
p or q: True
not p: False
Can enter club: True


# 🛑 Control Flow: Conditional Statements

Python’s branching syntax lets you execute different code paths based on conditions.

---

## 1. `if` / `elif` / `else` Syntax

```python
if condition1:
    # executed when condition1 is True
elif condition2:
    # executed when condition2 is True (and condition1 was False)
else:
    # executed when neither condition1 nor condition2 is True


## 2. Chained Comparisons

In [20]:
systolic = 128
# Check if systolic blood pressure is in the “elevated” range
if 120 <= systolic < 130:
    category = "Elevated"


## 3. Example: Classifying Blood Pressure Categories

In [21]:
def classify_bp(systolic, diastolic):
    """
    Classify blood pressure based on AHA guidelines:
      - Normal:            <120 and <80
      - Elevated:         120–129 and <80
      - Hypertension Stage 1: 130–139 or 80–89
      - Hypertension Stage 2: ≥140 or ≥90
    """
    if systolic < 120 and diastolic < 80:
        return "Normal"
    elif 120 <= systolic < 130 and diastolic < 80:
        return "Elevated"
    elif (130 <= systolic < 140) or (80 <= diastolic < 90):
        return "Hypertension Stage 1"
    elif systolic >= 140 or diastolic >= 90:
        return "Hypertension Stage 2"
    else:
        return "Consult Clinician"

# Example usage
patients = [
    {"id": 1, "systolic": 118, "diastolic": 76},
    {"id": 2, "systolic": 125, "diastolic": 78},
    {"id": 3, "systolic": 135, "diastolic": 85},
    {"id": 4, "systolic": 145, "diastolic": 95},
]

for p in patients:
    cat = classify_bp(p["systolic"], p["diastolic"])
    print(f"Patient {p['id']}: {cat}")


Patient 1: Normal
Patient 2: Elevated
Patient 3: Hypertension Stage 1
Patient 4: Hypertension Stage 2


# 🔄 Control Flow: Loops

Python provides two main looping constructs—`for` and `while`—plus loop controls like `break` and `continue`. Below are healthcare-focused examples.

---

## 1. `for` Loops Over Sequences

Iterate through lists, tuples, or other iterable collections.

```python
# Example: Calculate BMI for a list of patients
patients = [
    {"id": 1, "weight_kg": 70, "height_m": 1.75},
    {"id": 2, "weight_kg": 82, "height_m": 1.68},
    {"id": 3, "weight_kg": 55, "height_m": 1.60},
]

for p in patients:
    bmi = p["weight_kg"] / (p["height_m"] ** 2)
    print(f"Patient {p['id']} BMI: {bmi:.1f}")


# 2. while Loops & Loop Control (break, continue)
#### Use while when you don’t know up front how many iterations you need. 
#### Use break to exit early, and continue to skip to the next iteration.

In [22]:
# Example: Monitor a rising glucose level until it hits a threshold
glucose_readings = [95, 102, 110, 130, 145, 160]
threshold = 140

i = 0
while i < len(glucose_readings):
    level = glucose_readings[i]
    if level < 100:
        # skip “normal” readings
        i += 1
        continue
    print(f"Reading {i+1}: {level} mg/dL")
    if level >= threshold:
        print("⚠️ Alert: Glucose threshold exceeded — stop monitoring.")
        break
    i += 1


Reading 2: 102 mg/dL
Reading 3: 110 mg/dL
Reading 4: 130 mg/dL
Reading 5: 145 mg/dL
⚠️ Alert: Glucose threshold exceeded — stop monitoring.


## 3. Iterating Over Dictionary Items

In [23]:
# Example: Print each patient’s latest test result
test_results = {
    "Alice": {"HbA1c": 6.2, "Cholesterol": 190},
    "Bob":   {"HbA1c": 7.8, "Cholesterol": 220},
    "Cara":  {"HbA1c": 5.9, "Cholesterol": 180},
}

for name, labs in test_results.items():
    hba1c = labs["HbA1c"]
    status = "OK" if hba1c < 7.0 else "Review"
    print(f"{name}: HbA1c={hba1c} → {status}")


Alice: HbA1c=6.2 → OK
Bob: HbA1c=7.8 → Review
Cara: HbA1c=5.9 → OK


# 🛠️ Function Definitions with `def`

## 1. The `def` Keyword, Parameters & Return Values

Use `def` to define a function. Inside the parentheses you list **parameters**, and use `return` to send back a value.

```python
def seq_length(sequence):
    """
    seq_length(sequence) → int
    Returns the length of the input sequence.
    """
    return len(sequence)

# Example
dna = "ATGCGTAC"
print("Length:", seq_length(dna))   # Length: 8


## 2. Default & Keyword Arguments

In [24]:
def seq_length(sequence, include_gaps=True):
    """
    seq_length(sequence, include_gaps=True) → int
    If include_gaps is False, gap characters ('-') are ignored.
    """
    if include_gaps:
        return len(sequence)
    else:
        return sum(1 for base in sequence if base != '-')

# 1) Using the default (include gaps)
print(seq_length("ATG-C-TA"))                 # 8

# 2) Positional override
print(seq_length("ATG-C-TA", False))          # 6

# 3) Keyword override (order-independent)
print(seq_length(sequence="A--TGC", include_gaps=False))  # 4


8
6
4


# 3. Positional vs. Keyword Arguments
Positional: Passed in the order parameters are defined

Keyword: param_name=value, can be in any order

In [25]:
def annotate(seq, tag="gene", version=1):
    return f"{tag}_{version}:{seq}"

# Positional: tag="exon", version=2
print(annotate("ATGC", "exon", 2))            

# Keyword: swap order
print(annotate(seq="ATGC", version=2, tag="exon"))  


exon_2:ATGC
exon_2:ATGC


# 🌐 Scope & Side-Effects

Understanding variable scope and avoiding unintended side-effects is key to writing reliable analysis code.

---

## 1. Local vs. Global Variables

- **Local variables** live inside a function and vanish when it returns.  
- **Global variables** are defined at the module level and accessible anywhere—risky if you accidentally overwrite them.

```python
# Global variable
THRESHOLD = 0.8

def classify_measurement(value):
    # Local variable ‘threshold’ shadows the global THRESHOLD
    threshold = 0.6
    if value >= threshold:
        return "High"
    else:
        return "Normal"

print(classify_measurement(0.7))  # Uses local threshold → "High"
print(THRESHOLD)                  # Global unchanged → 0.8


## 2. The global and nonlocal Keywords
global
Use sparingly when you truly need to modify a global from inside a function.

In [26]:
counter = 0

def increment_global():
    global counter
    counter += 1   # Modifies the module-level ‘counter’

increment_global()
print(counter)     # → 1


1


nonlocal


Inside nested functions, lets you modify a variable in the enclosing (but non-global) scope

In [27]:
def tracker():
    count = 0
    def step():
        nonlocal count
        count += 1
        return count
    return step

step_fn = tracker()
print(step_fn())  # → 1
print(step_fn())  # → 2


1
2


## 3. Avoiding Side-Effects in Analyses
Prefer pure functions that take inputs and return outputs without touching globals.

Pass state explicitly (e.g., dataframes, thresholds) rather than relying on hidden module variables.

Immutable data patterns (don’t modify lists/dicts in place) reduce bugs in pipelines.

In [28]:
# BAD: modifies the global list in place
results = []
def record(result):
    results.append(result)

# GOOD: returns a new list, no side-effects
def record_safe(results_list, result):
    return results_list + [result]

base = []
new1 = record_safe(base, "sample1")
print(base, new1)  # base unchanged; new1 contains sample1


[] ['sample1']


# 📋 Lists: Creation & Operations

Python **lists** are ordered, mutable collections—ideal for storing sample IDs, DNA/RNA sequences, patient codes, and more.

---

## 1. Creation: `[]` vs. `list()`



In [33]:

# Empty lists
sample_ids   = []            # literal notation
sequences    = list()        # constructor notation

# Pre-populated list
sample_ids   = ["S1", "S2", "S3"]
sequences    = ["ATGCGT", "GGCTA-A", "TTAGGC"]


## 2. Indexing & Slicing

In [34]:
# Indexing (0-based)
first_sample   = sample_ids[0]       # "S1"
last_sequence  = sequences[-1]       # "TTAGGC"

# Slicing: [start:stop:step]
# Get the middle two samples
mid_samples    = sample_ids[1:3]     # ["S2", "S3"]

# Get every other base in a sequence
subseq         = sequences[0][::2]   # from "ATGCGT" → "AGT"


# 3. Appending & Extending

In [35]:
# Append: add one item
sample_ids.append("S4")
# sample_ids → ["S1", "S2", "S3", "S4"]

# Extend: add multiple items
new_samples = ["S5", "S6"]
sample_ids.extend(new_samples)
# sample_ids → ["S1", "S2", "S3", "S4", "S5", "S6"]


## 4. Use Case: Storing Sequences or Sample IDs
Imagine you’re loading FASTA headers or clinical sample codes:

In [36]:
# Start with no records
fasta_headers = []

# As you parse each record:
for record in SeqIO.parse("patients.fasta", "fasta"):
    fasta_headers.append(record.id)

print("Loaded headers:", fasta_headers)


FileNotFoundError: [Errno 2] No such file or directory: 'patients.fasta'

In [37]:
visit_log = []
visit_log.extend(["V001", "V002", "V003"])   # batch‐load past visits
visit_log.append("V004")                     # new visit

print("All visits:", visit_log)


All visits: ['V001', 'V002', 'V003', 'V004']


# 🔒 Tuples: Creation & Immutability

Tuples are ordered, immutable sequences—ideal for packing together fixed sets of values (e.g., genome annotation fields).

---

## 1. Creation: `()` vs. Comma Syntax

```python
# Empty tuple
empty = ()

# Single‐element tuple: comma is required
one = (42,)      
also_one = 42,  # parentheses optional with comma

# Multi‐element tuple
coords = (10.0, 20.0, 30.0)
labels = "geneA", "geneB", "geneC"  


## 2. Immutability & Function Returns
Once created, you cannot change (add/remove) elements in a tuple—attempts raise TypeError.

Great for returning multiple values from functions without worrying they’ll be altered.

In [38]:
def get_sample_metadata(sample_id):
    """
    Returns a tuple of (ID, collection_date, patient_age)
    """
    # imagine we look these up in a database…
    return sample_id, "2025-05-17", 48

meta = get_sample_metadata("S100")
print(meta)             # ('S100', '2025-05-17', 48)

# meta[1] = "2025-05-18"  
# TypeError: 'tuple' object does not support item assignment


('S100', '2025-05-17', 48)


## 3. Use Case: Fixed Annotation Fields
Suppose each record must carry an unchangeable set of attributes:

In [39]:
# Define a tuple for annotation schema: (gene_name, start, end, strand)
annotation = ("BRCA1", 100123, 100456, "+")

# Later, you can unpack safely
gene, start_pos, end_pos, strand = annotation
print(f"{gene} spans {start_pos}-{end_pos} on strand {strand}")

# Use tuple as a dictionary key for quick lookups
annotations = { annotation: "tumor_suppressor" }
print(annotations[("BRCA1", 100123, 100456, "+")])


BRCA1 spans 100123-100456 on strand +
tumor_suppressor


# 🌿 Sets: Creation & Operations

Python **sets** are unordered collections of **unique** elements—perfect for deduplicating data like k-mers or gene identifiers.

---

## 1. Creation: `{}` vs. `set()`

```python



In [40]:
# Empty set: must use set(), {} creates an empty dict
empty_set = set()
print(type(empty_set))  # <class 'set'>

# Pre-populated set literal
genes = {"BRCA1", "TP53", "EGFR", "BRCA1"}  
print(genes)            # {'EGFR', 'BRCA1', 'TP53'} — duplicates auto-removed

<class 'set'>
{'TP53', 'BRCA1', 'EGFR'}


## 2. Uniqueness & Basic Operations

In [41]:
# Adding and removing
genes.add("MYC")
genes.discard("EGFR")    # no error if element not present
print(genes)

# Membership test
print("TP53" in genes)   # True
print("EGFR" in genes)   # False


{'TP53', 'BRCA1', 'MYC'}
True
False


## 3. Set Operations: Union, Intersection, Difference


In [42]:
sample1_genes = {"BRCA1", "TP53", "MYC"}
sample2_genes = {"TP53", "EGFR", "PTEN"}

# Union: all unique genes across both samples
all_genes = sample1_genes | sample2_genes
# or sample1_genes.union(sample2_genes)
print("Union   :", all_genes)

# Intersection: genes common to both
common_genes = sample1_genes & sample2_genes
# or sample1_genes.intersection(sample2_genes)
print("Intersection:", common_genes)

# Difference: genes in sample1 but not in sample2
unique_to_sample1 = sample1_genes - sample2_genes
# or sample1_genes.difference(sample2_genes)
print("Difference  :", unique_to_sample1)


Union   : {'TP53', 'BRCA1', 'EGFR', 'PTEN', 'MYC'}
Intersection: {'TP53'}
Difference  : {'BRCA1', 'MYC'}


## 4. Use Case: Unique k-mers from a DNA Sequence


In [43]:
def get_kmers(sequence, k):
    """
    Returns the set of unique k-mers of length k in the sequence.
    """
    kmers = set()
    for i in range(len(sequence) - k + 1):
        kmers.add(sequence[i : i + k])
    return kmers

dna_seq = "ATGCGATGACCTG"
k = 3
unique_kmers = get_kmers(dna_seq, k)
print(f"Unique {k}-mers:", unique_kmers)


Unique 3-mers: {'ACC', 'GCG', 'TGA', 'CGA', 'CCT', 'GAC', 'ATG', 'CTG', 'GAT', 'TGC'}


# 🗺️ Dictionaries: Creation & Operations

Python **dictionaries** are unordered, mutable mappings from **keys** to **values**—ideal for storing things like gene expression profiles.

---

## 1. Creation: `{key: value}` vs. `dict()`

```python



In [45]:
# 1a. Literal notation
gene_expression = {
    'TP53': 12.5,
    'BRCA1': 8.3,
    'EGFR': 15.2
}

# 1b. Empty dict and then populate
gene_expression = dict()
gene_expression['TP53'] = 12.5
gene_expression['BRCA1'] = 8.3

## 2. Accessing & Updating

In [46]:
# Access a value by key
tp53_expr = gene_expression['TP53']
print(f"TP53 expression: {tp53_expr}")

# You can also use .get() to avoid KeyError, with a default
jak2_expr = gene_expression.get('JAK2', 0.0)
print(f"JAK2 expression (default 0.0): {jak2_expr}")

# Update an existing entry
gene_expression['BRCA1'] = 9.1

# Add a new gene
gene_expression['PTEN'] = 4.7

# Bulk update with another dict
gene_expression.update({'MYC': 22.4, 'EGFR': 16.0})


TP53 expression: 12.5
JAK2 expression (default 0.0): 0.0


## 3. Iteration

In [47]:
# 3a. Iterate over keys
for gene in gene_expression:
    print(gene, gene_expression[gene])

# 3b. Iterate over key–value pairs
for gene, expr in gene_expression.items():
    status = 'high' if expr > 10 else 'normal'
    print(f"{gene}: {expr:.1f} → {status}")

# 3c. Iterate over values only
for expr in gene_expression.values():
    print(f"Expression value: {expr:.1f}")


TP53 12.5
BRCA1 9.1
PTEN 4.7
MYC 22.4
EGFR 16.0
TP53: 12.5 → high
BRCA1: 9.1 → normal
PTEN: 4.7 → normal
MYC: 22.4 → high
EGFR: 16.0 → high
Expression value: 12.5
Expression value: 9.1
Expression value: 4.7
Expression value: 22.4
Expression value: 16.0


## 4. Use Case: Filtering Highly Expressed Genes

In [48]:
# Suppose we want genes with expression > 10 TPM
highly_expressed = {
    gene: expr
    for gene, expr in gene_expression.items()
    if expr > 10.0
}

print("Highly expressed genes:")
for gene, expr in highly_expressed.items():
    print(f" - {gene}: {expr:.1f} TPM")


Highly expressed genes:
 - TP53: 12.5 TPM
 - MYC: 22.4 TPM
 - EGFR: 16.0 TPM
