# Getting User Input in Python – Tekin Interacts with the Data

As Tekin’s Python skills improved, he wanted to make his scripts interactive — allowing users (like lab colleagues) to enter information such as sample IDs, measurement values, or viral sequence fragments directly.

Python allows us to take user input using the `input()` function.  
This function always returns a **string**, so it often needs to be converted into other data types like `int` or `float`.

---

## Why Use Input Handling in Wastewater Surveillance?

Input handling is essential for making Python scripts flexible and useful in real lab environments.  
In Tekin’s case, it allowed:

- Entering **sample IDs** for tracking  
- Typing **measured values** like viral concentrations  
- Accepting **user-defined thresholds** for flagging high-risk samples  
- Checking the **validity** of input (e.g. numeric ranges or text formats)

Since `input()` always returns a string, Tekin often had to **convert input types** and check whether the input was suitable for further processing.  
This made his scripts more **robust**, **reusable**, and **ready for real-world deployment**.

---

### Example: Taking a Sample ID
```python
sample_id = input("Enter sample ID: ")
print("Sample received:", sample_id)


In [5]:
# Let's try it together! Starting from the basic

name = input("What is your name?...")



What is your name?...Gültekin


In [None]:
# TRY IT OUT BY YOURSELF AREA

In [None]:
# So we included the printing now

name = input("What is your name?...")

print(name)

In [None]:
# TRY IT OUT BY YOURSELF AREA

In [6]:
# Let's make it prettier

name = input("What is your name?...")

print(f"Your name is: {name}")


What is your name?...Gültekin
Your name is: Gültekin


In [None]:
# TRY IT OUT BY YOURSELF AREA

In [None]:
# What if I don't want the input itself?

In [3]:
# Taking a DNA sequence as user input
dna_sequence = input("Enter a DNA sequence: ").upper()

# Checking if the sequence contains only valid nucleotides
valid_nucleotides = {"A", "T", "C", "G"}
is_valid = all(base in valid_nucleotides for base in dna_sequence)

print(f"DNA Sequence: {dna_sequence}")
print(f"Valid DNA Sequence: {is_valid}")


Enter a DNA sequence: ATACAGCAGCAGÜLTEKİN
DNA Sequence: ATACAGCAGCAGÜLTEKİN
Valid DNA Sequence: False


# Defining Functions in Python – Tekin Reuses His Code

As Tekin’s scripts grew longer, he realized he was repeating the same code over and over.  
To make his work more efficient, he started writing **functions** — reusable blocks of code that could simplify tasks like formatting results, calculating metrics, or transforming sequences.

In Python, a function is defined using the `def` keyword.  
Functions can take **parameters** (inputs) and optionally return a **result**.

Tekin created a simple function to check the length of a gene target or primer sequence.


---

### Basic Function – Greeting the Lab Team
```python
def greet(name):
    return f"Hello, {name}! Ready to process some samples?"

print(greet("Tekin"))




In [None]:
#Another Example 

def dna_length(dna_sequence):
    return len(dna_sequence)

print(dna_length("ATGCGT"))


In [11]:
# Let's do an example together.

# The aim here is to define an function. 
# So I'm defining the function as follows

def get_dna_length(dna_sequence):
    return len(dna_sequence)

# Now I need to use this function as follows
# First I will create a sequence

sequence = "ATGCGCGATAGACA"

# Let me print the output

print(f"DNA Length: {get_dna_length(sequence)}")

# This code will print the lenght of the sequence that I entered.



DNA Length: 14


In [None]:
# TRY IT OUT BY YOURSELF AREA

# Now we are in a stage to see whether our sequence is valid or not.


In [12]:

def is_valid_dna(sequence):
    valid_bases = {"A", "T", "C", "G"}
    return all(base in valid_bases for base in sequence)

# Example usage
seq = "ATGCCGTTA"
print(f"Is valid DNA? {is_valid_dna(seq)}")  # True

seq2 = "ATGXCT"  # Invalid base 'X'
print(f"Is valid DNA? {is_valid_dna(seq2)}")  # False


Is valid DNA? True
Is valid DNA? False


In [None]:
# TRY IT OUT BY YOURSELF AREA

# We can convert our code with input handling!

In [13]:
# At first, let's design our function. 

def analyze_dna():
    sequence = input("Enter a DNA sequence: ").upper()
    length = len(sequence)
    print(f"Sequence Length: {length}")
    print(f"Is valid DNA? {is_valid_dna(sequence)}")

# And run it
analyze_dna()

Enter a DNA sequence: ATACAGTACAT,
Sequence Length: 12
Is valid DNA? False


In [None]:
# TRY IT OUT BY YOURSELF AREA

# As you can see the code gives me the sequence length and the validation boolean.

# Let's move on to multiple functions

In [14]:
# Example

# Function 1: Calculate DNA sequence length
def get_dna_length(dna_sequence):
    """Returns the length of a DNA sequence."""
    return len(dna_sequence)

# Function 2: Count occurrences of each nucleotide
def count_nucleotides(dna_sequence):
    """Returns a dictionary with counts of A, T, C, and G in the sequence."""
    return {base: dna_sequence.count(base) for base in "ATCG"}

# Function 3: Generate reverse complement of a DNA sequence
def get_reverse_complement(dna_sequence):
    """Returns the reverse complement of a DNA sequence."""
    complement = {"A": "T", "T": "A", "C": "G", "G": "C"}
    return "".join(complement[base] for base in dna_sequence[::-1])

# Example usage
sequence = "ATGCCGTA"

print(f"DNA Length: {get_dna_length(sequence)}")
print(f"Nucleotide Counts: {count_nucleotides(sequence)}")
print(f"Reverse Complement: {get_reverse_complement(sequence)}")


DNA Length: 8
Nucleotide Counts: {'A': 2, 'T': 2, 'C': 2, 'G': 2}
Reverse Complement: TACGGCAT


In [None]:
# TRY IT OUT BY YOURSELF AREA

What is an ORF?
An Open Reading Frame (ORF) is a sequence of DNA that has the potential to be translated into a protein.
An ORF:

Starts with a start codon (ATG).
Ends with a stop codon (TAA, TAG, or TGA).
Has a length that is a multiple of 3 (each codon is a triplet).

In [None]:
# Function 1: Take input and convert to uppercase
def get_dna_input():
    """Takes DNA sequence input and converts it to uppercase."""
    dna_sequence = input("Enter a DNA sequence: ").strip().upper()
    return dna_sequence

# Function 2: Slice the sequence into triplets (codons)
def slice_into_codons(dna_sequence):
    """Splits the DNA sequence into triplets (codons) without checking for ORFs."""
    return [dna_sequence[i:i+3] for i in range(0, len(dna_sequence), 3) if len(dna_sequence[i:i+3]) == 3]

# Function 3: Format output
def format_output(codons):
    """Formats and prints codons as a string."""
    return " - ".join(codons)

# Running the functions
dna = get_dna_input()  # Get user input
codons = slice_into_codons(dna)  # Convert to codons
formatted_output = format_output(codons)  # Format output

# Display results
print("\nCodons:", formatted_output)

# example input "atgttgagccgtacgtaa"
# Expected output is like below


In [None]:
# TRY IT OUT BY YOURSELF AREA

## What is a Regular Expression? – Tekin Detects Patterns in Sequences

As Tekin moved into more advanced data parsing, he needed a way to **detect patterns** in messy text files, DNA sequences, or file formats.

A **regular expression (regex)** is a special sequence of characters that defines a **search pattern**.  
Regex is incredibly useful for searching, extracting, or validating data in bioinformatics and public health workflows.

---

### Tekin Used Regular Expressions To:

-  **Find all DNA sequences starting with `"ATG"`** (start codon for genes)
-  **Extract gene or sample IDs** from raw sequence annotation files  
-  **Validate file formats**, such as checking if a `.vcf` (variant call format) file contains valid headers and entries  
-  **Filter metadata entries**, like confirming if a wastewater sample label follows the correct pattern (`WWTP_ZoneX_YYYY`)

Regular expressions helped Tekin write compact, powerful rules that made his data pipelines smarter and cleaner.

In the next section, we’ll explore how to apply regex using Python’s built-in `re` module.


In [1]:
# Let's import it.

import re

## Basic RegEx Functions – Tekin Cleans and Searches Sequences

Tekin used Python’s `re` module (regular expressions) to search for biological patterns and clean messy metadata in sequencing files.

Here are some of the most common regex functions he used in his scripts:

| Function      | Description                              | Example                                       |
|---------------|------------------------------------------|-----------------------------------------------|
| `re.search()` | Finds the **first match** in a string    | `re.search("ATG", sequence)` → Start codon?   |
| `re.findall()`| Finds **all matches** of a pattern       | `re.findall("A[TG]G", sequence)` → Codon scan |
| `re.match()`  | Checks if string **starts** with pattern | `re.match("ATG", sequence)` → Starts with ATG |
| `re.sub()`    | **Replaces** pattern with something else | `re.sub("T", "U", sequence)` → DNA to RNA     |

---

These tools helped Tekin:

- Detect motifs in **viral sequences**
- Standardize **sample IDs** (e.g. fix typos like `wwtp` → `WWTP`)
- Simulate **transcription** (DNA → RNA)
- Validate whether uploaded files conformed to required formats

Regex became a powerful tool in Tekin’s bioinformatics toolkit, especially when working with genomic and metadata-rich files


In [2]:
# Let'see our first example

import re

# DNA sequence
sequence = "AGCTATGCGTAAATGCAGTGA"

# Find all start codons (ATG)
matches = re.findall("ATG", sequence)

print(f"Start codons found: {matches}")


Start codons found: ['ATG', 'ATG']


In [None]:
# Do want to try it out?

In [3]:
# Let'see our second example
sequence = "ATGCATGTTGAC"

# Replace T with U for RNA conversion
rna_sequence = re.sub("T", "U", sequence)

print(f"RNA sequence: {rna_sequence}")


RNA sequence: AUGCAUGUUGAC


In [None]:
# Do want to try it out?

# Introduction to Lists, Tuples, and Dictionaries in Python – Tekin Structures His Data

As Tekin started dealing with larger datasets — including multiple samples, gene targets, and metadata — he needed ways to organize collections of values.  
Python offers powerful data structures to handle such scenarios: **Lists**, **Tuples**, and **Dictionaries**.

Each has its own strengths, and Tekin learned when and how to use them efficiently in his wastewater surveillance work.

---

##  Lists – Ordered, Mutable Collections

A **list** is an **ordered** collection that can store **multiple elements**, including different data types.  
Lists are **mutable**, meaning Tekin could modify their contents (add, remove, or change values).

###  Key Features of Lists:
- **Ordered** → The position of each item matters.  
- **Mutable** → You can change, extend, or shrink them.  
- **Indexable** → Access elements with position (`my_list[0]`).

###  Use Cases in Wastewater Context:
- Storing **multiple viral sequences** retrieved from metagenomic data  
- Listing **AMR genes** detected in a sample  
- Organizing **daily viral load measurements** for trend analysis

Example:

```python
viral_targets = ["SARS-CoV-2", "Norovirus", "Adenovirus"]
print(viral_targets[0])  # Output: SARS-CoV-2


#  Working with Lists in Python – Tekin Handles Surveillance Data

Lists are one of the most powerful and frequently used data structures in Python.  
They allowed Tekin to store **multiple values** in an **ordered** and **changeable** way — ideal for managing sample metadata, sequencing results, or AMR gene profiles.

---

## **Creating a List**

In Python, lists are created using **square brackets `[]`**, with each element separated by commas.

###  Examples from Tekin's Lab:

```python
# List of AMR genes detected in a wastewater sample
resistance_genes = ["blaCTX-M", "mecA", "tetM", "aadA1", "ermB"]

# List of nucleotide bases in sequencing reads
nucleotides = ["A", "T", "C", "G"]

# List combining different data types from a variant call
variant_data = ["chr1", 123456, "A", "G", 99.5]


In [4]:
# Let's try it out

# List of resistance genes
resistance_genes = ["blaCTX-M", "mecA", "tetM", "aadA1", "ermB"]

# List of nucleotide bases
nucleotides = ["A", "T", "C", "G"]

# List of different data types
variant_data = ["chr1", 123456, "A", "G", 99.5]

# Accessing the first element
print(resistance_genes[0])  # Output: blaCTX-M

# Accessing the last element
print(resistance_genes[-1])  # Output: ermB


blaCTX-M
ermB


In [5]:
# Get the first three resistance genes
print(resistance_genes[:3])  # Output: ['blaCTX-M', 'mecA', 'tetM']


['blaCTX-M', 'mecA', 'tetM']


In [6]:
# Changing an element in the list
resistance_genes[1] = "mecC"
print(resistance_genes)  # Output: ['blaCTX-M', 'mecC', 'tetM', 'aadA1', 'ermB']


['blaCTX-M', 'mecC', 'tetM', 'aadA1', 'ermB']


##  Modifying Lists – Tekin Updates AMR Gene Profiles

Tekin often needed to update lists of detected resistance genes during his analysis of wastewater samples.  
Python's list methods like `.append()`, `.insert()`, `.remove()`, and `.pop()` allowed him to add or remove genes as needed.

###  Example: Managing a Resistance Gene List

```python
# Add a new resistance gene at the end
resistance_genes.append("vanA")

# Insert a gene at a specific position
resistance_genes.insert(2, "sul1")

print(resistance_genes)  
# Output: ['blaCTX-M', 'mecC', 'sul1', 'tetM', 'aadA1', 'ermB', 'vanA']

# Remove an element by value
resistance_genes.remove("sul1")  
print(resistance_genes)  
# Output: ['blaCTX-M', 'mecC', 'tetM', 'aadA1', 'ermB', 'vanA']

# Remove the last gene from the list
resistance_genes.pop()  
print(resistance_genes)  
# Output: ['blaCTX-M', 'mecC', 'tetM', 'aadA1', 'ermB']


In [7]:
# Add a new resistance gene at the end
resistance_genes.append("vanA")

# Insert a gene at a specific position
resistance_genes.insert(2, "sul1")

print(resistance_genes)  
# Output: ['blaCTX-M', 'mecC', 'sul1', 'tetM', 'aadA1', 'ermB', 'vanA']

# Remove an element
resistance_genes.remove("sul1")  
print(resistance_genes)  # Output: ['blaCTX-M', 'mecC', 'tetM', 'aadA1', 'ermB', 'vanA']

# Remove the last element
resistance_genes.pop()  
print(resistance_genes)  # Output: ['blaCTX-M', 'mecC', 'tetM', 'aadA1', 'ermB']


['blaCTX-M', 'mecC', 'sul1', 'tetM', 'aadA1', 'ermB', 'vanA']
['blaCTX-M', 'mecC', 'tetM', 'aadA1', 'ermB', 'vanA']
['blaCTX-M', 'mecC', 'tetM', 'aadA1', 'ermB']


In [8]:
# Looping through a list...

for gene in resistance_genes:
    print(f"Resistance gene: {gene}")

Resistance gene: blaCTX-M
Resistance gene: mecC
Resistance gene: tetM
Resistance gene: aadA1
Resistance gene: ermB


## Tuples – Ordered, Immutable Collections

While working with genomic data, Tekin encountered cases where data should **not be changed** — for example, reference coordinates or known mutations.  
In such cases, he used **tuples**, which are like lists but **immutable**.

Tuples are ideal for representing **fixed biological data** that should remain constant throughout analysis.

---

### Key Features of Tuples:
- **Ordered** → Elements are stored in a defined sequence.  
- **Immutable** → Once created, values cannot be changed.  
- **Efficient** → Slightly faster and safer than lists for static data.

---

### Use Cases in Public Health Genomics:
- Representing **genomic coordinates** (e.g., chromosome, start, end)  
- Storing **variant definitions**, such as SNPs or indels  
- Keeping track of **sample metadata snapshots** that shouldn’t be edited later

Example:

```python
# A tuple representing a single nucleotide polymorphism (SNP)
variant = ("chr1", 123456, "A", "G")
print(variant[0])  # Output: chr1


# Working with Tuples in Python – Tekin Stores Immutable Genomic Data

Tuples are similar to lists, but with a key difference: **they cannot be changed after creation**.  
Tekin used **tuples** when dealing with **fixed reference data** — things that should remain untouched during analysis.

They were particularly useful for:

- **Genomic coordinates** (e.g., chromosome, position)
- **SNP records** that define a specific variant
- **Sample metadata snapshots** that shouldn't be modified accidentally

---

## **Creating a Tuple**

Tuples are created using **parentheses `()`** instead of square brackets.

```python
# SNP representation: chromosome, position, reference base, alternate base
snp = ("chr1", 123456, "A", "G")

# Accessing tuple elements
print(snp[0])  # Output: chr1
print(snp[2])  # Output: A


In [9]:
## Just like lists, tuples are indexed starting from 0.


# Tuple of nucleotide bases
nucleotides = ("A", "T", "C", "G")

# Tuple of genomic coordinates (Chromosome, Start Position, End Position)
genomic_region = ("chr1", 123456, 123789)

# Tuple of SNP variant information (Chromosome, Position, Reference, Alternate)
snp = ("chr2", 987654, "A", "G")


print(nucleotides[0])  # Output: A
print(snp[-1])  # Output: G (last element)


A
G


In [None]:
# Slicing in the tuples

print(nucleotides[:2])  # Output: ('A', 'T')

Why Use Tuples Instead of Lists?
Unlike lists, tuples are immutable, meaning:

They cannot be changed after creation.
They are faster than lists.
They can be used as keys in dictionaries (lists cannot).

In [10]:
# Looping through the tuple.

for base in nucleotides:
    print(f"Nucleotide: {base}")

Nucleotide: A
Nucleotide: T
Nucleotide: C
Nucleotide: G


In [11]:
## Checking if an Element Exists in a Tuple

if "T" in nucleotides:
    print("Thymine is present!")


Thymine is present!


In [12]:
print(nucleotides.count("A"))  # Output: 1
print(nucleotides.index("C"))  # Output: 2

1
2


In [None]:
# Code playground

# Working with Dictionaries in Python – Tekin Maps Genes to Functions

Dictionaries (`dict`) became one of Tekin’s favorite tools in Python, especially when he needed to map one piece of information to another — such as linking gene names to their resistance classes, or associating sample IDs with metadata.

Unlike lists or tuples, dictionaries store data in **key-value pairs**, allowing for **fast, meaningful access** without using numeric indexes.

---

## Key Features of Dictionaries:
- **Unordered** → Data is accessed by keys, not by position  
- **Key-Value Mapping** → Each key is linked to a specific value  
- **Mutable** → Values can be updated; keys must remain unique

---

## Example Use Cases in Wastewater Surveillance:
- Mapping **resistance genes** to their **antibiotic classes**  
- Associating **sample IDs** with collection **locations or dates**  
- Defining **codon-to-amino acid translations** for protein prediction

---

## **Creating a Dictionary**

Dictionaries are created using **curly braces `{}`**, with key-value pairs separated by colons `:`.

```python
# Mapping resistance genes to antibiotic classes
resistance_map = {
    "blaCTX-M": "Beta-lactam",
    "mecA": "Methicillin",
    "tetM": "Tetracycline",
    "vanA": "Vancomycin"
}

# Accessing values by keys
print(resistance_map["tetM"])  # Output: Tetracycline


In [None]:
# Code playground

In [17]:
# Dictionary of genetic codes
genetic_code = {
    "AUG": "Methionine",
    "UUU": "Phenylalanine",
    "UAA": "Stop",
    "UGA": "Stop"
}

# Dictionary mapping resistance genes to antibiotic classes
resistance_genes = {
    "blaCTX-M": "Beta-lactam",
    "mecA": "Methicillin",
    "tetM": "Tetracycline",
    "vanA": "Vancomycin"
}


In [None]:
# Code playground

In [18]:
# We can retrieve a value using its key.

print(genetic_code["AUG"])  # Output: Methionine
print(resistance_genes["mecA"])  # Output: Methicillin


Methionine
Methicillin


In [None]:
# Code playground

In [19]:
# ✔ Dictionaries are mutable, so values can be updated.


# Add a new genetic code
genetic_code["UAG"] = "Stop"

# Update an existing value
genetic_code["UUU"] = "Modified Phenylalanine"

print(genetic_code)


{'AUG': 'Methionine', 'UUU': 'Modified Phenylalanine', 'UAA': 'Stop', 'UGA': 'Stop', 'UAG': 'Stop'}


In [None]:
# Code playground

In [20]:
# We can iterate over keys, values, or both.


# Looping through keys
for gene in resistance_genes:
    print(gene)  # Outputs all gene names

# Looping through values
for antibiotic in resistance_genes.values():
    print(antibiotic)  # Outputs all antibiotic classes

# Looping through key-value pairs
for gene, antibiotic in resistance_genes.items():
    print(f"{gene} → {antibiotic}")


blaCTX-M
mecA
tetM
vanA
Beta-lactam
Methicillin
Tetracycline
Vancomycin
blaCTX-M → Beta-lactam
mecA → Methicillin
tetM → Tetracycline
vanA → Vancomycin


In [None]:
# Code playground

In [21]:
## We can use the in keyword to check if a key is in the dictionary.
 
if "blaCTX-M" in resistance_genes:
    print("blaCTX-M gene is present!")


blaCTX-M gene is present!


## **Summary of Differences**
| Feature | List | Tuple | Dictionary |
|---------|------|-------|------------|
| **Ordered** | ✅ Yes | ✅ Yes | ❌ No (unordered) |
| **Mutable** | ✅ Yes | ❌ No | ✅ Yes |
| **Indexable** | ✅ Yes | ✅ Yes | ❌ No (access via keys) |
| **Duplicates Allowed** | ✅ Yes | ✅ Yes | ❌ No (keys must be unique) |
| **Use Case** | Storing sequences, lists of genes | Fixed datasets, SNPs | Fast lookups, gene mapping |

---

# Problem 1: Checking Valid DNA Input
### Scenario
You are working with DNA sequences and need to ensure that users only enter **valid nucleotide sequences**.

### Task
Write a Python program that:
1. **Takes a DNA sequence** as input from the user.
2. **Validates** whether the sequence contains **only** valid DNA bases (`A`, `T`, `C`, `G`).
3. If valid, **prints the sequence length**.
4. If invalid, **displays an error message**.


In [None]:
# Your Solution

In [None]:
# Your Solution

In [None]:
# Your Solution

---

# Problem 3: Interpreting Wastewater Viral Load

### Scenario  
Tekin is building a basic script to help field teams quickly evaluate wastewater surveillance results.  
The team inputs the **measured viral load (copies/mL)** from a sample, and the script tells them whether the level is **high** or **acceptable** based on a threshold.

The threshold is **25,000 copies/mL**.

---

### Task  
Write a Python program that:

1. Asks the user to **input the viral load** (e.g. `input()` function)  
2. Converts the input into a **float**  
3. Uses an **`if-else` statement** to check:
   - If the viral load is **greater than 25,000**, print a warning message  
   - Otherwise, print a message that the value is within range

---


In [None]:
# Your Solution

In [None]:
# Your Solution

In [None]:
# Your Solution

In [2]:
## **Solution for Problem 1: Checking Valid DNA Input**


# Function to validate a DNA sequence
def is_valid_dna(sequence):
    valid_bases = {"A", "T", "C", "G"}
    return all(base in valid_bases for base in sequence)

# Get user input
dna_sequence = input("Enter a DNA sequence: ").upper()

# Check if the sequence is valid
if is_valid_dna(dna_sequence):
    print("Valid DNA sequence.")
    print(f"Sequence length: {len(dna_sequence)}")
else:
    print("Invalid DNA sequence! Only A, T, C, and G are allowed.")

Enter a DNA sequence: AGAGAAGATAGAG
Valid DNA sequence.
Sequence length: 13


In [None]:
# Estimating Viral Load Level from Wastewater Data

# Step 1: Ask user for current viral load
current_load = input("Enter current viral load (copies/mL): ")

# Step 2: Convert input to float
current_load = float(current_load)

# Step 3: Check if the load is high or not using if-else
if current_load > 25000:
    print("High viral load detected! Further investigation may be needed.")
else:
    print("Viral load is within acceptable range.")
