## QBS146/Gene146 Assignment 1

Welcome to the first assignment! We'll use Jupyter Notebooks to write our Python code interactively. We can test individual functions and check that they all work together before we package them and submit them to [Autolab](https://qbs146.mckennalab.org/).  

Enumerated out, our **goals are**:

 1. Get Jupyter notebooks up and running and work through an example
 2. Learn how to use basic Python commands to manipulate strings and numbers
 3. Try making a basic implementation of the kmer-counting routine we talked about in class
 4. Package up the code and submit
 

### Python help

We highlighted some resources for learning more Python. To add to those, here are some more great resources to get you going. Also reach out to the TAs or me if you're struggling and we can work something out:

- [A really nice introduction to Python and Jupyter](https://gist.github.com/kenjyco/69eeb503125035f21a9d)
- [Github collection of learning Python notebooks](https://github.com/jerry-git/learn-python3)
- [Jupyter notebooks trainings from Jupyter](https://jupyter.org/try)

## Assignment 

We'll ask you to create a function for each code block below. The function will contain documentation based on what we expect it to do. Seems simple! 

***Do not rename the function***, as we have test functions that will automatically grade it. You can create any number of other functions you call in support of this function, but the parameters and name of the requested function must stay the same. For the homework grader to work, you must include all the Python code together as one file. 


#### For example here is a fake assignment problem:


In [1]:
# 0 points: create a function to determine if the input sequence contains the substring 'AAAA'
def contains_four_As(input_sequence):
    '''
    This function returns true if the input sequence contains the subsequence 'AAAA'
    '''
    
    return False # this is obviously broken

You might correct this to the following function:

In [2]:
def contains_four_As(input_sequence):
    '''
    This function returns True if the input sequence contains the subsequence 'AAAA', False otherwise
    '''
    return('AAAA' in input_sequence)

You could also write a more verbose version of that function, that makes the logic easier to understand from the code:

In [3]:
def contains_four_As(input_sequence):
    '''
    This function returns True if the input sequence contains the subsequence 'AAAA', False otherwise
    '''
    if 'AAAA' in input_sequence:
        return True
    else:
        return False

You may want to write small functions to test your work as you go. For instance we might test the above function like so:

In [11]:
def our_grading_function_for_contains_four_As():
    '''
    This function is similar to our homework grading approach. We create some fake data,
    run the function you've created, and if the results are correct, add points. The sum
    of the points is your final grade.
    '''
    score = 0
    if contains_four_As('AAAA'):
        score += 1
    if not contains_four_As('GGGG'):
        score += 1

    # hopefully returning 2 points!
    return(score)

print("YOUR SCORE:",our_grading_function_for_contains_four_As(), "OF 2")

YOUR SCORE: 2 OF 2


One thing about Jupyter notebooks is that they're not linear!! For instance, you can run the broken **contains_four_As** function, skip the code blocks in between where you've created a better solution, and then run the **our_grading_function_for_contains_four_As** function, which should give you a score of *1* (the default __False__ response is right in one of the cases). Be careful of this; often people will choose the __kernel->Restart kernel and run all cells__ menu option to ensure their notebooks works when run top to bottom.

# Real homework problems below here



## **2 points**: create a function **add_three_numbers** that add three numbers together and returns the result

In [5]:

def add_three_numbers(number_a, number_b, number_c):
    '''
    Adds three numbers together and returns the result
    note: dont change the name of the function, or the number of parameters that it takes
    '''
    return(0) # fix: return the sum of the passed-in numbers 

## **2 points**: create a function called **smallest_in_list** that takes a list as input and returns the smallest number within the list. If there is no smallest number, return the number 0. 


In [6]:
def smallest_in_list(input_list):
    '''Returns the smallest number in a list of objects, zero if there is no such smallest number'''
    return(0) # default return value, only valid in select cases!

## **6 points**: create a function called **find_kmers_occurrence**

Your **find_kmers_occurrence** will take a FASTA formatted sequence, a kmer size, and a minimum count, 
and will return a python dictionary with all kmers in the sequence that appear more than minimum_count times
Your function should not return kmers with ambiguous/degenerate bases.

#### What is a fasta file? 

FASTA is a simple, plain-text file format widely used to store biological sequences (DNA, RNA, or protein). It is one of the most common formats in bioinformatics for sequence data, alignment, and analysis. Here’s what defines FASTA:

1.	**Header Line**
    - Each new sequence begins with a header line starting with a >` character (greater-than sign).
	- Immediately following the > sign is a sequence identifier (e.g., an accession number, gene name) and optionally a description.
	- Example header line:

        ```
        >QBS146SequencingResults Homo sapiens
        ```

\
2. **Sequence Lines**
	- After the header line, one or more lines contain the sequence itself (A, C, G, T/U for nucleotides or the 20 amino acids for proteins).
	- By convention, lines are often wrapped at 80 characters for readability, but a single long line is equally valid.
	- Example sequence content:
    
        
        ATGGTGCTGCTGAAC
        AAAGTACCCTTTGTG
        
3.	Repeats of numbers 1 and 2 above -- multisequence FASTA files
	•	A single FASTA file can contain multiple sequences. You will be getting just a single sequence FASTA file, but what would be the best way to handle this?
	•	Each sequence has its own header followed by its own sequence data, ending right before the next > line or the end of the file.

#### What is a kmer (k-mer, KMER)? 

A k-mer is a short DNA substring of length k (hence k-mer). For example, if you have a DNA sequence ACTGAT, then the 3-mers (k=3) would be:
•	ACT
•	CTG
•	TGA
•	GAT

K-mers are widely used in bioinformatics for tasks such as:
	1.	Genome Assembly: Tools like de Bruijn graph assemblers break long sequencing reads into overlapping k-mers, then reconstruct the genome by chaining them together.
	2.	Sequence Similarity / Comparison: K-mer frequency profiles can be computed to quickly compare large numbers of sequences without performing full alignments.
	3.	Error Correction: Identifying misrepresented k-mers in data (e.g., from low‑quality reads) can help correct sequencing errors.
	4.	Signature Generation: K-mer compositions can act like “fingerprints” that help classify sequences into different species or detect specific motifs.

We'll use kmers in future lectures and homework a lot, but this is a good place to start!

In [7]:
# 6 points: create a function called 'find_kmers_occurrence' that takes a FASTA formatted sequence, a kmer size, and a minimum count, 
#           and returns a python dictionary with all kmers in the sequence that appear equal or more than minimum_count times
#           **Your function should not return kmers with ambiguous/degenerate bases.**
def find_kmers_occurrence(sequence, k, minimum_count):
    """
    Find all k-mers that occur greater than x times within a supplied nucleotide sequence.

    See the fasta reference for ambiguous nucleotide codes: https://en.wikipedia.org/wiki/FASTA_format
    
    Args:
    sequence (str): A string containing a sequence of nucleotides.
    k (int): The length of the k-mer.
    x (int): The minimum occurrence threshold.

    Returns:
    dict: A dictionary, containing k-mers as keys, and their occurrences as values.
    """
    
    empty_dictionary = {}
    
    return(empty_dictionary) # not quite done
    

## How to turn in your assignment

### First, check your assignment

#### Command line 

If you're working in a text editor, just save and rerun the Python file on the command line to make sure it's all set.

#### Notebooks 
If you are working in Jupyter/Colab, it's smart to reset the kernel and rerun everything from scratch,in order. 

In Jupyter, this is the menu bar 'kernel -> Reset Kernel and run all cells...' (or something like that, depending on the version). In Google Colab it's something like 'Runtime -> Restart session and run all'

### Prepare the turn-in python (.py) file

You'll need to turn in a python file (a file that ends in .py). If you're working in Jupyter Notebook/Lab or Google Colab, you'll need to export your work as a Python file. In Jupyter lab you can do this with:


![Screenshot 2024-03-20 at 10.17.15 AM.png](attachment:7c7390d1-631f-42de-8b2b-8e848fa70ce4.png)

In Google Colab it looks like this: 

![Screenshot 2024-03-20 at 10.18.47 AM.png](attachment:5caf6985-8642-46d9-8948-50a2008e2199.png)


Then you can go over to the Autolab website ([https://qbs146.mckennalab.org/](https://qbs146.mckennalab.org/)) and turn your assignment in. We'll go over the details of this in class, though it's pretty straightforward.