# Pair Programming
BISB Bootcamp 2020

Author: Owen Chapman ochapman@ucsd.edu

24 September 2020

## Workshop Goals
- What is pair programming and why is it useful?
- Programming and algorithms practice!

## What is pair programming?

**Pair programming is more or less what it says on the tin - two people, working on the same programming problem with on a shared workstation (one keyboard, one mouse).**  The intuition behind this practice is that the two programmers collaborate closely on the same programming problem, producing higher-quality code while simultaneously learning from each other. Pair programming is used most often in the software industry by so-called "agile" development teams, as one of a number of development practices designed to produce high-quality, user-friendly software.

When pair programming, the two developers adopt two slightly different roles: **driver** and **navigator**. **The driver uses the keyboard and mouse, while the navigator does not.** This distinction allows them to practice different modes of thinking - the driver constructs the lines of code, leaving the navigator more free to focus on direction and architecture. Generally, developers switch roles every few minutes or so. Fluid communication is important! The driver is still interested in code architecture, and the navigator should be reading the code as it is written.

## How to use this notebook
This module presents a number of bioinformatics algorithms problems of varying difficulty. They are intended to be solved as a pair, but of course you can solve them on your own as well!
You can find these problems, and a solution checker, at [rosalind.info](rosalind.info).

Some notebook cells will contain markdown text, describing the problem and linking to the corresponding rosalind page. **The rosalind page contains detailed problem descriptions, biological motivation, and test cases!** This notebook is just to help you organize your code solutions. Following the description, a code cell is provided as a sandbox for code development. Sometimes the code cell will have starter code or comments to get you started.

Feel free to skip around to find problems of your skill level. Solutions can be found in the pair-programming-solutions notebook.

(NB: Rosalind.info was developed in the Python 2 language. This notebook is written in Python 3. You may see some minor syntax differences between this notebook and snippets from the rosalind site. 

# Python Village
Basic python syntax for new programmers ([rosalind link](http://rosalind.info/problems/list-view/?location=python-village))

## [Variables and some arithmetic](http://rosalind.info/problems/ini2/)
Goals: 
- store values in variables
- value *types*: integer, float, and string

*Primitive types* are the basic building blocks of a programming language. Examples in python include
- **int** (integer), for integers between -2 billion and 2 billion;
- **float**, for numbers with a decimal
- **bool** (boolean), can be True or False
- **string**, for sequences of alphanumeric characters
- **None**

**Problem:** 

Given: Two positive integers a and b, each less than 1000.

Return: The integer corresponding to the square of the hypotenuse of the right triangle whose legs have lengths a and b.

In [None]:
################################################
## Examples
# Assigning values to variables
a_Variable = 4 # integer
b = 5.2 # float
c4_you_cant_start_a_variable_with_a_number = "Hello"

# Some arithmetic
a_Variable = a_Variable + 7 # Can reassign a variable to a new value
b = a_Variable**2 # The exponential operator is '**'. Here we assign b the value of a squared.
print(c4_you_cant_start_a_variable_with_a_number + " World!") # Can do operations on strings too

#################################################
# Your code here:


## [Strings and Lists](http://rosalind.info/problems/ini3/)

Goals:
- Familiarity with lists, your first data structure!
    
Data structures are ways of organizing, collecting, and manipulating the primitive types. Some common and useful ones in Python are:
- **Lists** [] for ordered sequences
- **Strings** "" for ordered sequences of alphanumeric characters (can often do list operations on strings)
- **Dictionaries** {key: value} for indexed pairs of {lookup index, value}

**Problem**

Given: A string s of length at most 200 letters and four integers a, b, c and d.

Return: The slice of this string from indices a through b and c through d (with space in between), inclusively. In other words, we should include elements s[b] and s[d] in our slice.

In [None]:
############################################################
## Examples
# Define a string
a = "Here is a string."

# Define a list
b = [a, "here is another string", "one more string"]

# Get a character at an index of a string
ex1 = a[0]
print(ex1)

# Get a range of characters:
ex2 = a[0:4] # can also be a[:4]
print(ex2)
############################################################
# Your code here:


## [Conditions and Loops](http://rosalind.info/problems/ini4/)
Sometimes, you need to do things more than once. You could write the code lots of times, but that would be terrible. Instead, we use *control flow* to dictate whether, and how many times, a piece of code runs.

**Problem**

Given: Two positive integers a and b (a<b<10000).

Return: The sum of all odd integers from a through b, inclusively.

In [None]:
############################################################
## Examples
# A bit of fun: let's generate a random number [0,2) and print depending on the result
import random # Only do this once per program
rv = random.randint(0,2)
print(rv)
if rv == 0:
    print("Here we do thing zero!")
elif rv == 1:
    print("Wow we do a different thing")
else:
    print("Very program much wow")
    
# Let's count to six
six=[1,2,3,4,5,6]
for number in six:
    print(number)

# Cool lets do it again
for i in range(len(six)):
    print(six[i])
############################################################
# Your code here:


## [Working with Files](http://rosalind.info/problems/ini5/)

You'll very frequently need to bring data from other files. There are many common ways to do this; we could read it line-by-line in Python, we could use the pandas library, etc. Here we read a file using Python's default libraries.

### Pandas

A common data storage method is the **Relational database or table**. A table consists of multiple rows or observations, each with a unique index, and the properties of each observation are stored in the columns:

| Gene | Sample1 | Sample2 | Sample3 | Sample4 |
| --- | --- | --- | --- | --- |
| CDK6 | 742 | 100 | 46 | 10294 |
| MYC | 343 | 54365 | 2345 | 867 |
| BRAF | 445 | 6783 | 483 | 8890 |

You've encountered tables like these as excel, .tsv, or .csv files. **Pandas** is a Python library for manipulating tabular data. The syntax is kind of clunky, but it's worth learning if Python is your programming language of choice. (There are similar libraries in other languages; see data.table and data.frame in R.)

**Problem**

Given: A file containing at most 1000 lines.

Return: A file containing all the even-numbered lines from the original file. Assume 1-based numbering of lines.

In [None]:
############################################################
## Examples
# Read a file
file = open('example.bed','r') # 'r' is 'read' mode
for line in file.readlines():
    print(line)
file.close() # remember to close your file afterward, otherwise you'll be wasting memory!

# Write a file
haiku='''Star Wars names such as
Twi'lek stress me out because
Can you close that string
'''
with open('haiku.txt','w') as f:
    f.write(haiku)
# Don't need to close the file if you use the 'with open() as f' syntax.

# Read example.bed into a Pandas DataFrame.
import pandas as pd
df = pd.read_csv('example.bed',sep='\t',header=None,names=['chrom','start','end'])
print(df)
############################################################
# Your code here:


## [Dictionaries](http://rosalind.info/problems/ini6/)

**Dictionaries** are another basic data structure common to many languages. A dictionary allows you to take any *value*, and assign it a unique lookup index, or *key*. You can then look up this value, using the key, very efficiently using the dict. You can store anything in the value: a string, a list, even another dictionary.

**Problem**

Given: A string s of length at most 10000 letters.

Return: The number of occurrences of each word in s, where words are separated by spaces. Words are case-sensitive, and the lines in the output can be in any order.

In [None]:
############################################################
## Examples
# Make a dict of phone numbers
phones = {'Zoe':'232-43-58', 'Alice':'165-88-56'}
print(phones)
# Update Zoe's phone
phones['Zoe'] = '658-99-55'
# Add Bill's phone
phones['Bill'] = '342-18-25'
print(phones)

# get the keys in a dictionary using d.keys()
print(phones.keys())
############################################################
s = 'We tried list and we tried dicts also we tried Zen'
def word_count(s):
    d = {} # Create a dict
    words = s.split() # split() is a string function that takes a string "Here a string" and splits it by a character: ['Here','a','string']
    for word in words:
        pass # Your code here

# [Bioinformatics Stronghold](http://rosalind.info/problems/list-view/)

Bioinformatics algorithms problems in rough ascending order of difficulty. The linked problems often have relevant biological or computational explanations, so check them out!


## [Counting DNA Nucleotides](http://rosalind.info/problems/dna/)
**Problem**

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

In [None]:
# Your code here:
def count_nucleotides(s):
    pass

## [Complementing a Strand of DNA ](http://rosalind.info/problems/revc/)
**Problem**

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement sc of s.

## [Rabbits and Recurrence Relations](http://rosalind.info/problems/fib/)

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2–√,0,π) and the infinite sequence of odd numbers (1,3,5,7,9,…). We use the notation an to represent the n-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation Fn=Fn−1+Fn−2 (with F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

**Problem**

Given: Positive integers n≤40 and k≤5.

Return: The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

## Find the Most Frequent Words in a String

We say that Pattern is a most frequent k-mer in Text if it maximizes Count(Text, Pattern) among all k-mers. For example, "ACTAT" is a most frequent 5-mer in "ACAACTATGCATCACTATCGGGAACTATCCT", and "ATA" is a most frequent 3-mer of "CGATATATCCATAG".

**Frequent Words Problem**

Find the most frequent k-mers in a string.

Given: A DNA string Text and an integer k.

Return: All most frequent k-mers in Text (in any order).

## [Finding a Shared Motif](http://rosalind.info/problems/lcsm/)

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, "CG" is a common substring of "ACGTACGT" and "AACCGTATA", but it is not as long as possible; in this case, "CGTA" is a longest common substring of "ACGTACGT" and "AACCGTATA".

Note that the longest common substring is not necessarily unique; for a simple example, "AA" and "CC" are both longest common substrings of "AACC" and "CCAA".

**Problem**

Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

## [Inferring mRNA from Protein](http://rosalind.info/problems/mrna/)


For positive integers a and n, a modulo n (written a mod n in shorthand) is the remainder when a is divided by n. For example, 29 mod 11=7 because 29=11×2+7.

For this problem, you may also find this codon table helpful:
<img src="images/codon_table.png" width="480">

**Problem**

Given: A protein string of length at most 1000 aa.

Return: The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don't neglect the importance of the stop codon in protein translation.)