# Projects

## Basic parameters

Hi, all!  Great work this week.  It's time to put your knowledge to the test (not be tested on your knowledge).  Let's do some projects!

Here are some details that will help you not bug out about the final projects:
- The projects described below are toy models of scientifically relevant systems.  Your goal in investigating these systems is to gain a better understanding of them, **not** to "solve" them exhaustively.  
- Students will work in groups of 2 or 3.
- I will be available for consultation all of Thursday afternoon and Friday morning after 9:15.  Don't suffer!  Please stop by Swanson 100.b!  I like updates *:^)*
- Final presentations should be less than 10 minutes in length and include no more than 6 slides.  Here's a good model for your slides:
    1. Introduction.  What is the system?  What is the problem?  What is the goal?  Why are you doing this with a computer and not pencil/paper?
    2. Analysis/simulation steps.  What are the pieces of your analysis code or simulation?  Tell us what they do rather than showing us the code.
    3. At least one graph or visualization
    4. Problems encountered
    5. Results/conclusions
- All group members should contribute equally to the presentation, please!

It is up to the groups to decide which group gets which project below.  I need to know who is doing what before 12:00 PM today.

***

## Project 0: DNA as data
Here is a cool problem having to do with DNA structure.  (Please keep in mind that this problem is written by a particle physicist, not someone with a good understanding of DNA structure.)

You may know that DNA kind of looks like a ladder, with the rungs represented by *base pairs* made of pairs (duh) of *nucleobases*.
Nucleobases are complex molicules that come in four flavors: cytosine [C], guanine [G], adenine [A] or thymine [T].  A and T pair, G and C pair.
Genetic information is encoded in the order/sequence of these base pairs.

In the `data_files` directory, you'll find the following five files:
- dna_1_1000000bps.txt 
- dna_1_1000bps.txt    
- dna_2_1000000bps.txt 
- dna_2_1000bps.txt
- dna_marker_50bps.txt

The first four files indicate the type of DNA (1 or 2) and the number of base pairs in the file (1000 or $10^6$).
The fifth file contains a small (50 base pairs) snippet of DNA (marker) that is an indicator of a congenital disease.

#### Goals
1. Determine whether the distribution of base pair types is random in each of the four DNA files.
2. Determine whether any of the four DNA files contains the marker given in the fifth file and, if so, where in the sequence the marker lives.

#### Steps
1. Begin by writing some code that reads in a DNA file and makes bar graphs of the frequencies of the four **base pair** types (*i.e.*, NOT nucleobase types).  Develop/test this file on the smaller files before running on the larger files.
2. Write code that reads in a DNA file and the marker file and scans through the sequence of base pairs to check whether chunks of 50 base pairs match the marker.  Find the line number/index at which the marker begins in the sequence and count the number of matches in each file.

## Project 1: Random walks
In this problem, you'll generate the trajectory of an object that makes discrete movements along a single axis.
This may seem like it's too simple to be of use, but this problem is a model of the physics responsible for diffusion and mixing.
The problem is sometimes called "the drunkard's walk" because of the following analogy:

> Picture a drunk person leaning against a lamppost in the middle of a very long sidewalk.
We'll imagine the $x$-axis as being aligned with this sidewalk and the lamppost at $x = 0$.
At time $t = 0$, the person decides that it is time to go home, and sets out making a step of length $0.5$~m every second.
There is only one problem: they are so compromised that each step is a random one!
The probability that the person steps in the positive-$x$ direction (at each step) is $p_+$, and the probability of a step in the negative-$x$ direction is $p_-$.
Each step is a random event and doesn't depend on the previous position.

#### Goals
1. Simulate random walks in one and two dimensions.
2. Calculate statistical quantities that characterize the "average" random walk.
3. Compare the probability that the agent returns to the origin in one and two dimensions.

#### Steps
1. Write a program that simulates a random walk in one dimension.  The program should randomly generate a step either to the right (+1) or to the left (-1) with equal probabilities, and make a list of the successive positions of the agent (*i.e.*, the $i^{th}$ element is the agent's position after $i$ steps).  Each walk should last 1000 steps.  Just for something nice to look at, make a graph of the position of the agent versus step number.
2. Put the code from step 1 in a loop that simulates $10^4$ such random walks.  Calculate the average final distance from the origin and the number of walks that returned to the origin at least once.
4. Now modify your code so that the agent walks in two dimensions.  That is, the agent will either step left, right, up or down with equal probability.  Note that your position list will need to store both the $x$ and $y$ positions after each step; it is likely that your list will be a list of 2-element lists. Calculate the average final distance from the origin and the number of walks that returned to the origin at least once.

## Project 2: Epidemiological simulation

Here is a cool problem having to do with numerical integration.  (See day 1!)

Let $f(t)$ be the fraction of a population that have been exposed to a virus as a function of time.  A simple model for how this virus (or a meme, or a rumor, *etc.*) spreads through the population is that the **rate** of spread is proportional to the product of the fraction of people who have the virus and the fraction of people who do not (yet?) have the virus.  [Think about this for a minute!]  For simplicity, we assume that once a person gets the virus, they have it for good.

Mathematically, we write this as

\begin{equation}
\frac{df}{dt} = \beta f (1-f) 
\end{equation}

where $\beta$ is called the *transmission constant*.  In this model, $\beta$ essentially quantifies how transmissable the virus is.

#### Goal
Gain a better understanding of how quickly this virus spreads through a population, and how this spread depends on $\beta$.  Often we can control $\beta$ with simple intervention measures like masking, social distancing, and hand-washing.

#### Steps
1. Begin by integrating this function from $t = 0$ to some later $t$ at which $f(t)$ is greater than $0.95$.  Use values of $\beta = 0.01$ and $f(0) = 0.01$.  Make a graph of $f(t)$ versus $t$.
2. (According to this model) Will the entire population eventually get the virus?  Justify your conclusion?
3. At what $t$ value does 1/2 of the population have the virus?
4. Repeat your integration with a smaller transmission rate, $\beta = 0.005$.  Compare your results for both $\beta$ values by graphing the two $f(t)$ on the same axes.
5. If everyone will eventually get the virus, is there a good reason to reduce the rate at which this happens?
