### Lab 6: Simulation

*Elements of Data Science* <br>
Welcome to Module 2 and lab 6! This week, we will go over conditionals and iteration, and introduce the concept of randomness. All of this material is covered in [Chapter 9](https://inferentialthinking.com/chapters/09/Randomness.html) and [Chapter 10](https://inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html) of the online <i>Inferential Thinking</i> textbook. 

First, set up the tests and imports by running the cell below.

In [None]:
name = ...

In [None]:
## import statements
from gofer.ok import check
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

## 1. Sampling

#### 1.1 Dungeons and Dragons and Sampling
In the game Dungeons & Dragons, each player plays the role of a fantasy character.
A player performs actions by rolling a 20-sided die, adding a "modifier" number to the roll, and comparing the total to a threshold for success. The modifier depends on her character's competence in performing the action.
For example, suppose Alice's character, a barbarian warrior named Roga, is trying to knock down a heavy door. She rolls a 20-sided die, adds a modifier of 11 to the result (because her character is good at knocking down doors), and *succeeds if the total is greater than 15*.<br> A Medium posting discusses probability in the context of Dungeons and Dragons https://towardsdatascience.com/understanding-probability-theory-with-dungeons-and-dragons-a36bc69aec88 <br><br><img src= data/DnD.JPG height="200"><br>
**Question 1.1** Write code that simulates that procedure. Compute three values: the result of Alice's roll (roll_result), the result of her roll plus Roga's modifier (modified_result), and a boolean value indicating whether the action succeeded (action_succeeded). Do not fill in any of the results manually; the entire simulation should happen in code.
Hint: A roll of a 20-sided die is a number chosen uniformly from the array make_array(1, 2, 3, 4, ..., 20). So a roll of a 20-sided die plus 11 is a number chosen uniformly from that array, plus 11.

In [None]:
possible_rolls = np....(1,...)
roll_result = np........(...)
modified_result = ...
action_succeeded = modified_... > ...

# The next line just prints out your results in a nice way
# once you're done.  You can delete it if you want.
print("On a modified roll of {:d}, Alice's action {}.".format(modified_result, "succeeded" if action_succeeded else "failed"))

In [None]:
check('tests/q1.1.py')

**Question 1.2** Run your cell 7 times to manually estimate the chance that Alice succeeds at this action. (Don't use math or an extended simulation.). Your answer should be a fraction.

In [None]:
rough_success_chance = ...

In [None]:
check('tests/q1.2.py')

Suppose we don't know that Roga has a modifier of 11 for this action.  Instead, we observe the modified roll (that is, the die roll plus the modifier of 11) from each of 7 of her attempts to knock down doors.  We would like to estimate her modifier from these 7 numbers.<br>

**Question 1.3** Write a Python function called `simulate_observations`.  It should take two arguments, the modifier and num_oobservations, and it should return an array of num_observations.  Each of the numbers should be the modified roll from one simulation.  **Then**, call your function once to compute an array of 7 simulated modified rolls.  Name that array `observations`.

In [None]:
modifier = 11
num_observations = 7

def simulate_observations(modifier, num_observations):
    """Produces an array of 7 simulated modified die rolls"""
    results=make_array()
    for i in np.arange(...):
        possible_rolls = np....(1,...)
        roll_result = np........(...)
        modified_result = ... + ...
        ... = np.append(results,...)
    return ... 

observations = ... # Hint: use the newly defined function
observations

In [None]:
check('tests/q1.3.py')

**Question 1.4** Draw a histogram to display the *probability distribution* of the modified rolls we might see.  Check with a neighbor or a CA to make sure you have the right histogram. Carry this out again using 100 rolls.

In [None]:
# We suggest using these bins.
roll_bins = np.arange(1, modifier+2+20, 1)

num_observations=...
roll_hist= Table()...._...('Roll num',np....(1,...+1), 'Roll result', simulate_...(...,...)).hist('...', bins=roll_...)

### Estimate the modifier
Now let's imagine we don't know the modifier and try to estimate it from observations.
One straightforward (but clearly suboptimal) way to do that is to find the smallest total roll, since the smallest roll on a 20-sided die is 1, which is roughly 0. Use a random number for <i>modifier</i> to start and keep this value through the next questions. We will also generate 100 rolls based on the below unknown modifier. <br>
**Question 1.5** Using that method, estimate modifier from observations. Name your estimate min_estimate.

In [None]:
modifier = np.random.randint(1,20) # Generates a random integer modifier from 1 to 20 inclusive
... = simulate_observations(modifier, num_observations)
...
min_estimate = ...
min_estimate

In [None]:
check('tests/q1.5.py')

## Estimate the modifier based on the mean of observations.
**Question 1.6** Figure out a good estimate based on that quantity. Then, write a function named mean_based_estimator that computes your estimate. It should take an array of modified rolls (like the array observations) as its argument and return an estimate (single number)of the modifier based on those numbers contianed in the array.

In [None]:
def mean_based_estimator(obs):
    """Estimate the roll modifier based on observed modified rolls in the array nums."""
    ...

# Here is an example call to your function.  It computes an estimate
# of the modifier from our  observations.
mean_based_estimate = mean_based_estimator(observations)
mean_based_estimate

In [None]:
check('tests/q1.6.py')

**Question 1.7** Construct a histogram and compare to above estimates, are they consistent?
What is your best estimate of the random modiifer based on the above, without examining the value?

In [None]:
plt.hist(..., bins = ...) # Use to plot histogram of an array of 100 modified rolls
estimated_modifier = ...

In [None]:
check('tests/q1.7.py')

## 2. Sampling and GC content of DNA sequence
DNA within a cell contains codes or sequences for the ultimate synthesis of proteins. In DNA is made up of four types of nucleotides, guanine (G), cytosine (C), adenine (A), and thymine (T) connected in an oredered sequence. These nucleotides on a single strand pair with complimentary nucleotides on a second strand, G pairs with C and A with T. Regions of DNA code for RNA which ultimately directs protein synthesis and these segments are known as genes and these segments often have higher GC content. Hear we will sample 10 nuclotide segments of a DNA sequence and determine the GC content of these DNA segments. See [DNA sequnce basics](http://data-science-sequencing.github.io/Win2018/assets/lecture2/lecture2_2018.pdf) and [GC Content details](https://geneticeducation.co.in/what-is-the-importance-of-gc-content/). 
##### Our goal is to sample portions (10 nucelotides) of the sequence and determine the relative content of guanine (G) and cytosine (C) to adenine (A) and thymine (T)

In [None]:
# DNA sequence we will examine, a string
seq = "CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG \
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG \
CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA \
AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA \
ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT \
AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA \
GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC \
AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT \
TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT \
GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT \
GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC" # LCBO-Prolactin precursor-Bovine
seq

**Question 2.1A** Run the first two code cells below to see how substrings are extracted and how a character can be counted within a substring. Use the same strategy to determine GC content as fraction of the total in the first 10 nucleotides in the larger sequence above, `seq`

In [None]:
# Example A
samplesize = 4
# Use this short sequence in this example
seq0 = 'GTGAAAGATT'
# How to get a substring
seq0[0:samplesize]

In [None]:
# Example B
# How to count the number of times 'A' appears in sequence
seq0[0:samplesize].count('A')

In [None]:
GCcount = seq[0:10].count('G') + seq[...]...
GCfraction = ...

##### Lists
Below we assemble a list and append an additional entry, 0.7. A useful strategy in creating your function

In [None]:
gc = []
gc.append(0.8)
gc.append(0.7)
gc

##### Fill a list with 30 random G, C, T, A nucleotides 
use iteration and `np.random.choice`

In [None]:
my_sim_seq = []
nucleo = ['G','C','T','A']
for i in np.arange(...):
    my_sim_seq.append(np. ...)

print(my_sim_seq)   


**Question 2.1B** We will define a function, `calcGC` to do the repetitive work of computing GC content fraction for each segment (samplesize) and return a list with the fraction GC for each `samplesize` segment. 

In [None]:
def calcGC(seq, samplesize):
    gc = []
    for i in range(len(seq)-samplesize):
        seg = seq[i:i+samplesize]
        GCcount = ...
        ...
    return ...

In [None]:
check('tests/q2.1.py')

**Question 2.2**
Apply this function to our sequence above <i>seq</i> with a samplesize of 10. What is the maximum, minimum, mean, and median? (Hint: the max of a list can be obtained with the max(results) and we can use np.mean(results)). Plot the results by using plt.plot(results).

In [None]:
samplesize = ...
results = ... #Hint: use the 
maximum = ...
minimum = ...
median = ...
mean = ...
...

In [None]:
check('tests/q2.2.py')

**Question 2.3** Now apply this function to our sequence above <i>seq</i> with a different, larger samplesize (>30) of your choosing and plot. What do you observed with the different sampling? What is the maximum, minimum, mean, and median? (Hint: the max of a list can be obtained with the max(results) and we can use np.median(results))

In [None]:
# Plot

## Answer
<font color='green'> Discuss your results

### Complete and tally your score.
Submit .html and .ipynb of completed laboratory

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import check
correct = 0
total = 8
checks = ['1.1','1.2', '1.3', '1.5', '1.6', '1.7', '2.1', '2.2']
for x in checks:
    print('Testing question {}: '.format(x))
    g = check('tests/q{}.py'.format(x))
    if g.grade == 1.0:
        print("Passed")
        correct += 1
    else:
        print('Failed')
        display(g)

print('Grade:  {}'.format(str(correct/total)))

In [None]:
print(name," Great work!")