# Lab 6: Sampling and Simulation
*Elements of Data Science* <br>
Welcome to Module 2 and lab 6! This week, we will go over conditionals and iteration, and introduce the concept of randomness. All of this material is covered in [Chapter 9](https://inferentialthinking.com/chapters/09/Randomness.html) and [Chapter 10](https://inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html) of the online <i>Inferential Thinking</i> textbook. 
<br>**<center>Learning Goals**
|Area|Concept|
|---|---|
|Probability|Probability of outcomes including die roles and beyond|
|Simulation |Sample the distribution|
| - Iteration|  for i in np.arange(samples)|
| - numpy random | random.choice() to randomly select among outcomes|
|Conditional|Comparisons using `if` and boolean operator|
|Boolean Operators| `==`, `!=`, `>`,`<`,`>=`,`<=`| 
|Python dictionary|my_dictionary = {'EDS':'TR 8:00 AM','CHEM 1031':'TR 9:30 AM'}|



First, set up the tests and imports by running the cell below.

In [None]:
name = ...

In [None]:
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import os
user = os.getenv('JUPYTERHUB_USER')
from gofer.ok import check

## 1. Sampling

#### 1.1 Dungeons and Dragons and Sampling
In the game Dungeons & Dragons, each player plays the role of a fantasy character.
A player performs actions by rolling a 20-sided die, adding a "modifier" number to the roll, and comparing the total to a threshold for success. The modifier depends on her character's competence in performing the action.
For example, suppose Alice's character, a barbarian warrior named Roga, is trying to knock down a heavy door. She rolls a 20-sided die, adds a modifier of 11 to the result (because her character is good at knocking down doors), and *succeeds if the total is greater than 15*.<br> A Medium posting discusses probability in the context of Dungeons and Dragons https://towardsdatascience.com/understanding-probability-theory-with-dungeons-and-dragons-a36bc69aec88 <br><br><img src= data/DnD.JPG height="200"><br>
#### <font color=blue> **Question 1.1.** </font>
Write code that simulates the above process:<br>
- Roll a 20 sided die
- Add a modifier of +11
- Check if the combined roll succeeds (roll + modifier > 15)
</li>
<br>
<p>In other words, compute three values: the result of Alice's roll (roll_result), the result of her roll plus Roga's modifier (modified_result), and a boolean value indicating whether the action succeeded (action_succeeded). Do not fill in any of the results manually; the entire simulation should happen in code.
Hint: A roll of a 20-sided die is a number chosen uniformly from the array make_array(1, 2, 3, 4, ..., 20). So a roll of a 20-sided die plus 11 is a number chosen uniformly from that array, plus 11.</p>

In [None]:
possible_rolls = ...
roll_result = ...
modified_result = ...
action_succeeded = ...

# The next line just prints out your results in a nice way
# once you're done.  You can delete it if you want.
print(f"On a modified roll of {modified_result}, Alice's action {'succeeded' if action_succeeded else 'failed'}")

In [None]:
check('tests/q1.1.py')

#### <font color=blue> **Question 1.2.** </font> 
Run your cell 7 times to manually estimate the chance that Alice succeeds at this action. (Don't use math or an extended simulation.). Your answer should be a fraction.

In [None]:
rough_success_chance = ...

In [None]:
check('tests/q1.2.py')

Suppose we don't know that Roga has a modifier of 11 for this action.  Instead, we observe the modified roll (that is, the die roll plus the modifier of 11) from each of 7 of her attempts to knock down doors.  We would like to estimate her modifier from these 7 numbers.<br>

#### <font color=blue> **Question 1.3.** </font>
Write a Python function called `simulate_observations`.  It should take two arguments, the modifier and num_oobservations, and it should return an array of num_observations.  Each of the numbers should be the modified roll from one simulation.  **Then**, call your function once to compute an array of 7 simulated modified rolls.  Name that array `observations`.

In [None]:
modifier = 11
num_observations = 7

def simulate_observations(modifier, num_observations):
    """Produces an array of 7 simulated modified die rolls"""
    ...

observations = ...
observations

In [None]:
check('tests/q1.3.py')

#### <font color=blue> **Question 1.4.** </font> 
Draw a histogram to display the *probability distribution* of the modified rolls we might see.  Check with a neighbor or a CA to make sure you have the right histogram. Carry this out again using 100 rolls.

In [None]:
# We suggest using these bins.
roll_bins = np.arange(1, modifier+2+20, 1)

...

### Estimate the modifier
Now let's imagine we don't know the modifier and try to estimate it from observations.
One straightforward (but clearly suboptimal) way to do that is to find the smallest total roll (`min()`), since the smallest roll on a 20-sided die is 1, which is corresponds to a modifier of 0. Use a random number for <i>modifier</i> to start and keep this value through the next questions. We will also generate 100 rolls based on the below unknown modifier. <br>
#### <font color=blue> **Question 1.5.** </font>
Using that method, estimate modifier from observations. Name your estimate min_estimate.

In [None]:
modifier = np.random.randint(1,20) # Generates a random integer modifier from 1 to 20 inclusive
... = simulate_observations(modifier, num_observations)
...
min_estimate = ...
min_estimate

In [None]:
check('tests/q1.5.py')

## Estimate the modifier based on the mean of observations.
#### <font color=blue> **Question 1.6.** </font>
Figure out a good estimate based on using the mean of all the modified rolls. Then, write a function named mean_based_estimator that computes your estimate. It should take an array of modified rolls (like the array observations) as its argument and return an estimate (single number) of the modifier based on those numbers contained in the array.

<font color=blue>**What is the mean of many rolls (1000's) of a 20-sided die?**</font>

In [None]:
mean20 = ...
mean20

In [None]:
def mean_based_estimator(obs):
    """Estimate the roll modifier based on observed modified rolls in the array nums."""
    ...

# Here is an example call to your function.  It computes an estimate
# of the modifier from our  observations.
mean_based_estimate = mean_based_estimator(observations)
mean_based_estimate

In [None]:
check('tests/q1.6.py')

#### <font color=blue> **Question 1.7.** </font>
Construct a histogram and compare to above estimates, are they consistent?
What is your best estimate of the random modiifer based on the above, without examining the value?

In [None]:
plt.hist(..., bins = ...) # Use to plot histogram of an array of 100 modified rolls
estimated_modifier = ...

In [None]:
check('tests/q1.7.py')

## 2. Sampling and GC content of DNA sequence
DNA within a cell contains codes or sequences for the ultimate synthesis of proteins. In DNA is made up of four types of nucleotides, guanine (G), cytosine (C), adenine (A), and thymine (T) connected in an ordered sequence. These nucleotides on a single strand pair with complimentary nucleotides on a second strand, G pairs with C and A with T. Regions of DNA code for RNA which ultimately directs protein synthesis and these segments are known as genes and these segments often have higher GC content. Here we will sample 10 nuclotide segments of a DNA sequence and determine the GC content of these DNA segments. See [DNA sequnce basics](http://data-science-sequencing.github.io/Win2018/assets/lecture2/lecture2_2018.pdf) and [GC Content details](https://geneticeducation.co.in/what-is-the-importance-of-gc-content/). 
##### Our goal is to sample portions (10 nucelotides) of the sequence and determine<br> the relative content of guanine (G) and cytosine (C) to adenine (A) and thymine (T)

In [None]:
# DNA sequence we will examine, a string
seq = "CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG \
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG \
CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA \
AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA \
ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT \
AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA \
GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC \
AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT \
TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT \
GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT \
GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC" # LCBO-Prolactin precursor-Bovine
seq

#### <font color=blue> **Question 2.1A.** </font>
Run the first two code cells below to see how substrings are extracted and how a character can be counted within a substring. Use the same strategy to determine GC content as fraction of the total in the first 10 nucleotides in the larger sequence above, `seq`

In [None]:
# Example A
samplesize = 4
# Use this short sequence in this example
seq0 = 'GTGAAAGATT'
# How to get a substring
seq0[0:samplesize]

In [None]:
# Example B
# How to count the number of times 'A' appears in sequence
seq0[0:samplesize].count('A')

In [None]:
GCcount = seq[0:10].count('G') + seq[...]...
GCfraction = ...

##### Lists
Below we assemble a list and append an additional entry, 0.7. A useful strategy in creating your function

In [None]:
gc = []
gc.append(0.8)
gc.append(0.7)
gc

##### Fill a list with 30 random G, C, T, A nucleotides 
use iteration and `np.random.choice`

In [None]:
my_sim_seq = []
nucleo = ['G','C','T','A']
for i in np.arange(...):
    my_sim_seq.append(np. ...)

print(my_sim_seq)   


#### <font color=blue> **Question 2.1B.** </font>
We will define a function, `calcGC` to do the repetitive work of computing GC content fraction for each segment (samplesize) and return a list with the fraction GC for each `samplesize` segment. 

In [None]:
def calcGC(seq, samplesize):
    gc = []
    for i in range(len(seq)-samplesize):
    ...
    return ...

In [None]:
check('tests/q2.1.py')

#### <font color=blue> **Question 2.2.** </font>
Apply this function to our sequence above <i>seq</i> with a samplesize of 10. What is the maximum, minimum, mean, and median? Use you five number summary function from the Olympics Mini-project (Hint: the max of a list can be obtained with the max(results) and we can use np.mean(results)). Plot the results by using plt.plot(results).

In [None]:
samplesize = ...
results = ...
maximum = ...
minimum = ...
median = ...
mean = ...
...

In [None]:
check('tests/q2.2.py')

#### <font color=blue> **Question 2.3.** </font> 
Now apply this function to our sequence above <i>seq</i> with a different, larger samplesize (>30) of your choosing and plot. What do you observed with the different sampling? What is the maximum, minimum, mean, and median? (Hint: the max of a list can be obtained with the max(results) and we can use np.median(results))

In [None]:
# Plot

#### <font color=blue> **Question 2.3 Discussion.** </font>
<font color='green'> Discuss your results. You just did a fairly sophisticated DNA analysis, so there should be a lot to say. For example, how did changing samplesize affect your ability to identify GC-rich DNA regions? Why are these regions so important (do a little reading). Given a year-long record of daily temperatures, how might a similar approach of varying sample size be used to distinguish seasonal temperature trends from day-to-day fluctuations?

## 3. Wordle and sampling
Inspired by Medium post: https://ericlani.medium.com/determining-the-best-first-wordle-word-to-guess-using-data-b93b975a6294



#### Letter frequency in sampled texts
We will begin our study by trying to understand the frequency of certain letters by sampling texts.  We will again use Charles Darwin's book on the Origin of Species to examine the frequency of letters in this text. To do this we will need to write a function which goes through all the words and determines the counts of letters. 

In [None]:
darwin_string = open('data/darwin_origin_species.txt', encoding='utf-8').read()
darwin_words = np.array(darwin_string.split())
darwin_words

#### <font color=blue> ****Question 3.1**.** </font> 
We will examine some details about the array of words from Darwin's Origin of Species

**How may words are in Darwin's Origin of Species?**

In [None]:
num_words = ...

**What are the first three words and how many letters are in these first three words?** Hint: Each word has an index in the array, the 10th word would be `darwin_words[9]`

In [None]:
darwin_words[9]

**Use a for loop to step through the letters in the third word.**
Print the lower case version of the letters.

In [None]:
for letter in ...:
    print(letter.lower())

`.isalpha()` : checks if character is alphanumeric

In [None]:
"b".isalpha()

In [None]:
",".isalpha()

#### Dictionaries
A Python dictionary stores values in pairs, a key and a value. For example when we considered the height in meters of each of the Adirondack High Peaks in Lab 02 we had values representing the height in meters without a label. With a dictionary we could use the name of the mountain as a key with the height as a value. Below we will use a dictionary to store the number of times a letter occurs. Run the below cells to see how dictionaries work.

Original List

In [None]:
ADK_highpeaks = [1629,1559,1512,1501] 
ADK_highpeaks

In [None]:
ADK_dictionary = {'Mount Marcy':1629,'Algonquin Peak':1559,'Mount Haystack':1512,'Mount Skylight':1501}
ADK_dictionary

In [None]:
ADK_dictionary.values()

In [None]:
list(ADK_dictionary.keys())

In [None]:
ADK_dictionary.get('Algonquin Peak')

**Define a function** to determine letter frequency in text with words split in an array as in above <i>darwin_words</i> array and return a Table with letters and their count. We will use a nice trick with a Python dictionary (see: [Python Dictionaries](https://realpython.com/python-dicts/)) as already encoded below.

In [None]:
def letter_freq(words):
    f = {} # Create an empty dictionary to store letters and their count found in words
    for ...:
        for ...:
            l = l.lower()
            if l.isalpha(): # avoid punctuation
                f[l] = f.get(l,0) + 1 # Using Python dictionary
    return Table().with_columns('letters',list(f.keys()),'count',list(f.values()) )

**Now test your function with your own sentence.**

In [None]:
sentence = "..."
letter_freq(sentence)

In [None]:
check('tests/q3.1.py')

#### <font color=blue> **Question 3.2**</font> 
Apply the function to the <i>darwin_string</i> and examine the output. How many total letters in the text (Hint: use np.sum(freq.column('count')))? Now compute a new column, <i>frequency</i> which contains the fraction of each letter. What are the two most frequent letters in this sample?

In [None]:
freq = letter_freq(darwin_words).sort("letters")

total_letters = ... # How many letters

freq = freq.with_columns(...).sort("frequency",descending=True)


In [None]:
check('tests/q3.2.py')

#### Five letter words
#### <font color=blue> **Question 3.3** </font> 

 Now look at a list of 5 letter words assembled by Professor Emeritas Donald Knuth of Stanford. Use your function to determine the letter frequency and compare to above.

In [None]:
from urllib.request import urlopen # Needed to read from internet

url = "https://www-cs-faculty.stanford.edu/~knuth/sgb-words.txt"
knuth5_string=urlopen(url).read().decode('utf-8') 
knuth_words = np.array(knuth5_string.split())

In [None]:
# Now apply your function and compute letter frequencies
freq = ...

freq = freq.with_columns(...)

In [None]:
check('tests/q3.3.py')

### Compare with Oxford Dictionary
Based on analysis of Oxford dictionary these are the letter frequencies from the dictionary. Compare the three, in the markdown below, what are the similarities?

In [None]:
url = "data/Oxford_Letter_frequency.csv"
Oxletters = Table().read_table(url, header=None, names=["letters","frequency","count"])
Oxletters

#### Comparsion



#### <font color=blue> **Question 3.4** </font> 
Let's look at 5-letter words and the frequency of each letter and compare with Oxford case

### Wordle itself
The New York Times hosts [Wordle](https://www.nytimes.com/games/wordle/index.html) where a 5 letter word (Wordle) is determined in six or fewer tries using clues about letters contained and letter position. We will use our new knowledge of letter frequency and Knuth's 5 letter words to come up with best letters and words to try.

In [None]:
# Reload Knuth's words
Knuth_url = "https://www-cs-faculty.stanford.edu/~knuth/sgb-words.txt"
words = Table().read_table(Knuth_url, header=None, names=["word"])
words

Now load our letter frequency, <i>freq</i>, data table from a chosen text, Darwin, Knuth, or the Oxford dictionary (already in `Oxletters`, no need to run freq). We will store this in the `letters` Table for further analysis.

In [None]:
letters = ...
letters

#### <font color=blue> **Question 3.5** </font>  
Devise a plan to use the collection of letter frequencies and Knuth's five letter words to order the best words to guess for Wordle? Higher letter frequency means more words with the given letter. Describe plan in text markdown only.

#### Wordle computation
Can you interpret how this code is sorting Knuth's words based on our letter frequencies?

In [None]:
score = np.array([])
letter_count = np.array([])
for word in words['word']:  # Iterate through each Knuth word
    #print("Word: ", word)
    wscore = 1
    lcount = 0
    for letter in word:  # Iterate through each letter
        lcount +=1
        letter_frequency = letters.where('letters',letter)['frequency'][0] # Get letter frequency computed above
        wscore = wscore*letter_frequency # Score based on product of letter frequency
    score = np.append(score,wscore)  # append score
    letter_count= np.append(letter_count, lcount)
score = score*1.0e6 # Scale score to be more readable
words = words.with_columns("score",score,"letter_count",letter_count)  # Add score and letter count to words Table
words

#### <font color=blue> **Code Discussion** </font> 
In a markdown below discuss how the code iterates through words and computes score.

#### <font color=blue> Four letter words are included with Knuth's original words. Use `.where` to select only five letter words. </font> 

In [None]:
words_5 = words.where(...,...)
words_5

#### <font color=blue> Use `.sort()` to find words with the best score</font> 

#### <font color=blue> **Question 3.6** </font> 

#### Compare top word given based on Darwin's text, Knuth's letter frequency, and the Oxford Dictionary.

Need to run Wordle computation with `letters` set equal to each `letter_freq(words)` computation or `Oxletters` from the Oxford dictionary. <br><br><b><font color=green>***Top words for Wordle:***


<font color=green>***Why words with repeat letters may not be good guesses for Wordle:***

### <font color=blue> **Question 4.** </font>

At the end of each lab, please include a reflection. 
* How did this lab go? 
* What aspects of sampling do you find confusing?
* Do you understand how we computed the best words for Wordle?
* Were there questions you found especially challenging? 
* How long did the lab take you to complete?

Share your feedback so we can continue to improve this class!

**Insert a markdown cell below this one and write your reflection on this lab.**

### Complete and tally your score.
Submit .html and .ipynb of completed laboratory

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import check
correct = 0
total = 11
checks = ['1.1','1.2', '1.3', '1.5', '1.6', '1.7', '2.1', '2.2', '3.1', '3.2','3.3']
for x in checks:
    print('Testing question {}: '.format(x))
    g = check('tests/q{}.py'.format(x))
    if g.grade == 1.0:
        print("Passed")
        correct += 1
    else:
        print('Failed')
        display(g)

print('Grade:  {}'.format(str(correct/total)))

In [None]:
print("Nice work ",name, user)
import time;
localtime = time.asctime( time.localtime(time.time()) )
print("Submitted @ ", localtime)