# Tutorial 12
## A How-to Approach of Maximum Likelihood and Regular Expressions

In [2]:
import pandas as pd
import numpy as np
import os
from scipy.optimize import minimize
from scipy.stats import norm
import scipy.stats

os.chdir('/Users/andrewguinness/Google Drive/ND/Fall 17/Intro to Biocomputing/Intro_Biocom_ND_319_Tutorial12')

Let's begin by importing all of the packages we will need in a manner conducive to their use, and set our working directory to the directory containing our data. 

In [6]:
chickwts = pd.read_csv("chickwts.txt")

chickwts = chickwts[(chickwts.feed == 'soybean') | (chickwts.feed == 'sunflower')]

chickwts = chickwts.replace('soybean', 0)
chickwts = chickwts.replace('sunflower',1)

Now we want to read in our data and save it as an argument. 


Thereafter, we want to subset our data so that we can answer the question we're asking **statistically**. So, let's grab the chick weights **ONLY** for those fed soybean and sunflower diets. (Read the | bar as "or"). 

Now, let's treat soybean like our intercept of the linear model, b0 by setting it to zero. 

What this does is it takes the linear model: y = b0 + b1 * x and changes it to y = b0 (because x = 0). 

This will become clearer later. 

In [7]:
def nllike(p,obs):
    B0=p[0]
    B1=p[1]
    sigma=p[2]
    expected=B0+B1*obs.feed
    nll=-1*norm(expected,sigma).logpdf(obs.weight).sum()
    return nll

initialguess = np.array([1,1,1])

fit = minimize(nllike, initialguess, method="Nelder-Mead", options={'disp': True}, args=chickwts)

print(fit.x)

scipy.stats.ttest_ind(chickwts[(chickwts.feed == 0)], chickwts[(chickwts.feed == 1)])

Optimization terminated successfully.
         Current function value: 138.469162
         Iterations: 200
         Function evaluations: 363
[ 246.42855057   82.48813575   49.73948886]


Ttest_indResult(statistic=array([-4.05020659,        -inf]), pvalue=array([ 0.00046409,  0.        ]))

Here we define the normal log likelihood function, which takes three parameter (*p*) values, which is our initialguess, and our observations (*obs*). 

So, how do we interpet this? 

You'll see from the fit (which optimizes the values to predict their LOWEST log likelihood (i.e. highest likelihood)) that b0 gives you some value and b1 gives you a positive additive value. That is to say, on average the chicks fed sunflower weigh b1 more than soybean chicks. 

We can confirm these results with a T-test, indicating the two populations are discrete. 

   ## Pattern matching

Here, I've synthesized some data on which we'll be doing three separate pattern matching steps, so many different types of data are available in the same file. Let's go ahead and read this in and take a look. 

In [12]:
import re

textfile = open('pattern.txt', 'r')
filetext = textfile.read()
textfile.close()

print filetext

"14:56"
"01:53"
"02:25"
"17:13"
"04:24"
"03:12"
"11:45"
"19:06"
"16:02"
"07:42"
"09:16"
"06:36"
"10:31"
"21:43"
"20:46"
"718-50-9401"
"419-75-1706"
"945-53-6985"
"911-50-1000"
"478-68-7481"
"342-13-1221"
"438-77-5635"
"470-74-6952"
"672-42-1738"
"161-65-6811"
"H. sapiens"
"C. pipiens"
"H. cecropia"
"E. coli"
"R. pomonella"



Alright, so now we want to pull out three separate things: 

1) Just the SSNs

2) Times after noon (12:00)

3) The species names

Let's do these each one at a time. 

In [14]:
## Social Security Numbers
SSNmatch = re.compile('\d{3}-\d{2}-\d{4}')

SSNs = re.findall(SSNmatch, filetext)
print SSNs

['718-50-9401', '419-75-1706', '945-53-6985', '911-50-1000', '478-68-7481', '342-13-1221', '438-77-5635', '470-74-6952', '672-42-1738', '161-65-6811']


So, this gives us a list of all 10 social security numbers in the data set by matching a pattern of three digits (\d), a hyphen, two digits, a hyphen, and four digits. 

Truthfully, we don't need to match them this way. Since the only data structure in this file that contains hyphens (or for that matter even contains digits longer than two) are the social security numbers, we can be a little bit lazier *in this specific case*. 

In [15]:
## Social Security Numbers the lazy way

SSNmatch = re.compile('\d{3}.+')

SSNs = re.findall(SSNmatch, filetext)
print SSNs

['718-50-9401"', '419-75-1706"', '945-53-6985"', '911-50-1000"', '478-68-7481"', '342-13-1221"', '438-77-5635"', '470-74-6952"', '672-42-1738"', '161-65-6811"']


In this case, we match anything that has three digits in a row and all of the other characters that come after it (**.** means anything, **+** takes it *ad infinitum*)

Let's skip challenge two for right now. We'll get back to it. We can be super lazy with the abbreviated binomial names also. They're the ONLY data structure here that has any word characters. 

In [21]:
## Binomial names 

speciesmatch = re.compile('[A-Za-z].+')

spp = re.findall(speciesmatch, filetext)
print spp

['H. sapiens"', 'C. pipiens"', 'H. cecropia"', 'E. coli"', 'R. pomonella"']


The final option asks us to pull out times in a 24-hr format **after noon**. This is slightly more challenging because we can't just match all digits. So, we want to be able to pull out times like 14:56 and 21:43, but ignore times like 01:53. 

This sounds convoluted at first, but it's really quite simple. 

We want to match times that start with one digit of either 1 or 2, but not zero and are followed by a digit 2 or greater if they start with 1. Then there's a colon and some other stuff. The colon is a key here because it lets us ignore the social security numbers. 

In [31]:
## Times after 12:00
timesmatch = re.compile('1{1}[2-9]{1}:.+|2{1}[0-9]{1}:.+')

afternoon = re.findall(timesmatch, filetext)
print afternoon

['14:56"', '17:13"', '19:06"', '16:02"', '21:43"', '20:46"']


So, let's translate this to English. 

We're looking for something that has a 1 once and then {1} digit 2 or greater, a colon, and anything after that. 

OR (|) 

We're looking for something that starts with a 2 and has any digit and a colon (etc.) thereafter. 