 #  <center> <span style = "color:blue">  Project </span>  </center>

For this project, you'll be analyzing real data provided by professors and their research students in chemistry and biology.  Here's the story:

Chemistry students synthesized three different derivatives of penicillin, and meanwhile other students in a microbiology class grew two different bacteria in petri dishes.  The synthesized derivatives of penicillin are supposed to possess antibacterial qualities, which is to say they act as antibiotics.  To test those qualities, the three derivatives were introduced into the petri dishes, by soaking a small disk of paper in the antibiotic then placing the disk in the petri dish.  The "zone of inhibition" is the circular region around the disk where the bacteria was not able to grow.  (This is also known as a Kirby-Bauer test.)  The diameters of these zones of inhibition were measured in decimeters and recorded, the data is available below.  The larger the zone of inhibition the more effective the antibiotic.    

- Of the two bacteria, was one of the derivatives more effective than penicillin?  Or ampicillin?

- Was the 5 parts per million or the 10 parts per million more effective?  (Parts per million will be abbreviated to ppm.)

- Was any derivative more effective against the gram positive bacteria (*S. epidermidis*) or the the gram negative bacteria (*E. coli*)?

- If one compound is effective against *E. coli* is it likely to be effective against *S. epidermidis*?

In [None]:
## This cell imports modules and functions used in the rest of the notebook

from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from scipy.stats import ttest_ind as ttest2
from scipy.stats import t
import scipy.stats as stats

def groupstats(table, group, data):
    ### This function will find all the major descriptive stats you need ###
    cut = table.select(group, data).sort(group)
    favstats = cut.group(group, np.mean).sort(group)
    words = [data, 'mean']
    favstats = favstats.relabeled(' '.join(words), "mean")
    groups = favstats.column(0)
    q1=make_array()
    for i in np.arange(len(groups)):
        q1 = np.append(q1, np.percentile(table.where(group, groups.item(i)).column(data), 25))
    q3=make_array()
    for i in np.arange(len(groups)):
        q3 = np.append(q3, np.percentile(table.where(group, groups.item(i)).column(data), 75))
    favstats = favstats.with_column('std', cut.group(group, stats.tstd).sort(group).column(1) )
    favstats = favstats.with_column('min', cut.group(group, min).sort(group).column(1) )
    favstats = favstats.with_column('Q1', q1 )
    favstats = favstats.with_column('median', cut.group(group, np.median).sort(group).column(1) )
    favstats = favstats.with_column('Q3', q3 )
    favstats = favstats.with_column('max', cut.group(group, max).sort(group).column(1) )
    favstats = favstats.with_column('IQR', cut.group(group, stats.iqr).sort(group).column(1) )
    favstats = favstats.with_column('n', cut.group(group ).sort(group).column(1) )
    return favstats

penicillin = Table.read_table("Penicillin measurements.csv")

penicillin = penicillin.relabel(1, "E. coli").relabel(2, "S. epidermidis").relabel(0,"Compound")

In [None]:
## To see a sample of the data, run this cell, otherwise skip it.
penicillin.take(make_array(0,2,12,19,25,28,32,37,45))

## To see all the data take the # off the line below
#penicillin.show()

## First Research Question

Against *E. coli*, was one of the derivatives more effective than penicillin? Or ampicillin?

**1.1** In the cell below, use the `groupstats` function to find the summary statistics split by "Compound" for the variable "E. coli".

In [None]:
## The next few lines prepares the data for an ANOVA ##
## column 1 is the E. coli column.

E_ag5 = penicillin.where("Compound", "AG (light blue) 5 ppm").column(1)
E_ag10 = penicillin.where("Compound", "AG (purple) 10 ppm").column(1)
E_ampic = penicillin.where("Compound", "Ampicillin").column(1)
E_er5 = penicillin.where("Compound", "ER (green) 5ppm").column(1)
E_er10 = penicillin.where("Compound", "ER (yellow) 10 ppm").column(1)
E_jh5 = penicillin.where("Compound", "JH (red) 5 ppm").column(1)
E_jh10 = penicillin.where("Compound", "JH (orange) 10 ppm").column(1)
E_penic = penicillin.where("Compound", "Penicillin").column(1)
E_h2o = penicillin.where("Compound", "Water").column(1)

## The next line actually runs the ANOVA ##
stats.f_oneway(E_ag5, E_ag10, E_ampic, E_er5, E_er10, E_jh5, E_jh10, E_penic, E_h2o)

**1.2** In the cell above, a one-way ANOVA test was run.  The null and alternative hypotheses for a such an ANOVA are as follows:

$ H_o: $ *All groups have the same mean*
    
$ H_a: $ *At least one group has a different mean*
        
Which interpretation of the results is correct?


a) The p-value is small, so all the means are approximately the same.

b) The p-value is large, so all the means are approximately the same.

c) The p-value is small, so at least one group mean is different.

d) The p-value is large, so at least one group mean is different.  

*Replace this text with your response.*

**1.3** In the cell below, a set of side-by-side boxplots is prepared. Remembering, that higher is better for this measurement, which derivative was most effective?  Just based on the graph, does it appear to be more effective than penicillin and/or ampicillin?

In [None]:
ticks = make_array(1,2,3,4,5,6,7,8,9)
labels = make_array("Amp.", "Pen.", "H2O", "ER 5ppm","ER 10ppm", "JH 5ppm", "JH 10ppm", "AG 5ppm", "AG 10ppm")

plots.figure(figsize=(11, 11))
plots.boxplot(E_ampic, widths=.5, positions=make_array(ticks.item(0)), showmeans=True)
plots.boxplot(E_penic, widths=.5, positions=make_array(ticks.item(1)), showmeans=True)
plots.boxplot(E_h2o, widths=.5, positions=make_array(ticks.item(2)), showmeans=True)
plots.boxplot(E_er5, widths=.5, positions=make_array(ticks.item(3)), showmeans=True)
plots.boxplot(E_er10, widths=.5, positions=make_array(ticks.item(4)), showmeans=True)
plots.boxplot(E_jh5, widths=.5, positions=make_array(ticks.item(5)), showmeans=True)
plots.boxplot(E_jh10, widths=.5, positions=make_array(ticks.item(6)), showmeans=True)
plots.boxplot(E_ag5, widths=.5, positions=make_array(ticks.item(7)), showmeans=True)
plots.boxplot(E_ag10, widths=.5, positions=make_array(ticks.item(8)), showmeans=True)
plots.xticks(ticks, labels, size = 12)
plots.title("E. coli");

*Replace this text with your answer.*

**1.4** Run a two-sample t-test to compare the array named `E_jh5` to the array called `E_penic`.  Then interpret the results in the cell below that.  

*Replace this text with your interpretation of the results of your t-test*

**1.5** Run a two-sample t-test to compare the array named `E_jh5` to the array called `E_ampic`.  Then interpret the results in the cell below that.  

*Replace this text with your interpretation of the results of your t-test*

## Second Research Question

Against *S. epidermidis*, was one of the derivatives more effective than penicillin? Or ampicillin?

In the cell, below the data has been prepared for you but, this time you'll have to perform all the analyses without as much guidance.  Repeat the processes and analyses that were performed for the First Research Question in the empty cells below.  Include appropriate grapics and state your conclusions.  You should be able to accomplish all of this by copying, pasting and editing the cells above.

**2.1** Using the dataset called penicillin2, create the variables you'll need.  Use the variable names provided in the cell below.  If you choose to use other variable names, instructions given to you later in this project may be confusing.  

**2.2** Use the `groupstats` function to get a summary of this data split by "Compound" for "S. epidermidis".

**2.3** Run the one-way ANOVA using these new variables.  

**2.4** Make the box-plots that allow for visual comparison between these groups.  

**2.5** For the most effective derivative/concentration, run two-sample t-tests comparing that group to penicillin.

**2.6** For the most effective derivative/concentration, run two-sample t-tests comparing that group to ampicillin. 

For all of these parts, be sure to include a brief interpretation of your results.  

In [None]:
## These next few lines prepare the data that you'll need for this section and those that follow.
## Don't change these lines

penicillin2 = penicillin.where("S. epidermidis", are.above(-1))

## Do 2.1 in the rest of this cell 

S_ag5 = ...
S_ag10 = ...
S_ampic = ...
S_er5 = ...
S_er10 = ...
S_jh5 = ...
S_jh10 = ...
S_penic = ...
S_h2o = ...

In [None]:
## Do 2.2 in the rest of this cell 


In [None]:
## Do 2.3 in the rest of this cell 



In [None]:
## Do 2.4 in the rest of this cell 



In [None]:
## Do 2.5 in the rest of this cell 



In [None]:
## Do 2.6 in the rest of this cell 



## Third Research Question

By now, you've observed that the JH compound was more effective against both *E. coli* and *S. epidermidis*.  Did the concentration of this compound matter?

**3.1** For just *E. coli*, was there a significant difference between `E_jh5` and `E_jh10`?  Run a two-sample t-test in the cell below and interpret the results.  

*Replace this text with your response*

**3.2** Repeat the analysis in the previous part, but this time use the *S. epidermidis* data, `S_jh5` and `S_jh10`.

*Replace this text with your response*

## Fourth Research Question ##

**4.1** Using the data called `E_jh5`, and `S_jh5` test whether the 5 ppm concentration of the JH compound was more effective against *E. coli* or *S. epidermidis*.

In [None]:
ticks = make_array(1,2)
labels = make_array("E_jh5", "S_jh5")


plots.boxplot(E_jh5, widths=.5, positions=make_array(ticks.item(0)), showmeans=True)
plots.boxplot(S_jh5, widths=.5, positions=make_array(ticks.item(1)), showmeans=True)
plots.xticks(ticks, labels, size = 12)
plots.title("The JH Compound");

**4.2** Using the data called `E_jh10`, and `S_jh10` test whether the 10 ppm concentration of the JH compound was more effective against *E. coli* or *S. epidermidis*.

In [None]:
stats.ttest_ind(E_jh10, S_jh10)

**4.3** Make side-by-side boxplots of the "E_jh10$" and "S_jh10" data.

## Final Research Question

In general, if a compound was effective against *E. coli* is it most likely effective against *S. epidermidis*?

To answer this question, perform a linear regression analysis, using the penicillin2 dataset and treating *E. coli* as the x-variable.  

**5.1** What about the way this research question is stated implies the *E. coli* should be the x-variable?

*Write your response here*

**5.2** Create a scatterplot with *E. coli* on the horizontal axis and *S. epidermidis* on the vertical axis.  Use the fit_line <span style = "color:darkmagenta">=</span> <b><span style = "color:green" style="font:bold">True</span></b> option to have the regression line superimposed on the graph.  

**5.3** The biology professor confided in us that she believes something was done incorrectly when the AG compound was tested against *E. coli*.  If that's true, it would be a reasonable justification for leaving that data out of this analysis.  

Create a new dataset called penicillin3, that starts with penicillin2 and eliminates all the AG compounds.  Then recreate the scatterplot from above using penicillin3.  

In [None]:
penicillin3 = ...



**5.4** Using `stats.linregress`, run a linear regression analysis using "E. coli" and "S. epidermidis" from the penicillin3 dataset.    

**5.5** Edit the LaTeX code below by replacing the word "slope" with the slope you just found and replacing the word "intercept" with the intercept you just found.  

$$ \hat{\left.S. epidermidis\right.} = slope \cdot \left(E. coli\right) + intercept$$

The above equation shows the relationship between the sizes of the zones of inhibition in *E. coli* and the sizes in *S. epidermidis*.  

**5.6** Finally, answer the research question.  Does it appear that if one compound is effective against *E. coli* it might also be effective against *S. epidermidis*?  If there are any relevant statistics (correlation, p-value, etc) that you could quote, please do so.  

*Write your response here*

## Summary, Conclusions and Recommendations

Imagine that the scientists that performed this experiment provided you with this data because they wanted a statisticians help.  With those scientists as the intended audience, write a professional-style (clear and concise) summary of what these analyses yielded.  Make sure you address each research question from above and provide recommendations grounded in statistics.  Also the 10 ppm concentration may be more expensive than the 5 ppm, so be certain your recommendations take this cost into consideration.  Feel free to use bullet points.

*Write your conclusions here*

- If using bullet points, this would be the first one,

- And this would be the second bullet point,

- You may add others as needed.