# Question 
A pharmaceutical company conducts an experiment to test the effect of a new cholesterol medication. The company selects 15 subjects randomly from a larger population. Each subject is randomly assigned to one of three treatment groups. Within each treatment group subjects receive a different dose of the new medication. 

In conducting the experiment there are two questions to answer:
1. Does the dosage level have a significant effect on cholesterol level?
2. How strong is the effect of dosage on cholesterol level?

Use a one-way ANOVA to answer these questions, use a confidence level of 95% in your analysis. Perform a post-hoc test if necessary.

In [1]:
import pandas
from scipy import stats
import numpy
import itertools

from functions.module2 import ANOVA

The first step is to load in our data from our excel or csv file.
We will also look at the headings of the columns in our data to get an idea of what we are working with.

In [2]:
cholesterol_data = pandas.read_csv('data//Cholesterol.csv')
column_names = cholesterol_data.columns
column_names

Index(['Group', 'Dosage [mg/day]', 'Cholesterol Level'], dtype='object')

We see that we have the group, the dosage they recieved, and the cholesterol level of the subjects, supplied in the data.
We identify from the question that the dosage is the independent variable and the cholesterol level is the dependent variable.
Let's assign the names of those columns to variables:

In [3]:
independent_col = column_names[1]  # Dosage
dependent_col = column_names[2]  # Cholesterol Level

Note that we could also simply use the group number as our independent variable, which may be easier, but does not make a difference to the analysis.
Next we should find out how many values of our independent variable are present and what they are. We do that with a pandas command that finds the unique values in a given column. 

In [4]:
independent_variable_values = pandas.unique(cholesterol_data[independent_col])
independent_variable_values

array([  0,  50, 100], dtype=int64)

We see that there are three dosages; 0, 50, and 100 mg/day. We can compare these to the groups:

In [5]:
independent_variable_values = pandas.unique(cholesterol_data['Group'])
independent_variable_values

array([1, 2, 3], dtype=int64)

and we see there are three groups: 1, 2, and 3. We can also just print the whole dataset, we see that the group 1 corresponds to a dosage of 0, etc.

In [6]:
cholesterol_data

Unnamed: 0,Group,Dosage [mg/day],Cholesterol Level
0,1,0,210
1,1,0,240
2,1,0,270
3,1,0,270
4,1,0,300
5,2,50,210
6,2,50,240
7,2,50,240
8,2,50,270
9,2,50,270


Now, we can first break the dataset up into the individual groups, like we did in module 1, in order to look at some decriptive statistics and test some of our assumptions. We use the group numbers as our independent variable.

In [7]:
dependent_variable_data = pandas.DataFrame(columns=[group for group in pandas.unique(cholesterol_data['Group'])])
for group in pandas.unique(cholesterol_data['Group']):
    dependent_variable_data[group] = cholesterol_data["Cholesterol Level"][cholesterol_data["Group"]==group].reset_index(drop=True)

Now we can get the statistics for the various groups:

In [8]:
print(dependent_variable_data.describe())

                1           2           3
count    5.000000    5.000000    5.000000
mean   258.000000  246.000000  210.000000
std     34.205263   25.099801   21.213203
min    210.000000  210.000000  180.000000
25%    240.000000  240.000000  210.000000
50%    270.000000  240.000000  210.000000
75%    270.000000  270.000000  210.000000
max    300.000000  270.000000  240.000000


We can perform the Shapiro-Wilk test for normality:

In [9]:
for group in dependent_variable_data.columns:
    print(group, stats.shapiro(dependent_variable_data[group]))

1 ShapiroResult(statistic=0.9608590006828308, pvalue=0.8139519691467285)
2 ShapiroResult(statistic=0.8810376524925232, pvalue=0.3140396773815155)
3 ShapiroResult(statistic=0.883490800857544, pvalue=0.3254301846027374)


and the Levene test for equality of variance:

In [10]:
for group1,group2 in itertools.combinations(dependent_variable_data.columns,2):
    print(group1,group2, stats.levene(dependent_variable_data[group1],dependent_variable_data[group2]))

1 2 LeveneResult(statistic=0.2, pvalue=0.6665811073830712)
1 3 LeveneResult(statistic=0.8, pvalue=0.3972038407802933)
2 3 LeveneResult(statistic=0.3333333333333333, pvalue=0.5795839999999997)


For both tests the $p$ values are large, so we cannot reject our null-hypotheses and we can assume that the required assumptions are correct.

So, we perform the ANOVA (remember to use the original dataset before we split the dependent variables):

In [11]:
ANOVA(cholesterol_data, "Group", "Cholesterol Level", confidence=0.95)

Variation due to      DoF    Sum of squares  mean squares    F ratio
------------------  -----  ----------------  --------------  ---------
Between                 2              6240  3120.0          4.16
Within                 12              9000  750.0
Total                  14             15240
Significance (p value): 0.04241751817647503


Reject null-hypothesis: There are statistical differences present.
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     1      2    -12.0 0.7611 -58.186 34.186  False
     1      3    -48.0 0.0415 -94.186 -1.814   True
     2      3    -36.0  0.136 -82.186 10.186  False
---------------------------------------------------


The $p$ value from the ANOVA was smaller than our desired significance limit, so we reject the null hypothesis and perform a post-hoc Tukey test.

We see that there are significant differences between group 1 and group 3, and this concludes our analysis.