# Question:
You are given Pyrolysis data containing temperatures and produced syngas volume. The independent variable is the temperature, two different temperatures are provided.
The task is to compare the means of the two groups and use independent sample t-test to determine if there is a significant difference. The confidence level should be 95%. 
Include in your results the group statistics and the t–test results.

We start off by just including the functions that we developed in the tutorial notebook.

In [1]:
import pandas
from scipy import stats
import numpy

def get_confidence_interval(data, confidence=0.95):
    """ Determines the confidence interval for a given set of data, 
        assuming the population standard deviation is not known.
    Args:  # 'arguments', or inputs to the function
        data (single-column or list): The data
        confidence (float): The confidence level on which to produce the interval.
    Returns:
        c_interval (tuple): The confidence interval on the given data (lower, upper).
    """

    n = len(data)  # determines the sample size
    m = numpy.mean(data)  # obtains mean of the sample

    se = stats.sem(data)  # obtains standard error of the sample

    c_interval = stats.t.interval(confidence, n-1, m, se)  # determines the confidence interval
    return c_interval  # which is of the form (lower bound, upper bound)

def t_test(data_group1, data_group2, confidence=0.95):
    alpha = 1-confidence

    if stats.levene(data_group1, data_group2)[1]>alpha:
        equal_variance = True
    else:
        equal_variance = False

    t, p = stats.ttest_ind(data_group1, data_group2, equal_var = equal_variance)

    accept_H0 = "False"
    if p>alpha:
        accept_H0 = "True"

    return({'t': t, "p": p, "Accept H0": accept_H0})

Now, the first step is to load in our data from our excel or csv file.
We will also look at the headings of the columns in our data to get an idea of what we are working with.

In [2]:
pyrolysis_data = pandas.read_csv('data//Pyrolysis.csv')
column_names = pyrolysis_data.columns
column_names

Index(['Temperature', 'Syngas Volume'], dtype='object')

We see that we have temperature and syngas volume supplied in the data. This is inline with what the question indicated. 
The question states that temperature is the independent variable (which makes sense accoring to what we know of pyrolysis). 
Let's assign the names of those columns to variables:

In [3]:
independent_var = column_names[0]
dependent_var = column_names[1]

So, next we should find out how many values of our independent variable are present and what they are. We do that with a pandas command that finds the unique values in a given column. 

In [4]:
values_of_independent_var = pandas.unique(pyrolysis_data[independent_var])
values_of_independent_var

array([753, 793], dtype=int64)

We see that there are only two groups, a temperature of 753 degrees Celcius and one of 793. 

If we recall from the qestion/problem statement: we should perform an independent sample t-test to compare the means of the groups and determine if they are statistically different or not. We should present our results along with the statistics for the two groups.

We would like to begin by calculating the statistics for our two groups using the .describe function we used in the tutorial. But, if you recall, we must first "separate" the data as it is currently in a single column.

We do this with the list comprehension below, selecting all the rows of the dataset where the independent variable is equal to the first, and then the second, of the values we determined in the previous section of code.

In [5]:
Group_753 = pyrolysis_data.loc[pyrolysis_data[independent_var]==values_of_independent_var[0]]
Group_793 = pyrolysis_data.loc[pyrolysis_data[independent_var]==values_of_independent_var[1]]

Now we can get the group statistics:

In [6]:
print(Group_753.describe())
print()
print(Group_793.describe())

       Temperature  Syngas Volume
count         27.0      27.000000
mean         753.0       0.161671
std            0.0       0.010420
min          753.0       0.143617
25%          753.0       0.152269
50%          753.0       0.163481
75%          753.0       0.169802
max          753.0       0.179406

       Temperature  Syngas Volume
count         27.0      27.000000
mean         793.0       0.188444
std            0.0       0.008817
min          793.0       0.169676
25%          793.0       0.184338
50%          793.0       0.189009
75%          793.0       0.192935
max          793.0       0.209312


We only care about the statistics for the dependent variable, but we print it for both the independent and dependent just to make sure that we had no errors in how we separated the data.

The variables we care about in this table are: "count" - the number of samples, "mean" - the mean value, "std" - the standard deviation, "min" - the minimum value, and "max" - the maximum value. These help us understand our data better.

We can also determine the confidence interval on the mean for the dependent variables in the two groups:

In [7]:
print("Group 753:", get_confidence_interval(Group_753[dependent_var], confidence=0.95))
print("Group 793:", get_confidence_interval(Group_793[dependent_var], confidence=0.95))

Group 753: (0.1575492241047826, 0.16579288693225439)
Group 793: (0.18495575917054885, 0.19193184653315487)


Looking at the values of the means and their confidence intervals it seems likely that the two groups are indeed statistically different. But to confirm we will perform a t-test. 
Let us first recall the hypothesis: \
$H_0 : \mu_1 - \mu_2 = 0$\
$H_1 : \mu_1 - \mu_2 \neq 0$

The null hypothesis states that the means from the two groups are equal. We can perform a t-test, and remember that the function we created will output the t-value, the p-value, and whether or not we should accept or reject the null hypothesis based on the p-value relative to our desired confidence level.

In [8]:
t_test(Group_753[dependent_var], Group_793[dependent_var], confidence=0.95)

{'t': -10.191851675980322, 'p': 5.370645652707995e-14, 'Accept H0': 'False'}