## part 1 
In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [48]:
import numpy as np 
from scipy import stats
import pandas as pd
import math

In [23]:
data = pd.read_csv('machine.csv')

In [25]:
data.head(2)

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6


In [47]:
data.std()

New machine    0.683455
Old machine    0.749889
dtype: float64

In [63]:
meannew=data['New machine'].mean()
meannew

42.14

In [64]:
meanold=data['Old machine'].mean()
meanold

43.230000000000004

In [29]:
# calculate the respective means and standard deviation if we want to input manually
data.describe()

Unnamed: 0,New machine,Old machine
count,10.0,10.0
mean,42.14,43.23
std,0.683455,0.749889
min,41.0,41.7
25%,41.8,42.8
50%,42.2,43.4
75%,42.625,43.75
max,43.2,44.1


In [35]:
meanNew=42.14
meanOld=43.23
stdNew=0.683455
stdOld=0.749889
obs=20


### define Hypothesis


Ho : New machine is just as fast as the Old machine


Ha : New machine is faster than Old machine

we can do a 1 tailed t (students) test to see if observations for New are significantly smaller than Old and thus falls into critical area

we could also do a 2 tailed t test, if we didnt have a starting assumption of one sample being better than the other. 

### degrees of freedom
https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/degrees-of-freedom/

Degrees of Freedom (Two Samples):
    
    (N1 + N2) – 2
    
The standard significance level we are going for is 0.05 (ie 1-0.05)


In [45]:
dof = obs-2 
sl=0.95

### t critical value 

https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf


from the look up table i can also find the value for:
18 df, a 1 tail test, and CI 0.95 = 1.734

(as we are doing a left tailed test this is -1.734)

In [60]:
t_critical=stats.t.ppf(sl, dof)
t_critical

1.7340636066175354

what this means is that if the t statistic for our samples is either > 1.734 or < -1.734 we can reject the null hypothesis

### t statistic for a 2 independent sample test 

the next stage requires some thinking about. We are not comparing a sample mean to a population mean as we did in class, but comparing 2 means to eachother. In this case lets use the stats.ttest_ind function to calculate our t statistic

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

**= stats.ttest_ind(rvs1, rvs5, equal_var = False)**

where rvs1 is one sample, and rvs5 is the second sample - in this case we do not need to work out the pooled std 

(equal_var = False: we perform a Welch test which does not assume equal population variance, as confirmed by our std checks) 


In [56]:
#lets separate the two columns into arrays 
arr1=data['New machine'].to_numpy()
arr1

array([42.1, 41. , 41.3, 41.8, 42.4, 42.8, 43.2, 42.3, 41.8, 42.7])

In [57]:
arr2=data['Old machine'].to_numpy()
arr2

array([42.7, 43.6, 43.8, 43.3, 42.5, 43.5, 43.1, 41.7, 44. , 44.1])

In [104]:
statistic=stats.ttest_ind(arr1, arr2,equal_var=False)
statistic

Ttest_indResult(statistic=-3.397230706117603, pvalue=0.0032422494663179747)

### conclusion of the test 

we have enough evidence to reject the null hypothesis, both because the t statistic is in the critical area and because the p value <0.05 

## part 2 
In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file files_for_lab/student_gpa.txt. 

At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?


In [65]:
data2 = pd.read_csv('student_gpa.csv')

In [79]:
data2.head()

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98


In [80]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Sophomores  17 non-null     float64
 1   Juniors     16 non-null     object 
dtypes: float64(1), object(1)
memory usage: 400.0+ bytes


In [81]:
data2.describe()

Unnamed: 0,Sophomores
count,17.0
mean,2.84
std,0.519832
min,1.71
25%,2.6
50%,2.92
75%,3.13
max,3.6


In [89]:
#lets split the df to arrays and deal with object type first 

### Prepare data

In [71]:
arr3=data2['Sophomores'].to_numpy()
arr3

array([3.04, 1.71, 3.3 , 2.88, 2.11, 2.6 , 2.92, 3.6 , 2.28, 2.82, 3.03,
       3.13, 2.86, 3.49, 3.11, 2.13, 3.27])

In [72]:
arr4=data2['Juniors'].to_numpy()
arr4

array(['2.56', '2.77', '2.7', '3', '2.98', '3.47', '3.26', '3.2', '3.19',
       '2.65', '3', '3.39', '2.58', '\t', '\t', '\t', nan], dtype=object)

In [76]:
arr4=arr4[:13]

In [78]:
arr4 = arr4.astype('float64')
arr4

array([2.56, 2.77, 2.7 , 3.  , 2.98, 3.47, 3.26, 3.2 , 3.19, 2.65, 3.  ,
       3.39, 2.58])

#### calculate means, std and non of observations of each sample

In [83]:
Sopmean=np.mean(arr3)
Sopmean

2.84

In [85]:
Sopsd=np.std(arr3)
Sopsd

0.5043108285221584

In [92]:
SopN=arr3.size
SopN

17

In [86]:
Jnmean=np.mean(arr4)
Jnmean

2.980769230769231

In [87]:
Jnsd=np.std(arr4)
Jnsd

0.29712627562812255

In [93]:
JnN=arr4.size
JnN

13

### Define Hypotheses

Ho : no difference between sophomores and juniors gpa 


Ha : there is a difference between soph and juniors gpa
    
    
we can do a 2 tailed t test, as we dont have a starting assumption of one sample being better than the other. 

### degrees of freedom and significance level 

In [96]:
dof = (SopN+JnN)-2 
sl=0.95

### critical level 

using the lookup table i see the appropriate value of a 2 tailed 0.95 28 dof test is 2.048 

### test statistic 



In [105]:
statistic2=stats.ttest_ind(arr3, arr4,equal_var=False)
statistic2

Ttest_indResult(statistic=-0.9231495630900278, pvalue=0.3642180675348571)

### conclusion of the test 

we do not have enough evidence to reject the null hypothesis, both because the t statistic is outside the critical area and because the p value >0.05 

### anova 

In [109]:
import statsmodels.api as sm
from statsmodels.formula.api import ols


In [132]:
data3 = pd.read_excel('anova_lab_data.xlsx', sheet_name='data_collected')
data3.head()

Unnamed: 0,Watts,Etching Rate
0,160,5.43
1,180,6.24
2,200,8.79
3,160,5.71
4,180,6.71


In [133]:
data3.rename(columns = {'Etching Rate':'rate'}, inplace = True)
data3.head()
   

Unnamed: 0,Watts,rate
0,160,5.43
1,180,6.24
2,200,8.79
3,160,5.71
4,180,6.71


In [134]:
data3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Watts   15 non-null     int64  
 1   rate    15 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 368.0 bytes


In [136]:
data3['Watts']

0     160
1     180
2     200
3     160
4     180
5     200
6     160
7     180
8     200
9     160
10    180
11    200
12    160
13    180
14    200
Name: Watts, dtype: int64

In [137]:
model = ols('rate ~ C(Watts)',data=data3).fit()
table = sm.stats.anova_lm(model)
print(table)

            df     sum_sq   mean_sq          F    PR(>F)
C(Watts)   2.0  18.176653  9.088327  36.878955  0.000008
Residual  12.0   2.957240  0.246437        NaN       NaN


In [None]:
# next look up the f statistic to draw conclusions 