# Hypothesis Testing

First, we should read our data. Since we will do some mathematical operations, we excluded the Player column which is formed of strings this time.

In [2]:
import pandas as pd
data = pd.read_csv("deneme.csv")
data = data.drop(["Player1","Player2", "FSP.2", "FSW.2", "SSP.2", "SSW.2", "ACE.2","DBF.2", "WNR.2", "UFE.2","BPC.2","BPW.2",
                  "TPW.2" , "FNL1", "FNL2","NPA.1","NPW.1","ST1.1","ST2.1","ST3.1","ST4.1","ST5.1","NPA.2","NPW.2","ST1.2",
                  "ST2.2","ST3.2","ST4.2","ST5.2"], axis=1)
data.rename(columns={"FSP.1":"FSP" , "FSW.1" : "FSW" , "SSP.1" : "SSP" , "SSW.1" : "SSW","ACE.1" : "ACE", "DBF.1" : "DBF",
                     "WNR.1":"WNR", "UFE.1" : "UFE", "BPC.1" : "BPC" , "BPW.1" : "BPW","TPW.1" : "TPW" }, inplace=True)
data.head(10)

Unnamed: 0,Round,Result,FSP,FSW,SSP,SSW,ACE,DBF,WNR,UFE,BPC,BPW,TPW
0,1,0,61,35,39,18,5,1.0,17,29,1,3,70
1,1,1,61,31,39,13,13,1.0,13,1,7,14,80
2,1,0,52,53,48,20,8,4.0,37,50,1,9,106
3,1,1,53,39,47,24,8,6.0,8,6,6,9,104
4,1,0,76,63,24,12,0,4.0,16,35,3,12,128
5,1,0,65,51,35,22,9,3.0,35,41,2,7,108
6,1,0,68,73,32,24,5,3.0,41,50,9,17,173
7,1,1,47,18,53,15,3,4.0,21,31,6,20,78
8,1,0,64,26,36,12,3,4.0,20,39,3,7,67
9,1,1,77,76,23,11,6,,6,4,7,24,162


Our hypothesis is that the statistics and the result is correlated.
To do that, we will use ρ to denote our correlation coefficient.

Our null hypothesis is: H0: ρ = 0 (The correlation being zero)
Our alternative hypothesis is H1: ρ != 0 (The correlation being nonzero)

To test our hypothesis,  we will use r to estimate ρ:

r = Sxy / (Sxx)*(Syy)

Covariance of x and y: Sxy = Σxy - 1/n(Σx)(Σy)
Variance of x: Sxx = Σ(x^2) - 1/n(Σx)^2
Variance of y: Syy = Σ(y^2) - 1/n(Σy)^2

n: sample size

We will use 0.05 significance level.
Since our population is large enough, we can use Z distribution.

Z = sqrt(n-3)/2  *  ln((1+r)(1-ρ)/(1-r)(1+ρ)) is normally distributed.

We will reject H0 if Z >= Z(α/2) or if Z <= -Z(α/2)

Since our significance level is α=0.05, our corresponding Z(α/2)=1.96

To find Σx and Σy, ve should sum the columns of our data

In [3]:
import numpy as py
py.sum(data,axis=0)

Round       488.0
Result      126.0
FSP       15469.0
FSW       12341.0
SSP        9731.0
SSW        5458.0
ACE        2458.0
DBF        1119.0
WNR        8311.0
UFE        8357.0
BPC         909.0
BPW        2280.0
TPW       28001.0
dtype: float64

To find  Σ(x^2) and  Σ(y^2) we should take the squares for each value and then sum them.

In [4]:
data2 = py.square(data)
data2.head()

Unnamed: 0,Round,Result,FSP,FSW,SSP,SSW,ACE,DBF,WNR,UFE,BPC,BPW,TPW
0,1.0,0.0,3721.0,1225.0,1521.0,324.0,25.0,1.0,289.0,841.0,1.0,9.0,4900.0
1,1.0,1.0,3721.0,961.0,1521.0,169.0,169.0,1.0,169.0,1.0,49.0,196.0,6400.0
2,1.0,0.0,2704.0,2809.0,2304.0,400.0,64.0,16.0,1369.0,2500.0,1.0,81.0,11236.0
3,1.0,1.0,2809.0,1521.0,2209.0,576.0,64.0,36.0,64.0,36.0,36.0,81.0,10816.0
4,1.0,0.0,5776.0,3969.0,576.0,144.0,0.0,16.0,256.0,1225.0,9.0,144.0,16384.0


In [5]:
py.sum(data2,axis = 0)

Round        1352.0
Result        126.0
FSP        962809.0
FSW        666565.0
SSP        389009.0
SSW        136598.0
ACE         36040.0
DBF          7749.0
WNR        358261.0
UFE        377955.0
BPC          4779.0
BPW         28626.0
TPW       3405731.0
dtype: float64

In [6]:
py.shape(data)

(252, 13)

This means that we have 252 samples

##  Our hypothesis 

Our hypothesis is that Break Points Created and Result is correlated and we will prove this by the uppermentioned method.
For this purpose, our x is Break Points Created, and our y is Result.

H0: They are not correlated: ρ = 0
H1: They are correlated: ρ != 0

First we should estimate ρ with r(using the formula). To use the formula, we have to find Sxx, Sxy, Syy. 

In [7]:
Sxx = 4779 - (1/252)*(909)*(909)
Syy = 126 - (1/252)*(126)*(126)
print("Sxx: ", Sxx, "    Sxy: ", Sxy)

NameError: name 'Sxy' is not defined

To find Sxy, we should multiply the Result and Break Points Created columns to use the formula

In [9]:
myarray = []
for i in range (0,251):
    x = data.values[i][1]*data.values[i][10]
    myarray.append(x)

In [10]:
print(myarray)

[0.0, 7.0, 0.0, 6.0, 0.0, 0.0, 0.0, 6.0, 0.0, 7.0, 0.0, 6.0, 0.0, 0.0, 10.0, 4.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 5.0, 7.0, 4.0, 0.0, 5.0, 3.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 9.0, 5.0, 0.0, 5.0, 4.0, 5.0, 2.0, 4.0, 5.0, 9.0, 0.0, 5.0, 11.0, 7.0, 0.0, 8.0, 6.0, 3.0, 0.0, 0.0, 0.0, 7.0, 0.0, 5.0, 0.0, 7.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 5.0, 5.0, 0.0, 4.0, 0.0, 6.0, 0.0, 5.0, 0.0, 0.0, 7.0, 3.0, 0.0, 5.0, 0.0, 0.0, 0.0, 3.0, 0.0, 6.0, 0.0, 8.0, 5.0, 7.0, 0.0, 2.0, 0.0, 0.0, 0.0, 5.0, 0.0, 4.0, 3.0, 0.0, 5.0, 3.0, 5.0, 0.0, 3.0, 4.0, 4.0, 0.0, 0.0, 0.0, 6.0, 0.0, 2.0, 0.0, 5.0, 0.0, 7.0, 4.0, 5.0, 5.0, 5.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 0.0, 4.0, 0.0, 3.0, 4.0, 10.0, 0.0, 7.0, 0.0, 7.0, 0.0, 4.0, 6.0, 0.0, 0.0, 8.0, 7.0, 5.0, 0.0, 5.0, 7.0, 0.0, 0.0, 0.0, 7.0, 0.0, 0.0, 2.0, 7.0, 4.0, 0.0, 4.0, 6.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 4.0, 9.0, 2.0, 0.0, 7.0, 0.0, 2.0, 0.0, 3.0, 4.0, 0.0, 0.0, 5.0, 5.0, 0.0, 0.0, 6.0, 0.0, 4.0, 0.0, 9.0, 0

In [11]:
py.sum(myarray)

642.0

In [12]:
Sxy = 642 - (1/252)* (909) * (126)
print("Sxy: ", Sxy)

Sxy:  187.50000000000006


Now we can estimate ρ, and our estimate is denoted by r.

In [13]:
r = Sxy / ((Sxx* Syy)**(1/2))

In [14]:
print("r: ",r)

r:  0.6099157632759651


Our Zobservations is:

In [53]:
Zhat = (((252-3)**(1/2))/2) * py.log((1+r)/(1-r))
print("Zobservations: ", Zhat)

Zobservations:  11.1844735868


#### Since our Zobservations are greater then Z(α/2) (which was 1.96), we can say that Result and Break Points Created are correlated with %95 confidence

Let's also do one with unforced errors.

H0: They are not correlated: ρ = 0 H1: They are correlated: ρ != 0

In [55]:
Sxx = 377955 - (1/252)*(8357)*(8357)
Syy = 126 - (1/252)*(126)*(126)
print("Sxx: ", Sxx, "    Sxy: ", Sxy)

myarray = []
for i in range (0,251):
    x = data.values[i][1]*data.values[i][9]
    myarray.append(x)
summ=py.sum(myarray)
Sxy = summ - (1/252)* (8357) * (126)
print("Sxy: ", Sxy)

Sxx:  100814.32936507935     Sxy:  187.50000000000006
Sxy:  -463.5


In [56]:
r = Sxy / ((Sxx* Syy)**(1/2))
print("r: ",r)

r:  -0.18391549951


In [57]:
Zhat = (((252-3)**(1/2))/2) * py.log((1+r)/(1-r))
print("Zobservations: ", Zhat)

Zobservations:  -2.93553970681


#### Since our Zobservations are smaller then -Z(α/2) (which was -1.96), we can say that Result and Unforced Errors are correlated with %95 confidence