# Instructions

* This notebook has a combination of fill-in-the-blank and entire code blocks you need to solve. 

# Hypothesis Testing


We would like to know if the effects we see in the sample(observed data) are likely to occur in the population. 

The way classical hypothesis testing works is by conducting a statistical test to answer the following question:
> Given the sample and an effect, what is the probability of seeing that effect just by chance?

Here are the steps on how we would do this

1. Compute test statistic
2. Define null hypothesis
3. Compute p-value
4. Interpret the result

If p-value is very low (below 0.05), the effect is considered statistically significant. That means that effect is unlikely to have occured by chance. The inference? The effect is likely to be seen in the population too. 

This process is very similar to the *proof by contradiction* paradigm. We first assume that the effect is false. That's the null hypothesis. Next step is to compute the probability of obtaining that effect (the p-value). If p-value is very low (<0.05 as a rule of thumb), we reject the null hypothesis. 

In [30]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib as mpl
%matplotlib inline

In [31]:
import seaborn as sns
sns.set(color_codes=True)

In [32]:
weed_pd = pd.read_csv("../data/Weed_Price.csv", parse_dates=[-1])
weed_pd.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01


* Parse the `date` column to create two new columns: one for `month` and one for `year`.

In [33]:
#Complete the code below

weed_pd["year"] = weed_pd["date"].apply(lambda x: x.year) 
weed_pd["month"] = weed_pd["date"].apply(lambda x: x.month) 

In [34]:
weed_pd.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01,2014,1
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01,2014,1
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01,2014,1
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01,2014,1
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01,2014,1


### Let's work on weed prices in California in 2014


* Parse the `weed_pd` DataFrame so that it contains only entries for California from 2014.

In [35]:

weed_ca_2014 = weed_pd[(weed_pd.State=="California") & (weed_pd.year==2014)]
print(weed_ca_2014)

            State   HighQ  HighQN    MedQ  MedQN    LowQ  LowQN       date  \
4      California  248.78   12096  193.56  12812  192.92    778 2014-01-01   
106    California  248.20   12571  192.80  13406  191.94    804 2014-02-01   
208    California  247.60   12988  192.97  13906  191.40    839 2014-03-01   
259    California  246.76   13396  192.91  14527  191.98    863 2014-04-01   
310    California  246.04   13787  191.89  15047  191.40    891 2014-05-01   
...           ...     ...     ...     ...    ...     ...    ...        ...   
22546  California  245.68   14157  192.26  15561  190.01    929 2014-05-31   
22648  California  245.39   14928  191.61  16635  188.12    975 2014-07-31   
22699  California  245.11   15234  191.36  17049     NaN    995 2014-08-31   
22750  California  245.13   15667  191.03  17758     NaN   1023 2014-10-31   
22852  California  243.96   16501  189.38  19140     NaN   1094 2014-12-31   

       year  month  
4      2014      1  
106    2014      2  


In [45]:
print(weed_ca_2014['State'].unique())
print(weed_ca_2014['year'].unique())

['California']
[2014]


* Find the mean and standard deviation of high quality weed's price.

In [46]:
#print("Mean:", _________ )
#print("Standard Deviation:", _________ )

In [47]:
np.mean(weed_ca_2014)['HighQ']

245.89423076923077

In [39]:
np.std(weed_ca_2014['HighQ'], ddof=1)

1.2899079393714108

* Calculate the 95% confidence interval on the mean.

In [40]:
stats.norm.interval?

[1;31mSignature:[0m [0mstats[0m[1;33m.[0m[0mnorm[0m[1;33m.[0m[0minterval[0m[1;33m([0m[0malpha[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwds[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Confidence interval with equal areas around the median.

Parameters
----------
alpha : array_like of float
    Probability that an rv will be drawn from the returned range.
    Each value should be in the range [0, 1].
arg1, arg2, ... : array_like
    The shape parameter(s) for the distribution (see docstring of the
    instance object for more information).
loc : array_like, optional
    location parameter, Default is 0.
scale : array_like, optional
    scale parameter, Default is 1.

Returns
-------
a, b : ndarray of float
    end-points of range that contain ``100 * alpha %`` of the rv's
    possible values.
[1;31mFile:[0m      c:\users\ngkno\appdata\local\programs\python\python38\lib\site-packages\scipy\stats\_distn_infrastructure.py
[

In [71]:
stats.norm.interval=95

In [72]:
stats.norm.interval(_95__ , loc= ____245____ , scale = weed_ca_2014.HighQ.std()/np.sqrt(len(weed_ca_2014)))

NameError: name '_95__' is not defined

In [42]:
stats.norm.interval(_95__ , loc= ___245_____ , scale = weed_ca_2014.HighQ.std()/np.sqrt(len(weed_ca_2014)))

NameError: name '_95__' is not defined

### Question: Are high-quality weed prices in Jan 2014 significantly higher than in Jan 2015?

* Make two `numpy` arrays by subetting the original DataFrame: each array should list the `HighQ` entries for California in January of 2014 and 2015.

In [53]:
#Get the data
#weed_ca_jan2014
weed_ca_jan2014 = weed_ca_2014 [(weed_ca_2014.State=="California") & (weed_ca_2014.year==2014) & (weed_ca_2014.month==1)]
print(weed_ca_jan2014)


            State   HighQ  HighQN    MedQ  MedQN    LowQ  LowQN       date  \
4      California  248.78   12096  193.56  12812  192.92    778 2014-01-01   
769    California  248.67   12125  193.56  12836  192.80    779 2014-01-02   
1483   California  248.67   12141  193.57  12853  192.67    782 2014-01-03   
2248   California  248.65   12155  193.59  12884  192.67    782 2014-01-04   
3013   California  248.68   12176  193.63  12902  192.67    782 2014-01-05   
3778   California  248.68   12189  193.57  12918  192.79    783 2014-01-06   
4543   California  248.64   12212  193.51  12945  192.91    784 2014-01-07   
5308   California  248.63   12243  193.53  12955  192.91    784 2014-01-08   
6073   California  248.58   12256  193.54  12980  192.89    785 2014-01-09   
6838   California  248.56   12270  193.50  13001  192.67    786 2014-01-10   
7654   California  248.54   12278  193.50  13013  192.67    786 2014-01-11   
8470   California  248.47   12289  193.51  13025  192.88    788 

In [56]:
weed_ca_2015 = weed_pd[(weed_pd.State=="California") & (weed_pd.year==2015)]

In [57]:
#weed_ca_jan2015 = 
weed_ca_jan2015 = weed_ca_2015 [(weed_ca_2015.State=="California") & (weed_ca_2015.year==2015) & (weed_ca_2015.month==1)]
print(weed_ca_jan2015)

            State   HighQ  HighQN    MedQ  MedQN  LowQ  LowQN       date  \
55     California  243.96   16512  189.35  19151   NaN   1096 2015-01-01   
820    California  243.95   16517  189.34  19160   NaN   1096 2015-01-02   
1534   California  243.93   16530  189.38  19179   NaN   1096 2015-01-03   
2299   California  243.91   16542  189.38  19193   NaN   1099 2015-01-04   
3064   California  243.91   16558  189.39  19222   NaN   1100 2015-01-05   
3829   California  243.83   16571  189.37  19243   NaN   1102 2015-01-06   
4594   California  243.79   16582  189.31  19274   NaN   1103 2015-01-07   
5359   California  243.79   16592  189.26  19296   NaN   1104 2015-01-08   
6124   California  243.72   16606  189.25  19327   NaN   1104 2015-01-09   
6889   California  243.75   16622  189.21  19343   NaN   1104 2015-01-10   
7705   California  243.73   16637  189.19  19358   NaN   1104 2015-01-11   
8521   California  243.68   16646  189.21  19378   NaN   1106 2015-01-12   
9286   Calif

In [59]:
data = pd.DataFrame(weed_ca_jan2014)
display(data)

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01,2014,1
769,California,248.67,12125,193.56,12836,192.8,779,2014-01-02,2014,1
1483,California,248.67,12141,193.57,12853,192.67,782,2014-01-03,2014,1
2248,California,248.65,12155,193.59,12884,192.67,782,2014-01-04,2014,1
3013,California,248.68,12176,193.63,12902,192.67,782,2014-01-05,2014,1
3778,California,248.68,12189,193.57,12918,192.79,783,2014-01-06,2014,1
4543,California,248.64,12212,193.51,12945,192.91,784,2014-01-07,2014,1
5308,California,248.63,12243,193.53,12955,192.91,784,2014-01-08,2014,1
6073,California,248.58,12256,193.54,12980,192.89,785,2014-01-09,2014,1
6838,California,248.56,12270,193.5,13001,192.67,786,2014-01-10,2014,1


In [67]:
 
weed_ca_jan2014 = weed_ca_2014.to_numpy()
print(type(weed_ca_jan2014))

<class 'numpy.ndarray'>


In [68]:
print(type(weed_ca_jan2014))
weed_ca_jan2014

<class 'numpy.ndarray'>


array([['California', 248.78, 12096, ...,
        Timestamp('2014-01-01 00:00:00'), 2014, 1],
       ['California', 248.2, 12571, ...,
        Timestamp('2014-02-01 00:00:00'), 2014, 2],
       ['California', 247.6, 12988, ...,
        Timestamp('2014-03-01 00:00:00'), 2014, 3],
       ...,
       ['California', 245.11, 15234, ...,
        Timestamp('2014-08-31 00:00:00'), 2014, 8],
       ['California', 245.13, 15667, ...,
        Timestamp('2014-10-31 00:00:00'), 2014, 10],
       ['California', 243.96, 16501, ...,
        Timestamp('2014-12-31 00:00:00'), 2014, 12]], dtype=object)

In [69]:
weed_ca_jan2014.mean()

TypeError: can only concatenate str (not "float") to str

* Find the mean value for each of the arrays.

In [10]:
print("Mean-2014 Jan:", ______ )
print("Mean-2015 Jan:", ______ )

Mean-2014 Jan: 248.445483871
Mean-2015 Jan: 243.602258065


* Calculate the effect size of one year on the mean value of weed prices in California. What difference does one year make in terms of average weed prices?

In [11]:
print("Effect size:", ______________ )

Effect size: 4.84322580645


**Null Hypothesis**: Mean prices aren't significantly different

Perform **t-test** and determine the p-value. 

In [26]:
stats.ttest_ind?

In [12]:
stats.ttest_ind(________ , ________ , equal_var=True)

Ttest_indResult(statistic=98.011325238158051, pvalue=6.2979718185084028e-68)

p-value is the probability that the effective size was by chance. And here, p-value is almost 0.

*Conclusion*: The price difference is significant. But is a price increase of $4.85 a big deal? The price decreased in 2015 by almost 2%. Always remember to look at effect size. 

### t-test Challenge
**Problem** Determine if prices of medium quality weed for Jan 2015 and Feb 2015 are significantly different for New York. 
* _**CoderGirl**: Use what you've learned above to answer the below two questions._

In [None]:
# Your code here

### Assumption of t-test

One assumption is that the data used came from a normal distribution. 
<br>
There's a [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro-Wilk) to test for normality. If p-value is less than 0.05, then there's a low chance that the distribution is normal.

* *Are the `weed_ca_jan2015` and `weed_ca_jan2014` data normally distributed?*

In [28]:
stats.shapiro?

In [13]:
stats.shapiro(weed_ca_jan2015)

(0.9469053149223328, 0.12818680703639984)

In [14]:
stats.shapiro(weed_ca_jan2014)

(0.9353488683700562, 0.06141229346394539)

### A/B testing

Comparing two versions to check which one performs better. Eg: Show to people two variants for the same webpage that they want to see and find which one provides better conversion rate (or the relevant metric). [wiki](https://en.wikipedia.org/wiki/A/B_testing)

**Exercise: Impact of regulation and deregulation.**

* _**CoderGirl**: Use what you've learned above to answer the below two questions._

Information on regulation of Weed in the US by State [wiki](Impact of regulation and deregulation on a couple of states )

1. Alaska legalized it on 4th Nov 2014. Find if prices significantly changed in Dec 2014 compared to Oct 2014. 
2. Maryland decriminalized possessing weed from Oct 1, 2014. Find if prices of weed changed significantly in Oct 2014 compared to Sep 2014

In [None]:
# Your code here

<h2> Something to think about: Which of these give smaller p-values ? </h2>
   
   * Smaller effect size
   * Smaller standard error
   * Smaller sample size
   * Higher variance
   
   **Answer:** 

### Chi-square tests

Chi-Square tests are used when the data are frequencies, rather than numerical score/price.

The following two tests make use of chi-square statistic

1. chi-square test for goodness of fit
2. chi-square test for independence

Chi-square test is a non-parametric test. They do not require assumptions about population parameters and they do not test hypotheses about population parameters.

<h2> Chi-Square test for goodness fit </h2>

$$ \chi^2 = \sum (O - E)^2/E $$

* O is observed frequency
* E is expected frequency
* $ \chi $ is the chi-square statistic

Let's assume the proportion of people who bought High, Medium and Low quality weed in Jan-2014 as the expected proportion. Find if proportion of people who bought weed in Jan 2015 conformed to the norm

* Find the total numbers of people who bought High/Med/Low quality weed in January of each year.

In [None]:
weed_jan2014 = weed_pd[(weed_pd.year==2014) & (weed_pd.month==1)][["HighQN", "MedQN", "LowQN"]]
weed_jan2015 = weed_pd[(weed_pd.year==2015) & (weed_pd.month==1)][["HighQN", "MedQN", "LowQN"]]

In [30]:
Expected = np.array(weed_jan2014.apply(sum, axis=0))
Observed = np.array(weed_jan2015.apply(sum, axis=0))

In [18]:
print("Expected:", Expected, "\n" , "Observed:", Observed)

Expected: [2918004 2644757  263958] 
Observed: [4057716 4035049  358088]


* Print the proportions of High/Med/Low quality weed bought and observed.

In [32]:
print("Expected:", Expected/np.sum(Expected.astype(float)), "\n" , "Observed:", Observed/np.sum(Observed.astype(float)))

Expected: [0.5007971  0.45390159 0.04530131] 
 Observed: [0.48015461 0.47747239 0.042373  ]


* Do the Chi-Squared test. What is your result? Were the proportions of people buying high quality weed in Jan 2015 different than expected?

In [33]:
stats.chisquare?

In [20]:
stats.chisquare(______ , ______)

Power_divergenceResult(statistic=1209562.2775169075, pvalue=0.0)

*Inference* : We reject null hypothesis. The proportions in Jan 2015 is different than what was expected.