In [62]:
from scipy.stats import norm, chi2, f
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd

### Problem 12-7, Page518.
Social networking is becoming more and more popular around the world. Pew research Center used a survey of adults in several countries to determine the percentage of adults who use social networking sites. assume that the results for surveys in great britain, israel, russia, and united States are as follows.


|Use Social Media| Great Britain |Israel|Russia|USA|
|:--|:--:|:--:|:--:|:--:|
|Yes|344|265|301|500|
|No|456|235|399|500|

a. Conduct a hypothesis test to determine whether the proportion of adults using social networking sites is equal for all four countries. What is the p-value? using a .05 level of significance, what is your conclusion? (3 points)

b. What are the sample proportions for each of the four countries? Using a .05 level of significance, conduct multiple pairwise comparison tests among the four countries. What is your conclusion? (2 points)

a. Let $p_1, p_2, p_3, p_4$ be the proportions of adults using social networking sites in Great Britain, Israel, Russia and USA for their respective populations.

The hypotheses are stated as follows:

$H_0: p_1 = p_2 = p_3 = p_4$

$H_a: $ not all population proportions are equal

To conduct this hypothesis test we begin by taking a sample of owners from each of the three populations. 

#### The Observed Frequencies

| Use Social Media       | Great Britain | Israel        | Russia        | USA    | Total  |
|:-----------------------|:-------------:|:-------------:|:-------------:|:------:|:------:|
| Yes                    | 344           | 265           |   301         |  500   | 1410   |
| No                     | 456           | 235           |   399         |  500   | 1590   |
|    Total               | 800           | 500           |   700         | 1000   | 3000   |

#### Expected Frequencies under the Assumptuion $H_0$ is True
1. $\frac{1410}{3000} = .47$ adults in the sample are using social networking sites.
2. If $H_0$ is true, then there should be 47% of adults in all 4 countries equally, are likely to use social networking sites.
3. The expected # of Great Britain adults who are likely to use social networking sites is thus .47*800 = 376; The expected # of Great Britain adults who are unlikely to use social networking sites is thus 800 - 376 = 424, etc.


| Use Social Media       | Great Britain | Israel        | Russia        | USA    | Total  |
|:-----------------------|:-------------:|:-------------:|:-------------:|:------:|:------:|
| Yes                    | 376           | 235           |   329         |  470   | 1410   |
| No                     | 424           | 265           |   371         |  530   | 1590   |
|    Total               | 800           | 500           |   700         | 1000   | 3000   |


In [63]:
observed = [344, 265, 301, 500, 456, 235, 399, 500]
expected = [376, 235, 329, 470, 424, 265, 371, 530]
n = 4 #Number of countries = 4
r = 2 #Binomial
df = (n-1)*(r-1) #Degree of freedom of (n-1)*(r-1)
alpha = 0.05

chi_square = 0

for i in range(len(observed)):
    chi_square += (observed[i]-expected[i])**2./expected[i]

#crit = chi2.ppf(0.95, df) # Find the critical value for 95% confidence

#critical value (7.815) is positive
p_value = 1 - chi2.cdf(chi_square, df)

print("p-value:", p_value)
if p_value > alpha:
    print("p-value > alpha. So, Accept H0. Hence p1 = p2 = p3 = p4")
    print("Conclusion: Proportion of adults using social networking sites is equal for all four countries")
else:
    print("p-value <= alpha. So, Reject H0.")
    print("Conclusion: Proportion of adults using social networking sites is NOT equal for all four countries")

p-value: 0.000135384658871
p-value <= alpha. So, Reject H0.
Conclusion: Proportion of adults using social networking sites is NOT equal for all four countries


In [64]:
print("Observed Sample Proportion of Great Britain adults using Social Network [P(GreatBritain)]:", 344/800)
print("Observed Sample Proportion of Israel adults using Social Network [P(Israel)]:", 265/500)
print("Observed Sample Proportion of Russia adults using Social Network [P(Russia)]:", 301/700)
print("Observed Sample Proportion of USA adults using Social Network [P(USA)]:", 500/1000)
print()
print("As observed, NOT all proportions are equal")

Observed Sample Proportion of Great Britain adults using Social Network [P(GreatBritain)]: 0.43
Observed Sample Proportion of Israel adults using Social Network [P(Israel)]: 0.53
Observed Sample Proportion of Russia adults using Social Network [P(Russia)]: 0.43
Observed Sample Proportion of USA adults using Social Network [P(USA)]: 0.5

As observed, NOT all proportions are equal


### Problem 12-11, page 524

a Bloomberg Businessweek subscriber study asked, “in the past 12 months, when traveling for business, what type of airline ticket did you purchase most often?” a second question asked if the type of airline ticket purchased most often was for domestic or international travel. Sample data obtained are shown in the following table.

|Type of Ticket| Domestic Flight |International Flight|
|:--|:--:|:--:|
|First class|29|22|
|Business class|95|121|
|Economy class|518|135|

a.	using a .05 level of significance, is the type of ticket purchased independent of the type of flight? Compute the test statistic, the critical value of test statistic, and the p value. (3 points)

b.	What is your conclusion? Discuss any dependence that exists between the type of ticket and type of flight. (2 points)

$H_0$: Type of ticket purchased is independent of the type of flight

$H_a$: Type of ticket purchased is not independent of the type of flight

#### Observed Frequency Table
|                     |   Domestic Flight   | International Flight | Total  | 
|:--------------------|:-------------------:|:--------------------:|:------:|
| First class         | 29                  | 22                   |   51   |
| Business class      | 95                  | 121                  |   216  |
| Economy class       | 518                 | 135                  |   653  |
|    Total            | 642                 | 278                  |   920  |

### Connection with Probability Theory
* Suppose we define the following events:
    * $T_i$ = Type of Ticket i, where i = 1 (First class), 2 (Business class), or 3 (Economy class)
    * $F_j$ = Type of Flight j, where j = 1 (Domestic) or 2 (International)
* If Type of Ticket and Flight are independent, then P($T_iF_j$) = P($T_i$)P($F_j$) for all i and j.
* Alteratively, P($T_i$) =  P($T_i|F_j$). That is, we are testing whether P($T_i$) =  P($T_i$|Domestic) = P($T_i$|International) for all i.

* Joint Probability Table (__Observed Relative Frequency Table__)

|                     |   Domestic Flight   | International Flight | Marginal Prob       | 
|:--------------------|:-------------------:|:--------------------:|:-------------------:|
| First class         | 29/920 = 0.0315     | 22/920 = 0.0239      |   51/920 = 0.0554   |
| Business class      | 95/920 = 0.1033     | 121/920 = 0.1315     |   216/920 = 0.2348  |
| Economy class       | 518/920 = 0.5630    | 135/920 = 0.1468     |   653/920 = 0.7098  |
| Marginal Prob       | 642/920 = 0.6978    | 278/920 = 0.3022     |   920/920 = 1       |

* If independent, Joint Probability Table (__Expected Relative Frequency Table__)

|                     |   Domestic Flight       | International Flight   | Marginal Prob    | 
|:--------------------|:-----------------------:|:----------------------:|:----------------:|
| First class         | .6978(.0554) = 3.87%    | .3022(.0554) = 1.67%   |   5.54%          |
| Business class      | .6978(.2348) = 16.38%   | .3022(.2348) = 7.1%    |   23.48%         |
| Economy class       | .6978(.7098) = 49.53%   | .3022(.7098) = 21.45%  |   70.98%         |
| Marginal Prob       | 69.78%                  | 30.22%                 |   100%           |


* If independent, __Expected Frequency Table__


|                     |  Domestic Flight    | International Flight | Total            | 
|:--------------------|:-------------------:|:--------------------:|:----------------:|
| First class         | 920(.0387) = 35.6   | 920(.0167) = 15.4    |   51             |
| Business class      | 920(.1638) = 150.7  | 920(.071) = 65.3     |   216            |
| Economy class       | 920(.4953) = 455.7  | 920(.2145) = 197.3   |   653            |
| Marginal Prob       | 642                 | 278                  |   920            |



In [65]:
observed = [29, 22, 95, 121, 518, 135]
expected = [35.6, 15.4, 150.7, 65.3, 455.7, 197.3]
n = 2 #Number of columns/ type of flight
r = 3 #Number of rows/ types of tickets
df = (n-1)*(r-1) #Degree of freedom of (n-1)*(r-1); df = 2 in this case
alpha = 0.05

chi_square = sum([(x-y)**2./y for x, y in zip(observed, expected)])
crit = chi2.ppf(q = .95,    # Find the critical chi square value for 5% level of significance
               df = 2)    
p_value = 1 - chi2.cdf(x=chi_square, df = 2)
print(chi_square, crit, p_value)

100.33991894760526 5.99146454711 0.0


In [66]:
print("p-value:", p_value)
if p_value > alpha:
    print("p-value > alpha. So, Accept H0.")
    print("Conclusion: Type of ticket purchased is independent of the type of flight")
else:
    print("p-value <= alpha. So, Reject H0.")
    print("Conclusion: Type of ticket purchased is NOT independent of the type of flight")

p-value: 0.0
p-value <= alpha. So, Reject H0.
Conclusion: Type of ticket purchased is NOT independent of the type of flight


### Problem 12-23, page 535

the Wall Street Journal’s Shareholder Scoreboard tracks the performance of 1000 major U.S. companies. The performance of each company is rated based on the annual total return, including stock price changes and the reinvestment of dividends. ratings are assigned by dividing all 1000 companies into five groups from a (top 20%), b (next 20%), to e (bottom 20%). Shown here are the one-year ratings for a sample of 60 of the largest companies. Do the largest companies differ in performance from the performance of the 1000 companies in the Shareholder Scoreboard? use a = .05.

|A|B|C|D|E|
|:-|-|-|-|-|
|5|8|15|20|12|

a. Clearly state your hypotheses. (1 point)

b. Compute the test statistic, the critical value of test statistic, and the p value given alpha = 0.05. (1 point)

c. Draw your conclusion and provide a practical interpretation. (1 point)

$Observed Sample Proportions$: [5/60, 8/60, 15/60, 20/60, 12/60] = [.0833, .1333, .25, .3333, .2]

a. Hypotheses

$H_0$:	Population Proportions are equal i.e., $p_A$ = $p_B$ = $p_C$ = $p_D$ = $p_E$

$H_a$: Not all proportions are equal

#### The Observed Frequencies

| A | B | C | D | E |
|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|
| 5             | 8             |   15          |  20           |  12           |

#### The Expected Frequencies

| A | B | C | D | E |
|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|
| 12  | 12  |   12         | 12   | 12 |


In [67]:
observed = [5, 8, 15, 20, 12]
expected = [12, 12, 12, 12, 12]
n = 5
df = n-1
alpha = 0.05

chi_square = 0

for i in range(len(observed)):
    chi_square += (observed[i]-expected[i])**2./expected[i]

crit = chi2.ppf(0.95, df) # Find the critical value for 95% confidence

p_value = 1 - chi2.cdf(chi_square, df)

print(chi_square, crit, p_value)

11.5 9.48772903678 0.0214837703764


In [68]:
print("p-value:", p_value)
if p_value > alpha:
    print("p-value > alpha. So, Accept H0.")
    print("Conclusion: Population proportions for large companies are equal")
else:
    print("p-value <= alpha. So, Reject H0.")
    print("Conclusion: Population proportions for large companies differ")

p-value: 0.0214837703764
p-value <= alpha. So, Reject H0.
Conclusion: Population proportions for large companies differ


### Problem 13-9, page 561

To study the effect of temperature on yield in a chemical process, five batches were produced at each of three temperature levels. The results follow. Construct an analysis of variance table. Use a .05 level of significance to test whether the temperature level has an effect on the mean yield of the process.

|50C|60C|70C|
|:-:|:-:|:-:|
|34|30|23|
|24|31|28|
|36|34|28|
|39|23|30|
|32|27|31|

a. Clearly state your hypotheses. (1 point) 

b. Use statsmodels to generate the ANOVA table. (3 points)

c. Draw your conclusion and provide its practical interpretation. (1 point)

a. Hypotheses

$H_0$:	Not all three population means are equal

$H_a$: All three population means are equal

In [69]:
labels = ['chemical_yield', 'temperature']
data = [(34, '50C'), (24, '50C'), (36, '50C'), (39, '50C'), (32, '50C'), (30, '60C'), (31, '60C'), (34, '60C'), (23, '60C'), (27, '60C'), (23, '70C'), (28, '70C'), (28, '70C'), (30, '70C'), (31, '70C')]

dframe = pd.DataFrame.from_records(data, columns = labels)    # Read data into DataFrame

model = ols('chemical_yield ~ temperature', data=dframe).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

             sum_sq    df         F    PR(>F)
temperature    70.0   2.0  1.779661  0.210447
Residual      236.0  12.0       NaN       NaN


In [70]:
alpha = 0.05
p_value = 1-f.cdf(1.779661, 2, 12) #Upper tail test PR>F
print("p-value:", p_value)
if p_value > alpha:
    print("p-value > alpha. So, Accept H0.")
    print("Conclusion: Not all three population means are equal")
else:
    print("p-value <= alpha. So, Reject H0.")
    print("Conclusion: All population means are equal")

p-value: 0.210447351562
p-value > alpha. So, Accept H0.
Conclusion: Not all three population means are equal
