# Hypothesis Test - I

## Concept session

## Demo - 6.3: Right Tailed Hypothesis Test

According to the US Department of Agriculture, the average size of farms increased in 2019 compared to 2018. In 2018, the mean farm size was 445.51 acres; In 2019, the average size was 446.92 acres. Suppose an agribusiness researcher believes that the average farm size is higher than the 2018 average of 446.92 acres.

To test this notion, data analyst Maya randomly selected 35 farms in the United States and ascertained the average size of each state from county records.

Use a 5% level of significance to test her hypothesis. Consider that the number of acres per farm is normally distributed in the population.

In [1]:
import numpy as np
import pandas as pd
import statistics as st
import math
from scipy.stats import norm


### Establish the null and alternate hypothesis

In [None]:
H0: the average size of a U.S farm is equal to 442.6 acres in 2019
Ha: the average size of a U.S farm is more than 442.6 acres in 2019


H0: μ=442.6 acres
Ha: μ>442.6 acres

### Read data from the source

In [4]:
farm_df=pd.read_csv("DS1_C5_S6_Hypothesis_I_Concept_FarmSize_Data.csv")
farm_df.head()

Unnamed: 0,State,2018_Number_of_farms,2019_Number_of_farms,"2018_Land_in_farms(in1,000acres)","2019_Land_in_farms(in1,000acres)",2018_Average_farm_size(acres),2019_Average_farm_size(acres)
0,Alabama,39700,38800,8500,8300,214,214
1,Alaska,1000,1050,850,850,850,810
2,Arizona,19200,19000,26200,26200,1365,1379
3,Arkansas,42500,42300,13900,14000,327,331
4,California,69400,69900,24300,24300,350,348


In [7]:
farm_df.columns

Index(['State', '2018_Number_of_farms', '2019_Number_of_farms',
       '2018_Land_in_farms(in1,000acres)', '2019_Land_in_farms(in1,000acres)',
       ' 2018_Average_farm_size(acres)', ' 2019_Average_farm_size(acres)'],
      dtype='object')

In [11]:
farm_df[" 2018_Average_farm_size(acres)"].mean()

442.6

In [14]:
farm_df[" 2019_Average_farm_size(acres)"].mean()

444.0

In [15]:
farm_df[" 2018_Average_farm_size(acres)"].std()

462.496993922939

In [16]:
farm_df[" 2019_Average_farm_size(acres)"].std()

463.4576791986117

### Determine the appropriate statistical test

Population Mean(2018)=442.6
Population SD(2018)=462.496993922939

Population Mean(2019)=444.0
Population SD(2019)=463.4576791986117

Sample size=35

We know the population mean and standard deviation, 
also the sample size is greater than 30, 
then we can test the hypothesis about a population mean using the z-statistics (standard deviation of the population is known).

### Set the value of alpha

### Establish the decision rule

### Gather the sample data

In [18]:
farm_df_sample= farm_df.sample(n=35,
                               random_state=1)

farm_df_sample

Unnamed: 0,State,2018_Number_of_farms,2019_Number_of_farms,"2018_Land_in_farms(in1,000acres)","2019_Land_in_farms(in1,000acres)",2018_Average_farm_size(acres),2019_Average_farm_size(acres)
27,Nevada,3400,3350,6100,6100,1794,1821
35,Oklahoma,77300,77300,34200,34400,442,445
40,South Dakota,29600,29600,43200,43200,459,459
38,Rhode Island,1100,1100,60,60,55,55
2,Arizona,19200,19000,26200,26200,1365,1379
3,Arkansas,42500,42300,13900,14000,327,331
48,Wisconsin,64800,64900,14300,14300,223,220
29,New Jersey,9900,9900,750,750,76,76
46,Washington,35700,35600,14700,14600,412,410
31,New York,33400,33400,6900,6900,207,207


### Calculate sample mean

In [25]:
#sample mean:


sample_mean=farm_df_sample[' 2019_Average_farm_size(acres)'].mean()

print("Sample Mean :",sample_mean)


#sample standard deviation:


sample_std=farm_df_sample[' 2019_Average_farm_size(acres)'].std()

print("Sample Standard Deviation :",sample_std)

Sample Mean : 421.4
Sample Standard Deviation : 438.62426638163043


### Analyze the data

#To find the p-value associated with a z-score in Python, we can use the scipy.stats.norm.sf() function, which uses the following syntax:

scipy.stats.norm.sf(abs(x))

### Reach a statistical conclusion

### Make a business decision

## Demo - 6.4: Left tailed Hypothesis Test

A survey was conducted among managing directors of manufacturing plants in Glasgow, rated between 1-5 Likert scale. The mean of the survey response was 4.30 with a population standard deviation of 0.574. U.S. supply chain analysts believe that American manufacturing managers would not rate highly and conduct a hypothesis to prove their theory. Determine whether U.S. managers rate significantly lesser than the mean 4.30 ascertained in the U.K with a 10% confidence level. Use the following ratings from U.S. managers for the test.
![](rating.png)


### Establish the null and alternate hypothesis

H0: the average rating is equal to 4.3
Ha: the average rating is less than 4.3

### Determine the appropriate statistical test

Our known parameter/statistic:
    
Population mean (µ) = 4.3
Population standard deviation (σ) = 0.574
Sample size(n) = 32, which is greater than 30

Population data is normally distributed.

We will test the hypothesis about a population mean using the z-statistics 
(standard deviation of the populationis known).

### Set the value of alpha

It is given that a 10% level of significance is to be used to test the hypothesis.
alpha (α) = 0.10

### Establish the decision rule

i. If p-value < α : Rejection of Null Hypothesis(H0)
ii. If -z-critical > z-statistic > +z-critical : Rejection of Null Hypothesis(H0)

### Calculate sample  mean

In [26]:
sample_ratings=[3,4,5,5,4,5,5,4,4,4,4,4,4,4,4,5,4,4,4,3,4,4,4,3,5,4,4,5,4,4,4,5]

sample_mean=st.mean(sample_ratings)

print("Sample Mean =",sample_mean)

Sample Mean = 4.15625


### Analyze the data

In [30]:
alpha=0.1
p_mean =4.3
p_sd=0.574
n=len(sample_ratings)

z_statistics = (sample_mean-p_mean) / (p_sd/math.sqrt(n))
print("The Z statistics is ", z_statistics)

p_value = norm.sf(abs(z_statistics))
print("The p_value is "+str(p_value))

z_critical = norm.ppf(alpha)
print("The z-critical value is "+str(z_critical))

The Z statistics is  -1.4166773490671232
The p_value is 0.07828864121333116
The z-critical value is -1.2815515655446004


In [29]:
print(p_value<alpha)
print(z_statistics<z_critical)

True
True


### Reach a statistical conclusion

### Make the business decision

## Demo - 6.5: Hypothesis Test with Two Samples

A random sample of the annual salary of 33 advertising managers is selected from the United States. The advertising managers are contacted by telephone and asked about their annual salary. A similar random sample was selected for 35 sales managers.

Christopher, a business analyst, tests whether there is a difference between the average wage of an advertising manager and a sales manager. Use the 5% significance level for the test.

### Read data from the source

In [2]:
wages_df=pd.read_csv("DS1_C5_S6_Hypothesis_I_Concept_Wages_Data.csv")
wages_df

Unnamed: 0,Advertising Manager,Sales Manager
0,74.256,71.492
1,96.234,67.814
2,89.807,56.47
3,93.261,72.401
4,103.03,71.804
5,74.195,46.394
6,75.932,54.449
7,80.742,59.676
8,39.672,63.369
9,45.652,43.649


In [3]:
x_x=wages_df.mean()
x_x

Advertising Manager    70.278303
Sales Manager          61.524686
dtype: float64

### Calculate Sample statistic

In [4]:
sample_adv_m=wages_df.iloc[:33,0]
sample_adv_m

0      74.256
1      96.234
2      89.807
3      93.261
4     103.030
5      74.195
6      75.932
7      80.742
8      39.672
9      45.652
10     93.083
11     63.384
12     57.791
13     65.145
14     96.767
15     77.242
16     67.056
17     64.276
18     74.194
19     65.360
20     73.904
21     54.270
22     59.045
23     68.508
24     71.115
25     67.574
26     59.621
27     62.483
28     69.319
29     35.394
30     86.741
31     57.351
32     56.780
Name: Advertising Manager, dtype: float64

In [5]:
sample_adv_mean=sample_adv_m.mean()
sample_adv_mean

70.27830303030304

In [6]:
sample_adv_std=sample_adv_m.std()
sample_adv_std

16.179630804434417

In [7]:
sample_adv_var=sample_adv_std**2
sample_adv_var

261.7804529678031

In [8]:
sample_sales_m=wages_df["Sales Manager"]
print(sample_sales_m)
#l=sample_sales_m.len()
l=len(sample_sales_m)
print("length of sales manager column:",l)

0     71.492
1     67.814
2     56.470
3     72.401
4     71.804
5     46.394
6     54.449
7     59.676
8     63.369
9     43.649
10    63.508
11    58.653
12    71.351
13    72.790
14    59.505
15    37.386
16    67.160
17    83.849
18    42.494
19    54.335
20    66.035
21    77.136
22    61.261
23    66.359
24    60.053
25    48.036
26    73.065
27    61.254
28    99.198
29    37.194
30    63.362
31    57.828
32    55.052
33    69.962
34    39.020
Name: Sales Manager, dtype: float64
length of sales manager column: 35


In [9]:
sample_sales_mean=sample_sales_m.mean()
sample_sales_mean

61.52468571428572

In [10]:
sample_sales_std=sample_sales_m.std()
sample_sales_std

13.298461504938054

In [11]:
sample_sales_var=sample_sales_std**2
sample_sales_var

176.84907839831928

### Establish the null and alternate hypothesis

### Determine the appropriate statistical test

### Set the value of alpha

In [None]:
 #

### Establish the decision rule

### Construct a 95% confidence interval to estimate the difference in the mean between the two departments.

### Analyze the data

### Reach a statistical conclusion

### Make a business decision

# Learning Consolidation

## Hypothesis Test with Type II Error

The recent fact reported by the New York Stock Exchange is that the average age of a female shareholder is 44 years. Evan, a stock exchange broker, compares the above-reported data with a randomly selected sample of 68 women from Chicago. Suppose the average age for shareholders in the sample is 45.1 years, with a population standard deviation of 8.7 years.

You test to determine whether Evan’s sample data differ significantly enough from the 44-year figure released by the New York Stock Exchange. The sample data declare that Chicago female shareholders are different in age from female shareholders in general. Use alpha=0.05. If no significant difference is noted, what is the probability of committing a Type II error if the average age of a female Chicago shareholder is 45 years?

### Establish the null and alternate hypothesis

### Determine the appropriate statistical test

### Set the value of alpha

### Establish the decision rule

### Analyze the data

In [None]:
# We have used,
# z_statistic1 = (45-p_mean) / (p_sd/math.sqrt(n))
# 1-norm(loc = 44 , scale = 1).cdf(45)
# Here, loc represents the population mean value.

### Reach a statistical conclusion

### Make a business decision

### Analyze the data with the average of a female is 45 

In [None]:
#jupyter symbols...

type\mu =
type\sigma =