# Z Test

**What is a Z-Test?**

- A Z-Test is used to determine if there is a significant difference between:
- A sample mean and a population mean, or
- Means of two independent samples (if population variance is known).
- Condition: Population standard deviation (σ) is known.

**Formula for Z-Test**

**Z = (X̄ - μ) / (σ / √n)**

Where:
X̄ = Sample mean
μ = Population mean
σ = Population standard deviation
n = Sample size


- Hypotheses
- Null Hypothesis (H₀): Sample mean = Population mean
- Alternative Hypothesis (H₁): Sample mean ≠ Population mean

- Interpretation:
- Z-Score: How many standard deviations the sample mean is away from population mean
- p-value: Probability of observing this result by chance
- Decision: Compare p-value with α (0.05)

In [1]:
# Import necessary libraries
import numpy as np                # For numerical operations like mean, array handling
from scipy.stats import norm      # For working with Z-distribution (normal distribution)

# Sample data: heights of 10 individuals
sample = [172, 174, 168, 169, 171, 173, 175, 170, 169, 172]

# Population parameters (assumed known)
population_mean = 170            # Known population mean
population_std = 3               # Known population standard deviation

# Calculate the mean of the sample
smple_mean = np.mean(sample)     # Average of the sample values

# Number of observations in the sample
n = len(sample)                  # Sample size
n                                # Display sample size

10

In [2]:
# Calculate the Z-score for the sample
# Formula: Z = (sample_mean - population_mean) / (population_std / sqrt(n))
z_score = (smple_mean - population_mean) / (population_std / np.sqrt(n))

# Display the calculated Z-score
z_score

np.float64(1.3703203194063098)

In [16]:
# Calculate the p-value for the two-tailed Z-test
# norm.cdf() gives the cumulative probability up to the Z-score
# abs(z_score) ensures we consider both tails (positive and negative)
p_value = 2 * (1 - norm.cdf(abs(z_score)))

# Display the p-value
p_value

np.float64(0.17058693287143756)

In [17]:
# Set the significance level
alpha = 0.05  # Common choice for 5% significance level

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("I will Reject the Null Hypothesis")  # Evidence is strong → sample mean differs from population
else:
    print("I will Accept the Null Hypothesis")  # Not enough evidence → sample mean not significantly different

I will Accept the Null Hypothesis


# T Test

**- What is a T-Test?**
- A T-Test is a statistical test used to compare the means when the population standard deviation (σ) is unknown and/or the sample size is - small.
- It is based on the t-distribution, which is wider and has heavier tails than the normal distribution.
- Use T-Test instead of Z-Test when σ is unknown or sample size < 30.

**Formula for T-Test**

**Z = (X̄ - μ) / (σ / √n)**

Where:-->
- X̄ = Sample mean
- μ = Population mean
- σ = Population standard deviation
- n = Sample size

In [3]:
from scipy import stats  # Import the stats module

# Example groups (two independent samples)
group_A = [85,88,90,92,87,85,89,91,86,88]   # Group A scores
group_B = [82,84,80,83,81,79,78,75,85,83]   # Group B scores

# Independent T-Test (Welch’s T-Test since equal_var=False)
# This compares the means of two independent groups
t_stats, p_value = stats.ttest_ind(group_A, group_B, equal_var=False)

print("T-Test:", t_stats)   # Print the t-statistic value
print("P-Value:", p_value)  # Print the p-value

alpha = 0.05  # Significance level (5%)

# Hypothesis testing decision
if p_value < alpha:
    print("I will Reject Null Hypothesis")   # Means are significantly different
else:
    print("I will Accept Null Hyplothesis")  # Means are not significantly different

T-Test: 5.756756756756752
P-Value: 2.2637400297756707e-05
I will Reject Null Hypothesis


# CHI Square Test

**What is Chi-Square Test?**
- The Chi-Square Test is used to check if there is a significant relationship between categorical variables.
- It compares the observed frequencies (actual data) with the expected frequencies (theoretical values if there was no relationship).

In [4]:
import numpy as np              # For handling arrays, numerical operations
import seaborn as sns           # For visualization (like heatmaps for contingency tables)
from scipy.stats import chi2_contingency  # For performing Chi-Square Test
import pandas as pd             # For handling datasets and creating tables easily

In [5]:
df = sns.load_dataset("titanic")  # Load Titanic dataset into a pandas DataFrame

In [6]:
df    # Show the table

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [7]:
# Create a contingency table between 'sex' and 'survived'
contingency_table = pd.crosstab(df['sex'], df['survived'])

# Show the table
print(contingency_table)

survived    0    1
sex               
female     81  233
male      468  109


In [9]:
# Perform Chi-Square Test of Independence on the contingency table
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

# chi2     → Chi-Square statistic
# p_value  → Probability that the observed difference happened by chance
# dof      → Degrees of freedom = (rows - 1) * (columns - 1)
# expected → Expected frequencies if variables are independent

In [10]:
print("Chi-Square Statistic:", chi2)
print("P-Value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:\n", expected)

Chi-Square Statistic: 260.71702016732104
P-Value: 1.1973570627755645e-58
Degrees of Freedom: 1
Expected Frequencies:
 [[193.47474747 120.52525253]
 [355.52525253 221.47474747]]


In [44]:
alpha = 0.005  # Significance level (0.5%)

In [11]:
# Decision based on p-value and significance level alpha
if p_value < alpha:
    # If p-value is less than alpha, reject H0
    # This means there is a significant relationship between gender and survival
    print("We reject the null Hypothesis and there is significant relationship between gender and survival.")
else:
    # If p-value is greater than or equal to alpha, fail to reject H0
    # This means no evidence of a relationship between gender and survival
    print("There is no connection")

We reject the null Hypothesis and there is significant relationship between gender and survival.


# ANNOVA TEST

**What is ANOVA?**
- ANOVA (Analysis of Variance) is a statistical test used to compare the means of three or more groups to see if at least one group mean is significantly different.
- If you only have two groups, you can use a T-Test.
- For 3+ groups, ANOVA is preferred.

- It helps answer questions like:
-"Do students from different classes score differently on a test?"

In [12]:
import seaborn as sns              # For loading example datasets and visualization
import pandas as pd                # For handling datasets as DataFrames
from scipy.stats import f_oneway   # For performing One-Way ANOVA test

In [13]:
# Load the Titanic dataset from Seaborn into a pandas DataFrame
df = sns.load_dataset('titanic')

# Display the first few rows of the dataset
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [14]:
# Fill missing values in 'age' column with the mean age
# pd.to_numeric(df['age']) ensures the column is numeric
# np.mean(...) calculates the mean of the numeric ages
df['age'] = df['age'].fillna(np.mean(pd.to_numeric(df['age'])))

# Display the 'age' column after filling missing values
df['age']

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: age, Length: 891, dtype: float64

In [15]:
# Fill missing values in 'pclass' column with the mean class
# pd.to_numeric(df['pclass']) ensures the column is numeric
# np.mean(...) calculates the mean of the numeric pclass values
df['pclass'] = df['pclass'].fillna(np.mean(pd.to_numeric(df['pclass'])))

# Display the 'pclass' column after filling missing values
df['pclass']

0      3
1      1
2      3
3      1
4      3
      ..
886    2
887    1
888    3
889    1
890    3
Name: pclass, Length: 891, dtype: int64

In [16]:
# Display unique values in the 'pclass' column
df['pclass'].unique()

array([3, 1, 2])

In [18]:
# Select ages of passengers in Class 1
class_1 = df[df['pclass'] == 1]['age']

# Select ages of passengers in Class 2
class_2 = df[df['pclass'] == 2]['age']

# Select ages of passengers in Class 3
class_3 = df[df['pclass'] == 3]['age']

# Print the ages for each class to check
print(class_1)
print(class_2)
print(class_3)

1      38.0
3      35.0
6      54.0
11     58.0
23     28.0
       ... 
871    47.0
872    33.0
879    56.0
887    19.0
889    26.0
Name: age, Length: 216, dtype: float64
9      14.000000
15     55.000000
17     29.699118
20     35.000000
21     34.000000
         ...    
866    27.000000
874    28.000000
880    25.000000
883    28.000000
886    27.000000
Name: age, Length: 184, dtype: float64
0      22.000000
2      26.000000
4      35.000000
5      29.699118
7       2.000000
         ...    
882    22.000000
884    25.000000
885    39.000000
888    29.699118
890    32.000000
Name: age, Length: 491, dtype: float64


In [19]:
# Perform One-Way ANOVA on the ages of the three passenger classes
# f_oneway() returns:
#   f_stats → F-statistic value (ratio of between-group variance to within-group variance)
#   p_value → Probability that the observed differences occurred by chance
f_stats, p_value = f_oneway(class_1, class_2, class_3)

In [84]:
f_stats

np.float64(56.57438528337172)

In [20]:
# Display the p-value from the One-Way ANOVA test
p_value

np.float64(7.481182472787439e-24)

In [21]:
alpha = 0.05  # Significance level (5%)

# Decision based on ANOVA p-value and significance level alpha
if p_value < alpha:
    # If p-value is less than alpha, reject the null hypothesis
    # This means at least one passenger class has a significantly different mean age
    print('Reject the null hypothesis and there is significant difference in at least one passenger class')
else:
    # If p-value is greater than or equal to alpha, fail to reject null hypothesis
    # This means there is no significant difference in mean ages across classes
    print('There is no significant difference')

Reject the null hypothesis and there is significant difference in at least one passenger class


**1. Z-Test Conclusion**
- Purpose: Compare a sample mean to a population mean or two sample means when the population standard deviation is known.
- When to Use: Large sample size (n > 30) and population variance (σ²) is known.

**Interpretation:**

- Z-Score tells how many standard deviations the sample mean is from the population mean.
- p-value < α → Reject H₀ → Sample mean is significantly different from population mean.
- p-value ≥ α → Fail to reject H₀ → No significant difference.

**2. T-Test Conclusion**
- Purpose: Compare means when population standard deviation is unknown or for small sample sizes.

**Interpretation:**

- t-statistic indicates difference between group means relative to variability.
- p-value < α → Reject H₀ → Significant difference between means.
- p-value ≥ α → Fail to reject H₀ → No significant difference.

**3. ANOVA (Analysis of Variance) Conclusion**

- Purpose: Compare means of three or more groups to see if at least one group mean is different.
- When to Use: When comparing 3+ groups with one (or more) independent variables.

**Interpretation:**

- F-Statistic: Ratio of between-group variance to within-group variance.
- p-value < α → Reject H₀ → At least one group mean is significantly different.
- p-value ≥ α → Fail to reject H₀ → All group means are similar.

**Quick Summary Table**

| Test   | When to Use               | Key Output | Decision Rule     |
| ------ | ------------------------- | ---------- | ----------------- |
| Z-Test | Large sample, known σ     | Z-Score, p | p < α → Reject H₀ |
| T-Test | Small sample or unknown σ | t-Stat, p  | p < α → Reject H₀ |
| ANOVA  | 3+ groups                 | F-Stat, p  | p < α → Reject H₀ |
