## What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.

Both t-test and z-test are statistical hypothesis tests used to make inferences about population parameters from sample statistics. However, there are some fundamental differences between the two:

* The z-test is used when the sample size is large (typically, n > 30) or when the population standard deviation is known, whereas the t-test is used when the sample size is small (typically, n < 30) or when the population standard deviation is unknown.

* The z-test assumes that the population variance is known, whereas the t-test assumes that the population variance is unknown.

* The z-test is used to test hypotheses about population means when the population standard deviation is known, whereas the t-test is used to test hypotheses about population means when the population standard deviation is unknown.

* The t-test is generally considered more robust and reliable than the z-test because it makes fewer assumptions about the population parameters and is more appropriate for smaller sample sizes.

Example:

* t-test

Suppose we want to compare the mean height of two groups of students, one group from a private school and the other from a public school.
We collect a sample of 20 students from each school and measure their heights
We can use a t-test to determine if there is a significant difference between the mean heights of the two groups.

* z-test

Suppose we want to compare the mean weight of a population to a known value
We collect a large sample of 1000 individuals and measure their weight
We can use a z-test to determine if the mean weight of the population is significantly different from the known value.


## Differentiate between one-tailed and two-tailed tests.

The test can be either a one-tailed test or a two-tailed test, depending on the directionality of the hypothesis.

A one-tailed test is a hypothesis test where the alternative hypothesis is directional, meaning it specifies the direction of the difference between the sample statistic and the population parameter. For example, a one-tailed test may be used to test whether a new drug is more effective than the standard drug, or whether a new advertising campaign increases sales.

A two-tailed test is a hypothesis test where the alternative hypothesis is non-directional, meaning it does not specify the direction of the difference between the sample statistic and the population parameter. Instead, it simply states that the sample statistic is significantly different from the population parameter. For example, a two-tailed test may be used to test whether a coin is biased or not.

The critical region for a one-tailed test is located in one tail of the probability distribution, either the upper or lower tail, depending on the directionality of the hypothesis. The critical region for a two-tailed test is located in both tails of the probability distribution.

## Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for each type of error.

In statistical hypothesis testing, there are two types of errors that can occur: Type 1 error and Type 2 error.

__Type 1 error__ occurs when we reject a null hypothesis that is actually true. In other words, we conclude that there is a significant difference between the sample statistic and the population parameter when in fact there is no difference.

* Example of Type 1 Error:

A pharmaceutical company is testing a new drug for treating a disease. The null hypothesis is that the new drug is not more effective than the existing drug. The alternative hypothesis is that the new drug is more effective than the existing drug. The company conducts a hypothesis test at a significance level of 0.05 and concludes that the new drug is more effective based on the sample data. However, in reality, the new drug is not more effective than the existing drug, and the company has made a Type 1 error.

__Type 2 error__ occurs when we fail to reject a null hypothesis that is actually false. In other words, we conclude that there is no significant difference between the sample statistic and the population parameter when in fact there is a difference.

* Example of Type 2 Error:

A college admission office is using a standardized test to evaluate the intelligence of applicants. The null hypothesis is that the mean test score of the applicants is equal to the national mean. The alternative hypothesis is that the mean test score of the applicants is different from the national mean. The office conducts a hypothesis test but fails to reject the null hypothesis based on the sample data. However, in reality, the mean test score of the applicants is significantly different from the national mean, and the office has made a Type 2 error by failing to detect the difference.



## Explain Bayes's theorem with an example.

Bayes's theorem is a mathematical formula used in probability theory to update the probability of a hypothesis as new evidence becomes available. 

The theorem is expressed as P(A|B) = P(B|A) x P(A) / P(B), where P(A) and P(B) are the probabilities of A and B occurring, P(B|A) is the conditional probability of B given that A has occurred, and P(A|B) is the probability of A given that B has occurred.

* Example 

Suppose a company has two machines, A and B, that produce widgets. Machine A produces 70% of the widgets and machine B produces 30% of the widgets. The widgets produced by machine A have a defect rate of 5%, while the widgets produced by machine B have a defect rate of 10%. If a widget is randomly selected from the production line and is found to be defective, what is the probability that it was produced by machine B?

Using Bayes's theorem, we can calculate the probability that the defective widget was produced by machine B as follows:

P(B|D) = P(D|B) x P(B) / P(D)

where P(B) = 0.3, P(A) = 0.7, P(D|B) = 0.1, P(D|A) = 0.05, and P(D) = P(D|B) x P(B) + P(D|A) x P(A) = 0.1 x 0.3 + 0.05 x 0.7 = 0.065

So, P(B|D) = 0.1 x 0.3 / 0.065 = 0.462 or approximately 46.2%. Therefore, there is a 46.2% chance that the defective widget was produced by machine B.

## What is a confidence interval? How to calculate the confidence interval, explain with an example.

A confidence interval is a range of values that is used to estimate the true value of a population parameter (such as the population mean or proportion) with a certain degree of confidence. It is a statistical measure that provides a range of values that is likely to contain the true population parameter with a certain level of probability.

* Example:

Suppose a researcher wants to estimate the average height of all adult males in a particular city. The researcher takes a random sample of 100 adult males and finds that the sample mean height is 175 cm, with a standard deviation of 10 cm. The researcher can use this information to calculate a 95% confidence interval.


> Calculate the standard error: 

    Standard error = standard deviation / square root of sample size = 10 / sqrt(100)

>Calculate the margin of error:

    Margin of error = critical value * standard error The critical value for a 95% confidence interval with 99 degrees of freedom (n-1) is 1.984 (based on a t-distribution table). Margin of error = 1.984 * 1 = 1.984

>Calculate the confidence interval: 

    Confidence interval = sample mean +/- margin of error = 175 +/- 1.984 = (173.016, 176.984)

Therefore, we can say with 95% confidence that the true population mean height of adult males in the city lies between 173.016 cm and 176.984 cm.

## Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.

Bayes' Theorem is a mathematical formula that is used to calculate conditional probabilities. It states that the probability of an event A given the occurrence of event B can be calculated as follows:

P(A|B) = P(B|A) * P(A) / P(B)

where P(A|B) is the probability of A given B, P(B|A) is the probability of B given A, P(A) is the prior probability of A, and P(B) is the prior probability of B.

* Example

Suppose a patient comes to a doctor with a certain set of symptoms, and the doctor wants to determine the probability of the patient having a particular disease. Let's assume that 1% of the population has the disease, and the diagnostic test has a false positive rate of 5% and a false negative rate of 1%.

> Prior probability: P(Disease) = 0.01 (1% of the population has the disease)

> False positive rate: P(Positive | No Disease) = 0.05 (5% of people without the disease will test positive)

> False negative rate: P(Negative | Disease) = 0.01 (1% of people with the disease will test negative)

> Marginal probability: P(Positive) = P(Positive | Disease) * P(Disease) + P(Positive | No Disease) * P(No Disease) = 0.99 * 0.01 + 0.05 * 0.99 = 0.0585

> Using Bayes's theorem:

    Probability of having the disease given a positive test result: P(Disease | Positive) = P(Positive | Disease) * P(Disease) / P(Positive) = 0.99 * 0.01 / 0.0585 = 0.1692 or 16.92%

Therefore, even if the patient has a positive test result, there is still only a 16.92% chance that they actually have the disease

## Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

## What is the margin of error in a confidence interval? How does sample size affect the margin of error? Provide an example of a scenario where a larger sample size would result in a smaller margin of error.

__Margin of error in a confidence interval:__

* The margin of error in a confidence interval is a measure of the amount of random error present in a survey's results.

* It represents the range in which the true population parameter is likely to fall with a certain degree of confidence.

__Parameter how does sample size affect the margin of error ::__
* The margin of error is inversely proportional to the square root of the sample size. This means that as the sample size increases, the margin of error decreases.

* A confidence interval is a range of values, calculated from a sample of data, that is likely to contain the true population parameter. The margin of error is the maximum amount by which this range might differ from the true population parameter.

__Scenerio with Example:__
> let's say you want to estimate the proportion of people in a city who support a particular political candidate. You conduct a survey of 100 people and find that 60% of them support the candidate. If you calculate a 95% confidence interval around this estimate, you might find that the margin of error is ±10%. This means that you can be 95% confident that the true proportion of people who support the candidate in the entire city is somewhere between 50% and 70%.

> Now imagine that you conduct a survey of 1000 people and find that 60% of them support the candidate. If you calculate a 95% confidence interval around this estimate, you might find that the margin of error is only ±3%. This means that you can be 95% confident that the true proportion of people who support the candidate in the entire city is somewhere between 57% and 63%.

__Conclusion from Example:__
* Therefore, a larger sample size generally results in a smaller margin of error, as it reduces the effect of random sampling variation.

## Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.

## In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.

In [12]:
import math
import scipy.stats as stat

# Given:
population_mean = 0
size = 50
sample_mean = 6
std = 2.5
alpha = 0.05

'''
Null hypothesis 
hypothesized mean = 0

Alternate hypothesis
popolation mean != 0

we will apply two tail, t-test 

'''

t_value = (sample_mean - population_mean) / (std / math.sqrt(size))

t_critical= abs(stat.t.ppf( q = alpha/2, df = size-1 ))

if t_value < t_critical:
    print(f'Fail to reject null hypothesis as the value of t {t_value} is less than critical t {t_critical}')
else:
    print(f'Reject the null hypothesis as the value of t {t_value} is greater than critital t {t_critical}')
    print('The mean weight loss for participants taking the new drug is significantly different from zero')

Reject the null hypothesis as the value of t 16.970562748477143 is greater than critital t 2.0095752344892093
The mean weight loss for participants taking the new drug is significantly different from zero


## In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% confidence interval for the true proportion of people who are satisfied with their job.

In [16]:

import math 

# data 
p = 0.65
z = 1.96  # 95% CI
n = 500

# Calculate the standard error
s_error = math.sqrt((p*(1-p))/n)

# calculate the margin of error
m_error  = z * s_error


# Calculate the lower and upper bounds of the confidence interval
lower_bound = p - m_error
upper_bound = p + m_error



# Print the results
print(f"95% Confidence Interval: ({lower_bound}, {upper_bound})")

95% Confidence Interval: (0.608191771144905, 0.6918082288550951)


## A researcher is testing the effectiveness of two different teaching methods on student performance. Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01.

In [29]:
import numpy as np
from scipy import stats

# Sample A (Assuming sample size to be 50 for both the samples)
x1 = 85
s1 = 6
n1 = 50 

# Sample B
x2 = 82
s2 = 5
n2 = 50

# H0: The mean score of sample A is not significantly different from the mean score of sample B.
# H1: The mean score of sample A is significantly different from the mean score of sample B.

# Generating two samples from two populations
np.random.seed(132)
sample_1 = np.random.normal(loc=85 , scale=6, size=50)
sample_2 = np.random.normal(loc=82 , scale=5, size=50)

# hypothesis test on the difference between the means of the two populations
alpha = 0.01
stat, p_val = stats.ttest_ind(sample_1, sample_2)


# Compare the p-value to the significance level and draw a conclusion
if p_val < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in student performance between the two teaching methods.")
    print(f'p-value = {p_val}')

Fail to reject the null hypothesis. There is no significant difference in student performance between the two teaching methods.
p-value = 0.014955335674135631


## A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

## In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.