In [None]:
Q1: What is Estimation Statistics? Explain point estimate and interval estimate.
Ans-
    Estimation statistics is a branch of statistics that deals with the estimation of unknown population parameters based on sample data. The goal of estimation statistics is to provide an estimate of the true value of a population parameter using the information obtained from a sample.

There are two types of estimates in estimation statistics: point estimates and interval estimates.

1.Point Estimate: A point estimate is a single value that is used to estimate an unknown population parameter. Point estimates are based on sample statistics and provide an estimate of the population parameter value. For example, the sample mean is a point estimate of the population mean.

2.Interval Estimate: An interval estimate is a range of values that is likely to contain the true value of a population parameter with a certain level of confidence. Interval estimates are based on a range of values and provide a more accurate estimate of the population parameter than point estimates. For example, a 95% confidence interval for the population mean is an interval estimate that provides a range of values likely to contain the true value of the population mean with a 95% level of confidence.

In summary, point estimates provide a single value as an estimate of the population parameter, while interval estimates provide a range of values likely to contain the true value of the population parameter with a certain level of confidence.

In [None]:
Q2. Write a Python function to estimate the population mean using a sample mean and standard
deviation.
Ans-
   Sure, here's a Python function that takes in a sample mean, sample standard deviation, and sample size, and returns an estimate of the population mean using the formula for a confidence interval:
   
 import math

def estimate_population_mean(sample_mean, sample_std_dev, sample_size, confidence_level=0.95):
    """
    Estimate the population mean using a sample mean and standard deviation.

    Args:
    sample_mean (float): The sample mean.
    sample_std_dev (float): The sample standard deviation.
    sample_size (int): The sample size.
    confidence_level (float): The confidence level for the interval estimate (default 0.95).

    Returns:
    float: An estimate of the population mean.
    """
    # Calculate the t-value for the given confidence level and degrees of freedom
    dof = sample_size - 1
    t_value = abs(stats.t.ppf((1 - confidence_level) / 2, dof))

    # Calculate the margin of error
    margin_of_error = t_value * (sample_std_dev / math.sqrt(sample_size))

    # Calculate the lower and upper bounds of the confidence interval
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    # Return the estimated population mean
    return (lower_bound + upper_bound) / 2


This function uses the t-distribution to calculate the margin of error for a confidence interval estimate of the population mean. The confidence_level argument is optional and defaults to 0.95, which corresponds to a 95% confidence interval.


In [None]:
Q3: What is Hypothesis testing? Why is it used? State the importance of Hypothesis testing.
Ans-
   Hypothesis testing is a statistical method that is used to determine whether there is enough evidence in a sample of data to infer conclusions about a population. It involves the formulation of two competing hypotheses: a null hypothesis and an alternative hypothesis. The null hypothesis states that there is no significant difference between the population parameter and the observed sample statistic, while the alternative hypothesis states that there is a significant difference.

Hypothesis testing is used to determine whether the difference between the observed sample data and the expected population data is statistically significant. It is used to make decisions about a population parameter, such as the mean or the proportion, based on sample data. The goal of hypothesis testing is to make a decision about whether to reject or fail to reject the null hypothesis, based on the evidence provided by the sample data.

The importance of hypothesis testing lies in its ability to help researchers and decision-makers draw valid conclusions from sample data. Hypothesis testing provides a framework for making statistical decisions and can help reduce the risk of making incorrect conclusions based on random chance. It is a valuable tool for assessing the effectiveness of interventions, evaluating the accuracy of data, and making informed decisions based on data.

In summary, hypothesis testing is an essential statistical method used to make decisions about population parameters based on sample data. It helps researchers and decision-makers draw valid conclusions and reduces the risk of making incorrect decisions based on random chance.

In [None]:
Q4. Create a hypothesis that states whether the average weight of male college students is greater than
the average weight of female college students.
Ans-
    The null hypothesis for this hypothesis test would be:

H0: The average weight of male college students is equal to or less than the average weight of female college students.

The alternative hypothesis would be:

H1: The average weight of male college students is greater than the average weight of female college students.

We would then collect a sample of male and female college students, record their weights, and perform statistical analysis to determine whether there is enough evidence to reject the null hypothesis and accept the alternative hypothesis.

In [None]:
Q5. Write a Python script to conduct a hypothesis test on the difference between two population means,
given a sample from each population.
Ans-

import numpy as np
from scipy import stats

# Set the significance level and sample data for two populations
alpha = 0.05
sample1 = np.array([83, 75, 82, 79, 88, 80, 75, 86, 78, 85])
sample2 = np.array([80, 78, 81, 76, 85, 77, 82, 87, 79, 84])

# Calculate the sample means and variances
mean1 = np.mean(sample1)
mean2 = np.mean(sample2)
var1 = np.var(sample1, ddof=1)
var2 = np.var(sample2, ddof=1)
n1 = len(sample1)
n2 = len(sample2)

# Calculate the pooled variance
s_pooled = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))

# Calculate the t-statistic
t_statistic = (mean1 - mean2) / (s_pooled * np.sqrt(1/n1 + 1/n2))

# Calculate the degrees of freedom
dof = n1 + n2 - 2

# Calculate the critical t-value for a two-tailed test
t_crit = abs(stats.t.ppf(alpha/2, dof))

# Determine whether to reject the null hypothesis
if abs(t_statistic) > t_crit:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")
    
    
In this example, we have two samples of weights for two different populations (presumably male and female college students). We calculate the sample means and variances, the pooled variance, and the t-statistic using the formula for a two-sample t-test with equal variances. We then calculate the critical t-value for a two-tailed test with the given significance level and degrees of freedom, and compare it to the absolute value of the t-statistic to determine whether to reject or fail to reject the null hypothesis that the means of the two populations are equal.

In [None]:
Q6: What is a null and alternative hypothesis? Give some examples.
Ans-
   A null hypothesis is a statement that assumes there is no significant difference between two or more populations or samples. It is often denoted as H0 and serves as a starting point for statistical testing. The alternative hypothesis, denoted as H1 or Ha, is a statement that contradicts the null hypothesis by asserting that there is a significant difference between the populations or samples being compared.

Here are some examples of null and alternative hypotheses:

Example 1:
Null hypothesis: There is no significant difference in the average test scores of students who study for 2 hours a day versus those who study for 4 hours a day.
Alternative hypothesis: Students who study for 4 hours a day have significantly higher average test scores than students who study for 2 hours a day.

Example 2:
Null hypothesis: There is no significant difference in the average number of hours slept per night between people who exercise regularly and those who do not.
Alternative hypothesis: People who exercise regularly sleep significantly more hours per night than those who do not.

Example 3:
Null hypothesis: There is no significant difference in the average number of miles per gallon (MPG) between cars that use regular gasoline and those that use premium gasoline.
Alternative hypothesis: Cars that use premium gasoline have significantly higher average MPG than those that use regular gasoline.

These examples illustrate how the null hypothesis assumes that there is no significant difference between the populations or samples being compared, while the alternative hypothesis asserts that there is a significant difference. The choice of which hypothesis to test depends on the research question and the nature of the data being analyzed.

In [None]:
Q7: Write down the steps involved in hypothesis testing.
Ans-
   Hypothesis testing is a statistical procedure used to determine if there is enough evidence in a sample of data to infer that a certain condition or hypothesis about the population from which the sample was drawn is true or false. Here are the steps involved in hypothesis testing:

1.State the null hypothesis (H0) and the alternative hypothesis (Ha):

The null hypothesis is the default position that there is no significant difference between two groups or variables.
The alternative hypothesis is the statement that you want to test, which suggests that there is a significant difference between the two groups or variables.

2.Choose a level of significance (alpha level) for your test:

The alpha level determines how much evidence against the null hypothesis is required before you reject it. It is typically set to 0.05, which means that there is a 5% chance of rejecting the null hypothesis when it is actually true.

3.Determine the test statistic and its distribution:

The test statistic is a measure of the difference between the sample data and the null hypothesis. Its distribution depends on the sample size, the population parameters, and the type of test used.
Calculate the p-value:

The p-value is the probability of obtaining a test statistic as extreme or more extreme than the observed value, assuming the null hypothesis is true. A small p-value (usually less than 0.05) suggests that the null hypothesis should be rejected.

4.Make a decision:

If the p-value is less than the alpha level, reject the null hypothesis and accept the alternative hypothesis.
If the p-value is greater than the alpha level, fail to reject the null hypothesis and conclude that there is not enough evidence to support the alternative hypothesis.

5.Interpret the results:

If you reject the null hypothesis, interpret the results in terms of the alternative hypothesis and the practical significance of the findings.
If you fail to reject the null hypothesis, interpret the results in terms of the limitations of the study and the possibility of type II error (i.e., incorrectly accepting the null hypothesis when it is false).

In [None]:
Q8. Define p-value and explain its significance in hypothesis testing.
Ans-
   The p-value is a measure of the strength of evidence against the null hypothesis in hypothesis testing. It is defined as the probability of obtaining a test statistic as extreme or more extreme than the observed value, assuming the null hypothesis is true. In other words, the p-value indicates the likelihood of observing the data we have, or something more extreme, if the null hypothesis is actually true.

The significance of the p-value lies in its ability to inform our decision-making process in hypothesis testing. Specifically, if the p-value is small (usually less than 0.05), it suggests that the observed data is unlikely to have occurred by chance alone, assuming that the null hypothesis is true. This provides evidence in favor of the alternative hypothesis and leads to rejection of the null hypothesis.

On the other hand, if the p-value is large (usually greater than 0.05), it suggests that the observed data is consistent with the null hypothesis and that we do not have sufficient evidence to reject it. This does not necessarily mean that the null hypothesis is true, but rather that we do not have enough evidence to conclude otherwise.

Overall, the p-value is a critical component of hypothesis testing as it helps us make decisions about whether to accept or reject the null hypothesis based on the strength of evidence provided by the observed data.

In [None]:
Q9. Generate a Student's t-distribution plot using Python's matplotlib library, with the degrees of freedom
parameter set to 10.
Ans-
   Here's the code to generate a Student's t-distribution plot with the degrees of freedom parameter set to 10 using Python's matplotlib library:
    
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import t

df = 10  # degrees of freedom
x = np.linspace(t.ppf(0.001, df), t.ppf(0.999, df), 100)  # generate x-values for t-distribution
y = t.pdf(x, df)  # generate y-values for t-distribution

fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=2, label='t pdf')
ax.legend(loc='best', frameon=False)
plt.show()


This code generates a plot of the Student's t-distribution with degrees of freedom set to 10. The t.ppf function is used to generate the x-values for the distribution, and the t.pdf function is used to generate the y-values. Finally, the plot function is used to create the plot and the legend function is used to label the plot.

In [None]:
Q10. Write a Python program to calculate the two-sample t-test for independent samples, given two
random samples of equal size and a null hypothesis that the population means are equal.
Ans-

import numpy as np
from scipy.stats import ttest_ind

# generate two random samples
sample1 = np.random.normal(10, 2, 50)
sample2 = np.random.normal(12, 2, 50)

# calculate the t-test and print the results
t_stat, p_value = ttest_ind(sample1, sample2)
print("t-statistic:", t_stat)
print("p-value:", p_value)


In this program, we first generate two random samples using the numpy library's random.normal function. We generate two normal distributions with mean 10 and standard deviation 2 for the first sample, and mean 12 and standard deviation 2 for the second sample. Both samples have a size of 50.

Next, we use the scipy.stats library's ttest_ind function to perform the two-sample t-test. This function takes two arrays as input, representing the two samples, and returns the t-statistic and p-value for the test. We store these values in the t_stat and p_value variables.

Finally, we print the results of the t-test by printing the t-statistic and p-value. The t-statistic measures the difference between the means of the two samples, while the p-value indicates the probability of obtaining such a difference by chance if the null hypothesis (that the population means are equal) is true. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that the population means are significantly different.

In [None]:
Q11: What is Student’s t distribution? When to use the t-Distribution.
Ans-
    Student's t-distribution, also known as the t-distribution, is a probability distribution that is used to model the sampling distribution of a statistic when the sample size is small and the population standard deviation is unknown. It was first introduced by William Gosset, who published under the pseudonym "Student" in 1908.

The t-distribution is similar to the standard normal distribution, but it has heavier tails and a lower peak. Its shape depends on the degrees of freedom, which is equal to the sample size minus one. As the degrees of freedom increase, the t-distribution approaches the standard normal distribution.

The t-distribution is commonly used in statistical inference when the sample size is small and the population standard deviation is unknown. Specifically, it is used to test hypotheses about population means, where the sample size is less than 30 and the population standard deviation is unknown. In these situations, the t-distribution provides a more accurate estimation of the standard error of the sample mean than the standard normal distribution.

In addition to hypothesis testing, the t-distribution is also used for constructing confidence intervals for population means and for conducting regression analysis when the errors are normally distributed.

In summary, the t-distribution is a probability distribution used to model the sampling distribution of a statistic when the sample size is small and the population standard deviation is unknown. It is commonly used for testing hypotheses about population means and for constructing confidence intervals, especially when the sample size is less than 30.


In [None]:
Q12: What is t-statistic? State the formula for t-statistic.
Ans-
    The t-statistic is a measure of the difference between two sample means in relation to the variation within the samples. It is used in hypothesis testing to determine whether the difference between two groups is statistically significant or just due to chance.

The formula for the t-statistic is:

t = (x̄1 - x̄2) / (s√(1/n1 + 1/n2))

Where:

x̄1 and x̄2 are the sample means of the two groups being compared
s is the pooled standard deviation of the two samples
n1 and n2 are the sample sizes of the two groups
The t-statistic is compared to a critical value based on the degrees of freedom (df), which is determined by the sample sizes and the assumption of equal variances between the two groups. The critical value can be found in a t-distribution table or calculated using statistical software. If the calculated t-statistic is greater than the critical value, then the difference between the two groups is considered statistically significant.

In [None]:
Q13. A coffee shop owner wants to estimate the average daily revenue for their shop. They take a random
sample of 50 days and find the sample mean revenue to be $500 with a standard deviation of $50.
Estimate the population mean revenue with a 95% confidence interval.
Ans-
    To estimate the population mean revenue with a 95% confidence interval, we can use the following formula:

CI = x̄ ± t*(s/√n)

Where:

x̄ is the sample mean revenue ($500 in this case)
s is the sample standard deviation ($50 in this case)
n is the sample size (50 in this case)
t is the t-value from the t-distribution table with (n-1) degrees of freedom and a 95% confidence level (two-tailed test)
First, we need to find the t-value. Since our sample size is 50, our degrees of freedom are 49. Using a t-distribution table or statistical software, we can find that the t-value for a two-tailed test with 49 degrees of freedom and a 95% confidence level is approximately 2.009.

Plugging in the values, we get:

CI = 500 ± 2.009*(50/√50)
CI = 500 ± 14.14

Therefore, the 95% confidence interval for the population mean revenue is $485.86 to $514.14. We can be 95% confident that the true population mean revenue falls within this interval.

In [None]:
Q14. A researcher hypothesizes that a new drug will decrease blood pressure by 10 mmHg. They conduct a
clinical trial with 100 patients and find that the sample mean decrease in blood pressure is 8 mmHg with a
standard deviation of 3 mmHg. Test the hypothesis with a significance level of 0.05.
Ans-
    To test the hypothesis that the new drug will decrease blood pressure by 10 mmHg, we can perform a one-sample t-test with the following null and alternative hypotheses:

H0: μ = 10
Ha: μ < 10

Where:

H0 is the null hypothesis that the true mean decrease in blood pressure is 10 mmHg
Ha is the alternative hypothesis that the true mean decrease in blood pressure is less than 10 mmHg
μ is the population mean decrease in blood pressure
We can use the following formula to calculate the t-statistic:

t = (x̄ - μ) / (s/√n)

Where:

x̄ is the sample mean decrease in blood pressure (8 mmHg in this case)
μ is the hypothesized population mean decrease in blood pressure (10 mmHg in this case)
s is the sample standard deviation (3 mmHg in this case)
n is the sample size (100 in this case)
Plugging in the values, we get:

t = (8 - 10) / (3/√100)
t = -2.82

The degrees of freedom for this test are (n-1) = 99. Using a t-distribution table or statistical software, we can find the p-value associated with this t-statistic to be approximately 0.003. This p-value is the probability of observing a t-statistic as extreme as -2.82 or more extreme, assuming that the null hypothesis is true.

Since the significance level is 0.05, we reject the null hypothesis if the p-value is less than 0.05. In this case, the p-value is less than 0.05, so we reject the null hypothesis. This means that we have sufficient evidence to conclude that the new drug does decrease blood pressure by less than 10 mmHg.

In [None]:
Q15. An electronics company produces a certain type of product with a mean weight of 5 pounds and a
standard deviation of 0.5 pounds. A random sample of 25 products is taken, and the sample mean weight
is found to be 4.8 pounds. Test the hypothesis that the true mean weight of the products is less than 5
pounds with a significance level of 0.01.
Ans-
    To test the hypothesis that the true mean weight of the products is less than 5 pounds, we can perform a one-sample t-test with the following null and alternative hypotheses:

H0: μ = 5
Ha: μ < 5

Where:

H0 is the null hypothesis that the true mean weight of the products is 5 pounds
Ha is the alternative hypothesis that the true mean weight of the products is less than 5 pounds
μ is the population mean weight of the products
We can use the following formula to calculate the t-statistic:

t = (x̄ - μ) / (s/√n)

Where:

x̄ is the sample mean weight of the products (4.8 pounds in this case)
μ is the hypothesized population mean weight of the products (5 pounds in this case)
s is the sample standard deviation (0.5 pounds in this case)
n is the sample size (25 in this case)
Plugging in the values, we get:

t = (4.8 - 5) / (0.5/√25)
t = -2

The degrees of freedom for this test are (n-1) = 24. Using a t-distribution table or statistical software, we can find the p-value associated with this t-statistic to be approximately 0.028. This p-value is the probability of observing a t-statistic as extreme as -2 or more extreme, assuming that the null hypothesis is true.

Since the significance level is 0.01, we reject the null hypothesis if the p-value is less than 0.01. In this case, the p-value is greater than 0.01, so we fail to reject the null hypothesis. This means that we do not have sufficient evidence to conclude that the true mean weight of the products is less than 5 pounds at a significance level of 0.01.

In [None]:
Q16. Two groups of students are given different study materials to prepare for a test. The first group (n1 =
30) has a mean score of 80 with a standard deviation of 10, and the second group (n2 = 40) has a mean
score of 75 with a standard deviation of 8. Test the hypothesis that the population means for the two
groups are equal with a significance level of 0.01.
Ans-
    To test the hypothesis that the population means for the two groups are equal, we can perform a two-sample t-test with the following null and alternative hypotheses:

H0: μ1 - μ2 = 0
Ha: μ1 - μ2 ≠ 0

Where:

H0 is the null hypothesis that the population means for the two groups are equal
Ha is the alternative hypothesis that the population means for the two groups are not equal
μ1 is the population mean score of the first group
μ2 is the population mean score of the second group
We can use the following formula to calculate the t-statistic:

t = (x̄1 - x̄2) / √[s1^2/n1 + s2^2/n2]

Where:

x̄1 is the sample mean score of the first group (80 in this case)
x̄2 is the sample mean score of the second group (75 in this case)
s1 is the sample standard deviation of the first group (10 in this case)
s2 is the sample standard deviation of the second group (8 in this case)
n1 is the sample size of the first group (30 in this case)
n2 is the sample size of the second group (40 in this case)
Plugging in the values, we get:

t = (80 - 75) / √[(10^2/30) + (8^2/40)]
t = 2.11

The degrees of freedom for this test are (n1 + n2 - 2) = 68. Using a t-distribution table or statistical software, we can find the two-tailed p-value associated with this t-statistic to be approximately 0.038. This p-value is the probability of observing a t-statistic as extreme as 2.11 or more extreme, assuming that the null hypothesis is true.

Since the significance level is 0.01, we reject the null hypothesis if the p-value is less than 0.01/2 = 0.005 (two-tailed test). In this case, the p-value is greater than 0.005, so we fail to reject the null hypothesis. This means that we do not have sufficient evidence to conclude that the population means for the two groups are different at a significance level of 0.01.

In [None]:
Q17. A marketing company wants to estimate the average number of ads watched by viewers during a TV
program. They take a random sample of 50 viewers and find that the sample mean is 4 with a standard
deviation of 1.5. Estimate the population mean with a 99% confidence interval.
Ans-
   To estimate the population mean with a 99% confidence interval, we can use the following formula:

        Confidence interval = sample mean ± z*(standard error)

where z is the critical value from the standard normal distribution corresponding to the desired confidence level, and the standard error is given by:

Standard error = standard deviation / sqrt(sample size)

Plugging in the given values, we get:

    Standard error = 1.5 / sqrt(50) = 0.2121

The critical value corresponding to a 99% confidence level is obtained from the standard normal distribution as:

    z = 2.576

Therefore, the 99% confidence interval is:

    4 ± 2.576*0.2121

    = (3.445, 4.555)

So, we can say with 99% confidence that the true population mean number of ads watched by viewers during a TV program lies between 3.445 and 4.555.