<a href="https://colab.research.google.com/github/juhi3101/ml_libraries/blob/main/statics_advance_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
What is Estimation Statistics? Explain point estimate and interval estimate.

Estimation statistics is a branch of statistics that deals with the process of estimating population parameters (such as mean, variance, proportion)
based on sample data. In situations where it's not feasible or practical to collect data from an entire population,
statisticians use samples to make inferences about the entire population.

Point Estimate:
A point estimate is a single value that is used to approximate a population parameter. It's the most straightforward type of estimate and provides a single,
best guess for the value of the parameter. Point estimates are often calculated using sample statistics,
such as the sample mean, sample variance, or sample proportion.

Interval Estimate:
An interval estimate, also known as a confidence interval, provides a range of values within which the population parameter is likely to fall.
It takes into account both the point estimate and the variability of the sample data. Interval estimates provide a level of confidence that the true
population parameter lies within the specified interval.


In [None]:
#Write a Python function to estimate the population mean using a sample mean and standard deviation.

import scipy.stats as stats

def estimate_population_mean(sample_mean, sample_std_dev, sample_size):

    standard_error = sample_std_dev / (sample_size ** 0.5)


    confidence_level = 0.95


    margin_of_error = stats.norm.ppf(1 - (1 - confidence_level) / 2) * standard_error


    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    return lower_bound, upper_bound


sample_mean = 85
sample_std_dev = 10
sample_size = 30

lower_bound, upper_bound = estimate_population_mean(sample_mean, sample_std_dev, sample_size)
print("95% Confidence Interval for Population Mean:", (lower_bound, upper_bound))


In [None]:
#What is Hypothesis testing? Why is it used? State the importance of Hypothesis testing.


Hypothesis testing is a statistical method used to make decisions about a population parameter based on a sample of data.
It involves formulating two competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and then using sample data to assess
the evidence against the null hypothesis. The goal is to determine whether there is enough evidence to reject the null hypothesis in favor of the
alternative hypothesis.

Importance of Hypothesis Testing:

Hypothesis testing plays a crucial role in the scientific method and data-driven decision-making. Here's why it's important:

Objectivity: Hypothesis testing provides an objective framework for making decisions based on evidence. It helps researchers
avoid making decisions solely based on intuition or personal bias.

Validation: It allows researchers to validate their theories, assumptions, and predictions. By comparing data to hypotheses,
researchers can determine whether their ideas are supported by empirical evidence.

Inference: Hypothesis testing allows researchers to make inferences about population parameters based on sample data.
This is particularly valuable when it's impractical to collect data from an entire population.

In [None]:
#Create a hypothesis that states whether the average weight of male college students is greater than the average weight of female college students.

here's a hypothesis that states whether the average weight of male college students is greater than the average weight of female college students:

Null Hypothesis (H0): The average weight of male college students is equal to or less than the average weight of female college students.

Alternative Hypothesis (Ha): The average weight of male college students is greater than the average weight of female college students.

In symbols:

H0: μ_male ≤ μ_female
Ha: μ_male > μ_female

Here, "μ_male" represents the population average weight of male college students, and "μ_female"
represents the population average weight of female college students.

In [None]:
#Write a Python script to conduct a hypothesis test on the difference between two population means, given a sample from each population.

import scipy.stats as stats

def two_sample_t_test(sample1, sample2, alpha=0.05, alternative='two-sided'):
    # Calculate the means and standard deviations of the samples
    mean1 = sum(sample1) / len(sample1)
    mean2 = sum(sample2) / len(sample2)
    std_dev1 = stats.tstd(sample1)
    std_dev2 = stats.tstd(sample2)
    n1 = len(sample1)
    n2 = len(sample2)

    # Calculate the pooled standard deviation
    pooled_std_dev = ((n1 - 1) * std_dev1**2 + (n2 - 1) * std_dev2**2) / (n1 + n2 - 2)
    pooled_std_dev = pooled_std_dev**0.5


    t_statistic = (mean1 - mean2) / (pooled_std_dev * (1/n1 + 1/n2)**0.5)


    degrees_of_freedom = n1 + n2 - 2


    critical_t_value = stats.t.ppf(1 - alpha/2, degrees_of_freedom)


    if alternative == 'two-sided':
        p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), degrees_of_freedom))
    else:
        p_value = 1 - stats.t.cdf(t_statistic, degrees_of_freedom)


    if p_value < alpha:
        result = "Reject the null hypothesis"
    else:
        result = "Fail to reject the null hypothesis"

    return t_statistic, p_value, result

sample1 = [67, 72, 65, 68, 70]
sample2 = [62, 58, 63, 59, 60]

alpha = 0.05
t_statistic, p_value, result = two_sample_t_test(sample1, sample2, alpha)

print("T-statistic:", t_statistic)
print("P-value:", p_value)
print(result)


In [None]:
#What is a null and alternative hypothesis? Give some examples.

Null Hypothesis (H0):
The null hypothesis represents the default or status quo assumption. It states that there is no effect, no difference, or no relationship between variables.
It often suggests that any observed difference or effect is due to random chance. The null hypothesis is the hypothesis that is initially assumed to be
true and is subject to testing against the alternative hypothesis.

Alternative Hypothesis (Ha):
The alternative hypothesis represents what the researcher wants to find evidence for. It states that there is an effect, a difference, or a relationship
between variables. It is the hypothesis that challenges the null hypothesis and is supported when there is enough evidence from the data to reject
the null hypothesis.

In [None]:
Write down the steps involved in hypothesis testing.
Hypothesis testing involves a structured set of steps to determine whether there is enough evidence in the sample data to make a decision about a population parameter. Here are the typical steps involved in hypothesis testing:

Formulate Hypotheses:

Formulate the null hypothesis (H0) and the alternative hypothesis (Ha) based on the research question or problem at hand. H0 represents the default assumption, while Ha represents what you're trying to find evidence for.
Select Significance Level (α):

Choose a significance level (α), which is the probability of making a Type I error (rejecting H0 when it's true). Commonly used values are 0.05, 0.01, or 0.10.
Collect and Analyze Data:

Collect data through sampling or experimentation. Ensure that the data collection process is unbiased and representative of the population.
Choose a Test Statistic:

Select an appropriate test statistic that corresponds to the type of hypothesis test being conducted (e.g., t-test, z-test, chi-square test).
Calculate the Test Statistic:

Compute the test statistic using the sample data and the chosen test formula. This statistic quantifies the difference between the sample data and what's expected under the null hypothesis.
Determine Critical Region:

Define the critical region or rejection region based on the chosen significance level and the distribution of the test statistic (e.g., critical values from t-distribution or z-distribution).
Calculate P-Value:

Calculate the p-value, which represents the probability of obtaining a test statistic as extreme as the one observed, assuming that the null hypothesis is true. The p-value informs you of the strength of evidence against H0.
Compare P-Value and Significance Level:

Compare the calculated p-value with the chosen significance level (α). If p-value ≤ α, then there's sufficient evidence to reject H0. If p-value > α, then there's insufficient evidence to reject H0.
Make a Decision:

Based on the comparison of the p-value and significance level, decide whether to reject the null hypothesis or fail to reject it. If p-value ≤ α, reject H0 in favor of Ha. If p-value > α, fail to reject H0.
Draw Conclusion:

Summarize the results of the hypothesis test in the context of the problem. State whether there's statistically significant evidence to support the alternative hypothesis.
Communicate Results:

Clearly communicate the results, including the conclusion and any insights gained from the hypothesis test. Provide relevant statistics, p-values, and any implications for the field of study.

In [None]:
#Define p-value and explain its significance in hypothesis testing.

The p-value (probability value) is a crucial concept in hypothesis testing that quantifies the strength of evidence against the null hypothesis (H0).
It measures the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed in the sample data, assuming that the null
hypothesis is true. In other words, the p-value tells you how likely it is to observe the data you have collected if the null hypothesis is correct.

In [None]:
#Generate a Student's t-distribution plot using Python's matplotlib library, with the degrees of freedom parameter set to 10.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Degrees of freedom
df = 10

# Generate a range of x values
x = np.linspace(-4, 4, 400)

# Calculate the corresponding probability density function (PDF) values
pdf_values = stats.t.pdf(x, df)

# Plot the t-distribution
plt.plot(x, pdf_values, label=f"df = {df}")
plt.title("Student's t-Distribution")
plt.xlabel("x")
plt.ylabel("PDF")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
#Write a Python program to calculate the two-sample t-test for independent samples, given two random samples of equal size and a null hypothesis that the population means are equal.
import numpy as np
import scipy.stats as stats

def two_sample_t_test(sample1, sample2):
    # Calculate sample statistics
    n = len(sample1)
    mean1 = np.mean(sample1)
    mean2 = np.mean(sample2)
    var1 = np.var(sample1, ddof=1)  # Use ddof=1 for sample variance
    var2 = np.var(sample2, ddof=1)

    # Calculate the pooled standard deviation
    pooled_std_dev = np.sqrt(((n - 1) * var1 + (n - 1) * var2) / (2 * (n - 1)))

    # Calculate the t-statistic
    t_statistic = (mean1 - mean2) / (pooled_std_dev * np.sqrt(2/n))

    # Calculate the degrees of freedom
    degrees_of_freedom = 2 * n - 2

    # Calculate the p-value
    p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), degrees_of_freedom))

    return t_statistic, p_value

# Generate random samples
np.random.seed(42)
sample_size = 20
sample1 = np.random.normal(loc=10, scale=2, size=sample_size)
sample2 = np.random.normal(loc=11, scale=2, size=sample_size)

# Perform two-sample t-test
t_statistic, p_value = two_sample_t_test(sample1, sample2)

# Print results
print("Sample 1 Mean:", np.mean(sample1))
print("Sample 2 Mean:", np.mean(sample2))
print("Calculated t-statistic:", t_statistic)
print("Calculated p-value:", p_value)

# Compare p-value with significance level (e.g., alpha = 0.05) for hypothesis testing
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")


In [None]:
#What is Student’s t distribution? When to use the t-Distribution.

The Student's t-distribution, often referred to as simply the t-distribution, is a probability distribution that is used for performing hypothesis
tests and constructing confidence intervals when the sample size is small or when the population standard deviation is unknown. It is a variation of
the normal distribution and is characterized by its bell-shaped curve, similar to the normal distribution, but with heavier tails.

The t-distribution is parameterized by its degrees of freedom (df), which determine the shape of the distribution.
As the degrees of freedom increase, the t-distribution approaches the standard normal distribution (z-distribution).

In [None]:
#What is t-statistic? State the formula for t-statistic.

The t-statistic is a test statistic that is used in hypothesis testing to assess whether there is a significant difference between sample means or to
determine if a sample mean is significantly different from a known or hypothesized population mean. It quantifies the difference between the sample
data and the null hypothesis and is used to make a decision about whether to reject or fail to reject the null hypothesis.

The formula for the t-statistic depends on the type of t-test being conducted (e.g., one-sample, two-sample, paired) and whether the population
standard deviation is known or unknown.