# Table of Contents

1. [Estimation (Point) in Inferential Statistics](#estimation-point-in-inferential-statistics)
2. [Estimation (Interval) in Inferential Statistics](#estimation-interval-in-inferential-statistics)
3. [Hypothesis Testing in Inferential Statistics](#hypothesis-testing-in-inferential-statistics)

# Estimation (Point) in Inferential Statistics

Estimation is a fundamental concept in inferential statistics, used to make predictions or inferences about a population based on sample data. Point estimation involves using sample data to calculate a single value (known as a point estimate) that serves as the best estimate of an unknown population parameter (e.g., population mean, population proportion).

#### Mathematical Formula

The formula for estimating the population mean ($\mu$) using the sample mean ($\bar{x}$) is:

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i$$

where $n$ is the sample size, and $x_i$ are the individual sample values.

#### Explanation of Steps

1. **Collect a Sample**: Randomly select a sample of $n$ observations from the population.
2. **Calculate the Sample Mean**: Use the sample data to calculate the sample mean ($\bar{x}$), which serves as the point estimate of the population mean ($\mu$).
3. **Use the Sample Mean as the Point Estimate**: The calculated sample mean is used as the best estimate of the population mean.

#### Business Scenario: Estimating Average Daily Active Users in a Software Application

Imagine you're a data analyst at a software technology company, and you're tasked with estimating the average daily active users (DAU) for a new application. The company has launched the app recently, and you want to estimate the average DAU to forecast server load and guide marketing strategies.

To accomplish this, you decide to collect data on the daily active users for a sample of 30 days after the app's launch.

#### Python Code Example

The following Python code demonstrates how to perform point estimation for this scenario using a randomly created pandas DataFrame to simulate the sample data of daily active users.


In [21]:
import pandas as pd
import numpy as np

# Simulate sample data for daily active users (DAU)
np.random.seed(42)
sample_size = 30
dau_sample = np.random.randint(1000, 5000, size=sample_size)

# Create a pandas DataFrame
df = pd.DataFrame(dau_sample, columns=['daily_active_users'])

# Calculate the sample mean (point estimate for the average DAU)
sample_mean = df['daily_active_users'].mean().astype(int)

print(f"The point estimate for the average daily active users is: {sample_mean} users.")

The point estimate for the average daily active users is: 3048 users.


#### Interpretation

The calculated sample mean provides us with a point estimate of the average daily active users (DAU) for the new application. This estimate can be used to make informed decisions about server capacity planning and to tailor marketing strategies to engage more users.

By using point estimation, we can derive meaningful insights from sample data, allowing for efficient and informed decision-making in business contexts, especially in the dynamic field of software technology.


# Estimation (Interval) in Inferential Statistics

Interval estimation provides a range (interval) of values within which the parameter is expected to lie. This interval is calculated from the sample data and provides an estimate of the parameter with a certain level of confidence (e.g., 95%).

#### Mathematical Formula

The formula for a confidence interval for a population mean, assuming a large sample size or known population standard deviation, is:

$$\bar{x} \pm Z \times \frac{\sigma}{\sqrt{n}}$$

where:
- $\bar{x}$ is the sample mean,
- $Z$ is the Z-score corresponding to the desired confidence level (e.g., 1.96 for 95% confidence),
- $\sigma$ is the population standard deviation (or an estimate from the sample if the population standard deviation is unknown),
- $n$ is the sample size.

#### Business Scenario: Estimating Average Daily Active Users with a Confidence Interval

Given the same business scenario as before, we now wish to provide an interval estimate for the average DAU to understand the range within which the true average DAU likely falls.

#### Python Code Example


In [27]:
from scipy.stats import norm
import math

# Assuming a 95% confidence level
z_score = norm.ppf(0.975) # Two-tailed Z-score for 95% confidence
std_dev = df['daily_active_users'].std() # Sample standard deviation
margin_of_error = z_score * (std_dev / np.sqrt(sample_size))

confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
floored_confidence_interval = (math.floor(confidence_interval[0]), math.floor(confidence_interval[1]))

print(f"The 95% confidence interval for the average daily active users is: {floored_confidence_interval}")

The 95% confidence interval for the average daily active users is: (2643, 3452)


#### Interpretation

This confidence interval provides a range within which we are 95% confident that the true average DAU lies. It gives us not just an estimate but also an idea of the precision of our estimate and the variability in the sample data.

# Hypothesis Testing in Inferential Statistics

Hypothesis testing is a pivotal method in inferential statistics that enables decision-making based on data analysis. It involves formulating a null hypothesis, which posits no effect or no difference, and an alternative hypothesis, which suggests some effect or difference. By analyzing sample data, we determine the likelihood of the null hypothesis being true.

#### Business Scenario: Impact of a New Treatment on Billing Amounts for Hypertension Patients

In this scenario, our healthcare facility has introduced a new treatment protocol for hypertension. We aim to assess whether this new treatment has led to a change in the billing amounts for patients diagnosed with hypertension. To do this, we will compare the billing amounts before and after the introduction of the new treatment.

##### Steps for Hypothesis Testing

1. **Formulate Hypotheses**:
   - Null Hypothesis ($H_0$): The new treatment has no effect on the billing amounts for hypertension patients.
   - Alternative Hypothesis ($H_1$): The new treatment significantly changes the billing amounts for hypertension patients.

2. **Select an Appropriate Test**:
   Assuming we have data on billing amounts for a similar number of patients before and after the treatment's introduction, a two-sample T-test (for independent samples) could be appropriate. If the billing amounts are paired (e.g., the same patients before and after treatment), a paired T-test would be more suitable.

3. **Determine the Significance Level ($\alpha$)**:
   Typically, a significance level of 0.05 is used.

4. **Calculate the Test Statistic and P-value**:
   This involves calculating the mean billing amount before and after the treatment and using the selected T-test to compute the p-value.

5. **Make a Decision**:
   If the p-value is less than the significance level, we reject the null hypothesis, suggesting the new treatment has a significant effect on billing amounts.

#### Python Implementation

Here is a simplified Python example using hypothetical data. You'll need to adjust this to fit the actual structure of your dataset and specifically filter for 'Hypertension' patients before and after the treatment introduction.

In [8]:
import pandas as pd
from scipy.stats import ttest_ind

df = pd.read_csv('../../data/healthcare_dataset.csv')

import pandas as pd
from scipy.stats import ttest_ind

# Assuming you've already loaded the dataset into df
# df = pd.read_csv('./data/healthcare-dataset.csv')

# Example criteria for demonstration (you'll need actual criteria from your dataset)
# Let's say the new treatment was introduced on '2022-01-01'
treatment_start_date = '2022-01-01'

# Filtering patients with hypertension before and after the new treatment introduction
df_hypertension = df[df['Medical Condition'] == 'Hypertension']
df_before = df_hypertension[df_hypertension['Date of Admission'] < treatment_start_date]
df_after = df_hypertension[df_hypertension['Date of Admission'] >= treatment_start_date]

# Calculating mean billing amounts for demonstration
mean_before = df_before['Billing Amount'].mean()
mean_after = df_after['Billing Amount'].mean()

# Conducting the two-sample T-test
t_stat, p_value = ttest_ind(df_before['Billing Amount'], df_after['Billing Amount'], equal_var=False)  # assuming unequal variances

print(f"Mean Billing Amount Before Treatment: {mean_before}")
print(f"Mean Billing Amount After Treatment: {mean_after}")
print(f"T-statistic: {t_stat}, P-value: {p_value}")



Mean Billing Amount Before Treatment: 25177.749278580923
Mean Billing Amount After Treatment: 25234.25185912228
T-statistic: -0.07890450754533705, P-value: 0.937121107277069


#### Interpretation

The T-statistic and P-value are key components in interpreting the results of a hypothesis test. The T-statistic, in this case, is -0.0789, which indicates the direction and magnitude of the difference between the mean billing amounts before and after the treatment. However, the magnitude here is very small, suggesting a minimal difference between the two means.

The P-value of 0.9371 is much higher than the commonly used significance level of 0.05. In hypothesis testing, a P-value greater than the significance level leads us to fail to reject the null hypothesis. In this context, the null hypothesis posits that there is no significant difference in the billing amounts before and after the introduction of the new treatment for hypertension.

Therefore, based on the P-value obtained, we conclude that there is not enough statistical evidence to suggest that the new treatment had a significant effect on the billing amounts for patients with hypertension. The slight difference observed in the mean billing amounts before and after the treatment is not statistically significant.

This interpretation implies that, from a billing perspective, the new treatment does not appear to impact the cost of care for hypertension patients in a meaningful way. It's important to note that this analysis focuses solely on billing amounts and does not address the clinical effectiveness or patient outcomes related to the new treatment. Further studies may be necessary to evaluate the treatment's effectiveness and other impacts on patient care beyond just the financial aspect