<a href="https://colab.research.google.com/github/joshuabdixon/Hypothesis-Testing-for-Case-Studies/blob/main/Hypothesis%20Testing%20for%20Case%20Studies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing for Case Studies


# Rationale and Approach

> In this assignment, my approach involved applying hypothesis testing to various organisational scenarios to solve business challenges and guide strategic decision-making.
>
>I began by identifying the problem statement and the objective for each scenario, then selected an appropriate statistical test to evaluate the hypothesis.
>
>For scenarios involving comparisons between groups, I used tests like the independent two-sample t-test or the one-way ANOVA, while for categorical relationships, I applied the chi-square test. The rationale behind each test selection was based on the type of data and the assumptions that needed to be met, ensuring the results would be valid and reliable.
>
>By carefully analysing the outcomes, I was able to draw meaningful conclusions and provide actionable recommendations, demonstrating critical thinking and problem-solving skills throughout the process.

# Scenario 1: Loan Repayment Affordability with Commission

- **Problem Statement:** A financial institution needs to determine if an applicant's base salary, supplemented by sales commission, is sufficient to meet loan repayment requirements. The base salary alone may not be enough, so the additional commission is crucial to achieving affordability.
- **Objective:** Determine whether the applicant's base salary plus the commission meets the required threshold for loan affordability. This involves testing the impact of fluctuations in commission on the applicant's ability to repay the loan.
- **Statistical Test:** One-tailed t-test.
- **Reason for Test Selection:** A one-tailed t-test assesses whether a given mean is significantly greater than a specified value, accounting for variability in the data. This test is appropriate to evaluate if the average commission is high enough to supplement the base salary for loan repayment.
- **Value to the Organisation:** This analysis helps the financial institution gauge whether the applicant can reliably meet loan repayments without conducting a full affordability test, which leaves a credit footprint. The outcome informs loan approval decisions and mitigates the risk of default due to income fluctuations.


## Hypothesis
  - $H_0$: The mean commission for a cosmetics salesperson is less than or equal to £500 per month.
  - $H_a$: The mean commission for a cosmetics salesperson is greater than £500 per month.


In [None]:
import scipy.stats as stats

# Sample data for monthly commission for a year (in GBP).
monthly_commission = [
    480, 520, 540, 490, 510, 525, 515, 505, 500, 480, 515, 530,
    520, 530, 540, 525, 550, 460, 570, 490, 545, 535, 540, 545,
    550, 555, 560, 565, 470, 575, 580, 485, 590, 595, 500, 505
]

# Perform a one-tailed t-test.
t_statistic, p_value = stats.ttest_1samp(monthly_commission, popmean=500, alternative='greater')

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Define the null and alternative hypotheses.
null_hypothesis = "The mean commission is less than or equal to £500 per month."
alternative_hypothesis = "The mean commission is greater than £500 per month."

# Print the hypotheses, test results and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Null Hypothesis: The mean commission is less than or equal to £500 per month.
Alternative Hypothesis: The mean commission is greater than £500 per month.
Significance Level (alpha): 0.05
T-Statistic: 4.825292252995393
P-Value: 1.3582107426823953e-05
Reject the null hypothesis.


### Reporting results

The initial suitability test performed on your average commission for last year, along with potential future earnings from commission, shows that you are likely to be accepted for a loan after the full affordability tests are completed. As a reminder, the full credit check will leave a record on your credit file. Would you like to proceed with the application?

# Scenario 2: Product price comparison

- **Problem statement:** A retail company wants to determine whether there is a significant difference in the average price of a product between two different stores (Store A – the business; and Store B – the competitor).
- **Objective:** Test whether the average price of the product differs significantly between Store A and Store B.
- **Statistical test:** Independent two-sample t-test.
- **Reason for test selection:** An independent two-sample t-test is used to assess whether there is a significant difference between the means of two independent groups (prices, Store A and Store B). The assumptions for a t-test (independent in this case) are met, including normality and homoscedasticity.
- **Value to the organisation:** This information can influence pricing strategies and competitive analysis, leading to informed decisions about product pricing and placement.

> ## Hypothesis
  - $H_0$: There is no significant difference in average product price between Store A and Store B.
  - $H_a$: There is a significant difference in average product price between Store A and Store B.

In [None]:
# Import the necessary library.
import scipy.stats as stats

# Sample data for prices at Store A and Store B.
store_A_prices = [50, 55, 60, 45, 48, 52, 57, 59, 53, 50, 58, 54, 51, 56, 55]
store_B_prices = [55, 52, 58, 54, 50, 56, 53, 59, 55, 57, 60, 58, 53, 55, 57]

# Perform the independent two-sample t-test.
t_statistic, p_value = stats.ttest_ind(store_A_prices, store_B_prices)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Define the null and alternative hypotheses.
null_hypothesis = "There is no significant difference in average product price between Store A and Store B"
alternative_hypothesis = "There is a significant difference in average product price between Store A and Store B"

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Null Hypothesis: There is no significant difference in average product price between Store A and Store B
Alternative Hypothesis: There is a significant difference in average product price between Store A and Store B
Significance Level (alpha): 0.05
T-Statistic: -1.4777027242467613
P-Value: 0.15064852874704232
Fail to reject the null hypothesis.


### Reporting results
After comparing the prices from Store A and Store B, we did not find a significant difference. This suggests that, based on the data we have, the prices are generally similar between the two stores.

To gain a clearer understanding, we suggest collecting more price data from both stores to see if there's any real difference.

If further analysis confirms that there's no significant difference, we might be able to optimise pricing strategies to make your prices more competitive compared to Store B. This could involve finding the optimal price to maximise revenue while remaining competitive.

Would you like us to continue with a more detailed analysis?

# Scenario 3: Employee productivity

- **Problem statement:** The HR department of a company wants to determine whether there is a significant difference in the average productivity levels of employees across three different departments (sales, marketing, and finance).
- **Objective:** Test whether the average productivity levels vary significantly across the three departments.
- **Statistical test:** One-way ANOVA to determine if there is a significant difference in average productivity levels across three different departments: sales, marketing, and finance.
- **Reason for test selection:** One-way ANOVA compares the means of productivity levels across more than two groups (departments). ANOVA determines whether there is a significant difference in means, which is important for identifying which department(s) may require specific attention or improvements (independent in this case) are met, including normality and homoscedasticity.
- **Value to the organisation:** Identifying which department might need targeted improvements should lead to better resource allocation and increased efficiency.



> ## Hypothesis
- $H_0$: There is no significant difference in the average productivity levels across the sales, marketing, and finance departments.
- $H_a$: At least one department has a significantly different average productivity level compared to the others.

In [None]:
import scipy.stats as stats

# Sample data for productivity in three departments.
sales_productivity = [100, 110, 105, 120, 115, 108, 112, 107, 118, 103, 105, 115, 110]
marketing_productivity = [90, 95, 88, 92, 85, 87, 93, 89, 94, 91, 86, 92, 91]
finance_productivity = [80, 75, 82, 78, 85, 86, 81, 79, 83, 87, 84, 79, 82]

# Perform one-way ANOVA.
f_statistic, p_value = stats.f_oneway(sales_productivity, marketing_productivity, finance_productivity)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Define the null and alternative hypotheses.
null_hypothesis = "There is no significant difference in the average productivity levels across the sales, marketing, and finance departments."
alternative_hypothesis = "At least one department has a significantly different average productivity level compared to the others."

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Null Hypothesis: There is no significant difference in the average productivity levels across the sales, marketing, and finance departments.
Alternative Hypothesis: At least one department has a significantly different average productivity level compared to the others.
Significance Level (alpha): 0.05
F-Statistic: 142.9988771614636
P-Value: 7.448929132713286e-18
Reject the null hypothesis.


### Reporting results
> After comparing the productivity levels across the sales, marketing, and finance departments, we found evidence of a significant difference. This indicates that at least one department has a notably different average productivity level compared to the others.
>
>Given this finding, we recommend conducting further analysis to identify which specific departments need targeted improvements. Understanding these differences can help you make informed decisions about resource allocation and potentially improve overall efficiency.
>
>Would you like us to proceed with more detailed analysis?


# Scenario 4: Market research

- **Problem statement:** A marketing research firm is investigating whether there is a relationship between customers’ age groups (18–25, 26–35, and 36–45) and their preferred social media platforms (Facebook, Twitter, and Instagram).
- **Objective:** Test whether the age group and the choice of social media platform are independent of each other.
- **Statistical test:** Chi-square test for independence.
- **Reason for test selection:** The chi-square test for independence is suitable for categorical data (age groups and social media platforms) and to determine whether the variables are independent or related. This test assesses whether there is an association between the two variables without the need for specific assumptions about data distribution.
- **Value to the organisation:** Identifying whether age groups and preferred social media platforms are independent or related is valuable for targeted marketing campaigns tailored to specific age groups and platforms.

> ## Hypothesis
- $H_0$: No significant relationship exists between age groups and preferred social media platform.
- $H_a$: At least one age group has a statistically significant association with at least one social media platform.

In [None]:
import scipy.stats as stats
import pandas as pd

# Sample data as a Pandas DataFrame.
data = pd.DataFrame({
    'Age Group': ['18-25', '26-35', '36-45', '18-25', '26-35'],
    'Social Media Platform': ['Facebook', 'Twitter', 'Instagram', 'Instagram', 'Facebook']
})

# Create a contingency table.
contingency_table = pd.crosstab(data['Age Group'], data['Social Media Platform'])

# Perform chi-square test for independence.
chi2, p, _, _ = stats.chi2_contingency(contingency_table)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Define the null and alternative hypotheses.
null_hypothesis = "No significant relationship exists between age groups and social media platforms."
alternative_hypothesis = "At least one age group has a statistically significant association with at least one social media platform."

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)

if p < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Null Hypothesis: No significant relationship exists between age groups and social media platforms.
Alternative Hypothesis: At least one age group has a statistically significant association with at least one social media platform.
Significance Level (alpha): 0.05
Chi-Square Statistic: 3.75
P-Value: 0.44089552967916945
Fail to reject the null hypothesis.


### Reporting results
> Our analysis suggests there is no clear link between age groups and the choice of social media platforms, as any observed differences could be due to random chance.
>
>To dive deeper, we recommend gathering more data, looking at other factors like gender or location, and exploring potential connections between age and other characteristics. This approach can help identify if age plays a role in social media preferences.


# Scenario 5: Product quality control

- **Problem statement:** A manufacturing company is evaluating whether there is a significant difference in product quality (measured as 'defective' or 'non-defective') among three production lines (Line A, Line B, Line C).
- **Objective:** Test whether the proportion of defective products differs across the three production lines.
- **Statistical test:** Chi-square test for proportions.
- **Reason for test selection:** This scenario involves comparing proportions (defective versus non-defective) among production lines, making the chi-square test for proportions appropriate. It will test whether the proportion of defective products significantly differs among the production lines, regardless of data distribution.
- **Value to the organisation:** Identifying variations in product quality aids quality control and process improvement within the manufacturing process.

> ## Hypothesis
- $H_0$: No significant difference in product quality among the three production lines.
- $H_a$: At least one production line has a statistically significant difference in product quality compared to the others.

In [None]:
import scipy.stats as stats

# Sample data for three production lines.
line_A_defective = 20
line_A_non_defective = 180
line_B_defective = 30
line_B_non_defective = 170
line_C_defective = 10
line_C_non_defective = 190

# Create a 2x3 contingency table.
contingency_table = [[line_A_defective, line_B_defective, line_C_defective],
                     [line_A_non_defective, line_B_non_defective, line_C_non_defective]]

# Perform chi-square test for proportions.
chi2, p, _, _ = stats.chi2_contingency(contingency_table)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Define the null and alternative hypotheses.
null_hypothesis = "No significant difference in product quality among the three production lines."
alternative_hypothesis = "At least one production line has a statistically significant difference in product quality compared to the others."

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)

if p < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Null Hypothesis: No significant difference in product quality among the three production lines.
Alternative Hypothesis: At least one production line has a statistically significant difference in product quality compared to the others.
Significance Level (alpha): 0.05
Chi-Square Statistic: 11.11111111111111
P-Value: 0.0038659201394728067
Reject the null hypothesis.


### Reporting results
Our analysis shows that there is likely a difference in product quality among the three production lines. This means that at least one of the lines might be producing more defective products than the others.

Given this, we suggest conducting additional tests to find out which production lines might need quality improvements or process adjustments.

Would you like us to continue with a more detailed investigation to pinpoint the source of the quality variation?

# Scenario 6:  Determining product lines

- **Problem statement:** An online bookstore wants to determine which books to add to its product line. It needs to determine whether to spend the budget on fiction or non-fiction books, based on which category will likely generate the most revenue.
- **Objective:** Test which category of books will generate the most revenue.
- **Statistical test:** Independent two-sample t-test.
- **Reason for test selection:** An independent two-sample t-test is used to asses whether there is a significant difference between the means of two independent groups (revenue, fiction, non-fiction). The assumptions for a t-test (independent in this case) are met, including normality and homoscedasticity.
- **Value to the organisation:** This information can influence decision on which books are best placed on the product the line, leading to informed decisions that generate the most revenue.

> ## Hypothesis
  - $H_0$: There is no significant difference in revenue generated from fiction and non-fiction books.
  - $H_a$: There is a significant difference in revenue generated from fiction and non-fiction books.

In [None]:
# Import relevant libraries.
import scipy.stats as stats

# Sample data for revenue of fiction and non-fiction books.
fiction_revenue = [500, 550, 600, 520, 480, 530, 560,
                   540, 570, 590, 545, 525, 510, 525,
                   515, 550, 570, 580, 535, 520, 510,
                   540, 560, 575, 590]
non_fiction_revenue = [600, 620, 580, 590, 610, 630,
                       595, 605, 615, 625, 635, 590,
                       625, 630, 640, 610, 620, 600,
                       615, 630, 625, 635, 610, 590, 580]

# Perform the independent two-sample t-test.
t_statistic, p_value = stats.ttest_ind(fiction_revenue, non_fiction_revenue)

# Set the significance level (alpha) at 0.05.
alpha = 0.05

# Define the null and alternative hypotheses.
null_hypothesis = "There is no significant difference in revenue generated from fiction and non-fiction books."
alternative_hypothesis = "At least one production line has a statistically significant difference in product quality compared to the others."

# Print the hypotheses, test results, and interpretation.
print("Null Hypothesis:", null_hypothesis)
print("Alternative Hypothesis:", alternative_hypothesis)
print("Significance Level (alpha):", alpha)
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


Null Hypothesis: There is no significant difference in revenue generated from fiction and non-fiction books.
Alternative Hypothesis: At least one production line has a statistically significant difference in product quality compared to the others.
Significance Level (alpha): 0.05
T-Statistic: -9.616927427940514
P-Value: 8.955388322211702e-13
Reject the null hypothesis.


### Reporting results
> Our  analysis suggests that non-fiction books generally generate more revenue than fiction books, based on the sample data.
>
> Given this result, the bookstore should consider focusing its budget on the book category with higher revenue potential.
>
> Further analysis could help understand why one category is more profitable and guide future product line decisions.