In [None]:
1.	A company wants to test whether the average time taken to assemble a product has significantly decreased after implementing a new training program. Prior 
    to the training, the average assembly time was 35 minutes with a known population standard deviation of 5 minutes. After the training, a random sample
    of 40 employees showed a mean assembly time of 33 minutes. Can you help the company to decide whether the new training program is necessary?

Answer: The goal of this analysis is to find out if a new training program has really helped reduce the average time it takes to assemble a product at
the company. We will compare the new average assembly time of 33 minutes from a sample of employees to the old average time of 35 minutes to see if this
change is significant.

In [1]:
import numpy as np
from scipy import stats
population_mean = 35  
sample_mean = 33      
population_std_dev = 5  
sample_size = 40      
z_test_statistic = (sample_mean - population_mean) / (population_std_dev / np.sqrt(sample_size))
print(f"Z-test Statistic: {z_test_statistic:.4f}")
p_value = stats.norm.cdf(z_test_statistic)
print(f"P-value: {p_value:.4f}")
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: The training program has significantly reduced assembly time.")
else:
    print("Do not reject the null hypothesis: There is not enough evidence to suggest a reduction in assembly time.")

Z-test Statistic: -2.5298
P-value: 0.0057
Reject the null hypothesis: The training program has significantly reduced assembly time.


Explaination: The first step in this code is to imports necessary tools from Python libraries to do calculations. Then, we type in the known values, like the average 
assembly time before training, the new average after training, the standard deviation, and the sample size of employees. Then,  z-test statistic is 
calculated, which shows how much the new average differs from the old one, and it then finds the p-value, which tells us the probability of observing 
such a difference by chance. Finally, the p-value is compared to a significance level of 0.05 to decide if we should reject the idea that there’s no
difference (the null hypothesis). Since, the p-value (0.0057) is less than the significance level (0.05), we reject the null hypothesis. This means
there is strong evidence to suggest that the training program has significantly reduced the average assembly time for products.

2.	A university administrator wants to test whether graduate students at the institution study, on average, more than 25 hours per week. To explore 
this, a random sample of 15 graduate students was surveyed, and the sample mean study time was found to be 27 hours per week, with a sample standard 
deviation of 4.5 hours.
Answer: The objective of the hypothesis test is to assess whether the average study time of graduate students at the university exceeds 25 hours per 
week, using the sample data to support or reject this claim.


In [3]:
import numpy as np
from scipy import stats
sample_mean = 27          
population_mean = 25      
sample_std_dev = 4.5      
sample_size = 15          
t_test_statistic = (sample_mean - population_mean) / (sample_std_dev / np.sqrt(sample_size))
print(f"T-test Statistic: {t_test_statistic:.4f}")
p_value = 1 - stats.t.cdf(t_test_statistic, df=sample_size - 1)
print(f"P-value: {p_value:.4f}")
alpha = 0.05
critical_value = stats.t.ppf(1 - alpha, df=sample_size - 1)
print(f"Critical Value: {critical_value:.4f}")
if t_test_statistic > critical_value:
    print("Reject the null hypothesis: Graduate students study more than 25 hours per week.")
else:
    print("Do not reject the null hypothesis: There is not enough evidence to suggest that graduate students study more than 25 hours per week.")


T-test Statistic: 1.7213
P-value: 0.0536
Critical Value: 1.7613
Do not reject the null hypothesis: There is not enough evidence to suggest that graduate students study more than 25 hours per week.


In [None]:
Explaination: A one-sample t-test is performed here to determine whether the average study time of graduate students is greater than 25 hours per week.
Firstly, necessary libraries are imported. Following that, known values are defined, including the sample mean (27 hours), the population mean under the
null hypothesis (25 hours), the sample standard deviation (4.5 hours), and the sample size (15). The code calculates the t-test statistic using the 
formula for the t-test and also the p-value for the one-tailed test. A significance level of 0.05 is set, and the critical value for the t-test is 
calculated based on this significance level and the sample size's degrees of freedom. Finally, the t-test statistic is compared to the critical value to
make a decision on the null hypothesis. Since, the results show a t-test statistic of 1.7213 and a p-value of 0.0536, which is above the significance 
level of 0.05, leading to the conclusion that there is not enough evidence to suggest that graduate students study more than 25 hours per week.

In [None]:
3. A researcher is studying the relationship between hours of study and exam scores for graduate students. The above data was collected from a sample of 30
students. 
a.	Fit a simple linear regression model to the data, where the dependent variable (Y) is the exam score, and the independent variable (X) is the hours
of study.

In [2]:
import pandas as pd
import statsmodels.api as sm
data = pd.DataFrame({ 'Hours_of_Study': [5, 5, 7, 8, 3, 8, 8, 7, 8, 8, 0, 5, 2, 8, 8, 4, 2, 1, 6, 2, 8, 0, 5, 7, 7, 4, 4, 2, 6, 8],
    'Score': [52.1221, 52.1221, 72.1221, 82.1221, 32.1221, 82.1221, 72.1221, 72.1221, 82.1221, 82.1221, 
              2.122104, 52.1221, 22.1221, 82.1221, 82.1221, 42.1221, 22.1221, 12.1221, 62.1221, 22.1221,
              82.1221, 2.122104, 52.1221, 72.1221, 72.1221, 42.1221, 42.1221, 22.1221, 62.1221, 82.1221]})

X = data['Hours_of_Study'] 
Y = data['Score']           
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                  Score   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     6067.
Date:                Sun, 03 Nov 2024   Prob (F-statistic):           2.79e-34
Time:                        23:29:09   Log-Likelihood:                -59.519
No. Observations:                  30   AIC:                             123.0
Df Residuals:                      28   BIC:                             125.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const              2.4928      0.738      3.

In [None]:
Explaination: Firstly, pandas and statsmodels are imported in this notebook. The data is then inputted manually. Hours_of_Study is defined as the 
predictor variable and Score as the outcome variable, adding a constant to the predictor variable to include an intercept in the model. Then  the OLS 
function is used from statsmodels to fit the regression model to our data. Lastly, the model summary is printed, which shows main information. 
This allows us to understand if and how study hours relate to exam scores.
b.	What are the assumptions of simple linear regression, and do you think they hold for this dataset?
Linearity: The relationship between hours of study and exam scores should be linear.
Independence: Each observation (student's score) should be independent of others.
Homoscedasticity: The variance of residuals should remain constant across levels of study hours.

c.	How would outliers in the data affect the regression model?
Outliers are data points that are very different from the others, and they can mess up a regression model. They can pull the line of best fit towards
them, which makes the model less accurate for the rest of the data. This can lead to incorrect slope and intercept values, making predictions less 
trustworthy. Outliers can make it harder for the model to explain and predict relationships in the data correctly.

d.	If you need to verify the model what will you need and what will you do?
To verify the regression model, we can use the split and train method where we can split the data into training and test sets. We can train the model
using the training set, where "Hours of Study" is the input and "Score" is the output. Then we can make predictions on the test set and compare these 
predicted scores with the actual scores. 


In [None]:
3. Machine X is available for 100 hours per week, and Machine Y is available for 85 hours per week. The company wants to maximize its total production
output.
a.	Please formulate this problem as a linear programming problem in standard form. 
Answer:              
Let us assume,
x = Number of units produced of Product A
y = Number of units produced of Product B
z = Number of units produced of Product C
The objective function to maximize the total production output
𝑍= 𝑥+𝑦+𝑧
Constraints
Machine X: 2x+1y+3z≤100,
Machine Y: 4x+3y+2z≤85,
x≥0,y≥0,z≥0.                                                  

In [4]:
from scipy.optimize import linprog
c = [-1, -1, -1] 
A = [[2, 1, 3],  
     [4, 3, 2]]   
b = [100, 85]     
result = linprog(c, A_ub=A, b_ub=b, bounds=(0, None), method='highs')
print("Status:", result.message)
print("Product A:", result.x[0])
print("Product B:", result.x[1])
print("Product C:", result.x[2])
print("Maximum Output:", -result.fun) 

Status: Optimization terminated successfully. (HiGHS Status 7: Optimal)
Product A: 0.0
Product B: 7.857142857142857
Product C: 30.714285714285715
Maximum Output: 38.57142857142857


In [None]:
Explaination: The code uses the linprog function from scipy.optimize to solve a linear programming problem. It includes the objective function 
coefficients c, the inequality constraint matrix A_ub, and the upper bounds for those constraints b_ub. The variable c represents the coefficients of the
objective function in linear programming, which we aim to maximize or minimize. Here, c = [-1, -1, -1] indicates that we want to minimize 
-1 * (Product A) - 1 * (Product B) - 1 * (Product C). Since we're actually trying to maximize Product A + Product B + Product C, minimizing the negative
expression effectively achieves that goal. The aim is to find the best amounts of three products (A, B, and C) to maximize output while staying within the limits of available resources. The code sets
up the goal (objective function), the limits (constraints), and the range of possible values (variable bounds). It then runs linprog to find the solution
. The results, including whether the solution was successful, how much of each product to make, and the maximum output, are stored in a result object.

In [None]:
c.	The problem is bounded by machine X or Y? If you have a budget to upgrade the machine, will you upgrade machine X or Y? Please explain your answer 
    by exploring the math model. 
Answer: Here we can substitute the value of x, y and z with the above results that we got from running the code i.e. the optimization result. 
      Machine X Constraint:
2x+1y+3z≤100
Substituting the optimal solution:
2(0)+1(7.86)+3(30.71)≈0+7.86+92.13=100
This constraint reaches the limit of 100 hours, indicating that Machine X is fully utilized.

Machine Y Constraint:
4x+3y+2z≤85
Substituting the optimal solution:
4(0)+3(7.86)+2(30.71)≈0+23.58+61.42=85
This constraint also reaches its limitof 85 hours.
Upgrading Machine Y is a good idea because it currently operates at only 85 hours per week, which limits production. This upgrade would allow for the 
making of Product A, which isn't being produced right now. It would also help increase production of Products A and B, making the overall process more 
efficient. Since Machine X is already used up at 100 hours, improving Machine Y is a cost-effective way to expand the product line and respond to changes
in market demand. Overall, upgrading Machine Y would greatly enhance the company’s production abilities.

In [None]:
Reference: In addressing the linear programming problem, I have relied heavily on Gemini from Google Colab for assistance. Since I am more familiar with
Colab, I first typed and tested the code there before pasting it into Jupyter Notebook. The autocomplete suggestions feature turned out to be very 
helpful, specially for someone who is beginner to coding like me. As I typed my code, I got suggestions of possible completions for functions, variables,
and methods, that allowed me to code more efficiently and also reduced the chances of errors. It also assisted me in explaining the coding steps and
provide detailed comments alongside the code which made it easier to break down the analysis into understandable segments.