# 1 Hospital admission & quality of service


## Question 1 (a)

In [20]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df1 = pd.read_csv("health_data.csv")

model1_1a = smf.ols(formula = 'patient_died_dummy ~ hospital_id',
                   data = df1).fit()
print(model1_1a.summary())

                            OLS Regression Results                            
Dep. Variable:     patient_died_dummy   R-squared:                       0.042
Model:                            OLS   Adj. R-squared:                  0.042
Method:                 Least Squares   F-statistic:                     119.3
Date:                Sun, 04 Feb 2024   Prob (F-statistic):          1.75e-220
Time:                        12:45:55   Log-Likelihood:                -7416.5
No. Observations:               24480   AIC:                         1.485e+04
Df Residuals:                   24470   BIC:                         1.493e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.0970      0.006  

The constant term indicates the predicted mortality rate at hospital A.
The dummy for hospital D, of which positivity suggests the higher mortality rate at hospital D than at A, indicates the difference in mortality rates between the two hospitals. In other words, the mortality rate of hospital D is 0.1882 higher than that of hospital A.

## Question 1 (b)

In [21]:
print('The difference between the mortality rates at hospitals D and E is ', model1_1a.params['hospital_id[T.D]'] - model1_1a.params['hospital_id[T.E]'], '.')

The difference between the mortality rates at hospitals D and E is  0.24139049162000503 .


## Causal interpretation
## Question 2 (a)

In [24]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df1 = pd.read_csv("health_data.csv")
df1_ab = df1[df1['hospital_id'].isin(['A', 'B'])]

model1_2a = smf.ols(formula = 'patient_died_dummy ~ hospital_id',
                   data = df1_ab).fit()
print(model1_2a.summary())

                            OLS Regression Results                            
Dep. Variable:     patient_died_dummy   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.9377
Date:                Sun, 04 Feb 2024   Prob (F-statistic):              0.333
Time:                        12:48:56   Log-Likelihood:                -1446.8
No. Observations:                6611   AIC:                             2898.
Df Residuals:                    6609   BIC:                             2911.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.0970      0.005  

The difference in mortality rate implied by this regression cannot be interpreted as the causal effect of visiting a different hospital (i.e., the change in risk of dying when moving a patient
from hospital A to B cannot be inferred from this regression) because the p-value at 0.333 is too large for the coefficient to be significant. This indicates an insignificant correlation between the two variables, so we cannot conclude the causal effect. There would also be other confounders affecting the change in risk of dying when moving a patient from hospital A to B.

## Question 2 (b)

I think difference in mortality between hospitals are under-estimated. Pregnant women go to hospitals that specialize in birth delivery, so the gender of the patiens is not randomly distributed across hospitals. This can be shown by the significance of coefficient of female_dummy with patient_died_dummy, as well as the significance of coefficients of both hospitals with female_dummy.

In [25]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df1 = pd.read_csv("health_data.csv")
df1_ab = df1[df1['hospital_id'].isin(['A', 'B'])]

model1_2b_1 = smf.ols(formula = 'patient_died_dummy ~ female_dummy',
                   data = df1_ab).fit()
print(model1_2b_1.summary())

                            OLS Regression Results                            
Dep. Variable:     patient_died_dummy   R-squared:                       0.062
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     434.0
Date:                Sun, 04 Feb 2024   Prob (F-statistic):           2.08e-93
Time:                        12:50:02   Log-Likelihood:                -1237.0
No. Observations:                6611   AIC:                             2478.
Df Residuals:                    6609   BIC:                             2492.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0.0615      0.004     15.152   

In [26]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df1 = pd.read_csv("health_data.csv")
df1_ab = df1[df1['hospital_id'].isin(['A', 'B'])]

model1_2b_2 = smf.ols(formula = 'female_dummy ~ hospital_id',
                   data = df1_ab).fit()
print(model1_2b_2.summary())

                            OLS Regression Results                            
Dep. Variable:           female_dummy   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     7.042
Date:                Sun, 04 Feb 2024   Prob (F-statistic):            0.00798
Time:                        12:50:04   Log-Likelihood:                -3523.8
No. Observations:                6611   AIC:                             7052.
Df Residuals:                    6609   BIC:                             7065.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.2321      0.007  

## Question 2 (c)

Potential control variable that I might want to include in the regression, in order to obtain a causal estimate (or at least get closer to a causal estimate), is female_dummy. 

In [27]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df1 = pd.read_csv("health_data.csv")
df1_ab = df1[df1['hospital_id'].isin(['A', 'B'])]

model1_2c_1 = smf.ols(formula = 'patient_died_dummy ~ hospital_id + female_dummy',
                   data = df1_ab).fit()
print(model1_2c_1.summary())

                            OLS Regression Results                            
Dep. Variable:     patient_died_dummy   R-squared:                       0.062
Model:                            OLS   Adj. R-squared:                  0.062
Method:                 Least Squares   F-statistic:                     218.5
Date:                Sun, 04 Feb 2024   Prob (F-statistic):           1.33e-92
Time:                        12:50:21   Log-Likelihood:                -1235.6
No. Observations:                6611   AIC:                             2477.
Df Residuals:                    6608   BIC:                             2498.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.0549      0.006  

The change in the coefficient on the hospital B dummy, from 0.0072 to 0.0121, is 0.0049. It indicates that when we look at the gender factor, the gap between mortality at hospital A and B becomes more noticeable.

I included the specific set of variables because pregnancy will affect mortality and because the coefficient of female_dummy is significant. Inclusion of female_dummy will reduce omitted variable bias and get closer to a causal estimate.
I have also considered including startage as another variable because too young or too old an age might increase risk of death due to child delivery, but the coefficient of startage is not significant.

In [29]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df1 = pd.read_csv("health_data.csv")
df1_ab = df1[df1['hospital_id'].isin(['A', 'B'])]

model2c_2 = smf.ols(formula = 'patient_died_dummy ~ startage',
                   data = df1_ab).fit()
print(model2c_2.summary())

                            OLS Regression Results                            
Dep. Variable:     patient_died_dummy   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.1602
Date:                Sun, 04 Feb 2024   Prob (F-statistic):              0.689
Time:                        12:50:36   Log-Likelihood:                -1447.1
No. Observations:                6611   AIC:                             2898.
Df Residuals:                    6609   BIC:                             2912.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.1117      0.027      4.097      0.0

# 2 Demand estimation

## Question 1.

In [34]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df2 = pd.read_csv("demand_data.csv")

df2_v1 = df2[df2['vendor_id'] == 1]

model2_1_1 = smf.ols(formula = 'sales ~ price',
                   data = df2_v1).fit()
print(model2_1_1.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.006
Model:                            OLS   Adj. R-squared:                 -0.013
Method:                 Least Squares   F-statistic:                    0.3250
Date:                Sun, 04 Feb 2024   Prob (F-statistic):              0.571
Time:                        12:54:44   Log-Likelihood:                -360.33
No. Observations:                  52   AIC:                             724.7
Df Residuals:                      50   BIC:                             728.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   8983.8227    145.437     61.771      0.0

In [35]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df2 = pd.read_csv("demand_data.csv")

df2_v1 = df2[df2['vendor_id'] == 1]

model2_1_2 = smf.ols(formula = 'sales ~ price + summer_dummy',
                   data = df2_v1).fit()
print(model2_1_2.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.318
Model:                            OLS   Adj. R-squared:                  0.290
Method:                 Least Squares   F-statistic:                     11.42
Date:                Sun, 04 Feb 2024   Prob (F-statistic):           8.49e-05
Time:                        12:55:06   Log-Likelihood:                -350.56
No. Observations:                  52   AIC:                             707.1
Df Residuals:                      49   BIC:                             713.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     9177.5500    128.432     71.458   

In [36]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df2 = pd.read_csv("demand_data.csv")

df2_v1 = df2[df2['vendor_id'] == 1]

model2_1_3 = smf.ols(formula = 'price ~ summer_dummy',
                   data = df2_v1).fit()
print(model2_1_3.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.204
Model:                            OLS   Adj. R-squared:                  0.189
Method:                 Least Squares   F-statistic:                     12.85
Date:                Sun, 04 Feb 2024   Prob (F-statistic):           0.000764
Time:                        12:58:40   Log-Likelihood:                -44.499
No. Observations:                  52   AIC:                             93.00
Df Residuals:                      50   BIC:                             96.90
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        2.4103      0.093     25.922   

The price coefficient changes when the summer dummy is also included in the regression because the correlation between price and summer_dummy is significant. Omitted variable bias are introduced when summer_dummy is not included, so the price coefficient changes.

## Question 2.

In [37]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df2 = pd.read_csv("demand_data.csv")

df2_v2 = df2[df2['vendor_id'] == 2]

model2_2_1 = smf.ols(formula = 'sales ~ price',
                   data = df2_v2).fit()
print(model2_2_1.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.133
Model:                            OLS   Adj. R-squared:                  0.116
Method:                 Least Squares   F-statistic:                     7.684
Date:                Sun, 04 Feb 2024   Prob (F-statistic):            0.00781
Time:                        12:59:09   Log-Likelihood:                -359.10
No. Observations:                  52   AIC:                             722.2
Df Residuals:                      50   BIC:                             726.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   8411.1748    219.545     38.312      0.0

In [38]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df2 = pd.read_csv("demand_data.csv")

df2_v2 = df2[df2['vendor_id'] == 2]

model2_2_2 = smf.ols(formula = 'sales ~ price + summer_dummy',
                   data = df2_v2).fit()
print(model2_2_2.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.133
Model:                            OLS   Adj. R-squared:                  0.116
Method:                 Least Squares   F-statistic:                     7.684
Date:                Sun, 04 Feb 2024   Prob (F-statistic):            0.00781
Time:                        12:59:31   Log-Likelihood:                -359.10
No. Observations:                  52   AIC:                             722.2
Df Residuals:                      50   BIC:                             726.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     2105.3159     29.848     70.534   

In [40]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df2 = pd.read_csv("demand_data.csv")

df2_v2 = df2[df2['vendor_id'] == 2]

model2_2_3 = smf.ols(formula = 'price ~ summer_dummy',
                   data = df2_v2).fit()
print(model2_2_3.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                       inf
Date:                Sun, 04 Feb 2024   Prob (F-statistic):               0.00
Time:                        13:15:24   Log-Likelihood:                    inf
No. Observations:                  52   AIC:                              -inf
Df Residuals:                      50   BIC:                              -inf
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        2.5000          0        inf   

  return self.mse_model/self.mse_resid
  llf = -nobs2*np.log(2*np.pi) - nobs2*np.log(ssr / nobs) - nobs2
  dw = np.sum(diff_resids**2, axis=axis) / np.sum(resids**2, axis=axis)


In the case of the regression with the summer dummy, there might be multicollinearity problems possibly because price of ice cream is more highly correlated and even has a perfect linear relationship with season for vendor 2 (with a coefficient of 1.0) compared to vendor 1 (with a coefficient of 0.6667). This implies that vendor 2 systematically charged higher prices in summer.

## Question 3.

For the vendor who did not systematically charge higher or lower prices in summer, in the regression with the summer dummy, I expect the price coefficient estimate to be less precise because the summer dummy might introduce noise by taking seasonal variation into account. 

In the regression without the summer dummy, on the other hand, I expect the price coefficient estimate to be more precise because the model would estimate the average effect of price on sales without influences from seasonal variation.