<a href="https://colab.research.google.com/github/joyinning/causal_inference/blob/main/Causal_Week_7_2_(A).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Chapter 8


### Concepts

**Q1. As we learned Causal Inference, we became familiar with OLS containing such components; dependent variable and independent variables into two groups - treatment and controlled variables. What role controlled variable plays in OLS for Causal Inference? Think about the term itself and you can inference from the meaning.**


- To control for confounding factors that could otherwise bias the estimation of the causal effect of the treatment variable on the outcome variable. They help ensure that the relationship being estimated approximates all else being equal condition, which is essential for causal inference.


**Q2. When choosing instrumental variables, what are the expected relationships between y, T, and Z(IV)? and why?**


- The instrument Z(IV) is uncorrelated with y, but it is correlated with T. (Exclusion Restriction)


**Q3. When can we use reduced form for IV and why is it allowed? Is there any reduced form in our model? Why we can't use 2SLS by hands (using two linear models)**


- When we need directly models the relationship between the instrument and the outcome.
- Why allowed: it encapsulates the combined effect of Z on y through T in one step when the first stage (from Z to T) is weak.
- 1) consistency but not unbiased, 2) complexity with multiple instruments.


**Q4. What is the bad side of IV?**


- If the instrument has only a very small correlation with the treatment, we can't learn much about the treatment from the instrument.
- The formulas for the IV standard errors are complex and not so intuitive.


**Q5. The book listed two common mistakes that people can encounter when utilizing IV. One is using SE with two models (by hand) and the other one is using ML model to achieve the first stage. Why do you think SE becomes wrong and ML approach is wrong?**

- 1) using standard errors incorrectly.
- 2) misusing machine learning ML models for first stage.

### Case Study

You are employed at a smartphone manufacturing company and are interested in analyzing historical sales data related to competitors' smartphone launches.
Your company starts the smartphone the first in the market and number of competitors joined more every year.

In [5]:
import numpy as np
import pandas as pd
np.random.seed(12)
n = 1000

data = {
    "company_age_in_month": [j for j in range(50) for i in range(int(n/50))],
    "num_competitor_launches": [round(abs(np.random.normal((i/500), 1.5, 1)[0])) if i < 100 else round(abs(np.random.normal((i/50), 1.5, 1)[0])) if i < 700 else round(abs(np.random.normal((i/100), 1.5, 1)[0])) for i in range(n)],
    "num_sales": [round(abs(np.random.normal(300 + 500 * (i/500), 150, 1)[0])) if i < 100 else round(abs(np.random.normal(100 + 300 * (i/50), 300, 1)[0])) if i < 700 else round(abs(np.random.normal(200 + 100 * (i/100), 30, 1)[0])) for i in range(n)],
    "our_custom_review": [np.random.uniform(1, 5) if i < 100 else np.random.uniform(2, 5) if i < 700 else np.random.uniform(4, 5) for i in range(n)],
    "num_other_product_on_us": [round(abs(np.random.normal((i/700), 1.5, 1)[0])) if i < 100 else round(abs(np.random.normal((i/100), 1.5, 1)[0])) if i < 700 else round(abs(np.random.normal((i/100), 1.5, 1)[0])) for i in range(n)],
    "market_growth_rate": [np.random.normal((i/700), 1.5, 1)[0] if i < 100 else np.random.normal((i/100), 1.5, 1)[0] if i < 700 else np.random.normal((i/50), 1.5, 1)[0] for i in range(n)],
    "new_tech": [np.random.normal((i/700), 1.5, 1)[0] if i < 100 else np.random.normal((i/300), 1.5, 1)[0] if i < 700 else np.random.normal((i/200), 1.5, 1)[0] for i in range(n)],
    "num_new_royal_customer": [np.random.normal((i/100), 1.5, 1)[0] if i < 100 else np.random.normal((i/150), 1.5, 1)[0] if i < 700 else np.random.normal((i/1000), 1.5, 1)[0] for i in range(n)]
}

**Question 1 Perform IV Estimate using two different methods one by using two OLS models and two, using 2SLS library**

- num_competitor_launches = T
- market_growth_rate = Z
- num_sales = Y

In [6]:
df = pd.DataFrame(data)

In [11]:
import statsmodels.api as sm
from statsmodels.sandbox.regression.gmm import IV2SLS

first_stage = sm.OLS(df['num_competitor_launches'], sm.add_constant(df['market_growth_rate'])).fit()
print("First Stage Summary:\n", first_stage.summary())

First Stage Summary:
                                OLS Regression Results                              
Dep. Variable:     num_competitor_launches   R-squared:                       0.190
Model:                                 OLS   Adj. R-squared:                  0.189
Method:                      Least Squares   F-statistic:                     234.3
Date:                     Sat, 04 May 2024   Prob (F-statistic):           1.16e-47
Time:                             23:10:26   Log-Likelihood:                -2646.5
No. Observations:                     1000   AIC:                             5297.
Df Residuals:                          998   BIC:                             5307.
Df Model:                                1                                         
Covariance Type:                 nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------

In [12]:
df['predicted_T'] = first_stage.predict(sm.add_constant(df['market_growth_rate']))
second_stage = sm.OLS(df['num_sales'], sm.add_constant(df['predicted_T'])).fit()

In [13]:
print("Second Stage Summary:\n", second_stage.summary())

Second Stage Summary:
                             OLS Regression Results                            
Dep. Variable:              num_sales   R-squared:                       0.026
Model:                            OLS   Adj. R-squared:                  0.025
Method:                 Least Squares   F-statistic:                     26.99
Date:                Sat, 04 May 2024   Prob (F-statistic):           2.48e-07
Time:                        23:10:41   Log-Likelihood:                -8462.0
No. Observations:                1000   AIC:                         1.693e+04
Df Residuals:                     998   BIC:                         1.694e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const        2678.8882    1

**Question 2 Calculate covariance to check the correlation between T and Z**

In [14]:
covariance = df[['num_competitor_launches', 'market_growth_rate']].cov()
print("Covariance between T and Z:\n", covariance)

Covariance between T and Z:
                          num_competitor_launches  market_growth_rate
num_competitor_launches                14.397033           11.069883
market_growth_rate                     11.069883           44.770831


**Question 3 Justify by context what variables you can use for IV**

- market_growth_rate: a good instrument for 'num_competitor_launches'.
> It might influence the entry of new competitors but does not directly affect our company's sales, assuming that market conditions affect all competitors similarly and not directly linked to our sales performance.

**Question 4 Get IV estimates**

predicted_T = -113.8524
- For each additional predicted competitor launch (market growth rate), the sales of the company decrease by approximately 114 units, holding other factors constant.
- Market growth rates and competitor launches is purely exogenous with respect to other factors influencing sales.

**Question 5 Explain in context whether there was an effect of having competitors' launches to our sales**

Changes in the number of competitor launches do affect the sales. The sign and magnitude of the coefficient will tell whether more competitors entering the market are beneficial or detrimental to the sales, controlling for other market growth factors.