In [88]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
import statsmodels.api as sm

from pathlib import Path
from patsy import dmatrices

from notears import nonlinear, linear, utils
from graphviz import Digraph

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

## Description

In this exercise you will work on the lighter version of the [Student Performance Data Set](https://archive.ics.uci.edu/ml/datasets/Student+Performance#) published in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php).

We select only the subset of the columns:

1. address - student's home address type (binary: 'U' - urban or 'R' - rural)
2. higher - wants to take higher education (binary: yes or no)
3. internet - Internet access at home (binary: yes or no)
4. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
5. G1 - first period grade (numeric: from 0 to 20)
6. absences - number of school absences (numeric: from 0 to 93)

Here we would like to try to measure the `ACE`(Average Causal Effect) of the `absences` on the `G1` from the observational data.

### Loading Data and Processing Data

In [92]:
data_path = Path("./data/")
data = pd.read_csv(data_path / "student-por.csv", delimiter=';')[["absences", "address", "internet", "reason", "higher", "G1"]]
data = data[data.absences < 20]

data.head()

Unnamed: 0,absences,address,internet,reason,higher,G1
0,4,U,no,course,yes,0
1,2,U,yes,course,yes,9
2,6,U,yes,other,yes,12
3,0,U,yes,home,yes,14
4,0,U,no,home,yes,11


### Process Data 

Assign numerical values to string columns.  
*Note*: do not worry about details such as having one-hot encoders for non-binary columns. Treat them as ordinal values.

In [1]:
#### ADD CODE HERE

### Visualize Data

See if you can find interesting patterns

### Stats and Visualization

In [2]:
#### ADD CODE HERE 

### Naive Estimation: Part 1

Estimate the `absences` to `G1` naively using linear regression. 

To be consistent, please use statsmodels library for that. The API is:

```
sm.OLS.from_formula(formula="your_formula here", data=data) 
res = model.fit()
res.summary()
```

What do you observe here? 

In [3]:
#### ADD CODE HERE

### Naive Estimation Part 2

Now try to fit a linear regression using all available variables. 
 - What do you observe? 
 - What are the differences? 
 - What is the relative percentage difference between this estimate and previous one?
 - Which result to trust ?

In [4]:
#### ADD CODE HERE

### Learning DAG Structure

- Describe the structure you have learned
- Given the results from Naive Estimation Part2, is your graph consistent with the results or not?
- Where your learned structure is correct and where it is not?
- Wherever your Structure is incorrect complete the graph by adding/removing edges. (in case you add an edge, please color it red)

In [None]:
np.random.seed(42)

weight_matrix = linear.notears_linear(data.values.astype(np.float32), lambda1=0.05, loss_type='l2')
assert utils.is_dag(weight_matrix)
weight_matrix

In [6]:
col_names  = data.columns.tolist()
num_cols = len(col_names)

dot = Digraph(comment='students', format='png')

for col in col_names:
    dot.node(col)

for ix in range(num_cols):
    edge_candidates = np.where(weight_matrix[ix, :] != 0)[0]
    for ec in edge_candidates:
        dot.edge(col_names[ix], col_names[ec], constraint="true")

### Structural Causal Model

- Write down the SCM corresponding to your graph

In [1]:
### WRITE EQUATIONS HERE

### Interventional Distribution

- Draw the graph corresponding to the Interventional Distribution

In [2]:
### ADD CODE HERE

### Estimating ACE

- Given a correct structure that you learned and modified from the previous example, fit a linear regression and estimate the `ACE` of `absences` on `G1`.
- Estimate relative percentage differences between Naive Estimation 1 and 2

In [7]:
#### ADD CODE HERE