## Instructions

Before you turn this problem in, make sure everything runs as expected. 

1. **Restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then
2. **Run all cells** (in the menubar, select Cell $\rightarrow$ Run All).
3. **Save the notebook**

Do fix all your errors before submitting.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

## Student Performance Data

For the most part of our course, we have been working on the mathematics scores for students from the student performance dataset. The following code will read in the scores for portugese subject.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

import matplotlib.pyplot as plt

stud_perf_mat  = pd.read_csv("data/student/student-mat.csv", delimiter=";")
stud_perf_por  = pd.read_csv("data/student/student-por.csv", delimiter=";")

The two datasets have different dimensions, but there is a set of students that are common to both datasets. The following code creates a new 
dataframe corresponding to these students. Columns with a `_x` suffix correspond to Math scores, while columns with a `_y` suffix correspond to 
Portugese.

In [2]:
merged_df = stud_perf_mat.merge(stud_perf_por, how='inner', 
                                on =['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 
                                     'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'nursery', 'internet'])

### Q1: Remove zero-points

Remove the points from `merged_df` where either `G3_x = 0` or `G3_y = 0` (or both). 

In [3]:
merged_df = merged_df[(merged_df.G3_x != 0) & (merged_df.G3_y != 0)]

### Q2: t-test

Suppose we let $\mu_x$ correspond to the mean Math G3 scores, while $\mu_y$ corresponds to mean Portugese G3 scores. Carry out an appropriate test of the following hypothesis:


$$\mu_x = \mu_y$$
$$\mu_x > \mu_y$$


Remember to carry out all steps in the procedure, including assumption checking.

###YOUR ANSWER HERE (include code as well)

In [4]:
# 2-variable independent test
# one side
# formal setup
x = merged_df.G3_x
y = merged_df.G3_y

**Test for normality:** Execute the following 2 block to apply Shapiro-Wilk test to both variable. The result shows that both p-value < 0.05. Therefore we reject $H_0$, accept $H_1$, both two variables do not follow normal distribution.

In [5]:
stats.shapiro(x)

ShapiroResult(statistic=np.float64(0.9784042930972487), pvalue=np.float64(5.515441608011557e-05))

In [6]:
stats.shapiro(y)

ShapiroResult(statistic=np.float64(0.9711407632829925), pvalue=np.float64(2.66480527714179e-06))

Since the normality assumption is violated, we should use non-parametric test. Note that `G3_x` and `G3_y` are independent, we use wilcoxon sum rank test.

$H_0:$ The distribution of `G3_x` is in same location of the distribution of `G3_y`, in other words, $\mu_x = \mu_y$

$H_1:$ The distribution of `G3_x` is on the right location of the distribution of `G3_y`, in other words, $\mu_x > \mu_y$

In [7]:
# wilcoxn rank sum test
wrs_out = stats.mannwhitneyu(x, y, alternative = "greater")
print(wrs_out.pvalue)

0.9999999911627284


Note that the p-value is 0.99 > 0.05, hence we don't have enough evidence to reject $H_0$. Therefore $\mu_x = \mu_y$

### Q3: Linear Regression

It is of interest to understand if we can explain math G3 scores using Portuguse G3 scores and Mother's education. Fit the following model 
to the data:

\begin{equation}
Y = \beta_0 + \beta_1 X_1  + \beta_2 I(X_2 = 1) + \beta_3 I(X_2 = 2) + \beta_4 I(X_2 = 3) + \beta_5 I(X_2 = 4)  + \epsilon
\end{equation}

where 

* $Y$: Math G3 scores
* $X_1$: Portugese G3 scores
* $X_2$: Mother's education (i.e. 0, 1, 2, 3 or 4)

Use the model to compute 90% confidence intervals for all values of Medu for a student with Portugese G3 score of 10.

In [8]:
### YOUR CODE HERE
lm = ols('G3_x ~ G3_y + C(Medu)', merged_df).fit()
lm.summary()
new_df = sm.add_constant(pd.DataFrame({'G3_y':[10, 10, 10, 10,10], 'Medu':[0, 1, 2, 3, 4]}))
predictions_out = lm.get_prediction(new_df)
ci = predictions_out.conf_int(alpha=0.1)

results_df = new_df.copy()
results_df['CI_lower'] = ci[:, 0]
results_df['CI_upper'] = ci[:, 1]
results_df

Unnamed: 0,G3_y,Medu,CI_lower,CI_upper
0,10,0,8.816466,13.925972
1,10,1,7.899504,9.309208
2,10,2,9.148769,10.209562
3,10,3,9.065021,10.140673
4,10,4,9.428337,10.458905
