# DS-SF-34 | 13 | Advanced Metrics | Assignment | Starter Code

## Myopia

In [1]:
import os

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

import statsmodels.api as sm

from sklearn import preprocessing, linear_model, model_selection, metrics
# TODO model_selection

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In this assignment, we will be looking at what contributes to myopia (i.e., nearsightedness) the most.  My parents always told me not to watch TV or play video game as it will affect negatively my vision.  (They were strangely fine with studying!).  But we are data scientists now, so let's go and explain myopia!

In [2]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-13-myopia.csv'))

In [3]:
df

Unnamed: 0,ID,STUDYYEAR,MYOPIC,AGE,GENDER,...,STUDYHR,TVHR,DIOPTERHR,MOMMY,DADMY
0,1,1992,1,6,1,...,0,10,34,1,1
1,2,1995,0,6,1,...,1,7,12,1,1
2,3,1991,0,6,1,...,0,10,14,0,0
3,4,1990,1,6,1,...,0,4,37,0,1
4,5,1995,0,5,0,...,0,4,4,1,0
...,...,...,...,...,...,...,...,...,...,...,...
613,614,1995,1,6,0,...,3,14,37,1,0
614,615,1993,0,6,1,...,0,8,10,1,1
615,616,1995,0,6,0,...,0,4,4,1,1
616,617,1991,0,6,1,...,0,15,23,0,0


Here's the data dictionnary for this dataset:

| Variable Name | Variable Description | Values/Labels |
|:---|:---|:---|
| `ID` | Subject identifier | Integer (range 1-1503) |
| `STUDYYEAR` | Year subject entered the study | Year |
| `MYOPIC` | Myopia within the first 5 yr of follow up<sup>(a)</sup> | `0 = No`, `1 = Yes` |
| `AGE` | Age at first visit | Years |
| `GENDER` | Gender | `0 = Male`, `1 = Female` |
| `SPHEQ` | Spherical equivalent refraction<sup>(b)</sup> | Diopter |
| `AL` | Axial length<sup>(c)</sup> | mm |
| `ACD` | Anterior chamber depth<sup>(d)</sup> | mm |
| `LT` | Lens thickness<sup>(e)</sup> | mm |
| `VCD` | Vitreous chamber depth<sup>(f)</sup> | mm |
| `SPORTHR` | How many hours per week outside of school the child spent engaging in sports/outdoor activities | Hours per week |
| `READHR` | How many hours per week outside of school the child spent reading for pleasure | Hours per week |
| `COMPHR` | How many hours per week outside of school the child spent playing video/computer games or working on the computer | Hours per week |
| `STUDYHR` | How many hours per week outside of school the child spent reading or studying for school assignments | Hours per week |
| `TVHR` | How many hours per week outside of school the child spent watching television | Hours per week |
| `DIOPTERHR` | Composite of near-work activities | Hours per week  |
| `MOMMY` | Was the subject's mother myopic?<sup>g</sup> | `0 = No`, `1 = Yes` |
| `DADMY` | Was the subject's father myopic? | `0 = No`, `1 = Yes` |
<sup>(a)</sup> MYOPIC is defined as SPHEQ <= -0.75D<br>
<sup>(b)</sup> A measure of the eye's effective focusing power.  Eyes that are "normal" (don't require glasses or contact lenses) have spherical equivalents between -0.25 diopters (D) and +1.00 D. The more negative the spherical equivalent, the more myopic the subject<br>
<sup>(c)</sup> The length of eye from front to back<br>
<sup>(d)</sup> The length from front to back of the aqueous-containing space of the eye between the cornea and the iris<br>
<sup>(e)</sup> The length from front to back of the crystalline lens<br>
<sup>(f)</sup> The length from front to back of the aqueous-containing space of the eye in front of the retina<br>
<sup>(g)</sup> DIOPTERHR = 3 * (READHR + STUDYHR) + 2 * COMPHR + TVHR

> ### Question 1.  `ID` and `STUDYYEAR` do not predict myopia.  Disregard them.  Then, consider two types of inputs.  First, all general inputs (i.e., physical and external inputs) as `X1`.  Second, only the external inputs as `X2`.  Finally, define the response vector `c`

In [12]:
# TODO
X1 = df[['AGE', 'GENDER', 'SPHEQ', 'AL', 'ACD', 'LT',
      'VCD', 'SPORTHR', 'READHR', 'COMPHR', 'STUDYHR',
      'TVHR', 'DIOPTERHR', 'MOMMY', 'DADMY']]
X2 = df[['AGE', 'GENDER', 'SPORTHR', 'READHR', 'COMPHR', 'STUDYHR', 'TVHR', 'DIOPTERHR',
         'MOMMY', 'DADMY']]
c = df.MYOPIC

> ### Question 2.  Run your regression line on `X1` and interpret the `MOMMY` and `DADMY` coefficients

In [13]:
# TODO
model = linear_model.LogisticRegression().fit(X1, c)

print zip(X1, model.coef_[0])
print model.intercept_

[('AGE', 0.0037913089441876124), ('GENDER', 0.53625099581972702), ('SPHEQ', -3.3942542422000823), ('AL', 0.11608732686882325), ('ACD', 0.77257102715114212), ('LT', -0.31202021007267727), ('VCD', -0.32603607756289527), ('SPORTHR', -0.047394851633255908), ('READHR', 0.097641708448349296), ('COMPHR', 0.05015836653044712), ('STUDYHR', -0.13224792751548364), ('TVHR', -0.0043863140850415819), ('DIOPTERHR', -0.0078882382267651888), ('MOMMY', 0.63899609932801882), ('DADMY', 0.72678882137552547)]
[ 0.05253595]


Answer: Parent's with myopia have a large impact on the children having myopia

> ### Question 3.  What's the model accuracy?

In [14]:
# TODO
model.score(X1, c)

0.89320388349514568

Answer: TODO

> ### Question 3.  Use a 5-fold cross-validation to measure the model's accuracy

In [15]:
# TODO
model_selection.cross_val_score(model, X1 , c, cv = 5).mean()

0.88023592971413578

Answer: TODO

> ### Question 4.  In the dataset, what's the percentage of myopic cases?

In [17]:
# TODO
df.MYOPIC.sum() 
df.MYOPIC.total()

AttributeError: 'Series' object has no attribute 'total'

In [18]:
df.MYOPIC.mean()

0.13106796116504854

Answer: TODO

> ### Question 5.  Based on the result above, is your model's accuracy good?

Answer: It is not. Misclassification is 1 - accuracy, so ~12% and 13% are myopic

> ### Question 6.  Build a confusion matrix

In [19]:
# TODO
c_hat = model.predict(X1)
pd.crosstab(c, c_hat, rownames = ['Hypothesis class'], colnames = ['True Class'])

True Class,0,1
Hypothesis class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,524,13
1,53,28


> ### Question 7.  What's the model `FPR` and `FNR` (i.e., type I and type II error rates)?

FPR() = 13 / 577
FNR = 53 / 4

In [25]:
FPR = 13. / 537
FPR

0.024208566108007448

In [27]:
FNR = 53. / 81
FNR

0.654320987654321

> ### Question 8.  What's the trade-off between these two errors?

Answer: Increase one to decrease the other

> ### Question 9.  Run your regression line on `X2` and interpret your results specifically on `SPORTHR`, `READHR`, `COMPHR`, `STUDYHR`, `TVHR`, and `GENDER`.  You might want to use `statsmodels`' `Logit()`

In [28]:
# TODO
model = linear_model.LogisticRegression().fit(X2, c)

print zip(X2, model.coef_[0])
print model.intercept_

[('AGE', -0.14479353919050633), ('GENDER', 0.24541580034114238), ('SPORTHR', -0.047119474548268508), ('READHR', 0.068796782642721932), ('COMPHR', 0.0093705351899656987), ('STUDYHR', -0.071573850343561049), ('TVHR', -0.0036418993733181785), ('DIOPTERHR', 0.0067679679035274323), ('MOMMY', 0.72828898872177272), ('DADMY', 0.83081070943537583)]
[-1.73861951]


In [42]:
sm.Logit(c, X2).fit().summary()

Optimization terminated successfully.
         Current function value: 0.364782
         Iterations 7


  return np.sqrt(np.diag(self.cov_params()))
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


0,1,2,3
Dep. Variable:,MYOPIC,No. Observations:,618.0
Model:,Logit,Df Residuals:,609.0
Method:,MLE,Df Model:,8.0
Date:,"Thu, 01 Jun 2017",Pseudo R-squ.:,0.06084
Time:,19:45:46,Log-Likelihood:,-225.44
converged:,True,LL-Null:,-240.04
,,LLR p-value:,0.0002917

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
AGE,-0.4143,0.064,-6.431,0.000,-0.541 -0.288
GENDER,0.1809,0.251,0.720,0.471,-0.311 0.673
SPORTHR,-0.0500,0.018,-2.731,0.006,-0.086 -0.014
READHR,0.0612,,,,nan nan
COMPHR,-0.0003,,,,nan nan
STUDYHR,-0.0554,,,,nan nan
TVHR,-0.0075,,,,nan nan
DIOPTERHR,0.0093,,,,nan nan
MOMMY,0.7323,0.252,2.909,0.004,0.239 1.226


Answer: TODO

> ### Question 10.  Now it's time for regularization!  Use `X1`.  According to `Lasso`, what are the non-significant features?

In [None]:
# TODO


Answer: TODO

> ### Question 11.  What is your conclusions about your parents' claims?

Answer: TODO

> ### Question 12.  Draw the ROC curve of your best tuned model

In [None]:
# TODO