# Case Study 1: Educational Outcomes for Hearing-impaired Children

In [None]:
# Import data analysis packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# Import pymc modules
import arviz as av
import pymc as pm

Here, we are interested in determining factors associated with
better or poorer learning outcomes.

## The Data

The anonymized dataset is taken from the Listening and
Spoken Language Data Repository (LSL-DR), an international data repository.

The anonymized data set tracks the demographics and longitudinal
outcomes for children who have hearing loss and are enrolled in
programs focused on supporting listening and spoken language
development. Researchers are interested in discovering factors
related to improvements in educational outcomes within these programs.

The data set contains a suite of available predictors including:

- Gender (`male`)
- Number of household siblings (`siblings`)
- Index of family involvement (`family_inv`)
- Whether the primary household language is not English (`non_english`)
- Presence of a previous disability (`prev_disab`)
- Non-white race (`non_white`)
- Age at time of testing (in months, `age_test`)
- Whether hearing loss is not severe (`non_severe_hl`)
- Whether the subject's mother obtained a high school diploma or better (`mother_hs`)
- Whether the hearing impairment was identified by 3 months of age (`early_ident`)

The outcome variable is a standardized test score in one of several
learning domains.

In [None]:
# Load the test scores to be analyzed.
test_scores = pd.read_csv(pm.get_data('test_scores.csv'), index_col=0)
test_scores.head()

In [None]:
# Examine a histogram of the outcomes
test_scores['score'].hist()

plt.show()

In [None]:
# Dropping missing values is, generally, a **very bad idea**. We do so
# in this case for simplicity. Additionally, we convert all values to
# floating point numbers.
X = test_scores.dropna().astype(float)

# The `DataFrame.pop()` method drops the specified item from the
# `DataFrame` and returns the dropped items.
y = X.pop('score')

# Standardize the features
X -= X.mean() # Centered at the mean
X /= X.std() # Normalize to the standard deviation

# I'm uncertain. What do `N` and `D` stand for?
N, D = X.shape
N, D