# Fertility of women in Bangladesh

Bayesian Logistic Regression Example

## Libraries

In [1]:
import pandas as pd
import numpy as np
from scipy.special import expit

## Variables and functions definitions

In [2]:
path_data = '../data/external/'

## Import the data

In [3]:
data = pd.read_csv(path_data + 'bangladesh.csv')
data.head(2)

Unnamed: 0,district,urban,living.children,age_mean,contraceptive_use
0,35,0,4,2.44,0
1,22,0,2,-1.5599,1


In [4]:
districts = data.sort_values(by='district').district.unique()
n_districts = len(districts)

## Pre-processing

In [5]:
data.replace({'district':{61:54}}, inplace=True)

## Modeling

Denote attributes of women by `X` and outcome by `Y`. Outcome `Y` is the likelihood of using contraceptives. 

$$P(Y=1)=\frac{1}{1+e^{-(\beta_0+\beta_1X_1+...+\beta_pX_p)}}$$

We could estimate the $\beta$s using a classic GLM. But if we had real life domain knowledge about the situation, we could incorporate it by assigning prior distributions to the $\beta$s, treating them as random variables,\ instead of modeling them as unknown quantities.

Let's simulate binary response data `Y` using prior parameters values:

In [6]:
# Genero valores de beta0 con los parametros elegidos. Beta0 es una V.A.
beta_0_mu = 2
beta_0_sigma = 1
sample_size = data.shape[0]
beta_0 = np.random.normal(beta_0_mu, beta_0_sigma, sample_size)

In [7]:
# Betas 1, 2 y 3 (los de las covariables) son deterministicos
beta_1 = 4
beta_2 = -3
beta_3 = -2

Generate the simulated response `contraceptive_use_sim` by applying function `expit` with the previously defined parameter values. If the generated value was $> 0.5$, then it is a 1. Else, it is a 0.

In [8]:
data_sim = data.copy()
data_sim['contraceptive_use_sim'] = expit(beta_0 + beta_1*data_sim['urban'] + 
                                          beta_2*data_sim['living.children'] +
                                          beta_3*data_sim['age_mean'])
data_sim['contraceptive_use_sim'] = (data_sim['contraceptive_use_sim'] > 0.5).astype(int) 
data_sim.head()

Unnamed: 0,district,urban,living.children,age_mean,contraceptive_use,contraceptive_use_sim
0,35,0,4,2.44,0,0
1,22,0,2,-1.5599,1,0
2,29,0,2,-8.5599,1,1
3,5,0,3,-4.5599,1,1
4,34,1,4,8.44,0,0


Apply MCMC and try to estimate the true parameter values used to simulate `contraceptive_use_sim`. 

Use priors:

$$\beta_{0j}\sim N(\mu_0, \sigma_0^2)$$
$$\frac{1}{sigma_0^2} \sim Gamma(0.1, 0.1)$$
$$\mu_0, \beta_1, \beta_2, \beta_3 \sim N(0, 10000)$$

Model is specified as:

$$Y_{ij} \sim Bernoulli(p_{ij})$$
$$logit(p_{ij})=\beta_{0j} + \beta_1  urban + \beta_2  living.children + \beta_3  age-mean$$

We can use library `PyJAGS` to run MCMC to find the posterior distribution of the model parameters. 