## Background


We often have to compare parameter estimates across several versions of a model.

- Same model estimated with several estimators (ols, iv, gmm, ...)
- Model estimated by numerical optimization with different optimizers and or different start values for the optimization
- Monte Carlo exercises

Especially in large models (100 or more parameters) this is time consuming. Therefore, we need a plot that makes it easier to see:

- how large the differences in parameter estimates are between different models
- whether the confidence intervals of one model contain the parameter estimates of other models

If we just plot every parameter estimate and every confidence interval at the same time, the plot won't show anything because it is too full. Therefore, we need an interactive plot that always shows exactly what we want to see. Probably the best library to do this is [bokeh](https://bokeh.pydata.org/en/latest/index.html), but if you find another one you can also use it. My first strategy would be to use something like [this](https://bokeh.pydata.org/en/latest/docs/gallery/elements.html) to plot the estimates for one parameter across models. The official [tutorials](https://hub.mybinder.org/user/bokeh-bokeh-notebooks-1rrayuuy/notebooks/tutorial/00%20-%20Introduction%20and%20Setup.ipynb) also explain how to make those plots interactive, how to stack several plots into one figure and how to link the subplots within one figure.

You should start by writing a very basic function that only takes a data dictionary as argument and produces the plot I described in our last meeting. Later we will add more arguments for colors, background styles, etc. 

Please just work in this notebook.

In [1]:
# Interface
def comparison_plot(data_dict):
    """Make a comparison plot.
    
    Args:
        data_dict (dict): The keys are the names of different models.
            The values are pd.DataFrames where the index contains the names of 
            the parameters and there are three columns:
                - 'params', containing the point estimates
                - 'lower', containing the lower bound of the confidence interval
                - 'upper', containing the upper bound of the confidence interval
        
    """

## Tasks

1. Define two or three different data dictionaries that differ in the following dimensions
    - longer and shorter parameter names
    - some parameters have a large variance across models, some have a small variance
    - some have wide confidence intervals (larger than the variance of the parameter value), some have a small one.
2. Define the basic plot where based on [this](https://bokeh.pydata.org/en/latest/docs/gallery/elements.html). When hovering over a point, the pop-up window should display:
    - the exact parameter value
    - the lower and upper bound of the confidence interval from that model
    - the model name
    - the standard deviation of this parameter across models
3. Implement the clicking action, i.e. when clicking on a point, the points that belong to this model are highlighted (acrross all subplots) while all other points get more opaque. This requires linking the different subplots. 
4. We meet again.

If you have questions, just ask me again and we can meet.

In [2]:
####################
import pandas as pd
import numpy as np
from comparison_plot import comparison_plot

In [3]:
nr_models = 10
nr_intercepts = 5

intercept_tuples = [str(i) for i in range(nr_intercepts)]
short_controls = ['age', 'grade', 'iq_score', 'gender']
long_controls = short_controls + [
    'parent_educ', 'parent_inc', 'parent_occup', 'dist_to_school', 
    'single_parent', 'single_parent_x_parent_inc']

cols = ['model', 'param_value', 'param_name', 'lower', 'upper', 'group']

model_comp_df = pd.DataFrame()

for i in range(nr_models):
    if i < 0.7 * nr_models:
        ctrl_names = short_controls
        factor = 0.5
    else:
        ctrl_names = long_controls
        factor = 0.7
    nr_params = nr_intercepts + len(ctrl_names)
    
    df = pd.DataFrame(columns=cols)
    df['param_value'] = factor * np.arange(nr_params) + np.random.normal(0, 1, nr_params)
    df['param_name'] = [str(i) for i in range(nr_intercepts)] + ctrl_names
    df['lower'] = df['param_value'] - np.random.normal(1, factor, nr_params)
    df['upper'] = df['param_value'] + np.random.normal(1, factor, nr_params)
    df['group'] = ['intercept'] * nr_intercepts + [r'$\beta$'] * len(ctrl_names)
    df['model'] = 'model_' + str(i)
    
    model_comp_df = pd.concat([model_comp_df, df], axis=0, sort=False)
    
model_comp_df.reset_index(inplace=True, drop=True)
model_comp_df = model_comp_df[['model', 'group', 'param_name', 'lower', 'param_value', 'upper']]
model_comp_df[:9]

Unnamed: 0,model,group,param_name,lower,param_value,upper
0,model_0,intercept,0,-0.234841,0.324288,1.871211
1,model_0,intercept,1,-0.189183,0.522327,1.207206
2,model_0,intercept,2,-1.856715,-0.413885,0.560972
3,model_0,intercept,3,0.998369,2.130928,3.449516
4,model_0,intercept,4,0.25655,3.052554,4.22195
5,model_0,$\beta$,age,1.420543,2.66684,3.667222
6,model_0,$\beta$,grade,0.325244,1.886626,2.682159
7,model_0,$\beta$,iq_score,3.395515,3.427538,3.533402
8,model_0,$\beta$,gender,3.59389,3.771181,4.944963


In [4]:
show_p = comparison_plot(df=model_comp_df)

Dictionaries

In [5]:
import pandas as pd
import numpy as np

m_1 = pd.DataFrame(np.array([[2, 1.5, 3], [0.1, 0.001, 0.15], [30, 20, 40]]), columns = ['params', 'lower', 'upper'], 
                   index = ['p_1', 'p_2', 'p_3'])
m_2 = pd.DataFrame(np.array([[1, 1, 3], [0.03, 0, 0.09], [26, 20, 30]]), columns = ['params', 'lower', 'upper'], 
                   index = ['p_1', 'p_2', 'p_3'])
m_3 = pd.DataFrame(np.array([[3.5, 3.3, 5], [0.2, 0.05, 0.5], [27, 26, 36]]), columns = ['params', 'lower', 'upper'], 
                   index = ['p_1', 'p_2', 'p_3'])
m_4 = pd.DataFrame(np.array([[1.3, 1, 2], [0.14, 0.001, 0.2], [35, 25, 40]]), columns = ['params', 'lower', 'upper'], 
                   index = ['p_1', 'p_2', 'p_3'])

model_1 = pd.DataFrame(np.array([[2, 1.5, 3], [0.1, 0.001, 0.15], [30, 20, 40]]), columns = ['params', 'lower', 'upper'], 
                       index = ['p_1', 'p_2', 'p_3'])
model_2 = pd.DataFrame(np.array([[0.5, 0, 2.2], [0.01, 0, 0.09], [29, 20, 30]]), columns = ['params', 'lower', 'upper'], 
                       index = ['p_1', 'p_2', 'p_3'])
model_3 = pd.DataFrame(np.array([[3.7, 3.3, 5], [0.3, 0.05, 0.5], [27, 26, 36]]), columns = ['params', 'lower', 'upper'], 
                       index = ['p_1', 'p_2', 'p_3'])
model_4 = pd.DataFrame(np.array([[1.3, 1, 2], [0.14, 0.001, 0.2], [32, 25, 40]]), columns = ['params', 'lower', 'upper'], 
                       index = ['p_1', 'p_2', 'p_3'])

mo_1 = pd.DataFrame(np.array([[2.8, 1.5, 3], [0.1, 0.001, 0.15], [28, 20, 40]]), columns = ['params', 'lower', 'upper'], 
                    index = ['p_1', 'p_2', 'p_3'])
mo_2 = pd.DataFrame(np.array([[2.9, 1, 3], [0.09, 0, 0.09], [26, 20, 30]]), columns = ['params', 'lower', 'upper'], 
                    index = ['p_1', 'p_2', 'p_3'])
mo_3 = pd.DataFrame(np.array([[3.3, 3.3, 5], [0.14, 0.05, 0.5], [27, 26, 36]]), columns = ['params', 'lower', 'upper'], 
                    index = ['p_1', 'p_2', 'p_3'])
mo_4 = pd.DataFrame(np.array([[1.8, 1, 2.3], [0.12, 0.001, 0.2], [26, 25, 40]]), columns = ['params', 'lower', 'upper'], 
                    index = ['p_1', 'p_2', 'p_3'])

dic_1 = {'model_1' : m_1, 'model_2' : m_2, 'model_3' : m_3, 'model_4' : m_4}
dic_2 = {'model_1' : model_1, 'model_2' : model_2, 'model_3' : model_3, 'model_4' : model_4}
dic_3 = {'model_1' : mo_1, 'model_2' : mo_2, 'model_3' : mo_3, 'model_4' : mo_4}

Function

In [6]:
plot_dict_1 = comparison_plot(dic_1)

In [7]:
plot_dict_2 = comparison_plot(dic_2)

In [8]:
plot_dict_3 = comparison_plot(dic_3)