<a href="https://colab.research.google.com/github/rulerauthors/ruler/blob/master/user_study/ruler_user_study_figures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import altair as alt
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', None)

## Load the full user study data from Github

In [0]:
full_data = pd.read_csv('https://raw.githubusercontent.com/rulerauthors/ruler/master/user_study/full_study_data.csv')
display(full_data)

## About this data





> We carried out the study using  a within-subjects experiment design, where all participants performed tasks using both conditions (tools).  The sole independent variable controlled was the method of creating labeling functions. We counterbalanced the order in which the tools were used, as well as which classification task we performed with which tool. 

### Tasks and Procedure 
> We asked participants to write  labeling functions for two prevalent labeling tasks: spam detection and sentiment classification.  They performed these two tasks on  YouTube Comments and Amazon Reviews, respectively. Participants received 15 mins of instruction on how to use each tool, using a topic classification task (electronics vs. guns) over a newsgroup dataset~\cite{rennie200820} as an example. We asked participants to write as many functions as they considered necessary for the goal of the task.  There were given 30 mins to complete each task and we recorded the labeling functions they created and these functions' individual and aggregate performances.  After completing both tasks, participants also filled out an exit survey, providing their qualitative feedback.

> For the manual programming condition, we iteratively developed a Jupyter notebook interface based on the Snorkel tutorial. We provided a section for writing functions, a section with diverse analysis tools, and a section to train a logistic regression model on the labels they had generated (evaluated on the test set shown to the user, which is separate from our heldout test set used for the final evaluation).



## Select Best Model

From [our EMNLP '20 submission](https://github.com/rulerauthors/ruler/blob/master/media/Ruler_EMNLP2020.pdf):



> To analyze the performance of the labeling functions created by participants, for each participant we select and task the labeling  model  that achieved the highest f1 score on the development set.  For each labeling model, we then train a logistic regression model on a training dataset  generated by the model.  We finally evaluate the performance of the logistic regression model on a heldout test set. 



In [0]:
def create_best_table_small(action='heldout_test_LR_stats'):
  dt = pd.DataFrame()

  subjects = full_data.participant.value_counts().index
  datasets = ['amazon', 'youtube']

  for _, pid in enumerate(subjects):
    for d in datasets: 
      # gather all the rows logging participant {pid}'s progress on the given dataset/task
      sub_df = full_data[(full_data['participant']==pid) & (full_data['dataset']==d)]
      sub_df = sub_df.reset_index(drop=True)

      # find index of best performance on dev set
      idxmax = sub_df[sub_df.data == 'dev']['f1'].idxmax()

      # choose the first logistic regression model trained after that,
      # report the performance on the held out test data
      try:
        r = sub_df.loc[idxmax:][sub_df.action==action].iloc[0]
      except IndexError:
        # in one case the user never finished any labelling functions, 
        # so we report the initial 'baseline' LR performance
        # which is f1 score of 0.5
        r = sub_df[sub_df.action==action].iloc[0]

      # the logged precision and recall are separated by class. 
      # we use the heldout dataset splits to compute micro precision and recall
      size0 = 418
      size1 = 382
      if r.task=="Youtube":
        size0=192
        size1=164
      prec = (r['precision_0']*size0 +r['precision_1']*size1)/(size0+size1)
      rec = (r['recall_0']*size0+r['recall_1']*size1)/(size0+size1)

      dt = dt.append({'participant': pid, 
                      'condition': r['condition'].lower(),
                      'task':'sentiment' if d == 'amazon' else 'spam', 
                      'dataset':d,
                      'f1':r['micro_f1'],
                      'precision':prec, 
                      'recall':rec, 
                      'accuracy':r['accuracy'],
                      'max_dev_f1': sub_df.at[idxmax, 'f1'],
                      'training_label_coverage': r['training_label_coverage'],
                      }, ignore_index=True)
  return dt

In [21]:
dt_best_small  = create_best_table_small()
display(dt_best_small)



Unnamed: 0,accuracy,condition,dataset,f1,max_dev_f1,participant,precision,recall,task,training_label_coverage
0,0.48375,snorkel,amazon,0.562036,0.653386,p4,0.492273,0.48375,sentiment,0.58375
1,0.578652,ruler,youtube,0.71374,0.632558,p4,0.682599,0.525599,spam,0.135
2,0.4825,ruler,amazon,0.622951,0.588235,p8,0.50177,0.4825,sentiment,0.1125
3,0.789326,snorkel,youtube,0.76489,0.697436,p8,0.821485,0.809982,spam,0.54375
4,0.50625,ruler,amazon,0.655623,0.642857,p2,0.667621,0.50625,sentiment,0.4025
5,0.679775,snorkel,youtube,0.604167,0.691892,p2,0.744225,0.710193,spam,1.0
6,0.58875,ruler,amazon,0.636464,0.64486,p1,0.608134,0.58875,sentiment,0.2975
7,0.52809,snorkel,youtube,0.596154,0.651163,p1,0.517664,0.512288,spam,1.0
8,0.51,snorkel,amazon,0.56541,0.639456,p3,0.519668,0.51,sentiment,1.0
9,0.61236,ruler,youtube,0.72619,0.75,p3,0.695697,0.566626,spam,0.62375


## Generate Figures

### Quantitative (model performance, etc.)

In [0]:
dt_bm_small = dt_best_small.melt(id_vars=['participant', 'condition', 'task', 'dataset'], 
        var_name="metric", 
        value_name="value")

In [23]:
aW = 300
H = 50
error_bars = alt.Chart(dt_bm_small).mark_errorbar(extent='stderr').encode(
  x=alt.X('value:Q'),
  y=alt.Y('condition:N'),
  color=alt.Color('condition:N', sort=['ruler'])
).properties(width=W,height=H)

points = alt.Chart(dt_bm_small).mark_point(filled=True).encode(
  x=alt.X('value:Q', title=None, aggregate='mean', axis=alt.Axis(tickCount=10)),
  y=alt.Y('condition:N'),
  text=alt.Text('value:Q'),
  color=alt.Color('condition:N', sort=['ruler'], legend=alt.Legend(title=None, orient='top'))
).properties(width=W,height=H)

(error_bars + points).facet(
    facet= alt.Facet('metric:N', sort=['f1', 'accuracy', 'training_label_coverage', 'max_dev_f1'], title=None),
    columns=2
)

### Qualitative (Survey responses)

In [0]:
background = pd.read_csv('https://raw.githubusercontent.com/rulerauthors/ruler/master/user_study/background_survey_anon.csv', index_col=0)
exit_survey = pd.read_csv('https://raw.githubusercontent.com/rulerauthors/ruler/master/user_study/exit_survey_anon.csv', index_col=0)
final_survey = pd.read_csv('https://raw.githubusercontent.com/rulerauthors/ruler/master/user_study/final_survey_anon.csv', index_col=0)

The original column names for exit_survey shows the statements that the users ranked their agreement with, on a Likert scale of 1-5.

We'll shorten these column names for our figures.

In [0]:
# simplify column names
exit_survey.columns = ['Timestamp', 'condition',
       'satisfaction', 'easy to use',
       'expressive enough',
       'easy to learn',
       'feedback',
       'how to improve',
       'other',
       'comments', 'participant']
exit_survey = exit_survey.drop('Timestamp', axis=1)

In [0]:
df_qm = exit_survey.melt(id_vars=['participant', 'condition','comments', 'how to improve', 'feedback', 'other'], 
        var_name="metric", 
        value_name="value")

In [35]:
error_bars = alt.Chart(df_qm).mark_errorbar(extent='stderr').encode(
  x=alt.X('value:Q'),
  y=alt.Y('condition:N'),
      color=alt.Color('condition:N', sort=['ruler'])
).properties(width=400,height=100)

points = alt.Chart(df_qm).mark_point(filled=True).encode(
  x=alt.X('value:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
).properties(width=400,height=100)

(error_bars + points).facet(
    facet= alt.Facet('metric:N',sort=['ease of use', 'expressivity', 'ease of learning', 'overall']),
    columns=2
)