<a href="https://colab.research.google.com/github/rulerauthors/ruler/blob/master/user_study/ruler_user_study_figures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import altair as alt
import numpy as np
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', None)

ModuleNotFoundError: No module named 'pandas'

## Load the full user study data from Github

In [2]:
full_data = pd.read_csv('https://raw.githubusercontent.com/rulerauthors/ruler/master/user_study/full_study_data.csv')
display(full_data)

NameError: name 'pd' is not defined

## About this data





> We carried out the study using  a within-subjects experiment design, where all participants performed tasks using both conditions (tools).  The sole independent variable controlled was the method of creating labeling functions. We counterbalanced the order in which the tools were used, as well as which classification task we performed with which tool. 

### Tasks and Procedure 
> We asked participants to write  labeling functions for two prevalent labeling tasks: spam detection and sentiment classification.  They performed these two tasks on  YouTube Comments and Amazon Reviews, respectively. Participants received 15 mins of instruction on how to use each tool, using a topic classification task (electronics vs. guns) over a newsgroup dataset~\cite{rennie200820} as an example. We asked participants to write as many functions as they considered necessary for the goal of the task.  There were given 30 mins to complete each task and we recorded the labeling functions they created and these functions' individual and aggregate performances.  After completing both tasks, participants also filled out an exit survey, providing their qualitative feedback.

> For the manual programming condition, we iteratively developed a Jupyter notebook interface based on the Snorkel tutorial. We provided a section for writing functions, a section with diverse analysis tools, and a section to train a logistic regression model on the labels they had generated (evaluated on the test set shown to the user, which is separate from our heldout test set used for the final evaluation).



## Select Best Model

From [our EMNLP '20 submission](https://github.com/rulerauthors/ruler/blob/master/media/Ruler_EMNLP2020.pdf):



> To analyze the performance of the labeling functions created by participants, for each participant we select and task the labeling  model  that achieved the highest f1 score on the development set.  For each labeling model, we then train a logistic regression model on a training dataset  generated by the model.  We finally evaluate the performance of the logistic regression model on a heldout test set. 



In [0]:
def create_best_table_small(action='heldout_test_LR_stats'):
  dt = pd.DataFrame()

  subjects = full_data.participant.value_counts().index
  datasets = ['amazon', 'youtube']

  for _, pid in enumerate(subjects):
    for d in datasets: 
      # gather all the rows logging participant {pid}'s progress on the given dataset/task
      sub_df = full_data[(full_data['participant']==pid) & (full_data['dataset']==d)]
      sub_df = sub_df.reset_index(drop=True)

      # find index of best performance on dev set
      idxmax = sub_df[sub_df.data == 'dev']['f1'].idxmax()

      # choose the first logistic regression model trained after that,
      # report the performance on the held out test data
      try:
        r = sub_df.loc[idxmax:][sub_df.action==action].iloc[0]
      except IndexError:
        # in one case the user never finished any labelling functions, 
        # so we report the initial 'baseline' LR performance
        # which is f1 score of 0.5
        r = sub_df[sub_df.action==action].iloc[0]

      # the logged precision and recall are separated by class. 
      # we use the heldout dataset splits to compute micro precision and recall
      size0 = 418
      size1 = 382
      if r.task=="Youtube":
        size0=192
        size1=164
      prec = (r['precision_0']*size0 +r['precision_1']*size1)/(size0+size1)
      rec = (r['recall_0']*size0+r['recall_1']*size1)/(size0+size1)

      dt = dt.append({'participant': pid, 
                      'condition': r['condition'].lower(),
                      'task':'sentiment' if d == 'amazon' else 'spam', 
                      'dataset':d,
                      'f1':r['micro_f1'],
                      'precision':prec, 
                      'recall':rec, 
                      'accuracy':r['accuracy'],
                      'max_dev_f1': sub_df.at[idxmax, 'f1'],
                      'training_label_coverage': r['training_label_coverage'],
                      }, ignore_index=True)
  return dt

In [4]:
dt_best_small  = create_best_table_small()
display(dt_best_small)



Unnamed: 0,accuracy,condition,dataset,f1,max_dev_f1,participant,precision,recall,task,training_label_coverage
0,0.48375,snorkel,amazon,0.562036,0.653386,p4,0.492273,0.48375,sentiment,0.58375
1,0.578652,ruler,youtube,0.71374,0.632558,p4,0.682599,0.525599,spam,0.135
2,0.4825,ruler,amazon,0.622951,0.588235,p8,0.50177,0.4825,sentiment,0.1125
3,0.789326,snorkel,youtube,0.76489,0.697436,p8,0.821485,0.809982,spam,0.54375
4,0.50625,ruler,amazon,0.655623,0.642857,p2,0.667621,0.50625,sentiment,0.4025
5,0.679775,snorkel,youtube,0.604167,0.691892,p2,0.744225,0.710193,spam,1.0
6,0.58875,ruler,amazon,0.636464,0.64486,p1,0.608134,0.58875,sentiment,0.2975
7,0.52809,snorkel,youtube,0.596154,0.651163,p1,0.517664,0.512288,spam,1.0
8,0.51,snorkel,amazon,0.56541,0.639456,p3,0.519668,0.51,sentiment,1.0
9,0.61236,ruler,youtube,0.72619,0.75,p3,0.695697,0.566626,spam,0.62375


## Figures and Analysis

### Quantitative Figure (model performance, etc.)

In [0]:
dt_bm_small = dt_best_small.melt(id_vars=['participant', 'condition', 'task', 'dataset'], 
        var_name="metric", 
        value_name="value")

In [7]:
W = 300
H = 50
error_bars = alt.Chart(dt_bm_small).mark_errorbar(extent='stderr').encode(
  x=alt.X('value:Q'),
  y=alt.Y('condition:N'),
  color=alt.Color('condition:N', sort=['ruler'])
).properties(width=W,height=H)

points = alt.Chart(dt_bm_small).mark_point(filled=True).encode(
  x=alt.X('value:Q', title=None, aggregate='mean', axis=alt.Axis(tickCount=10)),
  y=alt.Y('condition:N'),
  text=alt.Text('value:Q'),
  color=alt.Color('condition:N', sort=['ruler'], legend=alt.Legend(title=None, orient='top'))
).properties(width=W,height=H)

(error_bars + points).facet(
    facet= alt.Facet('metric:N', sort=['f1', 'accuracy', 'training_label_coverage', 'max_dev_f1'], title=None),
    columns=2
)

### Hypothesis Testing

Let's see which of these differences are statistically significant, starting with the f1 score.

#### F1 Score

In [8]:
from scipy import stats
dt = dt_best_small
ruler_f1 = dt[dt['condition']=='ruler']['accuracy']
snorkel_f1 = dt[dt['condition']=='snorkel']['accuracy']
stats.ttest_rel(ruler_f1, snorkel_f1)

Ttest_relResult(statistic=-0.518186116980372, pvalue=0.6168254933363918)

As  the figure suggested, the difference for f1 scores is not significant (**pvalue = 0.62 >> 0.05**).

For posterity, let's perform the above comparison using a mixed effects model.  

In [10]:
%load_ext rpy2.ipython

  from pandas.core.index import Index as PandasIndex


In [11]:
%%R
install.packages("lme4")

R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependencies ‘minqa’, ‘nloptr’, ‘statmod’, ‘RcppEigen’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/minqa_1.2.4.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 53548 bytes (52 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[wr

In [12]:
%R library(lme4)
%R -i dt

R[write to console]: Loading required package: Matrix



Use linear mixed-effects (LME) regression to analyze the effect of **condition** on **f1** (see, e.g., https://web.stanford.edu/class/psych252/section/Mixed_models_tutorial.html, https://jontalle.web.engr.illinois.edu/MISC/lme4/bw_LME_tutorial.pdf). The main difference between the LME model below and the paired t-test model above is the LME model  takes  the differences among users, e.g., due to experience, familiarity, etc., and among the task types into consideration (the `(1|participant)` and `(1 | task)` parts). 

In [9]:
%%R
# the first model suggests the f1 scores can be modeled as a linear function of 
# a constant, per-subject random effects, per-task random effects, and a measurement noise 
compact = lmer('f1 ~ 1 + (1|participant) + (1|task)', data=dt) 

# the second model suggests the f1 scores can be  modeled as a linear function of 
# a constant, the value of condition (fixed effect), per-subject random effects, per-task random effects, and a measurement noise 
augmented = lmer('f1 ~ condition + (1|participant) + (1|task)', data=dt)

# So, does one model explain the data better than the other? We can compare the two models using the $\chi^2$ test 
anova(compact, augmented)

R[write to console]: boundary (singular) fit: see ?isSingular

R[write to console]: boundary (singular) fit: see ?isSingular

R[write to console]: refitting model(s) with ML (instead of REML)



Data: dt
Models:
compact: f1 ~ 1 + (1 | participant) + (1 | task)
augmented: f1 ~ condition + (1 | participant) + (1 | task)
          npar     AIC     BIC logLik deviance Chisq Df Pr(>Chisq)
compact      4 -35.005 -31.022 21.502  -43.005                    
augmented    5 -33.159 -28.180 21.579  -43.159 0.154  1     0.6947


If we look at the last column for the second row (augmented), similar to the paired t-test performed earlier, we can see the difference between these two models is not significant (**pvalue=0.69 >> 0.05**). So far, **ruler** and **snorkel** have no statistically  significant performance difference as measured by the **f1** score. 

Let's repeat the paired t-test also in R to further verify our conclusion

In [10]:
%%R
t.test(dt$f1[dt$condition=="ruler"],
       dt$f1[dt$condition=="snorkel"],
       alternative = "two.sided",
       paired = T)


	Paired t-test

data:  dt$f1[dt$condition == "ruler"] and dt$f1[dt$condition == "snorkel"]
t = 0.24251, df = 9, p-value = 0.8138
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1110753  0.1377500
sample estimates:
mean of the differences 
             0.01333735 



Before moving on, let's calculate the effect size (https://en.wikipedia.org/wiki/Effect_size). For that, we can use one of many R packages. 

In [13]:
%%R
install.packages("rstatix")
install.packages('coin')

R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

R[write to console]: also installing the dependencies ‘zip’, ‘SparseM’, ‘MatrixModels’, ‘sp’, ‘data.table’, ‘openxlsx’, ‘carData’, ‘abind’, ‘pbkrtest’, ‘quantreg’, ‘maptools’, ‘rio’, ‘corrplot’, ‘car’


R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/zip_2.0.4.tar.gz'

R[write to console]: Content type 'application/x-gzip'
R[write to console]:  length 97756 bytes (95 KB)

R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to console]: =
R[write to conso

In [14]:
%%R

library(rstatix)
cohens_d(dt, accuracy ~ condition, paired=T)

R[write to console]: Registered S3 methods overwritten by 'car':
  method                          from
  influence.merMod                lme4
  cooks.distance.influence.merMod lme4
  dfbeta.influence.merMod         lme4
  dfbetas.influence.merMod        lme4

R[write to console]: 
Attaching package: ‘rstatix’


R[write to console]: The following object is masked from ‘package:stats’:

    filter




# A tibble: 1 x 7
  .y.      group1 group2  effsize    n1    n2 magnitude 
* <chr>    <chr>  <chr>     <dbl> <int> <int> <ord>     
1 accuracy ruler  snorkel  -0.164    10    10 negligible


Wilcoxon tests 

In [15]:
 %%R 

# f1 
print(wilcox.test(dt$f1[dt$condition=="ruler"],dt$f1[dt$condition=="snorkel"] , paired = TRUE))
print(wilcox_effsize(dt, f1~condition, paired=T))

# precision 
print(wilcox.test(dt$precision[dt$condition=="ruler"],dt$precision[dt$condition=="snorkel"] , paired = TRUE))
print(wilcox_effsize(dt, precision~condition, paired=T))

# recall 
print(wilcox.test(dt$recall[dt$condition=="ruler"],dt$recall[dt$condition=="snorkel"] , paired = TRUE))
print(wilcox_effsize(dt, recall~condition, paired=T))

# accuracy 
print(wilcox.test(dt$accuracy[dt$condition=="ruler"],dt$f1[dt$condition=="snorkel"] , paired = TRUE))
print(wilcox_effsize(dt, accuracy~condition, paired=T))


	Wilcoxon signed rank test

data:  dt$f1[dt$condition == "ruler"] and dt$f1[dt$condition == "snorkel"]
V = 35, p-value = 0.4922
alternative hypothesis: true location shift is not equal to 0

# A tibble: 1 x 7
  .y.   group1 group2  effsize    n1    n2 magnitude
* <chr> <chr>  <chr>     <dbl> <int> <int> <ord>    
1 f1    ruler  snorkel   0.242    10    10 small    

	Wilcoxon signed rank test

data:  dt$precision[dt$condition == "ruler"] and dt$precision[dt$condition == "snorkel"]
V = 29, p-value = 0.9219
alternative hypothesis: true location shift is not equal to 0

# A tibble: 1 x 7
  .y.       group1 group2  effsize    n1    n2 magnitude
* <chr>     <chr>  <chr>     <dbl> <int> <int> <ord>    
1 precision ruler  snorkel  0.0483    10    10 small    

	Wilcoxon signed rank test

data:  dt$recall[dt$condition == "ruler"] and dt$recall[dt$condition == "snorkel"]
V = 25, p-value = 0.8457
alternative hypothesis: true location shift is not equal to 0

# A tibble: 1 x 7
  .y.    group1 gr

The effect size for our paired-t test is **small**. As mentioned earlier, the **f1** values for both  conditions have high variance, which in turn causes high variance in differences. A paired test will have a large effect if the average difference between paired values is high while their variance is low. 

In fact, the effect size computation based on Cohen's d measure for a paired t-test is  relatively simple: $d=\frac{\text{mean of paired differences}}{\text{std of paired differences}}$. 

In [0]:
 # our implementation 
def cohensd(g1, g2): 
  return  np.mean(g1-g2) / np.std(g1-g2, ddof=1)

In [17]:
cohensd(ruler_f1.values, snorkel_f1.values)

-0.1638648381536429

Great, we got the same value as the R `rstatix` package's `cohens_d()` function. 

One remaining question is how to interpret the effect size values. The answer is _depends_ but in general the effect size is assumed to be  $\left \{ 
  \begin{array}{ll} 
  \text{small} & \text{if $d\sim 0.2$} \\
  \text{moderate} & \text{if $d\sim 0.5$} \\
  \text{large} & \text{if $d\sim 0.8$} \\
  \end{array}
  \right.$

Now, let's repeat the significance analysis for the other metrics. 

#### Precision

In [18]:
error_bars = alt.Chart(dt).mark_errorbar(extent='stderr').encode(
  x=alt.X('precision:Q'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
)

points = alt.Chart(dt).mark_point(filled=True).encode(
  x=alt.X('precision:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
 color=alt.Color('condition:N', sort=['ruler'])
)

error_bars + points

In [19]:
ruler_precision = dt[dt['condition']=='ruler']['precision']
snorkel_precision = dt[dt['condition']=='snorkel']['precision']
stats.ttest_rel(ruler_precision, snorkel_precision)

Ttest_relResult(statistic=0.21633993091864037, pvalue=0.8335467862551649)

*As* expected, the difference is not significant for PRECISION (**pvalue=0.83 >> 0.05**)

#### Recall

In [20]:
error_bars = alt.Chart(dt).mark_errorbar(extent='stderr').encode(
  x=alt.X('recall:Q'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
)

points = alt.Chart(dt).mark_point(filled=True).encode(
  x=alt.X('recall:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
 color=alt.Color('condition:N', sort=['ruler'])
)

error_bars + points

In [21]:
ruler_recall = dt[dt['condition']=='ruler']['recall']
snorkel_recall = dt[dt['condition']=='snorkel']['recall']
stats.ttest_rel(ruler_recall, snorkel_recall)

Ttest_relResult(statistic=0.050128779325306946, pvalue=0.961114662015901)

*Again*, differences in RECALL not significant (**pvalue=0.96 >> 0.05**)

#### Accuracy

In [22]:
error_bars = alt.Chart(dt).mark_errorbar(extent='stderr').encode(
  x=alt.X('accuracy:Q'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
)

points = alt.Chart(dt).mark_point(filled=True).encode(
  x=alt.X('accuracy:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
 color=alt.Color('condition:N', sort=['ruler'])
)

error_bars + points

In [23]:
ruler_accuracy = dt[dt['condition']=='ruler']['recall']
snorkel_accuracy = dt[dt['condition']=='snorkel']['recall']
stats.ttest_rel(ruler_accuracy, snorkel_accuracy)

Ttest_relResult(statistic=0.050128779325306946, pvalue=0.961114662015901)

Not significant (**pvalue=0.96 >> 0.05**)

### Qualitative Figure (Survey responses)

In [0]:
background = pd.read_csv('https://raw.githubusercontent.com/rulerauthors/ruler/master/user_study/background_survey_anon.csv', index_col=0)
exit_survey = pd.read_csv('https://raw.githubusercontent.com/rulerauthors/ruler/master/user_study/exit_survey_anon.csv', index_col=0)
final_survey = pd.read_csv('https://raw.githubusercontent.com/rulerauthors/ruler/master/user_study/final_survey_anon.csv', index_col=0)

The original column names for exit_survey shows the statements that the users ranked their agreement with, on a Likert scale of 1-5.

We'll shorten these column names for our figures.

In [0]:
# simplify column names
exit_survey.columns = ['Timestamp', 'condition',
       'overall satisfaction', 'ease of use',
       'expressivity',
       'ease of learning',
       'feedback',
       'how to improve',
       'other',
       'comments', 'participant']
exit_survey = exit_survey.drop('Timestamp', axis=1)
exit_survey['condition'] = exit_survey['condition'].str.lower()

exit_survey.fillna({'comments':'','how to improve':'', 'feedback':'', 'other':''},inplace=True) # this is necessary to be able to pass the dataframe to R

In [0]:
df_q = exit_survey

In [0]:
df_qm = df_q.melt(id_vars=['participant', 'condition','comments', 'how to improve', 'feedback', 'other'], 
        var_name="metric", 
        value_name="value")

In [29]:
error_bars = alt.Chart(df_qm).mark_errorbar(extent='stderr').encode(
  x=alt.X('value:Q'),
  y=alt.Y('condition:N'),
      color=alt.Color('condition:N', sort=['ruler'])
).properties(width=400,height=100)

points = alt.Chart(df_qm).mark_point(filled=True).encode(
  x=alt.X('value:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
).properties(width=400,height=100)

(error_bars + points).facet(
    facet= alt.Facet('metric:N',sort=['ease of use', 'expressivity', 'ease of learning', 'overall']),
    columns=2
)

### Hypothesis Testing

We'll perform an analysis similar to what we did with the model performance metrics.  Let's start with **expressivity**.

#### Expressivity

In [30]:
error_bars = alt.Chart(exit_survey).mark_errorbar(extent='stderr').encode(
  x=alt.X('expressivity:Q'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
)

points = alt.Chart(exit_survey).mark_point(filled=True).encode(
  x=alt.X('expressivity:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
 color=alt.Color('condition:N', sort=['ruler'])
)

error_bars + points

It appears that subjects found **snorkel** more expressive than **ruler**. Let's test if this is statistically significant, which is what the figure suggests. 

In [31]:
from scipy import stats
ruler_expr =  exit_survey[exit_survey['condition']=='ruler']['expressivity']
snorkel_expr = exit_survey[exit_survey['condition']=='snorkel']['expressivity']
stats.ttest_rel(ruler_expr, snorkel_expr)

Ttest_relResult(statistic=-2.4494897427831783, pvalue=0.03678749787978613)

OK. Participants found, as opined on a Likert scale of 5, **snorkel** significantly more expressive than **ruler** at **pvalue = 0.04 < 0.05**. Let's compute the effect size of the difference, which appears to be small.

In [0]:
%R -i df_q

In [33]:
%%R
library(rstatix)
cohens_d(df_q,expressivity~condition,paired=T)

# A tibble: 1 x 7
  .y.          group1 group2  effsize    n1    n2 magnitude
* <chr>        <chr>  <chr>     <dbl> <int> <int> <ord>    
1 expressivity ruler  snorkel  -0.775    10    10 moderate 


In [34]:
%%R 
library(rstatix)
wilcox.test(df_q$expressivity[df_q$condition=="ruler"],df_q$expressivity[df_q$condition=="snorkel"] , paired = TRUE)



	Wilcoxon signed rank test with continuity correction

data:  df_q$expressivity[df_q$condition == "ruler"] and df_q$expressivity[df_q$condition == "snorkel"]
V = 0, p-value = 0.05447
alternative hypothesis: true location shift is not equal to 0



In [35]:
%R wilcox_effsize(df_q, expressivity~condition, paired=T)

Unnamed: 0,.y.,group1,group2,effsize,n1,n2,magnitude
1,expressivity,ruler,snorkel,0.69843,10,10,large


We were wrong; we got a moderate effect size for the significace of the difference in expressivity. Let's move on to other subjective measures. 

####Ease of Use

In [36]:
error_bars = alt.Chart(exit_survey).mark_errorbar(extent='stderr').encode(
  x=alt.X('ease of use:Q'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
)

points = alt.Chart(exit_survey).mark_point(filled=True).encode(
  x=alt.X('ease of use:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
 color=alt.Color('condition:N', sort=['ruler'])
)

error_bars + points

In [37]:
%%R 
library(stringr)
names(df_q)<-str_replace_all(names(df_q), c(" " = "." , "," = "" )) # R doesn't well handle  col names with space. 
wilcox.test(df_q$overall.satisfaction[df_q$condition=="ruler"],df_q$overall.satisfaction[df_q$condition=="snorkel"] , paired = TRUE)


	Wilcoxon signed rank test with continuity correction

data:  df_q$overall.satisfaction[df_q$condition == "ruler"] and df_q$overall.satisfaction[df_q$condition == "snorkel"]
V = 43, p-value = 0.1151
alternative hypothesis: true location shift is not equal to 0



Not significantly different **(p=0.12 > 0.05)**

In [38]:
%R wilcox_effsize(df_q, overall.satisfaction~condition, paired=T)

Unnamed: 0,.y.,group1,group2,effsize,n1,n2,magnitude
1,overall.satisfaction,ruler,snorkel,0.514882,10,10,large


We got a large effect size for the significace of the difference in **ease of use**.  Now, we move on to **ease of learning**.

#### Ease of learning

In [126]:
error_bars = alt.Chart(df_q).mark_errorbar(extent='stderr').encode(
  x=alt.X('ease of learning:Q'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
)

points = alt.Chart(df_q).mark_point(filled=True).encode(
  x=alt.X('ease of learning:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
 color=alt.Color('condition:N', sort=['ruler'])
)

error_bars + points

Looks like  participants found **ruler** easier to learn than **snorkel**. Now let's test that hypothesis.  

In [39]:
ruler_learn =  df_q[df_q['condition']=='ruler']['ease of learning']
snorkel_learn = df_q[df_q['condition']=='snorkel']['ease of learning']
stats.ttest_rel(ruler_learn, snorkel_learn)

Ttest_relResult(statistic=1.9639610121239313, pvalue=0.08112618884584057)

In [40]:
%R cohens_d(df_q,ease.of.learning~condition,paired=T)

Unnamed: 0,.y.,group1,group2,effsize,n1,n2,magnitude
1,ease.of.learning,ruler,snorkel,0.621059,10,10,moderate


The difference in **ease of learning** between two conditions, **ruler** and **snorkel**, is not statistically signficant (**pvalue = 0.08 > 0.05**). 

Finally, let's look into **overall satisfaction** of participants with the respective tools.

#### Satisfaction

In [41]:
error_bars = alt.Chart(df_q).mark_errorbar(extent='stderr').encode(
  x=alt.X('overall satisfaction:Q'),
  y=alt.Y('condition:N'),
    color=alt.Color('condition:N', sort=['ruler'])
)

points = alt.Chart(df_q).mark_point(filled=True).encode(
  x=alt.X('overall satisfaction:Q', aggregate='mean'),
  y=alt.Y('condition:N'),
 color=alt.Color('condition:N', sort=['ruler'])
)

error_bars + points

Allright. Is this difference statistically significant?

In [42]:
ruler_overall =  df_q[df_q['condition']=='ruler']['overall satisfaction']
snorkel_overall = df_q[df_q['condition']=='snorkel']['overall satisfaction']
stats.ttest_rel(ruler_overall, snorkel_overall)

Ttest_relResult(statistic=1.9215378456610457, pvalue=0.08684229054535088)

In [43]:
%R cohens_d(df_q,overall.satisfaction~condition,paired=T)

Unnamed: 0,.y.,group1,group2,effsize,n1,n2,magnitude
1,overall.satisfaction,ruler,snorkel,0.607644,10,10,moderate


The difference in overall  satisfaction with two tools, **ruler** and **snorkel**, is not statistically signficant (**pvalue = 0.09 > 0.05**). 