# Conversation Experience Survey Experiment Design
#### Prepared by: Mike Fan

## Background
Whether it's between a student and teacher, child and parent or employee and supervior, feedback is a vital mechanism to continiously improve. In the current implementation digital assistants at Intuit, there are primarily 2 feedback mechanisms in place:
1. **Response specific feedback** (see below image) that prompts the user whether the content generated by the chatbot was helpful or not helpful in answering the user's query. This feedback aims to optimize the quality of bot response to a given query, which has an impact on the user's perceived interaction quality. 
    <center><img src='C:\Users\micha\Desktop\botresponse.JPG'></center>

2. **Live chat feedback** (see image below) aims to measure the helpfulness of live chat conversations with human experts and specialists.
    <center><img src='C:\Users\micha\Desktop\livechat.JPG'></center>

In each case, the feedback is coupled to a specific element of the conversation journey, but does not capture the holistic sentiment of the overall interaction. Confounding factors such as intuitiveness of conversation flow, UI design and navigation makes extrapolation of response and live chat feedback questionable beyond their intended scope.

## Experiment Scope
We will be running 3 separate experiments in 3 different products with CES. Below are the specifics and minutiae.
- **Treatment:** Upon exiting the chat drawer (clicking X on chat), the user will be prompted on their overall experinence with the digital assistant with a 5-point Likert scale that has been transformed into emojis (see image below). The mapping between the numerical score and the feedback is as such:
    - Terrible - 1
    - Not Good - 2
    - Not Helpful - 3
    - Somewhat Helpful - 4
    - Very Helpful - 5
    
    <center><img src='C:\Users\micha\Desktop\ces.JPG'></center>
    
- To reduce survey fatigue and eco-system risk (i.e. users gets so annoyed by the survey that he/she stops using the digital assistant altogether), we have implemented 2 rules that controls the frequency of when the Conversation Experience Survey (CES) will be served:
    1. If the user had a live agent handoff in the current session, then CES will not be shown upon exit
    2. If the user had a live agent or CE survey shown in the past 72 hours, then CES will not be shown upon exit

- **Platform:** Web only; Conversation Framework iOS and Android have not integrated with Intuit Experiment Platform (IXP)
- **Locale:** US only
- **Products:** The survey will be exposed in both Quickbooks Assistant (QBA) and TurboTax Digital Assistant (TDA) in 3 prodicts:
    1. QuickBook Self-Employed (QBSE)
        - [QBSE IXP](https://experimentation-e2e.intuit.com/#/experiment/27538/version/4)
    2. QuickBooks Online (QBO)
        - [QBO IXP](https://experimentation-e2e.intuit.com/#/experiment/27843/version/1)
    3. TurboTax Online (TTO)
        - [TTO IXP](https://experimentation-e2e.intuit.com/#/experiment/27842/version/3)
- **Timeline:** QBSE and QBO experiments will begin in mid-November as users are more consistent in product usage. TTO will start in the beginning of January, 2021 to be in sync with Tax Peak 1 (mid to end of January, 2021), since user engagements are more sparse and centered around tax deadlines.

## Key Questions and Metrics
**1. What proportion of conversations resulted in the CES being shown to users?**
>- Desired insight: Is the survey fatigue mitigation logic too restrictive that it's not getting enough exposure?
>- Target: > 3,850 per product (see Statistical Analysis section for derivation)

**2. What's the completion rate of CES?**
>- Desired insight: Do we have enough samples/responses to make infer the result on the broader population?
>- Target: 15%; average in-app survey response rate is 13% per [SurveyAnyPlace](https://surveyanyplace.com/wp-content/uploads/average-survey-response-rate.png)

**3. What’s the perceived quality of each digital assistant? Is it relatively consistent or varied?**
>- Desired insight: Can we infer from the responses that the underlying sentiment of the digital assistant is at least "Somewhat Helpful?"
>- Target: The 95% confidence interval of the response score for both QBA and TDA is between 4 and 5

#### Do Not Harm Metric
In additon to the 3 metrics above, we will also monitor the interaction rate (user sent at least 1 message/utterance) between the treatment and control group for each product/experiment, since execssive survey exposure can potentially lead to degradation in UX and reduction in usage.

## Statistical Analysis
We split our analysis into 2 parts: pre-test and post-test described below. It's also important to note that this is not a typical A/B or split test where power analysis is usually conducted to determine the minimum sample size based on the chosen effect size (practical significance between control and treatment), power (probability that the test correctly rejects the null hypothesis; avoiding a Type-II error) and alpha (probabiluty of avoiding a Type-I error; false-positive). 

### Pre-Test
The purpose of the pre-test analysis is to determine the approriate sample size to ensure that we can infer the resulting score distribution to the broader population, where population is defined as the total number of active users in a particular product over a pre-defined timeline such as a calendar year. Population will be estimated by analyzing 2019's traffic patterns to arrive at a sensible population size for QBO, QBSE web and TTO. The sample size is calculated based on 3 parameters:
>    1. **Population size:** Total number active users in a product
>    2. **Confidence level:** Indicates how confident we can be that the population would select an answer within a certain range (i.e. a 95% confidence level means that we can be 95% certain that the true population score is between X and Y)
>    3. **Margin of error:** Also called confidence interval, which tells us how much yweou can expect our survey results to reflect the views of the population.The smaller the MoE, the more confidence we may have in your results. The bigger the MoE, the farther they can stray from the views of the population

The below function o
utputs a sample size table based on an array of confidence levels, population size, sigma (standard deviation of the population), and the MoE. The sample size calculation is dervied by the below equation:
        <center><img src='C:\Users\micha\Desktop\ss.JPG'></center>


> - N = population size
> - e = MoE
> - z = z-score
> - p (sigma) = Response distribution, which refers to how we expect people to respond to the survey questions. If sample data is skewed highly to one end, the population probably is too. 50% is used when we don't know this value, which yields the largest sample size 

Outputs can be validated against well-established online sample-size calculators like this one by [Qualtrics](https://www.qualtrics.com/blog/calculating-sample-size/).

In [57]:
sample_size_table = function(conf_level, sigma =.5, moe, population)
{
    z_value = qnorm(.5 + conf_level / 200)
    ss_num = (z_value^2 * sigma * (1-sigma))/(moe^2) #sample size numerator
    ss_denom = ss_num/(1 + ((ss_num-1)/population)) #sample size denominator
    c_level = c(paste(conf_level, "%", sep = ""))
    results = data.frame(c_level, round(ss_denom, digits = 0))
    names(results) = c("Confidence Level", "Sample Size")
    METHOD = c("Minimum sample size at different confidence levels")
    conf_int = paste((moe*100), "%", sep="")
    resp_dist = paste((sigma*100),"%", sep="")
    pre = structure(list(Population = population,
                           "Margin of Error" = conf_int,
                           "Response Distribution" = resp_dist,
                           method = METHOD), class = "power.htest")
  print(pre)
  print(results)
}

In [58]:
sample_size_table(conf_lev = c(90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 99.9, 99.99), moe = .05, population = 1000000)


     Minimum sample size at different confidence levels 

           Population = 1e+06
      Margin of Error = 5%
Response Distribution = 50%

   Confidence Level Sample Size
1               90%         270
2               91%         287
3               92%         306
4               93%         328
5               94%         354
6               95%         384
7               96%         422
8               97%         471
9               98%         541
10              99%         663
11            99.9%        1082
12           99.99%        1511


Based on the above arguments, the minimum sample size we need for each experiment is 384 responses with a 95% confidence interval, MoE of 5% and an assumed population of 1,000,000.

#### Time to Significance
**If we assume a 10% response rate, we will need 3,840 survey exposures per experiment to reach signifiance.** How long this will take will depend on the number of chat sessions (user opens chat drawer) intiated by users, proportion of live chat handoffs and frequency of bot interaction. Because this can be complex to derive without real data, I will update the estimated weeks to significance once we have actual traffic and interaction data.

### Post-Test
#### Statistical Significance
Because the survey responses are based on a 5-point Likert scale, the data are ordinal in nature. Since ordinal data has no central tendency nor can we assume its distribution to be Gaussian, parametric tests such as the T-test are not approriate. In addition, the experiment is not a typical A/B test as mentioned before where we are measuring the difference of mean between 2 groups. In our case, the only reading we will observe are users who have been exposed to the CES, so there isn't a second group of users to compare against.

Instead, we will use a non-parametric approach called the `Wilcoxon One Sample Ranked Sign Test` to test for both statistical and practical significance. This test is the non-parametric version of the one sample t-test and because it is based on ranks, the location parameter is the median rather than the mean. Thus, we can test the null hypothesis that the sample median is equal to a hypotheical value. In the context of CES, we will set the hypothetical value to 4, since our initial belief is that digital assistants are somewhat helpful to users. More specifically, we want to test whether the sample median score is **statistically different** than our benchmark value of 4. The hypothesis test is stated below:

>Null hypothesis:
> - $H_{0}: Md = 4$; there is no difference between the users perception against our base hypothesis that digital assistants are at least somewhat helpful
> 
>Alternative hypothesis:
> - $H_{a}: Md \ne 4$; there is a difference between the users perception against our base hypothesis that digital assistants are at least somewhat helpful

If the alternative hypothesis is true, we will examine the directional magnitude to determine if the sentiment is better or worse relative to somewhat helpful (4). 

#### Practical Significance (Effect Size)
Practical significance refers to the magnitude of the difference, which is also known as the effect size. Results are practically significant when the difference is large enough to be meaningful in real life as not all statistically significant differences are interesting. In other words, statistical significance indicates that the sample provides sufficient evidence to conclude that the effect exists in the population while practical significance asks whether that effect is large enough to care about. A metric we will compute is `r` (not correlation coefficient), which will help us determine the effect size of the difference if there is one. This metric is calculated as the Z-statistic divided by square root of the sample size (N). The `r` value varies from 0 to close to 1 and The interpretation values for `r` commonly in published 
literature ([Statistical Power Analysis for the Behavioral Sciences](https://books.google.com/books?hl=en&lr=&id=rEe0BQAAQBAJ&oi=fnd&pg=PP1&ots=sv0TKsMPs6&sig=Rpfkd0H-EB0-ZhKjsdDdSDgTdx4#v=onepage&q&f=false)) are listed in the below table.
>
>| R | Effect Size  |
>|---|---|
>|0.1 - < .3   | Small  |
>|0.3 - < .5  | Medium |
>|>= 0.5   | Large   |

## Action Items
#### QBO and QBSE (priority; complete by 11/6)
- [ ] Pull YTD and 2019 user traffic volume data (time series) for **QBO** and **QBSE web**
- [ ] Pull 2019 and 2020 **QBA** interaction/session volume data for **QBO** and **QBSE web** from Dashbot or Intuit Data Lake
- [ ] Calculate the average daily bot interaction to user login ratio (i.e. $500 \, chatbot \,sessions \div 10,000 \, user \, logins \, into \, QBO \, = .05$)
- [ ] Derive the approriate traffic allocation for each product based on aforementioned data (i.e. 10% traffic allocation yields  10,000 user logins on average and 500 (based on historical daily bot interaction to user login ratio) will start the chat with the bot. From the 500 chats, we assume that 25% are not eligible due to      violation of one or both of the survey fatigue rules resulting in 375 remaining opportunities. Assuming a 10% response rate to CES, this will yield 37 responses per day on average. Since we need at least 384 total responses to reach significance in each product, it will take approximately 11 days)

#### TurboTax Online (complete by 12/11)
- [ ] Pull TY19 Peak 1 user traffic volume data (time series) for **TTO**
- [ ] Pull 2020 **TDA** interaction/session volume data for **TTO** from Dashbot or Intuit Data Lake
- [ ] Calculate the average daily bot interaction to user login ratio
- [ ] Derive the approriate traffic allocation for TTO based on aforementioned