<a href="https://colab.research.google.com/github/nathanbollig/vet-graduate-expectations-survey/blob/main/analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Veterinary graduate expectations survey

Start by uploading the data into the working directory. Two files are required:

1.   `SVM.xlsx`: SVM graduate expectations survey results
2.   `WVMA.xlsx`: WVMA graduate expectations survey results

## Set up

In [72]:
! pip install xlsxwriter



In [73]:
import pandas as pd
import numpy as np
from scipy.stats import kruskal
from scipy.stats import normaltest
from scipy.stats import ttest_ind

### Read in SVM data

In [74]:
# Use top row as header and skip second header row
svm = pd.read_excel('SVM.xlsx', header=0, skiprows=lambda x: x in [1])  

# Read in questions from second header row and associate with column names
question_svm = {}

top_rows = pd.read_excel('SVM.xlsx', nrows=2) 

for col in list(top_rows.columns):
    question_svm[col] = top_rows.iloc[0][col]

### Read in WVMA data

In [75]:
# Use top row as header and skip second header row
wvma = pd.read_excel('WVMA.xlsx', header=0, skiprows=lambda x: x in [1])  

# Read in questions from second header row and associate with column names
question_wvma = {}

top_rows_wvma = pd.read_excel('WVMA.xlsx', nrows=2) 

for col in list(top_rows_wvma.columns):
    question_wvma[col] = top_rows_wvma.iloc[0][col]

### Set Analysis Parameters

In [76]:
ALPHA = 0.05
force_completion_rate = 0    # a respondent needs to have completed strictly more than this percentage of all subquestions to be included in a group-level comparison

In [77]:
"""
The `analysis_mode` variable specifies which two main populations are being compared in this analysis. Possible values are:
    0 - SVM vs. WVMA
    1 - SVM specialists vs. WVMA specialists
    2 - WVMA specialists vs. WVMA generalists
"""

analysis_mode = 0

In [78]:
"""
Technical or non-technical.

The nontechnical questions are Q13, Q14, and Q16 for all species categories.
"""
nontechnical = False

Run the following to set up the notebook for this analysis.

In [79]:
# Preparation for SVM vs. WVMA
if analysis_mode == 0:
    pop1 = svm.copy()
    pop2 = wvma.copy()
    pop1_str = "SVM"
    pop2_str = "WVMA"
    file_suffix = ""

# Preparation for SVM specialists vs. WVMA specialists
if analysis_mode == 1:
    pop1 = svm[svm['Q59'].notnull()].copy()
    pop2 = wvma[wvma['Q49'].notnull()].copy()
    pop1_str = "SVM"
    pop2_str = "WVMA"
    file_suffix = "_s"

# Preparation for WVMA specialists vs. WVMA generalists
if analysis_mode == 2:
    pop1 = wvma[wvma['Q49'].notnull()].copy()
    pop2 = wvma[wvma['Q49'].isnull()].copy()
    pop1_str = "specialist"
    pop2_str = "generalist"
    file_suffix = "_sg"

# Adjust file suffix for nontechnical analysis
if nontechnical == True:
    file_suffix = file_suffix + "_nontechnical"

### Counts of species area

Let's look at the counts of species area (`Q1`) in each population. First, note that this question allowed multiple responses, which appear as a common-delimited list. The below code counts how many times each species appears, taking into account the possible of multiple responses.

In [80]:
from collections import defaultdict

pop1_counts = defaultdict(int) # start each count at zero by default

for entry in list(pop1.Q1):
    if isinstance(entry, str):
        species_list = entry.split(',')
        for species in species_list:
            pop1_counts[species] += 1
    elif np.isnan(entry) == True:
        pop1_counts["empty"] += 1

print("*** %s Survey ***" % (pop1_str,))
for key, val in pop1_counts.items():
    print("%s: %i" % (key, val))

*** SVM Survey ***
Companion Animal (canine and/or feline): 46
Food Animal (bovine): 18
Equine: 16
Special Species: 19
empty: 5


In [81]:
pop2_counts = defaultdict(int) # start each count at zero by default

for entry in list(pop2.Q1):
    if isinstance(entry, str):
        species_list = entry.split(',')
        for species in species_list:
            pop2_counts[species] += 1
    elif np.isnan(entry) == True:
        pop2_counts["empty"] += 1

print("*** %s Survey ***" % (pop2_str,))
for key, val in pop2_counts.items():
    print("%s: %i" % (key, val))

*** WVMA Survey ***
Companion Animal (canine and/or feline): 115
Food Animal (bovine): 48
Equine: 30
Special Species (ex. exotic companion animals): 21
empty: 29


### Note about organization

There are several levels of organization in our interpretation of this data.

 * `Group`: One of the 4 species groups (companion animal, special species, food animal, or equine)
     * `Question`: A group of procedures in a category such as "Medical Procedures" or "Surgical Procedures"
          * `Sub-question`: A particular procedure

We can perform analysis at the sub-question level, or pool upwards to the question or group level. I will do all of this below.





## Question analysis

Let's encode the expectation response in the following way:

 * 0: No Expectation to Perform Procedure

 * 1: Perform with Assistance (assist with portions of procedure)
 
 * 2: Perform with Direct Supervision (present in room during procedure)

 * 3: Perform with Indirect Supervision (available in building or by phone if needed)

 * 4: Perform Independently

In [82]:
def encode_expectation(response_string):
    if isinstance(response_string, int) == True:
        return response_string
    
    # Encode nan values as -1
    if isinstance(response_string, str) == False:
        if np.isnan(response_string) == True:
            return -1
    
    # Encode string
    s = response_string.lower()
    if s.find('no expectation') > -1:
        return 0
    elif s.find('with assistance') > -1:
        return 1
    elif s.find('indirect supervision') > -1:
        return 3
    elif s.find('direct supervision') > -1:
        return 2
    elif s.find('independently') > -1:
        return 4
    else:
        print(response_string)
        raise ValueError('Expected performance response was not formatted as expected.')

In [83]:
"""
Function for computing the composite average scores for all respondents. The 
average is over all subquestions in the indicated question.

For a given respondent to be included in the output, their response rate for this group
of subquestions must be above the value of `force_completion_rate`.
"""
def get_composite_scores(filtered_df, question_list, n_subq_list, force_completion_rate):
    composite_scores = []
    indices_used = []
    for i in range(filtered_df.shape[0]):  # Loop over respondents
        responses_in_group = []

        # Loop over questions in the group
        for q_index in range(len(question_list)): 
            question_number = question_list[q_index]
            n_subquestions = n_subq_list[q_index]

            for j in range(1, n_subquestions+1):  # Loop over subquestions in this question
                qkey = "Q" + str(question_number) + "_" + str(j)
                filtered_df[qkey] = filtered_df[qkey].apply(lambda x: encode_expectation(x))
                response = filtered_df[qkey].iloc[[i]]
                responses_in_group.append(response)
        
        responses_in_group = np.array(responses_in_group)

        # verify inclusion of respondent in the group analysis
        unique, counts = np.unique(responses_in_group, return_counts=True)
        counter = dict(zip(unique, counts))
        response_rate = 1 - counter.get(-1,0) / len(responses_in_group)
        if (response_rate <= force_completion_rate):
            continue
        else:
            indices_used.append(i)
        
        # remove -1 (empty) values and compute average score of responses
        index = np.where(responses_in_group != -1)[0]
        responses_in_group = responses_in_group[index]
        
        # add average of responses to list of composite scores
        composite_response = np.mean(responses_in_group)
        composite_scores.append(composite_response)

    return composite_scores, indices_used

In [84]:
def run_kruskal(pop1_data, pop2_data):
    if len(pop1_data) >= 5 and len(pop2_data) >= 5:
        stat, p = kruskal(pop1_data, pop2_data)
    else:
        stat = 0
        p = 1
    return stat, p

In [85]:
def analyze_question(question_number, filtered_pop1_df, filtered_pop2_df, n_subquestions, alpha=0.05, verbose=True):
    """
    Perform an analysis of a given question on a species-filtered dataframe.
    
    Inputs:
        question_number: main question number to analyze
        filtered_pop1_df: pop1 dataframe filtered to respondants with the desired species area
        filtered_pop2_df: pop2 dataframe filtered to respondants with the desired species area
        n_subquestions: number of subquestions in the main question
        alpha: power level for the statistical test

    Prints a summary of results.

    Outputs:
        table: summary table
        (pooled_stat, pooled_p, pooled_diff_mean): tuple of statistics describing output of Kruskal test on the distributions of average responses over subquestions
        pop1_data: list of composite pop1 data (average of subquestion reponses for all respondents meeting completion rate threshold)
        pop2_data: list of composite pop2 data
        sig_count: number of subquestions with significant difference detected (between pop1 and pop2 responses), according to Kruskal test applied at subquestion level

    """

    pop1_counts = np.zeros((n_subquestions, 6), dtype=int) # Row for each question, column for empty (-1), 0, 1, 2, 3, and 4 responses
    pop2_counts = np.zeros((n_subquestions, 6), dtype=int) # Row for each question, column for empty (-1), 0, 1, 2, 3, and 4 responses
    rows = []
    sig_count = 0

    for i in range(1, n_subquestions+1):
        qkey = "Q" + str(question_number) + "_" + str(i)
        qstring = question_svm[qkey].split('-')[2] # could refer to questions_svm or questions_wvma

        # Encoding
        filtered_pop1_df[qkey] = filtered_pop1_df[qkey].apply(lambda x: encode_expectation(x))
        filtered_pop2_df[qkey] = filtered_pop2_df[qkey].apply(lambda x: encode_expectation(x))

        # pop1 tally
        counts = filtered_pop1_df[qkey].value_counts(dropna=False)
        for key in counts.keys():
            pop1_counts[i-1][key+1] += counts[key] # question index is 1-based; keys range from -1 to 4
        counts = pop1_counts[i-1][1:] # counts of 0, 1, 2, 3, and 4
        pop1_num_responses = np.sum(counts)
        pop1_mean = (0*counts[0] + 1*counts[1] + 2*counts[2] + 3*counts[3] + 4*counts[4]) / pop1_num_responses

        # pop2 tally
        counts = filtered_pop2_df[qkey].value_counts(dropna=False)
        for key in counts.keys():
            pop2_counts[i-1][key+1] += counts[key]
        counts = pop2_counts[i-1][1:] # counts of 0, 1, 2, 3, and 4
        pop2_num_responses = np.sum(counts)
        pop2_mean = (0*counts[0] + 1*counts[1] + 2*counts[2] + 3*counts[3] + 4*counts[4]) / pop2_num_responses
        
        # Get data
        pop1_data = list(filtered_pop1_df[qkey])
        pop2_data = list(filtered_pop2_df[qkey])

        # Remove empty values from data
        pop1_data = [x for x in pop1_data if x != -1]
        pop2_data = [x for x in pop2_data if x != -1]

        assert(pop1_num_responses == len(pop1_data))
        assert(pop2_num_responses == len(pop2_data))

        # compare samples
        stat,p = run_kruskal(pop1_data, pop2_data)

        # Determine significance
        if p > alpha:
            sig = ""
        else:
            sig = "*"
            sig_count += 1

        # Cache row for table of results
        row = [qstring] + list(pop1_counts[i-1]) + [pop1_mean, pop1_num_responses] + list(pop2_counts[i-1]) + [pop2_mean, pop2_num_responses, pop1_mean-pop2_mean, stat, p, sig]
        rows.append(row)

    # Assemble table of results
    table = pd.DataFrame(rows, columns=["Subquestion", pop1_str+": empty", pop1_str+": 0", pop1_str+": 1", 
                                        pop1_str+": 2", pop1_str+": 3", pop1_str+": 4", pop1_str+": avg", pop1_str+": num responses", 
                                        pop2_str+": empty", pop2_str+": 0", pop2_str+": 1", pop2_str+": 2", pop2_str+": 3", pop2_str+": 4", pop2_str+": avg", pop2_str+": num responses", 
                                        "Diff Mean ("+pop1_str+"-"+pop2_str+")", "stat", "pval", "sig"])

    # Compute composite scores
    pop1_composite_scores,_ = get_composite_scores(filtered_pop1_df, [question_number], [n_subquestions], force_completion_rate)
    pop2_composite_scores,_ = get_composite_scores(filtered_pop2_df, [question_number], [n_subquestions], force_completion_rate)

    #from scipy.stats import normaltest
    #_,p = normaltest(pop1_composite_scores)

    # Apply Kruskal test to composite data
    pooled_stat, pooled_p = run_kruskal(pop1_composite_scores, pop2_composite_scores)
    pooled_diff_mean = np.mean(pop1_composite_scores) - np.mean(pop2_composite_scores)

    # Print
    if verbose == True:
        print('Q%s composite scores: stat=%.3f, p=%.2e, diff_mean (%s-%s)=%.3f, sig_subq=%s/%s' % (question_number, pooled_stat, pooled_p, pop1_str, pop2_str, pooled_diff_mean, sig_count, n_subquestions))

    return table, (pooled_stat, pooled_p, pooled_diff_mean), pop1_composite_scores, pop2_composite_scores, sig_count


## Group Analysis

In [86]:
# Code to analyze all questions within the group

def analyze_group(question_list, n_subq_list, question_strings, filtered_pop1_df, filtered_pop2_df, alpha=0.05):
    pop1_pooled = [] # now pooling over entire group
    pop2_pooled = []
    rows = []
    sig_count = 0
    subq_tables = []
    subq_tables_names = []

    for i in range(len(question_list)):
        question_number = question_list[i]
        n_subquestions = n_subq_list[i]
        question_string = question_strings[i]

        # Run analysis
        table, subq_pooled_result, pop1_data, pop2_data, sig_subq = analyze_question(question_number, filtered_pop1_df, filtered_pop2_df, n_subquestions, verbose=False, alpha=alpha)
        pooled_stat, pooled_p, pooled_diff_mean = subq_pooled_result
        pop1_num_responses = len(pop1_data)
        pop2_num_responses = len(pop2_data)

        # Cache procedure tables
        subq_tables.append(table)
        subq_tables_names.append('Q'+str(question_number))

        # Determine significance
        if pooled_p > alpha:
            sig = ""
        else:
            sig = "*"
            sig_count += 1

        # Cache data for group summary
        row = ['Q'+str(question_number), question_string, pooled_stat, pooled_p, sig, pooled_diff_mean, n_subquestions, "%i/%i" % (sig_subq,n_subquestions), pop1_num_responses, pop2_num_responses]
        rows.append(row)

    # Assemble table of results
    group_table = pd.DataFrame(rows, columns=["Question number", "Category", "Stat", "pval", "Sig", "Diff Mean (%s-%s)"%(pop1_str, pop2_str), "Num subquestions", "Fraction of sig subquestions", "Num %s respondents"%(pop1_str,), "Num %s respondents"%(pop2_str,)])                     

    # Compute composite scores
    pop1_composite_scores,_ = get_composite_scores(filtered_pop1_df, question_list, n_subq_list, force_completion_rate)
    pop2_composite_scores,_ = get_composite_scores(filtered_pop2_df, question_list, n_subq_list, force_completion_rate)

    # Apply Kruskal test to pooled data
    pooled_stat, pooled_p = run_kruskal(pop1_composite_scores, pop2_composite_scores)
    pooled_diff_mean = np.mean(pop1_composite_scores) - np.mean(pop2_composite_scores)

    # Print
    print('Group result (all questions): stat=%.3f, p=%.2e, diff_mean (%s-%s)=%.3f, sig_subq=%s/%s' % (pooled_stat, pooled_p, pop1_str, pop2_str, pooled_diff_mean, sig_count, len(question_list)))

    return group_table, (pooled_stat, pooled_p, pooled_diff_mean), sig_count, len(question_list), len(pop1_composite_scores), len(pop2_composite_scores), (subq_tables, subq_tables_names)

In [87]:
# cache data across all groups
group_data = []
group_columns = ["Group", "stat", "pval", "Diff_mean (%s-%s)"%(pop1_str,pop2_str), "Num questions", "Fraction of sig questions", "Num of %s respondents"%(pop1_str,), "Num of %s respondents"%(pop2_str,)]

In [88]:
# cache tables
output_tables = []
output_tables_sheet_names = []

# cache subquestion table data
output_subq_data = []

### Companion Animal Group

In [89]:
# Filter dataframe to only companion animal respondants (may have responded to other species too)
ca_pop1 = pop1[pop1['Q1'].str.contains('Companion Animal (canine and/or feline)', na=False, regex=False)].copy()

In [90]:
# Filter dataframe to only companion animal respondants (may have responded to other species too)
ca_pop2 = pop2[pop2['Q1'].str.contains('Companion Animal (canine and/or feline)', na=False, regex=False)].copy()

In [91]:
# Input info about question group

if nontechnical == False:
    question_list = [16,17,7,8,9,10,11,12]
    n_subq_list = [25,10,25,8,4,12,13,3]
    question_strings = ['Medical Procedures',
                        'Preventive Medicine/Population Health Procedures',
                        'Surgical Procedures', 
                        'Anesthetic Procedures', 
                        'Reproductive Procedures',
                        'Diagnostic Imaging Procedures',
                        'Clinical Pathology Procedures',
                        'Diagnostic Necropsy Procedures']
else:
    question_list = [13,14,15]
    n_subq_list = [11,6,8]
    question_strings = ['Communication practices',
                        'Professional and business practices',
                        'Ethics and professional practices']

assert(len(question_list) == len(n_subq_list))
assert(len(n_subq_list) == len(question_strings))

In [92]:
group_table, pooled_q_stats, sig_count, n_questions, pop1_responses, pop2_responses, subq_data  = analyze_group(question_list, n_subq_list, question_strings, ca_pop1, ca_pop2, alpha=ALPHA)
pooled_stat, pooled_p, pooled_diff_mean = pooled_q_stats
group_data.append(["Companion Animal", pooled_stat, pooled_p, pooled_diff_mean, n_questions, "%i/%i" % (sig_count,n_questions), pop1_responses, pop2_responses])
output_tables.append(group_table)
output_tables_sheet_names.append("Companion Animal")
output_subq_data.append(subq_data)
group_table

Group result (all questions): stat=9.018, p=2.67e-03, diff_mean (SVM-WVMA)=0.227, sig_subq=5/8


Unnamed: 0,Question number,Category,Stat,pval,Sig,Diff Mean (SVM-WVMA),Num subquestions,Fraction of sig subquestions,Num SVM respondents,Num WVMA respondents
0,Q16,Medical Procedures,22.970642,2e-06,*,0.33439,25,15/25,40,100
1,Q17,Preventive Medicine/Population Health Procedures,5.229499,0.022207,*,0.260298,10,4/10,39,93
2,Q7,Surgical Procedures,1.580552,0.208681,,0.142689,25,5/25,39,93
3,Q8,Anesthetic Procedures,10.379662,0.001274,*,0.252289,8,7/8,39,90
4,Q9,Reproductive Procedures,0.10743,0.74309,,0.052651,4,0/4,39,89
5,Q10,Diagnostic Imaging Procedures,5.486665,0.019162,*,0.300766,12,3/12,39,87
6,Q11,Clinical Pathology Procedures,5.767303,0.016327,*,0.227149,13,5/13,39,87
7,Q12,Diagnostic Necropsy Procedures,0.644815,0.421973,,0.231948,3,1/3,39,87


### Special Species Group

In [93]:
# Filter dataframes to only companion animal respondants (may have responded to other species too)
ss_pop1 = pop1[pop1['Q1'].str.contains('Special Species', na=False, regex=False)].copy()
ss_pop2 = pop2[pop2['Q1'].str.contains('Special Species', na=False, regex=False)].copy()

In [94]:
# Input info about question group

if nontechnical == False:
    question_list = [43, 44, 45, 46, 48, 49, 50]
    n_subq_list = [20, 9, 11, 8, 6, 13, 3]
    question_strings = ['Medical Procedures',
                        'Preventive Medicine/Population Health Procedures',
                        'Surgical Procedures', 
                        'Anesthetic Procedures', 
                        'Diagnostic Imaging Procedures',
                        'Clinical Pathology Procedures',
                        'Diagnostic Necropsy Procedures']
else:
    question_list = [13,14,15]
    n_subq_list = [11,6,8]
    question_strings = ['Communication practices',
                        'Professional and business practices',
                        'Ethics and professional practices']
                        
assert(len(question_list) == len(n_subq_list))
assert(len(n_subq_list) == len(question_strings))

In [95]:
group_table, pooled_q_stats, sig_count, n_questions, pop1_responses, pop2_responses, subq_data  = analyze_group(question_list, n_subq_list, question_strings, ss_pop1, ss_pop2, alpha=ALPHA)
pooled_stat, pooled_p, pooled_diff_mean = pooled_q_stats
group_data.append(["Special Species", pooled_stat, pooled_p, pooled_diff_mean, n_questions, "%i/%i" % (sig_count,n_questions), pop1_responses, pop2_responses])
output_tables.append(group_table)
output_tables_sheet_names.append("Special Species")
output_subq_data.append(subq_data)
group_table

Group result (all questions): stat=7.289, p=6.94e-03, diff_mean (SVM-WVMA)=0.821, sig_subq=4/7


Unnamed: 0,Question number,Category,Stat,pval,Sig,Diff Mean (SVM-WVMA),Num subquestions,Fraction of sig subquestions,Num SVM respondents,Num WVMA respondents
0,Q43,Medical Procedures,9.249899,0.002355,*,0.873922,20,8/20,13,17
1,Q44,Preventive Medicine/Population Health Procedures,1.67802,0.195188,,0.223113,9,0/9,11,17
2,Q45,Surgical Procedures,3.461992,0.062795,,0.725814,11,2/11,11,17
3,Q46,Anesthetic Procedures,3.911626,0.047953,*,0.666903,8,1/8,11,16
4,Q48,Diagnostic Imaging Procedures,4.017415,0.045033,*,0.967803,6,3/6,11,16
5,Q49,Clinical Pathology Procedures,6.248169,0.012432,*,0.803322,13,4/13,11,16
6,Q50,Diagnostic Necropsy Procedures,0.673221,0.411931,,0.342803,3,0/3,11,16


### Food Animal Group

In [96]:
# Filter dataframes to only companion animal respondants (may have responded to other species too)
fa_pop1 = pop1[pop1['Q1'].str.contains('Food Animal', na=False, regex=False)].copy()
fa_pop2 = pop2[pop2['Q1'].str.contains('Food Animal', na=False, regex=False)].copy()

In [97]:
# Input info about question group

if nontechnical == False:
    question_list = [20, 18, 25, 24, 21, 19, 23, 22, 27]
    n_subq_list = [8, 27, 16, 10, 20, 11, 12, 3, 5]
    question_strings = ['Handling and Husbandry Procedures',
                        'Medical Procedures',
                        'Surgical Procedures',
                        'Anesthetic Procedures',
                        'Preventive Medicine/Population Health Procedures',
                        'Reproductive Procedures',
                        'Clinical Pathology Procedures',
                        'Diagnostic Necropsy Procedures',
                        'Diagnostic Imaging Procedures']
else:
    question_list = [13,14,15]
    n_subq_list = [11,6,8]
    question_strings = ['Communication practices',
                        'Professional and business practices',
                        'Ethics and professional practices']

assert(len(question_list) == len(n_subq_list))
assert(len(n_subq_list) == len(question_strings))

In [98]:
group_table, pooled_q_stats, sig_count, n_questions, pop1_responses, pop2_responses, subq_data  = analyze_group(question_list, n_subq_list, question_strings, fa_pop1, fa_pop2, alpha=ALPHA)
pooled_stat, pooled_p, pooled_diff_mean = pooled_q_stats
group_data.append(["Food Animal", pooled_stat, pooled_p, pooled_diff_mean, n_questions, "%i/%i" % (sig_count,n_questions), pop1_responses, pop2_responses])
output_tables.append(group_table)
output_tables_sheet_names.append("Food Animal")
output_subq_data.append(subq_data)
group_table

Group result (all questions): stat=0.065, p=7.99e-01, diff_mean (SVM-WVMA)=0.012, sig_subq=0/9


Unnamed: 0,Question number,Category,Stat,pval,Sig,Diff Mean (SVM-WVMA),Num subquestions,Fraction of sig subquestions,Num SVM respondents,Num WVMA respondents
0,Q20,Handling and Husbandry Procedures,0.684004,0.408211,,-0.247884,8,0/8,13,37
1,Q18,Medical Procedures,3.753191,0.052707,,0.273543,27,5/27,13,37
2,Q25,Surgical Procedures,0.041589,0.838404,,-0.008449,16,0/16,13,36
3,Q24,Anesthetic Procedures,0.036746,0.847983,,0.165741,10,1/10,12,36
4,Q21,Preventive Medicine/Population Health Procedures,0.618589,0.431572,,0.074488,20,1/20,12,36
5,Q19,Reproductive Procedures,0.55539,0.456124,,-0.17987,11,0/11,12,35
6,Q23,Clinical Pathology Procedures,3.300788,0.069247,,0.302381,12,4/12,12,35
7,Q22,Diagnostic Necropsy Procedures,0.19294,0.660481,,-0.280952,3,0/3,12,35
8,Q27,Diagnostic Imaging Procedures,0.539646,0.46258,,0.318095,5,0/5,12,35


### Equine Group

In [99]:
# Filter dataframes to only companion animal respondants (may have responded to other species too)
eq_pop1 = pop1[pop1['Q1'].str.contains('Equine', na=False, regex=False)].copy()
eq_pop2 = pop2[pop2['Q1'].str.contains('Equine', na=False, regex=False)].copy()

In [100]:
# Input info about question group

if nontechnical == False:
    question_list = [28, 29, 30, 31, 32, 33, 34, 35, 36]
    n_subq_list = [7, 24, 8, 8, 15, 9, 11, 3, 5]
    question_strings = ['Handling and Husbandry Procedures',
                        'Medical Procedures',
                        'Surgical Procedures',
                        'Anesthetic Procedures',
                        'Preventive Medicine/Population Health Procedures',
                        'Reproductive Procedures',
                        'Clinical Pathology Procedures',
                        'Diagnostic Necropsy Procedures',
                        'Diagnostic Imaging Procedures']
else:
    question_list = [13,14,15]
    n_subq_list = [11,6,8]
    question_strings = ['Communication practices',
                        'Professional and business practices',
                        'Ethics and professional practices']

assert(len(question_list) == len(n_subq_list))
assert(len(n_subq_list) == len(question_strings))

In [101]:
group_table, pooled_q_stats, sig_count, n_questions, pop1_responses, pop2_responses, subq_data  = analyze_group(question_list, n_subq_list, question_strings, eq_pop1, eq_pop2, alpha=ALPHA)
pooled_stat, pooled_p, pooled_diff_mean = pooled_q_stats
group_data.append(["Equine", pooled_stat, pooled_p, pooled_diff_mean, n_questions, "%i/%i" % (sig_count,n_questions), pop1_responses, pop2_responses])
output_tables.append(group_table)
output_tables_sheet_names.append("Equine")
output_subq_data.append(subq_data)
group_table

Group result (all questions): stat=0.050, p=8.23e-01, diff_mean (SVM-WVMA)=0.137, sig_subq=1/9


Unnamed: 0,Question number,Category,Stat,pval,Sig,Diff Mean (SVM-WVMA),Num subquestions,Fraction of sig subquestions,Num SVM respondents,Num WVMA respondents
0,Q28,Handling and Husbandry Procedures,0.004883,0.944292,,0.009351,7,0/7,11,25
1,Q29,Medical Procedures,1.012161,0.314386,,0.281341,24,3/24,10,25
2,Q30,Surgical Procedures,0.260388,0.609854,,0.108333,8,0/8,9,25
3,Q31,Anesthetic Procedures,2.467502,0.116223,,0.466111,8,1/8,9,25
4,Q32,Preventive Medicine/Population Health Procedures,0.026214,0.871379,,-0.180556,15,0/15,9,24
5,Q33,Reproductive Procedures,0.003692,0.95155,,0.080247,9,0/9,9,24
6,Q34,Clinical Pathology Procedures,4.439558,0.035115,*,0.45841,11,1/11,9,23
7,Q35,Diagnostic Necropsy Procedures,0.064206,0.799967,,-0.097424,3,0/3,9,23
8,Q36,Diagnostic Imaging Procedures,0.003973,0.949741,,-0.050966,5,0/5,9,23


## Group Summary

In [102]:
group_summary_table = pd.DataFrame(group_data, columns=group_columns)

In [103]:
pvals = list(group_summary_table['pval'])

sigs = []
for p in pvals:
  if p > ALPHA:
      sig = ""
  else:
      sig = "*"
  sigs.append(sig)

group_summary_table.insert(loc=3, column='Sig', value=sigs)

In [104]:
# Add group summary to the beginning of output tables
output_tables.insert(0, group_summary_table)
output_tables_sheet_names.insert(0, "Group summary")

In [105]:
group_summary_table

Unnamed: 0,Group,stat,pval,Sig,Diff_mean (SVM-WVMA),Num questions,Fraction of sig questions,Num of SVM respondents,Num of WVMA respondents
0,Companion Animal,9.018245,0.002673,*,0.227131,8,5/8,40,100
1,Special Species,7.288587,0.006939,*,0.820592,7,4/7,13,17
2,Food Animal,0.064706,0.799207,,0.011531,9,0/9,13,37
3,Equine,0.049834,0.823352,,0.137071,9,1/9,11,25


## Top-level comparison

For each respondant, compute a composite score based on all questions they answered. Then compare pop1 to pop2. This represents a top-level comparison across both populations.

In [107]:
# Create question list
if nontechnical == False:
    question_list = [16,17,7,8,9,10,11,12]+[43, 44, 45, 46, 48, 49, 50]+[20, 18, 25, 24, 21, 19, 23, 22, 27]+[28, 29, 30, 31, 32, 33, 34, 35, 36]
    n_subq_list = [25,10,25,8,4,12,13,3]+[20, 9, 11, 8, 6, 13, 3]+[8, 27, 16, 10, 20, 11, 12, 3, 5]+[7, 24, 8, 8, 15, 9, 11, 3, 5]
else:
    question_list = [13,14,15]
    n_subq_list = [11,6,8]

# Compute composite scores
pop1_composite_scores,_ = get_composite_scores(pop1, question_list, n_subq_list, force_completion_rate)
pop2_composite_scores,_ = get_composite_scores(pop2, question_list, n_subq_list, force_completion_rate)

# Apply Kruskal test to pooled data
pooled_stat, pooled_p = run_kruskal(pop1_composite_scores, pop2_composite_scores)
pooled_diff_mean = np.mean(pop1_composite_scores) - np.mean(pop2_composite_scores)

# Print
top_level_output = '%s vs. %s: stat=%.3f, p=%.2e, diff_mean (%s-%s)=%.3f, Num %s responses=%i, Num of %s responses=%i ' % (pop1_str, pop2_str, pooled_stat, pooled_p, pop1_str, pop2_str, pooled_diff_mean, pop1_str, len(pop1_composite_scores), pop2_str, len(pop2_composite_scores))
print(top_level_output)

# Save result
with open('toplevel%s.txt'%(file_suffix,), "w") as text_file:
    text_file.write(top_level_output)


SVM vs. WVMA: stat=4.575, p=3.24e-02, diff_mean (SVM-WVMA)=0.162, Num SVM responses=49, Num of WVMA responses=127 


## TODO: Determine p-value significance

We will remove the `sig` and `Fraction of sig questions/subquestions` columns from the above dataframes. P values will be made the rightmost column of all primary data tables.

Then, we will loop through tables to obtain all p values, evaluate p values with a ranking approach, then add back a `sig` column to each table.

# Generate tables

We will generate the following types of tables using pooled data from these experiments:

1.   `summary.xlsx`: Group summary table and a table for procedure sets (questions) within each group.
2.   `companion_animal.xlsx`: Tables for all procedures within the companion animal group.
3.   `special_species.xlsx`: Tables for all procedures within the special species group.
4.   `food_animal.xlsx`: Tables for all procedures within the food animal group.
5.   `equine.xlsx`:Tables for all procedures within the equine group.
6. `summary_nontechnical_allspecies.xlsx`: Summary table for responses to procedures (subquestions) pooled across species areas. Applicable only to non-technical questions.



## Summary

In [None]:
writer = pd.ExcelWriter('summary%s.xlsx'%(file_suffix,), engine='xlsxwriter')

for i,table in enumerate(output_tables):
    sheet_name = output_tables_sheet_names[i]
    table.to_excel(writer, sheet_name=sheet_name, index=False)

    # Auto-adjust columns widths
    for column in table:
        column_width = max(table[column].astype(str).map(len).max(), len(column))
        col_idx = table.columns.get_loc(column)
        writer.sheets[sheet_name].set_column(col_idx, col_idx, column_width)

writer.save()

## All Species Summary

This output table only applies to the nontechnical analyses, for which the relevant questions appear in all species areas.

In [None]:
if nontechnical == True:

    group_table, pooled_q_stats, sig_count, n_questions, pop1_responses, pop2_responses, subq_data  = analyze_group(question_list, n_subq_list, question_strings, pop1, pop2)
    subq_tables, subq_tables_names = subq_data

    # Loop through tables
    writer = pd.ExcelWriter('summary%s_allspecies.xlsx'%(file_suffix,), engine='xlsxwriter')

    for i,table in enumerate(subq_tables):
        sheet_name = subq_tables_names[i]
        table.to_excel(writer, sheet_name=sheet_name, index=False)

        # Auto-adjust columns widths
        for column in table:
            column_width = max(table[column].astype(str).map(len).max(), len(column))
            col_idx = table.columns.get_loc(column)
            writer.sheets[sheet_name].set_column(col_idx, col_idx, column_width)

    writer.save()

## All procedures

In [None]:
for i, file in enumerate(['companion_animal%s.xlsx'%(file_suffix,), 'special_species%s.xlsx'%(file_suffix,), 'food_animal%s.xlsx'%(file_suffix,), 'equine%s.xlsx'%(file_suffix,)]):
    subq_data = output_subq_data[i]
    subq_tables, subq_tables_names = subq_data

    # Loop through tables
    writer = pd.ExcelWriter(file, engine='xlsxwriter')

    for i,table in enumerate(subq_tables):
        sheet_name = subq_tables_names[i]
        table.to_excel(writer, sheet_name=sheet_name, index=False)

        # Auto-adjust columns widths
        for column in table:
            column_width = max(table[column].astype(str).map(len).max(), len(column))
            col_idx = table.columns.get_loc(column)
            writer.sheets[sheet_name].set_column(col_idx, col_idx, column_width)

    writer.save()

# Next Steps

1.   Implement multiple testing correction



In [None]:
!cp *.xlsx drive/MyDrive/survey_test/