<a href="https://colab.research.google.com/github/nathanbollig/vet-graduate-expectations-survey/blob/main/explore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Veterinary graduate expectations survey

Exploratory analysis of this data. Start by uploading the data into the working directory. Two files are required:

1.   `SVM.xlsx`: SVM graduate expectations survey results
2.   `WVMA.xlsx`: WVMA graduate expectations survey results

In [77]:
import pandas as pd
import numpy as np
from scipy.stats import kruskal

## Read in SVM data

In [78]:
# Use top row as header and skip second header row
svm = pd.read_excel('SVM.xlsx', header=0, skiprows=lambda x: x in [1])  

In [79]:
#svm.head(3)

In [80]:
# Read in questions from second header row and associate with column names
question_svm = {}

top_rows = pd.read_excel('SVM.xlsx', nrows=2) 

for col in list(top_rows.columns):
    question_svm[col] = top_rows.iloc[0][col]

## Read in WVMA data

In [81]:
# Use top row as header and skip second header row
wvma = pd.read_excel('WVMA.xlsx', header=0, skiprows=lambda x: x in [1])  

# Read in questions from second header row and associate with column names
question_wvma = {}

top_rows_wvma = pd.read_excel('WVMA.xlsx', nrows=2) 

for col in list(top_rows_wvma.columns):
    question_wvma[col] = top_rows_wvma.iloc[0][col]

## Linking questions

In [82]:
len(question_wvma)

425

In [83]:
len(question_svm)

470

In [84]:
q_svm_set = set(question_svm)
q_wvma_set = set(question_wvma)
len(q_svm_set.intersection(q_wvma_set))

421

There are 425 WVMA questions and 470 SVM questions, and they have 421 questions in common.

In the code below, we loop through all question keys in the SVM results and print results for two potential issues:

*   If the same question key appears in both surveys, but the text of the question differs, there is a "TEXT MISMATCH" message and the two questions are printed.
*   If a question in SVM does not appear in WVMA, the SVM question is printed.



In [85]:
# Review question keys
for svm_key in question_svm.keys():
    if svm_key in question_wvma.keys():
        if question_svm[svm_key] != question_wvma[svm_key]:
            print("\nTEXT MISMATCH! %s" % (svm_key,))
            print("SVM: %s" % (question_svm[svm_key],))
            print("WVMA: %s" % (question_wvma[svm_key],))
    else:
        print("\nSVM key '%s' not in WVMA: %s" % (svm_key, question_svm[svm_key]))


TEXT MISMATCH! Q34
SVM: Select your primary clinical service area. - Selected Choice
WVMA: Select the best description for your practice setting. - Selected Choice

TEXT MISMATCH! Q34_12_TEXT
SVM: Select your primary clinical service area. - Other - Text
WVMA: Select the best description for your practice setting. - Other (please describe) - Text

SVM key 'Q59' not in WVMA: Indicate any AVMA-Recognized Veterinary Specialites in which you currently hold Diplomate status. - Selected Choice

SVM key 'Q59_24_TEXT' not in WVMA: Indicate any AVMA-Recognized Veterinary Specialites in which you currently hold Diplomate status. - Other non-AVMA recognized specialty credentials - Text

TEXT MISMATCH! Q1
SVM: This survey has separate sections detailing expectations for graduates in four species areas.  In each area, rate the level at which you would expect a new graduate to be able to perform the specific procedure or task.  Levels range from "Perform Independently" to "No Expectation to Perform

Now it remains to print WVMA questions that do not appear in the SVM survey. As we saw above, there should be 4 of them.

In [86]:
# Review question keys
for wvma_key in question_wvma.keys():
    if wvma_key not in question_svm.keys():
        print("\nWVMA key '%s' not in SVM: %s" % (wvma_key, question_wvma[wvma_key]))


WVMA key 'Q35' not in SVM: Select the number of veterinarians employed in your practice location.

WVMA key 'Q49' not in SVM: Indicate any AVMA-Recognized Veterinary Specialites in which you currently hold Diplomate status. - Selected Choice

WVMA key 'Q49_24_TEXT' not in SVM: Indicate any AVMA-Recognized Veterinary Specialites in which you currently hold Diplomate status. - Other non-AVMA recognized specialty credentials - Text

WVMA key 'Q46' not in SVM: List any business/practice management, communication, or ethical and professional practices not included in the above lists that you believe are essential for graduates to be able to perform at an independent or indirect supervision level.


In [None]:
# Show matches
for svm_key in question_svm.keys():
    if svm_key in question_wvma.keys():
        print(svm_key)
            

I examined these lists to develop the following summary. The question keys refer to column names in the first row of the raw spreadsheets (keys in `question_svm` and `question_wvma`, as well as column names in `svm` and `wvma` dataframes).

Comments on mismatched categories in SVM and WVMA

 * `Q34`: Primary clinical service area/practice setting (category list with different categories for SVM vs. WVMA)
 * `Q34_12_TEXT`: Free-text primary clinical service area/practice setting
 * `Q1`: Analogous questions with slightly different wording.
 * `Q43_1` - `Q43_20`: Identical questions (difference due to one space missing before hyphen)

SVM questions with WVMA analogs

 * `svm.Q59`: AVMA-recognized specialty, analogous to `wvma.Q49`
 * `svm.Q59_24_TEXT`: Free-text AVMA-recognized specialty, analogous to `wvma.Q49_24_TEXT`
 * `svm.Q57`: Professional skills, analogous to `wvma.Q46`

SVM-only
 * `svm.Q51_1` - `svm.Q51_22`: DMS rotation length assessment
 * `svm.Q53_1` - `svm.Q53_3`: PBS rotation length assessment
 * `svm.Q54_1` - `svm.Q54_19`: DSS rotation length assessment
 * `svm.Q55`: Free-text which rotations should be extended
 * `svm.Q56`: Free-text which rotations should be shortened

WVMA-only
 * `wvma.Q35`: Number of veterinarians at practice location

### Summary Table

| Question Key             | Linked question key | Description                                                                                               |
| ------------------------ | ------------------- | --------------------------------------------------------------------------------------------------------- |
| Status                   |                     |                                                                                                           |
| IPAddress                |                     |                                                                                                           |
| Progress                 |                     |                                                                                                           |
| Duration (in seconds)    |                     |                                                                                                           |
| Finished                 |                     |                                                                                                           |
| RecordedDate             |                     |                                                                                                           |
| ResponseId               |                     |                                                                                                           |
| RecipientLastName        |                     |                                                                                                           |
| RecipientFirstName       |                     |                                                                                                           |
| RecipientEmail           |                     |                                                                                                           |
| ExternalReference        |                     |                                                                                                           |
| LocationLatitude         |                     |                                                                                                           |
| LocationLongitude        |                     |                                                                                                           |
| DistributionChannel      |                     |                                                                                                           |
| UserLanguage             |                     |                                                                                                           |
| Q34                      |                     | Primary clinical service area/practice setting (category list with different categories for SVM vs. WVMA) |
| Q34\_12\_TEXT            |                     | Free-text primary clinical service area/practice setting                                                  |
| svm.Q59                  | wvma.Q49            | AVMA-recognized specialty                                                                                 |
| svm.Q59\_24\_TEXT        | wvma.Q49\_24\_TEXT  | Free-text AVMA-recognized specialty                                                                       |
| svm.Q57                  | wvma.Q46            | Professional skills                                                                                       |
| Q1                       |                     | Species area                                                                                              |
| Q16\_1 - Q16\_25         |                     | Companion Animal Medical Procedures                                                                       |
| Q17\_1 - Q17\_10         |                     | Companinon Animal Preventive Medicine/Population Health Procedures                                        |
| Q7\_1 - Q7\_25           |                     | Companinon Animal Surgical Procedures                                                                     |
| Q8\_1 - Q8\_8            |                     | Companion Animal Anesthetic Procedures                                                                    |
| Q9\_1 - Q9\_4            |                     | Companion Animal Reproductive Procedures                                                                  |
| Q10\_1 - Q10\_12         |                     | Companion Animal Diagnostic Imaging Procedures                                                            |
| Q11\_1 - Q11\_13         |                     | Companion Animal Clinical Pathology Procedures                                                            |
| Q12\_1 - Q12\_3          |                     | Companion Animal Diagnostic Necropsy Procedures                                                           |
| Q31                      |                     | Free text additional companion animal procedures                                                          |
| Q43\_1 - Q43\_20         |                     | Special Species Medical Procedures                                                                        |
| Q44\_1 - Q44\_9          |                     | Special Species Preventive Medicine/Population Health Procedures                                          |
| Q45\_1 - Q45\_11         |                     | Special Species Surgical Procedures                                                                       |
| Q46\_1 - Q46\_8          |                     | Special Species Anesthetic Procedures                                                                     |
| Q48\_1 - Q48\_6          |                     | Special Species Diagnostic Imaging Procedures                                                             |
| Q49\_1 - Q49\_13         |                     | Special Species Clinical Pathology Procedures                                                             |
| Q50\_1 - Q50\_3          |                     | Special Species Diagnostic Necropsy Procedures                                                            |
| Q51                      |                     | Free text additional special species procedures                                                           |
| Q20\_1 - Q20\_8          |                     | Food Animal Handling and Husbandry Procedures                                                             |
| Q18\_1 - Q18\_27         |                     | Food Animal Medical Procedures                                                                            |
| Q25\_1 - Q25\_16         |                     | Food Animal Surgical Procedures                                                                           |
| Q24\_1 - Q24\_10         |                     | Food Animal Anesthetic Procedures                                                                         |
| Q21\_1 - Q21\_20         |                     | Food Animal Preventive Medicine/Population Health Procedures                                              |
| Q19\_1 - Q19\_11         |                     | Food Animal Reproductive Procedures                                                                       |
| Q23\_1 - Q23\_12         |                     | Food Animal Clinical Pathology Procedures                                                                 |
| Q22\_1 - Q22\_3          |                     | Food Animal Diagnostic Necropsy Procedures                                                                |
| Q27\_1 - Q27\_5          |                     | Food Animal Diagnostic Imaging Procedures                                                                 |
| Q32                      |                     | Free text additional food animal procedures                                                               |
| Q28\_1 - Q28\_7          |                     | Equine Handling and Husbandry Procedures                                                                  |
| Q29\_1 - Q29\_24         |                     | Equine Medical Procedures                                                                                 |
| Q30\_1 - Q30\_8          |                     | Equine Surgical Procedures                                                                                |
| Q31\_1 - Q31\_8          |                     | Equine Anesthetic Procedures                                                                              |
| Q32\_1 - Q32\_15         |                     | Equine Preventive Medicine/Population Health Procedures                                                   |
| Q33\_1 - Q33\_9          |                     | Equine Reproductive Procedures                                                                            |
| Q34\_1 - Q34\_11         |                     | Equine Clinical Pathology Procedures                                                                      |
| Q35\_1 - Q35\_3          |                     | Equine Diagnostic Necropsy Procedures                                                                     |
| Q36\_1 - Q36\_5          |                     | Equine Diagnostic Imaging Procedures                                                                      |
| Q33                      |                     | Free text additional equine procedures                                                                    |
| Q14\_1 - Q14\_6          |                     | Professional and Business Practices                                                                       |
| Q13\_1 - Q13\_11         |                     | Communication Practices                                                                                   |
| Q15\_1 - Q15\_8          |                     | Ethical and Professional Practices                                                                        |
| svm.Q51\_1 - svm.Q51\_22 | none                | DMS rotation length assessment                                                                            |
| svm.Q53\_1 - svm.Q53\_3  | none                | PBS rotation length assessment                                                                            |
| svm.Q54\_1 - svm.Q54\_19 | none                | DSS rotation length assessment                                                                            |
| svm.Q55                  | none                | Free-text which rotations should be extended                                                              |
| svm.Q56                  | none                | Free-text which rotations should be shortened                                                             |
| wvma.Q35                 | none                | Number of veterinarians at practice location                                                              |

## Descriptive Analysis

### Counts of species area

Let's look at the counts of species area (`Q1`) in each survey. First, note that this question allowed multiple responses, which appear as a common-delimited list. The below code counts how many times each species appears, taking into account the possible of multiple responses.

In [88]:
from collections import defaultdict

svm_counts = defaultdict(int) # start each count at zero by default

for entry in list(svm.Q1):
    if isinstance(entry, str):
        species_list = entry.split(',')
        for species in species_list:
            svm_counts[species] += 1
    elif np.isnan(entry) == True:
        svm_counts["empty"] += 1

print("*** SVM Survey ***")
for key, val in svm_counts.items():
    print("%s: %i" % (key, val))

*** SVM Survey ***
Companion Animal (canine and/or feline): 46
Food Animal (bovine): 18
Equine: 16
Special Species: 19
empty: 5


In [89]:
wvma_counts = defaultdict(int) # start each count at zero by default

for entry in list(wvma.Q1):
    if isinstance(entry, str):
        species_list = entry.split(',')
        for species in species_list:
            wvma_counts[species] += 1
    elif np.isnan(entry) == True:
        wvma_counts["empty"] += 1

print("*** WVMA Survey ***")
for key, val in wvma_counts.items():
    print("%s: %i" % (key, val))

*** WVMA Survey ***
Companion Animal (canine and/or feline): 115
Food Animal (bovine): 48
Equine: 30
Special Species (ex. exotic companion animals): 21
empty: 29


### Note about organization

There are several levels of organization in our interpretation of this data.

 * `Group`: One of the 4 species groups (companion animal, special species, food animal, or equine)
     * `Question`: A group of procedures in a category such as "Medical Procedures" or "Surgical Procedures"
          * `Sub-question`: A particular procedure

We can perform analysis at the sub-question level, or pool upwards to the question or group level. I will do all of this below.





### Examine sub-questions within companion animal Q16

Now let's consider Companion Animal Medical Procedures (`Q16_1` - `Q16_25`). Below is code to convert these responses to ordinal representations and compare the distributions between SVM and WVMA.

Let's encode the expectation response in the following way:

 * 0: No Expectation to Perform Procedure

 * 1: Perform with Assistance (assist with portions of procedure) or Perform with Direct Supervision (present in room during procedure)

 * 2: Perform with Indirect Supervision (available in building or by phone if needed)

 * 3: Perform Independently

In [90]:
def encode_expectation(response_string):
    if isinstance(response_string, int) == True:
        return response_string
    
    # Encode nan values as -1
    if isinstance(response_string, str) == False:
        if np.isnan(response_string) == True:
            return -1
    
    # Encode string
    s = response_string.lower()
    if s.find('no expectation') > -1:
        return 0
    elif s.find('indirect supervision') > -1:
        return 2
    elif (s.find('with assistance') > -1) or (s.find('direct supervision') > -1):
        return 1
    elif s.find('independently') > -1:
        return 3
    else:
        print(response_string)
        raise ValueError('Expected performance response was not formatted as expected.')

In [91]:
# Filter dataframe to only companion animal respondants (may have responded to other species too)
ca_svm = svm[svm['Q1'].str.contains('Companion Animal (canine and/or feline)', na=False, regex=False)].copy()

In [92]:
# Filter dataframe to only companion animal respondants (may have responded to other species too)
ca_wvma = wvma[wvma['Q1'].str.contains('Companion Animal (canine and/or feline)', na=False, regex=False)].copy()

Below we see that of the 46 companion animal responses for svm.Q16_1, 6 were blank.

In [93]:
ca_svm.Q16_1.value_counts(dropna=False)

Perform Independently                  39
NaN                                     6
No Expectation to Perform Procedure     1
Name: Q16_1, dtype: int64

Let's summarize counts of responses.

In [94]:
# Tally data for this question group
NUM_QUESTIONS = 25
svm_counts = np.zeros((NUM_QUESTIONS, 5), dtype=int) # Row for each question, column for empty (-1), 0, 1, 2, and 3 responses
wvma_counts = np.zeros((NUM_QUESTIONS, 5), dtype=int) # Row for each question, column for empty (-1), 0, 1, 2, and 3 responses
rows = []

for i in range(1,26):
    qkey = "Q16_" + str(i)
    qstring = question_svm[qkey].split('-')[2] # could refer to questions_svm or questions_wvma

    # Encoding
    ca_svm[qkey] = ca_svm[qkey].apply(lambda x: encode_expectation(x))
    ca_wvma[qkey] = ca_wvma[qkey].apply(lambda x: encode_expectation(x))

    # SVM tally
    counts = ca_svm[qkey].value_counts(dropna=False)
    for key in counts.keys():
        svm_counts[i-1][key+1] += counts[key] # question index is 1-based; keys range from -1 to 3
    counts = svm_counts[i-1][1:] # counts of 0, 1, 2, and 3
    svm_mean = (0*counts[0] + 1*counts[1] + 2*counts[2] + 3*counts[3]) / np.sum(counts)

    # WVMA tally
    counts = ca_wvma[qkey].value_counts(dropna=False)
    for key in counts.keys():
        wvma_counts[i-1][key+1] += counts[key]
    counts = wvma_counts[i-1][1:] # counts of 0, 1, 2, and 3
    wvma_mean = (0*counts[0] + 1*counts[1] + 2*counts[2] + 3*counts[3]) / np.sum(counts)
      
    # Cache for table of results
    row = [qstring] + list(svm_counts[i-1]) + [svm_mean] + list(wvma_counts[i-1]) + [wvma_mean]
    rows.append(row)

In [95]:
# Assemble table of results
q16_table = pd.DataFrame(rows, columns=["Question", "SVM: empty", "SVM: 0", "SVM: 1", "SVM: 2", "SVM: 3", "SVM: avg", "WVMA: empty", "WVMA: 0", "WVMA: 1", "WVMA: 2", "WVMA: 3", "WVMA: avg"])
q16_table

Unnamed: 0,Question,SVM: empty,SVM: 0,SVM: 1,SVM: 2,SVM: 3,SVM: avg,WVMA: empty,WVMA: 0,WVMA: 1,WVMA: 2,WVMA: 3,WVMA: avg
0,Obtain history and perform complete PE,6,1,0,0,39,2.925,16,0,0,12,87,2.878788
1,Perform ophthalmic exam,6,2,3,8,27,2.5,15,1,19,29,51,2.3
2,Perform otoscopic exam,6,2,2,2,34,2.7,15,1,4,20,75,2.69
3,Perform neurologic exam,6,2,2,6,30,2.6,15,1,24,37,38,2.12
4,Perform orthopedic exam,6,2,2,6,30,2.6,15,2,23,36,39,2.12
5,Develop problem list and rank order different...,6,0,1,6,33,2.8,15,0,8,41,51,2.43
6,Develop and interpret diagnostic plan,6,1,1,6,32,2.725,16,0,18,57,24,2.060606
7,Develop treatment plan,6,2,2,8,28,2.55,16,1,20,56,22,2.0
8,Calculate medication dosage,6,2,1,1,36,2.775,16,1,5,15,78,2.717172
9,Write prescription,6,2,0,0,38,2.85,16,1,5,20,73,2.666667


## Companion Animal Medical Q16 Subquestions

Let's apply Kruskal-Wallis H-test to each subquestion in the Q16 question.

In [96]:
# Tally data for this question group
NUM_QUESTIONS = 25
ALPHA = 0.05
svm_counts = np.zeros((NUM_QUESTIONS, 5), dtype=int) # Row for each question, column for empty (-1), 0, 1, 2, and 3 responses
wvma_counts = np.zeros((NUM_QUESTIONS, 5), dtype=int) # Row for each question, column for empty (-1), 0, 1, 2, and 3 responses
rows = []
svm_pooled = []
wvma_pooled = []

for i in range(1,26):
    qkey = "Q16_" + str(i)
    qstring = question_svm[qkey].split('-')[2] # could refer to questions_svm or questions_wvma

    # Encoding
    ca_svm[qkey] = ca_svm[qkey].apply(lambda x: encode_expectation(x))
    ca_wvma[qkey] = ca_wvma[qkey].apply(lambda x: encode_expectation(x))

    # SVM tally
    counts = ca_svm[qkey].value_counts(dropna=False)
    for key in counts.keys():
        svm_counts[i-1][key+1] += counts[key] # question index is 1-based; keys range from -1 to 3
    counts = svm_counts[i-1][1:] # counts of 0, 1, 2, and 3
    svm_mean = (0*counts[0] + 1*counts[1] + 2*counts[2] + 3*counts[3]) / np.sum(counts)

    # WVMA tally
    counts = ca_wvma[qkey].value_counts(dropna=False)
    for key in counts.keys():
        wvma_counts[i-1][key+1] += counts[key]
    counts = wvma_counts[i-1][1:] # counts of 0, 1, 2, and 3
    wvma_mean = (0*counts[0] + 1*counts[1] + 2*counts[2] + 3*counts[3]) / np.sum(counts)
    
    # Get data
    svm_data = list(ca_svm[qkey])
    wvma_data = list(ca_wvma[qkey])

    # Remove empty values from data
    svm_data = [x for x in svm_data if x != -1]
    wvma_data = [x for x in wvma_data if x != -1]

    # compare samples
    stat, p = kruskal(svm_data, wvma_data)

    # Determine significance
    if p > ALPHA:
        sig = ""
    else:
        sig = "*"

    # Cache for pooled data
    svm_pooled.extend(svm_data)
    wvma_pooled.extend(wvma_data)

    # Cache for table of results
    row = [qstring] + list(svm_counts[i-1]) + [svm_mean] + list(wvma_counts[i-1]) + [wvma_mean, stat, p, sig]
    rows.append(row)

In [97]:
# Assemble table of results
q16_table = pd.DataFrame(rows, columns=["Subquestion", "SVM: empty", "SVM: 0", "SVM: 1", "SVM: 2", "SVM: 3", "SVM: avg", "WVMA: empty", "WVMA: 0", "WVMA: 1", "WVMA: 2", "WVMA: 3", "WVMA: avg", "stat", "pval", "sig"])
q16_table

Unnamed: 0,Subquestion,SVM: empty,SVM: 0,SVM: 1,SVM: 2,SVM: 3,SVM: avg,WVMA: empty,WVMA: 0,WVMA: 1,WVMA: 2,WVMA: 3,WVMA: avg,stat,pval,sig
0,Obtain history and perform complete PE,6,1,0,0,39,2.925,16,0,0,12,87,2.878788,2.894845,0.08886334,
1,Perform ophthalmic exam,6,2,3,8,27,2.5,15,1,19,29,51,2.3,2.707331,0.09988798,
2,Perform otoscopic exam,6,2,2,2,34,2.7,15,1,4,20,75,2.69,1.065396,0.3019877,
3,Perform neurologic exam,6,2,2,6,30,2.6,15,1,24,37,38,2.12,12.903251,0.0003280119,*
4,Perform orthopedic exam,6,2,2,6,30,2.6,15,2,23,36,39,2.12,12.370351,0.0004362057,*
5,Develop problem list and rank order different...,6,0,1,6,33,2.8,15,0,8,41,51,2.43,11.447056,0.0007160736,*
6,Develop and interpret diagnostic plan,6,1,1,6,32,2.725,16,0,18,57,24,2.060606,30.764193,2.913649e-08,*
7,Develop treatment plan,6,2,2,8,28,2.55,16,1,20,56,22,2.0,20.534365,5.857023e-06,*
8,Calculate medication dosage,6,2,1,1,36,2.775,16,1,5,15,78,2.717172,1.899436,0.1681415,
9,Write prescription,6,2,0,0,38,2.85,16,1,5,20,73,2.666667,6.959105,0.008339373,*


These results indicate that 15/25 procedures in these groups have significantly different expectations between SVM and WVMA. In all cases, the SVM expectations reflect a higher level of independent ability than WVMA expectations.

In [98]:
stat, p = kruskal(svm_pooled, wvma_pooled)
diff_mean = np.mean(svm_pooled) - np.mean(wvma_pooled)
print('Pooled data: stat=%.3f, p=%.2e, diff_mean (svm-wvma)=%.3f' % (stat, p, diff_mean))

Pooled data: stat=149.733, p=1.98e-34, diff_mean (svm-wvma)=0.322


When responses are pooled across the set of subquestions in this group, we see that SVM respondants indicated a higher degree of independence compared to WVMA respondants overall.

## Reusable code for question analysis

We will refactor the above into reusable code.

In [99]:
def analyze_question(question_number, filtered_svm_df, filtered_wvma_df, n_subquestions, alpha=0.05, verbose=True):
    """
    Perform an analysis of a given question on a species-filtered dataframe.
    
    Inputs:
        question_number: main question number to analyze
        filtered_svm_df: SVM dataframe filtered to respondants with the desired species area
        filtered_wvma_df: WVMA dataframe filtered to respondants with the desired species area
        n_subquestions: number of subquestions in the main question
        alpha: power level for the statistical test

    Prints a summary of results.

    Outputs:
        table: summary table
        (pooled_stat, pooled_p, pooled_diff_mean): tuple of statistics describing output of Kruskal test on data pooled across subquestions
        svm_data: list of pooled SVM data
        wvma_data: list of pooled WVMA data
        sig_count: number of subquestions with significant difference detected (between SVM and WVMA responses), according to Kruskal test applied at subquestion level

    """

    svm_counts = np.zeros((n_subquestions, 5), dtype=int) # Row for each question, column for empty (-1), 0, 1, 2, and 3 responses
    wvma_counts = np.zeros((n_subquestions, 5), dtype=int) # Row for each question, column for empty (-1), 0, 1, 2, and 3 responses
    rows = []
    svm_pooled = []
    wvma_pooled = []
    sig_count = 0

    for i in range(1, n_subquestions+1):
        qkey = "Q" + str(question_number) + "_" + str(i)
        qstring = question_svm[qkey].split('-')[2] # could refer to questions_svm or questions_wvma

        # Encoding
        filtered_svm_df[qkey] = filtered_svm_df[qkey].apply(lambda x: encode_expectation(x))
        filtered_wvma_df[qkey] = filtered_wvma_df[qkey].apply(lambda x: encode_expectation(x))

        # SVM tally
        counts = filtered_svm_df[qkey].value_counts(dropna=False)
        for key in counts.keys():
            svm_counts[i-1][key+1] += counts[key] # question index is 1-based; keys range from -1 to 3
        counts = svm_counts[i-1][1:] # counts of 0, 1, 2, and 3
        svm_num_responses = np.sum(counts)
        svm_mean = (0*counts[0] + 1*counts[1] + 2*counts[2] + 3*counts[3]) / svm_num_responses

        # WVMA tally
        counts = filtered_wvma_df[qkey].value_counts(dropna=False)
        for key in counts.keys():
            wvma_counts[i-1][key+1] += counts[key]
        counts = wvma_counts[i-1][1:] # counts of 0, 1, 2, and 3
        wvma_num_responses = np.sum(counts)
        wvma_mean = (0*counts[0] + 1*counts[1] + 2*counts[2] + 3*counts[3]) / wvma_num_responses
        
        # Get data
        svm_data = list(filtered_svm_df[qkey])
        wvma_data = list(filtered_wvma_df[qkey])

        # Remove empty values from data
        svm_data = [x for x in svm_data if x != -1]
        wvma_data = [x for x in wvma_data if x != -1]

        assert(svm_num_responses == len(svm_data))
        assert(wvma_num_responses == len(wvma_data))

        # compare samples
        stat, p = kruskal(svm_data, wvma_data)

        # Determine significance
        if p > ALPHA:
            sig = ""
        else:
            sig = "*"
            sig_count += 1

        # Cache for pooled data
        svm_pooled.extend(svm_data)
        wvma_pooled.extend(wvma_data)

        # Cache for table of results
        row = [qstring] + list(svm_counts[i-1]) + [svm_mean, svm_num_responses] + list(wvma_counts[i-1]) + [wvma_mean, wvma_num_responses, stat, p, sig]
        rows.append(row)

    # Assemble table of results
    table = pd.DataFrame(rows, columns=["Subquestion", "SVM: empty", "SVM: 0", "SVM: 1", "SVM: 2", "SVM: 3", "SVM: avg", "SVM: num responses", "WVMA: empty", "WVMA: 0", "WVMA: 1", "WVMA: 2", "WVMA: 3", "WVMA: avg", "WVMA: num responses", "stat", "pval", "sig"])

    # Apply Kruskal test to pooled data
    pooled_stat, pooled_p = kruskal(svm_pooled, wvma_pooled)
    pooled_diff_mean = np.mean(svm_pooled) - np.mean(wvma_pooled)

    # Print
    if verbose == True:
        print('Pooled Q%s: stat=%.3f, p=%.2e, diff_mean (svm-wvma)=%.3f, sig_subq=%s/%s' % (question_number, pooled_stat, pooled_p, pooled_diff_mean, sig_count, n_subquestions))

    return table, (pooled_stat, pooled_p, pooled_diff_mean), svm_pooled, wvma_pooled, sig_count


The above code prints a question-level summary and returns a table of details for particular sub-questions.

In [100]:
table, subq_pooled_result, svm_data, wvma_data, sig_count = analyze_question(16, ca_svm, ca_wvma, n_subquestions=25)

Pooled Q16: stat=149.733, p=1.98e-34, diff_mean (svm-wvma)=0.322, sig_subq=15/25


In [101]:
table

Unnamed: 0,Subquestion,SVM: empty,SVM: 0,SVM: 1,SVM: 2,SVM: 3,SVM: avg,SVM: num responses,WVMA: empty,WVMA: 0,WVMA: 1,WVMA: 2,WVMA: 3,WVMA: avg,WVMA: num responses,stat,pval,sig
0,Obtain history and perform complete PE,6,1,0,0,39,2.925,40,16,0,0,12,87,2.878788,99,2.894845,0.08886334,
1,Perform ophthalmic exam,6,2,3,8,27,2.5,40,15,1,19,29,51,2.3,100,2.707331,0.09988798,
2,Perform otoscopic exam,6,2,2,2,34,2.7,40,15,1,4,20,75,2.69,100,1.065396,0.3019877,
3,Perform neurologic exam,6,2,2,6,30,2.6,40,15,1,24,37,38,2.12,100,12.903251,0.0003280119,*
4,Perform orthopedic exam,6,2,2,6,30,2.6,40,15,2,23,36,39,2.12,100,12.370351,0.0004362057,*
5,Develop problem list and rank order different...,6,0,1,6,33,2.8,40,15,0,8,41,51,2.43,100,11.447056,0.0007160736,*
6,Develop and interpret diagnostic plan,6,1,1,6,32,2.725,40,16,0,18,57,24,2.060606,99,30.764193,2.913649e-08,*
7,Develop treatment plan,6,2,2,8,28,2.55,40,16,1,20,56,22,2.0,99,20.534365,5.857023e-06,*
8,Calculate medication dosage,6,2,1,1,36,2.775,40,16,1,5,15,78,2.717172,99,1.899436,0.1681415,
9,Write prescription,6,2,0,0,38,2.85,40,16,1,5,20,73,2.666667,99,6.959105,0.008339373,*


## Companion Animal Group

Recall that the companion animal group contains a set of questions. Now let's use the modularized question analysis code to perform a similar analysis across all questions in the companion animal group.

In [102]:
# cache data across all groups
group_data = []
group_columns = ["Group", "Pooled stat", "Pooled p", "Pooled diff_mean (svm-wvma)", "Num questions", "Fraction of sig questions", "Pooled num SVM reponses", "Pooled num WVMA responses"]

In [103]:
# Input info about question group

question_list = [16,17,7,8,9,10,11,12]
n_subq_list = [25,10,25,8,4,12,13,3]
question_strings = ['Medical Procedures',
                    'Preventive Medicine/Population Health Procedures',
                    'Surgical Procedures', 
                    'Anesthetic Procedures', 
                    'Reproductive Procedures',
                    'Diagnostic Imaging Procedures',
                    'Clinical Pathology Procedures',
                    'Diagnostic Necropsy Procedures']

assert(len(question_list) == len(n_subq_list))
assert(len(n_subq_list) == len(question_strings))

In [104]:
# Code to analyze all questions within the group

def analyze_group(question_list, n_subq_list, question_strings, filtered_svm_df, filtered_wvma_df):
    svm_pooled = [] # now pooling over entire group
    wvma_pooled = []
    rows = []
    sig_count = 0

    for i in range(len(question_list)):
        question_number = question_list[i]
        n_subquestions = n_subq_list[i]
        question_string = question_strings[i]

        # Run analysis
        table, subq_pooled_result, svm_data, wvma_data, sig_subq = analyze_question(question_number, filtered_svm_df, filtered_wvma_df, n_subquestions, verbose=False)
        pooled_stat, pooled_p, pooled_diff_mean = subq_pooled_result
        svm_num_responses = len(svm_data)
        wvma_num_responses = len(wvma_data)

        # Pool
        svm_pooled.extend(svm_data)
        wvma_pooled.extend(wvma_data)

        # Determine significance
        if pooled_p > ALPHA:
            sig = ""
        else:
            sig = "*"
            sig_count += 1

        # Cache data for group summary
        row = ['Q'+str(question_number), question_string, pooled_stat, pooled_p, sig, pooled_diff_mean, n_subquestions, sig_subq/n_subquestions, svm_num_responses, wvma_num_responses]
        rows.append(row)

    # Assemble table of results
    group_table = pd.DataFrame(rows, columns=["Question number", "Category", "Pooled stat", "Pooled p", "Sig", "Pooled Diff Mean (svm-wvma)", "Num subquestions", "Fraction of sig subquestions", "Pooled num SVM responses", "Pooled num WVMA responses"])                     

    # Apply Kruskal test to pooled data
    pooled_stat, pooled_p = kruskal(svm_pooled, wvma_pooled)
    pooled_diff_mean = np.mean(svm_pooled) - np.mean(wvma_pooled)

    # Print
    print('Group result (all questions): stat=%.3f, p=%.2e, diff_mean (svm-wvma)=%.3f, sig_subq=%s/%s' % (pooled_stat, pooled_p, pooled_diff_mean, sig_count, len(question_list)))

    return group_table, (pooled_stat, pooled_p, pooled_diff_mean), sig_count, len(question_list), len(svm_pooled), len(wvma_pooled)

In [105]:
group_table, pooled_q_stats, sig_count, n_questions, svm_responses, wvma_responses  = analyze_group(question_list, n_subq_list, question_strings, ca_svm, ca_wvma)
pooled_stat, pooled_p, pooled_diff_mean = pooled_q_stats
group_data.append(["Companion Animal", pooled_stat, pooled_p, pooled_diff_mean, n_questions, sig_count/n_questions, svm_responses, wvma_responses])
group_table

Group result (all questions): stat=144.344, p=2.99e-33, diff_mean (svm-wvma)=0.214, sig_subq=6/8


Unnamed: 0,Question number,Category,Pooled stat,Pooled p,Sig,Pooled Diff Mean (svm-wvma),Num subquestions,Fraction of sig subquestions,Pooled num SVM responses,Pooled num WVMA responses
0,Q16,Medical Procedures,149.732998,1.982989e-34,*,0.321893,25,0.6,999,2457
1,Q17,Preventive Medicine/Population Health Procedures,22.35616,2.264854e-06,*,0.23416,10,0.4,390,930
2,Q7,Surgical Procedures,7.854769,0.005068684,*,0.127968,25,0.2,973,2316
3,Q8,Anesthetic Procedures,41.203413,1.371828e-10,*,0.259941,8,0.875,312,719
4,Q9,Reproductive Procedures,0.24295,0.6220839,,0.032123,4,0.0,156,356
5,Q10,Diagnostic Imaging Procedures,13.354983,0.0002577369,*,0.236343,12,0.25,468,1042
6,Q11,Clinical Pathology Procedures,18.174154,2.015962e-05,*,0.197094,13,0.384615,506,1125
7,Q12,Diagnostic Necropsy Procedures,2.938878,0.08647078,,0.198939,3,0.333333,117,261


## Special Species Group

In [106]:
# Filter dataframes to only companion animal respondants (may have responded to other species too)
ss_svm = svm[svm['Q1'].str.contains('Special Species', na=False, regex=False)].copy()
ss_wvma = wvma[wvma['Q1'].str.contains('Special Species', na=False, regex=False)].copy()

In [107]:
# Input info about question group

question_list = [43, 44, 45, 46, 48, 49, 50]
n_subq_list = [20, 9, 11, 8, 6, 13, 3]
question_strings = ['Medical Procedures',
                    'Preventive Medicine/Population Health Procedures',
                    'Surgical Procedures', 
                    'Anesthetic Procedures', 
                    'Diagnostic Imaging Procedures',
                    'Clinical Pathology Procedures',
                    'Diagnostic Necropsy Procedures']

assert(len(question_list) == len(n_subq_list))
assert(len(n_subq_list) == len(question_strings))

In [108]:
group_table, pooled_q_stats, sig_count, n_questions, svm_responses, wvma_responses  = analyze_group(question_list, n_subq_list, question_strings, ss_svm, ss_wvma)
pooled_stat, pooled_p, pooled_diff_mean = pooled_q_stats
group_data.append(["Special Species", pooled_stat, pooled_p, pooled_diff_mean, n_questions, sig_count/n_questions, svm_responses, wvma_responses])
group_table

Group result (all questions): stat=143.035, p=5.77e-33, diff_mean (svm-wvma)=0.583, sig_subq=5/7


Unnamed: 0,Question number,Category,Pooled stat,Pooled p,Sig,Pooled Diff Mean (svm-wvma),Num subquestions,Fraction of sig subquestions,Pooled num SVM responses,Pooled num WVMA responses
0,Q43,Medical Procedures,78.31943,8.76554e-19,*,0.708574,20,0.4,259,340
1,Q44,Preventive Medicine/Population Health Procedures,2.046954,0.152511,,0.175949,9,0.0,99,148
2,Q45,Surgical Procedures,22.585485,2.009975e-06,*,0.555283,11,0.272727,121,185
3,Q46,Anesthetic Procedures,15.026705,0.0001060005,*,0.518466,8,0.125,88,128
4,Q48,Diagnostic Imaging Procedures,15.97588,6.415466e-05,*,0.72822,6,0.5,66,96
5,Q49,Clinical Pathology Procedures,29.618442,5.260209e-08,*,0.624126,13,0.307692,143,208
6,Q50,Diagnostic Necropsy Procedures,1.597247,0.2062937,,0.284091,3,0.0,33,48


## Food Animal Group

In [109]:
# Filter dataframes to only companion animal respondants (may have responded to other species too)
fa_svm = svm[svm['Q1'].str.contains('Food Animal', na=False, regex=False)].copy()
fa_wvma = wvma[wvma['Q1'].str.contains('Food Animal', na=False, regex=False)].copy()

In [110]:
# Input info about question group

question_list = [20, 18, 25, 24, 21, 19, 23, 22, 27]
n_subq_list = [8, 27, 16, 10, 20, 11, 12, 3, 5]
question_strings = ['Handling and Husbandry Procedures',
                    'Medical Procedures',
                    'Surgical Procedures',
                    'Anesthetic Procedures',
                    'Preventive Medicine/Population Health Procedures',
                    'Reproductive Procedures',
                    'Clinical Pathology Procedures',
                    'Diagnostic Necropsy Procedures',
                    'Diagnostic Imaging Procedures']


assert(len(question_list) == len(n_subq_list))
assert(len(n_subq_list) == len(question_strings))

In [111]:
group_table, pooled_q_stats, sig_count, n_questions, svm_responses, wvma_responses  = analyze_group(question_list, n_subq_list, question_strings, fa_svm, fa_wvma)
pooled_stat, pooled_p, pooled_diff_mean = pooled_q_stats
group_data.append(["Food Animal", pooled_stat, pooled_p, pooled_diff_mean, n_questions, sig_count/n_questions, svm_responses, wvma_responses])
group_table

Group result (all questions): stat=15.002, p=1.07e-04, diff_mean (svm-wvma)=0.096, sig_subq=3/9


Unnamed: 0,Question number,Category,Pooled stat,Pooled p,Sig,Pooled Diff Mean (svm-wvma),Num subquestions,Fraction of sig subquestions,Pooled num SVM responses,Pooled num WVMA responses
0,Q20,Handling and Husbandry Procedures,3.085446,0.07899561,,-0.179954,8,0.0,104,295
1,Q18,Medical Procedures,23.197716,1.461709e-06,*,0.240667,27,0.185185,325,997
2,Q25,Surgical Procedures,0.206057,0.6498759,,-0.042471,16,0.0,193,574
3,Q24,Anesthetic Procedures,2.621206,0.1054442,,0.154832,10,0.1,119,360
4,Q21,Preventive Medicine/Population Health Procedures,4.684126,0.03044255,*,0.119487,20,0.05,240,718
5,Q19,Reproductive Procedures,1.588406,0.207554,,-0.122944,11,0.0,132,385
6,Q23,Clinical Pathology Procedures,28.663647,8.610502e-08,*,0.288001,12,0.333333,143,418
7,Q22,Diagnostic Necropsy Procedures,1.614574,0.20385,,-0.247619,3,0.0,35,105
8,Q27,Diagnostic Imaging Procedures,3.090183,0.07876597,,0.284483,5,0.0,60,174


## Equine Group

In [112]:
# Filter dataframes to only companion animal respondants (may have responded to other species too)
eq_svm = svm[svm['Q1'].str.contains('Equine', na=False, regex=False)].copy()
eq_wvma = wvma[wvma['Q1'].str.contains('Equine', na=False, regex=False)].copy()

In [113]:
# Input info about question group

question_list = [28, 29, 30, 31, 32, 33, 34, 35, 36]
n_subq_list = [7, 24, 8, 8, 15, 9, 11, 3, 5]
question_strings = ['Handling and Husbandry Procedures',
                    'Medical Procedures',
                    'Surgical Procedures',
                    'Anesthetic Procedures',
                    'Preventive Medicine/Population Health Procedures',
                    'Reproductive Procedures',
                    'Clinical Pathology Procedures',
                    'Diagnostic Necropsy Procedures',
                    'Diagnostic Imaging Procedures']


assert(len(question_list) == len(n_subq_list))
assert(len(n_subq_list) == len(question_strings))

In [114]:
group_table, pooled_q_stats, sig_count, n_questions, svm_responses, wvma_responses  = analyze_group(question_list, n_subq_list, question_strings, eq_svm, eq_wvma)
pooled_stat, pooled_p, pooled_diff_mean = pooled_q_stats
group_data.append(["Equine", pooled_stat, pooled_p, pooled_diff_mean, n_questions, sig_count/n_questions, svm_responses, wvma_responses])
group_table

Group result (all questions): stat=26.113, p=3.22e-07, diff_mean (svm-wvma)=0.175, sig_subq=3/9


Unnamed: 0,Question number,Category,Pooled stat,Pooled p,Sig,Pooled Diff Mean (svm-wvma),Num subquestions,Fraction of sig subquestions,Pooled num SVM responses,Pooled num WVMA responses
0,Q28,Handling and Husbandry Procedures,0.014463,0.9042765,,0.013506,7,0.0,77,175
1,Q29,Medical Procedures,22.69447,1.899136e-06,*,0.286032,24,0.125,219,599
2,Q30,Surgical Procedures,0.593181,0.4411915,,0.108333,8,0.0,72,200
3,Q31,Anesthetic Procedures,9.782052,0.001762236,*,0.387778,8,0.125,72,200
4,Q32,Preventive Medicine/Population Health Procedures,0.158782,0.6902808,,-0.066667,15,0.0,135,360
5,Q33,Reproductive Procedures,0.42826,0.5128439,,0.092593,9,0.0,81,216
6,Q34,Clinical Pathology Procedures,27.36949,1.68062e-07,*,0.41869,11,0.090909,98,253
7,Q35,Diagnostic Necropsy Procedures,0.093098,0.7602756,,-0.067538,3,0.0,27,68
8,Q36,Diagnostic Imaging Procedures,0.479226,0.4887731,,0.140541,5,0.0,45,111


## Group Summary

In [115]:
group_summary_table = pd.DataFrame(group_data, columns=group_columns)

In [116]:
pvals = list(group_summary_table['Pooled p'])

sigs = []
for p in pvals:
  if p > ALPHA:
      sig = ""
  else:
      sig = "*"
  sigs.append(sig)

group_summary_table.insert(loc=3, column='Sig', value=sigs)

In [117]:
group_summary_table

Unnamed: 0,Group,Pooled stat,Pooled p,Sig,Pooled diff_mean (svm-wvma),Num questions,Fraction of sig questions,Pooled num SVM reponses,Pooled num WVMA responses
0,Companion Animal,144.344322,2.987518e-33,*,0.213987,8,0.75,3921,9206
1,Special Species,143.035475,5.773968e-33,*,0.582878,7,0.714286,809,1153
2,Food Animal,15.002487,0.0001073696,*,0.09633,9,0.333333,1351,4026
3,Equine,26.113365,3.219479e-07,*,0.174795,9,0.333333,826,2182


Looking at pooled comparisons at the group level, there are significant differences between the SVM and WVMA expectations for companion animal and special species, but there are not significant differences for food animal and equine at the group level. 

As we saw, there are significant differences at the individual question and subquestion levels within all groups.