# Hacker Within survey analysis

In [1]:
import numpy as np
import pandas as pd

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# Import custom data analysis module
from analyze_data import process_data, category_weights, get_categories

In [4]:
# Get data
df = pd.read_csv('thw_survey.csv')
df

Unnamed: 0,key_tools-groups-yes,key_tools-groups-maybe,key_tools-ranks-yes-bash,key_tools-ranks-yes-version_control_git,key_tools-ranks-yes-choosing_text_editor,key_tools-ranks-maybe-bash,key_tools-ranks-maybe-version_control_git,key_tools-ranks-maybe-choosing_text_editor,languages-groups-yes,languages-groups-maybe,...,tools-ranks-maybe-docker,tools-ranks-maybe-comparing_vc_software,tools-ranks-maybe-node.js,tools-ranks-maybe-MEAN_software_stack,tools-ranks-maybe-django,category-key_tools,category-languages,category-practices,category-libraries,category-tools
0,The Bash Shell,Choosing a text editor,1.0,,,,,1.0,"Matlab,R","Java,Javascript,C,Python",...,,,,,,4.0,3.0,1.0,5.0,2.0
1,"Version Control / Git,Choosing a text editor",The Bash Shell,,1.0,2.0,1.0,,,"Python,Javascript,R,Java,C","Matlab,Julia (new language for numerical compu...",...,,,3.0,,,2.0,5.0,1.0,4.0,3.0
2,Version Control / Git,Choosing a text editor,,1.0,,,,1.0,R,Python,...,,,,,,5.0,1.0,2.0,4.0,3.0
3,"The Bash Shell,Version Control / Git,Choosing ...",,1.0,2.0,3.0,,,,"R,Python,Javascript",,...,,1.0,,,,3.0,1.0,5.0,2.0,4.0
4,Choosing a text editor,"Version Control / Git,The Bash Shell",,,1.0,2.0,1.0,,Matlab,"R,Python,C,Java",...,,,,,,4.0,5.0,1.0,3.0,2.0
5,Version Control / Git,The Bash Shell,,1.0,,1.0,,,"Julia (new language for numerical computing),P...",Javascript,...,3.0,1.0,,,,5.0,1.0,4.0,3.0,2.0
6,The Bash Shell,,1.0,,,,,,"Python,Javascript,R",,...,,,,,,4.0,2.0,1.0,5.0,3.0
7,"Choosing a text editor,Version Control / Git",,,2.0,1.0,,,,"R,Python,Javascript,Matlab",,...,,,,,,5.0,1.0,2.0,4.0,3.0
8,"Version Control / Git,Choosing a text editor,T...",,3.0,1.0,2.0,,,,"R,Python,Julia (new language for numerical com...",,...,,,,,,3.0,1.0,2.0,4.0,5.0
9,"Version Control / Git,The Bash Shell",Choosing a text editor,2.0,1.0,,,,1.0,"Python,R,Matlab,Javascript",Julia (new language for numerical computing),...,,1.0,3.0,2.0,,2.0,4.0,1.0,5.0,3.0


These data aren't very easy to summarize.  Y'all may remember that the questions were in five categories:

In [5]:
get_categories(df)

['category-key_tools',
 'category-languages',
 'category-practices',
 'category-libraries',
 'category-tools']

Each category has a set of options.  Here are the options for `category-key-tools`:

In [6]:
[col_name.split('-')[-1] for col_name in df.columns if col_name.startswith('key_tools-ranks-yes')]

['bash', 'version_control_git', 'choosing_text_editor']

For each category, you selected some options as "Yes" and some options as "Maybe".  The survey records these responds in two ways: as "groups" (a comma separated list of options) and as "ranks" (the rank for each option).  It records this twice, once for the "Yes" options, and once for the "Maybe" options.

Say there are `n` options in a category (n == 3 for category `key_tools`).  There will be 2 "group" columns for this category (group column for "Yes", group column for "Maybe".  There will be `n` "ranks" columns for "Yes" and `n` "ranks" columns for "Maybe".  That gives 2 + 2 * `n` columns in total, for each category.

In [7]:
key_tools_columns = [col_name for col_name in df.columns if col_name.startswith('key_tools')]
key_tools_columns                                                                                

['key_tools-groups-yes',
 'key_tools-groups-maybe',
 'key_tools-ranks-yes-bash',
 'key_tools-ranks-yes-version_control_git',
 'key_tools-ranks-yes-choosing_text_editor',
 'key_tools-ranks-maybe-bash',
 'key_tools-ranks-maybe-version_control_git',
 'key_tools-ranks-maybe-choosing_text_editor']

The ranks columns give the rank of this option for the "Yes" or "Maybe" sets.  Here are the column names for the ranks of the "Yes" set, `key_tools` category:

In [8]:
kt_y_rank_cols = [col_name for col_name in df.columns if col_name.startswith('key_tools-ranks-yes')]
kt_y_rank_cols

['key_tools-ranks-yes-bash',
 'key_tools-ranks-yes-version_control_git',
 'key_tools-ranks-yes-choosing_text_editor']

Here are "key tools" ranks, for the "Yes" set, for the first respondent:

In [9]:
df.loc[0, kt_y_rank_cols]

key_tools-ranks-yes-bash                      1
key_tools-ranks-yes-version_control_git     NaN
key_tools-ranks-yes-choosing_text_editor    NaN
Name: 0, dtype: object

The first respondent only put "The Bash" shell in the "Yes" set for this category.  Thence there are no ranks for the other two questions (the respondent didn't select them).

Last in the survey data we have five columns for the ranking you gave for each category, from 1 to 5.

In [10]:
df.columns[-5:]

Index(['category-key_tools', 'category-languages', 'category-practices',
       'category-libraries', 'category-tools'],
      dtype='object')

These data are a bit difficult to summarize, but hey, we're hackers, how hard can it be?

The analysis code is in the same directory as this Notebook, called `analyze_data.py`.

See that analysis code for details, but it generates:

* `cat_rank_mean` is the mean rank of each category (key tools, languages, etc);
* `cat_scores` is the composite score for each option, within category.  See below for what the composite scores are.
* `w_cat_scores` are the composite scores, weighted by the ranking of the matching category, given by each respondent.  A rank of 1 corresponds to a weight of 1, and a rank of 5 (out of 5 categories) corresponds to a weight of 1/5.

The composite scores are my attempt to weight the yes and the maybe ranks.  First we rescale the ranks as above, so a rank of 1 for an option corresponds to a rescaled score of 1, a rank of `n` (where `n` is the number of options in this category) gives a score of 1/`n`.   Next we add one, if this is a rank in the `yes` category, so a yes rank of 1 becomes 2 and 1/`n` becomes 1 + 1/`n`.  We then add the scores for the yes and the maybe ranks.  So, if there was only one repondent, giving a 1 rank for "bash shell" Yes, and 1 rank for "Choosing a text editor" Maybe, then the composite score is 1 + 1 == 2 for "bash shell", and 1 for "Choosing a text editor".

In [11]:
# Process data
cat_rank_mean, cat_scores, w_cat_scores = process_data(df)

In [12]:
# Some stuff for formatting output
pd.set_option('precision', 2)
from IPython.display import display, Markdown

def printmd(string):
    display(Markdown(string))


def disp_dict_row(d):
    display(pd.DataFrame(d, index=['']))

In [13]:
# Show the composite, weighted composite scores for each category.
for category in sorted(cat_scores):
    printmd('## Composite score for ' + category)
    disp_dict_row(cat_scores[category])
    printmd('## Weighted composite score for ' + category)
    disp_dict_row(w_cat_scores[category])
# Show the mean rank for each category.
printmd('## Category mean rank')
disp_dict_row(cat_rank_mean)

## Composite score for key_tools

Unnamed: 0,bash,choosing_text_editor,version_control_git
,30.67,26.67,42.67


## Weighted composite score for key_tools

Unnamed: 0,bash,choosing_text_editor,version_control_git
,17.4,13.8,23.93


## Composite score for languages

Unnamed: 0,c,fortran,java,javascript,julia,matlab,python,r
,13.12,10.12,7.62,21.75,14.75,18.0,41.12,41.0


## Weighted composite score for languages

Unnamed: 0,c,fortran,java,javascript,julia,matlab,python,r
,7.45,4.5,4.2,13.85,9.25,7.68,26.38,24.78


## Composite score for libraries

Unnamed: 0,data.table,electron,numpy,pandas,react.js,statistics
,30.83,12.17,32.67,34.33,15.33,26.17


## Weighted composite score for libraries

Unnamed: 0,data.table,electron,numpy,pandas,react.js,statistics
,13.3,8.03,14.4,16.5,7.53,10.8


## Composite score for practices

Unnamed: 0,api_driven_development,bear_batch,continuous_integration,data_modes,fast_databases,gpu,object_oriented,programming_paradigms,reproducible,testing,visualization
,17.36,18.0,32.55,22.73,13.36,18.73,11.64,27.18,30.73,33.27,39.0


## Weighted composite score for practices

Unnamed: 0,api_driven_development,bear_batch,continuous_integration,data_modes,fast_databases,gpu,object_oriented,programming_paradigms,reproducible,testing,visualization
,13.67,15.18,29.07,17.15,10.4,16.93,9.91,21.13,23.33,28.24,29.62


## Composite score for tools

Unnamed: 0,MEAN_software_stack,comparing_vc_software,django,docker,emacs,node.js,pandoc,vim
,10.12,29.12,20.25,26.62,15.5,14.12,16.25,25.38


## Weighted composite score for tools

Unnamed: 0,MEAN_software_stack,comparing_vc_software,django,docker,emacs,node.js,pandoc,vim
,6.15,15.4,12.12,16.33,8.93,8.68,9.65,15.97


## Category mean rank

Unnamed: 0,key_tools,languages,libraries,practices,tools
,3.33,2.89,3.78,1.89,3.11


Rank all options in all categories on weighted score:

In [14]:
# Merge all the weighted option scores into a single dictionary.
all_weighted = {}
for key, sub_dict in w_cat_scores.items():
    prefix = key + '-'
    for sub_key, value in sub_dict.items():
        all_weighted[prefix + sub_key] = value
# Turn dictionary into sequence of (name, value) pairs, and sort descending by value.
all_vals = sorted(list(all_weighted.items()), key=lambda x : x[1], reverse=True)
# Display as a single column data frame.
names, values = zip(*all_vals)
display(pd.DataFrame(np.array(values), columns=['weighted score'], index=names))

Unnamed: 0,weighted score
practices-visualization,29.62
practices-continuous_integration,29.07
practices-testing,28.24
languages-python,26.38
languages-r,24.78
key_tools-version_control_git,23.93
practices-reproducible,23.33
practices-programming_paradigms,21.13
key_tools-bash,17.4
practices-data_modes,17.15
