# Visualizing How Developers Rate Their Own Programming Skills

by Max Woolf (@minimaxir)

*This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)*

**Note, code in this notebook is uglier than usual due to unexpected shennanigans being required to get the code running. I may add more comments in the future.**

In [1]:
options(warn=1)

source("Rstart.R")

library(plotly)
library(htmlwidgets)
library(tidyr)

sessionInfo()

n_bootstrap_resamples <- 10000


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Registering fonts with R

Attaching package: ‘scales’

The following objects are masked from ‘package:readr’:

    col_factor, col_numeric


Attaching package: ‘plotly’

The following object is masked _by_ ‘.GlobalEnv’:

    subplot

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:graphics’:

    layout



R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] tidyr_0.4.1        htmlwidgets_0.6    plotly_3.6.0       stringr_1.0.0     
 [5] digest_0.6.9       RColorBrewer_1.1-2 scales_0.4.0       extrafont_0.17    
 [9] ggplot2_2.1.0      dplyr_0.4.3        readr_0.2.2       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4      Rttf2pt1_1.3.3   magrittr_1.5     munsell_0.4.3   
 [5] uuid_0.1-2       colorspace_1.2-6 R6_2.1.2         httr_1.1.0      
 [9] plyr_1.8.3       tools_3.3.0      parallel_3.3.0   gtable_0.2.0    
[13] DBI_0.4          extrafontdb_1.0  htmltools_0.3.5  assertthat_0.1  
[17] gridExtra_2.2.1  IRdisplay_0.3    repr_0.4         viridis_0.3.4   
[21] base64enc_0.1-3  IRkernel_0.5    

In [47]:
file_path <- "/Users/maxwoolf/Downloads/2016 Stack Overflow Survey Results/2016 Stack Overflow Survey Responses.csv"

df <- read_csv(file_path) %>% filter(programming_ability != "", programming_ability != " ")

df %>% head() %>% print()
df %>% nrow() %>% print()

Source: local data frame [6 x 66]

     NA collector     country    un_subregion      so_region
  (int)     (chr)       (chr)           (chr)          (chr)
1  4637  Facebook Afghanistan   Southern Asia   Central Asia
2 21378  Facebook Afghanistan   Southern Asia   Central Asia
3 31743  Facebook Afghanistan   Southern Asia   Central Asia
4 51301  Facebook Afghanistan   Southern Asia   Central Asia
5 24487  Facebook     Albania Southern Europe Eastern Europe
6 28844  Facebook     Albania Southern Europe Eastern Europe
Variables not shown: age_range (chr), age_midpoint (dbl), gender (chr),
  self_identification (chr), occupation (chr), occupation_group (chr),
  experience_range (chr), experience_midpoint (dbl), salary_range (chr),
  salary_midpoint (dbl), big_mac_index (dbl), tech_do (chr), tech_want (chr),
  aliens (chr), programming_ability (dbl), employment_status (chr), industry
  (chr), company_size_range (chr), team_size_range (chr), women_on_team (chr),
  remote (chr), job_satisfa

## Summary Column Chart for Distribution

In [3]:
df_summary <- df %>% group_by(programming_ability) %>% summarize(count=n())
df_summary %>% print(n=100)

programming_ability_avg = mean(df$programming_ability)
print(programming_ability_avg)

Source: local data frame [10 x 2]

   programming_ability count
                 (dbl) (int)
1                    1   263
2                    2   450
3                    3  1195
4                    4  1955
5                    5  3873
6                    6  6355
7                    7 12901
8                    8 11287
9                    9  4380
10                  10  4323
[1] 7.094547


In [4]:
plot <- ggplot(mapping=aes(x=factor(programming_ability), y=count), data=df_summary) +
            geom_bar(stat="identity") +
            fte_theme() +
            scale_y_continuous(labels=comma, breaks=pretty_breaks(6)) +
            #theme(axis.title.y=element_blank()) +
            labs(title="Distribution of Self-Ratings of Programming Ability By Developers", x="How would you rate your programming ability? (1-10)")

max_save(plot, "so-programming-0", "Stack Overflow", h=2.5)

![](so-programming-0.png)

## Gender

**Important Note**: The gender charts are not included in the final post since I have concerns that they are hitting [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox), as per the official analysis, the distribution of other attributes for female developers is much different than male developers.

I am leaving the charts here for posterity while I drill down subgroups, but **do not repost them elsewhere as fact**.

In [5]:
programming_ability <- df %>% filter(gender == "Male" | gender == "Female") %>%
                group_by(gender) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

programming_ability %>% print(n=100)

Source: local data frame [2 x 5]

  gender count med_prog avg_prog     sd_mean
   (chr) (int)    (dbl)    (dbl)       (dbl)
1 Female  2631        7 6.312429 0.039927559
2   Male 43434        7 7.146061 0.008285955


Build resampling. Get the average programming ability rating from a resampled dataframe grouped by a specified variable.

In [6]:
resample_means <- function(df, group_var) {
    df_new <- df %>% sample_frac(replace=T)
    
    summary <- df_new %>%
                select(group = matches(group_var), programming_ability) %>%
                group_by(group) %>%
                summarize(avg_prog=mean(programming_ability))
    
    return(summary)
}

set.seed(42)
print(resample_means(df %>% filter(gender == "Male" | gender == "Female"), "gender"))

Source: local data frame [2 x 2]

   group avg_prog
   (chr)    (dbl)
1 Female 6.291619
2   Male 7.146242


Automate resampling for *n* amount of resampling. Store the results of all the resamples in a dataframe.

In [7]:
resample_df <- function(df, group_var, n) {

num_levels <- df %>% select(matches(group_var)) %>% unlist() %>% unique() %>% length()

df_resample_summary <- data.frame(group = character(n*num_levels), avg_prog = numeric(n*num_levels), stringsAsFactors = FALSE)

for (i in seq(1,n*num_levels - 1, by = num_levels)) {
    df_resample_summary[c(i:(i + num_levels - 1)),] <- resample_means(df, group_var)
}

return(tbl_df(df_resample_summary))
    
}

set.seed(4)
print(resample_df(df %>% filter(gender == "Male" | gender == "Female"), "gender", 4))

Source: local data frame [8 x 2]

   group avg_prog
   (chr)    (dbl)
1 Female 6.329767
2   Male 7.155634
3 Female 6.373745
4   Male 7.151466
5 Female 6.266843
6   Male 7.148815
7 Female 6.310424
8   Male 7.152603


In [8]:
system.time( df_boot <- resample_df(df %>% filter(gender == "Male" | gender == "Female"), "gender", 100))

print(head(df_boot))
print(nrow(df_boot)) # expect 100 * 2

   user  system elapsed 
 10.818   1.286  12.183 

Source: local data frame [6 x 2]

   group avg_prog
   (chr)    (dbl)
1 Female 6.347000
2   Male 7.148403
3 Female 6.298564
4   Male 7.145236
5 Female 6.354876
6   Male 7.148548
[1] 200


Further aggregate the data frame of all the averages to get the percent quantiles to construct a confidence interval.

In [9]:
bootstrap_df <- function(df, group_var, n) {
    df_boot <- resample_df(df, group_var, n)
    
    print(df_boot %>% head())
    
    boot <- df_boot %>%
            select(group, avg_prog) %>%
            group_by(group) %>%
            summarize(avg_prog_avg=mean(avg_prog),
                        avg_prog_low_ci = quantile(avg_prog, 0.025),
                        avg_prog_high_ci = quantile(avg_prog, 0.975))
    
    names(boot)[1] <- group_var
    
    return(boot)
}

gender_boot <- bootstrap_df(df %>% filter(gender == "Male" | gender == "Female"), "gender", 
n_bootstrap_resamples)
gender_boot %>% print()

Source: local data frame [6 x 2]

   group avg_prog
   (chr)    (dbl)
1 Female 6.332717
2   Male 7.135599
3 Female 6.325000
4   Male 7.143120
5 Female 6.327767
6   Male 7.148011
Source: local data frame [2 x 4]

  gender avg_prog_avg avg_prog_low_ci avg_prog_high_ci
   (chr)        (dbl)           (dbl)            (dbl)
1 Female     6.312411        6.233961         6.389937
2   Male     7.146060        7.129527         7.162202


In [10]:
color_m <- "#2980b9"
color_f <- "#27ae60"

df_norm <- df %>% filter(gender == "Male" | gender == "Female") %>%
                group_by(gender, programming_ability) %>%
                summarize(count=n()) %>%
                group_by(gender) %>%
                mutate(norm = count/sum(count))

df_norm %>% print()

plot <- ggplot(df_norm, aes(x=programming_ability, y=norm, fill=gender, color=gender)) +
            geom_area(alpha=0.5, stat="identity", position="identity") +
            fte_theme() +
            scale_x_continuous(breaks=c(1:10)) +
            #geom_point(mapping=aes(x=avg_prog_avg, y=0.01, fill=gender), data=gender_boot, show.legend=F, size=0.7, shape=21, color="white", stroke=0.1) +
            #geom_errorbarh(mapping=aes(x=avg_prog_avg, xmin=avg_prog_low_ci, xmax=avg_prog_high_ci, y=0.01), data=gender_boot, show.legend=F, color="black", height=0.01, size=0.05) +
            theme(legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.width=unit(0.5, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(0,"cm"), axis.title.y=element_blank(), axis.text.y=element_blank()) +
            scale_fill_manual(labels=c("Female", "Male"), values=c(color_f,color_m)) +
            scale_color_manual(labels=c("Female", "Male"), values=c(color_f,color_m)) +
            labs(title="Normalized Distribution of Self-Assessed Programming Ability", x="How would you rate your programming ability? (1-10)")

max_save(plot, "so-programming-1", "Stack Overflow")

Source: local data frame [20 x 4]
Groups: gender [2]

   gender programming_ability count        norm
    (chr)               (dbl) (int)       (dbl)
1  Female                   1    55 0.020904599
2  Female                   2    80 0.030406689
3  Female                   3   146 0.055492208
4  Female                   4   194 0.073736222
5  Female                   5   330 0.125427594
6  Female                   6   443 0.168377043
7  Female                   7   654 0.248574686
8  Female                   8   437 0.166096541
9  Female                   9   123 0.046750285
10 Female                  10   169 0.064234132
11   Male                   1   184 0.004236313
12   Male                   2   354 0.008150297
13   Male                   3  1016 0.023391813
14   Male                   4  1720 0.039600313
15   Male                   5  3440 0.079200626
16   Male                   6  5801 0.133558963
17   Male                   7 12040 0.277202192
18   Male                   8 1066

![](so-programming-1.png)

In [46]:
gender_boot_reorder <- gender_boot %>% left_join(programming_ability) %>% mutate(gender = factor(gender))

plot <- ggplot(mapping=aes(x=factor(gender), y=programming_ability), data=df %>% filter(gender == "Male" | gender == "Female")) +
            geom_violin(aes(fill=gender, color=gender), alpha=0.25, size=0.10, scale="width") +
            geom_errorbar(data=gender_boot_reorder, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.1) +
            geom_point(data=gender_boot_reorder, mapping=aes(x=gender, y=avg_prog, color=gender), stat="identity", position="identity", shape=21, fill="white") +
            fte_theme() +
            coord_flip() +
            scale_fill_manual(labels=c("Female", "Male"), values=c(color_f,color_m)) +
            scale_color_manual(labels=c("Female", "Male"), values=c(color_f,color_m)) +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(axis.title.y=element_blank(), plot.title=element_text(hjust=1)) +
            labs(title="Avg. Self-Rating Programming Ability By Developers, by Gender", y="How would you rate your programming ability? (1-10)")

max_save(plot, "so-programming-1-1", "Stack Overflow", h=2.5)

Joining by: "gender"


![](so-programming-1-1.png)

## Geographic Region

I did not include these charts in the final post since they are not insightful, particularly since the confidence intervals are wide.

In [12]:
df_country <- df %>% filter(country != "") %>%
                group_by(country) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count)) %>%
                filter(count >= 30) %>%
                arrange(desc(avg_prog))

df_country %>% print(n=100)

Source: local data frame [76 x 5]

                country count med_prog avg_prog    sd_mean
                  (chr) (int)    (dbl)    (dbl)      (dbl)
1                Israel   378        8 7.859788 0.08197254
2               Lebanon    33        8 7.666667 0.32176910
3                Mexico   393        8 7.575064 0.08216461
4                Brazil   906        8 7.451435 0.05670407
5             Venezuela    40        7 7.450000 0.28408648
6          South Africa   421        8 7.425178 0.08116660
7              Slovenia   149        8 7.375839 0.12615385
8             Macedonia    33        7 7.363636 0.38143018
9                 Spain   896        7 7.341518 0.04763322
10            Australia   953        7 7.338930 0.05478994
11 United Arab Emirates    55        7 7.327273 0.20930307
12        United States 11755        7 7.324883 0.01644331
13          Netherlands   985        7 7.324873 0.04714236
14              Austria   400        7 7.315000 0.08415547
15            Argenti

Get the confidence interval for the averages of each grouping value. Must ensure that disallowed values are removed.

In [13]:
valid_countries <- df_country %>% select(country) %>% unlist() %>% as.character()

country_boot <- bootstrap_df(df %>% filter(country %in% valid_countries), "country", n_bootstrap_resamples)
country_boot %>% arrange(desc(avg_prog_avg)) %>% print(n=100)

Source: local data frame [6 x 2]

       group avg_prog
       (chr)    (dbl)
1  Argentina 7.457045
2  Australia 7.307773
3    Austria 7.306667
4 Bangladesh 5.875969
5    Belarus 6.963768
6    Belgium 6.903720
Source: local data frame [76 x 4]

                country avg_prog_avg avg_prog_low_ci avg_prog_high_ci
                  (chr)        (dbl)           (dbl)            (dbl)
1                Israel     7.860319        7.696628         8.020942
2               Lebanon     7.661322        7.000000         8.243243
3                Mexico     7.576121        7.417715         7.734118
4                Brazil     7.451549        7.340203         7.562914
5             Venezuela     7.450125        6.891892         8.000000
6          South Africa     7.425297        7.263390         7.583965
7              Slovenia     7.376679        7.124182         7.615952
8             Macedonia     7.360346        6.606061         8.083333
9                 Spain     7.342208        7.249428   

We have two dataframes: the original dataframe with all the data, and the bootstrap dataframe which has the per-group quantiles.

In the case of country data:

* Reorder factors of the grouping variable of the bootstrap dataframe to correct order.
* Reorder factors of the grouping variable of the original dataframe *to the same order*. (may have to filter original df)
* Plot aestetics in correct order for proper layering: `violin, errorbar, point`

In [41]:
country_boot_reorder <- country_boot %>% left_join(df_country) %>% arrange(desc(avg_prog_avg)) %>%
                            mutate(country = factor(country, levels=rev(country)))

df_filter <- df %>% filter(country %in% valid_countries) %>% mutate(country = factor(country))

df_top <- country_boot_reorder %>% head(12)
top_countries <- df_top %>% select(country) %>% unlist() %>% as.character()
df_bottom <- country_boot_reorder %>% tail(12)
bottom_countries <- df_bottom %>% select(country) %>% unlist() %>% as.character()

plot <- ggplot(mapping=aes(x=factor(country, levels=rev(top_countries)), y=programming_ability), data=df %>% filter(country %in% top_countries)) +
            geom_violin(alpha=0.05, size=0.10, fill="#1a1a1a", color="#1a1a1a", scale="width") +
            geom_errorbar(data=df_top, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.25) +
            geom_point(data=df_top, mapping=aes(x=country, y=avg_prog), stat="identity", position="identity", shape=21, fill="white", color="blue") +
            fte_theme() +
            coord_flip() +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(axis.title.y=element_blank(), plot.title=element_text(hjust=1)) +
            labs(title="Avg. Self-Rating Programming Ability By Developers in Country", y="How would you rate your programming ability? (1-10)")

max_save(plot, "so-programming-2", "Stack Overflow")

plot <- ggplot(mapping=aes(x=factor(country, levels=rev(bottom_countries)), y=programming_ability), data=df %>% filter(country %in% bottom_countries)) +
            geom_violin(alpha=0.05, size=0.10, fill="#1a1a1a", color="#1a1a1a", scale="width") +
            geom_errorbar(data=df_bottom, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.25) +
            geom_point(data=df_bottom, mapping=aes(x=country, y=avg_prog), stat="identity", position="identity", shape=21, fill="white", color="red") +
            fte_theme() +
            coord_flip() +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(axis.title.y=element_blank(), plot.title=element_text(hjust=1)) +
            labs(title="Avg. Self-Rating Programming Ability By Developers in Country", y="How would you rate your programming ability? (1-10)")

max_save(plot, "so-programming-3", "Stack Overflow")

Joining by: "country"


![](so-programming-2.png)
![](so-programming-3.png)

## Age Range

In [15]:
df_age <- df %>% filter(age_range != "", age_range!="Prefer not to disclose") %>%
                group_by(age_range) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

df_age %>% print(n=100)

Source: local data frame [8 x 5]

  age_range count med_prog avg_prog    sd_mean
      (chr) (int)    (dbl)    (dbl)      (dbl)
1      < 20  3174        6 6.076560 0.03486541
2      > 60   332        8 7.734940 0.12836221
3     20-24 10790        7 6.549398 0.01644985
4     25-29 13318        7 6.977399 0.01404713
5     30-34  8531        8 7.430547 0.01720699
6     35-39  4876        8 7.668991 0.02361295
7     40-49  4196        8 7.884414 0.02526847
8     50-59  1415        8 8.153357 0.04670094


In [16]:
valid_ages <- df_age %>% select(age_range) %>% unlist() %>% as.character()

age_boot <- bootstrap_df(df %>% filter(age_range %in% valid_ages), "age_range", n_bootstrap_resamples)
age_boot %>% arrange(desc(avg_prog_avg)) %>% print(n=100)

Source: local data frame [6 x 2]

  group avg_prog
  (chr)    (dbl)
1  < 20 6.043355
2  > 60 7.886076
3 20-24 6.533184
4 25-29 6.975621
5 30-34 7.438327
6 35-39 7.673682
Source: local data frame [8 x 4]

  age_range avg_prog_avg avg_prog_low_ci avg_prog_high_ci
      (chr)        (dbl)           (dbl)            (dbl)
1     50-59     8.152636        8.058154         8.244958
2     40-49     7.884271        7.834394         7.933429
3      > 60     7.735443        7.481473         7.981256
4     35-39     7.668712        7.622759         7.714139
5     30-34     7.430833        7.397121         7.464401
6     25-29     6.977620        6.950400         7.004567
7     20-24     6.549520        6.517351         6.580982
8      < 20     6.076723        6.008668         6.145712


In [40]:
age_levels <- c("> 60", "50-59", "40-49", "35-39", "30-34", "25-29", "20-24", "< 20")

age_boot_reorder <- age_boot %>% left_join(df_age) %>% mutate(age_range = factor(age_range, levels=age_levels))


plot <- ggplot(mapping=aes(x=factor(age_range, levels=age_levels), y=programming_ability), data=df %>% filter(age_range %in% age_levels)) +
            geom_violin(alpha=0.25, size=0.10, fill="#999999", color="#999999", scale="width") +
            geom_errorbar(data=age_boot_reorder, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.25) +
            geom_point(data=age_boot_reorder, mapping=aes(x=age_range, y=avg_prog), stat="identity", position="identity", shape=21, fill="white", color="#27ae60") +
            fte_theme() +
            coord_flip() +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(plot.title=element_text(hjust=1)) +
            labs(title="Avg. Self-Rating of Programming Ability By Developers, by Age", y="How would you rate your programming ability? (1-10)", x="Age Range of Developer")

max_save(plot, "so-programming-4", "Stack Overflow")

Joining by: "age_range"


![](so-programming-4.png)

## Experience

In [18]:
df_experience <- df %>% filter(experience_range != "") %>%
                group_by(experience_range) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

df_experience %>% print(n=100)

Source: local data frame [5 x 5]

  experience_range count med_prog avg_prog    sd_mean
             (chr) (int)    (dbl)    (dbl)      (dbl)
1      1 - 2 years  5809        6 5.929764 0.02252971
2        11+ years 12347        8 8.129586 0.01309627
3      2 - 5 years 15035        7 6.778118 0.01216742
4     6 - 10 years 10836        8 7.498431 0.01335262
5 Less than 1 year  2695        5 5.023006 0.04046226


In [19]:
valid_experience <- df_experience %>% select(experience_range) %>% unlist() %>% as.character()

experience_boot <- bootstrap_df(df %>% filter(experience_range %in% valid_experience), "experience_range", n_bootstrap_resamples)
experience_boot %>% arrange(desc(avg_prog_avg)) %>% print(n=100)

Source: local data frame [6 x 2]

             group avg_prog
             (chr)    (dbl)
1      1 - 2 years 5.941187
2        11+ years 8.134357
3      2 - 5 years 6.765884
4     6 - 10 years 7.503650
5 Less than 1 year 5.061329
6      1 - 2 years 5.959089
Source: local data frame [5 x 4]

  experience_range avg_prog_avg avg_prog_low_ci avg_prog_high_ci
             (chr)        (dbl)           (dbl)            (dbl)
1        11+ years     8.129711        8.104345         8.155698
2     6 - 10 years     7.498603        7.472770         7.524772
3      2 - 5 years     6.778213        6.754138         6.801968
4      1 - 2 years     5.929826        5.886444         5.973908
5 Less than 1 year     5.023449        4.944462         5.102119


In [39]:
experience_levels <- c("11+ years", "6 - 10 years", "2 - 5 years","1 - 2 years","Less than 1 year")

experience_boot_reorder <- experience_boot %>% left_join(df_experience) %>% mutate(experience_range = factor(experience_range, levels=experience_levels))
experience_boot_reorder %>% select(experience_range, avg_prog_avg, avg_prog) %>% print(n=100)

plot <- ggplot(mapping=aes(x=factor(experience_range, levels=experience_levels), y=programming_ability), data=df %>% filter(experience_range %in% experience_levels)) +
            geom_violin(alpha=0.25, size=0.10, fill="#999999", color="#999999", scale="width") +
            geom_errorbar(data=experience_boot_reorder, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.25) +
            geom_point(data=experience_boot_reorder, mapping=aes(x=experience_range, y=avg_prog), stat="identity", position="identity", shape=21, fill="white", color="#16a085") +
            fte_theme() +
            coord_flip() +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(plot.title=element_text(hjust=1)) +
            labs(title="Avg. Self-Rating of Programming Ability By Developers, by Experience", y="How would you rate your programming ability? (1-10)", x="Experience Range of Developer")

max_save(plot, "so-programming-5", "Stack Overflow")

Joining by: "experience_range"


Source: local data frame [5 x 3]

  experience_range avg_prog_avg avg_prog
            (fctr)        (dbl)    (dbl)
1      1 - 2 years     5.929826 5.929764
2        11+ years     8.129711 8.129586
3      2 - 5 years     6.778213 6.778118
4     6 - 10 years     7.498603 7.498431
5 Less than 1 year     5.023449 5.023006


![](so-programming-5.png)

## Salary Range

In [21]:
df_salary <- df %>% filter(salary_range != "", salary_range!="Other (please specify)",salary_range!="Rather not say",country=="United States") %>%
                group_by(salary_range) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

df_salary %>% print(n=100)

Source: local data frame [22 x 5]

          salary_range count med_prog avg_prog    sd_mean
                 (chr) (int)    (dbl)    (dbl)      (dbl)
1    $10,000 - $20,000   212        7 6.339623 0.14006386
2  $100,000 - $110,000   953        8 7.844701 0.04625403
3  $110,000 - $120,000   712        8 7.953652 0.05546706
4  $120,000 - $130,000   618        8 7.972492 0.05902680
5  $130,000 - $140,000   426        8 8.176056 0.06687880
6  $140,000 - $150,000   367        8 8.147139 0.07167507
7  $150,000 - $160,000   327        8 8.226300 0.08191198
8  $160,000 - $170,000   192        8 7.973958 0.12166917
9  $170,000 - $180,000   141        8 8.333333 0.10636609
10 $180,000 - $190,000    91        8 8.516484 0.13565977
11 $190,000 - $200,000    83        9 8.506024 0.13965818
12   $20,000 - $30,000   231        7 6.398268 0.12952586
13   $30,000 - $40,000   261        7 6.482759 0.11186417
14   $40,000 - $50,000   396        7 6.398990 0.08681671
15   $50,000 - $60,000   636        7

In [22]:
valid_salary <- df_salary %>% select(salary_range) %>% unlist() %>% as.character()

salary_boot <- bootstrap_df(df %>% filter(salary_range %in% valid_salary,country=="United States"), "salary_range", n_bootstrap_resamples)
salary_boot %>% arrange(desc(avg_prog_avg)) %>% print(n=100)

Source: local data frame [6 x 2]

                group avg_prog
                (chr)    (dbl)
1   $10,000 - $20,000 6.231884
2 $100,000 - $110,000 7.805527
3 $110,000 - $120,000 7.930473
4 $120,000 - $130,000 7.857923
5 $130,000 - $140,000 8.177616
6 $140,000 - $150,000 8.080645
Source: local data frame [22 x 4]

          salary_range avg_prog_avg avg_prog_low_ci avg_prog_high_ci
                 (chr)        (dbl)           (dbl)            (dbl)
1   More than $200,000     8.631682        8.477919         8.781867
2  $180,000 - $190,000     8.515714        8.243886         8.777807
3  $190,000 - $200,000     8.504258        8.227273         8.777778
4  $170,000 - $180,000     8.334688        8.125981         8.542263
5  $150,000 - $160,000     8.227022        8.064602         8.389385
6  $130,000 - $140,000     8.175449        8.041662         8.304450
7  $140,000 - $150,000     8.146759        8.007552         8.287236
8  $160,000 - $170,000     7.973996        7.732240         8.

In [38]:
salary_levels <- c("Unemployed","Less than $10,000","$10,000 - $20,000","$20,000 - $30,000","$30,000 - $40,000",
                  "$40,000 - $50,000","$50,000 - $60,000","$60,000 - $70,000","$70,000 - $80,000",
                  "$80,000 - $90,000", "$90,000 - $100,000","$100,000 - $110,000","$110,000 - $120,000",
                  "$120,000 - $130,000","$130,000 - $140,000","$140,000 - $150,000","$150,000 - $160,000",
                  "$160,000 - $170,000","$170,000 - $180,000","$180,000 - $190,000","$190,000 - $200,000",
                  "More than $200,000")

salary_boot_reorder <- salary_boot %>% left_join(df_salary) %>% mutate(salary_range = factor(salary_range, levels=rev(salary_levels)))
salary_boot_reorder %>% select(salary_range, avg_prog_avg, avg_prog) %>% print(n=100)


#plot <- ggplot(salary_boot_reorder, aes(x=salary_range, y=avg_prog)) +
#            #geom_violin(mapping=aes(x=factor(salary_range, levels=rev(salary_levels)), y=programming_ability), data=df_salary, alpha=0.2) +
#            geom_errorbar(mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci), show.legend=F, color="black", size=0.25, width=0.25) +
#            geom_point(stat="identity", position="identity", shape=21, fill="white", color="#2980b9") +
#            fte_theme() +
#            coord_flip() +
#            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
#            theme(plot.title=element_text(hjust=1, size=7)) +
#            labs(title="Avg. Self-Rating of Programming Ability By U.S. Developers, by Salary", y="How would you rate your programming ability? (1-10)", x="Annual Gross Earnings or Salary (including bonus) in USD")


plot <- ggplot(mapping=aes(x=factor(salary_range, levels=rev(salary_levels)), y=programming_ability), data=df %>% filter(salary_range != "", salary_range!="Other (please specify)",salary_range!="Rather not say",country=="United States")) +
            geom_violin(alpha=0.25, size=0.10, fill="#999999", color="#999999", scale="width") +
            geom_errorbar(data=salary_boot_reorder, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.25) +
            geom_point(data=salary_boot_reorder, mapping=aes(x=salary_range, y=avg_prog), stat="identity", position="identity", shape=21, fill="white", color="#2980b9") +
            fte_theme() +
            coord_flip() +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(plot.title=element_text(hjust=1, size=7)) +
            labs(title="Avg. Self-Rating of Programming Ability By U.S. Developers, by Salary", y="How would you rate your programming ability? (1-10)", x="Annual Gross Earnings or Salary (including bonus) in USD")



max_save(plot, "so-programming-6", "Stack Overflow", h=6, tall=T)

Joining by: "salary_range"


Source: local data frame [22 x 3]

          salary_range avg_prog_avg avg_prog
                (fctr)        (dbl)    (dbl)
1    $10,000 - $20,000     6.338275 6.339623
2  $100,000 - $110,000     7.844868 7.844701
3  $110,000 - $120,000     7.953826 7.953652
4  $120,000 - $130,000     7.972257 7.972492
5  $130,000 - $140,000     8.175449 8.176056
6  $140,000 - $150,000     8.146759 8.147139
7  $150,000 - $160,000     8.227022 8.226300
8  $160,000 - $170,000     7.973996 7.973958
9  $170,000 - $180,000     8.334688 8.333333
10 $180,000 - $190,000     8.515714 8.516484
11 $190,000 - $200,000     8.504258 8.506024
12   $20,000 - $30,000     6.397434 6.398268
13   $30,000 - $40,000     6.484083 6.482759
14   $40,000 - $50,000     6.400508 6.398990
15   $50,000 - $60,000     6.603853 6.603774
16   $60,000 - $70,000     6.743783 6.744457
17   $70,000 - $80,000     7.137166 7.136364
18   $80,000 - $90,000     7.317693 7.317881
19  $90,000 - $100,000     7.596933 7.597714
20   Less than $10,0

![](so-programming-6.png)

## Employment Status

In [24]:
df_employment <- df %>% filter(employment_status != "", employment_status != "Other (please specify)", employment_status!="Prefer not to disclose") %>%
                group_by(employment_status) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

df_employment %>% print(n=100)

Source: local data frame [7 x 5]

       employment_status count med_prog avg_prog     sd_mean
                   (chr) (int)    (dbl)    (dbl)       (dbl)
1     Employed full-time 31784        7 7.286308 0.009102539
2     Employed part-time  1663        7 6.707757 0.042268086
3 Freelance / Contractor  3350        8 7.528358 0.028977709
4          I'm a student  5959        6 6.000671 0.024219544
5                Retired    99        8 6.535354 0.290831920
6          Self-employed  2020        8 7.490594 0.040817385
7             Unemployed   863        7 6.144844 0.077246424


In [25]:
valid_employment <- df_employment %>% select(employment_status) %>% unlist() %>% as.character()

employment_boot <- bootstrap_df(df %>% filter(employment_status %in% valid_employment), "employment_status", n_bootstrap_resamples)
employment_boot %>% arrange(desc(avg_prog_avg)) %>% print(n=100)

Source: local data frame [6 x 2]

                   group avg_prog
                   (chr)    (dbl)
1     Employed full-time 7.289785
2     Employed part-time 6.704167
3 Freelance / Contractor 7.561908
4          I'm a student 6.020259
5                Retired 7.000000
6          Self-employed 7.521141
Source: local data frame [7 x 4]

       employment_status avg_prog_avg avg_prog_low_ci avg_prog_high_ci
                   (chr)        (dbl)           (dbl)            (dbl)
1 Freelance / Contractor     7.528419        7.471376         7.585886
2          Self-employed     7.490158        7.409607         7.570271
3     Employed full-time     7.286362        7.268608         7.304149
4     Employed part-time     6.707372        6.624557         6.790909
5                Retired     6.532750        5.947351         7.087392
6             Unemployed     6.145154        5.993555         6.298620
7          I'm a student     6.000840        5.952915         6.048529


In [43]:
employment_boot_reorder <- employment_boot %>% left_join(df_employment) %>% arrange(avg_prog_avg) %>% mutate(employment_status = factor(employment_status, levels=rev(employment_status)))

plot <- ggplot(mapping=aes(x=factor(employment_status, levels=employment_boot_reorder %>% select(employment_status) %>% unlist() %>% as.character() %>% rev()), y=programming_ability), data=df %>% filter(employment_status %in% valid_employment)) +
            geom_violin(alpha=0.25, size=0.10, fill="#999999", color="#999999", scale="width") +
            geom_errorbar(data=employment_boot_reorder, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.25) +
            geom_point(data=employment_boot_reorder, mapping=aes(x=employment_status, y=avg_prog), stat="identity", position="identity", shape=21, fill="white", color="#8e44ad") +
            fte_theme() +
            coord_flip() +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(plot.title=element_text(hjust=1)) +
            labs(title="Avg. Self-Rating of Programming Ability By Developer, by Employment", y="How would you rate your programming ability? (1-10)", x="Developer Employment Status")

max_save(plot, "so-programming-7", "Stack Overflow")

Joining by: "employment_status"


![](so-programming-7.png)

## Commit Frequency

In [27]:
df_commit <- df %>% filter(commit_frequency != "", commit_frequency!="Other (please specify)",
                           commit_frequency!="I don't \"check-in or commit code\", but I do put code into production somewhat frequently") %>%
                group_by(commit_frequency) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

df_commit %>% print(n=100)

Source: local data frame [5 x 5]

                 commit_frequency count med_prog avg_prog     sd_mean
                            (chr) (int)    (dbl)    (dbl)       (dbl)
1           A couple times a week  7799        7 6.939992 0.019545651
2             A few times a month  2579        7 6.407522 0.037674093
3 I never check-in or commit code  1453        5 5.261528 0.056297251
4            Multiple times a day 25199        8 7.464860 0.009835023
5                      Once a day  4678        7 6.946986 0.024583787


In [28]:
valid_commit <- df_commit %>% select(commit_frequency) %>% unlist() %>% as.character()

commit_boot <- bootstrap_df(df %>% filter(commit_frequency %in% valid_commit), "commit_frequency", n_bootstrap_resamples)
commit_boot %>% arrange(desc(avg_prog_avg)) %>% print(n=100)

Source: local data frame [6 x 2]

                            group avg_prog
                            (chr)    (dbl)
1           A couple times a week 6.917740
2             A few times a month 6.419508
3 I never check-in or commit code 5.235254
4            Multiple times a day 7.454169
5                      Once a day 6.977478
6           A couple times a week 6.984159
Source: local data frame [5 x 4]

                 commit_frequency avg_prog_avg avg_prog_low_ci avg_prog_high_ci
                            (chr)        (dbl)           (dbl)            (dbl)
1            Multiple times a day     7.464803        7.445692         7.484062
2                      Once a day     6.947047        6.898092         6.994759
3           A couple times a week     6.939854        6.900855         6.977817
4             A few times a month     6.407295        6.334108         6.479281
5 I never check-in or commit code     5.261386        5.151369         5.369986


In [44]:
commit_levels <- c("I never check-in or commit code","A few times a month","A couple times a week","Once a day","Multiple times a day")

commit_boot_reorder <- commit_boot %>% left_join(df_commit) %>% arrange(avg_prog_avg) %>% mutate(commit_frequency = factor(commit_frequency, levels=rev(commit_levels)))

plot <- ggplot(mapping=aes(x=factor(commit_frequency , levels=rev(commit_levels)), y=programming_ability), data=df %>% filter(commit_frequency %in% commit_levels)) +
            geom_violin(alpha=0.25, size=0.10, fill="#999999", color="#999999", scale="width") +
            geom_errorbar(data=commit_boot_reorder, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.25) +
            geom_point(data=commit_boot_reorder, mapping=aes(x=commit_frequency, y=avg_prog), stat="identity", position="identity", shape=21, fill="white", color="#e67e22") +
            fte_theme() +
            coord_flip() +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(plot.title=element_text(hjust=1)) +
            labs(title="Avg. Self-Rating of Programming Ability By Dev, by Commit Activity", y="How would you rate your programming ability? (1-10)", x="Developer Commit Activity")

max_save(plot, "so-programming-8", "Stack Overflow", h=2.5)

Joining by: "commit_frequency"


![](so-programming-8.png)

## Unit Testing

Not rendered as chart, but I wanted to check since I was curious. :P

In [30]:
df_unit <- df %>% filter(unit_testing != "", unit_testing!="Other (please specify)") %>%
                group_by(unit_testing) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

df_unit %>% print(n=100)

Source: local data frame [3 x 5]

  unit_testing count med_prog avg_prog     sd_mean
         (chr) (int)    (dbl)    (dbl)       (dbl)
1 I don't know  9494        7 6.360544 0.019320003
2           No  3484        7 7.424799 0.028137455
3          Yes 28388        7 7.294068 0.009768627


## Visit Frequency

In [31]:
df_visit <- df %>% filter(visit_frequency != "", visit_frequency !="Other (please specify)",
                         visit_frequency!="I have never been on Stack Overflow. I just love taking surveys.") %>%
                group_by(visit_frequency) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

df_visit %>% print(n=100)

Source: local data frame [5 x 5]

       visit_frequency count med_prog avg_prog    sd_mean
                 (chr) (int)    (dbl)    (dbl)      (dbl)
1 Multiple times a day 24329        7 7.155617 0.01069518
2           Once a day  9680        7 7.179236 0.01706293
3         Once a month  1344        7 6.949405 0.05531368
4          Once a week  6614        7 7.118990 0.02220377
5          Very rarely  1339        7 7.016430 0.05831076


In [32]:
valid_visit <- df_visit %>% select(visit_frequency) %>% unlist() %>% as.character()

visit_boot <- bootstrap_df(df %>% filter(visit_frequency %in% valid_visit), "visit_frequency", n_bootstrap_resamples)
visit_boot %>% arrange(desc(avg_prog_avg)) %>% print(n=100)

Source: local data frame [6 x 2]

                 group avg_prog
                 (chr)    (dbl)
1 Multiple times a day 7.136040
2           Once a day 7.187761
3         Once a month 6.944362
4          Once a week 7.158846
5          Very rarely 6.971897
6 Multiple times a day 7.152782
Source: local data frame [5 x 4]

       visit_frequency avg_prog_avg avg_prog_low_ci avg_prog_high_ci
                 (chr)        (dbl)           (dbl)            (dbl)
1           Once a day     7.179041        7.145458         7.213111
2 Multiple times a day     7.155666        7.134559         7.176768
3          Once a week     7.119039        7.075062         7.163585
4          Very rarely     7.017125        6.902238         7.130537
5         Once a month     6.949426        6.842511         7.058340


In [45]:
visit_levels <- c("Very rarely","Once a month","Once a week","Once a day","Multiple times a day")

visit_boot_reorder <- visit_boot %>% left_join(df_visit) %>% arrange(avg_prog_avg) %>% mutate(visit_frequency = factor(visit_frequency, levels=rev(visit_levels)))

plot <- ggplot(mapping=aes(x=factor(visit_frequency , levels=rev(visit_levels)), y=programming_ability), data=df %>% filter(visit_frequency %in% visit_levels)) +
            geom_violin(alpha=0.25, size=0.10, fill="#999999", color="#999999", scale="width") +
            geom_errorbar(data=visit_boot_reorder, mapping=aes(y=avg_prog, ymin=avg_prog_low_ci, ymax=avg_prog_high_ci) , show.legend=F, color="black", size=0.25, width=0.25) +
            geom_point(data=visit_boot_reorder, mapping=aes(x=visit_frequency, y=avg_prog), stat="identity", position="identity", shape=21, fill="white", color="#c0392b") +
            fte_theme() +
            coord_flip() +
            scale_y_continuous(breaks=c(1:10), limits=c(1,10)) +
            theme(plot.title=element_text(hjust=1)) +
            labs(title="Avg. Self-Rating of Programming Ability By S.O. Visit Frequency", y="How would you rate your programming ability? (1-10)", x="Stack Overflow Visit Frequency")

max_save(plot, "so-programming-9", "Stack Overflow", h=2.5)

Joining by: "visit_frequency"


![](so-programming-9.png)

## Star Wars vs. Star Trek

In [34]:
df_wars <- df %>% filter(star_wars_vs_star_trek != "") %>%
                group_by(star_wars_vs_star_trek) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

df_wars %>% print(n=100)

Source: local data frame [3 x 5]

  star_wars_vs_star_trek count med_prog avg_prog    sd_mean
                   (chr) (int)    (dbl)    (dbl)      (dbl)
1              Star Trek  7712        7 7.374741 0.01963603
2              Star Wars 21340        7 7.028585 0.01168088
3   Star Wars; Star Trek  3614        8 7.467349 0.02765136


Create a function to mass-print the grouping for all the qualitiative ranks, since it's faster than doing it manually.

In [35]:
print_categorical <- function(group_var) {
    writeLines(paste("\nGrouping on:", group_var,"\n"))
    
    df_temp <- df %>% select(group = matches(group_var), programming_ability) %>% filter(group != "") %>%
                group_by(group) %>%
                summarize(count=n(), med_prog = median(programming_ability), avg_prog=mean(programming_ability), sd_mean=sd(programming_ability)/sqrt(count))

    df_temp %>% print(n=100)
    
    return(NA)
}

print_categorical("star_wars_vs_star_trek")


Grouping on: star_wars_vs_star_trek 

Source: local data frame [3 x 5]

                 group count med_prog avg_prog    sd_mean
                 (chr) (int)    (dbl)    (dbl)      (dbl)
1            Star Trek  7712        7 7.374741 0.01963603
2            Star Wars 21340        7 7.028585 0.01168088
3 Star Wars; Star Trek  3614        8 7.467349 0.02765136


[1] NA

In [36]:
categoricals <- names(df)[45:64]
print(categoricals)

x<-lapply(categoricals, print_categorical)

 [1] "agree_tech"               "agree_notice"            
 [3] "agree_problemsolving"     "agree_diversity"         
 [5] "agree_adblocker"          "agree_alcohol"           
 [7] "agree_loveboss"           "agree_nightcode"         
 [9] "agree_legacy"             "agree_mars"              
[11] "important_variety"        "important_control"       
[13] "important_sameend"        "important_newtech"       
[15] "important_buildnew"       "important_buildexisting" 
[17] "important_promotion"      "important_companymission"
[19] "important_wfh"            "important_ownoffice"     

Grouping on: agree_tech 

Source: local data frame [5 x 5]

                group count med_prog avg_prog    sd_mean
                (chr) (int)    (dbl)    (dbl)      (dbl)
1    Agree completely 13641        8 7.421523 0.01463915
2      Agree somewhat 18438        7 7.187764 0.01194230
3 Disagree completely   659        7 6.760243 0.08359381
4   Disagree somewhat  2238        7 6.912869 0.03610446
5      

# The MIT License (MIT)

Copyright (c) 2016 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.