# What are the programming languages and technologies that highest paid software developers are using ?


### Business Understanding

If you are a seasoned software developer or a newbee who just want to be a software developer, you may want to know answers to the questions like: 
What are the programming languages and technologies that highest paid software developers are using ?
What are the database technologies, web fraemworks, and other tools and libreries that highest paid software developers are using ?

To answer these questions, I used data from Stackoverflow's 2019 Annual Developer Survey. The survey data covers 88,863 reviews from 213 countries and territories.

### Data Understanding

To get started, let's read in the necessary libraries we will need to wrangle our data: pandas and numpy. If we decided to build some basic plots, matplotlib might prove useful as well.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline
from collections import defaultdict
import matplotlib.pyplot as plt

df = pd.read_csv('./survey_results_public_2019.csv')
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


Let's focus on our first question: 
What are the programming languages and technologies that highest paid software developers are using ?
To answer this question we will review the field 
LanguageWorkedWith in the dataframe. Let's look into the column more closely.

In [5]:
df.LanguageWorkedWith.value_counts()

HTML/CSS;JavaScript;PHP;SQL                                                                                    1483
C#;HTML/CSS;JavaScript;SQL                                                                                     1414
HTML/CSS;JavaScript                                                                                            1247
C#;HTML/CSS;JavaScript;SQL;TypeScript                                                                           990
Java                                                                                                            934
HTML/CSS;JavaScript;PHP                                                                                         910
Python                                                                                                          759
HTML/CSS;JavaScript;TypeScript                                                                                  703
HTML/CSS;Java;JavaScript;SQL                                            

In [6]:
# This isn't what I was expecting, it is grouping programming languages together 
# So one row has more than just one answer.  I write a function to clean it up. 
# Following function will create a dictionary with all programming languages and counts. 
# We can use the dictionary to get TOP languages and also do some data visualizations. 

def create_dict_from_col(schema, column):
    '''
    INPUT 
        schema - a dataframe schema name
        column - column name with ';' seperator     
    OUTPUT
        desired_lang_dict - a dictionary with list of category names and their counts
    '''
    df_new = schema[schema[column].notnull()] # Remove NaN records
    desired_lang_dict = {} # Initialize the dict
    # Populate the dict
    for row in df_new[column].to_list():
        languages = row.split(';')
        for each_lang in languages:
            if each_lang in desired_lang_dict:
                desired_lang_dict[each_lang] += 1
            else:
                desired_lang_dict[each_lang] = 1
    return desired_lang_dict

In [7]:
# This is a short utility function we can use to get TOP N categories 
def get_dict_topvals(dict, top_n,is_reverse):
    return sorted(dict.items(), key=lambda x: x[1], reverse = is_reverse)[:top_n] 

In [8]:
# Using above functions we can easily get a list of all languages worked by the survey responders.
lang_list = create_dict_from_col(df,'LanguageWorkedWith').keys()  
lang_list    

dict_keys(['HTML/CSS', 'Java', 'JavaScript', 'Python', 'C++', 'C', 'C#', 'SQL', 'VBA', 'R', 'Bash/Shell/PowerShell', 'Ruby', 'Rust', 'TypeScript', 'WebAssembly', 'Other(s):', 'Go', 'PHP', 'Assembly', 'Kotlin', 'Swift', 'Objective-C', 'Elixir', 'Erlang', 'Clojure', 'F#', 'Scala', 'Dart'])

In [9]:
# Let's write a function to give us average salaries of developers using those languages. 
# I've written the following generic fucntion

def get_mean_by_column_values(df,col_list,col_name,mean_col_name):
    '''
    INPUT 
        schema - a dataframe schema name
        col_list - a Python list with all category values
        col_name - category column name
        mean_col_name - column that we apply the average (should be int ot float)
    OUTPUT
        df_means - a sorted data farme with each category name and mean of the column supplied in mean_col_name column 
    '''
    # drop any row with null values in mean_col_name    
    df_notnull = df.dropna(subset=[mean_col_name,col_name], how='any')
    df_final = df_notnull[[mean_col_name,col_name]].reset_index(drop=True)
    
    df_subset = defaultdict(float)
    denoms = dict()
    for val in col_list:
        denoms[val] = 0
        for row_num in range(df_final.shape[0]):
            if val in df_final[col_name][row_num]:
                df_subset[val] += df_final[mean_col_name][row_num]
                denoms[val] += 1
    
    df_subset = pd.DataFrame(pd.Series(df_subset)).reset_index()
    denoms = pd.DataFrame(pd.Series(denoms)).reset_index()  
    df_subset.columns = [mean_col_name, 'col_sum']
    denoms.columns = [mean_col_name, 'col_total']
    
    df_subset.columns = [col_name, 'col_sum']
    denoms.columns = [col_name, 'col_total']
    df_means = pd.merge(df_subset, denoms)
    df_means['mean_col'] = df_means['col_sum']/df_means['col_total']
    
    return df_means.sort_values('mean_col', ascending=False)

####  1. What are the programming languages that highest paid software developers are using ?

Let's get answer to the above question using the function we have just written.

In [14]:
lang_avg_sal = get_mean_by_column_values(df,lang_list,'LanguageWorkedWith','ConvertedComp')
lang_avg_sal

Unnamed: 0,LanguageWorkedWith,col_sum,col_total,mean_col
24,Clojure,172935200.0,863,200388.463499
25,F#,129110400.0,648,199244.41358
14,WebAssembly,101296500.0,582,174048.90378
16,Go,844685600.0,5079,166309.42154
22,Elixir,147043000.0,898,163745.023385
23,Erlang,83479500.0,515,162096.112621
11,Ruby,824814000.0,5160,159847.664729
26,Scala,377794900.0,2368,159541.765203
12,Rust,260420100.0,1646,158213.893074
9,R,1416310000.0,9131,155110.107874


### 2. What are the database environment that highest paid software developers are using ?

We can reuse our function to get this answered.

In [11]:
# Using above functions we can easily get a list of all languages worked by the survey responders.
db_list = create_dict_from_col(df,'DatabaseWorkedWith').keys()
db_avg_sal = get_mean_by_column_values(df,db_list,'DatabaseWorkedWith','ConvertedComp')
db_avg_sal

Unnamed: 0,DatabaseWorkedWith,col_sum,col_total,mean_col
9,DynamoDB,614092500.0,3505,175204.696719
6,Cassandra,294243300.0,1801,163377.746252
3,Couchbase,161433900.0,1068,151155.299625
7,Elasticsearch,1200737000.0,8173,146915.063257
13,Other(s):,522207200.0,3637,143581.845752
8,Redis,1513739000.0,10608,142697.836538
1,PostgreSQL,2407099000.0,18207,132207.313121
10,Microsoft SQL Server,2263035000.0,17151,131947.697335
0,SQLite,1780116000.0,14760,120604.080759
4,MongoDB,1529513000.0,12684,120585.998029


### 3. What are the web frameworks that highest paid software developers are using ?

We can reuse our function to get this answered.

In [12]:
web_list = create_dict_from_col(df,'WebFrameWorkedWith').keys()
web_avg_sal = get_mean_by_column_values(df,web_list,'WebFrameWorkedWith','ConvertedComp')
web_avg_sal

Unnamed: 0,WebFrameWorkedWith,col_sum,col_total,mean_col
6,Ruby on Rails,537440400.0,3657,146962.108012
1,Flask,723210600.0,4991,144902.935484
4,React.js,1943715000.0,14111,137744.689816
2,Express,1103571000.0,8321,132624.849177
9,ASP.NET,1535225000.0,11710,131103.743467
7,Other(s):,491642900.0,3764,130617.145058
8,Angular/Angular.js,1716019000.0,13830,124079.433406
5,Spring,881729200.0,7189,122649.776464
10,Vue.js,796962200.0,6696,119020.639188
0,Django,603531500.0,5123,117808.218622


### 4. What are the tools/libraries/miscelleneous frameworks that highest paid software developers are using ?

We can reuse our function to get this answered.

In [13]:
tech_list = create_dict_from_col(df,'MiscTechWorkedWith').keys()
tech_avg_sal = get_mean_by_column_values(df,tech_list,'MiscTechWorkedWith','ConvertedComp')
tech_avg_sal

Unnamed: 0,MiscTechWorkedWith,col_sum,col_total,mean_col
8,Chef,201651500.0,1079,186887.438369
13,Puppet,204680600.0,1143,179073.158355
15,Apache Spark,394994700.0,2238,176494.489723
3,Hadoop,307683000.0,1856,165777.458513
7,Ansible,643265600.0,4164,154482.607589
4,Pandas,720191700.0,4714,152777.191769
11,.NET Core,1395418000.0,9804,142331.468176
9,TensorFlow,498044900.0,3508,141974.027936
14,Other(s):,257407300.0,1839,139971.36161
6,Torch/PyTorch,151934900.0,1088,139646.02114
