# What are the database technologies, web framemworks, and other tools and libreries that most satisified software developers are using?

### Business Understanding

If you are a seasoned software developer or a newbee who just want to be a software developer, you may want to know answers to the questions like: 
What are the database technologies, web framemworks, and other tools and libreries that most satisified developers are using?

To answer these questions, I used data from Stackoverflow's 2019 Annual Developer Survey. 
The survey data covers 88,863 reviews from 213 countries and territories.

### Data Understanding

To get started, let's read in the necessary libraries we will need to wrangle our data: pandas and numpy. 
If we decided to build some basic plots, matplotlib and seaborn might prove useful as well.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline
from collections import defaultdict
import matplotlib.pyplot as plt

df = pd.read_csv('./survey_results_public_2019.csv')
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


In [3]:
# Let's look at the CareerSat column as shown below
df["CareerSat"].value_counts()

Very satisfied                        29173
Slightly satisfied                    25018
Slightly dissatisfied                  7670
Neither satisfied nor dissatisfied     7252
Very dissatisfied                      3734
Name: CareerSat, dtype: int64

### A brief note on scoring the satisfaction

I've used a technique inspired by NPS Scoring technique, to quantify the satisfaction rating before I calculated the average. The rating used is below:

For more details on NPS scoring refer https://www.qualtrics.com/experience-management/customer/measure-nps/

In [4]:
# Create a numeric rating to replace
nps_numeric_ratings = {"CareerSat": {"Very satisfied": 4, "Slightly satisfied": 2, 
                                     "Neither satisfied nor dissatisfied" : 0,
                                    "Slightly dissatisfied": -2, "Very dissatisfied": -4}}

In [5]:
# drop all records without any career satisfaction value
df_valid_careersat = df.dropna(subset=["CareerSat"], how='any').reset_index(drop=True)

In [6]:
# Now replace the career rating with numeric values
df_valid_careersat.replace(nps_numeric_ratings, inplace=True)

In [55]:
# Verify the field again
df_valid_careersat["CareerSat"].value_counts()

In [None]:
# let's look at all the programming languages now

In [57]:
df.LanguageWorkedWith.value_counts()

In [None]:
# This isn't what I was expecting, it is grouping programming languages together 
# So one row has more than just one answer.  I write a function to clean it up. 
# Following function will create a dictionary with all programming languages and counts. 
# We can use the dictionary to get TOP languages and also do some data visualizations. 

def create_dict_from_col(schema, column):
    '''
    INPUT 
        schema - a dataframe schema name
        column - column name with ';' seperator     
    OUTPUT
        desired_lang_dict - a dictionary with list of category names and their counts
    '''
    df_new = schema[schema[column].notnull()] # Remove NaN records
    desired_lang_dict = {} # Initialize the dict
    # Populate the dict
    for row in df_new[column].to_list():
        languages = row.split(';')
        for each_lang in languages:
            if each_lang in desired_lang_dict:
                desired_lang_dict[each_lang] += 1
            else:
                desired_lang_dict[each_lang] = 1
    return desired_lang_dict

In [None]:
# This is a short utility function we can use to get TOP N categories 
def get_dict_topvals(dict, top_n,is_reverse):
    return sorted(dict.items(), key=lambda x: x[1], reverse = is_reverse)[:top_n] 

In [58]:
# Using above functions we can easily get a list of all languages worked by the survey responders.
lang_list = create_dict_from_col(df,'LanguageWorkedWith').keys()  
lang_list  

In [56]:
# Let's write a function to give us average satisfaction of developers for each categories (language / database etc.). 
# I've written the following generic function
def get_mean_by_column_values(df,col_list,col_name,mean_col_name):
    
    df_notnull = df.dropna(subset=[mean_col_name,col_name], how='any')
    df_final = df_notnull[[mean_col_name,col_name]].reset_index(drop=True)
    
    df_subset = defaultdict(float)
    denoms = dict()
    
    for val in col_list:
        denoms[val] = 0
        for row_num in range(df_final.shape[0]):
            if val in df_final[col_name][row_num]:
                df_subset[val] += df_final[mean_col_name][row_num]
                denoms[val] += 1
    
    df_subset = pd.DataFrame(pd.Series(df_subset)).reset_index()
    denoms = pd.DataFrame(pd.Series(denoms)).reset_index()  
    df_subset.columns = [mean_col_name, 'col_sum']
    denoms.columns = [mean_col_name, 'col_total']
    
    df_subset.columns = ['ConvertedComp', 'col_sum']
    denoms.columns = ['ConvertedComp', 'col_total']
    df_means = pd.merge(df_subset, denoms)
    df_means['mean_col'] = df_means['col_sum']/df_means['col_total']
    return df_means.sort_values('mean_col', ascending=False)

####  1. What are the programming languages that most satisfied software developers are using ?
Let's get answer to the above question using the function we have just written.

In [13]:
# Using above functions we can easily get a list of all categories.
lang_list = create_dict_from_col(df,'LanguageWorkedWith').keys()  
# Calculate average satisfaction for each categroies 
lang_sat = get_mean_by_column_values(df_valid_careersat,lang_list,'LanguageWorkedWith','CareerSat')
lang_sat

####  2. What are the database technologies that most satisfied software developers are using ?
Let's get answer to the above question using the function we have just written.

In [None]:
db_list = create_dict_from_col(df,'DatabaseWorkedWith').keys()  
# Calculate average satisfaction for each categroies 
db_sat = get_mean_by_column_values(df_valid_careersat,db_list,'DatabaseWorkedWith','CareerSat')
db_sat

####  3. What are the web frameworks that most satisfied software developers are using ?
Let's get answer to the above question using the function we have just written.

In [None]:
web_list = create_dict_from_col(df,'WebFrameWorkedWith').keys()  
# Calculate average satisfaction for each categroies 
web_sat = get_mean_by_column_values(df_valid_careersat,web_list,'WebFrameWorkedWith','CareerSat')
web_sat

####  3. What are the tools/libraries/miscelleneous frameworks  that most satisfied software developers are using ?
Let's get answer to the above question using the function we have just written.

In [60]:
tech_list = create_dict_from_col(df,'MiscTechWorkedWith').keys()  
# Calculate average satisfaction for each categroies 
tech_sat = get_mean_by_column_values(df_valid_careersat,tech_list,'MiscTechWorkedWith','CareerSat')
tech_sat

In [61]:
#df_final = pd.merge(db_sal,db_sat, on = 'ConvertedComp', suffixes=['_salary','_satisfaction'])
#sns.set(style="whitegrid")
#tips = sns.load_dataset("tips")
#ax = sns.barplot(x="mean_col_salary", y="ConvertedComp", data=df_plot, palette="Blues_d")
