The 2019 Survey results have a significant amount of data for the languages a developer is interested in working with. Given the large data set, it would be interesting to see if there is any correlation between those languages and how a developer feels about their career. One may think that certain languages are more pleasant, enriching, or satisfying to work with. From the Udacity Course, we know that developers report higher job satisfaction when they work remotely or when they work for a smaller company. 

Does this apply to the languages they choose (or are directed to use)?


In [38]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import LabelEncoder

import seaborn as sns
%matplotlib inline

# Read in the 2019 results and schema as a test to ensure all is correctly organized and the data can be found.
df = pd.read_csv('../developer_survey_2019/survey_results_public.csv')
schema = pd.read_csv('../developer_survey_2019/survey_results_schema.csv')


Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


In [3]:

#Loop over similar entry columns to determine possible unique entries
#Split entries
#def find_unique_entries(df):
#    entry_list = []
#    #loop over similar columns
#    for col_name in df.columns:
#        col = df[col_name]
        #select and split each row
#        for idx in range(0, len(col)):
#            entries = str(col[idx])
#            entry_list.extend(entries.split(';'))
#    #cast as list and use set for unique entries    
#    possible_list = list(set(entry_list))
#    #clean nans
#    possible_list.remove('nan')
#    return possible_list

#possible_JobSat = find_unique_entries(df[['JobSat']])
#possible_JobSat

In [49]:
#Let's cut the data set down to just the desired languages
new_df = df[['JobSat','LanguageDesireNextYear']]

In [52]:
def clean_data(df, response_col_name):
    '''
    INPUT
    df - pandas dataframe
    response_col_name - String name of the response column

    OUTPUT
    X - A matrix holding all of the variables you want to consider when predicting the response
    y - the corresponding response vector

    This function cleans df using the following steps to produce X and y:
    1. Replace the strings with numerics that we can use with mean, etc.
    2. Drop the rows that are missing a response.
    3. Split off the response column and assing to y
    4. Remove unnecessary columns from the df that will become X.
    5. Impute the NANs of the numeric columns with the mean for the column.
    6. Replace the categorical columns with Dummy variables.
    7. Assign the final df as X.
    8. Return X, y.
    ''' 
    # Replace JobSat strings with numerics - since this is the response we don't want multi-column dummies.
    dictJobSat = {'Slightly satisfied' : 4,
              'Slightly dissatisfied' : 2,
              'Neither satisfied nor dissatisfied' : 3,
              'Very dissatisfied' : 1,
              'Very satisfied' : 5}

    for sat in dictJobSat:
        df['JobSat'].replace([sat],dictJobSat[sat], inplace=True)
    
    #Drop the rows with nan in the response
    temp_df = df.dropna(subset=['JobSat'], axis=0).reset_index(drop=True)
    df = temp_df  
    
    #Assign y as only the response column.
    y = df[response_col_name]
    
    #Remove the response column and other unnecessary columns from the rest of the data.
    df = df.dropna(how='all', axis=1)
    df = df.drop([response_col_name], axis=1)

    #Determine then numeric columns and replace any NANs with the mean for the column.
    num_cols = df.select_dtypes(include=['float', 'int']).columns
    for col in num_cols:
        df[col].fillna((df[col].mean()), inplace=True)

    #Replace the categorical columns with dummy variable columns.    
    cat_cols = df.select_dtypes(include=['object']).copy().columns
    for col in  cat_cols:
        #Drop the original category column from df and concat with the new dummies.
        #We use df.str.get_dummies(sep=';') since we need to get each language specified as it's own column.
        #Using entries As-Is would result in a bunch of non-comparable responses that just had all the data
        #mushed together.
        df = pd.concat([df.drop(col, axis=1), df[col].str.get_dummies(sep=';')], axis=1)
    X = df
    return X, y


In [58]:
#This function is only slightly modified from the function utilized during the linear
# regression ML lessons from the Intro to Data Science notebooks. This is not meant
# to take credit as the original source and is acknowledging the source as Udacity.
def fit_model(df, response_col_name, test_size=0.3, random_state=42):
    '''
    INPUT
    df - pandas dataframe
    response_col_name - String name of the response column
    test_size - proportion of data to reserve for test.
    random_state - seed for random selection of train/test split.

    OUTPUT
    X - A matrix holding all of the variables you want to consider when predicting the response
    y - the corresponding response vector

    This function fits a linear model for response_col_name:
    1. Clean the data and split df into X and y.
    2. Split X and y into test and train sets.
    3. Instantiate the model and fit to the training data.
    4. Predict based on the test data and find r-squared
    '''

    #Clean df and assign to X and y
    X, y = clean_data(df, response_col_name)
    
    #Split for Train / Test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = random_state)

    # Instantiate and Fit the Linear Model
    lm_model = LinearRegression(normalize=True)
    lm_model.fit(X_train, y_train)

    #Predict the response
    y_test_preds = lm_model.predict(X_test)
    y_test_score = r2_score(y_test, y_test_preds)
    y_train_preds = lm_model.predict(X_train)
    y_train_score = r2_score(y_train, y_train_preds)
    
    return y_train_score, y_test_score, y_train_preds, y_test_preds, lm_model, X_train, X_test, y_train, y_test
#end of fit_model


In [59]:
y_train_score, y_test_score, y_train_preds, y_test_preds, lm_model, X_train, X_test, y_train, y_test = fit_model(new_df, 'JobSat')
print("The rsquared on the training data was {}.".format(y_train_score))
print("The rsquared on the test data was {}.".format(y_test_score))


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


The rsquared on the training data was 0.0045463659042556115.
The rsquared on the test data was 0.0037079463310090155.


The above R-squared values show a distinct lack of correlation for languages desired and job satisfaction. It is likely that these factors have little to no impact on a developers satisfaction, especially when compared with the factors chosen during the lesson (things like Remote Work frequency, salary, Company Size, and Education levels have significant impacts).

Dropping the test portion from 0.3 to 0.1 generates a better R-squared on the test data, but it is still insignificant.

One may make the assertion that satisfaction is less to due with what you are doing and more to do with the environment in which you work (remotely work, friendly coworkers, competent leadership, etc.), but this has not been investigated in this analysis.

Let's take a different look at how languages impact satisfaction. We just saw how little of an impact having a desired language can have on the satisfaction, but let's see how they compare when a developer desires or does not desire a language.

The following code will identify the mean satisfaction by whether a language is desired or not.

In [134]:

grouped_df = new_df
#Reserve Satisfaction
grouped_df['JobSat'] = y

#Loop over the languages and take the column names (names of the languages) as the values for the 'Lang' column
for lang in X.columns:
    grouped_df[lang] = X[lang]
    
#Remove the original multi-select column and initialize the desires / not-desired dataframes
grouped_df = grouped_df.drop(['LanguageDesireNextYear'], axis=1)
means_zero_df = pd.DataFrame({})
means_one_df = pd.DataFrame({})

#loop over the languages
for lang in X.columns:
    #temporarily save a mini dataframe with a single language and the mean satisfaction when it is desired vs not desired
    a = grouped_df.groupby(lang, as_index = False)['JobSat'].mean()
    #Append the satisfaction into the correct dataframe
    means_zero_df = means_zero_df.append(pd.DataFrame([[lang, a.loc[a[lang] == 0, 'JobSat'].iloc[0]]]))
    means_one_df = means_one_df.append(pd.DataFrame([[lang, a.loc[a[lang] == 1, 'JobSat'].iloc[0]]]))

#correct the names of the columns
means_zero_df.rename(columns={0:'Lang',1:'JobSat'}, inplace=True)
means_one_df.rename(columns={0:'Lang',1:'JobSat'}, inplace=True)
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,Lang,JobSat
0,Assembly,3.685683
0,Bash/Shell/PowerShell,3.658761
0,C,3.688193
0,C#,3.684729
0,C++,3.690286
0,Clojure,3.685663
0,Dart,3.689666
0,Elixir,3.687057
0,Erlang,3.687329
0,F#,3.684431


In [153]:
#This function will generate a comparison of the languages based on the mean satisfactions we obtained previously
def bar_comp(df1, df2, left_col, right_col):
    #Initializae the base dataframe to have language and two columns for the desired / not-desired means.
    comp_df = pd.DataFrame({'Language' : [], left_col : [], right_col : []})
    comp_df['Language'] = df1['Lang']
    comp_df[left_col] = df1['JobSat']
    comp_df[right_col] = df2['JobSat']
    #Determine the difference
    comp_df['Diff_JobSatByLang'] = comp_df[left_col] - comp_df[right_col]
    #sort the data so that we can observe worst (most negative) impact to best (most positive) impact
    comp_df = comp_df.sort_values(by=['Diff_JobSatByLang'])
    comp_df = comp_df.reset_index(drop=True)
    comp_df.style.bar(subset=['Diff_JobSatByLang'], align='mid', color=['#d65f5f', '#5fba7d'])
    return comp_df

In [154]:
bar_comp(means_one_df, means_zero_df, 'Desired', 'Not Desired')

Unnamed: 0,Language,Desired,Not Desired,Diff_JobSatByLang
0,Scala,3.588407,3.692274,-0.103867
1,Dart,3.597406,3.689666,-0.09226
2,Erlang,3.601631,3.687329,-0.085699
3,Objective-C,3.603497,3.687769,-0.084271
4,Java,3.633554,3.702013,-0.068459
5,VBA,3.621479,3.686294,-0.064815
6,Kotlin,3.636534,3.693891,-0.057357
7,PHP,3.642347,3.691676,-0.049329
8,Elixir,3.64746,3.687057,-0.039598
9,Python,3.666251,3.699546,-0.033295


With the Job Satisfaction delta placed in a sorted chart, it appears that while there is a negligible difference in satisfaction between desiring and not desiring a language, there is some level of relevance. As we learned in the linear model, we can't say there is actually a correlation (with rSquared at under 0.005), it is interesting to note that some languages (or more likely the other aspects of jobs that rely on those languages) have some minor impact. Scala, Dart, Erlang, and Objective-C all suggest a negative impact when used at scores deltas of -0.08 to -0.1. Conversely, we can see a smaller positive impact when developers use things like Bash/Shell/PowerShell at +0.1. Many languages (16 of the 27 options) introduce a negative score impact when used.

Given the lack of viable correlation in the model, these small impacts can't be said to truly be related to the languages someone uses. I would hesitate to suggest that using a random language will likely result in an negative impact to your satisfaction, but it does make one wonder: "what is it about these languages that could cause a change in satisfaction?"