# Data Scientist Survey Analysis - Genevieve Hayes

## Overview

In 2012, the [***Harvard Business Review***](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century) named Data Scientist as the "sexiest job of the 21st century", and since then, Data Scientist has frequently featured in numerous top job lists, including [Glassdoor's best jobs in America list](https://www.forbes.com/sites/louiscolumbus/2018/01/29/data-scientist-is-the-best-job-in-america-according-glassdoors-2018-rankings/#1f57b7ad5535), which has consistently placed Data Scientist at number 1 position for the past three years. However, is being in the best job in America really any better than being in any other programming related roles, and has the sudden popularity of this profession made any impact of the types of people entering it. In this analysis, we will explore these questions by analysing data collected by Stack Overflow as part of their annual developer survey.

## Research Questions

The motivation behind this analysis is to explore how data scientists compare with other non-data scientist software developers (subsequently referred to simply as "non-data scientists"). Consequently, in this analysis, we set out to answer the following questions:

1. How does the demographic profile of data scientists differ from that of non-data scientists?
2. What programming languages do data scientists favour and how do they differ from those used by non-data scientists?
3. How much coding experience do data scientists have compared to non-data scientists?
4. Are data scientists more satisfied with their jobs/careers than non-data scientists?

## The Data

In order to answer these questions, we make use of data collected by Stack Overflow in response to their 2018 (Annual) Developer Survey. This data can be downloaded from: <https://insights.stackoverflow.com/survey>. To get a feel for this data, we read in the data set and conduct some exploratory analysis below:

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.simplefilter('ignore')
%matplotlib inline

In [2]:
# Read in dataset
survey = pd.read_csv("./Data/developer_survey_2018/survey_results_public.csv")
survey.head()

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...","10,000 or more employees",Database administrator;DevOps specialist;Full-...,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy
2,4,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",20 to 99 employees,Engineering manager;Full-stack developer,...,,,,,,,,,,
3,5,No,No,United States,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",100 to 499 employees,Full-stack developer,...,I don't typically exercise,Male,Straight or heterosexual,Some college/university study without earning ...,White or of European descent,35 - 44 years old,No,No,The survey was an appropriate length,Somewhat easy
4,7,Yes,No,South Africa,"Yes, part-time",Employed full-time,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","10,000 or more employees",Data or business analyst;Desktop or enterprise...,...,3 - 4 times per week,Male,Straight or heterosexual,Some college/university study without earning ...,White or of European descent,18 - 24 years old,Yes,,The survey was an appropriate length,Somewhat easy


In [3]:
# Print shape of dataset
print('The dataset contains', np.shape(survey)[0], 'rows and', np.shape(survey)[1], 'columns.')

The dataset contains 98855 rows and 129 columns.


In [4]:
# Print column names
list(survey.columns.values)

['Respondent',
 'Hobby',
 'OpenSource',
 'Country',
 'Student',
 'Employment',
 'FormalEducation',
 'UndergradMajor',
 'CompanySize',
 'DevType',
 'YearsCoding',
 'YearsCodingProf',
 'JobSatisfaction',
 'CareerSatisfaction',
 'HopeFiveYears',
 'JobSearchStatus',
 'LastNewJob',
 'AssessJob1',
 'AssessJob2',
 'AssessJob3',
 'AssessJob4',
 'AssessJob5',
 'AssessJob6',
 'AssessJob7',
 'AssessJob8',
 'AssessJob9',
 'AssessJob10',
 'AssessBenefits1',
 'AssessBenefits2',
 'AssessBenefits3',
 'AssessBenefits4',
 'AssessBenefits5',
 'AssessBenefits6',
 'AssessBenefits7',
 'AssessBenefits8',
 'AssessBenefits9',
 'AssessBenefits10',
 'AssessBenefits11',
 'JobContactPriorities1',
 'JobContactPriorities2',
 'JobContactPriorities3',
 'JobContactPriorities4',
 'JobContactPriorities5',
 'JobEmailPriorities1',
 'JobEmailPriorities2',
 'JobEmailPriorities3',
 'JobEmailPriorities4',
 'JobEmailPriorities5',
 'JobEmailPriorities6',
 'JobEmailPriorities7',
 'UpdateCV',
 'Currency',
 'Salary',
 'SalaryType',

Most columns are unnecessary in order to answer our research questions. For convenience we keep just the necessary columns and drop the rest.

In [5]:
# Drop unnecessary columns
survey = survey[['Respondent', 'DevType', 'Gender', 'Age', 'FormalEducation', 'UndergradMajor', 
                 'EducationTypes', 'SelfTaughtTypes', 'LanguageWorkedWith', 'YearsCoding', 
                 'YearsCodingProf', 'JobSatisfaction', 'CareerSatisfaction']]

As we are interested in comparing data scientists to non-data scientists, we need to be able to differentiate between the two. This is done using the `DevType` field. As a result, we should drop any rows where this field is missing, since we can't determine which subset these rows fit into.

In [6]:
# Delete rows where DevType is missing
survey = survey[pd.notnull(survey['DevType'])]

In [7]:
# Print size of reduced dataset
print('The reduced dataset contains', np.shape(survey)[0], 'rows and', np.shape(survey)[1], 'columns.')

The reduced dataset contains 92098 rows and 13 columns.


Making these adjustments removes 6757 rows and 116 columns from our dataset. Let's look at the descriptive statistics for these fields.

In [8]:
# Look at descriptive statistics for data (ignore Respondent since this is just an ID field)
survey.drop(['Respondent'], axis = 1).describe()

Unnamed: 0,DevType,Gender,Age,FormalEducation,UndergradMajor,EducationTypes,SelfTaughtTypes,LanguageWorkedWith,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction
count,92098,63321,63423,90161,77589,66698,56332,76865,92021,77362,68839,75996
unique,9568,15,7,9,12,494,470,26271,11,11,7,7
top,Back-end developer,Male,25 - 34 years old,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...","Taught yourself a new language, framework, or ...",The official documentation and/or standards fo...,C#;JavaScript;SQL;HTML;CSS,3-5 years,0-2 years,Moderately satisfied,Moderately satisfied
freq,6417,58426,31277,41848,49641,6708,3377,1339,22918,23284,25854,27747


All fields are text fields, including fields that we would expect to be numeric, such as `Age`. Some fields contain a very large number of unique values (e.g. `LanguageWorkedWith`). This is because for some fields, multiple selections were allowed, and all selections are concatenated together in a single string. The number of missing values varies from column to column. 

Let's look at the unique values for each of these features.

In [9]:
# Display (top 10) value counts for DevType
survey['DevType'].value_counts().nlargest(10)

Back-end developer                                                              6417
Full-stack developer                                                            6104
Back-end developer;Front-end developer;Full-stack developer                     4460
Mobile developer                                                                3518
Student                                                                         3222
Back-end developer;Full-stack developer                                         3128
Front-end developer                                                             2608
Front-end developer;Full-stack developer                                        1117
Back-end developer;Front-end developer                                          1030
Back-end developer;Front-end developer;Full-stack developer;Mobile developer    1008
Name: DevType, dtype: int64

In [10]:
# Display value counts for Gender
survey['Gender'].value_counts()

Male                                                                         58426
Female                                                                        3930
Non-binary, genderqueer, or gender non-conforming                              272
Female;Transgender                                                             143
Male;Non-binary, genderqueer, or gender non-conforming                         127
Transgender                                                                    103
Female;Male                                                                     97
Transgender;Non-binary, genderqueer, or gender non-conforming                   51
Female;Non-binary, genderqueer, or gender non-conforming                        50
Female;Male;Transgender;Non-binary, genderqueer, or gender non-conforming       48
Male;Transgender                                                                29
Female;Transgender;Non-binary, genderqueer, or gender non-conforming            23
Fema

The majority of respondents identify themselves as either male or female. All other responses could be grouped together into a single "other" category for simplicity.

In [11]:
# Display value counts for Age
survey['Age'].value_counts()

25 - 34 years old     31277
18 - 24 years old     15022
35 - 44 years old     11250
45 - 54 years old      3209
Under 18 years old     1592
55 - 64 years old       908
65 years or older       165
Name: Age, dtype: int64

Ages are given as ranges, but we would like to be able to calculate summary statistics for this field. To do so, we can create a numeric field where the age of each respondent is esimated as the mid-point of the range.

In [12]:
# Display value counts for FormalEducation
survey['FormalEducation'].value_counts()

Bachelor’s degree (BA, BS, B.Eng., etc.)                                              41848
Master’s degree (MA, MS, M.Eng., MBA, etc.)                                           20399
Some college/university study without earning a degree                                11070
Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)     8510
Associate degree                                                                       2798
Other doctoral degree (Ph.D, Ed.D., etc.)                                              2056
Primary/elementary school                                                              1545
Professional degree (JD, MD, etc.)                                                     1318
I never completed any formal education                                                  617
Name: FormalEducation, dtype: int64

In [13]:
# Display value counts for UndergradMajor
survey['UndergradMajor'].value_counts()

Computer science, computer engineering, or software engineering          49641
Another engineering discipline (ex. civil, electrical, mechanical)        6716
Information systems, information technology, or system administration     6412
A natural science (ex. biology, chemistry, physics)                       2931
Mathematics or statistics                                                 2739
Web development or web design                                             2401
A business discipline (ex. accounting, finance, marketing)                1864
A humanities discipline (ex. literature, history, philosophy)             1543
A social science (ex. anthropology, psychology, political science)        1333
Fine arts or performing arts (ex. graphic design, music, studio art)      1111
I never declared a major                                                   664
A health science (ex. nursing, pharmacy, radiology)                        234
Name: UndergradMajor, dtype: int64

For both `FormalEducation` and `UndergradMajor`, several of the categories could be grouped together for simplicity. Also, category names could be shortened, to make it easier to use as labels when creating graphs.

In [14]:
# Display (top 10) value counts for EducationTypes
survey['EducationTypes'].value_counts().nlargest(10)

Taught yourself a new language, framework, or tool without taking a formal course                                                                                                                                                                 6708
Taken an online course in programming or software development (e.g. a MOOC);Taught yourself a new language, framework, or tool without taking a formal course                                                                                     4381
Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software                                                                                                                             3994
Received on-the-job training in software development;Taught yourself a new language, framework, or tool without taking a formal course                                                                                                            2336
Taken an onl

In [15]:
# Display (top 10) value counts for SelfTaughtTypes
survey['SelfTaughtTypes'].value_counts().nlargest(10)

The official documentation and/or standards for the technology;Questions & answers on Stack Overflow                                                                                                                                                                                                                                                                             3377
The official documentation and/or standards for the technology;Questions & answers on Stack Overflow;The technology’s online help system                                                                                                                                                                                                                                         3087
The official documentation and/or standards for the technology;Questions & answers on Stack Overflow;Online developer communities other than Stack Overflow (ex. forums, listservs, IRC channels, etc.);The technology’s online help system                 

In [16]:
# Display (top 10) value counts for LanguageWorkedWith
survey['LanguageWorkedWith'].value_counts().nlargest(10)

C#;JavaScript;SQL;HTML;CSS                1339
JavaScript;PHP;SQL;HTML;CSS               1228
Java                                      1011
JavaScript;HTML;CSS                        870
C#;JavaScript;SQL;TypeScript;HTML;CSS      828
JavaScript;PHP;SQL;HTML;CSS;Bash/Shell     764
JavaScript;PHP;HTML;CSS                    713
Java;JavaScript;SQL;HTML;CSS               527
C#                                         471
JavaScript;TypeScript;HTML;CSS             390
Name: LanguageWorkedWith, dtype: int64

`EducationTypes`, `SelfTaughtTypes` and `LanguageWorkedWith` all contain multiple selections that have been strung together into a single string. These will need to be split apart into individual entries for analysis cases. In some cases, categories can potentially be grouped into a smaller number of categories.

In [17]:
# Display value counts for YearsCoding
survey['YearsCoding'].value_counts()

3-5 years           22918
6-8 years           19051
9-11 years          11973
0-2 years           10310
12-14 years          7905
15-17 years          6033
18-20 years          4969
30 or more years     3408
21-23 years          2602
24-26 years          1818
27-29 years          1034
Name: YearsCoding, dtype: int64

In [18]:
# Display value counts for YearsCodingProf
survey['YearsCodingProf'].value_counts()

0-2 years           23284
3-5 years           21224
6-8 years           11304
9-11 years           7516
12-14 years          4257
15-17 years          2988
18-20 years          2805
21-23 years          1356
30 or more years     1279
24-26 years           848
27-29 years           501
Name: YearsCodingProf, dtype: int64

As for `Age`, we would like to calculate summary statistics for `YearsCoding` and `YearsCodingProf`. To do so, we will need to create new numeric fields where the years coding and years coding professionally of a respondent are estimated as the mid-point of the ranges. 

In [19]:
# Display value counts for JobSatisfaction
survey['JobSatisfaction'].value_counts()

Moderately satisfied                  25854
Extremely satisfied                   12335
Slightly satisfied                     9962
Slightly dissatisfied                  7007
Moderately dissatisfied                6274
Neither satisfied nor dissatisfied     4943
Extremely dissatisfied                 2464
Name: JobSatisfaction, dtype: int64

In [20]:
# Display value counts for CareerSatisfaction
survey['CareerSatisfaction'].value_counts()

Moderately satisfied                  27747
Extremely satisfied                   14213
Slightly satisfied                    13418
Slightly dissatisfied                  6539
Neither satisfied nor dissatisfied     6268
Moderately dissatisfied                5220
Extremely dissatisfied                 2591
Name: CareerSatisfaction, dtype: int64

To allow summary statistics to be calculated for `JobSatisfaction` and `CareerSatisfaction`, we can convert the satisfaction rating scale into a numeric scale where "Extremely dissatisfied" = 1, "Moderately dissatisfied" = 2, etc.

## Data Preparation

Now that we have explored our data, the next step is to wrangle the data to adjust for the issues identified above. 

**Create Data Subsets**

As we wish to compare data scientists with non-data scientists, we first need to split our data into data scientist and non-data scientist subsets.

In [21]:
# Create data scientist and non-data scientist subsets.
survey_ds = survey[survey['DevType'].str.contains('Data scientist') == True]
survey_non_ds = survey[survey['DevType'].str.contains('Data scientist') == False]

In [22]:
print('There are', len(survey_ds), 'rows in the data scientist subset and', 
      len(survey_non_ds), 'rows in the non-data scientist subset.')

There are 7088 rows in the data scientist subset and 85010 rows in the non-data scientist subset.


** Simplify Fields with Large Numbers of Categories/Long Category Labels**

Simplify `Gender`, `FormalEducation` and `UndergradMajor` to reduce the length of category labels and to group similar categories into a single category.

In [23]:
# Simplify Gender
def simplify_gender(df):
    """Add a new field, Gender_New, to dataframe, containing simplified Gender values.
    
    INPUT
    df - dataframe containing the field Gender
       
    OUTPUT
    df - modified version of the input dataframe containing a new field, Gender_New
    """
    conditions_gender = [(df['Gender'] == 'Male'),
                         (df['Gender'] == 'Female'),
                         (df['Gender'] != 'Male') & (df['Gender'] != 'Female') 
                         & (pd.isnull(df['Gender']) == False)]

    choices_gender = ['Male', 'Female', 'Other']

    df['Gender_New'] = np.select(conditions_gender, choices_gender, default = np.NaN)
    
    return df
                
# Apply function to subsets
survey_ds = simplify_gender(survey_ds)
survey_non_ds = simplify_gender(survey_non_ds)

In [24]:
# Simplify FormalEducation
def simplify_ed(df):
    """Add a new field, FormalEducation_New, to dataframe, containing simplified FormalEducation values.
    
    INPUT
    df - dataframe containing the field FormalEducation
       
    OUTPUT
    df - modified version of the input dataframe containing a new field, FormalEducation_New
    """
    conditions_ed = [(df['FormalEducation'] == 'Bachelor’s degree (BA, BS, B.Eng., etc.)'),
                     (df['FormalEducation'] == 'Master’s degree (MA, MS, M.Eng., MBA, etc.)'),
                     (df['FormalEducation'] == 'Professional degree (JD, MD, etc.)'),   
                     (df['FormalEducation'] == 'Associate degree'),
                     (df['FormalEducation'] == 'Other doctoral degree (Ph.D, Ed.D., etc.)'),
                     (df['FormalEducation'] == 'Some college/university study without earning a degree') 
                     | (df['FormalEducation'] == 'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)') 
                     | (df['FormalEducation'] == 'Primary/elementary school')
                     | (df['FormalEducation'] == 'I never completed any formal education')]

    choices_ed = ['Bachelors', 'Masters', 'Professional', 'Associate', 'Doctorate', 'No Degree']

    df['FormalEducation_New'] = np.select(conditions_ed, choices_ed, default = np.NaN)
    
    return df

# Apply function to subsets
survey_ds = simplify_ed(survey_ds)
survey_non_ds = simplify_ed(survey_non_ds)

In [25]:
# Simplify Undergraduate Major
def simplify_major(df):
    """Add a new field, UndergradMajor_New, to dataframe, containing simplified UndergradMajor values.
    
    INPUT
    df - dataframe containing the field UndergradMajor
       
    OUTPUT
    df - modified version of the input dataframe containing a new field, UndergradMajor_New
    """
    conditions_major = [(df['UndergradMajor'] == 'Computer science, computer engineering, or software engineering'),
                        (df['UndergradMajor'] == 'Another engineering discipline (ex. civil, electrical, mechanical)'),
                        (df['UndergradMajor'] == 'Information systems, information technology, or system administration'),
                        (df['UndergradMajor'] == 'Mathematics or statistics'),
                        (df['UndergradMajor'] == 'A natural science (ex. biology, chemistry, physics)')
                        |(df['UndergradMajor'] == 'A health science (ex. nursing, pharmacy, radiology)'),
                        (df['UndergradMajor'] == 'Web development or web design'),
                        (df['UndergradMajor'] == 'A business discipline (ex. accounting, finance, marketing)'),
                        (df['UndergradMajor'] == 'A humanities discipline (ex. literature, history, philosophy)')
                        | (df['UndergradMajor'] == 'A social science (ex. anthropology, psychology, political science)')
                        | (df['UndergradMajor'] == 'Fine arts or performing arts (ex. graphic design, music, studio art)')]

    choices_major = ['Computer Science', 'Engineering', 'IT/Info Systems', 'Math/Statistics', 'Other Science',
                     'Web Design/Development', 'Business', 'Arts/Humanities/Social Science']

    df['UndergradMajor_New'] = np.select(conditions_major, choices_major, default = np.NaN)
    
    return df

# Apply function to subsets
survey_ds = simplify_major(survey_ds)
survey_non_ds = simplify_major(survey_non_ds)

**Convert Ranges to Single Numeric Values**

Create new fields that replace the ranges for `Age`, `YearsCoding` and `YearsCodingProf` with single numeric values that are the mid-point of the ranges.

In [26]:
# Convert Age to numeric
def convert_age(df):
    """Add a new field, Age_Num, to dataframe, containing numeric values equivalent to the midpoints of the corresponding
    Age range values.
    
    INPUT
    df - dataframe containing the field Age
       
    OUTPUT
    df - modified version of the input dataframe containing a new field, Age_Num
    """
    conditions_age = [(df['Age'] == 'Under 18 years old'),
                      (df['Age'] == '18 - 24 years old'),
                      (df['Age'] == '25 - 34 years old'),
                      (df['Age'] == '35 - 44 years old'),
                      (df['Age'] == '45 - 54 years old'),
                      (df['Age'] == '55 - 64 years old'),
                      (df['Age'] == '65 years or older')]

    choices_age = [16, 21, 29.5, 39.5, 49.5, 59.5, 69.5]

    df['Age_Num'] = np.select(conditions_age, choices_age, default = np.NaN)
    
    return df

# Apply function to subsets
survey_ds = convert_age(survey_ds)
survey_non_ds = convert_age(survey_non_ds)

In [27]:
# Convert YearsCoding and YearsCodingProf to numeric
def convert_coding(df, col, new_col):
    """Add a new field, new_col, to dataframe, containing numeric values equivalent to the midpoints of the corresponding
    col range values.
    
    INPUT
    df - dataframe containing the field col
       
    OUTPUT
    df - modified version of the input dataframe containing a new field, new_col
    """
    conditions_coding = [(df[col] == '0-2 years'),
                         (df[col] == '3-5 years'),
                         (df[col] == '6-8 years'),
                         (df[col] == '9-11 years'),
                         (df[col] == '12-14 years'),
                         (df[col] == '15-17 years'),
                         (df[col] == '18-20 years'),
                         (df[col] == '21-23 years'),
                         (df[col] == '24-26 years'),
                         (df[col] == '27-29 years'),
                         (df[col] == '30 or more years')]
    
    choices_coding = [1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31]
    
    df[new_col] = np.select(conditions_coding, choices_coding, default = np.NaN)
    
    return df

# Apply function to subsets
survey_ds = convert_coding(survey_ds, 'YearsCoding', 'YearsCoding_Num')
survey_ds = convert_coding(survey_ds, 'YearsCodingProf', 'YearsCodingProf_Num')

survey_non_ds = convert_coding(survey_non_ds, 'YearsCoding', 'YearsCoding_Num')
survey_non_ds = convert_coding(survey_non_ds, 'YearsCodingProf', 'YearsCodingProf_Num')

** Convert Satisfaction Scales to Numeric Values**

Create new fields that replace the categorical `JobSatisfaction` and `CareerSatisfaction` scales with numeric scales.

In [28]:
# Convert satisfaction scales to numeric scales
def convert_scale(df, col, new_col):
    """Add a new field, new_col, to dataframe, containing numeric equivalents to the categorical scale reflected in col
    
    INPUT
    df - dataframe containing the field col
       
    OUTPUT
    df - modified version of the input dataframe containing a new field, new_col
    """
    conditions_sat = [(df[col] == 'Extremely dissatisfied'),
                     (df[col] == 'Moderately dissatisfied'),
                     (df[col] == 'Slightly dissatisfied'),
                     (df[col] == 'Neither satisfied nor dissatisfied'),
                     (df[col] == 'Slightly satisfied'),
                     (df[col] == 'Moderately satisfied'),
                     (df[col] == 'Extremely satisfied')]
    
    choices_sat = [1, 2, 3, 4, 5, 6, 7]
    
    df[new_col] = np.select(conditions_sat, choices_sat, default = np.NaN)
    
    return df

# Apply function to subsets
survey_ds = convert_scale(survey_ds, 'JobSatisfaction', 'JobSatisfaction_Num')
survey_ds = convert_scale(survey_ds, 'CareerSatisfaction', 'CareerSatisfaction_Num')

survey_non_ds = convert_scale(survey_non_ds, 'JobSatisfaction', 'JobSatisfaction_Num')
survey_non_ds = convert_scale(survey_non_ds, 'CareerSatisfaction', 'CareerSatisfaction_Num')

** Split Multi-Selection Fields**

For the fields where multiple selections were possible (i.e. `EducationTypes`, `SelfTaughtTypes` and `LanguageWorkedWith`), split the strings containing the multiple selections into a list of selections and then concatenate these lists into a single list (dropping any missing values in the process). In the case of `EducationTypes` and `SelfTaughtTypes`, also simplify these fields to reduce the length of category labels and to group similar categories into a single category.

In [33]:
# Create dataframe containing split string values by respondent number
def split_list(df, col):
    """Create a new dataframe that splits the values of multi-selection column col into individual selections and 
    places each selection value on a separate row. This new dataframe can be linked back to the original dataframe by 
    Respondent value.
    
    INPUT
    df - dataframe containing the multi-selection field col
       
    OUTPUT
    out_df - new dataframe giving split values of col
    """
    in_res = list(df['Respondent'])
    in_list = list(df[col])
    
    out_res = []
    out_list = []
    
    for i in range(len(in_list)):
        if pd.isnull(in_list[i]) == False:
            vals = in_list[i].split(';')
            res = [in_res[i]]*len(vals)
            
            out_list.append(vals)
            out_res.append(res)
    
    out_df = pd.DataFrame({'Respondent': list(np.concatenate(out_res)), col: list(np.concatenate(out_list))})
    
    return out_df

In [40]:
# Split EducationTypes
ed_types_ds = split_list(survey_ds, 'EducationTypes')
ed_types_non_ds = split_list(survey_non_ds, 'EducationTypes')
    
# Simplify category labels
def convert_ed_type(df):
    """Add a new field, EducationTypes_New, to dataframe, containing simplified EducationTypes values.
    
    INPUT
    df - dataframe containing the field EducationTypes
       
    OUTPUT
    df - modified version of the input dataframe containing a new field, EducationTypes_New
    """
    conditions_edtype = [(df['EducationTypes'] == 'Taken a part-time in-person course in programming or software development'),
                         (df['EducationTypes'] == 'Taken an online course in programming or software development (e.g. a MOOC)'),
                         (df['EducationTypes'] == 'Completed an industry certification program (e.g. MCPD)'),
                         (df['EducationTypes'] == 'Participated in online coding competitions (e.g. HackerRank, CodeChef, TopCoder)'),
                         (df['EducationTypes'] == 'Contributed to open source software'), 
                         (df['EducationTypes'] == 'Taught yourself a new language, framework, or tool without taking a formal course'),
                         (df['EducationTypes'] == 'Participated in a hackathon'),
                         (df['EducationTypes'] == 'Received on-the-job training in software development'),
                         (df['EducationTypes'] == 'Participated in a full-time developer training program or bootcamp')]  
    
    choices_edtype = ['Part Time In-Person Course', 'Online Course', 'Industry Certification', 
                      'Online Coding Competition', 'Open Source', 'Self Taught',
                      'Hackathon', 'On-the-Job Training', 'Full Time Course/Bootcamp']

    df['EducationTypes_New'] = np.select(conditions_edtype, choices_edtype, default = np.NaN)
    
    return df

ed_types_ds = convert_ed_type(ed_types_ds)
ed_types_non_ds = convert_ed_type(ed_types_non_ds)

In [38]:
# Split LanguageWorkedWith
languages_ds = split_list(survey_ds, 'LanguageWorkedWith')
languages_non_ds = split_list(survey_non_ds, 'LanguageWorkedWith')

In [41]:
# Split SelfTaughtTypes and simplify/rationalize category labels
def convert_self_type(data):
    """Take a list of SelfTaughtTypes selection values and create a new list containing simplified version of these values.
    
    INPUT
    data - list containing SelfTaughtTypes selections
       
    OUTPUT
    new_data - modified version of the input list containing simplified SelfTaughtTypes selections
    """
    temp = np.array(data)
    conditions_selftype = [(temp == 'A college/university computer science or software engineering book')
                            |(temp == 'A book or e-book from O’Reilly, Apress, or a similar publisher'),
                           (temp == 'The official documentation and/or standards for the technology')
                           |(temp == 'The technology’s online help system'),
                           (temp == 'Tapping your network of friends, family, and peers versed in the technology')
                           |(temp == 'Pre-scheduled tutoring or mentoring sessions with a friend or colleague'),
                           (temp == 'Internal Wikis, chat rooms, or documentation set up by my company for employees'),
                           (temp == 'Online developer communities other than Stack Overflow (ex. forums, listservs, IRC channels, etc.)')
                           |(temp == 'Questions & answers on Stack Overflow')]
    
    
    choices_selftype = ['Textbook', 'Official Documentation/Help', 'Friends/Peers', 
                        'Company Resources', 'Online Community']

    new_data = np.select(conditions_selftype, choices_selftype, default = np.NaN)
    return new_data

def process_self_type(df):
    """Create a new dataframe that splits the values of multi-selection column SelfTaughtTypes into individual selections and 
    places each selection value on a separate row, simplifying the values at the same time. This new dataframe can be linked 
    back to the original dataframe by Respondent value.
    
    INPUT
    df - dataframe containing the multi-selection field SelfTaughtTypes
       
    OUTPUT
    out_df - new dataframe giving split values of SelfTaughtTypes
    """
    in_res = list(df['Respondent'])
    in_list = list(df['SelfTaughtTypes'])
    
    out_res = []
    out_list = []
    
    for i in range(len(in_list)):
        if pd.isnull(in_list[i]) == False:
            vals = list(set(convert_self_type(in_list[i].split(';'))))
            res = [in_res[i]]*len(vals)
            
            out_list.append(vals)
            out_res.append(res)
    
    out_df = pd.DataFrame({'Respondent': list(np.concatenate(out_res)), 'SelfTaughtTypes': list(np.concatenate(out_list))})
    
    return out_df

# Apply function to subsets
self_types_ds = process_self_type(survey_ds)
self_types_non_ds = process_self_type(survey_non_ds)

** Missing Values**

In the previous section we identified that all of the fields under consideration contain some missing values. As we do not intent to build any ML models with this data, rather most of our analysis will take the form of univariate statistical analysis, it is not necessary to impute these missing values. Instead, we will just exclude the missing values from any calculations. This will be done in the analysis section, since many of the statistical functions we intend to use already have built in functionality to allow for the exclusion of missing values. When this is not the case, we will exclude the missing values manually on a case by case basis. 

## Data Analysis

Now that we have processed our data, our next step is to apply statistical analysis techniques to this processed data in order to answer our research questions.

### 1. How does the demographic profile of data scientists differ from that of non-data scientists?

**Gender**

In [57]:
# Compare the gender profile of the two subsets
# Get proportion of dataset by category
def get_proportions(df, col):
    
    # Summarize by column value
    summary = df[['Respondent', col]].groupby([col]).count()
    
    # Drop NaN values
    summary = summary.drop('nan')

    # Convert sums to proportions
    props = summary/summary['Respondent'].sum()
    
    return props

# Create summary dataset for comparing the two subsets
def create_summary(df_ds, df_non_ds, col):
    # Get proportions for each subset
    props_ds = get_proportions(survey_ds, col)
    props_non_ds = get_proportions(survey_non_ds, col)
create_summary(survey_ds, survey_non_ds, 'Gender_New')

            Respondent
Gender_New            
Female             350
Male              4568
Other               96
            Respondent
Gender_New            
Female        0.069805
Male          0.911049
Other         0.019146


** Age**

### 2. What programming languages do data scientists favour and how do they differ from those used by non-data scientists?

### 3. How much coding experience do data scientists have compared to non-data scientists?

### 4. Are data scientists more satisfied with their jobs/careers than non-data scientists?

## Results