This notebook researches the first two project motivation question of:
1. "What are some defining characteristics of a Data Scientists/Machine Learning Specialists?" 
2. "What type(s) of technologies do Data Scientists/Machine Learning Specialists use?"

Within the Stack Overflow survey there are some interesting questions that may help us pull together some characteristics of Data Scientists and the technologies they use.  First lets import the modules we need along with the Stack Overflow 2019 Developer Survey data.

In [None]:
# Import the necessary modules and StackOverflow csv files for 2019.
import numpy as np
import pandas as pd
from collections import defaultdict
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('./2019/survey_results_public.csv')
schema = pd.read_csv('./2019/survey_results_schema.csv')
df.shape

---
A quick view of the survey results helps get a flavor for the underlying data:

In [None]:
df.head()

---
A quick view of the survey schema gives us a feel for how the questions are stored and structured:

In [None]:
schema.head()

---
The key survey question with a type of "DevType" will help us determine what type of job roles the respondent has.  Here is the question that is presented to the survey takers:

In [None]:
list(schema[schema['Column'] == 'DevType']['QuestionText'])[0]

---
We want to create a few functions that will be utilized going forward.  They allow us to count and/or display various data pulled from the survey.

In [None]:
# Method taken from Udacity Data Science course: https://www.udacity.com/course/data-scientist-nanodegree--nd025
def total_count(df, col1, col2, look_for, delim=';'):
    """
    INPUT:
    df - the pandas dataframe you want to search
    col1 - the column name you want to look through
    col2 - the column you want to count values from
    look_for - a list of strings you want to search for in each row of df[col1]
    delim - string delimiter value to break up the string by

    OUTPUT:
    new_df - a dataframe of each look_for with the count of how often it shows up
    """
    new_df = defaultdict(int)
    # loop through list of ed types
    for val in look_for:
        # loop through rows
        for idx in range(df.shape[0]):
            # if the type is in the row add 1
            if val in df[col1][idx].split(delim):
                new_df[val] += int(df[col2][idx])
    new_df = pd.DataFrame(pd.Series(new_df)).reset_index()
    new_df.columns = [col1, col2]
    new_df.sort_values('count', ascending=False, inplace=True)
    return new_df

In [None]:
# Inspired by Udacity Data Science course: https://www.udacity.com/course/data-scientist-nanodegree--nd025
def clean_and_plot(df, possible_vals, col='', title='', plot=True, xaxis='method'):
    '''
    INPUT
        df - a dataframe holding the column in the col parameter
        possible_vals - list of possible values to search for
        col - The column to search
        title - string the title of your plot
        plot - bool providing whether or not you want a plot back

    OUTPUT
        new_df - a dataframe with the count of how many individuals
        Displays a plot of pretty things related to the CousinEducation column.
    '''
    new_df = df[col].value_counts().reset_index()
    new_df.rename(columns={'index': xaxis, col: 'count'}, inplace=True)
    new_df = total_count(new_df, xaxis, 'count', possible_vals)

    new_df.set_index(xaxis, inplace=True)
    if plot:
        (new_df / new_df.sum()).plot(kind='bar', legend=None)
        plt.title(title)
        plt.show()
    new_df = new_df / new_df.sum()
    return new_df


In [None]:
# Looks at all the values in a given column and compiles a set of unique values broken up
# by a given delimeter
def count_lists(df, col, delim=';'):
    """
    INPUT:
    df - the pandas dataframe you want to search
    col - the column name you want to look through
    delim - string delimiter value to break up the string by

    OUTPUT:
   index_set = set of items found from the lists
    """
    my_list = []

    df = df[col].value_counts()
    for index, value in df.items():
        my_list += index.split(delim)
    index_set = list(set(my_list))
    return index_set

---
Let's take a look at a breakdown of the number of people that classify themselves doing particular jobs.  This gives us an overall feel for the various job classifications and the percentages of those that fall into each one.  Keep in mind that when answering the DevType question someone can pick multiple roles:

In [None]:
list_values = count_lists(df, col='DevType', delim=';')
results = clean_and_plot(df, list_values, col='DevType', title='Job Role', xaxis='Jobs')
results

---
Let's focus on the rows for DevTypes that consider their job or part of their job a "Data scientist or machine learning specialist".  We also want to see those reporting just as a "Data scientist or machine learning specialist" and no other role.  Isolating respondents that only hold a Data Scientist role might help us acquire a clearer picture of their characteristics. This will leave us with 6460 rows of mixed roles and 526 of pure DS/ML respondants to work with. We will also remove the null values:

In [None]:
df = df[df.DevType.notnull()] # Remove all nulls in the DevType column

# Dataframe of non DS/ML respondents
df_non_ds = df[df['DevType'].str.contains('Data scientist or machine learning specialist') == False]

# Dataframe with respondents that state they perform a DS/ML role in some capacity
df_ds = df[df['DevType'].str.contains('Data scientist or machine learning specialist')]

# Dataframe with respondents whom only have a DS/ML role
df_pure_ds = df_ds[df_ds['DevType'] == 'Data scientist or machine learning specialist']

print('Number of rows of type DS/ML: {}\nNumber of rows in non-DS/ML occupations: {}\nPure DS/ML: {}'.
      format(df_ds.shape[0], df_non_ds.shape[0], df_pure_ds.shape[0]))

---
Our first characteristic we are interested in is the education levels.  We'll go through mixed DS/ML respondents (those that were a DS/ML in some capacity, pure DS/ML (those that only have a DS/ML role) and the rest of the survey respondents that do not have a DS/ML role in any capacity.

In [None]:
# Mixed DS/ML respondents
chart_values = df_ds.EdLevel.value_counts()
(chart_values/df_ds.shape[0]).plot(kind="bar")
plt.title("DS/ML Education Levels")
plt.ylabel('Percentage')
plt.xlabel('Education')
round(chart_values/df_ds.shape[0]*100,1)

In [None]:
# Pure DS/ML respondents
chart_values = df_pure_ds.EdLevel.value_counts()
(chart_values/df_pure_ds.shape[0]).plot(kind="bar")
plt.title("Pure DS/ML Education Levels")
plt.ylabel('Percentage')
plt.xlabel('Education')
round(chart_values/df_pure_ds.shape[0]*100,1)

In [None]:
# Non DS/ML role respondents
chart_values = df_non_ds.EdLevel.value_counts()
(chart_values/df_non_ds.shape[0]).plot(kind="bar")
plt.title("Non DS/ML Education Levels")
plt.ylabel('Percentage')
plt.xlabel('Education')
round(chart_values/df_non_ds.shape[0]*100,1)

What we see is there are a significant amount more PhD holders in Data Science roles than the general developer community. Individuals that have a Master's degree and to a slightly lesser extent Bachelor's degree graduates are the most common Data Scientists.

---
Now that we have information about the respondents education levels, can we determine what they majored in?  This will be valuable in order to show what are some majors people in the DS/ML fields come from.  If someone wanted to break into the field, it may prove to be an advantage to have a major that correlates with those already in the field.  

One major limitation is the survey doesn't ask what your major was for the highest level of education you received.  What it does ask is "What was your main or most important field of study?"  It wouldn't be unreasonable to assume that a respondent will not always state the field of study that was most important to them was the highest one achieved.  For example, if a respondent had an undergraduate degree in computer science then later got a Masters in an unrelated field such as business, but they are employed as a developer.  They may consider their computer science degree most important since it correlates more with their current career. 

In [None]:
# Mixed DS/ML respondents
chart_values = df_ds.UndergradMajor.value_counts() 
(chart_values/df_ds.shape[0]).plot(kind="bar")
plt.title("Undergraduate Major For DS/ML")
round(chart_values/df_ds.shape[0]*100,1)

In [None]:
# Pure DS/ML respondents
chart_values = df_pure_ds.UndergradMajor.value_counts() 
(chart_values/df_pure_ds.shape[0]).plot(kind="bar")
plt.title("Undergraduate Major For DS/ML")
round(chart_values/df_pure_ds.shape[0]*100,1)

In [None]:
# Non DS/ML role respondents
chart_values = df_non_ds.UndergradMajor.value_counts() 
(chart_values/df_non_ds.shape[0]).plot(kind="bar")
plt.title("Undergraduate Major For Non DS/ML")
round(chart_values/df_non_ds.shape[0]*100,1)

If we believe that most import field of study translates to a graduate degree and higher with a relatively high frequency how does the above sets of charts help us? It shows us that Data Scientists come from a far broader spectrum of academic backgrounds than the average developer. Just looking at pure Data Scientists, they mostly came from a mathematics or statistics background with computer science a close second. There is also a respectable amount of other engineering disciples and natural science backgrounds.

---
Salary and hours worked doesn't feel like a good characteristic for someone wanting to become a Data Scientist as those are not attributes you can really acquiring prior that increases your chances.  Perhaps it can, but the data doesn't allow us to dig that deep.  Out of curiosity, lets take a look at the average compensation (converted to USD) and average hours worked to maybe see if there really isn't anything there we can work with.

In [None]:
# Mixed DS/ML respondents
comp = df_ds.ConvertedComp.mean()
hours = df_ds.WorkWeekHrs.mean()

print("DS/ML Compensation: ${:,.2f} and Hours: {}".format(comp, round(hours,1)))

In [None]:
# Pure DS/ML respondents
comp = df_pure_ds.ConvertedComp.mean()
hours = df_pure_ds.WorkWeekHrs.mean()

print("Pure DS/ML Compensation: ${:,.2f} and Hours: {}".format(comp, round(hours,1)))

In [None]:
# Non DS/ML role respondents
comp = df_non_ds.ConvertedComp.mean()
hours = df_non_ds.WorkWeekHrs.mean()

print("Non DS/ML Compensation: ${:,.2f} and Hours: {}".format(comp, round(hours,1)))

The data is interesing, at least in my opinion, since it does show Data Scietists (both mixed and pure) do make more than the average non DS/ML respondent.  The hours worked are not significantly different either.  

---
The last characteristic we will look at is what are other roles DS/ML might find themselves. Since 526 report as only being a DS/ML while 5,934 report to have other job roles but are also doing some DS/ML work, there are significantly more respondents that do other work.  We want to see what those roles are.

In [None]:
list_values = count_lists(df_ds, col='DevType', delim=';')
list_values.remove('Data scientist or machine learning specialist')
props_df = clean_and_plot(df_ds, list_values, col='DevType', title='DS/ML Other Roles')
props_df['count'] = round(props_df['count'] * 100, 1)
props_df['count']

How is the at all helpful to someone looking to be a DS/ML?

---
Let's take a look at the programming languages people in the DS/ML field(s) and compare those with non DS/ML occupation:

In [None]:
# Mixed DS/ML respondents
list_values = count_lists(df_ds, col='LanguageWorkedWith', delim=';')
props_df = clean_and_plot(df_ds, list_values, col='LanguageWorkedWith', title='DS/ML Most Common Current Languages Used',
                         xaxis='Languages')
props_df

In [None]:
# Pure DS/ML respondents
list_values = count_lists(df_pure_ds, col='LanguageWorkedWith', delim=';')
props_df = clean_and_plot(df_pure_ds, list_values, col='LanguageWorkedWith', title='Pure DS/ML Most Common Current Languages Used',
                         xaxis='Languages')
props_df

In [None]:
# Non DS/ML role respondents
list_values = count_lists(df_non_ds, col='LanguageWorkedWith', delim=';')
props_df = clean_and_plot(df_non_ds, list_values, col='LanguageWorkedWith', title='Most Common Current Languages Used')

Let's take a look at the programming languages people in the DS/ML field(s) want to learn over the next year and compare those with non DS/ML occupation:

In [None]:
# Mixed DS/ML respondents
list_values = count_lists(df_ds, col='LanguageDesireNextYear', delim=';')
props_df = clean_and_plot(df_ds, list_values, col='LanguageDesireNextYear', title='DS/ML Most Desired Languages to Learn in 2020')

In [None]:
# Pure DS/ML respondents
list_values = count_lists(df_pure_ds, col='LanguageDesireNextYear', delim=';')
props_df = clean_and_plot(df_pure_ds, list_values, col='LanguageDesireNextYear', title='Pure DS/ML Most Desired Languages to Learn in 2020')

In [None]:
# Non DS/ML role respondents
list_values = count_lists(df_non_ds, col='LanguageDesireNextYear', delim=';')
props_df = clean_and_plot(df_non_ds, list_values, col='LanguageDesireNextYear', title='Non DS/ML Most Desired Languages to Learn in 2020')

Let's take a look at the most common DB systems people in the DS/ML field(s) work with and compare those with non DS/ML occupation:

In [None]:
# Mixed DS/ML respondents
list_values = count_lists(df_ds, col='DatabaseWorkedWith', delim=';')
props_df = clean_and_plot(df_ds, list_values, col='DatabaseWorkedWith', title='DS/ML Most Common Current DBs Used')

In [None]:
# Pure DS/ML respondents
list_values = count_lists(df_pure_ds, col='DatabaseWorkedWith', delim=';')
props_df = clean_and_plot(df_pure_ds, list_values, col='DatabaseWorkedWith', title='Pure DS/ML Most Common Current DBs Used')

In [None]:
# Non DS/ML role respondents
list_values = count_lists(df_non_ds, col='DatabaseWorkedWith', delim=';')
props_df = clean_and_plot(df_non_ds, list_values, col='DatabaseWorkedWith', title='Non DS/ML Most Common Current DBs Used')

Let's take a look at the most common DB systems people in the DS/ML field(s) want to work with and compare those with non DS/ML occupation:

In [None]:
# Mixed DS/ML respondents
list_values = count_lists(df_ds, col='DatabaseDesireNextYear', delim=';')
props_df = clean_and_plot(df_ds, list_values, col='DatabaseDesireNextYear', title='DS/ML Most Desired DBs to Learn in 2020')

In [None]:
# Pure DS/ML respondents
list_values = count_lists(df_pure_ds, col='DatabaseDesireNextYear', delim=';')
props_df = clean_and_plot(df_pure_ds, list_values, col='DatabaseDesireNextYear', title='Pure DS/ML Most Desired DBs to Learn in 2020')

In [None]:
# Non DS/ML role respondents
list_values = count_lists(df_non_ds, col='DatabaseDesireNextYear', delim=';')
props_df = clean_and_plot(df_non_ds, list_values, col='DatabaseDesireNextYear', title='Non DS/ML Most Desired DBs to Learn in 2020')