# My take on Stack Overflow Survey 2020

Hi! My name is **Marcus**, and, upfront I want to thank you for your time assessing my code. This was coded for Project 1 of the Udacity\'s Data Science Nanodegree. 

As you can see, I was deeply influenced by [Josh Bernhard's Medium post on how to become a programmer](https://medium.com/@josh_2774/how-do-you-become-a-developer-5ef1c1c68711). However, I'm more interested in descriptive analysis of the Doctoral degree holder type than in any predictive model. This is even more true when I realized that the recently published 2020's Stack Overflow survey's results are quite different than the 2017's one, notably with pratically no fields about how people broke into the field. Plus, one of the take aways from Josh's analyses was that this question is that there is no scientific-proven formula to break into the field. Additionally, I'm one of these degree holders looking forward to break into the industry and it wouldn't hurt to understand a bit further the influence of their field of origin in their outcome.

You see, doctoral degrees varies a lot, especially in terms of programming experience and expertise. Someone with a degree in the humanities might not have to code a single line while someone from the computer science field will definitely have an edge. Thus, I'm deeply interested in knowing better my peers, especially the successfull ones. Who are they? What they are doing? What are their educational background? Do this background influences their outcome in job satisfaction and compensation?   

Thus, here are my questions of interested, and how I pretend to present the data in my own blog post:

Part I. Do you even work?

1. What's the employment rate of the group and how does it compare with the other educational backgrounds?

Part II. Ok, let's break it down.

2. What's the employment rate per field of origin?
3. What's the average programming experience, prior to first pro activity, per field of origin?
4. What's the average compensation, prior to first pro activity, per field of origin?

Part III. Is there a link between success and field of origin?

5. For how long the successfull ones (compensation and job satisfaction) are coding, and what are their field of origin?

For the sake of data exploration, I'll play a bit around with the data in this jupyter notebook befora tackling the questions at hand.


In [1]:
import pandas as pd
import numpy as np 
import matplotlib as plt
%matplotlib inline

In [2]:
# Load data
df = pd.read_csv('./data/survey_results_public.csv', dtype=str)

# Select the columns of interest
columns_of_interest = ["ConvertedComp", "Country", "DevType", "EdLevel", "Employment", "JobFactors", "JobSat", "NEWEdImpt", "NEWLearn", "UndergradMajor", "WorkWeekHrs", "YearsCode", "YearsCodePro"]

df_narrowed = df[columns_of_interest]
df_narrowed.head()

df_narrowed.head()

Unnamed: 0,ConvertedComp,Country,DevType,EdLevel,Employment,JobFactors,JobSat,NEWEdImpt,NEWLearn,UndergradMajor,WorkWeekHrs,YearsCode,YearsCodePro
0,,Germany,"Developer, desktop or enterprise applications;...","Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Independent contractor, freelancer, or self-em...","Languages, frameworks, and other technologies ...",Slightly satisfied,Fairly important,Once a year,"Computer science, computer engineering, or sof...",50.0,36,27.0
1,,United Kingdom,"Developer, full-stack;Developer, mobile","Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Employed full-time,,Very dissatisfied,Fairly important,Once a year,"Computer science, computer engineering, or sof...",,7,4.0
2,,Russian Federation,,,,,,,Once a decade,,,4,
3,,Albania,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",,Flex time or a flexible schedule;Office enviro...,Slightly dissatisfied,Not at all important/not necessary,Once a year,"Computer science, computer engineering, or sof...",40.0,7,4.0
4,,United States,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Employed full-time,,,Very important,Once a year,"Computer science, computer engineering, or sof...",,15,8.0


## Looking into my people
By "my people" I meant the ones from Brazil, with a doctoral degree of any sort in the natural sciences.

In [4]:
# Checking how many like me are out there
# First, I'm selecting the respondents who are from Brazil and hold a PhD or related degree

df_like_me = df_narrowed[df_narrowed["Country"] == "Brazil"][df_narrowed["EdLevel"] == "Other doctoral degree (Ph.D., Ed.D., etc.)"]

df_like_me.shape

(26, 13)

In [6]:
# Here, I'm trying to count the values of their field of origin
df_like_me["UndergradMajor"].value_counts()

Computer science, computer engineering, or software engineering                   12
A natural science (such as biology, chemistry, physics, etc.)                      4
A health science (such as nursing, pharmacy, radiology, etc.)                      2
Another engineering discipline (such as civil, electrical, mechanical, etc.)       2
A business discipline (such as accounting, finance, marketing, etc.)               2
Mathematics or statistics                                                          2
A social science (such as anthropology, psychology, political science, etc.)       1
Fine arts or performing arts (such as graphic design, music, studio art, etc.)     1
Name: UndergradMajor, dtype: int64

In [8]:
# Take a closer look into the four that are more like me

df_like_me[df["UndergradMajor"] == "A natural science (such as biology, chemistry, physics, etc.)"]

Unnamed: 0,ConvertedComp,Country,DevType,EdLevel,Employment,JobFactors,JobSat,NEWEdImpt,NEWLearn,UndergradMajor,WorkWeekHrs,YearsCode,YearsCodePro
3934,22536,Brazil,Data scientist or machine learning specialist,"Other doctoral degree (Ph.D., Ed.D., etc.)",Employed full-time,Industry that I’d be working in;Diversity of t...,Slightly dissatisfied,Very important,Every few months,"A natural science (such as biology, chemistry,...",40,7,2
15304,49476,Brazil,,"Other doctoral degree (Ph.D., Ed.D., etc.)",Employed full-time,Flex time or a flexible schedule;Remote work o...,Slightly dissatisfied,Critically important,Once every few years,"A natural science (such as biology, chemistry,...",40,35,13
16823,32988,Brazil,Academic researcher;Educator;Scientist,"Other doctoral degree (Ph.D., Ed.D., etc.)",Employed full-time,Specific department or team I’d be working on;...,Slightly satisfied,Not at all important/not necessary,Once every few years,"A natural science (such as biology, chemistry,...",40,30,18
18639,14568,Brazil,Data scientist or machine learning specialist;...,"Other doctoral degree (Ph.D., Ed.D., etc.)",Employed full-time,"Languages, frameworks, and other technologies ...",Slightly satisfied,Fairly important,Once a year,"A natural science (such as biology, chemistry,...",44,11,6


## 1. How many of them are currently employed?

Here, I'll compare the proportion of employed ones between the different educational background.

1. For that, I'll first considered only "Employed full-time", "Independent contractor, freelancer, or self-employed" and "Employed part-time" as "Employed". The rest will be considered "Unemployed". I'll use a custom function to build a dummy collumn summarizing these values like I just described. 

2. Then, I'll build a dataframe with the results grouped by "EdLevel" and its employment rate based on the counts of the EmploymentStatus dummy column.

3. Finally, I'll create a bar chart showing the proportion of employment by educational level.


In [9]:
# Check the value counts of "Employment"
df_narrowed["Employment"].value_counts()

Employed full-time                                      45270
Student                                                  7787
Independent contractor, freelancer, or self-employed     5672
Not employed, but looking for work                       2343
Employed part-time                                       2217
Not employed, and not looking for work                    322
Retired                                                   243
Name: Employment, dtype: int64

In [16]:
# A function that will convert to "Employed" if "Employment" is equal to "Employed full-time", "Independent contractor, freelancer, or self-employed", "Employed part-time"
def employment_discrete(df):
    """
    Return a list of converted values based on "Employment"
    "Employed full-time", "Independent contractor, freelancer, or self-employed", "Employed part-time" will be converted to "Employed", the others to "Unemployed"

    Parameters:
    df (pd.dataframe): Data

    Return:
    list: List of converted values

    """
    converted = []
    employment = ["Employed full-time", "Independent contractor, freelancer, or self-employed", "Employed part-time"]

    for entry in df["Employment"]:
        if str(entry) == "nan":
            converted.append("NaN")
        elif str(entry) in employment:
            converted.append("Employed")
        else:
            converted.append("Unemployed")
    return converted

In [18]:
# Create a new column, "EmploymentStatus", based on the converted values of "Employment"
df_narrowed["EmploymentStatus"] = employment_discrete(df_narrowed)

In [20]:
# Create a new dataframe dividing the counts of "EmploymentStatus" = "Employed", grouped by "EdLevel", by the total count of "EmploymentStatus" based on "EdLevel"
df_results01 = df_narrowed[["EdLevel", "EmploymentStatus"]][df_narrowed["EmploymentStatus"] == "Employed"].groupby("EdLevel").count() / df_narrowed[["EdLevel", "EmploymentStatus"]].groupby("EdLevel").count()

df_results01

Unnamed: 0_level_0,EmploymentStatus
EdLevel,Unnamed: 1_level_1
"Associate degree (A.A., A.S., etc.)",0.842105
"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",0.887499
I never completed any formal education,0.703854
"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",0.934487
"Other doctoral degree (Ph.D., Ed.D., etc.)",0.934911
Primary/elementary school,0.29118
"Professional degree (JD, MD, etc.)",0.905
"Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",0.465521
Some college/university study without earning a degree,0.770825
