### Screencast

In the previous video, I brought a few questions we will be exploring throughout this lesson. First, let's take a look at the data, and see how we might answer the first question about how to break into the field of becoming a software developoer according to the survey results.

To get started, let's read in the necessary libraries we will need to wrangle our data: pandas and numpy.  If we decided to build some basic plots, matplotlib might prove useful as well.

In [59]:
import numpy as np
import pandas as pd
from collections import defaultdict
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('./survey_results_public.csv')
df.head()

Unnamed: 0,Respondent,Professional,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,...,StackOverflowMakeMoney,Gender,HighestEducationParents,Race,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,ExpectedSalary
0,1,Student,"Yes, both",United States,No,"Not employed, and not looking for work",Secondary school,,,,...,Strongly disagree,Male,High school,White or of European descent,Strongly disagree,Strongly agree,Disagree,Strongly agree,,
1,2,Student,"Yes, both",United Kingdom,"Yes, full-time",Employed part-time,Some college/university study without earning ...,Computer science or software engineering,"More than half, but not all, the time",20 to 99 employees,...,Strongly disagree,Male,A master's degree,White or of European descent,Somewhat agree,Somewhat agree,Disagree,Strongly agree,,37500.0
2,3,Professional developer,"Yes, both",United Kingdom,No,Employed full-time,Bachelor's degree,Computer science or software engineering,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A professional degree,White or of European descent,Somewhat agree,Agree,Disagree,Agree,113750.0,
3,4,Professional non-developer who sometimes write...,"Yes, both",United States,No,Employed full-time,Doctoral degree,A non-computer-focused engineering discipline,"Less than half the time, but at least one day ...","10,000 or more employees",...,Disagree,Male,A doctoral degree,White or of European descent,Agree,Agree,Somewhat agree,Strongly agree,,
4,5,Professional developer,"Yes, I program as a hobby",Switzerland,No,Employed full-time,Master's degree,Computer science or software engineering,Never,10 to 19 employees,...,,,,,,,,,,


Now to look at our first question of interest: What do those employed in the industry suggest to help others enter the field?  Looking at the `CousinEducation` field, you can see what these individuals would suggest to help others break into their field.  Below you can take a look at the full field that survey participants would see.

In [60]:
df2 = pd.read_csv('./survey_results_schema.csv')
list(df2[df2.Column == 'CousinEducation']['Question'])

["Let's pretend you have a distant cousin. They are 24 years old, have a college degree in a field not related to computer programming, and have been working a non-coding job for the last two years. They want your advice on how to switch to a career as a software developer. Which of the following options would you most strongly recommend to your cousin?\nLet's pretend you have a distant cousin named Robert. He is 24 years old, has a college degree in a field not related to computer programming, and has been working a non-coding job for the last two years. He wants your advice on how to switch to a career as a software developer. Which of the following options would you most strongly recommend to Robert?\nLet's pretend you have a distant cousin named Alice. She is 24 years old, has a college degree in a field not related to computer programming, and has been working a non-coding job for the last two years. She wants your advice on how to switch to a career as a software developer. Which

In [61]:
#Let's have a look at what the participants say depending on gender
study_female = df[df['Gender']=='Female']['CousinEducation'].value_counts().reset_index()
study_male = df[df['Gender']=='Male']['CousinEducation'].value_counts().reset_index()
study_male.head()

Unnamed: 0,index,CousinEducation
0,Take online courses; Buy books and work throug...,571
1,Take online courses; Part-time/evening courses...,362
2,Take online courses; Bootcamp; Part-time/eveni...,359
3,Take online courses,355
4,Take online courses; Contribute to open source...,319


In [62]:
#it is grouping items together if a participant provided 
# more than just one answer.  Let's see if we can clean this up.
study_female.rename(columns={'index': 'method', 'CousinEducation': 'count'}, inplace=True)
study_male.rename(columns={'index': 'method', 'CousinEducation': 'count'}, inplace=True)
study_male.head()

Unnamed: 0,method,count
0,Take online courses; Buy books and work throug...,571
1,Take online courses; Part-time/evening courses...,362
2,Take online courses; Bootcamp; Part-time/eveni...,359
3,Take online courses,355
4,Take online courses; Contribute to open source...,319


We are now going to make a new dataframe where in each row only one 'method' appears. 

In [64]:
df_cousin =df['CousinEducation'].str.split('; ')
df_cousin_long = df.assign(method=df_cousin).explode('method')

In [65]:
study_df_female = df_cousin_long[df_cousin_long['Gender']=='Female']['method'].value_counts().reset_index()
study_df_male = df_cousin_long[df_cousin_long['Gender']=='Male']['method'].value_counts().reset_index()
study_df_female.rename(columns={'index': 'method', 'method': 'count'}, inplace=True)
study_df_male.rename(columns={'index': 'method', 'method': 'count'}, inplace=True)

study_df = pd.merge(study_df_female, study_df_male, suffixes=('_female','_male'), on='method',  )
study_df.head()

Unnamed: 0,method,count_female,count_male
0,Take online courses,1141,11827
1,Buy books and work through the exercises,689,9372
2,Part-time/evening courses,667,5686
3,Bootcamp,515,3926
4,Conferences/meet-ups,437,4037


From the percantages below we see that there are differences between how females and males suggest to brak into the field. Although both male and female suggest with highest percentage to "Take online courses" and next to "Buy books and work through the exercies", there were around 4% less females that suggested buing books as males. The third most popular method is different for females and males: females suggest as third method "Part-time/evening courses" while males suggest "Contribute to open source".

In [66]:
# We might also look at the percent

study_df['perc_female'] = study_df['count_female']/np.sum(study_df['count_female'])
study_df['perc_male'] = study_df['count_male']/np.sum(study_df['count_male'])
study_df.set_index('method', inplace = True)
#study_df.loc['Take online courses']

study_df

Unnamed: 0_level_0,count_female,count_male,perc_female,perc_male
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Take online courses,1141,11827,0.214675,0.210198
Buy books and work through the exercises,689,9372,0.129633,0.166566
Part-time/evening courses,667,5686,0.125494,0.101056
Bootcamp,515,3926,0.096896,0.069776
Conferences/meet-ups,437,4037,0.08222,0.071748
Contribute to open source,420,5909,0.079022,0.105019
Return to college,390,3879,0.073377,0.06894
Get a job as a QA tester,276,2602,0.051929,0.046245
Participate in online coding competitions,238,2708,0.044779,0.048129
Master's degree,203,1971,0.038194,0.03503


Overall we see that females suggest courses and formal education more than males: the total percentage of the options 'Take online courses', 'Part-time/evening courses', 'Bootcamp', 'Return to college', 'Master's degree'
is almost 55% for female while 48% for male.

In [67]:
courses = ['Take online courses', 'Part-time/evening courses', 'Bootcamp', 
           'Return to college', 'Master\'s degree']
print('Suggestions related to taking courses and formal education female',
      np.sum(study_df.loc[courses, 'perc_female']))
print('Suggestions related to taking courses and formal education male',
      np.sum(study_df.loc[courses, 'perc_male']))

Suggestions related to taking courses and formal education female 0.5486359360301035
Suggestions related to taking courses and formal education male 0.48499982227277577


Let's explore now what was the suggestion of the ones with the highest Salary.
Although we can see the mean salary is highest for the individuals who say that you should contribute to open source, you might be asking - is that really a significant difference?  The salary differences don't see that large... To answer this we would add t-test in an upcoming notebook.

In [73]:
df_mean= df_cousin_long.groupby('method')['Salary'].mean().reset_index(name='Salary_mean')
df_std = df_cousin_long.groupby('method')['Salary'].std().reset_index(name='Salary_standard_deviation')
df_all = pd.merge(df_mean, df_std,  on='method'  ).sort_values(by='Salary_mean', ascending=False)
df_all.head(14)

Unnamed: 0,method,Salary_mean,Salary_standard_deviation
3,Contribute to open source,61741.337383,41030.28413
7,Other,60859.281694,38946.726688
5,Master's degree,59425.969277,41258.009194
11,Return to college,59251.636145,37476.923599
0,Bootcamp,59046.391551,41876.732135
9,Participate in hackathons,58237.114855,41050.87343
2,Conferences/meet-ups,57770.118326,40576.908734
4,Get a job as a QA tester,56654.043442,40606.474293
1,Buy books and work through the exercises,56257.071544,40274.403123
12,Take online courses,53740.566104,39841.876501


In [76]:
df_male_mean = df_cousin_long[df_cousin_long['Gender']=='Male'].groupby('method')['Salary'].mean().reset_index(name='Salary_Male_mean')
df_female_mean = df_cousin_long[df_cousin_long['Gender']=='Female'].groupby('method')['Salary'].mean().reset_index(name='Salary_Female_mean')
df_male_std = df_cousin_long[df_cousin_long['Gender']=='Male'].groupby('method')['Salary'].std().reset_index(name='Salary_Male_standard_deviation')
df_female_std = df_cousin_long[df_cousin_long['Gender']=='Female'].groupby('method')['Salary'].std().reset_index(name='Salary_Female_standard_deviation')

Next we clearly see that the males and females with the highest salaries, have clearly different suggestions to give on how to break into the field. While males suggest to Contribute to open source, females with the highest Salary mean suggest to take part in a Bootkamp. This is getting interested as without taking the salary into account the first suggestion for both males and females was to "Take online courses". Another interesting observation is that the first and the third suggestion buy females with highest salaries has to do with taking courses and formal education, this was a trend that we had noticed for all females. 

In [80]:
df_all_male_female = pd.merge(df_male_mean, df_female_mean,  on='method'  ).sort_values(by='Salary_Female_mean', ascending=False)
df_all_male_female.head(14)

Unnamed: 0,method,Salary_Male_mean,Salary_Female_mean
0,Bootcamp,58617.933402,73087.852075
7,Other,60440.013483,71816.204132
5,Master's degree,57856.88143,71164.200945
9,Participate in hackathons,57707.850025,65578.205323
3,Contribute to open source,61333.660579,64873.155751
11,Return to college,59301.760668,61853.280908
8,Part-time/evening courses,52721.050831,59539.49625
2,Conferences/meet-ups,57948.053302,59432.045923
12,Take online courses,53800.685995,57933.478844
1,Buy books and work through the exercises,56726.05005,57460.110097
