## Stack Overflow Survey Analysis

I want to use Stack Overflow Survey data of the year 2018. to answer blow question:

1. what language is in fashion and what will be the new sexy?
2. what framework is in fashion and what will be the new sexy?
3. which framework makes data scientists more money (who makes more money)?


### Load labrary

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import plotly.plotly as py1
import plotly.offline as py
py.init_notebook_mode(connected=True)
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
from plotly import tools

from collections import defaultdict


ModuleNotFoundError: No module named 'plotly'

### Load data

In [2]:
! ls ./data/developer_survey_2018

[31mDeveloper_Survey_Instrument_2018.pdf[m[m [31msurvey_results_public.csv.gz[m[m
[31mREADME_2018.txt[m[m                      [31msurvey_results_schema.csv[m[m


In [3]:
stack_data = pd.read_csv('./data/developer_survey_2018/survey_results_public.csv.gz', compression="gzip")


  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
schema = pd.read_csv('./data/developer_survey_2018/survey_results_schema.csv')


### Glimps of the data 

In [5]:
stack_data.head()

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...","10,000 or more employees",Database administrator;DevOps specialist;Full-...,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy
2,4,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",20 to 99 employees,Engineering manager;Full-stack developer,...,,,,,,,,,,
3,5,No,No,United States,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",100 to 499 employees,Full-stack developer,...,I don't typically exercise,Male,Straight or heterosexual,Some college/university study without earning ...,White or of European descent,35 - 44 years old,No,No,The survey was an appropriate length,Somewhat easy
4,7,Yes,No,South Africa,"Yes, part-time",Employed full-time,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","10,000 or more employees",Data or business analyst;Desktop or enterprise...,...,3 - 4 times per week,Male,Straight or heterosexual,Some college/university study without earning ...,White or of European descent,18 - 24 years old,Yes,,The survey was an appropriate length,Somewhat easy


In [6]:
stack_data.shape

(98855, 129)

In [7]:
schema.head()

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,Hobby,Do you code as a hobby?
2,OpenSource,Do you contribute to open source projects?
3,Country,In which country do you currently reside?
4,Student,"Are you currently enrolled in a formal, degree..."


In [8]:
schema.shape

(129, 2)

### Missing data check

In [9]:
def missing_data(data):
    '''
    INPUT - 
            data - pandas dataframe for check missing values
    OUTPUT - 
            df - pandas dataframe -- total quantity and percentage of missing values for each column.
    '''              
    
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    df = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return df

In [10]:
# check missing values of dataframe 
missing_df = missing_data(stack_data)
missing_df

Unnamed: 0,Total,Percent
TimeAfterBootcamp,92203,93.270952
MilitaryUS,83074,84.036215
HackathonReasons,73164,74.011431
ErgonomicDevices,64797,65.547519
AdBlockerReasons,61110,61.817814
StackOverflowJobsRecommend,60538,61.239189
JobEmailPriorities1,52642,53.251732
JobEmailPriorities2,52642,53.251732
JobEmailPriorities3,52642,53.251732
JobEmailPriorities4,52642,53.251732


### Preparing data and handling missing values

In [11]:
# fill missing values for data visualization

# Use fillna mothod to fill missing Country value with 'Not know'
stack_data["Country"] = stack_data["Country"].fillna('Not known')
# Use fillna mothod fill missing DevType value with 'Not metioned'
stack_data["DevType"] = stack_data["DevType"].fillna('Not mentioned')


 ## Data Exploration

In [12]:
def get_description(column_name, schema=schema):
    '''
    INPUT - schema - pandas dataframe with the schema of the developers survey
            column_name - string - the name of the column you would like to know about
    OUTPUT - 
            desc - string - the description of the column
    '''
    desc = list(schema[schema['Column'] == column_name]['QuestionText'])[0]
    return desc

#Check your function against solution - you shouldn't need to change any of the below code
get_description(stack_data.columns[0]) 

'Randomized respondent ID number (not in order of survey response time)'

In [13]:
get_description('ConvertedSalary')

'Salary converted to annual USD salaries using the exchange rate on 2018-01-18, assuming 12 working months and 50 working weeks.'

In [14]:
get_description('DevType')

'Which of the following describe you? Please select all that apply.'

### How many data scientist participated in the survey

In [15]:
def split_column_value(ori_df, column_name, separator=';'):
    '''
    INPUT - ori_df  - pandas dataframe -  original dataframe
            column_name - string - the name of the column you would like to splite the value
            separator - string - The is a delimiter. The string splits at this specified separator. If is not provided then ; is the separator.
    OUTPUT - 
            df - pandas dataframe - all value for the column of original dataframe
    '''
    df = pd.DataFrame(ori_df[column_name].dropna().str.split(separator).tolist()).stack()
    return df

In [16]:
# splite the DevType colume value
temp1 = split_column_value(stack_data, 'DevType')
temp1.head()

0  0      Full-stack developer
1  0    Database administrator
   1         DevOps specialist
   2      Full-stack developer
   3      System administrator
dtype: object

In [17]:
#Count DevType column the values
cnt_srs = temp1.value_counts().sort_values(ascending=False)
cnt_srs

Back-end developer                               53300
Full-stack developer                             44353
Front-end developer                              34822
Mobile developer                                 18804
Desktop or enterprise applications developer     15807
Student                                          15732
Database administrator                           13216
Designer                                         12019
System administrator                             10375
DevOps specialist                                 9549
Data or business analyst                          7559
Data scientist or machine learning specialist     7088
Not mentioned                                     6757
QA or test developer                              6194
Engineering manager                               5256
Embedded applications or devices developer        4819
Game or graphics developer                        4642
Product manager                                   4316
Educator o

In [18]:
#Plot devtype column
trace = go.Bar(
    y=cnt_srs.index[::-1],
    x=(cnt_srs/cnt_srs.sum() * 100)[::-1],
    orientation = 'h',
)

layout = dict(
    title='Description of people who participated in survey (%)',
    margin=dict(
    l=400,
)
    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

There are 7088 Data sscientist participated in survey,  2.67% of total participated.

### Add data dev type to the data set for later analyst

In [19]:
def data_dev_type(formal_ed_str):
    '''
    INPUT
        formal_ed_str - a string of one of the values from the DevType column
    
    OUTPUT
        return 1 if the string is  in ("Data or business analyst")
        return 0 otherwise
    
    '''
    if "Data scientist or machine learning specialist" in formal_ed_str:
        return 1
    else:
        return 0

In [20]:
# add DataDevType column 
stack_data['DataDevType'] = stack_data["DevType"].apply(data_dev_type)
stack_dm_data = stack_data[stack_data['DataDevType']==1]

In [21]:
stack_dm_data.head()

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy,DataDevType
18,29,Yes,Yes,India,"Yes, full-time",Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",,"10,000 or more employees",Data or business analyst;Data scientist or mac...,...,Female,,Some college/university study without earning ...,,,,,The survey was too long,Very difficult,1
28,45,Yes,Yes,United States,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...","10,000 or more employees",Back-end developer;Data scientist or machine l...,...,,,,,,,,,,1
40,60,Yes,No,Germany,"Yes, full-time",Employed part-time,"Secondary school (e.g. American high school, G...",,"1,000 to 4,999 employees",Data scientist or machine learning specialist;...,...,,,,,,,,,,1
62,91,Yes,Yes,United States,No,Employed full-time,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...","10,000 or more employees",Back-end developer;Data scientist or machine l...,...,Male,Straight or heterosexual,"Professional degree (JD, MD, etc.)",White or of European descent,25 - 34 years old,No,No,The survey was too long,Somewhat easy,1
86,129,Yes,Yes,United States,"Yes, full-time",Employed full-time,Some college/university study without earning ...,"Computer science, computer engineering, or sof...",20 to 99 employees,Back-end developer;Data scientist or machine l...,...,Male,Straight or heterosexual,"Other doctoral degree (Ph.D, Ed.D., etc.)",East Asian,18 - 24 years old,No,No,The survey was too long,Somewhat easy,1


### What language the data scientists use most?

In [22]:
# Plot most popular language 
temp1 = split_column_value(stack_data, 'LanguageWorkedWith')
temp1 = temp1.value_counts().sort_values(ascending=False).head(20)

temp2 = split_column_value(stack_data, 'LanguageDesireNextYear')
temp2 = temp2.value_counts().sort_values(ascending=False).head(20)
trace1 = go.Bar(
    y=temp1.index[::-1],
    x=temp1.values[::-1],
    orientation = 'h',
    #name = ''
)
trace2 = go.Bar(
    y=temp2.index[::-1],
    x=temp2.values[::-1],
    orientation = 'h',
    #name = ''
)

fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('On which developers worked with ', 'On which developers want to work in over the next year'))
                                                          

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig['layout'].update(height=500, width=1000, title='Most popular languages')
iplot(fig, filename='simple-subplot')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [23]:
temp1 = split_column_value(stack_dm_data, 'LanguageWorkedWith')
temp1 = temp1.value_counts().sort_values(ascending=False).head(20)
temp2 = split_column_value(stack_dm_data,'LanguageDesireNextYear')
temp2 = temp2.value_counts().sort_values(ascending=False).head(20)
trace1 = go.Bar(
    y=temp1.index[::-1],
    x=temp1.values[::-1],
    orientation = 'h',
    #name = ''
)
trace2 = go.Bar(
    y=temp2.index[::-1],
    x=temp2.values[::-1],
    orientation = 'h',
    #name = ''
)

fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('On which Data Scientist worked with ', 'On which Data Scientist want to work in over the next year'))
                                                          

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig['layout'].update(height=500, width=1000, title='Most popular languages for Data Scientist')
iplot(fig, filename='simple-subplot')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



There are currently no data scientists using Julia, but next year by 513 data scientists would like to use it. It is gradually being valued

### Data Scientst used most  Frameworks

In [24]:
# Plot the most popluar framworks
temp1 = split_column_value(stack_dm_data,'FrameworkWorkedWith')
temp1 = temp1.value_counts().sort_values(ascending=False).head(20)
temp2 = split_column_value(stack_dm_data,'FrameworkDesireNextYear')
temp2 = temp2.value_counts().sort_values(ascending=False).head(20)
trace1 = go.Bar(
    y=temp1.index[::-1],
    x=temp1.values[::-1],
    orientation = 'h',
    marker=dict(
        color=temp2.values[::-1],
        colorscale = 'Reds'
    )
)
trace2 = go.Bar(
    y=temp2.index[::-1],
    x=temp2.values[::-1],
    orientation = 'h',
    marker=dict(
        color=temp2.values[::-1],
        colorscale = 'Blues',
        reversescale = True
    ),
)

fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('On which developers worked with', 'On which developers want to work in over the next year'))
                                                          

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
                          
fig['layout'].update(height=500, width=1100, title='Most popular Frameworks (Data Scientist / Machine Learning Specialists)', margin=dict(l=100,))
iplot(fig, filename='simple-subplot')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



The framework most liked by data scientists is TensorFlow, followed by Spark, Torch/PyTorch. An interesting finding is that PyTorch is the fastest-growing framework. It uses the 10th position of the framework this year and the 4th place to use next year. 


In [25]:
# plot years of coding
temp = stack_dm_data["YearsCodingProf"].value_counts()
trace = go.Bar(
    x = temp.index,
    y = (temp / temp.sum())*100,
    marker=dict(
        color=['#FF3E96','#00E5EE','#FFF8DC','#68228B','#1E90FF','#FFC125','#FF6103','#8EE5EE','#458B00','#FFF8DC'],
        colorscale = 'Blues',
        reversescale = True
    ),
)
data = [trace]
layout = go.Layout(
    title = "For how many years have data scientist been coding (%) ",
    xaxis=dict(
        title='Years',
        tickfont=dict(
            size=11,
            color='rgb(107, 107, 107)'
        )
    ),
    yaxis=dict(
        title='Count in %',
        titlefont=dict(
            size=16,
            color='rgb(107, 107, 107)'
        ),
        tickfont=dict(
            size=14,
            color='rgb(107, 107, 107)'
        )
)
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='schoolStateNames')

34% data scientist have been professionally codeing for 0-2 years 
28% data scientist have been professionally codeing for 3-5 years 
13% data scientist have been professionally codeing for 6-8 years 
Most data scientists, professionally writing code for no more than 5 years

In [26]:
#plot farmework vs year exerience with median salary
df = stack_dm_data.set_index(['YearsCodingProf','ConvertedSalary']).FrameworkWorkedWith.str.split(';', expand=True).stack().reset_index(['YearsCodingProf','ConvertedSalary'])
df.columns = ['YearsCodingProf','Salary','Framework']
df['YearsCodingProf'] = df['YearsCodingProf'].astype('category')
df['YearsCodingProf'].cat.reorder_categories(['0-2 years','3-5 years','6-8 years', '9-11 years', '12-14 years', '15-17 years', '18-20 years', 
                                            '21-23 years',  '24-26 years','27-29 years', '30 or more years'], inplace=True)
gndr = [ 'Hadoop', 'Spark', 'TensorFlow', 'Node.js','Torch/PyTorch']
fig = {
    'data': [
        {
            'x': df[df['Framework']==gr].groupby('YearsCodingProf').agg({'Salary' : 'median'}).sort_values(by = 'YearsCodingProf').reset_index()['YearsCodingProf'],
            'y': df[df['Framework']==gr].groupby('YearsCodingProf').agg({'Salary' : 'median'}).sort_values(by = 'YearsCodingProf').reset_index()['Salary'],
            'name': gr, 'mode': 'lines',
        } for gr in gndr
    ],
    'layout': {
        'title' : 'Frameworks V.S. years experience with Median Salary ($)',
        'xaxis': {'title': 'Years experience (Developers coded professionally)'},
        'yaxis': {'title': "Median Salary ($)"}
    }
}
py.iplot(fig)

Experienced data scientists, using pythorch, and making more money

In [27]:
#plot geographical division
temp = stack_dm_data["Country"].dropna().value_counts().head(20)
colors = []
for i in temp.index[::-1]:
    c = 'gray'
    if i == 'China':
        c = 'red'
    if i == "United States":
        c = 'blue'
    colors.append(c)
data = [go.Bar(
    y = temp.index[::-1],
    x = temp.values[::-1],
    text=temp.values[::-1],
    textposition = 'auto',
    orientation = 'h',
    marker = dict(
        color = colors
    ))  
    ]
layout = go.Layout(
    title = "How many data scientists are there in these countries? "
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig,filename='Country-bar')
        

In China, where I live, there are 80 data scientists, and the United States has the largest number of data scientists, a total of 1,754, 22 times that of China.