# A Deep Dive into the 2017 Kaggle ML and Data Science Survey Results


DATA 512 - UW Interdisciplinary Data Science Masters Program

Mobing Zhuang


## 1. Project Introduction

### 1.1 About the survey

In 2017, Kaggle conducted a [survey](https://www.kaggle.com/kaggle/kaggle-survey-2017) through Kaggle channels with the intention to establish a comprehensive view of the current state of data science and machine learning field.

16,716 usable survey responses were received from 171 countries and territories. The respondents were found through kaggle email list, discussion forums and social media channels. Questions in the survey consists of required and non-required ones. Survey responses were flagged as "spam" if their employment status, which is the first required question, was not answered. The survey is designed to only asking relevant questions to each respondent. In summary, this was a very well designed survey.

### 1.2 About the data

The data sets includes 5 files:

- schema.csv: This CSV file includes the survey questions that correspond to each column name in both the multipleChoiceResponses.csv and freeformResponses.csv.
- multipleChoiceResponses.csv: Respondents' answers to multiple choice and ranking questions. A single row is the answer from the same respondent.
- freeformResponses.csv: Respondents' freeform answers to Kaggle's survey questions. A single row is not the answer from the same respondent.
- conversionRates.csv: Currency conversion rates to USD 7.
- RespondentTypeREADME.txt: This describes respondent type in the "Asked" column of the schema.csv file.

License: 

According to Kaggle, the data is released under Open Database license, specifically, [ODC Open Database License (ODbL)](https://opendatacommons.org/licenses/odbl/1.0/). Users are free to share, to produce works, to modify, transform and build upon the data, as long as they attribute, share-alike, and keep open. [Click to read detailed explanations of how the data can be used.](https://opendatacommons.org/licenses/odbl/summary/) 


-----------

In [276]:
list(schema.Question)

['Select your gender identity. - Selected Choice',
 'Select your gender identity. - A different identity - Text',
 'Select the country you currently live in.',
 "What's your age?",
 "What's your current employment status?",
 'Are you currently enrolled as a student at a degree granting school?',
 'Are you currently focused on learning data science skills either formally or informally?',
 "What's your motivation for being a Kaggle user?",
 'Do you write code to analyze data in your current job, freelance contracts, or most recent job if retired?',
 'Are you actively looking to switch careers to data science?',
 "Select the option that's most similar to your current job/professional title (or most recent title if retired). - Selected Choice",
 "Select the option that's most similar to your current job/professional title (or most recent title if retired). - Other - Text",
 'How adequately do you feel your title describes what you do (or what you did if retired)?',
 'Which of the following

## 2. Project Background 

## 3. Research Questions and Methods

In [265]:
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
import copy
from datetime import datetime
import plotly
import plotly.graph_objs as go
from plotly import tools
import plotly.plotly as py
import numpy as np
from scipy.stats import ks_2samp

plotly.__version__
plotly.offline.init_notebook_mode(connected=True)
plotly.plotly.sign_in('45220Zmb', '9EywutMKCyDsD5WpWSp9')

convRate = pd.read_csv("./RawData/conversionRates.csv", encoding="ISO-8859-1")
freeRes = pd.read_csv("./RawData/freeformResponses.csv", encoding="ISO-8859-1")
multiRes = pd.read_csv("./RawData/multipleChoiceResponses.csv", encoding="ISO-8859-1")
schema = pd.read_csv("./RawData/schema.csv", encoding="ISO-8859-1")


Columns (5,17,21,38,50) have mixed types. Specify dtype option on import or set low_memory=False.


Columns (31,83,86,87,98,99,109,116,123,124,127,129,130,164) have mixed types. Specify dtype option on import or set low_memory=False.



## 4. Findings and Discussion

1.Who are the respondents? Country, age, gender, education (degree, major,)

In [19]:
multiRes.head()

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorExperienceLevel,JobFactorDepartment,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,,,Somewhat important,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,,
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,


In [139]:
df = multiRes[multiRes.Country != 'Other'].Country.value_counts().reset_index().rename(columns={'index': 'country', 'Country': 'count'})
# df = df[df.country != 'Other']

In [140]:
top_country = df.sort_values(by = 'count', ascending=False)[0:10].country
df.sort_values(by = 'count', ascending=False)[0:10]

Unnamed: 0,country,count
0,United States,4197
1,India,2704
2,Russia,578
3,United Kingdom,535
4,People 's Republic of China,471
5,Brazil,465
6,Germany,460
7,France,442
8,Canada,440
9,Australia,421


In [107]:
df_code = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv')
df_code.ix[df_code.CODE == "CHN", 'COUNTRY'] = "People \'s Republic of China"
df_code.ix[df_code.CODE == "KOR", 'COUNTRY'] = "South Korea"
df_code = df_code.rename(columns={'COUNTRY':'country', 'CODE':'code'})

In [108]:
new_df = pd.merge(df, df_code, on='country', how ='left')
nans = lambda df: df[df.isnull().any(axis=1)]
nans(new_df)

Unnamed: 0,country,count,GDP (BILLIONS),code
2,Other,1023,,
42,Republic of China,67,,


In [277]:
data = [ dict(
        type = 'choropleth',
        locations = new_df['code'],
        z = new_df['count'],
        text = new_df['country'],
        colorscale = [[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
            [0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            title = 'Number of Respondents'),
      ) ]

layout = dict(
    title = 'Survey Respondents Distribution',
    geo = dict(
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data, layout=layout )
plotly.offline.iplot(fig, validate=False, filename='RespondentDistribution')

# Save the image
plotly.plotly.image.save_as(fig, filename='RespondentDistribution.png')

Age

In [141]:
top_country

0                  United States
1                          India
2                         Russia
3                 United Kingdom
4    People 's Republic of China
5                         Brazil
6                        Germany
7                         France
8                         Canada
9                      Australia
Name: country, dtype: object

In [145]:
trace0 = go.Box(
    y=multiRes["Age"],
    boxpoints = 'outliers',
    name = 'Global'
)

trace1 = go.Box(
    y=multiRes[multiRes.Country == top_country[0]]["Age"],
    boxpoints = 'outliers',
    name = top_country[0]
)

trace2 = go.Box(
    y=multiRes[multiRes.Country == top_country[1]]["Age"],
    boxpoints = 'outliers',
    name = top_country[1]
)

trace3 = go.Box(
    y=multiRes[multiRes.Country == top_country[2]]["Age"],
    boxpoints = 'outliers',
    name = top_country[2]
)
trace4 = go.Box(
    y=multiRes[multiRes.Country == top_country[3]]["Age"],
    boxpoints = 'outliers',
    name = top_country[3]
)
trace5 = go.Box(
    y=multiRes[multiRes.Country == top_country[4]]["Age"],
    boxpoints = 'outliers',
    name = 'China'
)
trace6 = go.Box(
    y=multiRes[multiRes.Country == top_country[5]]["Age"],
    boxpoints = 'outliers',
    name = top_country[5]
)
trace7 = go.Box(
    y=multiRes[multiRes.Country == top_country[6]]["Age"],
    boxpoints = 'outliers',
    name = top_country[6]
)
trace8 = go.Box(
    y=multiRes[multiRes.Country == top_country[7]]["Age"],
    boxpoints = 'outliers',
    name = top_country[7]
)
trace9 = go.Box(
    y=multiRes[multiRes.Country == top_country[8]]["Age"],
    boxpoints = 'outliers',
    name = top_country[8]
)
trace10 = go.Box(
    y=multiRes[multiRes.Country == top_country[9]]["Age"],
    boxpoints = 'outliers',
    name = top_country[9]
)
data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9, trace10]

layout = go.Layout(
    title = "Respondents' Age, Global Distribution and Among Top 10 Countries"
)

fig = dict( data=data, layout=layout )
plotly.offline.iplot(fig, validate=False, filename='AgeDistribution')

# Save the image
plotly.plotly.image.save_as(fig, filename='AgeDistribution.png')

Gender

Education

In [224]:
multiRes['MajorSelect']=multiRes['MajorSelect'].\
                        replace(to_replace ='Information technology, networking, or system administration',\
                                value = 'Information tech / System admin', axis=0)
multiRes['FormalEducation']=multiRes['FormalEducation'].\
                        replace(to_replace ="Some college/university study without earning a bachelor's degree",\
                                value = 'College study, no degree', axis=0)
multiRes['FormalEducation']=multiRes['FormalEducation'].\
                        replace(to_replace ="I did not complete any formal education past high school",\
                                value = 'No formal education past high school', axis=0)    
    
    


the "axis" argument is deprecated and will be removed inv0.13; this argument has no effect



In [225]:
edu = multiRes['FormalEducation'].value_counts()
labels = (np.array(edu.index))

values = (np.array((edu / edu.sum())*100))

trace = go.Pie(labels=labels, values=values, hole=0.4, hoverinfo='label+percent')

layout = go.Layout(
    title='Formal Education of the Survey Respondents'
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
# Show image in jupyter notebook
plotly.offline.iplot(fig, filename='FormalEducation')

# Save the image
plotly.plotly.image.save_as(fig, filename='FormalEducation.png')

In [226]:
edu = ["Master's degree", "Bachelor's degree", "Doctoral degree"]
major = [x for x in multiRes.MajorSelect.unique() if x != 'nan']

In [238]:
traces = []

for i in range(len(major)):
        traces.append(go.Bar(
            x=edu,
            y=multiRes[multiRes.MajorSelect== major[i]].FormalEducation.value_counts(),
            name=major[i]
        ))

In [239]:
traces

[{'name': 'Management information systems',
  'type': 'bar',
  'x': ["Master's degree", "Bachelor's degree", 'Doctoral degree'],
  'y': Master's degree             116
  Bachelor's degree            87
  College study, no degree     18
  Doctoral degree              13
  I prefer not to answer        3
  Name: FormalEducation, dtype: int64},
 {'name': 'Computer Science',
  'type': 'bar',
  'x': ["Master's degree", "Bachelor's degree", 'Doctoral degree'],
  'y': Bachelor's degree           1774
  Master's degree             1765
  Doctoral degree              512
  College study, no degree     306
  I prefer not to answer        30
  Name: FormalEducation, dtype: int64},
 {'name': 'Engineering (non-computer focused)',
  'type': 'bar',
  'x': ["Master's degree", "Bachelor's degree", 'Doctoral degree'],
  'y': Master's degree             633
  Bachelor's degree           470
  Doctoral degree             193
  College study, no degree     34
  I prefer not to answer        2
  Name: Forma

In [485]:
layout = go.Layout(
    title='Education and Major Distribution, College Graduate Respondents',
    barmode='stack',
    height= 600, width = 1000,
#     margin=go.Margin(
#         l=100,
#         r=150,
#         b=150,
#         t=100,
#         pad=4
#     ),
#     xaxis = dict(title = 'Country'),
    yaxis = dict(title = 'Frequency'),
)

fig = go.Figure(data=traces, layout=layout)
# Show image in jupyter notebook
plotly.offline.iplot(fig, filename='EducationDistribution')

# Save the image
plotly.plotly.image.save_as(fig, filename='EducationDistribution.png')

2.Employment and Salary(with conversion, distribution by country, gender).

In [249]:
salary_df = pd.merge(multiRes, convRate, left_on="CompensationCurrency", right_on="originCountry", how="inner" )

In [259]:
salary_df['CompensationAmount']=salary_df['CompensationAmount'].str.replace(',','')
salary_df['CompensationAmount']=salary_df['CompensationAmount'].str.replace('-','')

In [263]:
salary_df['Salary']=pd.to_numeric(salary_df['CompensationAmount'], errors='ignore')*salary_df['exchangeRate']
salary_df=salary_df[salary_df['Salary']<1000000]

In [486]:
trace0 = go.Box(
    y=salary_df["Salary"],
    boxpoints = 'outliers',
    name = 'Global'
)

trace1 = go.Box(
    y=salary_df[salary_df.Country == top_country[0]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[0]
)

trace2 = go.Box(
    y=salary_df[salary_df.Country == top_country[1]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[1]
)

trace3 = go.Box(
    y=salary_df[salary_df.Country == top_country[2]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[2]
)
trace4 = go.Box(
    y=salary_df[salary_df.Country == top_country[3]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[3]
)
trace5 = go.Box(
    y=salary_df[salary_df.Country == top_country[4]]["Salary"],
    boxpoints = 'outliers',
    name = 'China'
)
trace6 = go.Box(
    y=salary_df[salary_df.Country == top_country[5]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[5]
)
trace7 = go.Box(
    y=salary_df[salary_df.Country == top_country[6]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[6]
)
trace8 = go.Box(
    y=salary_df[salary_df.Country == top_country[7]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[7]
)
trace9 = go.Box(
    y=salary_df[salary_df.Country == top_country[8]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[8]
)
trace10 = go.Box(
    y=salary_df[salary_df.Country == top_country[9]]["Salary"],
    boxpoints = 'outliers',
    name = top_country[9]
)
data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9, trace10]

layout = go.Layout(
    title = "Respondents' Salary, Global Distribution and Among Top 10 Countries",
    yaxis=dict(title="Salary in USD")
)

fig = dict( data=data, layout=layout )
plotly.offline.iplot(fig, validate=False, filename='SalaryDistribution')

# Save the imSalary
plotly.plotly.image.save_as(fig, filename='SalaryDistribution.png')

Are there any significant differences in 'Salary vs degree'?
Intersectionality, highest degree major combinition?

Are there any significant differences in 'Salary vs R Python use'?
Are there any significant differences in 'Salary vs industry'?



>>> from scipy.stats import ks_2samp
>>> import numpy as np
>>> 

Ks_2sampResult(statistic=0.022999999999999909, pvalue=0.95189016804849647)
>>> ks_2samp(x, z)
Ks_2sampResult(statistic=0.41800000000000004, pvalue=3.7081494119242173e-77)


In [304]:
data1 = salary_df[salary_df.FormalEducation == "Bachelor's degree"].Salary
data2 = salary_df[salary_df.FormalEducation == "Master's degree"].Salary
ks_2samp(data1, data2)

Ks_2sampResult(statistic=0.12803772559870119, pvalue=1.1378413363120816e-10)

In [305]:
data3 = salary_df[salary_df.FormalEducation == "Doctoral degree"].Salary
ks_2samp(data2, data3)

Ks_2sampResult(statistic=0.1621795470716334, pvalue=1.5723926350599245e-15)

In [306]:
ks_2samp(data1, data3)

Ks_2sampResult(statistic=0.2677802009263669, pvalue=4.9354791337087928e-33)

In [487]:
trace1 = go.Box(
    y=salary_df[salary_df.FormalEducation == "Bachelor's degree"].Salary,
    boxpoints = 'outliers',
    name = "Bachelor's degree"
)

trace2 = go.Box(
    y=salary_df[salary_df.FormalEducation == "Master's degree"].Salary,
    boxpoints = 'outliers',
    name = "Master's degree"
)

trace3 = go.Box(
    y=salary_df[salary_df.FormalEducation == "Doctoral degree"].Salary,
    boxpoints = 'outliers',
    name = "Doctoral degree"
)

data = [trace1, trace2, trace3]

layout = go.Layout(
    title = "Respondents' Salary vs. Education Level, Global",
    yaxis=dict(title="Salary in USD")
)

fig = dict( data=data, layout=layout )
plotly.offline.iplot(fig, validate=False, filename='Salary_Education_Distribution')

# Save the imSalary
plotly.plotly.image.save_as(fig, filename='Salary_Education_Distribution.png')

In [311]:
salary_df.Country.head()

0    United States
1    United States
2    United States
3    United States
4           Sweden
Name: Country, dtype: object

In [312]:
data1 = salary_df[(salary_df.FormalEducation == "Bachelor's degree") & (salary_df.Country == "United States")].Salary
data2 = salary_df[(salary_df.FormalEducation == "Master's degree") & (salary_df.Country == "United States")].Salary
ks_2samp(data1, data2)

Ks_2sampResult(statistic=0.078538389203751224, pvalue=0.18492561390634177)

In [313]:
data3 = salary_df[(salary_df.FormalEducation == "Doctoral degree") & (salary_df.Country == "United States")].Salary
ks_2samp(data2, data3)

Ks_2sampResult(statistic=0.19932142998984415, pvalue=1.4171887845776075e-07)

In [314]:
ks_2samp(data1, data3)

Ks_2sampResult(statistic=0.2629131582254618, pvalue=7.0211815370630783e-10)

In [488]:
trace1 = go.Box(
    y=salary_df[(salary_df.FormalEducation == "Bachelor's degree") & (salary_df.Country == "United States")].Salary,
    boxpoints = 'outliers',
    name = "Bachelor's degree"
)

trace2 = go.Box(
    y=salary_df[(salary_df.FormalEducation == "Master's degree") & (salary_df.Country == "United States")].Salary,
    boxpoints = 'outliers',
    name = "Master's degree"
)

trace3 = go.Box(
    y=salary_df[(salary_df.FormalEducation == "Doctoral degree") & (salary_df.Country == "United States")].Salary,
    boxpoints = 'outliers',
    name = "Doctoral degree"
)

data = [trace1, trace2, trace3]

layout = go.Layout(
    title = "Respondents' Salary vs. Education Level, USA",
    yaxis=dict(title="Salary in USD")
)

fig = dict( data=data, layout=layout )
plotly.offline.iplot(fig, validate=False, filename='Salary_Education_USA_Distribution')

# Save the imSalary
plotly.plotly.image.save_as(fig, filename='Salary_Education_USA_Distribution.png')

In [318]:
edu = ["Master's degree", "Bachelor's degree", "Doctoral degree"]
major = ['Mathematics or statistics', 'Electrical Engineering', 'Computer Science', 'Engineering (non-computer focused)']

In [345]:
new_salary_df = salary_df[(salary_df.FormalEducation.isin(edu)) & (salary_df.MajorSelect.isin(major))]

In [346]:
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

In [347]:
df

Unnamed: 0,Year,quarter,period
0,2014,q1,2014q1
1,2015,q2,2015q2


In [348]:
new_salary_df['Edu_Major'] = new_salary_df[['FormalEducation', 'MajorSelect']].apply(lambda x: '_'.join(x), axis=1)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [351]:
new_salary_df.groupby(['Edu_Major']).Salary.mean()

Edu_Major
Bachelor's degree_Computer Science                      45604.895259
Bachelor's degree_Electrical Engineering                47538.228127
Bachelor's degree_Engineering (non-computer focused)    49509.121056
Bachelor's degree_Mathematics or statistics             62866.250542
Doctoral degree_Computer Science                        66301.022494
Doctoral degree_Electrical Engineering                  77882.938032
Doctoral degree_Engineering (non-computer focused)      98196.947063
Doctoral degree_Mathematics or statistics               90649.616021
Master's degree_Computer Science                        52387.068246
Master's degree_Electrical Engineering                  61286.012315
Master's degree_Engineering (non-computer focused)      68151.090977
Master's degree_Mathematics or statistics               65144.141874
Name: Salary, dtype: float64

In [352]:
new_salary_df[new_salary_df.Country == "United States"].groupby(['Edu_Major']).Salary.mean()

Edu_Major
Bachelor's degree_Computer Science                      115241.137931
Bachelor's degree_Electrical Engineering                123408.833333
Bachelor's degree_Engineering (non-computer focused)    113214.285714
Bachelor's degree_Mathematics or statistics             109443.548387
Doctoral degree_Computer Science                        145126.666667
Doctoral degree_Electrical Engineering                  122108.387097
Doctoral degree_Engineering (non-computer focused)      127352.710204
Doctoral degree_Mathematics or statistics               145431.372549
Master's degree_Computer Science                        116026.006951
Master's degree_Electrical Engineering                  115783.744186
Master's degree_Engineering (non-computer focused)      112970.827474
Master's degree_Mathematics or statistics               111915.113393
Name: Salary, dtype: float64

How dependant is the job titile of the major of a worker*

evaluate dependacy between two categorical variables in statistics. Ladies and gentlemen, let me introduce to you the one and only **chi2 test of independence*

3.Python vs. R (gloabal distribution, by industry, job title, by data type, by ML skills algorithms)

In [366]:
df=multiRes[["WorkToolsFrequencyR","WorkToolsFrequencyPython"]].fillna(0)
df.replace(to_replace=['Rarely','Sometimes','Often','Most of the time'], value=[1,2,3,4], inplace=True)
df['Language'] = [ 'R' if (freq1 >2 and freq1 > freq2) else \
                   'Python' if (freq1<freq2 and freq2>2) else \
                   'Both' if (freq1==freq2 and freq1 >2) else \
                   'None' for (freq1,freq2) in zip(df["WorkToolsFrequencyR"],df["WorkToolsFrequencyPython"])]
multiRes['Language']=df['Language']
language_df=multiRes[multiRes.Language != "None"]

In [367]:
language_df['Language'].value_counts(normalize=True)

Python    0.557483
R         0.300146
Both      0.142371
Name: Language, dtype: float64

In [296]:
trace0 = go.Bar(
    x=language_df["Language"].value_counts(normalize=True).index,
    y=language_df['Language'].value_counts(normalize=True).values,
    name = 'Global'
)

trace1 = go.Bar(
    x=language_df[language_df.Country == top_country[0]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[0]]['Language'].value_counts(normalize=True).values,
    name = top_country[0]
)

trace2 = go.Bar(
    x=language_df[language_df.Country == top_country[1]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[1]]['Language'].value_counts(normalize=True).values,
    name = top_country[1]
)

trace3 = go.Bar(
    x=language_df[language_df.Country == top_country[2]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[2]]['Language'].value_counts(normalize=True).values,
    name = top_country[2]
)
trace4 = go.Bar(
    x=language_df[language_df.Country == top_country[3]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[3]]['Language'].value_counts(normalize=True).values,
    name = top_country[3]
)
trace5 = go.Bar(
    x=language_df[language_df.Country == top_country[4]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[4]]['Language'].value_counts(normalize=True).values,
    name = 'China'
)
trace6 = go.Bar(
    x=language_df[language_df.Country == top_country[5]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[5]]['Language'].value_counts(normalize=True).values,
    name = top_country[5]
)
trace7 = go.Bar(
    x=language_df[language_df.Country == top_country[6]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[6]]['Language'].value_counts(normalize=True).values,
    name = top_country[6]
)
trace8 = go.Bar(
    x=language_df[language_df.Country == top_country[7]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[7]]['Language'].value_counts(normalize=True).values,
    name = top_country[7]
)
trace9 = go.Bar(
    x=language_df[language_df.Country == top_country[8]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[8]]['Language'].value_counts(normalize=True).values,
    name = top_country[8]
)
trace10 = go.Bar(
    x=language_df[language_df.Country == top_country[9]]["Language"].value_counts(normalize=True).index,
    y=language_df[language_df.Country == top_country[9]]['Language'].value_counts(normalize=True).values,
    name = top_country[9]
)
data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9, trace10]

layout = go.Layout(
    title = "Respondents' Language, Global Distribution and Among Top 10 Countries",
    barmode = "group",
    yaxis=dict(tickformat=".0%", title="Percentage of Respondents")
)

fig = dict( data=data, layout=layout )
plotly.offline.iplot(fig, validate=False, filename='LanguageDistribution')

# Save the imLanguage
plotly.plotly.image.save_as(fig, filename='LanguageDistribution.png')

In [369]:
language_df['MLSkillsSelect'].head()

5     Natural Language Processing,Supervised Machine...
9     Supervised Machine Learning (Tabular Data),Tim...
11    Computer Vision,Natural Language Processing,Re...
14           Supervised Machine Learning (Tabular Data)
15    Supervised Machine Learning (Tabular Data),Sur...
Name: MLSkillsSelect, dtype: object

In [372]:
language_df['Language'].unique()

array(['Python', 'R', 'Both'], dtype=object)

In [443]:
skills = ['Natural Language Processing', 'Computer Vision', 'Adversarial Learning',
          'Supervised Machine Learning (Tabular Data)', 'Reinforcement learning',
          'Unsupervised Learning', 'Outlier detection (e.g. Fraud detection)',
          'Time Series', 'Recommendation Engines']

In [375]:
language_df.ix[:, ('Language', 'MLSkillsSelect')].head()

Unnamed: 0,Language,MLSkillsSelect
5,Python,"Natural Language Processing,Supervised Machine..."
9,Python,"Supervised Machine Learning (Tabular Data),Tim..."
11,Python,"Computer Vision,Natural Language Processing,Re..."
14,Python,Supervised Machine Learning (Tabular Data)
15,R,"Supervised Machine Learning (Tabular Data),Sur..."


### Count of Skills per language

In [414]:
pythonSkills = ",".join(language_df.ix[language_df.Language == 'Python', ('MLSkillsSelect')].dropna().tolist())

In [433]:
python_skills = pd.Series(pythonSkills.split(",")).value_counts()

In [422]:
RSkills = ",".join(language_df.ix[language_df.Language == 'R', ('MLSkillsSelect')].dropna().tolist())

In [434]:
r_skills = pd.Series(RSkills.split(",")).value_counts()

In [430]:
BothSkills = ",".join(language_df.ix[language_df.Language == 'Both', ('MLSkillsSelect')].dropna().tolist())

In [435]:
both_skills = pd.Series(BothSkills.split(",")).value_counts()

In [445]:
d = {
    'Python':python_skills,
    'R':r_skills,
    'Both':both_skills
}
summary_skills = pd.DataFrame(d).transpose()

In [448]:
summary_skills = summary_skills[['Natural Language Processing', 'Computer Vision', 'Adversarial Learning',
                  'Supervised Machine Learning (Tabular Data)', 'Reinforcement learning',
                  'Unsupervised Learning', 'Outlier detection (e.g. Fraud detection)',
                  'Time Series', 'Recommendation Engines']]

In [449]:
summary_skills

Unnamed: 0,Natural Language Processing,Computer Vision,Adversarial Learning,Supervised Machine Learning (Tabular Data),Reinforcement learning,Unsupervised Learning,Outlier detection (e.g. Fraud detection),Time Series,Recommendation Engines
Both,295,152,35,677,99,433,288,420,233
Python,1152,966,195,2385,344,1267,734,1116,754
R,343,107,31,1333,95,812,497,886,290


In [451]:
from scipy.stats import chi2_contingency
print(chi2_contingency(summary_skills))

(640.00559851462276, 7.3751476866605411e-126, 16, array([[  295.58190602,   202.28370663,    43.09881423,   725.74440053,
           88.83970136,   414.80544576,   250.83179622,   399.94378568,
          210.87044357],
       [ 1000.95802748,   685.01317523,   145.94974591,  2457.6595144 ,
          300.84660267,  1404.69640504,   849.41633729,  1354.36890646,
          714.09128553],
       [  493.4600665 ,   337.70311814,    71.95143986,  1211.59608507,
          148.31369597,   692.49814919,   418.75186649,   667.68730786,
          352.03827091]]))


In [480]:
list([summary_skills.apply(lambda x: x/float(x.sum())).iloc[0].values)

SyntaxError: invalid syntax (<ipython-input-480-4793d6a7a045>, line 1)

In [484]:
trace0 = go.Bar(
    x=skills,
    y=summary_skills.apply(lambda x: x/float(x.sum())).iloc[0].values,
    name='Both'
)

trace1 = go.Bar(
    x=skills,
    y=summary_skills.apply(lambda x: x/float(x.sum())).iloc[1].values,
    name='Python'
)

trace2 = go.Bar(
    x=skills,
    y=summary_skills.apply(lambda x: x/float(x.sum())).iloc[2].values,
    name='R'
)

data = [trace0, trace1, trace2]

layout = go.Layout(
    title = "Respondents' Language vs. Algorithms",
    barmode = "stack",
    yaxis=dict(tickformat=".0%", title="Percentage of Language Use"),
    height= 600, width = 1000,
    margin=go.Margin(
        l=100,
        r=150,
        b=150,
        t=100,
        pad=4
    )
)

fig = dict(data=data, layout=layout)
plotly.offline.iplot(fig, filename='Language_ML_Distribution')

# Save the imLanguage
plotly.plotly.image.save_as(fig, filename='Language_ML_Distribution.png')

In [None]:
d_skills={}
for skill in skills : 
    d_skills[skill]={'Python':0,'R':0,'Both':0}
    for (i,elem) in zip(range(df.shape[0]),df['MLSkillsSelect']):
        if skill in elem : 
            d_skills[skill][df['PythonVsR'].iloc[i]]+=1
    d_skills[skill]['Python']=100*d_skills[skill]['Python']/len(df[df['PythonVsR']=='Python'])
    d_skills[skill]['R']=100*d_skills[skill]['R']/len(df[df['PythonVsR']=='R'])
    d_skills[skill]['Both']=100*d_skills[skill]['Both']/len(df[df['PythonVsR']=='Both'])

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

4.Learners, Learning experience and resources, time spent


## 2. Project Goal

Data scientist has been referred to as the "sexiest job of the 21st century" and "best job of the year 2016" by Harvard Business Review and Glassdoor respectively. The goal of this project is to dive deep into the survey results and get a comprehensive understanding of state of data science and the job itself. This project will inform us what's happening at the cutting edge of data science across industries and subsequently guide new data scientists breaking into the field.

I will design the analyses in attempt to answer the following questions, which are potentially important information for people in the field.

- What hardware and cloud service are being used? Which one is the most popular and what are their advantages over others?
- What programming languages do data scientists know? Or do data scientists need to know any language like C++, C##, or Java?
- In terms of data analysis, Python or R? Which one is gaining popularity? How about the global distribution of Python and R users.
- What software or tools are being used for data visualization?
- What do data scientists want to learn? What are the hottest skills?
- How is the gender balance among data scientists? Is the balance or imbalance same globally?

-----------

---

### 3. Research Approach

Add additional notes about how to deploy this on a live system

##### Step 1. Data munging and exploration
Tools: R

Visualization: R Plotly

In this step, I will first deal with data quality issues, like missing value and formatting. Subsequently, merging datasets in preparation for later analysis. And then explore the final dataset by data aggregation and interactive visualization.

##### Step 2. Refine research questions

This step is mainly brainstorming and research. I will refine the questions in section 2 and bring out more questions that seems potentially interesting after know exploring the data more. Additional dataset may be brought to this research as reference or comparison. For instance, in the gender balance question, compare the results with the gender balance of college science and engineering major students of the corresponding country, if there is dataset available.

##### Step 3. Data visualization product development
Tools: R Shiny

Visualization: R Plotly

A dashboard will be built using R Shiny, which is the major deliverable of this project. Analysis results will be shown in this interactive dashboard so that users or audiences of the study will explore the topic better. Other deliverables are a report and a well documented github repo.

### 4. Human Centered Data Science Considerations

The privacy of the respondents has been taken good care by the survey designers before releasing the data. It is stated that the responses to multiple choice questions and open-ended responses have been separated without providing any key to match them up. In addition, freeformResponses.csv file has been sanitized and randomized that responses in the same row may not come from the same respondent.

I would like to follow the best practice guidelines of HCDE. Here are the key elements I have leanred from the class and will incorporated into the project.

- Replicability and reproducibility
  
  i. Clearly separate and label all data and files
  ii. Fully document all operations that occur on data and files
  iii. Automate all operations as much as possible, thus avoiding manual intervention in the workflow when feasible 
  iv. Generate intermediate outputs from one step and then feed it into the next step.
  
- License of source data, privacy policy and terms of use
- License of code and copyright


##### References
1. Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Part I: Practicing Reproducibility. Oakland, CA: University of California Press.

## License

This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details