<img src="http://www.macdonaldandcompany.com/File.ashx?path=Root/Images/News/PI_Cyber_Sec.gif"/>

# Kaggle Survey - Data Analysis Report
***

*'So what's Data Science anyway ?'*, *'What languages do you guys use most?'*, *'What's the most valuable technical skills for a Data Scientist '* and maybe the hardest one *'What does a Data Scientist specifically do at work'*. In my (very long and entertaining, thankfully) quest to becoming a data scientist, I've been asked some straightforward and basic questions and still, I found it hard to give a proper answer. 

According to Glassdoor, 'Data Scientist' is the best job in America for the second year in a row while a new study by CareerCast.com revealed data scientist jobs have the best growth potential over the next seven year. Clearly, Data Science is THE hype, the real deal. Yet, it remains somewhat hard to set boundaries and axioms for Data Science, indeed, while we can say something like 'Being good at  probabilities and stochastic processes guarantees success as a QR' (sorry fellow financial engineers), there's no equivalent for Data Science, at least in my opinion.

This Kaggle survey aims to dig deeper and understand how data science is perceived through expert data scientists and aspiring data scientists using this wonderful platform. The data that we've been given will help us understand what platforms people use to learn data science, what's the most valuable skill in data science, what data scientists are looking for when they're job hunting, what programming language is most frequently used and a lot more.

Let's now embark on a journey of Data Science discovery. Enjoy the ride !


# Table of contents

* [About the survey](#introduction)
* [1. Demographic analysis](#demographics)
   * [1.1. Gender, Age and Country](#general)
   * [1.2. Formal education and Major](#education)
   * [1.3. Employment status and Career plan](#prolife)
* [2. Python Vs. R](#language)
    * [2.1. Use across the world](#countryuse)
    * [2.2. Use at the workplace](#skills)
    * [2.3. Industry and  Job title](#tasks)
    * [2.4. Main function and percentage of time devoted to specific tasks](#industry)
    * [2.5. Experience as code writers](#tenure)
    * [2.6. Which language is recommended for beginners](#learnfirst)
* [3. Annual Income : Extensive Analysis](#salary)
    * [3.1. Annual income for US citizens](#us_salary)
    * [3.2. Income by gender](#gender_salary)
    * [3.3. Income by Academic degrees](#education_salary)
    * [3.4. Income by job titles](#jobtitle_salary)
    * [3.5. Dimensionality reduction and 2D-plotting (MCA)](#mca)
* [4. Kaggle's learners community](#learners)
    * [4.1. Demographic properties of learners community](#demographic)
    * [4.2. Most used platforms for learning Data Science](#platforms)
    * [4.3. Time spent during Data Science learning](#time_learning)
    * [4.4. Most important skills for landing a job in Data Science](#skills_job)
    * [4.5. Learners' job hunting](#job_hunt)
* [Conclusion](#conclusion)


## About the survey 
<a id="introduction"></a> 

There were 16 716 Kaggle respondents to this survey. The questions covered a broad spectrum, starting with general demographic questions before moving on to specific DS/ML questions for both the working community and the learning one.

Five files come with this survey : 

-**multipleChoiceResponses.csv ** : Participants' answers to multiple choice questions. Each column contains the answers of one respondent to a specific question.         
-**freeformResponses.csv** : Each time a respondent selected 'Other' and filled the 'Please specify' part, his answer was added in the freeform.            
-**schema.csv** : This file includes all the questions that have been asked, explains each one of them and precise to whom they've been asked (learners, coders...).             
-**RespondentTypeREADME.txt ** : This is to understand how instances are being defined by Kaggle : who are the learners, who are the workers, who are the the coding workers.                   
-**conversionRates.csv** : Currency conversion rates to USD. 

The most important file is**multipleChoiceResponses.csv **, which contain most of the informations that will be needed. 
**RespondentTypeREADME.txt ** is necessary when you want to understand the behaviour of a specific community of kagglers and **schema.csv** helps understanding the questions.

Let us load those files.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
import operator
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
cvRates = pd.read_csv('../input/conversionRates.csv', encoding="ISO-8859-1")
freeForm = pd.read_csv('../input/freeformResponses.csv', encoding="ISO-8859-1")
data = pd.read_csv('../input/multipleChoiceResponses.csv', encoding="ISO-8859-1")
schema = pd.read_csv('../input/schema.csv', encoding="ISO-8859-1")

We shall now begin the analysis, first comes the demographic overview.

## 1. Tell me  about yourself
<a id="demographics"></a>

<img src="https://media.giphy.com/media/12WawXb61sDRMA/giphy.gif"/>

We'll start with a general overview of the demographic properties.

### Gender, age and country
<a id="general"></a>

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(y=data['GenderSelect'])
plt.title("Gender Distribution of the suervey participants", fontsize=16)
plt.yticks(range(len(data['GenderSelect'].value_counts().index)), ['Non-confirming', 'Female', 'Male','Different'])

plt.xlabel("Number", fontsize=16)
plt.ylabel("Gender", fontsize=16)

print('Proportion of women in this survey: {:0.2f}% '.format(100*len(data[data['GenderSelect']=='Female'])/len(data['GenderSelect'].dropna())))
print('Proportion of men in this survey: {:0.2f}% '.format(100*len(data[data['GenderSelect']=='Male'])/len(data['GenderSelect'].dropna())))


Ouch ! The gender gap is huge! Unfortunately, this is common in the tech industry. Statistics show that** women hold only 25% of computing jobs**, which is already low but what we're having here is worse. 16.71% is too low, **there's 5 times as many male respondents as female respondents.**

In [None]:
print('{} instances seem to be too old (>65 years old)'.format(len(data[data['Age']>65])))
print('{} instances seem to be too young (<15 years old)'.format(len(data[data['Age']<15])))

Instances with 0, 5, 100 years old don't make much sens. Removing those instances here (we'll keep them later on as the age doesn't affect the other properties) would yield more significant results.

In [None]:
age=data[(data['Age']>=15) & (data['Age']<=65) ]
plt.figure(figsize=(10,8))
sns.boxplot( y=age['Age'],data=age)
plt.title("Age boxplot", fontsize=16)
plt.ylabel("Age", fontsize=16)


The age median is about 30 years old and most participants are between 25 and 37 years old.

In [None]:
plt.figure(figsize=(12,8))
countries = data['Country'].value_counts().head(30)
sns.barplot(y=countries.index, x=countries.values, alpha=0.6)
plt.title("Country Distribution of the suervey participants", fontsize=16)
plt.xlabel("Number of participants", fontsize=16)
plt.ylabel("Country", fontsize=16)

Seems like most Kagglers are either Americans or Indians. More precisely,

In [None]:
print('{:0.2f}% of the instances are Americans'.format(100*len(data[data['Country']=='United States'])/len(data)))
print('{:0.2f}% of the instances are Indians'.format(100*len(data[data['Country']=='India'])/len(data)))

All in all, 41.29% of the total instances are either from the US or India. This is sort of expected because those are the two most active communities around the world in Data Science (thanks to Kaggle and Analytics Vidhya).

### Formal education and Major
<a name="id"></a>

In [None]:
edu = data['FormalEducation'].value_counts()
labels = (np.array(edu.index))

values = (np.array((edu / edu.sum())*100))

trace = go.Pie(labels=labels, values=values,
              hoverinfo='label+percent',
               textfont=dict(size=20),
                showlegend=False)

layout = go.Layout(
    title='Formal Education of the survey participants'
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="Formal_Education")


Nearly half of the kagglers who took this survey are Master's graduates, impressive.   
What's more, 80.34% of respondents hold at least a bachelor degree. 

In [None]:
plt.figure(figsize=(10,8))
majors = data['MajorSelect'].value_counts()
sns.barplot(y=majors.index, x=majors.values, alpha=0.6)
plt.title("Majors of the survey respondents", fontsize=16)
plt.xlabel("Number of respondents", fontsize=16)
plt.ylabel("Majors", fontsize=16)

Okay I got to admit, this took me by surprise. I expected that a lot more people would have Mathematics as their Major but I was wrong. Computer Science majoring instances are twice as many as the Mathematics ones.     
I also expected Physics to be higher than Electrical Engineering but I was proved wrong on this one too, by a far margin, but I guess that's because there's the Master/Bechelor 'Electrical Engineering and Computer Science'.

### Employment status and Career plan
<a id="prolife"></a>

In [None]:
data['EmploymentStatus']=data['EmploymentStatus'].replace(to_replace ='Independent contractor, freelancer, or self-employed',
                                                       value = 'Independent', axis=0)

In [None]:
plt.figure(figsize=(10,8))
employment = data['EmploymentStatus'].value_counts()
sns.barplot(y=employment.index, x=employment.values, alpha=0.6)
plt.title("Employment Status of the survey participants", fontsize=16)
plt.xlabel("Number of respondents", fontsize=16)
plt.ylabel("Employment Status", fontsize=16)

In [None]:
print('{:0.2f}% of the instances are employed full-time'.format(100*len(data[data['EmploymentStatus']=='Employed full-time'])/len(data)))
status=['Employed full-time','Independent','Employed part-time']
print('{:0.2f}% of the instances are employed'.format(100*len(data[data.EmploymentStatus.isin(status)])/len(data)))

The overwhelming majority of all participants are employed and as you can notice, most of them are employed full-time.
Still, I thought I would find more freelancers independent workers given the fact that several companies hire private data scientits consultants and the fact that some data scientists dedicate a lot of their time to Kaggle competitions.

An interesting question was asked to kagglers in this survey : 'Are you actively looking to switch careers to data science?'.
Only 3012 respondents answered this question. My guess is that the other respondents are either already working as data scientists, in which case Yes/No would have no sens, or are still students.

In [None]:
car = data['CareerSwitcher'].value_counts()
labels = (np.array(car.index))
proportions = (np.array((car / car.sum())*100))
colors = ['#FEBFB3', '#E1396C']

trace = go.Pie(labels=labels, values=proportions,
              hoverinfo='lbal+percent',
              marker = dict(colors=colors, 
                           line=dict(color='#000000', width=2)))
layout = go.Layout(
    title='Working people looking to switch careers to data science'
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="Career_Switcher")

Without surprise, most of working kagglers who aren't yet working as data scientists would love to switch careers !

Okay so we've pretty much covered all the demographic features of the dataset (except the income, but an extended analysis is dedicated to that below),  we can move on to the second part which tries to answer the following question :

**How do Python and R fare in Data Science ?**


## 2. Python Vs R : Let the battle begin ! 
<a id="PythonVsR"></a>

*'What do you use for data science stuff, R or Python ?'* must be the question we all got more than once as aspiring data scientists. Couple of months ago, polls suggested that Python has definitely overtaken R and became the leader language for Data Science.
This survey will allow us to dig deeper and understand 
We shall now begin ! 

<img src="https://i.imgur.com/sIpzGMl.gif"/>

First, we must identify R and Python users amongst the instances.       
We'll only look at **working people (coding workers) ** who are defined as such : *Respondents who indicated that they were "Employed full-time" or "Employed part-time" AND that they write code to analyze data in their current job.*     

There was a question about frequency use of Python and R, frequencies ranging from 'Rare' to 'Most of the time'.
- Users who use Python Most of the time / often and don't use R just as much will be considered Python Users.
- Users who use R Most of the time / often and don't use Python just as much will be considered R users.
- Users who use both R and Python equally and at least often will be categorized as 'Both'.

In [None]:
t2=data[["WorkToolsFrequencyR","WorkToolsFrequencyPython"]].fillna(0)
t2.replace(to_replace=['Rarely','Sometimes','Often','Most of the time'], 
           value=[1,2,3,4], inplace=True)
t2['PythonVsR'] = [ 'R' if (freq1 >2 and freq1 > freq2) else
                    'Python' if (freq1<freq2 and freq2>2) else
                    'Both' if (freq1==freq2 and freq1 >2) else
                    'None' for (freq1,freq2) in zip(t2["WorkToolsFrequencyR"],t2["WorkToolsFrequencyPython"])]
data['PythonVsR']=t2['PythonVsR']

df = data[data['PythonVsR']!='None']
print("Python users: ",len(df[df['PythonVsR']=='Python']))
print("R users: ",len(df[df['PythonVsR']=='R']))
print("Python+R users: ",len(df[df['PythonVsR']=='Both']))


So it seems that Python coding workers who use exclusively Python are nearly twice as many as those who use exclusively R.

### R and Python across the world 
<a id="countryuse"></a>

In [None]:
df['Country'].fillna('Other',inplace=True)

In [None]:
d_country={}
for country in df['Country'].unique(): #modify to unique values
    maskp = (df['Country'] == country )& (df['PythonVsR']=='Python')
    maskr = (df['Country'] == country )& (df['PythonVsR']=='R')
    maskb = (df['Country'] == country )& (df['PythonVsR']=='Both')
    d_country[country]={'Python':100*len(df[maskp])/len(df[df['Country']==country]) , 
                        'R':100*len(df[maskr])/len(df[df['Country']==country]),
                        'Both':100*len(df[maskb])/len(df[df['Country']==country])}
pd.DataFrame(d_country).transpose()

print('Table with percentage of use for each country')
print(pd.DataFrame(d_country).transpose().head(20).round(2))
plt.figure(figsize=(14,8))
sns.countplot(y='Country',hue='PythonVsR',data=df)
plt.ylabel("Country", fontsize=13)
plt.xlabel("Number of each language users", fontsize=13)
plt.title("Python/R users per country", fontsize=13)
plt.show()

Some countries are still heavily relying on R. For example 33.71% of coding workers in India are still frequently using R, and 45.45% of coding workers in Australia is using R ! That's more than python coders in Australia.
That being said, nearly all countries use Python more than R.

Let us now compare between the use of those two languages at the workplace.

### ML methods / algorithms and skills for R and Python
<a id="skills"></a>

In [None]:
df['WorkMethodsSelect']=df['WorkMethodsSelect'].fillna('None')
techniques = ['Bayesian Techniques','Data Visualization', 'Logistic Regression','Natural Language Processing',
 'kNN and Other Clustering','Neural Networks','PCA and Dimensionality Reduction',
 'Time Series Analysis', 'Text Analytics','Cross-Validation']

In [None]:
d={}
for technique in techniques :
    d[technique]={'Python':0,'R':0,'Both':0}
    for (i,elem) in zip(range(df.shape[0]),df['WorkMethodsSelect']):
        if technique in elem : 
            d[technique][df['PythonVsR'].iloc[i]]+=1
            
       
(pd.DataFrame(d)).transpose().plot(kind='barh',figsize=(10,8))
plt.ylabel("Method", fontsize=13)
plt.xlabel("Number of users", fontsize=13)
plt.title("Methods used at work per programming language", fontsize=13)
plt.show()


Okay, let's see what we've got here :
Python users are more numerous than R users for each method but that's expected given the numbers we saw earlier.

That being said, we notice that when it comes to **Time Series Analysis and Logistic Regression**, R is still very much used.  
On the other hand, for** Neural Networks and Nature Language Processing**, R seems to be neglected in favor of Python. 


In [None]:
df['WorkAlgorithmsSelect'].fillna('None',inplace=True)
algorithms = ['Bayesian Techniques','Decision Trees','Random Forests','Regression/Logistic Regression',
 'CNNs', 'RNNs', 'Gradient Boosted Machines','SVMs','GANs','Ensemble Methods']


In [None]:
d_algo={}
for algo in algorithms :
    d_algo[algo]={'Python':0,'R':0,'Both':0}
    for (i,elem) in zip(range(df.shape[0]),df['WorkAlgorithmsSelect']):
        if algo in elem : 
            d_algo[algo][df['PythonVsR'].iloc[i]]+=1
            
(pd.DataFrame(d_algo)).transpose().plot(kind='barh',figsize=(10,8))
plt.ylabel("Algorithm", fontsize=13)
plt.xlabel("Number of users", fontsize=13)
plt.title("Algorithms used at work per programming language", fontsize=13)
plt.show()

So this question seems like a bit redundant, especially since some possibles answers were availble for both of them. That being said, I tried to keep in Algorithms answers as SVMs, CNNs and keep more general answers for the previous questions (Neural networks, PCA and dimensionality reduction ...)

R seems to perform good enough for Decision Trees and Random forests. I expect that to be caused by the fact that those libraries existed in R way before they did in Python.   
On ther other hand, performant deep learning libraries such as tensorflow made their apperance for both languages at the same time.

In [None]:
df['MLSkillsSelect'].fillna('None',inplace=True)
skills = ['Natural Language Processing', 'Computer Vision', 'Adversarial Learning',
          'Supervised Machine Learning (Tabular Data)', 'Reinforcement learning',
          'Unsupervised Learning', 'Outlier detection (e.g. Fraud detection)',
          'Time Series', 'Recommendation Engines']

In [None]:
d_skills={}
for skill in skills : 
    d_skills[skill]={'Python':0,'R':0,'Both':0}
    for (i,elem) in zip(range(df.shape[0]),df['MLSkillsSelect']):
        if skill in elem : 
            d_skills[skill][df['PythonVsR'].iloc[i]]+=1
(pd.DataFrame(d_skills)).transpose().plot(kind='barh',figsize=(10,8))
plt.ylabel("Machine Learning Skill", fontsize=13)
plt.xlabel("Number of users", fontsize=13)
plt.title("Machine Learning Skills per programming language", fontsize=13)
plt.show()


It seems that there's enough skilled coders who use R for unsupervised learning, outlier detection, time series analysis and to a certain extent supervised learning. But when it comes to new emerging fields like reinforcement learning and computer vision, Python is miles ahead.

Next we'll see where R and Python coders work and what task do they perform at work.

### Industry and Job title 
<a id="industry"></a>

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(y='EmployerIndustry', hue='PythonVsR',data=df)
plt.ylabel("Industry", fontsize=13)
plt.xlabel("Number of each language users", fontsize=13)
plt.title("Python/R coders by industry", fontsize=13)
plt.show()

This is an interesting one actually because even if there are more Python users than R users in this survey as we saw earliers, we notice on this plot that there are industries where R is still dominant or as competitive as Python.

* R fares as good as / better than Python in the following industries : Government, Insurance, Non-profit, Pharmaceutical, Retail and Marketing.
* Python outdoes R, by a large margin, in the tech industry which is also the industry with the most respondents in this survey which makes sense since Data Scientist is a tech job.

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(y='CurrentJobTitleSelect', hue='PythonVsR',data=df)
plt.ylabel("Job Title", fontsize=13)
plt.xlabel("Number of each language users", fontsize=13)
plt.title("Python/R and job titles", fontsize=13)
plt.show()


We notice that there are more Data analysts and business analysts tend to use R more than Python !
On the other hand, there are really few R Machine Learning Engineers or Software Engineers. 

Let's see if something similar comes up when looking at the functions of the coders. 


### Main function and percentage of time for specific tasks  
<a id="tasks"></a>

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(y='JobFunctionSelect', hue='PythonVsR',data=df)
plt.ylabel("Principal job function", fontsize=13)
plt.xlabel("Number of each language users", fontsize=13)
plt.title("Python/R coders and their job functions", fontsize=13)
plt.show()

We found a few industries where R is more dominant but it seems that for functions, Python fares better at nearly everything. 

It seems that R users tend to do more on the **business analytics** part.  This makes sens when taking into consideration what we already observed for job functions : There are more R users when it comes to analyzing and visualizing for business purposes.                                                                                                  
On the other hand, Python has a clear edge when it comes to build machine learning services for example.

So what's the volume of work for each task (gathering data, visualizing it ...) in a DS job ? 


In [None]:
d_task={}
tasks=['TimeGatheringData','TimeModelBuilding','TimeProduction','TimeVisualizing','TimeFindingInsights']
for task in tasks : 
    d_task[task]={'Python':df[df['PythonVsR']=='Python'][task].mean(),
                  'R':df[df['PythonVsR']=='R'][task].mean(),
                  'Both':df[df['PythonVsR']=='Both'][task].mean()}
    
(pd.DataFrame(d_task)).transpose().plot(kind='barh',figsize=(10,8))
plt.ylabel("Task", fontsize=13)
plt.xlabel("Percentage of time", fontsize=13)
plt.title("% of time devoted to specific tasks ", fontsize=13)
plt.show()

Both type of coders spend most of their time gathering data (38% for Python users, 40% for R) and building models (19% for Python, 18% for R).                            
Generally, both type of coders seem to invest the same time for nearly all tasks. For me, this means that using R or Python is more about a preference than an obligation towards some specific usage.     

That being said, **the biggest difference observed comes for putting work into production (12% for Python users, 7% for R)**. I remember the first time I asked my manager during my internship why do we use Python rather than R and he simply replied *'We always want to put our models into production and doing that using R can really be a pain in the ass'*.         
I've personnaly never used R production-wise so I can't relate, but I guess that these statistics support my manager's words ! 

> EDIT : I actually recalled that there was a specific question in the survey for workers which  was *At work, how often do the models you build get put into production?* and decided to delve deeper into this aspect.       
Possible answers were frequencies so I tried to check the % of each frequency for Python (resp. R) users and compare between the percentages for both communities.

In [None]:
df['WorkProductionFrequency']=df['WorkProductionFrequency'].fillna("Don't know")

In [None]:
d_prod={}
for value in df['PythonVsR'].value_counts().index : 
    temp=df[df['PythonVsR']==value]
    d_prod[value]={}
    for frequency in df['WorkProductionFrequency'].value_counts().index :
        d_prod[value][frequency]=100*len(temp[temp['WorkProductionFrequency']==frequency])/len(temp)

(pd.DataFrame(d_prod)).plot(kind='barh',figsize=(10,8))
plt.ylabel("Frequency", fontsize=13)
plt.xlabel("Percentages", fontsize=13)
plt.title("Proportion of R/Python coders by frequency of push to production  ", fontsize=13)
plt.show()

* If we combine *Always and Most of the time*, we find that 35.63% of Python users/ 30.43% of R users almost always push their models to production.
* If we combine *Never and Rarely*, we find that 22.07% of Python users / 23.27% of R users almost never push their models to production. Taking only never the gap is bigger : 6.46% for Python users and 11.25% for R users.

So yes this supports what we saw above, **R coders seem to diminish the importance of putting work to production.**

### Experience as code-writers 
<a id="tenure"></a>

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(y='Tenure', hue='PythonVsR',data=df)
plt.ylabel("Tenure", fontsize=13)
plt.xlabel("Number of each language users", fontsize=13)
plt.title("Number of years analyzing data for each language", fontsize=13)
plt.show()

Here's what we observe : the more years we go back in time, the more the proportion of R use increases and Python use decreases.    

That's actually similar to what we saw earlier for ML skills / methods / algorithms : R was heavily used back in the days and was as popular (or more popular) than python but the trend is changing now and that's what all the plots have been telling this far.

### What language would you recommend for DS beginners 
<a id="learnfirst"></a>

In [None]:
df['LanguageRecommendationSelect'].fillna('Other',inplace=True)

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(y='LanguageRecommendationSelect',hue='PythonVsR',data=df)
plt.ylabel("Language", fontsize=13)
plt.xlabel("Number of recommenders", fontsize=13)
plt.title("Recommended language", fontsize=13)
plt.show()

mask1=(df['LanguageRecommendationSelect'] == 'R')& (df['PythonVsR']=='Python')
print('Proportion of Python users who recommend R as the first language to learn: {:0.2f}%'.format(100*len(df[mask1])/len(df[df['PythonVsR']=='Python'])))

mask1=(df['LanguageRecommendationSelect'] == 'Python')& (df['PythonVsR']=='R')
print('Proportion of R users who recommend Python as the first language to learn: {:0.2f}%'.format(100*len(df[mask1])/len(df[df['PythonVsR']=='R'])))

As expected, the majority of each community recommended the language it uses, that's fair.
BUT, **R users are a lot more enclined to recommend Python than Python users are to recommend R ! ** Indeed, only 2.63% of Python users recommended R while 23.12% (!) of R users recommended Python. This means that a part of the R community is now convinced that Python may be the real deal for Machine Learning.  
We can also notice that people who use Python just as much as R are also recommending Python more than they are recommending R.



## Annual Income : Extensive Analysis
<a id="salary"></a>

For this part of the analysis, we'll select 20 features, mostly demographic, that can greatly impact the income of a person and that will be further used for dimensionality reduction.

In [None]:
demographic_features = ['GenderSelect','Country','Age',
                        'FormalEducation','MajorSelect','ParentsEducation',
                        'EmploymentStatus', 'CurrentJobTitleSelect',
                        'DataScienceIdentitySelect','CodeWriter',
                        'CurrentEmployerType','JobFunctionSelect',
                        'SalaryChange','RemoteWork','WorkMLTeamSeatSelect',
                        'Tenure','EmployerIndustry','EmployerSize','PythonVsR',
                        'CompensationAmount']
data_dem = data[demographic_features]
data_dem.head(5)

### American Kagglers annual income
<a id="us_salary"></a>

Well, it would be cool to know how much data scientists are getting paid actually ! Unfortunately, even if 73%
of the participants are employed, only 26.83% of the participants gave an answer for the income so we can't get be
extremely precise here.

Personnaly, I'm not a big fan of converting every salary to USD and then treat all the instances the same. Being paid 100k$ in the US is absolutely not comparable to being paid that same amount in India or Portugal for example.    
I'd rather convert all the salaries amounts to dollars but treat each country separatly, I think this makes a lot more sense
than checking the median for everything.   

Anyway, we'll only check the salary for US citizens as it's the most represented country in this survey so it would yield the most significant results.

In [None]:
#Convert all salaries to floats
data_dem['CompensationAmount'] = data_dem['CompensationAmount'].fillna(0)
data_dem['CompensationAmount'] = data_dem.CompensationAmount.apply(lambda x: 0 if (pd.isnull(x) or (x=='-') or (x==0))
                                                       else float(x.replace(',','')))

In [None]:
#Remove Outliers
data_dem = data_dem[(data_dem['CompensationAmount']>5000) & (data_dem['CompensationAmount']<1000000)]
data_dem = data_dem[data_dem['Country']=='United States']

In [None]:
plt.subplots(figsize=(15,8))
sns.distplot(data_dem['CompensationAmount'])
plt.title('Salary Distribution',size=15)
plt.show()

plt.figure(figsize=(10,8))
sns.violinplot( y='CompensationAmount', data=data_dem)
plt.title("Salary distribution for US data scientists", fontsize=16)
plt.ylabel("Annual Salary", fontsize=16)

In [None]:
print('The median salary for US data scientist: {} USD'.format(data_dem['CompensationAmount'].median()
))
print('The median salary for US data scientist: {:0.2f} USD'.format(data_dem['CompensationAmount'].mean()
))


The distribution shows that most salaries lie between 70k and 130k USD, according to Glassdoor, the average annual salary for a data scientist is 128k$ so it's coherent with what we've just got here.

Seaborn's *distplot* fits a univariate distribution using ke

### Salary VS Gender
<a id="gender_salary"></a>

In [None]:
temp=data_dem[data_dem.GenderSelect.isin(['Male','Female'])]
plt.figure(figsize=(10,8))
sns.violinplot( y='CompensationAmount', x='GenderSelect',data=temp)
plt.title("Salary distribution Vs Gender", fontsize=16)
plt.ylabel("Annual Salary", fontsize=16)
plt.xlabel("Gender", fontsize=16)

It seems that the salary gap between the two genders isn't too big but is still in favour of men.       
The average for male kagglers is a bit higher than the average for female kagglers.         
That being said, there's no woman with an income of 400k or higher while there are some outliers in the men part.

### Salary VS Formal Education
<a id="education_salary"></a>

In [None]:
titles=list(data_dem['FormalEducation'].value_counts().index)
temp=data_dem[data_dem.FormalEducation.isin(titles)]
plt.figure(figsize=(10,8))
sns.boxplot( x='CompensationAmount', y='FormalEducation',data=temp)
plt.title("Salary distribution VS Academic degrees", fontsize=16)
plt.xlabel("Annual Salary", fontsize=16)
plt.ylabel("Academic degree", fontsize=16)

Let's recall what the boxes mean in seaborn's boxplot, the documentation says : *The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers”*.    

The median follows a reasonable trend :** the higher the education, the higher the median annual income** except for doctoral education that is shadowed by professional degrees. That being said, the PhD box contain outliers + its whiskers are more extended than those of the Professional degree box so all in all, doctors are the best paid kagglers.  

The median for people who attended college but hold no degree is higher than the median for Bachelors and Masters holders BUT the first quartile (Q1) of first community is way smaller than the first quartile of the other two communities.   
So looking at the turquoise blue box and comparing it with the light pink and mustard yellow boxes, we notice that the majority of people with a professional degree are between Q1 and the median while the majority of Bachelors and Masters holders are between the median and Q3.

### Salary VS Job Title
<a id="jobtitle_salary"></a>

In [None]:
titles=list(data_dem['CurrentJobTitleSelect'].value_counts().index)
temp=data_dem[data_dem.CurrentJobTitleSelect.isin(titles)]
plt.figure(figsize=(10,8))
sns.violinplot( x='CompensationAmount', y='CurrentJobTitleSelect',data=temp)
plt.title("Salary distribution VS Job Titles", fontsize=16)
plt.xlabel("Annual Salary", fontsize=16)
plt.ylabel("Job Titles", fontsize=16)

People labeled as **Machine Learning Engineers or Data Scientst have an annual average income higher than Data Analysts, Business Analysts, Statisticians or Programmers. **    
One should be aware of the job's title when looking for work because the salaries seem to be really different even if many resepondents  identify as data scientists ! 

### Dimensionality reduction and 2D-plotting
<a id="mca"></a>

The most known / used dimensionality reduction technique has to be PCA. The problem with PCA is that it works best for numerical / continuous variables which is not the case here.

A similar technique, **Multi Correspondence Analysis (MCA)**, is used to achieve dimensionality reduction for categorical data. Simply put, It's a technique that use chi-2 independence tests to create a distance between row points that will be further contained in a matrix. Each of the eigenvalues of this matrix has an inertia (similar to expressed variance for PCA) and the process to obtain the 2D visualization is the same. You can read more about it here https://www.wikiwand.com/en/Correspondence_analysis  and here https://www.wikiwand.com/en/Multiple_correspondence_analysis (to get more technical, you can check the references in the links).

In [None]:
data_dem['CompensationAmount'] = pd.cut(data_dem['CompensationAmount'],bins=[0,130000,1000000],
                                            include_lowest=True,labels=[1,2])
data_dem['Age'] = pd.cut(data_dem['Age'],bins=[0,18,25,30,35,40,50,60,100],
                           include_lowest=True,labels=[1,2,3,4,5,6,7,8])
data_dem.drop('Country',axis=1,inplace=True)

We'll use the demographic properties and some of related to work answers (like language preference) to visualize the respondents. We'll try to see if it's possible to construct clusters by salaries.    
130k is an arbitrary treshold, I just tried to separate between the richer kagglers in the US and the others, let's see what we've got.

In [None]:
### NOT WORKING ON KAGGLE SERVERS (no module prince)####
#import prince
#np.random.seed(42)
#mca = prince.MCA(data_viz, n_components=2,use_benzecri_rates=True)
#mca.plot_rows(show_points=True, show_labels=False, color_by='CompensationAmount', ellipse_fill=True)

# I have uploaded an image instead.

![](http://img1.imagilive.com/1117/mca2e0b.png)

That's not so bad actually.      
* **There's an area where both poupulations overlap** and we can't really separate between two instances having incomes >130k or not. That's expected actually because salary is continuous and if we take for example one person with an annual income of 120k and another one with 135k, they probably have similar properties even if, with our treshold, they belong to two different classes.
* **The top part of the plot contains excusively people whose income is below 130k.** We drew a pink box to investigate further in what way the instances within that box look alike, we'll see that below.
* **The bottom part of the plot contains excusively people whose income is above 130k**. We drew a brown box to investigate further in what way the instances within that box look alike, we'll see that below.

In [None]:
"""If you want to execute the following two blocks of code and have the plot above,
install the package 'prince', copy all the code and uncomment it, you'll have the same outputs.
P.S : Don't forger the random seed !"""

#projections=mca.row_principal_coordinates
#projections.columns=['ax1','ax2']
#projections['target']=y.iloc[length]

#msk_p = ((projections['ax1']>-0.70) & (projections['ax1']<-0.45 )) & ((projections['ax2']<0.66) &(projections['ax2']>0.50))
#samples_p=projections[msk_p]
#indexes_p = samples_p.index #[133, 247, 499, 576, 2375, 3578, 3606, 3876, 5758, 6059, 10155, 10514, 11552, 13438, 15631]
#ex_p=data_dem.loc[indexes_p]

ex_p=data_dem.loc[[133, 247, 499, 576, 2375, 3578, 3606, 3876, 5758, 6059, 10155, 10514, 11552, 13438, 15631]]
ex_p.head(10)

The pink box contains 6 men and 4 women, all of them hold either a Bachelor's or a Master's degree.    
Most of them are between 18 and 25 years old and work either as Data Analysts or Business Analysts. Nearly all of them have been writing code to analyze data (Tenure) for 2 years at most.

This makes sense, that part of the plot must contain kagglers who have just started their careers (between 18 and 25 years old) and seem to work full time as business / data analysts.

Let's move on to the brown box.

In [None]:
#msk_r = ((projections['ax1']>0.2) & (projections['ax1']<0.7 )) & ((projections['ax2']<-0.80) &(projections['ax2']>-1.10))
#samples_r=projections[msk_r]
#indexes_r = samples_r.index  #[445, 3273, 4751, 4803, 4960, 11071, 11528, 13663, 13880]
#ex_r = data_dem.loc[indexes_r]

indexes_r=[445, 3273, 4751, 4803, 4960, 11071, 11528, 13663, 13880]
ex_r = data_dem.loc[indexes_r]

ex_r

The brown box contains men only. 4 of them hold a doctoral degree, others attended college without earning a degree.        
8 of the 9 instances are older than 45 years old and work as **independent** Data Scientists or pure Scientists.       
All of them have been writing code to analyze data for more than 10 years ! 

That part of the seems to contain male kagglers, older than 45 y.o who either hold senior positions in their workplace or work as private consulting data scientists that have been writing code since forever.

When we were looking at salaries for all the job titles, we found that business analysts / data analysts tend to be paid less than data scientists. In our 2D plot, we observed the same thing : the first box contained no data scientists, only analysts while the second one contained mainly data scientitst.


Okay so up until now, we've done a demographic overview of the respondents, we've analyzed the coding workers community and their use of programming languages and we conducted a study on the annual income for the workers community. Let's now analyze the learners community of Kaggle !

## 3. Welcome to Data Science
<a id="learners"></a>

This Kaggle Survey participants weren't all experienced data scientists, far from that !  

Many of them (me included) are still aspiring to become data scientists and have a long way to go. Some are still students, others are looking for a career switch but they all have something in common : they're **learners **!    

So what are platforms are using for this purpose ? What do they want to learn most ? What job are they looking for ?
Well, let's see if we can answer those questions ! 

<img src="https://media.giphy.com/media/RUry0iE5xatig/source.gif">

According to the respondent type README file, learners can be:
- Students
- People formally or informally learning data science skills
- Professionals looking for a career switch
- Unemployed but looking for work people
We first extract a dataframe containing instances having at least one of these characteristics.

In [None]:
df_students=data[data['StudentStatus']=='Yes']
df_ds=data[(data['LearningDataScience']=="Yes, but data science is a small part of what I'm focused on learning") |
            (data['LearningDataScience']=="Yes, I'm focused on learning mostly data science skills")]
df_c=data[data['CareerSwitcher']=='Yes']
df_e=data[data['EmploymentStatus']=='Not employed, but looking for work']

learners=pd.concat((df_students,df_ds,df_c,df_e))
learners = learners[~learners.index.duplicated(keep='first')]

print('{} participants on this survey are learners.'.format(len(learners)))
print('In other words, {:0.2f}% of the participants on this survey are learners.'.format(100*len(learners)/len(data)))

### Demographic properties 
<a id="demographic"></a>

First thing first, let's see which countries has the most learners, check their age and the gender distribution amongst them.

In [None]:
sexe = learners['GenderSelect'].value_counts()
labels = (np.array(sexe.index))
proportions = (np.array((sexe / sexe.sum())*100))

trace = go.Pie(labels=labels, values=proportions,
              hoverinfo='lbal+percent')
layout = go.Layout(
    title='Gender distrubiton of learners'
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="Career_Switcher")



We notice that there's a little progress compared to the general proportions (16.71% Female and 81.88% Male). This means that there are more and more women involved and interested in learning Data Science, that's good news.

In [None]:
print("Learners' median age", learners['Age'].median() )

In [None]:
plt.figure(figsize=(12,8))
countries = learners['Country'].value_counts().head(30)
sns.barplot(y=countries.index, x=countries.values, alpha=0.6)
plt.title("Population of learners in each country", fontsize=16)
plt.xlabel("Number of respondents", fontsize=16)
plt.ylabel("Country", fontsize=16)

Interesting, India has more data science learners than the U.S ! 

We saw on the first part of this EDA, several countries don't have many participants on this survey. A statistic that would be more meaningful is the % of learners amongst kaggle respondents of each country.

In [None]:
d_pcountries = {}
for value in data['Country'].value_counts().index:
    d_pcountries[value]=100*len(learners[learners['Country']==value])/len(data[data['Country']==value])
learners_p=pd.DataFrame.from_dict(d_pcountries, orient='index')
learners_p = learners_p.reset_index(drop=False)
learners_p.rename(columns = {'index':'Country',0:'% of learners'},inplace=True)


In [None]:
LOCDATA="""COUNTRY,GDP (BILLIONS),CODE
Afghanistan,21.71,AFG
Albania,13.40,ALB
Algeria,227.80,DZA
American Samoa,0.75,ASM
Andorra,4.80,AND
Angola,131.40,AGO
Anguilla,0.18,AIA
Antigua and Barbuda,1.24,ATG
Argentina,536.20,ARG
Armenia,10.88,ARM
Aruba,2.52,ABW
Australia,1483.00,AUS
Austria,436.10,AUT
Azerbaijan,77.91,AZE
"Bahamas, The",8.65,BHM
Bahrain,34.05,BHR
Bangladesh,186.60,BGD
Barbados,4.28,BRB
Belarus,75.25,BLR
Belgium,527.80,BEL
Belize,1.67,BLZ
Benin,9.24,BEN
Bermuda,5.20,BMU
Bhutan,2.09,BTN
Bolivia,34.08,BOL
Bosnia and Herzegovina,19.55,BIH
Botswana,16.30,BWA
Brazil,2244.00,BRA
British Virgin Islands,1.10,VGB
Brunei,17.43,BRN
Bulgaria,55.08,BGR
Burkina Faso,13.38,BFA
Burma,65.29,MMR
Burundi,3.04,BDI
Cabo Verde,1.98,CPV
Cambodia,16.90,KHM
Cameroon,32.16,CMR
Canada,1794.00,CAN
Cayman Islands,2.25,CYM
Central African Republic,1.73,CAF
Chad,15.84,TCD
Chile,264.10,CHL
"People 's Republic of China",10360.00,CHN
Colombia,400.10,COL
Comoros,0.72,COM
"Congo, Democratic Republic of the",32.67,COD
"Congo, Republic of the",14.11,COG
Cook Islands,0.18,COK
Costa Rica,50.46,CRI
Cote d'Ivoire,33.96,CIV
Croatia,57.18,HRV
Cuba,77.15,CUB
Curacao,5.60,CUW
Cyprus,21.34,CYP
Czech Republic,205.60,CZE
Denmark,347.20,DNK
Djibouti,1.58,DJI
Dominica,0.51,DMA
Dominican Republic,64.05,DOM
Ecuador,100.50,ECU
Egypt,284.90,EGY
El Salvador,25.14,SLV
Equatorial Guinea,15.40,GNQ
Eritrea,3.87,ERI
Estonia,26.36,EST
Ethiopia,49.86,ETH
Falkland Islands (Islas Malvinas),0.16,FLK
Faroe Islands,2.32,FRO
Fiji,4.17,FJI
Finland,276.30,FIN
France,2902.00,FRA
French Polynesia,7.15,PYF
Gabon,20.68,GAB
"Gambia, The",0.92,GMB
Georgia,16.13,GEO
Germany,3820.00,DEU
Ghana,35.48,GHA
Gibraltar,1.85,GIB
Greece,246.40,GRC
Greenland,2.16,GRL
Grenada,0.84,GRD
Guam,4.60,GUM
Guatemala,58.30,GTM
Guernsey,2.74,GGY
Guinea-Bissau,1.04,GNB
Guinea,6.77,GIN
Guyana,3.14,GUY
Haiti,8.92,HTI
Honduras,19.37,HND
Hong Kong,292.70,HKG
Hungary,129.70,HUN
Iceland,16.20,ISL
India,2048.00,IND
Indonesia,856.10,IDN
Iran,402.70,IRN
Iraq,232.20,IRQ
Ireland,245.80,IRL
Isle of Man,4.08,IMN
Israel,305.00,ISR
Italy,2129.00,ITA
Jamaica,13.92,JAM
Japan,4770.00,JPN
Jersey,5.77,JEY
Jordan,36.55,JOR
Kazakhstan,225.60,KAZ
Kenya,62.72,KEN
Kiribati,0.16,KIR
"Korea, North",28.00,PRK
"Korea, South",1410.00,KOR
Kosovo,5.99,KSV
Kuwait,179.30,KWT
Kyrgyzstan,7.65,KGZ
Laos,11.71,LAO
Latvia,32.82,LVA
Lebanon,47.50,LBN
Lesotho,2.46,LSO
Liberia,2.07,LBR
Libya,49.34,LBY
Liechtenstein,5.11,LIE
Lithuania,48.72,LTU
Luxembourg,63.93,LUX
Macau,51.68,MAC
Macedonia,10.92,MKD
Madagascar,11.19,MDG
Malawi,4.41,MWI
Malaysia,336.90,MYS
Maldives,2.41,MDV
Mali,12.04,MLI
Malta,10.57,MLT
Marshall Islands,0.18,MHL
Mauritania,4.29,MRT
Mauritius,12.72,MUS
Mexico,1296.00,MEX
"Micronesia, Federated States of",0.34,FSM
Moldova,7.74,MDA
Monaco,6.06,MCO
Mongolia,11.73,MNG
Montenegro,4.66,MNE
Morocco,112.60,MAR
Mozambique,16.59,MOZ
Namibia,13.11,NAM
Nepal,19.64,NPL
Netherlands,880.40,NLD
New Caledonia,11.10,NCL
New Zealand,201.00,NZL
Nicaragua,11.85,NIC
Nigeria,594.30,NGA
Niger,8.29,NER
Niue,0.01,NIU
Northern Mariana Islands,1.23,MNP
Norway,511.60,NOR
Oman,80.54,OMN
Pakistan,237.50,PAK
Palau,0.65,PLW
Panama,44.69,PAN
Papua New Guinea,16.10,PNG
Paraguay,31.30,PRY
Peru,208.20,PER
Philippines,284.60,PHL
Poland,552.20,POL
Portugal,228.20,PRT
Puerto Rico,93.52,PRI
Qatar,212.00,QAT
Romania,199.00,ROU
Russia,2057.00,RUS
Rwanda,8.00,RWA
Saint Kitts and Nevis,0.81,KNA
Saint Lucia,1.35,LCA
Saint Martin,0.56,MAF
Saint Pierre and Miquelon,0.22,SPM
Saint Vincent and the Grenadines,0.75,VCT
Samoa,0.83,WSM
San Marino,1.86,SMR
Sao Tome and Principe,0.36,STP
Saudi Arabia,777.90,SAU
Senegal,15.88,SEN
Serbia,42.65,SRB
Seychelles,1.47,SYC
Sierra Leone,5.41,SLE
Singapore,307.90,SGP
Sint Maarten,304.10,SXM
Slovakia,99.75,SVK
Slovenia,49.93,SVN
Solomon Islands,1.16,SLB
Somalia,2.37,SOM
South Africa,341.20,ZAF
South Sudan,11.89,SSD
Spain,1400.00,ESP
Sri Lanka,71.57,LKA
Sudan,70.03,SDN
Suriname,5.27,SUR
Swaziland,3.84,SWZ
Sweden,559.10,SWE
Switzerland,679.00,CHE
Syria,64.70,SYR
Taiwan,529.50,TWN
Tajikistan,9.16,TJK
Tanzania,36.62,TZA
Thailand,373.80,THA
Timor-Leste,4.51,TLS
Togo,4.84,TGO
Tonga,0.49,TON
Trinidad and Tobago,29.63,TTO
Tunisia,49.12,TUN
Turkey,813.30,TUR
Turkmenistan,43.50,TKM
Tuvalu,0.04,TUV
Uganda,26.09,UGA
Ukraine,134.90,UKR
United Arab Emirates,416.40,ARE
United Kingdom,2848.00,GBR
United States,17420.00,USA
Uruguay,55.60,URY
Uzbekistan,63.08,UZB
Vanuatu,0.82,VUT
Venezuela,209.20,VEN
Vietnam,187.80,VNM
Virgin Islands,5.08,VGB
West Bank,6.64,WBG
Yemen,45.45,YEM
Zambia,25.61,ZMB
Zimbabwe,13.74,ZWE
    """

with open("location_map.csv", "w") as ofile:
    ofile.write(LOCDATA)

In [None]:
loc_df = pd.read_csv("./location_map.csv")
new_df = pd.merge(learners_p, loc_df, left_on="Country", right_on="COUNTRY")
new_df = new_df[['Country','CODE','% of learners']]

data_t = [ dict(
        type = 'choropleth',
        locations = new_df['CODE'],
        z = new_df['% of learners'],
        text = new_df['Country'],
        #colorscale = [[0,"rgb(5, 10, 172)"],[10,"rgb(40, 60, 190)"],[20,"rgb(70, 100, 245)"],\
        #    [30,"rgb(90, 120, 245)"],[40,"rgb(200, 200, 200)"],[4500,"rgb(220, 220, 220)"]],
        colorscale = [[0,"rgb(210, 210, 210)"], [4500,"rgb(220, 220, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            title = 'Proportion of learners (in%)'),
      ) ]

layout = dict(
    title = 'Country wise proportion of learners',
    geo = dict(
        showframe = False,
        showcoastlines = True,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data_t, layout=layout )
py.iplot( fig, validate=False, filename='d3-world-map' )


We notice that some of the highest proportions of learners are found in Nigeria, Egypt, Kenia and Indonesia. Those are countries where data science is at its very start and it finding as much learners as practitioners make sens. In the other hand, only 26% of american kagglers are still in the learning phase.

What I found more surprising was the proportion of learners in China (47.34%). I would have guessed that the proportion of learners in China is similar to the one of the US but that's not true at all according to the survey responses.

Let's see what's the formal education of our learners.

In [None]:
edu = learners['FormalEducation'].value_counts()
labels = (np.array(edu.index))
values = (np.array((edu / edu.sum())*100))

trace = go.Pie(labels=labels, values=values,
              hoverinfo='label+percent',
               textfont=dict(size=20),
                showlegend=False)

layout = go.Layout(
    title='Formal Education of learners respondents'
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="Formal_Education2")

Most learners hold a bachelor degree (earlier, for all respondents, the most frequent education was Master level). That's primarly because amongst learners we find students who are yet to finish their studies.

Let's now see what platforms are these learners using in their quest to becoming data scientists.

### Platforms used for learning 
<a id="platforms"></a>

In [None]:
d_plat={}
platforms = ['College/University','Kaggle','Online courses','Arxiv','Company internal','Textbook',
             'Personal Projects','Stack Overflow Q&A']
for platform in platforms : 
    d_plat[platform]=0
    for elem in learners['LearningPlatformSelect'].fillna('Other/Missing'):
        if platform in elem : 
            d_plat[platform]+=1

plt.figure(figsize=(15,8))
plt.bar(range(len(d_plat)), d_plat.values(), align='center')
plt.title("Learners' platforms use", fontsize=16)
plt.xticks(range(len(d_plat)), d_plat.keys())
plt.xlabel("Platforms", fontsize=16)
plt.ylabel("Number of users", fontsize=16)


Interesting!    
Kaggle and Online courses seem to be the favorite platforms for data science learners. 
At first, I was a bit surprised when I saw that College/University comes as the fifth most used platform but then I remembered that : 
- learners include professionals who are looking to career switch
- learners' median age is 26.

Let's check the frequency of use for younger learners (younger than 23) : 

In [None]:
data_young = learners[(learners['Age']<=22) ]
d_plat2={}
for platform in platforms : 
    d_plat2[platform]=0
    for elem in data_young['LearningPlatformSelect'].fillna('Other/Missing'):
        if platform in elem : 
            d_plat2[platform]+=1

plt.figure(figsize=(15,8))
plt.bar(range(len(d_plat2)), d_plat2.values(), align='center')
plt.title("Young Learners' platforms use", fontsize=16)
plt.xticks(range(len(d_plat2)), d_plat2.keys())
plt.xlabel("Platforms", fontsize=16)
plt.ylabel("Number of users", fontsize=16)

Hum, there's some progress but nothing flagrant. Let's check young kagglers leaving in the US : 

In [None]:
data_young = learners[(learners['Age']<=22) & (learners['Country']=='United States')]
d_plat2={}
for platform in platforms : 
    d_plat2[platform]=0
    for elem in data_young['LearningPlatformSelect'].fillna('Other/Missing'):
        if platform in elem : 
            d_plat2[platform]+=1

plt.figure(figsize=(15,8))
plt.bar(range(len(d_plat2)), d_plat2.values(), align='center')
plt.title("Young Learners' platforms use", fontsize=16)
plt.xticks(range(len(d_plat2)), d_plat2.keys())
plt.xlabel("Platforms", fontsize=16)
plt.ylabel("Number of users", fontsize=16)

Aha ! College / University becomes the second most used platform and it isn't anymore outnumbered by Kaggle.  

The thing is in several countries, universities didn't have and still don't have specific data science training. Often, programs are either focused on mathematics or on computer science so a student must complete his training alone. In the US, more and more universities (MIT, Columbia, Stanford...) offer specific data science masters or at least machine learning courses for students who are interested.

This is why generally, in most countries, students feel like they learn about data science a lot more outside their college.

Let's see which online platform is the most popular amongst data science learners.

In [None]:
d_online={}
online_plat= ['Coursera','Udacity','edX',
              'DataCamp','Other']
for plat in online_plat : 
    d_online[plat]=0
    for elem in learners['CoursePlatformSelect'].fillna('Missing'):
        if plat in elem :
            d_online[plat]+=1

online = pd.DataFrame.from_dict(d_online,orient='index')

labels = (np.array(online.index))
proportions = np.array((online[0] / online[0].sum())*100)

trace = go.Pie(labels=labels, values=proportions,
              hoverinfo='lbal+percent')

layout = go.Layout(
    title='Online Platforms popularity'
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="Online_plat")


Coursera has a clear edge over Udacity, edX and DataCamp. This must have something to do with Andrew NG's famous course who helped introducing Machine Learning to a huge amount of people.

Let's now see how much time kagglers learners spend on DS learning on those platforms and for how many years have they been learning DS/ML.

### Time invested on DS training 
<a id="time_learning"></a>

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(y=data['TimeSpentStudying'].dropna())
plt.title("Average hours per week spent on DS learning", fontsize=16)
plt.xlabel("Number of learners", fontsize=16)
plt.ylabel("Number of hours", fontsize=16)

Most learners spend between 2 and 10 hours a week learning data science, which is kind of the amount you would spend if your learning consists of one Online Course at a time.  
That being said, nearly 1000 learners (from a total of 5494) spend more than 11 hours a week. Those are probably students enrolled in data science program or learners fully dedicated to DS learning which is not the case for workers looking for a career switch who naturally have less time at their disposal to invest in DS learning.

Let's see for how many years kagglers have been learning data science.

In [None]:
start = data['LearningDataScienceTime'].value_counts()
labels = (np.array(start.index))
values = (np.array((start / edu.sum())*100))

trace = go.Pie(labels=labels, values=values,
              hoverinfo='label+percent',
               textfont=dict(size=20))

layout = go.Layout(
    title='Years invested in Data Science Learning'
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="nb_yers")

84.1% of learners started their DS training at most 2 years ago when only 3.5% of started more than 5 years ago !        
This shows the effect of the hype around Data Science over the last few years and also its exponential growth.

What comes next focuses on how learners perceive the professional aspect of Data Science, starting with ranking skills according to necessity and ranking proofs of DS knowledge.

### Skills and proofs of knowledge for landing a DS job 
<a id="skills_job"></a>

In [None]:
df2=data
d_jobskills={}
job_skills = ['JobSkillImportanceDegree','JobSkillImportancePython','JobSkillImportanceR',
              'JobSkillImportanceKaggleRanking','JobSkillImportanceMOOC']

for skill in job_skills : 
    L=df2[skill].value_counts()
    d_jobskills[skill]={'Necessary':L.loc['Necessary'],
                        'Nice to have':L.loc['Nice to have'],
                        'Unnecessary':L.loc['Unnecessary']}


(pd.DataFrame(d_jobskills)).transpose().plot(kind='barh',figsize=(10,8))
plt.title("Most important skills for a DS Job", fontsize=16)          
plt.xlabel("Number of learners", fontsize=16)
plt.ylabel("Skills", fontsize=16)



For Data Science learners, there's no doubt, mastering Python is the most important skill for a job in Data Science ! (we could add that in our Python VS R duel).
Most learners think that a degree is more a nice asset to have than a necessary one but I'm not so sure about that. 

What about ways you can prove your knowledge of ML/DS? 


In [None]:
plt.figure(figsize=(10,8))
sns.countplot(y=learners['ProveKnowledgeSelect'])
plt.title("Most important proofs of DS knowledge", fontsize=16)          
plt.xlabel("Number of learners", fontsize=16)
plt.ylabel("Proof of knowledge", fontsize=16)


Learners were asked *What's the most important way you can prove your knowledge of ML/DS?*. 

Master's degree and PhD aren't well ranked, that's expected since most learners consider a that a degree is not necessary to land a job in DS. That being said, I wouldn't agree with the fact than MOOCs are more important that a Master's degree and I don't think that companies value Online Courses certificates more than a Master in Mathematics / Computer science or any related field.

Prior work experience in ML comes in first position and I have agree with that. Hands-on experience is always valuable for recruiters, especially when it comes to coding stuff.


So far, we've seen how much time learn spent on perfecting their Data Science skills, we've seen the factors that they consider to be important for landing a job in DS and we've seen what's the most important way to prove ML/DS knowledge.

So when a learner is finally ready to start his DS career, what resources does he use to find a DS job ? how much time does he spent looking for a job and what are the characteristics of his dream job ? Let's answer that ! 

### Learners' job hunt 
<a id="job_hunt"></a>

In [None]:
job_s = learners['JobSearchResource'].value_counts()
labels = (np.array(job_s.index))
values = (np.array((job_s / job_s.sum())*100))

trace = go.Pie(labels=labels, values=values,
              hoverinfo='label+percent',
               textfont=dict(size=20),
                showlegend=False)

layout = go.Layout(
    title='Most used resources for finding a DS job'
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="Job_resource")

So the top-3 resources according to learners are :
1. Companies' job listing pages,
2. Tech-Specific job boards (Stack Overflow recruitment platform for example)
3. General job boards (LinkedIn)

Now to the time spent looking for a job.

In [None]:
job_s = learners['JobHuntTime'].value_counts()
labels = (np.array(job_s.index))
values = (np.array((job_s / job_s.sum())*100))

trace = go.Pie(labels=labels, values=values,
              hoverinfo='label+percent',
               textfont=dict(size=20),
                showlegend=True)

layout = go.Layout(title='Hours per week spent  looking for a data science job?'
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="Job_resource")

* 40.2% of learners aren't actually looking for a job. Let's not forget that some are still enrolled in College/University and others (workers who answered 'yes' for career switch) may be looking to switch position within the company they work for.   
* 34.5% spend an hour or two a week looking for a job, which means that there's no urge to look for a job. It may be students looking for internships or employed people checking from time to time if an exciting opportunity is out there somewhere.
* A little more than 10% are actively working for a job, spending at least 6 hours per week job hunting.

In [None]:
d_criterias={}
criterias_job=['JobFactorLearning','JobFactorSalary','JobFactorOffice','JobFactorLanguages',
               'JobFactorCommute','JobFactorManagement','JobFactorExperienceLevel',
               'JobFactorDepartment','JobFactorTitle','JobFactorCompanyFunding','JobFactorImpact',
               'JobFactorRemote','JobFactorIndustry','JobFactorLeaderReputation','JobFactorDiversity',
               'JobFactorPublishingOpportunity']
for criteria in criterias_job : 
    L=df2[criteria].value_counts()
    d_criterias[criteria]={'Very Important':L.loc['Very Important'],
                           'Somewhat important':L.loc['Somewhat important'],
                           'Not important':L.loc['Not important']}
    
(pd.DataFrame(d_criterias)).transpose().plot(kind='barh',figsize=(10,8))
plt.title("Most important factors for learners during job hunting", fontsize=16)          
plt.xlabel("Number of learners", fontsize=16)
plt.ylabel("Factors", fontsize=16)



DS learners are constantly striving for development ! The most important factor during their job hunt, by far, is whether the job would hand them opportunities of professional development.   
The second most important factor is the office environment they would be working in.    
The third one is the programming languages and frameworks they'd be working with. That shows that data scientist aren't really open to work with whatever technology the company is using, they have strong preferences that are crucial for chosing one job over another.  
The salary comes fourth and is close to other factors so it doesn't seem to be too problematic for aspiring data scientists, maybe that's because they're already quite assured that the salary would be great !

## Conclusion 
<a id="Conclusion"></a>
***

This EDA provides a demographic analysis which purpose is to gain more insight about Kaggler's general backgrounds (Age, Country, Education ...), a more detailed study of the workers community that sheds lights simultaneously on : 
- The most used programming languages in DS (Python and R)
- The job positions and tasks of an expert data scientist
At last, we tried analyze the learners community that uses Kaggle for learning purposes to understand how the perceive Data Science, where do they put their focus more and their preferences on job positions.

I've still got a couple of ideas that I'd like to add to this kernel at some point.  
Please don't hesitate to offer suggestions on insightful content I could add or on improvements I could make as this is my first contribution to the Kernel community and I'm sure that there's some stuff I could have done better.

I hope you enjoyed the reading and gained more insight about the Kaggle community and Data Science. 
Please upvote if you liked the effort :).

Thanks for your time !