In [None]:
import matplotlib.pylab as plt
import seaborn as sns
sns.set_style('darkgrid')

import numpy as np 
import pandas as pd 

import warnings
warnings.filterwarnings('ignore')

In [None]:
kaggle_data_raw = pd.read_csv('../input/kaggle-survey-2018/multipleChoiceResponses.csv')
kaggle_data_raw = kaggle_data_raw[['Q3','Q1','Q2','Q4','Q5','Q7','Q8','Q9']]

kaggle_data = (kaggle_data_raw
               .replace({'United States of America': 'United States'})
               .replace({'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom'})
               .replace({'Hong Kong (S.A.R.)': 'Hong Kong'})
               .replace({'Iran, Islamic Republic of...': 'Iran'})
               .replace({'Viet Nam': 'Vietnam'})
               .replace({'Republic of Korea':'South Korea'}) # TO BE CHECKED!
                )

country = (kaggle_data.Q3
           .value_counts()
           .drop('I do not wish to disclose my location')
           .drop('In which country do you currently reside?')
           .drop('Other')
            )

**Kagglers in context: Country Income Group analysis**

This analysis uses both the data from the [Kaggle Survey](http://www.kaggle.com/kaggle/kaggle-survey-2018) and the [World development Indicators](http://www.kaggle.com/worldbank/world-development-indicators) (WDI), both available in kaggle. 

Since this is a data story, let抯 start at the beginning. And in the beginning, there was a plot. A plot of kagglers per country. (You most probably saw it before in many different colour maps and projections!)

In [None]:
fig, ax = plt.subplots(1,1,figsize=(20,5))
country.sort_values(inplace=True,ascending=False)
p = sns.barplot(x=country.index,y=country, dodge=False, ax = ax)
dummy = p.set_xticklabels(country.index,rotation=90)
dummy = p.set_xlabel('')
dummy = p.set_ylabel('Kagglers')


This plot is quite intesting. It tell you, for example, that the kaggle comunities in the United States and India are in a league of their own. They alone make up around 40% of the total number of people surveyed, and have more than double the number of answers than even their nearts competitor, China.

However, as the great data  storyteller Hans Rosling [wrote](https://www.gapminder.org/factfulness/size/):

> 	揅ompare. Big numbers always look big. Single numbers on their own are misleading and should make you suspicious. Always look for comparisons. Ideally, divide by something."  

It is quite impressive that the United States have about 4k answers compared with tiny ~100 of tiny Denmark (where I live now). *But* Denmark抯 population is also tiny compared with the Unites States, so it is possible that the number of people surveyed (aka kagglers) are just a proxy of population size?

*By the way, *

I noticed that there are both 'South Korea' and 'Republic of Korea' in the Kaggle Survey data.  I wasn't really sure about this,  so I went to wikipedia for help. The first sentence in the '[North Korea'](http://en.wikipedia.org/wiki/North_Korea) entry is:

> This article is about the Democratic People's Republic of Korea. For the Republic of Korea, see South Korea. 

From which I assume that 'South Korea' and 'Republic of Korea' should be the same thing, but I am not sure. (One could argue that since internet acess in 'North Korea' [is sort of controlled ](http://en.wikipedia.org/wiki/Internet_in_North_Korea) it might be unlikelly that 71 people actually answered the survey, but this is speculation).

In any case, I decided to go with Wikipedia and join both Koreas, but you were warned!

In [None]:
wdi_data_raw = pd.read_csv('../input/world-development-indicators/Indicators.csv')

#Make a pre-selection of stuff that interest us
wdi_data_raw.drop_duplicates(subset=['CountryName','IndicatorCode'], keep='last',inplace=True,)
wdi_data_raw = wdi_data_raw[['CountryName','IndicatorCode','Value']]

# Fix the country names to match the ones from the kaggle survey
wdi_data_renamed = (wdi_data_raw
                    .replace({'Russian Federation' : 'Russia'})
                    .replace({'Iran, Islamic Rep.' : 'Iran'})
                    .replace({'Korea, Rep.':'South Korea'})
                    .replace({'Egypt, Arab Rep.' : 'Egypt'})
                    .replace({'Hong Kong SAR, China' : 'Hong Kong'})
                    .set_index('CountryName')
                   )
wdi_data_raw = wdi_data_raw[['CountryName','IndicatorCode','Value']]

# Check if all the data we want  is there
country_names_in_wdi = wdi_data_renamed.index.unique()
for name in country.keys():
    if name not in country_names_in_wdi:
        print(name,'not found')
        
# Make a smaller sub-set with only the countries we are interested in and check if it is what we want
wdi_data = wdi_data_renamed.loc[country.keys()]

In [None]:
wdi_country_info = pd.read_csv('../input/world-development-indicators/Country.csv')

#Make a pre-selection of stuff that interest us
wdi_country_info = wdi_country_info[['ShortName','Region','IncomeGroup']]

# Fix the country names to match the ones from the kaggle survey
wdi_country_info = (wdi_country_info
                    .replace({'Korea':'South Korea'})
                    .replace({'Hong Kong SAR, China' : 'Hong Kong'})
                    .set_index('ShortName')
                   )

# Check if all the data we want  is there
country_names_in_wdi = wdi_country_info.index.unique()
for name in country.keys():
    if name not in country_names_in_wdi:
        print(name,'not found')
        
wdi_country_info = wdi_country_info.loc[country.keys()]

In [None]:
wdi_pop_totl = wdi_data[wdi_data.IndicatorCode=='SP.POP.TOTL'].drop('IndicatorCode',axis=1)
wdi_urb_ptc = wdi_data[wdi_data.IndicatorCode=='SP.URB.TOTL.IN.ZS'].drop('IndicatorCode',axis=1)
wdi_kids_lit  = wdi_data[wdi_data.IndicatorCode=='SE.ADT.1524.LT.ZS'].drop('IndicatorCode',axis=1)
wdi_bigs_lit  = wdi_data[wdi_data.IndicatorCode=='SE.ADT.LITR.ZS'].drop('IndicatorCode',axis=1)
wdi_gini      = wdi_data[wdi_data.IndicatorCode=='SI.POV.GINI'].drop('IndicatorCode',axis=1)
wdi_wom_lit   = wdi_data[wdi_data.IndicatorCode=='SE.ADT.LITR.FE.ZS'].drop('IndicatorCode',axis=1)
wdi_man_lit   = wdi_data[wdi_data.IndicatorCode=='SE.ADT.LITR.MA.ZS'].drop('IndicatorCode',axis=1)
wdi_internet = wdi_data[wdi_data.IndicatorCode=='IT.NET.USER.P2'].drop('IndicatorCode',axis=1)
wdi_income = wdi_data[wdi_data.IndicatorCode=='NY.ADJ.NNTY.PC.CD'].drop('IndicatorCode',axis=1)
wdi_gdp = wdi_data[wdi_data.IndicatorCode=='NY.GDP.PCAP.CD'].drop('IndicatorCode',axis=1)
wdi_ter_labor = wdi_data[wdi_data.IndicatorCode=='SL.TLF.TERT.ZS'].drop('IndicatorCode',axis=1)
wdi_wom_sec_edu = wdi_data[wdi_data.IndicatorCode=='SE.SEC.ENRL.GC.FE.ZS'].drop('IndicatorCode',axis=1)
wdi_wom_industry = wdi_data[wdi_data.IndicatorCode=='SL.IND.EMPL.FE.ZS'].drop('IndicatorCode',axis=1)
wdi_wom_man_labor = wdi_data[wdi_data.IndicatorCode=='SL.TLF.CACT.FM.NE.ZS'].drop('IndicatorCode',axis=1)
wdi_wom_labor_ter_edu = wdi_data[wdi_data.IndicatorCode=='SL.TLF.TERT.FE.ZS'].drop('IndicatorCode',axis=1)
wdi_young_pop = wdi_data[wdi_data.IndicatorCode=='SP.POP.0014.TO.ZS'].drop('IndicatorCode',axis=1)
wdi_middle_pop = wdi_data[wdi_data.IndicatorCode=='SP.POP.1564.TO.ZS'].drop('IndicatorCode',axis=1)
wdi_older_pop = wdi_data[wdi_data.IndicatorCode=='SP.POP.65UP.TO.ZS'].drop('IndicatorCode',axis=1)

pop = pd.concat([country,wdi_pop_totl/1e6,wdi_urb_ptc,wdi_kids_lit,wdi_bigs_lit,wdi_wom_lit/wdi_man_lit,wdi_gini,
                 wdi_internet,wdi_income,wdi_gdp,wdi_ter_labor,
                 wdi_wom_sec_edu,wdi_wom_industry,wdi_wom_man_labor,wdi_wom_labor_ter_edu,  
                 wdi_young_pop,wdi_middle_pop,wdi_older_pop,
                 wdi_income,
                 wdi_country_info.Region,wdi_country_info.IncomeGroup],
                axis=1)
pop.columns =['Kagglers','Pop','UrbanPop','YoungLit','AdultLit','FemMaleLit','Gini','InternetUsers','AdjustedIncome','GDPCapita','LaborTertiaryEducation',
              'WomanSecEducation','WomanIndustry','RatioWMLabor','WomanLaborAndTerEdu',
              'LessThen14','Between','MoreThen65',
              'IncomePerCapita',
              'Region','IncomeGroup']

pop = (pop
        .replace({'High income: OECD': 'High\n(OECD)'})
        .replace({'High income: nonOECD': 'High\n(nonOECD)'})
        .replace({'Lower middle income': 'Lower\nMiddle'})
        .replace({'Upper middle income': 'Upper\nMiddle'})
        )       

# Transform stuff
pop['CountryName'] = pop.index
pop['KagglersCapita'] = pop.apply(lambda row: row.Kagglers / row.Pop , axis=1)

We can now use the World development Indices to calculate the number of Kagglers per habitant (in this case per 10$^6$ habitants)

Here are the two plots (kagglers and kagglers per capita) together, colour coded by region:

In [None]:
c = ['#F3715A','#FFAD59','#F8FC98','#C5E17A','#00BCD4','#03A9F4','#AA55AA']
c2 = [c[3],c[5],c[0],c[1],c[4],c[6],c[2]]

fig, ax = plt.subplots(2,1,figsize=(20,10))
fig.subplots_adjust(hspace=0.6)

pop.sort_values('Kagglers',inplace=True,ascending=False)
p = sns.barplot(x='CountryName',y='Kagglers',hue='Region',data=pop, dodge=False, palette=c2, ax = ax[0])
dummy = p.set_xticklabels(pop.index,rotation=90)
dummy = p.set_xlabel('')

pop.sort_values('KagglersCapita',inplace=True,ascending=False)
p = sns.barplot(x='CountryName',y='KagglersCapita',hue='Region',data=pop, dodge=False, palette=c, ax = ax[1])
dummy = p.set_xticklabels(pop.index,rotation=90)
dummy = p.set_xlabel('')

Now, these plots tell different stories:  if you were to put all the Kaggler Survey data in an imaginary data hat and pick one answer at random, you would be more likelly to choose someone living in the United States, India or Chine. **But** you are most likelly to bump in a fellow kaggler in the streats of Singapore, Ireland, Israel, and a few others, than in any of these big countries. 

Indeed, little Denmark has actually more kagglers *per habitant* than both the United States and India. And so do a lot of other countries, with special mention to Singapore, where there is a very high *kaggler density*. Also, notice how the *densest* countries are still, generally, European countries (the orange bars).  If one looks at the exceptions to this -- Singapore, Israel, New Zeland, Canada, United States, Australia and Hong Kong -- we might start to see a pattern: they are all 'rich'/'developed' countries.

To check this, I'll use the country classification from the World Development Indices, where countries can belong to one of these income groups:
*  Low income
*  Lower middle income
*  Upper middle income
*  High income: nonOECD
* High income: [OECD](http://www.oecd.org/about/)

Here, I've spitted the countries in the Kaggle Survey in these cathegories, and plotted the total population, number of kagglers (people surveyed) and kaggler density for each group.

In [None]:
incomegroups = (pop.groupby('IncomeGroup')
                .mean()
                .sort_values('AdjustedIncome',ascending=False))

fig,ax = plt.subplots(1,3,figsize=(15,5))
fig.subplots_adjust(right=0.99,hspace=0.4,bottom=0.2)
ax = ax.ravel()

features =['Pop','Kagglers','KagglersCapita']
for x,var in enumerate(features):
    p = sns.barplot(x=incomegroups.index,y=var,data=incomegroups,ax=ax[x],palette='Oranges_r')
    dummy = p.set_xticklabels(incomegroups.index,rotation=60)
    dummy = p.set_xlabel('')

First thing to notice, sadly but probably not surprisingly, there are no kagglers from low income countries. 

Second, indeed most kagglers reside in lower middle income countries (middle plot) but notice how this  is also where most people reside (left plot). So when you look at kaggler density, indeed our intuition that there are  more kagglers per million habitants in high income countries is true.

Now, is there something more specific about a country that can predict how many kagglers per capita there are?

Here I'll use a couple of WDI indicators (I just choose some that seemed interesting. You can check the full list [here](http://www.kaggle.com/benhamner/indicators-in-data):
* UrbanPop: Urban population (% of total)
* YoungLit: Youth literacy rate, population 15-24 years, both sexes (%)
* AdultLit:  Adult literacy rate, population 15+ years, both sexes (%)
* LaborTertiaryEducation: Labor force with tertiary education (% of total)
* Gini: Gini index. Measures inequality in income (0 for perfect equality and 1 for maximal inequality) 
* InternetUsers: Internet users (per 100 people)
* AdjustedIncome: Adjusted net national income per capita (current USdollar)
* GDPCapita: GDP per capita (current USdollar)

In [None]:
order = ['High\n(OECD)','High\n(nonOECD)', 'Upper\nMiddle', 'Lower\nMiddle']
def plot_and_scatter(ind,i):
    if i == 0:
        p = sns.scatterplot(y='KagglersCapita',x=ind,hue='IncomeGroup',hue_order=order,data=pop,ax=ax[i],palette='Oranges_r')
        p.legend(loc=2)
    else:
        sns.scatterplot(y='KagglersCapita',x=ind,hue='IncomeGroup',hue_order=order,data=pop,ax=ax[i],legend=False,palette='Oranges_r')
    ax[i].annotate('$\\rho$ = %0.4f'%pop['KagglersCapita'].corr(pop[ind],'spearman'),xy=(0.7,0.85),xycoords='axes fraction')
    ax[i].annotate('r = %0.4f'%pop['KagglersCapita'].corr(pop[ind]),xy=(0.7,0.8),xycoords='axes fraction')

fig,ax = plt.subplots(2,4,figsize=(20,10))
fig.subplots_adjust(right=0.99)
ax = ax.ravel()

features = ['GDPCapita','InternetUsers','AdjustedIncome','LaborTertiaryEducation','YoungLit','UrbanPop','AdultLit','Gini']

for x,var in enumerate(features):
    plot_and_scatter(var,x)

There are two nice strong correlations between *kaggler density* and **Internet Users** (not surprising, since this is a necessary condition to kaggle!) and **GDP per Capita**, with [spearman correlations](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) ($\\rho$) of 0.83 and 0.87. Indeed, it seems that the most necessary condition to have a lot of kagglers per habitant in a country is to be a rich country (high GDP). And yes, correlation does not mean causation, but it seems to me more reasonable that rich countries can provide conditions for people to kaggle, than kagglers, which are only a small fraction of the population, can strongly influence the GDP.

On the other hand, no countries with a high degree of inequality (Gini > 45, bottom left panel) have a high kaggler density. However,  more egalitarian countries can have both a high or low kaggler density, probably because the Gini index is somehow correlated with 'richness' (red dots prefer to live in the left of that plot). And this tells you something quite important: **these variables are all correlated**! Rich countries have more internet access, their population is more educated, and urbanised (you can see this by how different colours are quite separated in the plots above, which I also find a nice by product of this analysis :)).
  
So for example, it is temping to say that countries with higher urban population or with more people with [terciary education](https://en.wikipedia.org/wiki/Tertiary_education) also have a higher kaggler density, however, these are properties of countries higher GDP, so it is hard to tell if urbanisation or higher education are driving kaggler density up or not.

But one interesting question that comes from these plots is: given that countries from different income levels actually have different characteristics (rich -> more internet, more cities, more educated labour force), **do we see any strong difference between kagglers from different income groups?**

(this is inspired by the [Dollar Streat](https://www.gapminder.org/dollar-street/matrix) project , where one can see that despire cultural differences, people from similar incomes have very similar homes)

Let's start with gender:


In [None]:
kaggle_by_incomegrp = kaggle_data.copy()

for i,k in pop['IncomeGroup'].to_dict().items():
    kaggle_by_incomegrp.replace(to_replace={i:k},inplace=True)
kaggle_by_incomegrp.replace({'I do not wish to disclose my location' : 'Not telling you'},inplace=True)
    
kaggle_by_incomegrp = kaggle_by_incomegrp[['Q3','Q1','Q2','Q4','Q8','Q9']]
kaggle_by_incomegrp.columns = ['IncomeGroup','Gender','Age','Education','Experience','Compensation']
kaggle_by_incomegrp.drop(index=0,inplace=True)
fix_index = ['High\n(OECD)', 'High\n(nonOECD)', 'Upper\nMiddle','Lower\nMiddle', 'Other','Not telling you']

#kaggle_by_incomegrp.describe()

In [None]:
def massage_data(feature):
    feat = kaggle_by_incomegrp.groupby(['IncomeGroup',feature]).size().reset_index()
    feat.columns = ['IncomeGroup',feature,'Counts']
    feat.set_index('IncomeGroup')

    # Change to percentage
    feat['Percentage'] = np.nan
    feature_nb = len(feat[feature].unique())
    for i,grp in enumerate(feat['IncomeGroup'].unique()):
        total = feat[feat['IncomeGroup'] == grp]['Counts'].sum()
        idx = np.where(feat['IncomeGroup'] == grp)[0]
        feat[idx[0]:idx[-1]+1]['Percentage'] = feat[idx[0]:idx[-1]+1]['Counts'] / total*100

    return feat

In [None]:
gender = massage_data('Gender')
# pivot from here: https://pstblog.com/2016/10/04/stacked-charts
pivot_gender = gender.pivot(index='IncomeGroup', columns='Gender', values='Percentage').loc[fix_index]
p = pivot_gender.plot.bar(stacked=True, figsize=(10,7),cmap='Accent')
dummy = p.set_ylabel('Percentage of total answers per income group')
dummy = p.set_xlabel('')
dummy = p.legend(loc='center left', bbox_to_anchor=(1, 0.5))

As you probably know by now, most kagglers identify as males, with less than ~20% identifying as females. However, there are not big differences (<5%) between the percentage of female kagglers between the different income groups.

Cool thing: notice how the higest percentage of people that 'prefer not to say' which gender they identified with, are people that also did not want to disclose their country of residency.  I suspect these are just overall 'shy' people that prefer to keep personal information to themselves, and not their gender in particular, but it's a bit hard to be sure just with one plot.

Now let's have a look at how woman are doing in these different groups. I've chosen a couple of (a bit cryptically named in the plots below) variables:

*  WomanSecEducation: Percentage of students in secondary general education who are female (%)
*  WomanIndustry: Employment in industry, female (% of female employment)
*  RatioWMLabor : Ratio of female to male labor force participation rate (%) (national estimate)
*  WomanLaborAndTerEdu: Labor force with tertiary education, female (% of female labor force)

The first plot is just the same as above, but only with females.


In [None]:
fig,ax = plt.subplots(1,5,figsize=(20,5))
fig.subplots_adjust(right=0.99,bottom=0.1)

p = pivot_gender['Female'].plot.bar(stacked=True,cmap='Accent',ax=ax[0])
dummy = p.set_ylabel('Percentage of total answers per income group')
dummy = p.set_xlabel('')

features =['WomanSecEducation','WomanIndustry','RatioWMLabor','WomanLaborAndTerEdu']
for x,var in enumerate(features):
    p = sns.barplot(x=incomegroups.index,y=var,data=incomegroups,ax=ax[x+1],palette='Oranges_r')
    dummy = p.set_xticklabels(incomegroups.index,rotation=60)
    dummy = p.set_xlabel('')

print('Variance:')
print('Kagglers %0.2f'%pivot_gender['Female'].drop('Other').drop('Not telling you').std())
print('Female in secondary education %0.2f'%incomegroups['WomanSecEducation'].std())
print('Female in Industry %0.2f'%incomegroups['WomanIndustry'].std())
print('Female in the labor force %0.2f'%incomegroups['RatioWMLabor'].std())
print('Female in labor force with terciary education %0.2f'%incomegroups['WomanLaborAndTerEdu'].std())

First, good news! Notice how usually there about 50% of students in secondary education are female (second plot), although lower middle income countries still lag a tiny bit behind. 

Then, very interestingly, there are way more woman in industry in upper middle income countries than in any other category (third plot). Then, notice the nice progression of women entering the work force from lower middle income countries to high (from about 55% up to 80%). Interestingly, this is not mirrored in the first plot, this is, there are not more woman kagglers in high income countries than in lower.  

So, is women representation in kagglers representative of women representation in education and labour force in each income level?

We can have a look at the variance of the plots above, that will give us an idea of how much the different income levels differ:

    Kagglers 1.73
    Female in secondary education 1.33
    Female in Industry 5.06
    Female in the labor force 10.63
    Female in labor force with terciary education 11.07

As you can see, **the variation of the percentage of women in the Kaggler survey per income level is quite smaller than the variation in women in work force or in work force with terciary education** (a probably start for a Data Scientist, I'd say!), so I would argue that the disparities seen in the different income levels in woman education or participation on the labor force, are not seen in the Kaggle comunity. 

In other words, imagine you had four very, very large rooms. In one you'd put kagglers from high income countries, in another one kagglers from lower middle income countries, in the two other rooms just regular working people with high education from high and lower income contries. If you were to choose a person at random, you'd have almost the same chance of picking a female from both the kaggle rooms (18% to 15%, asuming the survey is representative). They are not that different. But if you were to do the same in the 'regular' folks room, you would be much less likelly to pick a female than a male in the lower income countries than in the higher ones. 

Ok. what about age?

In [None]:
age = massage_data('Age')
pivot_age = age.pivot(index='IncomeGroup', columns='Age', values='Percentage').loc[fix_index]
p = pivot_age.plot.bar(stacked=True, figsize=(10,7),cmap='Blues_r',)
dummy = p.set_ylabel('Percentage of total answers per income group')
dummy = p.set_xlabel('')
dummy = p.legend(loc='center left', bbox_to_anchor=(1, 0.5))

Now age is a different matter. People in the lower-middle income countries are generally younger (more than 20~% under 21 and ~80% under 29) compared with other income groups, particularly the High income (OECD) countries, where less than ~10% of the suerveyed are under 21.

How  is the age distribution in each of these groups? In the World Development Indices data I could only find the percentage of population under 14, between 14 and 65, and older than 65, so I guess we'll work with that:

In [None]:
fig,ax = plt.subplots(1,3,figsize=(20,5))
fig.subplots_adjust(right=0.99,bottom=0.1)

features =['LessThen14', 'Between','MoreThen65']
for x,var in enumerate(features):
    p = sns.barplot(x=incomegroups.index,y=var,data=incomegroups,ax=ax[x],palette='Oranges_r')
    dummy = p.set_xticklabels(incomegroups.index,rotation=60)
    dummy = p.set_xlabel('')

Keep in mind that these do not map to the previous age slots, but, generally, there are way more younger people (-14 years) and way less older people (+65 years) lower middle income countries than in higher ones. Extrapolating this for the kaggler age gaps, maybe it is less surprising that there are quite more younger (less than 24 years old) people in the lower income countries than compared with the higher ones. And equivalently for the more mature slots.

What about compenstion?

In [None]:
compensation = massage_data('Compensation')

# Ordering
pivot_compensation = compensation.pivot(index='IncomeGroup', columns='Compensation', values='Percentage').loc[fix_index]
pivot_compensation = pivot_compensation[['0-10,000', '10-20,000', '20-30,000', '30-40,000','40-50,000','50-60,000','60-70,000','70-80,000', '80-90,000', '90-100,000',
                                          '100-125,000', '125-150,000', '150-200,000', '200-250,000', '250-300,000',  '300-400,000','400-500,000',  '500,000+', 
                                         'I do not wish to disclose my approximate yearly compensation']] # yes, it's hugly. 

# Plotting
p = pivot_compensation.plot.barh(stacked=True, figsize=(15,7),cmap='viridis_r')
dummy = p.set_xlabel('Percentage of total answers per income group')
dummy = p.set_ylabel('')
dummy = p.legend(loc='center left', bbox_to_anchor=(1, 0.5))

# If you're reading this STOP! Below there are four quite embarassing lines, but I didn't know how to display this information and I was too tired to figure it out
# As we say it back home: those who do not have a dog, hunt with a cat
d = p.annotate('__________',xy=(0.201,0.135),color='r',weight='bold',xycoords='axes fraction')
d = p.annotate('_______________________________',xy=(0.203,0.30),color='r',weight='bold',xycoords='axes fraction')
d = p.annotate('____________________________________________________',xy=(0.001,0.47),color='r',weight='bold',xycoords='axes fraction')
d = p.annotate('_______________________________________________________________',xy=(0.001,0.635),color='r',weight='bold',xycoords='axes fraction')

There seems to be quite a lot more people with lower compensations in Middle income countries than in higher.  But does this actually mean that Data Scientists are less well payed in these countries?

Well, it depends how you would define 'well payed'. I couldn't find a 'mean income' in the WDI data set, the closest that I found was  'Adjusted net national income per capita (current USdollar)',  which is "GNI minus consumption of fixed capital and natural resources depletion."  So I guess not exactly what I wanted, but we can use it as a  fair 'ruler' to normalise the incomes and allow us to compare incomes of Data Scientists (kagglers) with the other people that live in countries in their income group.

Here is the mean of this national income per capita, per income group:


In [None]:
incomegroups.IncomePerCapita

In the plot above, I've put a red line above the income slot that corresponds to this "national income per capita". 

For the two Middle income groups, even the worst payed Data Scientist are already earning (or relativelly close to earning) the 'national income per capita'. On the other hand, for the High income countries, the lower 20% of kagglers are doing less than this  "national income per capita", even if in absolute numbers, many will be making more than their fellow kagglers in the Middle income group. 

Notice as well the tail of the distributions (darker part). There is a clear trend of people not wanting to disclose their incomes with group income, with people from high income countries being more ok with sharing in a survey how much they earn. I can't really think of an explanation for this, I think culture is a bit too different between countries of the same group to explain this, so I'm intrigued!

My take away from this is that working in data science is probably still quite *relativelly* well payed in middle income countries, even if your fellows from high income countries might not think so!

(and did you notice again that the highest percentage of people that did not want to disclose their salary also didn't want to say where they live? Some more evidence for the 'shy'/'wise' people theory!)

Finally, let's have a quick look of how education and experience are distributed:

In [None]:
education = massage_data('Education')

# Ordering 
pivot_education = education.pivot(index='IncomeGroup', columns='Education', values='Percentage').loc[fix_index]
pivot_education = pivot_education[['No formal education past high school','Some college/university study without earning a bachelor抯 degree','Professional degree',
                                 'Bachelor抯 degree','Master抯 degree','Doctoral degree','I prefer not to answer']]

# Plotting
p = pivot_education.plot.bar(stacked=True, figsize=(10,7),cmap='PRGn')
dummy = p.set_ylabel('Percentage of total answers per income group')
dummy = p.set_xlabel('')
dummy = p.legend(loc='center left', bbox_to_anchor=(1, 0.5))

experience = massage_data('Experience')

# Ordering
pivot_experience = experience.pivot(index='IncomeGroup', columns='Experience', values='Percentage').loc[fix_index]
pivot_experience = pivot_experience[['0-1', '1-2','2-3', '3-4','4-5','5-10','10-15', '15-20', '20-25', '25-30',  '30 +']]

# Plotting
p = pivot_experience.plot.bar(stacked=True, figsize=(10,7),cmap='Oranges')
dummy = p.set_ylabel('Percentage of total answers per income group')
dummy = p.set_xlabel('')
dummy = p.legend(loc='center left', bbox_to_anchor=(1, 0.5))


Kagglers in the Lower Middle income group have generally more Bachelors degrees compared with higher income countries (that have more Masters) and are also a little bit less experienced than the other two groups. 

(By the way, if I had too guess, based in these three last plots, I would say that most people from 'Other' countries live in countries of high income and not part of the OECD)

More things could be done, plotted and said, but let's stop here and make a summary.

** Summary **

1.  Most people answering the Survey live in the United States, India and China. However, these countries have very large populations, so their kaggler density (survey answer per habitant) is relativelly low. Singapore, Ireland and Israel are the top countries with the highest density.
2.  Most people that answered the survey live in 'Lower Middle Income' countries (according to World Development Indicator). However, this is also where most people live, so indeed the highest density of kagglers is found in high income countries.
3. The percentage of female is quite low, but quite constant between the different income groups, even when diferences in percentage of women working and with higher education varied a lot between diferent income groups. I find that encoraging! If there are some barriers in access to higher education in some countries, it doesn't seem to see the case with access to kaggle :)
4. Surveyed people from higher income countries were, generally, older than in the lower income group. However, this might only reflect the age distribution of the underlying population: countries in the lower income group have younger populations (although a direct comparision was not possible).
5. In terms of education and experience, the 3 highest income groups are comparable, but the lower income group has slighly less experienced and less formally educated people. 
6. Compensation in lower incomes countries is, well, lower. However,  there is a higher percentage of data scientits in lower income countries that excedes (a) national income per capita.

All and all, looking at how diferent these 4 income groups can be in terms of GDP, urbanisation, education and equality, I think kagglers are still  a relativelly homogeneous group, with small(ish) differences in gender, experience and education. 
