# Part III: Exploratory data analysis

In this part, we will use different types of graphics to explore a dataset visually. The pacakges used for data visualisation are `matplotlib` and `seaborn`.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In this example, we will use a dataset from the 2017 Stack Overflow Developer Survey. The analysis broadly follows a post in the Stack Overflow blog focusing on an unexpected result of the survey, the entire blog post can be found here: https://stackoverflow.blog/2017/06/15/developers-use-spaces-make-money-use-tabs/

First, we will read the data and do some preprocessing of the variables:

In [None]:
so = pd.read_csv('survey_results_public.csv', 
                 usecols=['Respondent', 'Country', 'DeveloperType', 'YearsCodedJob', 'TabsSpaces', 
                          'JobSatisfaction', 'Salary', 'HaveWorkedLanguage'])
so.set_index('Respondent', inplace=True)

We restrict our analysis to five of the largest countries, and lump the other countries together in a category level called 'Other':

In [None]:
countries = ['United States', 'India', 'United Kingdom', 'Germany', 'Canada']
so.loc[~so.Country.isin(countries), 'Country'] = 'Other'
so.Country = so.Country.astype('category')

We will be focusing on the question whether a developer prefers tabs or spaces, and we therefore exclude all respondents who did not answer that question:

In [None]:
so = so[so.TabsSpaces.notnull()]

We recode the number of years of professional coding experience into four categories:

In [None]:
so.loc[so.YearsCodedJob.isin(
    ['Less than a year', '1 to 2 years', '2 to 3 years', '3 to 4 years', '4 to 5 years']), 'YearsCodedJob'] = '< 5y'
so.loc[so.YearsCodedJob.isin(
    ['5 to 6 years', '6 to 7 years', '7 to 8 years', '8 to 9 years', '9 to 10 years']), 'YearsCodedJob'] = '6-10'
so.loc[so.YearsCodedJob.isin(
    ['10 to 11 years', '11 to 12 years', '12 to 13 years', '13 to 14 years', '14 to 15 years']), 'YearsCodedJob'] = '11-15'
so.loc[so.YearsCodedJob.isin(
    ['15 to 16 years', '16 to 17 years', '17 to 18 years', '18 to 19 years', '19 to 20 years', '20 or more years']), 'YearsCodedJob'] = '15+'

Show plots in the notebook:

In [None]:
%matplotlib inline

Let's start with boxplots showing the distribution of salaries according to the tab/spaces preference:

In [None]:
sns.boxplot(x=so.TabsSpaces, y=so.Salary, order=['Tabs', 'Both', 'Spaces'])

The boxplot is a relatively simple plot that nonetheless conveys a lot of information: The boxes span the range from the 25% to the 75% quantile of the data (in this case: each group), with a center line at the median. We can thus see that there seems to be a difference in salary between developers prefering spaces and developers using tabs or a mix of tabs and spaces.

Barplots are another common way to compare groups. To build a barplot visualizing the same difference in medians, we need to first calculate the median salaries by group:

In [None]:
by_countries = so.groupby(['Country', 'TabsSpaces']).median()

We can then plot these median salaries using a barplot:

In [None]:
sns.barplot(x=by_countries.index.levels[1], y=by_countries.Salary.median(level=1), 
            order=['Tabs', 'Both', 'Spaces'], estimator=np.mean)

It is also possible to split the plot according to another variable. We will use this to see whether the salary difference is consistent across countries:

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(6,10))
axes = axes.ravel()
for i, c in enumerate(by_countries.index.levels[0]):
    sns.barplot(x=by_countries.index.levels[1], y=by_countries.loc[c, 'Salary'], order=['Tabs', 'Both', 'Spaces'], 
                ax=axes[i]).set_title(c)
    axes[i].set(xlabel='', ylabel='')
fig.tight_layout()

One hypothesis of where the difference comes from is that there might be a difference in preference for spaces over tabs depending on the experience of a developer. We can use a line chart to visualize the salary difference between the groups over changes in experience to check this:

In [None]:
by_years = so.groupby(['TabsSpaces', 'YearsCodedJob']).median().reset_index()
sns.pointplot(x="YearsCodedJob", y="Salary", hue="TabsSpaces", data=by_years, order=["< 5y", "6-10", "11-15", "15+"], estimator=np.median)

Apparently, the difference is consistent across experience levels, so there must be other factors explaining it. 

The next hypothesis to check is whether it might depend on the type of developer in question, assuming different developer types with different preferences for spaces or tabs and different salary ranges might explain this difference. To this end, we will use a point chart where we can compare all developer types in two dimensions according to the median salary of developers of this type prefering tabs or spaces, respectively: 

In [None]:
by_type = so[so.TabsSpaces.isin(['Tabs', 'Spaces'])]
by_type = by_type[by_type.DeveloperType.notnull()]
by_type.DeveloperType = by_type.DeveloperType.str.split(';').apply(lambda x : x[0])
by_type = by_type.groupby(['DeveloperType', 'TabsSpaces']).median()
by_type = by_type.unstack()['Salary']
g = sns.JointGrid(x='Tabs', y='Spaces', data=by_type, size=7)
g.plot_joint(plt.scatter)
g.ax_marg_x.set_axis_off()
g.ax_marg_y.set_axis_off()

for row in by_type.iterrows():
    xy = (row[1]['Tabs'], row[1]['Spaces'])
    plt.gca().annotate(xytext=(2,2), xy=xy, textcoords ="offset points", s=row[0])
    
x0, x1 = g.ax_joint.get_xlim()
y0, y1 = g.ax_joint.get_ylim()
lims = [max(x0, y0), min(x1, y1)]
g.ax_joint.plot(lims, lims, '-r')    

The red line gives us a reference for parity between tab and space users, and almost all points being above the line means that among almost all developer types, the salaries of space users are consistently higher than those of tab users.

We have thus seen that developers that use spaces earn consistently more than developers using tabs, which leads us to our closing remark for this part of the primer: when performing statistical analyses, never confuse correlation with causation. Developers are very likely not paid more simply *because* they are using spaces instead of tabs, nor are they likely to prefer spaces over tabs *because* they earn more. Rather, it seems more likely that there is a third factor correlated to both a preference for spaces over tabs *and* a higher salary that might explain the difference, but the jury is still out which factor that might be. 