### Introduction
In this project, we'll analyze the StackOverflow developer survey dataset. We will use opendatasets module for downloading the required dataset.

In [2]:
import opendatasets as od

In [3]:
od.download('stackoverflow-developer-survey-2020')

Downloading https://raw.githubusercontent.com/JovianML/opendatasets/master/data/stackoverflow-developer-survey-2020/survey_results_public.csv to ./stackoverflow-developer-survey-2020/survey_results_public.csv


0it [00:00, ?it/s]

94609408it [00:12, 7628561.91it/s]                               


Downloading https://raw.githubusercontent.com/JovianML/opendatasets/master/data/stackoverflow-developer-survey-2020/survey_results_schema.csv to ./stackoverflow-developer-survey-2020/survey_results_schema.csv


16384it [00:00, 33943.05it/s]           


Downloading https://raw.githubusercontent.com/JovianML/opendatasets/master/data/stackoverflow-developer-survey-2020/README.txt to ./stackoverflow-developer-survey-2020/README.txt


8192it [00:00, 20159.52it/s]            


Let's load the CSV files using the Pandas library. We will save the dataset into `surver_raw_df`.

In [None]:
import pandas as pd

In [None]:
surver_raw_df = pd.read_csv('/Users/mayanksingh/Documents/EDA/stackoverflow-developer-survey-2020/survey_results_public.csv')

In [None]:
surver_raw_df

What are the columns present in the dataframe.

In [None]:
surver_raw_df.columns

In [None]:
surver_raw_df.shape

In [None]:
schema_raw = pd.read_csv('/Users/mayanksingh/Documents/EDA/stackoverflow-developer-survey-2020/survey_results_schema.csv', index_col = 'Column').QuestionText

In [None]:
schema_raw.shape

Now we can we use schema_raw to retrive the full question.

In [None]:
schema_raw['YearsCodePro']

### Data Preparation & Cleaning

While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:

- Demographics of the survey respondents and the global programming community
- Distribution of programming skills, experience, and preferences
- Employment-related information, preferences, and opinions

Let's select a subset of columns with the relevant data for our analysis.

In [None]:
selected_columns = [
    # Demographics
    'Country',
    'Age',
    'Gender',
    'EdLevel',
    'UndergradMajor',
    # Programming experience
    'Hobbyist',
    'Age1stCode',
    'YearsCode',
    'YearsCodePro',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'NEWLearn',
    'NEWStuck',
    # Employment
    'Employment',
    'DevType',
    'WorkWeekHrs',
    'JobSat',
    'JobFactors',
    'NEWOvertime',
    'NEWEdImpt'
]

In [None]:
survey_df = surver_raw_df[selected_columns].copy()

In [None]:
survey_df.info()

In [None]:
survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')
survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors='coerce')
survey_df['YearsCodePro'] = pd.to_numeric(survey_df.YearsCodePro, errors='coerce')

In [None]:
survey_df.describe()

Here we see that the minimum age is 1 and maximum age is 279. These are errors in  the surveys.  A simple fix would be to ignore the rows where the age is higher than 100 years or lower than 10 years as invalid survey responses. 

In [None]:
survey_df.drop(survey_df[survey_df.Age < 10].index, inplace=True)
survey_df.drop(survey_df[survey_df.Age > 100].index, inplace=True)

The same holds for `WorkWeekHrs`. Let's ignore entries where the value for the column is higher than 140 hours. (~20 hours per day).

In [None]:
survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)

The gender column also allows for picking multiple options. We'll remove values containing more than one option to simplify our analysis.

In [None]:
survey_df.Gender.value_counts()

In [None]:
import numpy as np 

In [None]:
survey_df.where(~(survey_df.Gender.str.contains(';', na=False)), np.nan, inplace=True)

Now that we have cleared the dataset for our analysis, let's look at the dataset sample for some understanding.

In [None]:
survey_df.sample(10)

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt 
%matplotlib inline 

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
schema_raw.Country

In [None]:
survey_df.Country.nunique()

In [None]:
top_countries = survey_df.Country.value_counts().head(15)
top_countries

In [None]:
plt.figure(figsize=(8,4))
plt.xticks(rotation=75)
plt.title(schema_raw.Country)
colours = sns.color_palette('husl', n_colors=len(top_countries.index))
sns.barplot(x=top_countries.index, y=top_countries, palette=colours)

### Age

The distribution of respondents' age is another crucial factor to look at. We can use a histogram to visualize it.

In [None]:
plt.figure(figsize=(12, 6))
plt.title(schema_raw.Age)
plt.xlabel('Age')
plt.ylabel('Number of respondents')

plt.hist(survey_df.Age, bins=np.arange(10,80,5), color='purple')

### Gender

Let's now look at the gender distribution for the responses.It's a well-known fact that women and non-binary genders are underrepresented in the programming community, so we might expect to see a skewed distribution here.

In [None]:
schema_raw.Gender

In [None]:
gender_counts = survey_df['Gender'].value_counts()
gender_counts

In [None]:
sns.set_style('whitegrid')

In [None]:
plt.figure(figsize=(8,6))
plt.title(schema_raw.Gender)
plt.pie(gender_counts, labels = gender_counts.index, autopct='%1.1f%%', startangle=180)

### Education Level

Formal education in computer science is often considered an essential requirement for becoming a programmer. However, there are many free resources & tutorials available online to learn programming. Let's compare the education levels of respondents to gain some insight into this. We'll use a horizontal bar plot here.

In [None]:
x = survey_df.EdLevel.value_counts()
x

In [None]:
colours_palette = sns.color_palette('Set2', n_colors=len(x.index))
sns.countplot(y=survey_df.EdLevel, palette=colours_palette)
plt.xticks(rotation = 75)
plt.title(schema_raw.EdLevel)
plt.ylabel(None)

In [None]:
schema_raw.UndergradMajor

In [None]:
undergrad_pct = survey_df.UndergradMajor.value_counts() * 100 / survey_df.UndergradMajor.count()

sns.barplot(x=undergrad_pct, y=undergrad_pct.index)

plt.title(schema_raw.UndergradMajor)
plt.ylabel(None)
plt.xlabel('Percentage')

### Employement

In [None]:
schema_raw.Employment

In [None]:
(survey_df.Employment.value_counts(normalize=True, ascending=True)*100).plot(kind='barh', color='g')
plt.title(schema_raw.Employment)
plt.xlabel('Percentage')