# Analyzing the development of Data Science in Europe

In the scope of this Udacity project, I would like to focus on the development of Data Science jobs in European countries. Therefor, an annually Stack Overflow developer survey from 2011 until 2018 provides a wide range of data to tackle this topic and to find interesting insights within the trend of Data Science in Europe.

This Jupyter Notebook contains the code and some descriptions required within the process of analyzing the temporal growth and development of the Data Science community in Europe in the recent years. Do European countries go through the same development like American or Asian countries? I am also interested in the opinion of other Data Scientist according to their job and career satisfaction as well as their salary. What are influential and decisive features relating to these factors?

The structur of this file follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) process:

1. Data Understanding
First, we want to understand the provided data. Therefor, we import the Stack Overview developer survey data and take a look at the provided information within this database.

2. Business Understanding
Second, according to our defined goals from above, we want to formulate appropriate questions. These will help us to tackle the field in a structured manner.

3. Data Preparation
In the third step, we will prepare the data to obtain suitable variables for visualizing the information and answering the formulated questions.

4. Modeling
The fourth step provides the implementation of Machine Learning tools and algorithms to create predictive models for our data. Here, we want to find information about influential elements for our desired variables (satisfaction, salary).

5. Evaluation
In this step, we want to evaluate the developed model.

6. Deployment
The presentation and deployment of the obtained insights will be done within a blog post on Medium (see XXX)

## 1. Data Understanding
The data provided by the Stack Overview developer survey can be downloaded at [https://insights.stackoverflow.com/survey](https://insights.stackoverflow.com/survey). Here, we can find the survey data from 2011 until 2018 downloadable in several zip-files. From 2011 to 2016, these files only contained the data in CSV-files (Comma-Separated Vectors) and, intermittently, a ReadMe-File defining the variables. Since 2017, a PDF-file containing the asked questions and possible responses as well as another CSV-file containing the schema of the data has been additionally provided.

Beforehand, I downloaded these files and saved them into a 'data' folder, including all data CSV-files, and a 'schema' folder with the schema files as well as the pdf files. The files are structured as follows:
- survey_data_201X.csv
- survey_schema_201X.csv

At first, have to import the required packages and the data for the analysis process.

In [1]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

The data is not consistent over all years of the survey. There is a different amount of questions, different variables and, hence, datasets with different shapes. To store everything in one variable, I decided to create a dictionary with the keys indicating the year and values holding Pandas DataFrames.

Since some error messages and warnings occurred, we have to set the encoding and low memory parameters of the read_csv() function of Pandas. Furthermore, the header is individually set for each year and has to be defined in the parameters.

In [39]:
# read Stack Overflow developer survey data from different years and store into a dictionary
data = {}
data['2011'] = pd.read_csv('./data/survey_data_2011.csv', encoding='latin_1', low_memory=False, header = [0,1])
data['2012'] = pd.read_csv('./data/survey_data_2012.csv', encoding='latin_1', low_memory=False, header = [0,1])
data['2013'] = pd.read_csv('./data/survey_data_2013.csv', encoding='latin_1', low_memory=False, header = [0,1])
data['2014'] = pd.read_csv('./data/survey_data_2014.csv', encoding='latin_1', low_memory=False, header = [0,1])
data['2015'] = pd.read_csv('./data/survey_data_2015.csv', encoding='latin_1', low_memory=False, skiprows=1)
data['2016'] = pd.read_csv('./data/survey_data_2016.csv', encoding='latin_1', low_memory=False)
data['2017'] = pd.read_csv('./data/survey_data_2017.csv', encoding='latin_1', low_memory=False)
data['2018'] = pd.read_csv('./data/survey_data_2018.csv', encoding='latin_1', low_memory=False)

Let's take a look at the structure and shape of each dataframe.

In [40]:
print('year \t rows   \t columns')
print(33*'-')
for i in data.keys():
    print('{} \t {}   \t {}'.format(i,data[i].shape[0],data[i].shape[1]))

year 	 rows   	 columns
---------------------------------
2011 	 2813   	 65
2012 	 6243   	 75
2013 	 9742   	 128
2014 	 7643   	 120
2015 	 26086   	 222
2016 	 56030   	 66
2017 	 51392   	 154
2018 	 98855   	 129


Obviously, the survey was performed each year using a different amount of questions. The number of participants has also increased from 2,814 in 2011 to nearly 100,000 in 2018.
To compare the survey in 2011 and 2018, we take a look at the columns and a data sample.

In [41]:
# data sample of 2011
data['2011'].head()

Unnamed: 0_level_0,What Country or Region do you live in?,Which US State or Territory do you live in?,How old are you?,How many years of IT/Programming experience do you have?,How would you best describe the industry you work in?,Which best describes the size of your company?,Which of the following best describes your occupation?,How likely is it that a recommendation you make will be acted upon?,What is your involvement in purchasing? You can choose more than 1.,Unnamed: 9_level_0,...,Unnamed: 55_level_0,Unnamed: 56_level_0,Unnamed: 57_level_0,Unnamed: 58_level_0,Unnamed: 59_level_0,Unnamed: 60_level_0,Unnamed: 61_level_0,Unnamed: 62_level_0,"In the last 12 months, how much money have you spent on personal technology-related purchases?",Which of our sites do you frequent most?
Unnamed: 0_level_1,Response,Response,Response,Response,Response,Response,Response,Response,Influencer,Recommender,...,AppleTV,iPad,Other netbook,PS3,Xbox,Wii,Other gaming system,other (please specify),Response,Response
0,Africa,,< 20,<2,Consulting,Start Up (1-25),Web Application Developer,Not in a million years,,,...,,,,,,,,,<$100,
1,Other Europe,,25-29,41310,Software Products,Mature Small Business (25-100),Server Programmer,It's been known to happen,,,...,,,Other netbook,,,,,,$251-$500,Stack Overflow
2,India,,25-29,41435,Software Products,Mid Sized (100-999),Server Programmer,Unless it's stoopid it gets done,,,...,,,,,,,,,,
3,Germany,,< 20,41310,Foundation / Non-Profit,Student,Student,It's been known to happen,,,...,,,,,,Wii,Other gaming system,,"$501-$1,000",Stack Overflow
4,Other Asia,,35-39,11,Software Products,Start Up (1-25),"Executive (VP of Eng, CTO, CIO, etc.)",I run this place,Influencer,,...,,,,,Xbox,,,,$251-$500,Stack Overflow


In [42]:
# data sample of 2018
data['2018'].head()

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelorâs degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelorâs degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelorâs degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...","10,000 or more employees",Database administrator;DevOps specialist;Full-...,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelorâs degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy
2,4,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",20 to 99 employees,Engineering manager;Full-stack developer,...,,,,,,,,,,
3,5,No,No,United States,No,Employed full-time,"Bachelorâs degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",100 to 499 employees,Full-stack developer,...,I don't typically exercise,Male,Straight or heterosexual,Some college/university study without earning ...,White or of European descent,35 - 44 years old,No,No,The survey was an appropriate length,Somewhat easy
4,7,Yes,No,South Africa,"Yes, part-time",Employed full-time,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","10,000 or more employees",Data or business analyst;Desktop or enterprise...,...,3 - 4 times per week,Male,Straight or heterosexual,Some college/university study without earning ...,White or of European descent,18 - 24 years old,Yes,,The survey was an appropriate length,Somewhat easy


In [44]:
# cols of 2011
data['2011'].columns

MultiIndex(levels=[['How likely is it that a recommendation you make will be acted upon?', 'How many years of IT/Programming experience do you have?', 'How old are you?', 'How would you best describe the industry you work in?', 'In the last 12 months, how much money have you spent on personal technology-related purchases? ', 'Including bonus, what is your annual compensation in USD?', 'Please rate your job/career satisfaction', 'Unnamed: 10_level_0', 'Unnamed: 11_level_0', 'Unnamed: 12_level_0', 'Unnamed: 13_level_0', 'Unnamed: 14_level_0', 'Unnamed: 16_level_0', 'Unnamed: 17_level_0', 'Unnamed: 18_level_0', 'Unnamed: 19_level_0', 'Unnamed: 20_level_0', 'Unnamed: 22_level_0', 'Unnamed: 23_level_0', 'Unnamed: 24_level_0', 'Unnamed: 25_level_0', 'Unnamed: 26_level_0', 'Unnamed: 27_level_0', 'Unnamed: 28_level_0', 'Unnamed: 31_level_0', 'Unnamed: 32_level_0', 'Unnamed: 33_level_0', 'Unnamed: 34_level_0', 'Unnamed: 35_level_0', 'Unnamed: 36_level_0', 'Unnamed: 37_level_0', 'Unnamed: 38_lev

In [45]:
# cols of 2018
data['2018'].columns

Index(['Respondent', 'Hobby', 'OpenSource', 'Country', 'Student', 'Employment',
       'FormalEducation', 'UndergradMajor', 'CompanySize', 'DevType',
       ...
       'Exercise', 'Gender', 'SexualOrientation', 'EducationParents',
       'RaceEthnicity', 'Age', 'Dependents', 'MilitaryUS', 'SurveyTooLong',
       'SurveyEasy'],
      dtype='object', length=129)