This notebook is used to extractor the necessary data for Salary Prediction from 2012 Stack Overflow Annual Developer Survey 

https://insights.stackoverflow.com/survey 

In [3]:
import pandas as pd
import numpy as np

NOTE: Make sure **Data/source/** folder has the survey data downloaded using **Scripts/download_survey_data.py --all** script

### 2013 Stack Overflow User Survey

**Comments after analysis**
Comments for 2013 dataset:
1.	There are two questions regarding the size of the company, both are full of missing and messy data (most responses are dates for some reason):
    * "How many people work for your company?"
    * "Including yourself, how many developers are employed at your company?"
2.	There are no relevant questions regarding current project details unlike in other datasets, e.g. 2012.
3.	Question on languages is not about proficiency but about ‘languages used significantly in the past few years’.
4.	Extra questions that weren’t found in the survey the year before:
    *	What other departments / roles do you interact with regularly?
    *	If your company has a native mobile app, what platforms do you support?
    *	If you make a software product, how does your company make money? (You can choose more than one)
    *	In an average week, how do you spend your time?
    *	What is your involvement in purchasing products or services for the company you work for? (You can choose more than one)
    *	What types of purchases are you involved in?
    *	What is your budget for outside expenditures (hardware, software, consulting, etc) for 2013?
    *	Which technologies are you excited about?
    *	Please rate how important each of the following characteristics of a company/job offer are to you. 
    *	Have you changed jobs in the last 12 months?
    *	Please rate the advertising you've seen on Stack Overflow
    *	Range of questions about Stackoverflow - how it's used, what advertising seen
    *	In the last 12 months, how much money have you spent on personal technology-related purchases?
5.	% of missing data:

|Variable| % missing |
| --- | --- | 
|country                   |   0.010265|
|age                        |  3.161568|
|IT_experience_in_years      | 3.141039|
|industry                     |3.141039|
|company_size_q1            | 15.643605|
|company_size_q2           |  19.595566|
|occupation               |   15.643605|
|proficient_languages    |    17.604188|
|desktop_OS             |     17.121741|
|job_satisfaction        |    21.699856|
|compensation            |    27.550811|
|stackoverflow_reputation   | 23.711763|
|product_technology    |       29.95278|

In [4]:
survey_res = pd.read_csv("../../Data/source/2013.csv", sep=",", encoding='Latin-1')
survey_res.columns
# survey_res

  survey_res = pd.read_csv("../Data/source/2013.csv", sep=",", encoding='Latin-1')


Index(['What Country or Region do you live in?',
       'Which US State or Territory do you live in?', 'How old are you?',
       'How many years of IT/Programming experience do you have?',
       'How would you best describe the industry you currently work in?',
       'How many people work for your company?',
       'Which of the following best describes your occupation?',
       'Including yourself, how many developers are employed at your company?',
       'How large is the team that you work on?',
       'What other departments / roles do you interact with regularly?',
       ...
       'Unnamed: 118', 'Unnamed: 119', 'Unnamed: 120', 'Unnamed: 121',
       'What advertisers do you remember seeing on Stack Overflow?',
       'What is your current Stack Overflow reputation?',
       'How do you use Stack Overflow?', 'Unnamed: 125', 'Unnamed: 126',
       'Unnamed: 127'],
      dtype='object', length=128)

In [5]:
print(f"Number of responses to the survey in 2013: {survey_res.shape[0]-1}")

Number of responses to the survey in 2013: 9742


In [6]:
survey_res = survey_res.drop(0)
survey_res.head()

Unnamed: 0,What Country or Region do you live in?,Which US State or Territory do you live in?,How old are you?,How many years of IT/Programming experience do you have?,How would you best describe the industry you currently work in?,How many people work for your company?,Which of the following best describes your occupation?,"Including yourself, how many developers are employed at your company?",How large is the team that you work on?,What other departments / roles do you interact with regularly?,...,Unnamed: 118,Unnamed: 119,Unnamed: 120,Unnamed: 121,What advertisers do you remember seeing on Stack Overflow?,What is your current Stack Overflow reputation?,How do you use Stack Overflow?,Unnamed: 125,Unnamed: 126,Unnamed: 127
1,United Kingdom,,35-39,6/10/2013,Finance / Banking,101-999,Enterprise Level Services,100,4/8/2013,System Administrators,...,Neutral,Neutral,Neutral,Neutral,,Don't have an account,Read other people's questions to solve my prob...,,,
2,United States of America,Oregon,25-29,6/10/2013,Retail,101-999,Back-End Web Developer,6/15/2013,4/8/2013,System Administrators,...,Neutral,Agree,Disagree,Neutral,"StackOverflow themselves, Careers 2.0 (SO also...",1,Read other people's questions to solve my prob...,Ask questions to solve problems,Answer questions I know the answer to,
3,United States of America,Wisconsin,51-60,11,Software Products,26-100,Enterprise Level Services,6/15/2013,Just me!,System Administrators,...,Neutral,Strongly Disagree,Strongly Disagree,Strongly Disagree,don't recall seeing ads on Stack Overflow,Don't have an account,Read other people's questions to solve my prob...,,,
4,Germany,,,,,,,,,,...,,,,,,,,,,
5,United States of America,Idaho,35-39,11,Consulting,,,,,,...,,,,,,,,,,


In [7]:
# Extract the Country 

survey_res_flt = pd.DataFrame()
survey_res_flt["country"] = survey_res["What Country or Region do you live in?"]

# Extract Age
survey_res_flt["age"] = survey_res["How old are you?"]

# Extract years of IT/programing experience
survey_res_flt["IT_experience_in_years"] = survey_res["How many years of IT/Programming experience do you have?"]

# Extract industry 
survey_res_flt["industry"] = survey_res["How would you best describe the industry you currently work in?"]

# Extract company size 
# survey_res_flt["company_size"] = survey_res["Which best describes the size of your company?"] #this does not exist in this format
survey_res_flt["company_size"] = survey_res["How many people work for your company?"]
#Another related column is there:
survey_res_flt["company_techdep_size"] = survey_res["Including yourself, how many developers are employed at your company?"]
#Both columns are full of messy data, a lot of dates and NaNs

# Extract occupation 
survey_res_flt["occupation"] = survey_res["Which of the following best describes your occupation?"]

# Extract Current project details - there are no relevant questions in this survey
# survey_res_flt["current_development_project"] = survey_res["What type of project are you developing?"] #this does not exist in this dataset


In [8]:
survey_res_flt['company_techdep_size'].value_counts()

1/5/2013     3021
100          1621
6/15/2013    1408
16-30         745
50-100        546
31-50         492
Name: company_techdep_size, dtype: int64

In [10]:
survey_res_flt['company_techdep_size'].value_counts()

1/5/2013     3021
100          1621
6/15/2013    1408
16-30         745
50-100        546
31-50         492
Name: company_techdep_size, dtype: int64

In [11]:
# Extract programming language 

# language_col_start = survey_res.columns.get_loc("Which languages are you proficient in?")
#In this dataset there is a different question: 'Which of the following languages or technologies have you used significantly in the past year?'
language_col_start = survey_res.columns.get_loc("Which of the following languages or technologies have you used significantly in the past year?")

survey_res_flt["proficient_languages"] = (
    survey_res
    .iloc[:, language_col_start+1:language_col_start+15]
    .apply(lambda x: ",".join(x.astype(str)), axis = 1)
)

survey_res_flt["proficient_languages"] = [[pl for pl in pl_list.split(',') if pl != 'nan'] for pl_list in survey_res_flt['proficient_languages'].values.tolist()]

In [12]:
# Extract the desktop operating system used 
survey_res_flt["desktop_OS"] = survey_res["Which desktop operating system do you use the most?"]

# Extract job satisfaction 
survey_res_flt["job_satisfaction"] = survey_res["What best describes your career / job satisfaction?"]

# Extract compensation
survey_res_flt["compensation"] = survey_res["Including bonus, what is your annual compensation in USD?"]

In [13]:
# Extract Product technology
product_tech_col_start = survey_res.columns.get_loc("Which technology products do you own? (You can choose more than one)")

survey_res_flt["product_technology"] = (
    survey_res
    .iloc[:, product_tech_col_start+1:product_tech_col_start+14]
    .apply(lambda x: ",".join(x.astype(str)), axis = 1)
)

survey_res_flt["product_technology"] = [[pt for pt in pt_list.split(',') if pt != 'nan'] for pt_list in survey_res_flt['product_technology'].values.tolist()]

In [14]:
# Extract stackoverflow reputation 
survey_res_flt["stackoverflow_reputation"] = survey_res["What is your current Stack Overflow reputation?"]

In [15]:
survey_res_flt.head()

Unnamed: 0,country,age,IT_experience_in_years,industry,company_size,company_techdep_size,occupation,proficient_languages,desktop_OS,job_satisfaction,compensation,product_technology,stackoverflow_reputation
1,United Kingdom,35-39,6/10/2013,Finance / Banking,101-999,100,Enterprise Level Services,"[Java, SQL]",Windows 7,It's a paycheck,"$80,000 - $100,000",[],Don't have an account
2,United States of America,25-29,6/10/2013,Retail,101-999,6/15/2013,Back-End Web Developer,"[C#, JavaScript, jQuery, PHP, MySql / VbScript...",Windows 7,It's a paycheck,"$20,000 - $40,000","[Android, Xbox, PSP (Playstation Portable)]",1
3,United States of America,51-60,11,Software Products,26-100,6/15/2013,Enterprise Level Services,"[C#, JavaScript, jQuery, SQL, PL/SQL, XSLT, ...",Windows 7,I'm not happy in my job,"$120,000 - $140,000","[Android, Kindle Fire, Kindle]",Don't have an account
4,Germany,,,,,,,[],,,,[],
5,United States of America,35-39,11,Consulting,,,,[],,,,[],


In [17]:
# % of missing values (not minding 'proficient_languages' and 'product technology')
survey_res_flt.isna().sum()/len(survey_res_flt)*100

country                      0.010265
age                          3.161568
IT_experience_in_years       3.141039
industry                     3.141039
company_size                15.643605
company_techdep_size        19.595566
occupation                  15.643605
proficient_languages         0.000000
desktop_OS                  17.121741
job_satisfaction            21.699856
compensation                27.550811
product_technology           0.000000
stackoverflow_reputation    23.711763
dtype: float64

In [18]:
#% of empty entries for 'proficient_languages' column
(len(survey_res_flt)-len(survey_res_flt[survey_res_flt['proficient_languages'].astype(bool)]))/len(survey_res_flt)*100

17.604188051734756

In [20]:
#% of empty entries for 'product_technology' column
(len(survey_res_flt)-len(survey_res_flt[survey_res_flt['product_technology'].astype(bool)]))/len(survey_res_flt)*100

29.952781769657154

In [21]:
survey_res_flt.to_csv("../../Data/filtered/2013.csv", index=False)