# Stack Overflow Developer Survey Processing

# Content

* [Configure This Notebook](#configure-this-notebook)
* [Load Data](#load-data)
* [Select Relevant Features](#select-relevant-features)
* [Data Cleaning](#data-cleaning)
* [Save Cleaned Data](#save-cleaned-data)

# Configure This Notebook

This notebook will read in the 2024 developer survey, clean and process the data for analysis, and then save the clean version to a new location. Any configurable settings for the notebook will be set here.

In [1]:
# Relative file path to the 2024 developer survey results
SO_SURVEY_RAW = 'data/raw/survey_results_public.csv'

# Relative file path to store the cleaned results for future analysis
SO_SURVEY_CLEAN = 'data/clean/survey_results.csv'

# Load Data

Every year, Stack Overflow releases the data for their annual developer survey. Our intended analysis examines the current state of the industry, and as a result, we will focus our effort on only the most recent survey from 2024.

In [2]:
import pandas as pd
import numpy as np

In [3]:
survey_results = pd.read_csv(SO_SURVEY_RAW, index_col='ResponseId')
survey_results.head()

Unnamed: 0_level_0,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,TechDoc,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,,...,,,,,,,,,,
2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,...,,,,,,,Appropriate in length,Easy,,
4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,,...,,,,,,,Too long,Easy,,
5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,API document(s) and/or SDK document(s);User gu...,...,,,,,,,Too short,Easy,,


# Select Relevant Features

Our intended analysis focuses on developer skills and experience levels. As a result, we will focus on features that can directly support that analysis.

In [4]:
columns = ['MainBranch', 'Age', 'Employment', 'CodingActivities',
    'EdLevel', 'LearnCode', 'LearnCodeOnline', 'YearsCode', 'YearsCodePro',
    'DevType', 'Country', 'Currency', 'CompTotal', 'LanguageHaveWorkedWith',
    'DatabaseHaveWorkedWith', 'PlatformHaveWorkedWith',
    'WebframeHaveWorkedWith', 'WorkExp', 'Industry',
    'ConvertedCompYearly']

survey_results = survey_results[columns]
survey_results.head()

Unnamed: 0_level_0,MainBranch,Age,Employment,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,YearsCode,YearsCodePro,DevType,Country,Currency,CompTotal,LanguageHaveWorkedWith,DatabaseHaveWorkedWith,PlatformHaveWorkedWith,WebframeHaveWorkedWith,WorkExp,Industry,ConvertedCompYearly
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,I am a developer by profession,Under 18 years old,"Employed, full-time",Hobby,Primary/elementary school,Books / Physical media,,,,,United States of America,,,,,,,,,
2,I am a developer by profession,35-44 years old,"Employed, full-time",Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,20.0,17.0,"Developer, full-stack",United Kingdom of Great Britain and Northern I...,,,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Dynamodb;MongoDB;PostgreSQL,Amazon Web Services (AWS);Heroku;Netlify,Express;Next.js;Node.js;React,17.0,,
3,I am a developer by profession,45-54 years old,"Employed, full-time",Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,37.0,27.0,Developer Experience,United Kingdom of Great Britain and Northern I...,,,C#,Firebase Realtime Database,Google Cloud,ASP.NET CORE,,,
4,I am learning to code,18-24 years old,"Student, full-time",,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,4.0,,"Developer, full-stack",Canada,,,C;C++;HTML/CSS;Java;JavaScript;PHP;PowerShell;...,MongoDB;MySQL;PostgreSQL;SQLite,Amazon Web Services (AWS);Fly.io;Heroku,jQuery;Next.js;Node.js;React;WordPress,,,
5,I am a developer by profession,18-24 years old,"Student, full-time",,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,9.0,,"Developer, full-stack",Norway,,,C++;HTML/CSS;JavaScript;Lua;Python;Rust,PostgreSQL;SQLite,,,,,


# Data Cleaning

The steps needed to clean the survey data are as follows:
1. Convert multi-select features into lists

__1. Convert multi-select features into lists__

Many of the questions asked on the survey are multi-select questions, allowing one respondent to submit multiple answers for the same question. In the raw data, these are stored as a semicolon-delimited list of answers. We will convert these to list objects for easier analysis later.

In [5]:
expandable_columns = ['Employment', 'CodingActivities', 'LearnCode', 'LearnCodeOnline',
                      'LanguageHaveWorkedWith', 'DatabaseHaveWorkedWith', 'PlatformHaveWorkedWith',
                      'WebframeHaveWorkedWith']

for col_name in expandable_columns:
    survey_results[col_name] = survey_results[col_name].str.split(';')

survey_results.head()

Unnamed: 0_level_0,MainBranch,Age,Employment,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,YearsCode,YearsCodePro,DevType,Country,Currency,CompTotal,LanguageHaveWorkedWith,DatabaseHaveWorkedWith,PlatformHaveWorkedWith,WebframeHaveWorkedWith,WorkExp,Industry,ConvertedCompYearly
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,I am a developer by profession,Under 18 years old,"[Employed, full-time]",[Hobby],Primary/elementary school,[Books / Physical media],,,,,United States of America,,,,,,,,,
2,I am a developer by profession,35-44 years old,"[Employed, full-time]","[Hobby, Contribute to open-source projects, Ot...","Bachelor’s degree (B.A., B.S., B.Eng., etc.)","[Books / Physical media, Colleague, On the job...","[Technical documentation, Blogs, Books, Writte...",20.0,17.0,"Developer, full-stack",United Kingdom of Great Britain and Northern I...,,,"[Bash/Shell (all shells), Go, HTML/CSS, Java, ...","[Dynamodb, MongoDB, PostgreSQL]","[Amazon Web Services (AWS), Heroku, Netlify]","[Express, Next.js, Node.js, React]",17.0,,
3,I am a developer by profession,45-54 years old,"[Employed, full-time]","[Hobby, Contribute to open-source projects, Ot...","Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","[Books / Physical media, Colleague, On the job...","[Technical documentation, Blogs, Books, Writte...",37.0,27.0,Developer Experience,United Kingdom of Great Britain and Northern I...,,,[C#],[Firebase Realtime Database],[Google Cloud],[ASP.NET CORE],,,
4,I am learning to code,18-24 years old,"[Student, full-time]",,Some college/university study without earning ...,"[Other online resources (e.g., videos, blogs, ...","[Stack Overflow, How-to videos, Interactive tu...",4.0,,"Developer, full-stack",Canada,,,"[C, C++, HTML/CSS, Java, JavaScript, PHP, Powe...","[MongoDB, MySQL, PostgreSQL, SQLite]","[Amazon Web Services (AWS), Fly.io, Heroku]","[jQuery, Next.js, Node.js, React, WordPress]",,,
5,I am a developer by profession,18-24 years old,"[Student, full-time]",,"Secondary school (e.g. American high school, G...","[Other online resources (e.g., videos, blogs, ...","[Technical documentation, Blogs, Written Tutor...",9.0,,"Developer, full-stack",Norway,,,"[C++, HTML/CSS, JavaScript, Lua, Python, Rust]","[PostgreSQL, SQLite]",,,,,


# Save Cleaned Data

Now that the data is clean, we save it to the configured directory.

In [6]:
survey_results.to_csv(SO_SURVEY_CLEAN)