**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Steven Dominic Sahar
- Mengyi Chen
- Pranav Rebala
- Sikai Liang
- Jiaheng Dai

# Research Question

What is the correlation between the number of programming languages an individual possesses and the median salary of software engineers in the United States?

## Background and Prior Work

The COVID-19 pandemic has brought significant upheaval to various sectors in the United States, notably impacting industries like hospitality, travel, and retail. These sectors have faced considerable challenges, including job losses and reduced salaries 1. In contrast, the digital transformation accelerated by the pandemic has led to a surge in demand for software development professionals. This shift is driven by companies investing heavily in technology to adapt to new consumer behaviors, such as increased online shopping and remote working 2. According to CompTIA's February 2022 Tech Jobs Report, the average annual salary for an IT professional in the United States was $110,765 3. This has prompted many job seekers and university students to pivot or choose majors focused on software and programming, aiming for stable employment and income. 
However, a gap exists in the understanding of how proficiency in multiple programming languages influences employment opportunities and salary scales. While existing studies have largely focused on the demand for popular individual languages based on company needs 4, they often overlook the combined value of knowing multiple languages. 
Our research aims to bridge this gap by analyzing how the number of programming languages one is proficient in correlates with their salary progression in the software engineering field. Such insights are expected to provide valuable insights for individuals seeking to enhance their career prospects in this competitive job market.
1. Ansell, Ryan (June 2021) COVID-19 ends longest employment recovery and expansion in CES history, causing unprecedented job losses in 2020 
https://www.bls.gov/opub/mlr/2021/article/covid-19-ends-longest-employment-expansion-in-ces-history.htm <br>
2. Owen Hughes (March 2022) Developer jobs and programming languages: What's hot and what's next
https://www.zdnet.com/article/developer-jobs-and-programming-languages-whats-hot-and-whats-next/ <br>
3. Emily Matzelle (April 2022) T Salaries: Where the Money’s At
https://www.comptia.org/blog/it-salaries <br>
4. 4daysweek.io (April 2023)What are the Highest Paying Programming Languages in 2023?
https://4dayweek.io/salary/highest-paying-programming-languages <br>



# Hypothesis


We hypothesize that there is a significant positive correlation between the number of programming languages a software engineer is proficient in and their median salary in the United States, with a notably steep increase in salary observed as the number of tools increases until 5. Beyond this threshold, we anticipate a gradual diminishment in the rate of salary increase.

# Data

## Data overview

For each dataset include the following information

- Dataset #1
- Dataset Name: 2023 Stack Overflow Developer Survey 
- Link to the dataset: https://insights.stackoverflow.com/survey
- Number of observations: 89184
- Number of variables: 79


The dataset consists of survey results from over 90,000 developers in the year 2023. The independent variable is the number of programming languages a developer is proficient in (which could be extracted from the column “LanguageHaveWorkedWith”) and the dependent variable is the expected annual salary (from “ConvertedSumYearly”). The independent variable is originally presented as lists of languages, which we seek to convert to numeric values through one hot encoding. The dependent variable is given as numerical values. We need to remove irrelevant columns, filter out entries that miss information in important variables, and keep only professional developers based on the column “MainBranch”.


## Dataset

In [1]:
# Loading the survey results file

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", 15)

survey = pd.read_csv('survey_results_public.csv')
survey.head()

Unnamed: 0,ResponseId,Q120,MainBranch,Age,Employment,RemoteWork,CodingActivities,...,TimeSearching,TimeAnswering,ProfessionalTech,Industry,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I agree,None of these,18-24 years old,,,,...,,,,,,,
1,2,I agree,I am a developer by profession,25-34 years old,"Employed, full-time",Remote,Hobby;Contribute to open-source projects;Boots...,...,15-30 minutes a day,15-30 minutes a day,DevOps function;Microservices;Automated testin...,"Information Services, IT, Software Development...",Appropriate in length,Easy,285000.0
2,3,I agree,I am a developer by profession,45-54 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Professional development or self-paced l...,...,30-60 minutes a day,30-60 minutes a day,DevOps function;Microservices;Automated testin...,"Information Services, IT, Software Development...",Appropriate in length,Easy,250000.0
3,4,I agree,I am a developer by profession,25-34 years old,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,...,15-30 minutes a day,30-60 minutes a day,Automated testing;Continuous integration (CI) ...,,Appropriate in length,Easy,156000.0
4,5,I agree,I am a developer by profession,25-34 years old,"Employed, full-time;Independent contractor, fr...",Remote,Hobby;Contribute to open-source projects;Profe...,...,60-120 minutes a day,30-60 minutes a day,Microservices;Automated testing;Observability ...,Other,Appropriate in length,Neither easy nor difficult,23456.0


In [2]:
# Removing unnecessary columns and only keeping the salary and programming languages known
# Removing missing data from our columns

filtered_survey = survey[survey['MainBranch']=='I am a developer by profession']
data = filtered_survey[['ConvertedCompYearly', 'LanguageHaveWorkedWith']]
data = data.rename({'ConvertedCompYearly': 'Salary', 'LanguageHaveWorkedWith': 'Languages'}, axis=1)
data = data.dropna().reset_index(drop=True)
data['Languages'] = data['Languages'].str.split(';')
data.head()

Unnamed: 0,Salary,Languages
0,285000.0,"[HTML/CSS, JavaScript, Python]"
1,250000.0,"[Bash/Shell (all shells), Go]"
2,156000.0,"[Bash/Shell (all shells), HTML/CSS, JavaScript..."
3,23456.0,"[HTML/CSS, JavaScript, TypeScript]"
4,96828.0,"[Bash/Shell (all shells), HTML/CSS, JavaScript..."


In [3]:
# One hot encoding the programming languages and finding the total languages known for every response

languages = pd.get_dummies(data['Languages'].apply(pd.Series).stack()).groupby(level=0).sum()
data['Total_Languages'] = languages.sum(axis=1)
data = pd.concat([data, languages], axis=1).drop(columns=['Languages'])
data.head()

Unnamed: 0,Salary,Total_Languages,APL,Ada,Apex,Assembly,Bash/Shell (all shells),...,Scala,Solidity,Swift,TypeScript,VBA,Visual Basic (.Net),Zig
0,285000.0,3,0,0,0,0,0,...,0,0,0,0,0,0,0
1,250000.0,2,0,0,0,0,1,...,0,0,0,0,0,0,0
2,156000.0,7,0,0,0,0,1,...,0,0,0,1,0,0,0
3,23456.0,3,0,0,0,0,0,...,0,0,0,1,0,0,0
4,96828.0,6,0,0,0,0,1,...,0,0,0,1,0,0,0


The original data was not clean or tidy because there were some missing values for salary and programming languages known which we decided to discard because they wouldn't be helpful for our analysis. In addition, all of the programming languages were stored as a string together in one column rather than separately. Therefore we had to split the string into multiple columns and one hot encode the programming languages so it is easier to determine what languages every person knows. For preprocessing we found the total languages that every person knows by summing over all of the one hot encoded columns. This is because our analysis deals with the association between number of languages and salary so we needed to have those columns.

# Ethics & Privacy

Based on our proposed data, there might be biases in terms of the work type, nationality(country), age, educational background, and coding activities. In terms of the wording, I do not think there are biases due to the author of the survey composing it into formulated options instead of free response type of questions. This eliminates the possibility of respondents being misguided by the wording of the terms. However, some variables are more likely to be inherently forcing on specific demographic groups, for instance, in terms of the nationality, people that are from countries(U.S., European nations, India.etc..) with a relatively stronger programming atmosphere and are more engaged with the western countries are more concentrated in this dataset, while people that are from other countries might be overlooked, and biased correlation might be established based on this issue. Moreover, in terms of educational background, people might not be willing to report their actual background due to stereotypical perspective of the society regarding this matter, meaning that people with relatively lower educational background might fabricate their answers when answering to this question. The same issue might also apply to salaries. People might exaggerate the numbers to achieve a better perceived social status, which is a tendency due to social perspective. In regard to privacy, there is no sensitive data on the dataset, and people were given the option to not answer specific questions if they do not feel comfortable, so privacy leak is not a concern. To detect the bias before the analysis, we would try to thoroughly understand the data by examining the demographics from the dataset(e.g. nationality) and computing descriptive statistics. During analysis, we would conduct exploratory data analysis to visualize the distribution of variables across the different demographic groups and plot different graphs such as histograms and box plots to identify disparities and outliers. Furthermore, we could apply statistical tests to identify the statistical differences between demographic groups(e.g. nationality). Lastly, when communicating the analysis, we would ensure to clearly document the sources of our data, the preprocessing actions taken, and the different methodologies used to address the biases identified. We would also convey the limitations of our analysis by acknowledging the presence of inherent biases to provide a comprehensive understanding of the results and their potential constraints in our final report.


# Team Expectations 

Weekly Meetings: Our team is going to have at least one synchronous meeting every week. The day and time of the meeting will be discussed a week prior and is adjusted based on each member's availability. If any time conflicts arise, one should inform the team at least 6 hours before the meeting. After the meeting ends, the team will communicate important details and receive progress updates from the missing team member through Discord as necessary.
Communication: Our team will be using Discord as our medium of communication - chat for asynchronous communication and voice channels to host our synchronous meetings. Our team is going to provide continuous progress updates and any blockages throughout the week leading up to the weekly meeting to ensure the goals for the week are met. 
Task Distribution: We acknowledge that uneven task distribution is inevitable, but our group will try our best to ensure each member contributes equally throughout the entire project. We would achieve this by listing all the tasks to be completed for the week and sharing it accordingly. Since each member has different strengths and interests, we would open up the floor for each member to pick the tasks they desire, and discuss amongst ourselves if one task is selected by multiple members. We will try to get two people working on a single task when possible to help smoothen the workflow. 
Task Submission: Before submitting any project checkpoints, each member will read through the project submission to ensure that we understand the content of the submission, check insufficient code and proofread any texts for grammar and completeness.
Conflict Resolution: If any conflict arises, we would try to resolve it amongst ourselves by discussing it. However, if we are not able to resolve it, we would seek the help of a TA and the professor to mediate the problem.


# Individual Responsibilities:

* *Steven Dominic Sahar: Descriptive Analysis, Data Exploration, Data Visualization, Data Analysis*
* *Mengyi Chen: Data Wrangling & Cleaning, Data Visualization, Finalize EDA* 
* *Pranav Rebala: Obtaining Dataset, Descriptive Analysis, Data Exploration, Data Analysis*
* *Sikai Liang: Data Wrangling & Cleaning, Descriptive Analysis, Data Exploration, Data Visualization, Finalize EDA*
* *Jiaheng Dai: Data Wrangling & Cleaning, Data Visualization, Data Analysis*
* *All Members: Finalize Data Analysis, Finalize Report, Create Video, Proof Read Entire Project*

# Project Timeline Proposal





| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 11/1  |  6 PM | Search for datasets (completed)  | Finalize Proposal; Discuss Wrangling and possible analytical approaches (Asynchronous) | 
| 11/8  |  6 PM |  Import & Wrangle Data (Jiaheng, Sikai, Mengyi Chen); Begin EDA (Sikai, Pranav Rebala, Steven Dominic Sahar) | Review/Edit wrangling/EDA | 
| 11/14  | 6 PM  | EDA (Mengyi Chen, Steven Dominic Sahar, Jiaheng, Sikai, Steven Dominic Sahar)  | Discuss Analysis plan; Complete checkpoint 1   |
| 11/22  | 6 PM  | Finalize EDA (Mengyi Chen, Sikai); Begin Analysis (Jiaheng, Pranav Rebala, Steven Dominic Sahar) | Discuss/edit Analysis  |
| 11/28  | 6 PM  | Finalize Analysis (Sikai, Jiaheng, Steven Dominic Sahar, Mengyi Chen, Pranav Rebala) | Complete checkpoint 2; Complete results/conclusion/discussion |
| 12/4  | 6 PM  | Finalize report (Sikai, Jiaheng, Steven Dominic Sahar, Mengyi Chen, Pranav Rebala)| Discuss/edit full project |
| 12/9  | 6 PM  | Create video (Sikai, Jiaheng, Steven Dominic Sahar, Mengyi Chen, Pranav Rebala)| Final Check |
| 12/13  | Before 11:59 PM  | Final Check (Sikai, Jiaheng, Steven Dominic Sahar, Mengyi Chen, Pranav Rebala) | Turn in Final Project & Group Project Surveys |