# Tech Job Market and Salaries Analysis 

For our final project, we have selected the Stack Overflow Developer Survey dataset, 
which contains detailed responses from developers regarding their job roles, skills, 
technologies used, and salary information. This dataset is particularly relevant to the 
tech industry, which is a major focus of our group, and will provide insights into the tech 
job market by collecting responses from developers worldwide. It covers various topics 
such as job roles, salary, coding activities, education, technology usage, and job 
satisfaction.<br>

Team Eyy<br>
Members:  
- Julianne Kristine D. Aban 
- Derich Andre G. Arcilla 
- Jennifer Bendoy 
- Richelle Ann C. Candidato 
- Marc Francis B. Gomolon 
- Phoebe Kae A. Plasus

##### Data Preparation

LOADING DATA SET & LIBRARIES

In [188]:
import pandas as pd
import numpy as np
import re

# Load the dataset
df = pd.read_csv('survey_results_filtered.csv')

CLEANING CODE OF EXPERIENCE

In [189]:
# Cleaning Years of Experience YearsCode(overall coding experience) && YearsCodePro (coding experience as a professional)

# Step 1: Replace 'NA' strings with NaN
df.replace('NA', pd.NA, inplace=True)

# Step 2: Convert 'YearsCode' and 'YearsCodePro' columns to numeric, 
# forcing errors to NaN for any other non-numeric values
df['YearsCode'] = pd.to_numeric(df['YearsCode'], errors='coerce')
df['YearsCodePro'] = pd.to_numeric(df['YearsCodePro'], errors='coerce')

# Step 3: Impute missing values with the mean of each column
df['YearsCode'] = df['YearsCode'].fillna(df['YearsCode'].mean())
df['YearsCodePro'] = df['YearsCodePro'].fillna(df['YearsCodePro'].mean())

# Round the results to whole numbers (integers)
df['YearsCode'] = df['YearsCode'].round().astype(int)
df['YearsCodePro'] = df['YearsCodePro'].round().astype(int)

# Print the cleaned DataFrame with the two columns
# print(df[['YearsCode', 'YearsCodePro']])

CLEANING EDUCATION LEVEL

In [190]:
# Clean the EdLevel column:
# - Remove text in parentheses (e.g., "(e.g. American high school, etc.)")
# - Strip any extra spaces
df['EdLevel'] = df['EdLevel'].apply(lambda x: re.sub(r'\(.*\)', '', str(x)).strip())

# Mapping dictionary for converting text to numeric values
edlevel_mapping = {
    'Primary/elementary school': 1,
    'Secondary school': 2,
    "Bachelor's degree": 3,
    'Associate degree': 4,
    "Master's degree": 5,
    'Professional degree': 6,
    'Some college/university study without earning a degree': 0,
    'Something else': 0
}

# Replace text with numeric values according to the mapping
df['EdLevel'] = df['EdLevel'].replace(edlevel_mapping)

# Convert the column to numeric values (in case there are still mixed types)
df['EdLevel'] = pd.to_numeric(df['EdLevel'], errors='coerce')

# Handle missing values (NaN) and replace with -1
df['EdLevel'] = df['EdLevel'].fillna(-1)

# Print only the cleaned 'EdLevel' column
# print(df['EdLevel'])

CLEANING ORGSIZE

In [191]:
# Clean the column name
df.columns = df.columns.str.strip()

# Clean the 'OrgSize' column: strip spaces, convert to lowercase, and remove commas
df['OrgSize'] = df['OrgSize'].str.strip()  # Remove leading/trailing spaces
df['OrgSize'] = df['OrgSize'].str.lower()  # Convert to lowercase
df['OrgSize'] = df['OrgSize'].str.replace(',', '')  # Remove commas
df['OrgSize'] = df['OrgSize'].str.replace('to', '-')  # Standardize "to" as "-"
df['OrgSize'] = df['OrgSize'].str.replace(' ', '')  # Remove extra spaces

# Handle NaN values (if you want to replace them with a specific value like 'na')
df['OrgSize'] = df['OrgSize'].fillna('na')  # Replace NaN with 'na' or any other value you prefer

# Mapping the categories directly to numeric values
orgsize_map = {
    'freelancer': 1,                # Freelancer becomes 1
    '2-99employees': 2,             # Standardize "2 - 99 employees" to "2-99employees"
    '100-999employees': 3,          # Standardize "100 - 999 employees" to "100-999employees"
    '1000-4999employees': 4,        # Standardize "1000 to 4999 employees" to "1000-4999employees"
    '5000ormoreemployees': 5,       # Standardize "5000 or more employees" to "5000ormoreemployees"
    'i don\'t know': -1,            # "I don't know" becomes -1
    'na': -1                        # "NA" becomes -1
}

# Replace text categories with numeric values directly in the 'OrgSize' column
df['OrgSize'] = df['OrgSize'].replace(orgsize_map)

# Print the cleaned 'OrgSize' column
# print(df['OrgSize'])

CLEANING LANGUAGE HAVE WORKED WITH

In [192]:
# Handle NaN values by filling them with an empty string or dropping rows with NaN values
df['LanguageHaveWorkedWith'] = df['LanguageHaveWorkedWith'].fillna('')

# Split the 'LanguageHaveWorkedWith' column by semicolon
df['Technologies'] = df['LanguageHaveWorkedWith'].str.split(';')

# Standardize technology names (strip extra spaces, title case)
df['Technologies'] = df['Technologies'].apply(lambda x: [tech.strip().title() for tech in x if tech])

# Flatten the list of technologies and create a set of unique technologies
all_technologies = set([tech for sublist in df['Technologies'] for tech in sublist])

# One-hot encode by creating a column for each technology
for tech in all_technologies:
    df[tech] = df['Technologies'].apply(lambda x: 1 if tech in x else 0)

# Drop the original 'LanguageHaveWorkedWith' and 'Technologies' columns
df.drop(columns=['LanguageHaveWorkedWith', 'Technologies'], inplace=True)

# Print the one-hot encoded results
# print(df)

CLEANING DATABASE HAVE WORKED WITH

In [193]:
# Handle NaN values by filling them with an empty string or dropping rows with NaN values
df['DatabaseHaveWorkedWith'] = df['DatabaseHaveWorkedWith'].fillna('')

# Function to clean and one-hot encode a column
def clean_and_encode(column_name):
    # Split the column by semicolon
    df[column_name + '_Technologies'] = df[column_name].str.split(';')
    
    # Standardize technology names (strip extra spaces, title case)
    df[column_name + '_Technologies'] = df[column_name + '_Technologies'].apply(lambda x: [tech.strip().title() for tech in x if tech])
    
    # Flatten the list of technologies and create a set of unique technologies
    all_technologies = set([tech for sublist in df[column_name + '_Technologies'] for tech in sublist])
    
    # One-hot encode by creating a column for each technology
    for tech in all_technologies:
        df[tech] = df[column_name + '_Technologies'].apply(lambda x: 1 if tech in x else 0)
    
    # Drop the original technology columns
    df.drop(columns=[column_name, column_name + '_Technologies'], inplace=True)

# Clean and one-hot encode the 'DatabaseHaveWorkedWith' column
clean_and_encode('DatabaseHaveWorkedWith')

CLEANING JOB SATISFACTION

In [194]:
# Replace "NA" with NaN (missing values)
df['JobSat'] = df['JobSat'].replace('NA', np.nan)

# Convert the column to numeric values (in case some values are strings or other types)
df['JobSat'] = pd.to_numeric(df['JobSat'], errors='coerce')

# Impute missing values with the median (you can change to mean if preferred)
df['JobSat'] = df['JobSat'].fillna(df['JobSat'].median())  # or df['JobSat'].mean()

# Optional: If there are any outliers or invalid values (e.g., negative values), you can handle them
# For instance, we could cap the values to a valid range, such as 1-10 for job satisfaction scores
# df['JobSat'] = df['JobSat'].clip(lower=1, upper=10)

CLEANING DEVTYPE

In [None]:
# Replace "NA" with NaN (missing values) in DevType column
df['DevType'] = df['DevType'].replace('NA', np.nan)

# Define a mapping to group similar roles
dev_type_mapping = {
    'Academic researcher': 'Researcher',
    'Blockchain': 'Developer',
    'Cloud infrastructure engineer': 'Engineer',
    'Data engineer': 'Data Professional',
    'Data or business analyst': 'Data Professional',
    'Data scientist or machine learning specialist': 'Data Professional',
    'Database administrator': 'Data Professional',
    'Designer': 'Designer',
    'Developer Advocate': 'Developer',
    'Developer Experience': 'Developer',
    'Developer, AI': 'Developer',
    'Developer back-end': 'Developer',
    'Developer, desktop or enterprise applications': 'Developer',
    'Developer, embedded applications': 'Developer',
    'Developer, front-end': 'Developer',
    'Developer, full-stack': 'Developer',
    'Developer, game or graphics': 'Developer',
    'Developer, mobile': 'Developer',
    'Developer, QA or test': 'Developer',
    'DevOps specialist': 'Engineer',
    'Educator': 'Educator',
    'Engineer site reliability': 'Engineer',
    'Engineering manager': 'Manager',
    'Hardware manager': 'Manager',
    'Marketing or sales professional': 'Business',
    'Others': 'Other',
    'Product manager': 'Manager',
    'Project manager': 'Manager',
    'Research and Development role': 'Researcher',
    'Scientist': 'Researcher',
    'Security professional': 'Security',
    'Senior Executive (C-Suite, VP)': 'Executive',
    'Student': 'Student',
    'System administrator': 'Engineer'
}

# Apply the mapping
df['DevType'] = df['DevType'].map(dev_type_mapping)

# Convert categorical column to numeric using a mapping for the grouped roles
dev_type_numeric_mapping = {
    'Researcher': 1,
    'Developer': 2,
    'Engineer': 3,
    'Data Professional': 4,
    'Designer': 5,
    'Manager': 6,
    'Business': 7,
    'Other': 8,
    'Educator': 9,
    'Security': 10,
    'Executive': 11,
    'Student': 12
}

# Apply the numeric mapping to the 'DevType' column
df['DevTypeNumeric'] = df['DevType'].map(dev_type_numeric_mapping)

# Fill NaN values (optional, depending on your analysis)
df['DevTypeNumeric'] = df['DevTypeNumeric'].fillna(-1)  # Use -1 or another placeholder for missing values


         DevType  DevTypeNumeric
0            NaN            -1.0
1      Developer             2.0
2      Developer             2.0
3      Developer             2.0
4      Developer             2.0
...          ...             ...
65432  Developer             2.0
65433        NaN            -1.0
65434  Developer             2.0
65435        NaN            -1.0
65436        NaN            -1.0

[65437 rows x 2 columns]


In [196]:
#Save Cleaned File

output_file_path = "cleaned_file.csv"

df.to_csv(output_file_path, index=False)

##### Exploratory Data Analysis (EDA)

In [197]:
# place code here

##### Data Analysis Techniques

In [198]:
# K-means Clustering

In [199]:
# Linear Regression

In [200]:
# Apriori Algorithm