## Overview of Analysis

The purpose of this project is to explore how undergraduate degrees impact salary outcomes over time, and identify which degrees offer the best financial return on investment (ROI).

1. Which degrees have the highest starting salary?
2. Which degrees show the most growth over time?

## 1. Loading & Previewing of Data

This dataset was sourced from Kaggle, and was obtained from a year-long survey of 1.2 million people by The Wall Street Journal who based their data from Payscale Inc.

Link to dataset here: https://www.kaggle.com/datasets/wsj/college-salaries?select=salaries-by-college-type.csv

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
degree_pays = pd.read_csv("/kaggle/input/college-salaries/degrees-that-pay-back.csv")

In [None]:
# 50 rows, and 8 columns within the dataset
degree_pays.shape

In [None]:
degree_pays.head()

In [None]:
degree_pays.dtypes

With the exception of the 'Percent change from Starting to Mid-Career Salary' column, the rest of the salary fields are in string format and we'd probably want to convert that into numerical fields before diving into any further analysis. 

The 'Undergraduate Major' column is fine as a string since it's our categorical data identifying the different degrees.

In [None]:
# Quick check on the unique values for each column
degree_pays.nunique()

Now to quickly check for any missing values within the dataset.

In [None]:
degree_pays.isna().sum()

Doesn't seem like there are any missing values but also because the salary columns are currently still in string format which in the context of salaries, a 0 in string format is not considered null but might not make sense for someone's salary to be $0, so we'll just check again after preprocessing those fields. 

** *But also contextually, a median salary can technically be $0 in the unlikely, but possible case where graduates from that particular undergraduate degree are unemployed.* 

## 2. Data Preprocessing

Earlier we noted that most of the salary columns are recorded in string format because of the '$' sign, which might hinder any aggregation or analysis we would do later. Therefore, we're going to remove that and convert those columns into floats first before we can perform any other exploratory data analysis.

In [None]:
def clean_salary(value):
    return value.replace('$', '').replace(',', '')

In [None]:
salary_columns = [
    'Starting Median Salary',
    'Mid-Career Median Salary',
    'Mid-Career 10th Percentile Salary',
    'Mid-Career 25th Percentile Salary',
    'Mid-Career 75th Percentile Salary',
    'Mid-Career 90th Percentile Salary'
]

for col in salary_columns:
    degree_pays[col] = degree_pays[col].apply(clean_salary).astype('float64')

In [None]:
degree_pays.head()

In [None]:
degree_pays.dtypes

The salary columns are now converted to float64, and i'll just quickly check again if there are any null fields.

In [None]:
degree_pays.isna().sum()

Looking through the 8 columns available to us in this dataset, i don't think there's any irrelevant columns that we can drop so i'll just keep the dataframe as it is now.

Also, it's not exactly necessary but i'm gonna rename the column names to standardise the naming to a snake casing format instead for ease of querying later during the analysis portion.  

In [None]:
renamed_cols = [cols.lower().replace(' ', '_').replace('-', '_') for cols in degree_pays.columns]
degree_pays.columns = renamed_cols

In [None]:
degree_pays.columns

## 3. Exploratory Data Analysis

In [None]:
degree_pays.describe()

Insights drawn from the Descriptive Statistics:

* On average, salaries increase by ~69% from entry-level to mid-career, indicating strong long-term returns on higher education.
* While all degrees see positive growth, some offer more than double the starting salary by mid-career (up to 103%). However, growth varies widely, with some degrees grow by as little as 23%, highlighting the importance of major selection.
* The max starting salary is significantly higher than the median which suggests that specific majors offering a much higher starting pay than the others.

Next, i'll look at the distributions for both the starting and mid-career median salary

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))  # 1 row, 2 columns

sns.histplot(data=degree_pays['starting_median_salary'], kde=True, bins=25, ax=axes[0])
axes[0].set_title('Starting Median Salary')

sns.histplot(data=degree_pays['mid_career_median_salary'], kde=True, bins=25, ax=axes[1])
axes[1].set_title('Mid-Career Median Salary')

plt.show()

Insights drawn:

* The distribution of the starting median salary is observably **right, or positively skewed** as most salary values are clustered within the **35k to 44k range**.
* There's a notable gap between the rest of the salary values, and the highest salary value of **75k** on the extreme right.
* As compared to the starting median salary which is more skewed, the mid-career salaries seems more evenly distributed, but still maintaining a **slight positive skewness**.

In [None]:
degree_pays[['undergraduate_major', 'starting_median_salary']].sort_values(by='undergraduate_major')

In [None]:
starting_salary_by_degrees = degree_pays[['undergraduate_major', 'starting_median_salary']].sort_values(by='starting_median_salary')

In [None]:
plt.figure(figsize=(12, 14))
sns.barplot(x='starting_median_salary', y='undergraduate_major', data=starting_salary_by_degrees, palette='Blues_d')
plt.title("Starting Median Salary by Degrees")
plt.xlabel("Starting Median Salary")
plt.ylabel("Undergraduate Degree")

As seen above, a degree majoring in Physician Assistant tends to lead to the highest starting median salary as opposed to a degree major of Spanish.

Additionally, it also roughly suggests that more technical-related degrees like engineering, tends to lead to a higher starting median salary as compared to arts-related degrees. 

In [None]:
degree_pays[['undergraduate_major', 'mid_career_median_salary']].sort_values(by='undergraduate_major')

In [None]:
mid_career_salary_by_degrees = degree_pays[['undergraduate_major', 'mid_career_median_salary']].sort_values(by='mid_career_median_salary')

In [None]:
plt.figure(figsize=(12, 14))
sns.barplot(x='mid_career_median_salary', y='undergraduate_major', data=mid_career_salary_by_degrees, palette='BuGn')
plt.title("Mid-Career Median Salary by Degrees")
plt.xlabel("Mid-Career Median Salary")
plt.ylabel("Undergraduate Degree")

Looking at the mid-career median salary by degrees now, we can see that the degree major in Physician Assistant has been overtaken by other degree majors like engineering, economics and physics. Furthermore, with a starting median salary of 74.3k, its mid-career median salary is now 91.7k, suggesting that the estimated growth in its increased income over time differs by only 17.4k

This is in contrast to other similar high starting median salary degrees like Computer Engineering, starting at 61.4k and growing to 105k at mid-career. 

Assuming that we're aiming to understand the best choice of undegraduate major we should go for in order to maximise our future career's income, the best option that one would go for would be the degree that offers a high starting median salary, while also having a significant increase in its median salary at mid-career level. Whereas an undergraduate major that offers a low starting median salary and low growth rate of its median salary at mid-career would probably be one that we would avoid due to its lower short and long-term financial returns.

In [None]:
salary_growth_by_degrees = degree_pays[['undergraduate_major', 'percent_change_from_starting_to_mid_career_salary']].sort_values(by='percent_change_from_starting_to_mid_career_salary')

In [None]:
plt.figure(figsize=(12, 14))
sns.barplot(x='percent_change_from_starting_to_mid_career_salary', y='undergraduate_major', data=salary_growth_by_degrees, palette='BuPu')
plt.title("Salary Growth from Start to Mid-Career by Degrees")
plt.xlabel("Percentage Change in Median Salary")
plt.ylabel("Undergraduate Degree")

## 4. Visualisations

#### Degrees with High Starting Pay

In [None]:
sns.barplot(x='undergraduate_major', y='starting_median_salary', data=starting_salary_by_degrees.sort_values(by='starting_median_salary', ascending=False).head(10), palette='mako')
plt.title("Top 10 Degrees with the Highest Starting Salaries")
plt.ylabel("Median Salary")
plt.tick_params(axis='x', rotation=60)

In [None]:
starting_salary_by_degrees.sort_values(by='starting_median_salary', ascending=False).head(10)

The above bar graph highlights the top 10 degrees with the highest starting salary post-graduation.

Whilst a particular degree might offer a high starting pay, it would probably be more meaningful to also consider the growth in salary over time to really understand its ROI.

#### Degrees with the Most Growth Over Time

In [None]:
sns.barplot(x='undergraduate_major', y='percent_change_from_starting_to_mid_career_salary', data=salary_growth_by_degrees.sort_values(by='percent_change_from_starting_to_mid_career_salary', ascending=False).head(10), palette='mako')
plt.title("Top 10 Degrees with the Highest Growth in Salaries")
plt.ylabel("% Growth in Median Salary")
plt.tick_params(axis='x', rotation=60)

In [None]:
salary_growth_by_degrees.sort_values(by='percent_change_from_starting_to_mid_career_salary', ascending=False).head(10)

This graph highlights the top 10 degrees with the highest growth in salary from start to mid-career, answering our second analysis question on which degrees shows the most growth over time.

We can see here that a degree in Math and Philosophy shows a significant growth in salary of ~103%, suggesting that while they might start with a lower starting pay, these degree often offer graduates the career pathways to perhaps higher demand industries like finance or consulting, that would then offer a more lucrative compensation.

## 5. Conclusion

Based on the analysis, we gathered that:

* The **average starting salary** across the different degrees is **~44k**, and on average, graduate's salaries sees **growth rate of ~69% from starting to mid-career**. This indicates that while the growth rate or starting salaries might defer across the different degrees, their long-term growth generally remains positive. This makes sense realistically as our salaries tend to increase over time as our years of experience and skills increase too.
* While we see that a degree in Physician Assistant has the highest starting pay of ~74k, a degree in Math and Philosophy sees the highest growth rate over time of ~103% across the different listed degrees in comparison to the former whose growth rate sees the lowest growth rate of ~23%. Interestingly, this reflects that a **high starting salary does not always correlate with high growth rate over time**.


As this analysis utilises aggregated data from a survey, limitations of the analysis includes self-reporting biases, and uneven sample sizes across the different degrees.

Further analysis that might be helpful for a more comprehensive insights for prospective students choosing their degrees would be to also perhaps incorporate data such as tuition costs, employment rates after graduation, and number of years required to complete the degree. 