## Analysis of Post-University Salaries of Graduates by Major.

College degrees are very expensive. What does it mean to spend huge amount of money on a degree that is less demanded in the labour market? Choosing Philosophy or International Relations as a major may have worried your parents, but does the data back up their fears? PayScale Inc. did a year-long survey of 1.2 million Americans with only a bachelor's degree. We'll be digging into this data and use Pandas to answer these questions:

* Which degrees have the highest starting salaries? 

* Which majors have the lowest earnings after college?

* Which degrees have the highest earning potential?

* What are the lowest risk college majors from an earnings standpoint?

* Do business, STEM (Science, Technology, Engineering, Mathematics) or HASS (Humanities, Arts, Social Science) degrees earn more on average?

In [64]:
import numpy as np
import pandas as pd

In [65]:
data = pd.read_csv('sample_data/salaries_by_college_major (1).csv')

In [66]:
# view the top 5 rows of the data
data.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


### Preliminary Data Exploration and Data Cleaning with Pandas

The folowing questions are to be answered:

* How many rows does our dataframe have? 

* How many columns does it have?

* What are the labels for the columns? Do the columns have names?



In [67]:
# how many rows does the dataframe have?
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 6 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Undergraduate Major                51 non-null     object 
 1   Starting Median Salary             50 non-null     float64
 2   Mid-Career Median Salary           50 non-null     float64
 3   Mid-Career 10th Percentile Salary  50 non-null     float64
 4   Mid-Career 90th Percentile Salary  50 non-null     float64
 5   Group                              50 non-null     object 
dtypes: float64(4), object(2)
memory usage: 2.5+ KB


There are 50 rows and 6 columns.

In [68]:
#  what are the labels for the columns
data.columns

Index(['Undergraduate Major', 'Starting Median Salary',
       'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
       'Mid-Career 90th Percentile Salary', 'Group'],
      dtype='object')

The columns have names.

In [69]:
# is there missing values in the columns?
data.isna().any()

Undergraduate Major                  False
Starting Median Salary                True
Mid-Career Median Salary              True
Mid-Career 10th Percentile Salary     True
Mid-Career 90th Percentile Salary     True
Group                                 True
dtype: bool

Only the `Undergraduate Major` column does not have a missing value.

In [70]:
# obtain the rows with missing values in any of its column
data[data.isna().values.any(axis=1)]

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
50,Source: PayScale Inc.,,,,,


In [71]:
# since row 50 has missing values, we will remove it from the data
data.dropna(inplace=True)

In [72]:
# check for the missing values again
data.isna().any()

Undergraduate Major                  False
Starting Median Salary               False
Mid-Career Median Salary             False
Mid-Career 10th Percentile Salary    False
Mid-Career 90th Percentile Salary    False
Group                                False
dtype: bool

### What is the highest and lowest mid-career earning degree?

In [73]:
# Obtain the row at the index of the maximum Mid-Career Median Salary
data.loc[data['Starting Median Salary'].idxmax()]

Undergraduate Major                  Physician Assistant
Starting Median Salary                         74,300.00
Mid-Career Median Salary                       91,700.00
Mid-Career 10th Percentile Salary              66,400.00
Mid-Career 90th Percentile Salary             124,000.00
Group                                               STEM
Name: 43, dtype: object

**Physician Assistant** has highest mid-career earning 

In [74]:
# Obtain the row at the index of the minimum Mid-Career Median Salary
data.loc[data['Starting Median Salary'].idxmin()]

Undergraduate Major                   Spanish
Starting Median Salary              34,000.00
Mid-Career Median Salary            53,100.00
Mid-Career 10th Percentile Salary   31,000.00
Mid-Career 90th Percentile Salary   96,400.00
Group                                    HASS
Name: 49, dtype: object

### What is the Lowest Risk Majors?

A low-risk major is a degree where there is a small difference between the lowest and highest salaries.

If the difference between the 10th percentile and the 90th percentile earnings of your major is small, then you can be more certain about your salary after you graduate.

In [75]:
# create a new column for the difference between the 10th and 90th Mid-Career Percentile Salary
data['Risk Potential'] = data['Mid-Career 10th Percentile Salary'] - data['Mid-Career 90th Percentile Salary']
# print the row with the lowest risk
data.loc[data['Risk Potential'].idxmin()]

Undergraduate Major                   Economics
Starting Median Salary                50,100.00
Mid-Career Median Salary              98,600.00
Mid-Career 10th Percentile Salary     50,600.00
Mid-Career 90th Percentile Salary    210,000.00
Group                                  Business
Risk Potential                      -159,400.00
Name: 17, dtype: object

Economics is the lowest risk major. 

In [76]:
# The top 5 majors with lowest risk area:
data.sort_values(['Risk Potential'], ascending=True).head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group,Risk Potential
17,Economics,50100.0,98600.0,50600.0,210000.0,Business,-159400.0
22,Finance,47900.0,88300.0,47200.0,195000.0,Business,-147800.0
37,Math,45400.0,92400.0,45200.0,183000.0,STEM,-137800.0
36,Marketing,40800.0,79600.0,42100.0,175000.0,Business,-132900.0
42,Philosophy,39900.0,81200.0,35500.0,168000.0,HASS,-132500.0


In [77]:
# The top 5 majors with the highest Mid-Career 90th Percentile Salary
data.sort_values(['Mid-Career 90th Percentile Salary'], ascending=False).iloc[:5, [0, 4]] 

Unnamed: 0,Undergraduate Major,Mid-Career 90th Percentile Salary
17,Economics,210000.0
22,Finance,195000.0
8,Chemical Engineering,194000.0
37,Math,183000.0
44,Physics,178000.0


### Which category of degrees has the highest average salary?

* Is it STEM, Business or HASS (Humanities, Arts, and Social Science)? 

In [78]:
# the average salary of each degree category
pd.options.display.float_format = '{:,.2f}'.format # set the values to 2 decimal places
data.iloc[:, :-1].groupby('Group').mean()

Unnamed: 0_level_0,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Business,44633.33,75083.33,43566.67,147525.0
HASS,37186.36,62968.18,34145.45,129363.64
STEM,53862.5,90812.5,56025.0,157625.0
