In [2]:
import pandas as pd
df = pd.read_csv('salaries_by_college_major.csv')

In [5]:
df.head()


Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


<h2> A Look at Our Dataset </h2>

Let's start with checking out the number of rows and columns in our set.

In [96]:
df.shape

(51, 6)

<h3>Cleaning our Salary Dataset</h3>

Overall, our dataset is very simple and requires minimal cleaning. Let's start with checking our NaN values.

In [70]:
df.isna().sum()

Undergraduate Major                  0
Starting Median Salary               1
Mid-Career Median Salary             1
Mid-Career 10th Percentile Salary    1
Mid-Career 90th Percentile Salary    1
Group                                1
dtype: int64

 It looks like we have NaN Values. Let's take a closer look. 

In [71]:
df.isna()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


Our last row (50) has a NaN values.Let's take a look at which major this corresponds to. 

In [79]:
df.iloc[48:51]

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS
50,Source: PayScale Inc.,,,,,


We can see that there is no major listed in the `Undergraduate Major` column. We can go ahead and safely drop that row, since there is neither a major listed or any corresponding salary data.

In [80]:
clean_df = df.dropna()

In [83]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Undergraduate Major                50 non-null     object 
 1   Starting Median Salary             50 non-null     float64
 2   Mid-Career Median Salary           50 non-null     float64
 3   Mid-Career 10th Percentile Salary  50 non-null     float64
 4   Mid-Career 90th Percentile Salary  50 non-null     float64
 5   Group                              50 non-null     object 
dtypes: float64(4), object(2)
memory usage: 2.7+ KB


In [92]:
len(clean_df)

50

In [93]:
clean_df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
45,Political Science,40800.0,78200.0,41200.0,168000.0,HASS
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS


<h2>A Look at Highest Salaries by Major</h2>

We can now sort and view our majors from highest to lowest salaries. 

In [111]:
sorted_by_salary = clean_df.sort_values(by='Starting Median Salary', ascending=False)
sorted_by_salary

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
43,Physician Assistant,74300.0,91700.0,66400.0,124000.0,STEM
8,Chemical Engineering,63200.0,107000.0,71900.0,194000.0,STEM
12,Computer Engineering,61400.0,105000.0,66100.0,162000.0,STEM
19,Electrical Engineering,60900.0,103000.0,69300.0,168000.0,STEM
38,Mechanical Engineering,57900.0,93600.0,63700.0,163000.0,STEM
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
30,Industrial Engineering,57700.0,94700.0,57100.0,173000.0,STEM
13,Computer Science,55900.0,95500.0,56000.0,154000.0,STEM
40,Nursing,54200.0,67000.0,47600.0,98300.0,Business
10,Civil Engineering,53900.0,90500.0,63400.0,148000.0,STEM


Let's look at the top 10. 

In [123]:
top_10 = sorted_by_salary.iloc[0:10].reset_index()

In [124]:
top_10

Unnamed: 0,index,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,43,Physician Assistant,74300.0,91700.0,66400.0,124000.0,STEM
1,8,Chemical Engineering,63200.0,107000.0,71900.0,194000.0,STEM
2,12,Computer Engineering,61400.0,105000.0,66100.0,162000.0,STEM
3,19,Electrical Engineering,60900.0,103000.0,69300.0,168000.0,STEM
4,38,Mechanical Engineering,57900.0,93600.0,63700.0,163000.0,STEM
5,1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
6,30,Industrial Engineering,57700.0,94700.0,57100.0,173000.0,STEM
7,13,Computer Science,55900.0,95500.0,56000.0,154000.0,STEM
8,40,Nursing,54200.0,67000.0,47600.0,98300.0,Business
9,10,Civil Engineering,53900.0,90500.0,63400.0,148000.0,STEM


<h3>Physician Assistants Lead the Pack in Starting Salary</h3>

Physician Assistants with a `Starting Median Salary` of $74,000 come in at number 1. 

<h3>But the Highest Mid-Career Salary Goes To...</h3>

In [126]:
top_10.loc[top_10['Mid-Career Median Salary'].idxmax()]

index                                                   8
Undergraduate Major                  Chemical Engineering
Starting Median Salary                            63200.0
Mid-Career Median Salary                         107000.0
Mid-Career 10th Percentile Salary                 71900.0
Mid-Career 90th Percentile Salary                194000.0
Group                                                STEM
Name: 1, dtype: object

`Chemical Engineers` are rocking the highest `Mid-Career Salary` with a healthy <b>$107,000/year</b>

<h2>Lowest Salaries by Major</h2>

Let's look at the top 10 lowest salaries, by major. 

In [152]:
lowest_10 = df.sort_values(by='Starting Median Salary').iloc[0:10]

In [162]:
lowest_10 = lowest_10.reset_index().drop(columns='index')


Let's sort from highest to lowest.

In [166]:
lowest_10_sorted = lowest_10.sort_values(by='Starting Median Salary', ascending=False).iloc[0:10]

In [168]:
lowest_10_sorted.drop(columns='level_0')

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
7,Drama,35900.0,56900.0,36700.0,153000.0,HASS
8,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
9,Music,35900.0,55000.0,26700.0,134000.0,HASS
6,Art History,35800.0,64900.0,28800.0,125000.0,HASS
5,Graphic Design,35700.0,59800.0,36000.0,112000.0,HASS
4,Journalism,35600.0,66700.0,38400.0,145000.0,HASS
3,Criminal Justice,35000.0,56300.0,32200.0,107000.0,HASS
2,Education,34900.0,52000.0,29300.0,102000.0,HASS
1,Religion,34100.0,52000.0,29700.0,96400.0,HASS
0,Spanish,34000.0,53100.0,31000.0,96400.0,HASS


If your goal is to make money, don't major in Spanish or Religion!

<h3>Lowest Mid-Career Salary<h3>

In [173]:
lowest_10_sorted.sort_values(by='Mid-Career Median Salary').min()

level_0                                        0
Undergraduate Major                  Art History
Starting Median Salary                   34000.0
Mid-Career Median Salary                 52000.0
Mid-Career 10th Percentile Salary        26700.0
Mid-Career 90th Percentile Salary        96400.0
Group                                       HASS
dtype: object

The `Undergraduate Major` of `Art History` has the lowest mid-career(10 years experience) earnings potential. We are looking at $52000 a year.

<h2>Looking for Low Risk Majors? Nursing Leads the Way</h2>

We can compute the difference between the 10th and 90th percentiles for each major to get a sense of risk. The smaller the difference, the more certain you can be about your earnings when you graduate. As this number gets larger, the more uncertainty is introduced with regard to post-graduation earnings. 

In [179]:
spread_col = clean_df['Mid-Career 90th Percentile Salary'] - clean_df['Mid-Career 10th Percentile Salary']

In [184]:
clean_df.insert(1, 'Spread', spread_col)

In [203]:
risk_df = clean_df.sort_values(by='Spread').reset_index().drop(columns='index')

In [205]:
risk_df.head()

Unnamed: 0,Undergraduate Major,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Nursing,50700.0,54200.0,67000.0,47600.0,98300.0,Business
1,Physician Assistant,57600.0,74300.0,91700.0,66400.0,124000.0,STEM
2,Nutrition,65300.0,39900.0,55300.0,33900.0,99200.0,HASS
3,Spanish,65400.0,34000.0,53100.0,31000.0,96400.0,HASS
4,Health Care Administration,66400.0,38800.0,60600.0,34600.0,101000.0,Business


When it comes to salary numbers, Nursing offers the lowest risk and the most certainty. With a spread of `50700.0`, Nursing is your safe bet. Let's take a look at the columns separately.

In [215]:
risk_df[['Undergraduate Major', 'Spread']].head()

Unnamed: 0,Undergraduate Major,Spread
0,Nursing,50700.0
1,Physician Assistant,57600.0
2,Nutrition,65300.0
3,Spanish,65400.0
4,Health Care Administration,66400.0


<h3>Which Majors Offer the Highest Earnings Potential?</h3>