In [262]:
import pandas as pd
import numpy as np
pd.options.display.float_format = '{:,.2f}'.format 

In [225]:
# Load CSV File
df = pd.read_csv("salaries_by_college_major.csv")

# Quick look at the DataFrame

In [226]:
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


# Answer This Questions
Now that we've got our data loaded into our dataframe, we need to take a closer look at it to help us understand what it is we are working with. This is always the first step with any data science project. Let's see if we can answer the following questions: 

* How many rows does our dataframe have?  
* How many columns does it have? 
* What are the labels for the columns? Do the columns have names? 
* Are there any missing values in our dataframe? Does our dataframe contain any bad data?

In [227]:
df.shape

(51, 6)

In [228]:
# 51 rows and 6 columns, lets take a look at the column names
df.columns

Index(['Undergraduate Major', 'Starting Median Salary',
       'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
       'Mid-Career 90th Percentile Salary', 'Group'],
      dtype='object')

# Missing Values and Junk Data
Before we can proceed with our analysis we should try and figure out if there are any missing or junk data in our dataframe. 
That way we can avoid problems later on. In this case, we're going to look for NaN (Not A Number) values in our dataframe. 
NAN values are blank cells or cells that contain strings instead of numbers. 
Use the .isna() method and see if you can spot if there's a problem somewhere.

In [229]:
df.isna

<bound method DataFrame.isna of                      Undergraduate Major  Starting Median Salary  \
0                             Accounting                 46000.0   
1                  Aerospace Engineering                 57700.0   
2                            Agriculture                 42600.0   
3                           Anthropology                 36800.0   
4                           Architecture                 41600.0   
5                            Art History                 35800.0   
6                                Biology                 38800.0   
7                    Business Management                 43000.0   
8                   Chemical Engineering                 63200.0   
9                              Chemistry                 42600.0   
10                     Civil Engineering                 53900.0   
11                        Communications                 38100.0   
12                  Computer Engineering                 61400.0   
13              

In [230]:
# Did you find anything? Check the last couple of rows in the dataframe:
df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS
50,Source: PayScale Inc.,,,,,


In [231]:
# Aha! We have a row that contains some information regarding the source of the data with blank values for all the other columns.

In [232]:
# Delete the Last Row
clean_df = df.dropna()
clean_df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
45,Political Science,40800.0,78200.0,41200.0,168000.0,HASS
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS


# Accessing Columns and Individual Cells in a Dataframe
Find College Major with Highest Starting Salaries

To access a particular column from a data frame we can use the square bracket notation, like so:

```clean_df['Starting Median Salary']```

You should see all the values printed out below the cell for just this column:

In [233]:
clean_df["Starting Median Salary"]

0     46000.0
1     57700.0
2     42600.0
3     36800.0
4     41600.0
5     35800.0
6     38800.0
7     43000.0
8     63200.0
9     42600.0
10    53900.0
11    38100.0
12    61400.0
13    55900.0
14    53700.0
15    35000.0
16    35900.0
17    50100.0
18    34900.0
19    60900.0
20    38000.0
21    37900.0
22    47900.0
23    39100.0
24    41200.0
25    43500.0
26    35700.0
27    38800.0
28    39200.0
29    37800.0
30    57700.0
31    49100.0
32    36100.0
33    40900.0
34    35600.0
35    49200.0
36    40800.0
37    45400.0
38    57900.0
39    35900.0
40    54200.0
41    39900.0
42    39900.0
43    74300.0
44    50300.0
45    40800.0
46    35900.0
47    34100.0
48    36500.0
49    34000.0
Name: Starting Median Salary, dtype: float64

## To find the highest starting salary we can simply chain the .max() method.

In [234]:
clean_df["Starting Median Salary"].max()

74300.0

## The highest starting salary is $74,300. But which college major earns this much on average? For this, 
## we need to know the row number or index so that we can look up the name of the major. Lucky for us, the ```.idxmax()``` method will 
## give us index for the row with the largest value.

In [235]:
# which is 43. To see the name of the major that corresponds to that particular row, we can use the .loc (location) property.
clean_df["Undergraduate Major"].loc[clean_df["Starting Median Salary"].idxmax()]

'Physician Assistant'

# Challenges
Now that we've found the major with the highest starting salary, can you write the code to find the following:

* What college major has the highest mid-career salary? How much do graduates with this major earn? (Mid-career is defined as having 10+ years of experience).

* Which college major has the lowest starting salary and how much do graduates from earn after university?

* Which college major has the lowest mid-career salary and how much can people expect to earn with this degree? 

In [236]:
# Highest mid-career salary major 
# First we selected the major column, then we get the id of heights salary and pass it with loc, so we can select only the name of major
clean_df["Undergraduate Major"].loc[clean_df["Mid-Career Median Salary"].idxmax()]

'Chemical Engineering'

In [237]:
# Which college major has the lowest starting salary and how much do graduates from earn after university?
clean_df["Undergraduate Major"].loc[clean_df["Starting Median Salary"].idxmin()]

'Spanish'

In [238]:
clean_df[clean_df["Undergraduate Major"] == "Spanish"][["Undergraduate Major", "Starting Median Salary"]]

Unnamed: 0,Undergraduate Major,Starting Median Salary
49,Spanish,34000.0


In [239]:
# Which college major has the lowest mid-career salary and how much can people expect to earn with this degree?
clean_df["Undergraduate Major"].loc[clean_df["Mid-Career Median Salary"].idxmin()]

'Education'

In [240]:
clean_df[clean_df["Undergraduate Major"] == "Education"][["Undergraduate Major", "Mid-Career Median Salary"]]

Unnamed: 0,Undergraduate Major,Mid-Career Median Salary
18,Education,52000.0


In [241]:
spread_col = clean_df['Mid-Career 90th Percentile Salary'] - clean_df['Mid-Career 10th Percentile Salary']
# inserts column to the dataframe
# clean_df.insert(loc=1, column="Spread", value=spread_col) 
clean_df

Unnamed: 0,Undergraduate Major,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,109800.0,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,96700.0,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,113700.0,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,104200.0,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,85400.0,41600.0,76800.0,50600.0,136000.0,Business
5,Art History,96200.0,35800.0,64900.0,28800.0,125000.0,HASS
6,Biology,98100.0,38800.0,64800.0,36900.0,135000.0,STEM
7,Business Management,108200.0,43000.0,72100.0,38800.0,147000.0,Business
8,Chemical Engineering,122100.0,63200.0,107000.0,71900.0,194000.0,STEM
9,Chemistry,102700.0,42600.0,79900.0,45300.0,148000.0,STEM


In [242]:
low_risk = clean_df["Spread"]
low_risk.sort_values(ascending=False)

17    159400.0
22    147800.0
37    137800.0
36    132900.0
42    132500.0
45    126800.0
8     122100.0
44    122000.0
33    118800.0
16    116300.0
30    115900.0
14    114700.0
2     113700.0
28    112000.0
25    111000.0
0     109800.0
7     108200.0
39    107300.0
34    106600.0
11    105500.0
3     104200.0
9     102700.0
21    102100.0
35    100700.0
20     99600.0
38     99300.0
19     98700.0
6      98100.0
13     98000.0
1      96700.0
5      96200.0
12     95900.0
46     95400.0
24     92000.0
29     88500.0
48     87300.0
4      85400.0
10     84600.0
31     84500.0
26     76000.0
15     74800.0
18     72700.0
32     71300.0
23     70000.0
47     66700.0
27     66400.0
49     65400.0
41     65300.0
43     57600.0
40     50700.0
Name: Spread, dtype: float64

## Challenge


* Using the .sort_values() method, can you find the degrees with the highest potential? Find the top 5 degrees with the highest values in the 90th percentile. 

* Also, find the degrees with the greatest spread in salaries. Which majors have the largest difference between high and low earners after graduation.


In [243]:
# find the degrees with the highest potential? Find the top 5 degrees with the highest values in the 90th percentile. 
highest_potential = clean_df.sort_values("Mid-Career 90th Percentile Salary", ascending=False)
highest_potential[["Undergraduate Major", "Mid-Career 90th Percentile Salary"]].head()

Unnamed: 0,Undergraduate Major,Mid-Career 90th Percentile Salary
17,Economics,210000.0
22,Finance,195000.0
8,Chemical Engineering,194000.0
37,Math,183000.0
44,Physics,178000.0


In [244]:
# find the degrees with the greatest spread in salaries. Which majors have the largest difference between high and low earners after graduation
greatest_spread = clean_df.sort_values("Spread", ascending=False)
greatest_spread[["Undergraduate Major", "Spread"]].head()

Unnamed: 0,Undergraduate Major,Spread
17,Economics,159400.0
22,Finance,147800.0
37,Math,137800.0
36,Marketing,132900.0
42,Philosophy,132500.0


## Grouping and Pivoting Data with Pandas
* Often times you will want to sum rows that belong to a particular category. For example, which category of degrees has the highest average salary? Is it STEM, Business or HASS (Humanities, Arts, and Social Science)? 

* To answer this question we need to learn to use the ```.groupby()``` method. This allows us to manipulate data similar to a Microsoft Excel Pivot Table.

* We have three categories in the 'Group' column: STEM, HASS and Business. Let's count how many majors we have in each category:

In [263]:
numeric_columns = clean_df.select_dtypes(include=[np.number])
clean_df.groupby("Group")[numeric_columns.columns].mean()  # apply the mean to all numeric columns

Unnamed: 0_level_0,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Business,103958.33,44633.33,75083.33,43566.67,147525.0
HASS,95218.18,37186.36,62968.18,34145.45,129363.64
STEM,101600.0,53862.5,90812.5,56025.0,157625.0
