In [None]:
import pandas as pd
import numpy as np
pd.options.display.float_format = '{:,.2f}'.format 

In [None]:
# Load CSV File
df = pd.read_csv("salaries_by_college_major.csv")

# Quick look at the DataFrame

In [None]:
df.head()

# Answer This Questions
Now that we've got our data loaded into our dataframe, we need to take a closer look at it to help us understand what it is we are working with. This is always the first step with any data science project. Let's see if we can answer the following questions: 

* How many rows does our dataframe have?  
* How many columns does it have? 
* What are the labels for the columns? Do the columns have names? 
* Are there any missing values in our dataframe? Does our dataframe contain any bad data?

In [None]:
df.shape

In [None]:
# 51 rows and 6 columns, lets take a look at the column names
df.columns

# Missing Values and Junk Data
Before we can proceed with our analysis we should try and figure out if there are any missing or junk data in our dataframe. 
That way we can avoid problems later on. In this case, we're going to look for NaN (Not A Number) values in our dataframe. 
NAN values are blank cells or cells that contain strings instead of numbers. 
Use the .isna() method and see if you can spot if there's a problem somewhere.

In [None]:
df.isna

In [None]:
# Did you find anything? Check the last couple of rows in the dataframe:
df.tail()

In [None]:
# Aha! We have a row that contains some information regarding the source of the data with blank values for all the other columns.

In [None]:
# Delete the Last Row
clean_df = df.dropna()
clean_df.tail()

# Accessing Columns and Individual Cells in a Dataframe
Find College Major with Highest Starting Salaries

To access a particular column from a data frame we can use the square bracket notation, like so:

```clean_df['Starting Median Salary']```

You should see all the values printed out below the cell for just this column:

In [None]:
clean_df["Starting Median Salary"]

## To find the highest starting salary we can simply chain the .max() method.

In [None]:
clean_df["Starting Median Salary"].max()

## The highest starting salary is $74,300. But which college major earns this much on average? For this, 
## we need to know the row number or index so that we can look up the name of the major. Lucky for us, the ```.idxmax()``` method will 
## give us index for the row with the largest value.

In [None]:
# which is 43. To see the name of the major that corresponds to that particular row, we can use the .loc (location) property.
clean_df["Undergraduate Major"].loc[clean_df["Starting Median Salary"].idxmax()]

# Challenges
Now that we've found the major with the highest starting salary, can you write the code to find the following:

* What college major has the highest mid-career salary? How much do graduates with this major earn? (Mid-career is defined as having 10+ years of experience).

* Which college major has the lowest starting salary and how much do graduates from earn after university?

* Which college major has the lowest mid-career salary and how much can people expect to earn with this degree? 

In [None]:
# Highest mid-career salary major 
# First we selected the major column, then we get the id of heights salary and pass it with loc, so we can select only the name of major
clean_df["Undergraduate Major"].loc[clean_df["Mid-Career Median Salary"].idxmax()]

In [None]:
# Which college major has the lowest starting salary and how much do graduates from earn after university?
clean_df["Undergraduate Major"].loc[clean_df["Starting Median Salary"].idxmin()]

In [None]:
clean_df[clean_df["Undergraduate Major"] == "Spanish"][["Undergraduate Major", "Starting Median Salary"]]

In [None]:
# Which college major has the lowest mid-career salary and how much can people expect to earn with this degree?
clean_df["Undergraduate Major"].loc[clean_df["Mid-Career Median Salary"].idxmin()]

In [None]:
clean_df[clean_df["Undergraduate Major"] == "Education"][["Undergraduate Major", "Mid-Career Median Salary"]]

In [None]:
spread_col = clean_df['Mid-Career 90th Percentile Salary'] - clean_df['Mid-Career 10th Percentile Salary']
# inserts column to the dataframe
# clean_df.insert(loc=1, column="Spread", value=spread_col) 
clean_df

In [None]:
low_risk = clean_df["Spread"]
low_risk.sort_values(ascending=False)

## Challenge


* Using the .sort_values() method, can you find the degrees with the highest potential? Find the top 5 degrees with the highest values in the 90th percentile. 

* Also, find the degrees with the greatest spread in salaries. Which majors have the largest difference between high and low earners after graduation.


In [None]:
# find the degrees with the highest potential? Find the top 5 degrees with the highest values in the 90th percentile. 
highest_potential = clean_df.sort_values("Mid-Career 90th Percentile Salary", ascending=False)
highest_potential[["Undergraduate Major", "Mid-Career 90th Percentile Salary"]].head()

In [None]:
# find the degrees with the greatest spread in salaries. Which majors have the largest difference between high and low earners after graduation
greatest_spread = clean_df.sort_values("Spread", ascending=False)
greatest_spread[["Undergraduate Major", "Spread"]].head()

## Grouping and Pivoting Data with Pandas
* Often times you will want to sum rows that belong to a particular category. For example, which category of degrees has the highest average salary? Is it STEM, Business or HASS (Humanities, Arts, and Social Science)? 

* To answer this question we need to learn to use the ```.groupby()``` method. This allows us to manipulate data similar to a Microsoft Excel Pivot Table.

* We have three categories in the 'Group' column: STEM, HASS and Business. Let's count how many majors we have in each category:

In [None]:
numeric_columns = clean_df.select_dtypes(include=[np.number])
clean_df.groupby("Group")[numeric_columns.columns].mean()  # apply the mean to all numeric columns