# Visualizing Earnings based on college majors

## Introduction

This project focuses on a dataset on the job outcomes of students 
who graduated from college between 2010 and 2012. The original data on 
job outcomes was released by American Community Survey, 
which conducts surveys and aggregates the data. FiveThirtyEight cleaned 
the dataset and released it on their Github repo.
https://www.census.gov/programs-surveys/acs/
https://github.com/fivethirtyeight/data/tree/master/college-majors

Using visualizations, we can start to explore questions from the dataset like:

Do students in more popular majors make more money?
Using scatter plots
How many majors are predominantly male? Predominantly female?
Using histograms
Which category of majors have the most students?
Using bar plots

We'll explore how to do these and more while primarily working in pandas. Before we start creating data visualizations, let's import the libraries we need and remove rows containing null values.

Read the dataset into a DataFrame and start exploring the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

recent_grads = pd.read_csv('recent-grads.csv')
first_row = recent_grads.iloc[0]
print(first_row)
print(recent_grads.head())
print(recent_grads.tail())
recent_grads.describe()

Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
   Rank  Ma

Drop rows with missing values. Matplotlib expects that columns 
of values we pass in have matching lengths and missing values will 
cause matplotlib to throw errors.

In [None]:
raw_data_count = recent_grads.shape[0]
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape[0]
print("Raw_data_count: " + str(raw_data_count))
print("cleaned_data_count: " + str(cleaned_data_count))


## Pandas, Scatter Plots

Generate scatter plots in separate jupyter notebook cells to 
explore the following relations:
- Sample_size and Median,
- Sample_size and Unemployment_rate,
- Full_time and Median,
- ShareWomen and Unemployment_rate,
- Men and Median,
- Women and Median.

In [None]:
recent_grads.plot(x='Sample_size', y='Median', 
                  kind='scatter', title='Sample_size vs. Median',
                  )

In [None]:
recent_grads.plot(x='Sample_size', y='Unemployment_rate', 
                  kind='scatter', title='Sample_size vs. Unemployment_rate', 
                  )

In [None]:
recent_grads.plot(x='Full_time', y='Median', 
                  kind='scatter', title='Full_time vs. Median', 
                  )

In [None]:
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', 
                  kind='scatter', title='ShareWomen vs. Unemployment_rate', 
                  )

In [None]:
recent_grads.plot(x='Men', y='Median', 
                  kind='scatter', title='Men vs. Median', 
                  )

In [None]:
recent_grads.plot(x='Women', y='Median', 
                  kind='scatter', title='Women vs. Median', 
                  )

Using the plots we can explore the following questions:

Do students in more popular majors make more money?
    - By looking at the Sample_size vs. Median scatter plot, there is no clear correlation for this. The Median salary of full-time, year-round workers is not growing linearly with the Sample size.


Do students that majored in subjects that were majority female make more money?
    - The Women vs Median scatter plot shows no clear 
    correlation for this. There seems to be a slight trend of declining median linearly with the amount of women. 

Is there any link between the number of full-time employees and median salary?
    - No, there isn't any clear link by looking at the scatter plot Full_time vs Median. There is a slight trend of declining median when the number of full-time employees increases.

## Pandas, Histograms

Generate histograms in separate jupyter notebook cells to explore 
the distributions of the following columns:
- Sample_size
- Median
- Employed
- Full_time
- ShareWomen
- Unemployment_rate
- Men
- Women

In [None]:
recent_grads['Sample_size'].hist(bins=10, range=(0,5000))

In [None]:
recent_grads['Median'].hist(bins=12, range=(0,120000))

In [None]:
recent_grads['Employed'].hist(bins=8, range=(0,400000))

In [None]:
recent_grads['Full_time'].hist(bins=12, range=(0,300000))

In [None]:
print((recent_grads['ShareWomen'] > 0.5).value_counts())

recent_grads['ShareWomen'].hist(bins=2, range=(0,1))

In [None]:
recent_grads['Unemployment_rate'].hist(bins=8, range=(0,0.20))

In [None]:
recent_grads['Men'].hist(bins=8, range=(0,200000))

In [None]:
recent_grads['Women'].hist(bins=14, range=(0,350000))

Using the plots we can explore the following questions:
- What percent of majors are predominantly male? Predominantly female?
- What's the most common median salary range?

Roughly 56 % of the majors are predominanty female and rougly 44 % of the majors are premodimantly male. The most common median salary range is 30 000 - 40 000 based on the Median histogram. 

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))

In [None]:
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], 
               figsize=(10,10))

In [None]:
scatter_matrix(recent_grads[['ShareWomen', 'Median']], figsize=(10,10))

In [None]:
scatter_matrix(recent_grads[['Full_time', 'Median']], figsize=(10,10))

We can explore the questions from the last few steps using these scatter matrix plots. This may need the creation of more scatter matrix plots.

Do students in more popular majors make more money?
- This is not evident from the scatter matrix Sample_size vs Median

Do students that majored in subjects that were majority female make more money?
- The scatter matrix ShareWomen vs Median shows a slight trend of declining median when the ShareWomen increases.

Is there any link between the number of full-time employees and median salary?
- There is a slight trend of declining median seen from the scatter matrix Full_time vs Median.

Use bar plots to compare the percentages of women (ShareWomen) from the first ten rows and last ten rows of the recent_grads dataframe.

In [None]:
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend=False)
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen', legend=False)

Use bar plots to compare the unemployment rate (Unemployment_rate) from the first ten rows and last ten rows of the recent_grads dataframe.

In [None]:
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend=False)
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate', legend=False)

## Future ideas

* Use a grouped bar plot to compare the number of men with the number of women in each category of majors.
* Use a box plot to explore the distributions of median salaries and unemployment rate.
* Use a hexagonal bin plot to visualize the columns that had dense scatter plots from earlier in the project.

http://pandas.pydata.org/pandas-docs/stable/visualization.html

In [None]:
import numpy as np
# Use groupby method to group multiple occurrences of major_category 
grp_bar = recent_grads.groupby('Major_category')
#print(grp_bar.groups)

## apply an aggregation function agg(np.sum) to the grouped bar
plt_grp_bar = grp_bar[['Men','Women']].agg(np.sum)
plt_grp_bar.plot.bar()