## Importing Libraries

For this project, we will be using the `numpy`, `pandas`, and `altair` libraries. `altair` is a
data visualization library with many customization options.

If we wanted to use matplotlib, we would need to import it by adding `import matplotlib.pyplot as plt`. If we wanted to see the resulting `pyplot` plots as we execute cells, we would need to add `%matplotlib inline`. However, for this exercise, charts created using `altair` are displayed using a javascript frontend integrated into our Jupyter notebook.

In [58]:
import numpy as np
import pandas as pd
import altair as alt

Now let's read in our data set as a `pandas` `DataFrame`

In [59]:
recent_grads = pd.read_csv('recent-grads.csv')

## Exploring and cleaning the data

Our dataset is made up of college graduate data defined in this table:

Header | Description
---|---------
`Rank` | Rank by median earnings
`Major_code` | Major code, FO1DP in ACS PUMS
`Major` | Major description
`Major_category` | Category of major from Carnevale et al
`Total` | Total number of people with major
`Sample_size` | Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
`Men` | Male graduates
`Women` | Female graduates
`ShareWomen` | Women as share of total
`Employed` | Number employed (ESR == 1 or 2)
`Full_time` | Employed 35 hours or more
`Part_time` | Employed less than 35 hours
`Full_time_year_round` | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
`Unemployed` | Number unemployed (ESR == 3)
`Unemployment_rate` | Unemployed / (Unemployed + Employed)
`Median` | Median earnings of full-time, year-round workers
`P25th` | 25th percentile of earnings
`P75th` | 75th percentile of earnings
`College_jobs` | Number with job requiring a college degree
`Non_college_jobs` | Number with job not requiring a college degree
`Low_wage_jobs` | Number in low-wage service jobs

Let's get some high-level information from our dataset, then clean any rows containing empty values.

In [60]:
recent_grads.describe(include='all')

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
count,173.0,173.0,173,172.0,172.0,172.0,173,172.0,173.0,173.0,...,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
unique,,,173,,,,16,,,,...,,,,,,,,,,
top,,,"NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL ...",,,,Engineering,,,,...,,,,,,,,,,
freq,,,1,,,,29,,,,...,,,,,,,,,,
mean,87.0,3879.815029,,39370.081395,16723.406977,22646.674419,,0.522223,356.080925,31192.763006,...,8832.398844,19694.427746,2416.32948,0.068191,40151.445087,29501.445087,51494.219653,12322.635838,13284.49711,3859.017341
std,50.084928,1687.75314,,63483.491009,28122.433474,41057.33074,,0.231205,618.361022,50675.002241,...,14648.179473,33160.941514,4112.803148,0.030331,11470.181802,9166.005235,14906.27974,21299.868863,23789.655363,6944.998579
min,1.0,1100.0,,124.0,119.0,0.0,,0.0,2.0,0.0,...,0.0,111.0,0.0,0.0,22000.0,18500.0,22000.0,0.0,0.0,0.0
25%,44.0,2403.0,,4549.75,2177.5,1778.25,,0.336026,39.0,3608.0,...,1030.0,2453.0,304.0,0.050306,33000.0,24000.0,42000.0,1675.0,1591.0,340.0
50%,87.0,3608.0,,15104.0,5434.0,8386.5,,0.534024,130.0,11797.0,...,3299.0,7413.0,893.0,0.067961,36000.0,27000.0,47000.0,4390.0,4595.0,1231.0
75%,130.0,5503.0,,38909.75,14631.0,22553.75,,0.703299,338.0,31433.0,...,9948.0,16891.0,2393.0,0.087557,45000.0,33000.0,60000.0,14444.0,11783.0,3466.0


Do we have any rows which contain `NaN` or empty values?

In [61]:
na_rows = recent_grads[recent_grads.isna().any(axis=1)]
na_rows

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
21,22,1104,FOOD SCIENCE,,,,Agriculture & Natural Resources,,36,3149,...,1121,1735,338,0.096931,53000,32000,70000,1183,1274,485


Only one row contains `NaN` values. The columns with `NaN` values contain data key to our analysis, so let's just remove the entire row.

In [62]:
print(len(recent_grads))
recent_grads.dropna(inplace=True)
print(len(recent_grads))

173
172


## In-depth analysis

Now that our data has been cleaned, we can begin some deeper analysis. We might be able to answer some interesting questions with the data available to us, such as:

* Does the sample size taken for each major correlate to its number of graduates?
* Does the number of graduates in a major have any correlation to median salary?
* Does gender have any correlation with the median salary of a major?
* Is there any correlation between number of full time (or part time) graduates in a major and median salary?
* What category of major has the highest number of majors? Which has the lowest?
* Which categories of major have the highest and lowest median salaries?
* When considering all median salaries in a category of major, which category has the highest average? Which category has the lowest?
* Are there any median salary outliers that are adversely affecting our calculated averages?

### Question 1: Does the sample size taken for each major correlate to its number of graduates?
In order to determine if the median salary for a major is properly represented, we need to establish a positive correlation between the sample size taken and the number of graduates in a given major. There are many factors that dictate a healthy sample size; more information can be found [here](https://www.statisticshowto.com/probability-and-statistics/find-sample-size/#Cochran). For now, we will simply identify whether there is a correlation between our `Sample_size` and `Total` columns by creating a scatter plot comparing the two values.

Note our chart is interactive and can be panned / zoomed.

In [63]:
S1 = alt.Chart(recent_grads).mark_point(color="#2d9fa1").encode(
    alt.X('Sample_size',title='Sample Size'),
    alt.Y('Total',title='Total Students')
).properties(
    title='Sample Size Taken Per Total Students by Major'
).interactive()

S1

There appears to be a positive correlation between the total number of students within a major and the sample size taken. This gives us some level of confidence that the median salary among majors is accurately represented.

### Question 2: Does the number of graduates in a major have any correlation to median salary?

One might speculate that if a major has a high number of graduates, it is popular among students. We don't have data that asserts this, but we can at least see if there is a correlation between the `Total` number of students and `Median` salary in a given major (i.e. are certain majors popular because they have a high income?).

In [64]:
S2 = alt.Chart(recent_grads).mark_point(color="#2d9fa1").encode(
  alt.X('Total',title='Total Students'),
  alt.Y('Median',title='Median Salary')
).properties(
    title='Median Salary to Total Students in Each Major'
).interactive()

S2

There doesn't appear to be a correlation between the total number of graduates in a major and median salary. In fact, the highest earning majors among our entire data set have a very low number of graduates. Perhaps this is indicative of the fact that employees within the field are in high-demand due to the difficulty of the major's cirriculum. 

Since our data set is already sorted by rank (median salary in descending order, according to our column description table above), let's display the first 5 rows and see what we find.

In [65]:
recent_grads.head()

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,756.0,679.0,77.0,Engineering,0.101852,7,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,856.0,725.0,131.0,Engineering,0.153037,3,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258.0,1123.0,135.0,Engineering,0.107313,16,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,32260.0,21239.0,11021.0,Engineering,0.341631,289,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972


Our 5 highest ranked majors are all in `Engineering` majors. Since `Engineering` majors require lots of math, science, and critical thinking skills, it seems reasonable to say that difficult `Engineering` majors with few graduates (such as `PETROLEUM ENGINEERING`) earn higher incomes. 

### Question 3: Does gender have any correlation with the median salary of a major?

Currently, our dataset only contains a `ShareWomen` row, which represents the percentage of graduates in a major who are women. We implicitly know the value of what a `ShareMen` column might look like since "gender" is nominal, binary data (**in the capacity of the data we are working with**). Regardless, let's exercise `pandas` a little by adding a `ShareMen` column to our data set that represents the percentage of students who are men in a major.

Then, we'll plot each `ShareMen` and `ShareWomen` against `Median` in their own scatter plots. Lastly, we'll horizontally concatenate our plots and display them.

In [66]:
S3 = alt.Chart(recent_grads).mark_point(color="#2d9fa1").encode(
  alt.X('Median',title='Median Salary'),
  alt.Y('ShareWomen',title='Percentage of Women',axis=alt.Axis(format='%'))
).properties(
    title='Percentage of Women in Each Major to Median Salary'
).interactive()

# There is no ShareMen column. We will create our own.

recent_grads['ShareMen'] = recent_grads['Men'] * 100 / recent_grads['Total']

S4 = alt.Chart(recent_grads).mark_point(color="#2d9fa1").encode(
  alt.X('Median',title='Median Salary'),
  alt.Y('ShareMen',title='Percentage of Men',axis=alt.Axis(format='%'))
).properties(
    title='Percentage of Men in Each Major to Median Salary'
).interactive()

alt.hconcat(S3,S4)

The resulting scatter plots indicate there is a weak, negative correlation between percentage of women in a major, and median salary. Conversely, there is a weak, positive correlation between percentage of men in a major, and median salary. Let's remember this as we continue to answer our questions.

### Question 4: Is there any correlation between number of full time (or part time) graduates in a major and median salary?

Does the number of graduates that enter full-time or part-time positions have any correlation with the median salary a major earns? Again, we can use scatter plots to answer this question.

In [67]:
S5 = alt.Chart(recent_grads).mark_point(color="#2d9fa1").encode(
  alt.X('Full_time',title='Full-time Graduates'),
  alt.Y('Median',title='Median Salary')
).properties(
    title='Full-Time Employed Graduates Within Major to Salary'
).interactive()

S6 = alt.Chart(recent_grads).mark_point(color="#2d9fa1").encode(
  alt.X('Part_time',title='Part-time Graduates'),
  alt.Y('Median',title='Median Salary')
).properties(
    title='Part-Time Employed Graduates Within Major to Salary'
).interactive()

alt.hconcat(S5,S6)

No, there does not appear to be any correlation between the number of graduates who are employed full or part time and median salary.

### Question 5: What category of major has the highest number of majors? Which has the lowest?

This question requires we take a different direction with our data visualization. Until now, we have been able to answer our questions using scatter plots because they are good at visualizing correlations between our data. To answer our 5th question, however, we will use different types of bar charts.

#### Specifying data types in `altair`

Before we proceed, we should know some things about `altair`. `altair` is capable of recognizing the type of data we are passing to it, to include `quantitative` data, `nominal` data, `ordinal` data, and a few others that are outside the scope of this project. When we select the `Major_category` column as the data to be displayed along the `X` axis, `altair` automatically recognizes this as nominal data and aggregates it prior to plotting it along the X axis. This works for us since we are trying to determine the total number of majors within each category. However, if we *didn't* want `altair` to consider `Major_category` as nominal data and instead wanted this column to be considered as quantitative data, we could specify `Major_category:Q` in the `X` method of the `encode` method. Likeiwse, we could specify `Major_category:O` or `Major_category:N` for ordinal and nominal data, respectively.

To directly quote the [documentation](https://altair-viz.github.io/user_guide/encoding.html):

> If types are not specified for data input as a DataFrame, Altair defaults to quantitative for any numeric data, temporal for date/time data, and nominal for string data, but be aware that these defaults are by no means always the correct choice!

For the `Y` axis, we want to count the number of rows for each category of major. We can do this by replacing a column name in `Alt.Y` with the `count()` method.

In [68]:
B1 = alt.Chart(recent_grads).mark_bar(color="#2d9fa1").encode(
    alt.X('Major_category',title='Category of Major'),
    alt.Y('count()',title='Total number of majors')
).properties(
    title='Total Number of Majors Per Category',
    width=700
)

B1

The resulting bar graph answering our question by visualizing the fact that the `Engineering` major category has the most majors, while `Interdisciplinary` has the least.  

### Question 6: Which categories of major have the highest and lowest median salaries?

Now that we have demonstrated how to display the total number of majors in a category, we can apply the same method used above to determine which major category has the single highest median salary, and which has the lowest. We will need to make two bar charts.

In the first bar chart, our `X` axis will be represented by the `Major_category` column of our data set. The `Y` axis will be represented by the `Median` column. Recall that `altair` will automatically consider the `Median` column as quantitative, since it contains numeric values. Furthermore, it will aggregate the data by the `max` value by default.

Our second bar chart will look nearly identical, but this time we must pass `aggregate='min'` to the `Y` method.

--------------------

**NOTE** If we wanted to display charts of this kind using another plotting library, like `matplotlib`, we would might want to perform the data aggregation beforehand using `pandas`. We could achieve this using the DataFrame `groupby` method, which allows us to group a specified column as the new index and perform a method on the resulting DataFrame. If we wanted to group data by highest median salary among each major category, our DataFrame operation might look something like this :

`recent_grads_category_max = recent_grads.groupby('Major_category',as_index=False).max()`

This would assign create a new DataFrame, where the numeric index is equivalent the length of the number of unique major categories, and each column represents the maximum among its values. We could then input this data to `matplotlib` and create a plot however we please.

For now, we will let `altair` do this work for us:

In [69]:
B2 = alt.Chart(recent_grads).mark_bar(color="#2d9fa1").encode(
    alt.X('Major_category',title='Category of Major'),
    alt.Y('Median',title='Top Median Salary')
).properties(
    title='Highest Median Salary Among Major Categories',
    width=600
)

B3 = alt.Chart(recent_grads).mark_bar(color="#2d9fa1").encode(
    alt.X('Major_category',title='Category of Major'),
    alt.Y('Median',title='Median Salary',aggregate='min')
).properties(
    title='Lowest Median Salary Among Major Categories',
    width=600
)

alt.vconcat(B2,B3)

Our first bar chart shows that a major within the `Engineering` category has the highest median salary among all other majors, while our second bar chart shows us that a major within the `Education` category has the lowest median salary among all other majors.

This result naturally leads us to our next question.

### Question 7: When considering all median salaries in a category of major, which category has the highest average? Which category has the lowest?

In question \#6, we identified `Engineering` as the category of major that has the highest median salary among all our majors, and `Interdisciplinary` as the lowest. However, by knowing the average of all median salaries within a category, we might be able to draw conclusions on what majors a student should pursue if they were interested in a high salary, or which majors they should avoid if they don't want a low income.

Recall that while we were exploring relationships between our data, we established a weak, negative correlation between percentage of women in a major and median salary. Using `altair`, we can colorize our bar chart where hotter colors represent high percentages of women in a major, and cooler colors represent low percentages of women in a major. We can integrate this utility into the bar chart we are creating to answer question \#7

In [70]:
B4 = alt.Chart(recent_grads).mark_bar(color="#2d9fa1").encode(
    alt.X('Major_category',title='Category of Major'),
    alt.Y('Median',title='Median Salary',aggregate='mean'),
    color=alt.Color('ShareWomen', title='Percent of Women in Majors',legend=alt.Legend(format="%"),aggregate='mean')
).properties(title='Mean of Median Salary by Category of Major',
             width=600
)

B4

As we might have expected, `Engineering` majors have the highest median salary by a fair margin. To corroborate our earlier findings, we were able to visualize that the `Engineering` major is made up of very few women. Similarly, `Psychology & Social Work` majors, which have the lowest average median salary, are made up by mostly women.

### Question 8: Are there any median salary outliers that are adversely affecting our calculated averages?

Now that we know the average salary for each category of major, can we determine if there are any outliers skewing our results in any particular direction? It seems this is a possibility on the higher end, since the highest ranking major earns `35000` more in salary than the second highest ranking major.

In [71]:
recent_grads[['Rank','Median','Major','Major_category']].head(15)

Unnamed: 0,Rank,Median,Major,Major_category
0,1,110000,PETROLEUM ENGINEERING,Engineering
1,2,75000,MINING AND MINERAL ENGINEERING,Engineering
2,3,73000,METALLURGICAL ENGINEERING,Engineering
3,4,70000,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,5,65000,CHEMICAL ENGINEERING,Engineering
5,6,65000,NUCLEAR ENGINEERING,Engineering
6,7,62000,ACTUARIAL SCIENCE,Business
7,8,62000,ASTRONOMY AND ASTROPHYSICS,Physical Sciences
8,9,60000,MECHANICAL ENGINEERING,Engineering
9,10,60000,ELECTRICAL ENGINEERING,Engineering


Luckily, we can quickly visualize our outlier data by creating a box and whisker plot.

Note our box and whisker plot uses a default extent of 1.5 * IQR, [which is standard by definition of a box and whisker plot](https://mathworld.wolfram.com/InterquartileRange.html). We could specify otherwise if we desired, but for now this will suffice.

In [72]:
BW1 = alt.Chart(recent_grads).mark_boxplot().encode(
    alt.X('Major_category',title='Category of Major'),
    alt.Y('Median',title='Median Salary')
).properties(
    title='Median Salary of Majors by Category',
    width=700,
    height=400
)

BW1

It looks like our data contains some outliers, but nothing too significant. While `PETROLEUM ENGINEERING` has a significantly higher median salary than all other majors in the `Engineering` category, it's important to remember that `Engineering` is comprised of nearly 30 majors. Therefore, the effect `PETROLEUM ENGINEERING` has on the average might be inconsequential. 

Regardless, let's exclude some of our outliers and see what the resulting bar plot looks like. Note we should probably not indiscriminately exclude *all* outliers, such as in the case of `Communications & Journalism`, due to the low number of majors some categories have (which puts some of their outliers barely outside of the 1.5 * IQR range).

Let's omit the following outliers:

Max:
1. `Arts`
2. `Business`
3. `Engineering`
4. `Education`
5. `Physical Sciences`

Min:
1. `Education` -- **We are leaving the third outlier in place due to how close it is to the first quartile.**

#### Code it!

With our task laid out, let's perform these omissions by creating a new dataframe called `recent_grads_without_outliers`, which will be a copy of the `recent_grads` `DataFrame` without the aforementioned outliers. This requires a multi-step approach:

* First, use the `groupby`, `max`, and `min` methods to isolate the maximum and minimum values for the categories listed above into separate `pandas` objects called `recent_grads_max_outliers` and `recent_grads_min_outliers`.
* Create a new `DataFrame` called `recent_grads_outliers`, which consists of the maximum and minimum outliers appended to one another. Since `recent_grads_min_outliers` is a `Series` object, we will append it to `recent_grads_max_outliers`, which is a `DataFrame`.
* Create a new `DataFrame` called `recent_grads_without_outliers`, which is a copy of `recent_grads` without the rows we wish to exclude. This last step is a little confusing, so we will work from the innermost part of the code to the outermost parts:
    * `recent_grads_outliers['Major'].array` returns a `pandas` `array`, which is a list-like object containing the majors we are considering to be outliers.
    * Now we will perform boolean indexing to check if any of the contents of our list-like object (e.g. `PETROLEUM ENGINEERING`) exist in the `recent_grads` `DataFrame`. Apply the `~` operator at the beginning of the boolean array to flip all the `True` to `False` and vice versa, since we want to include all rows *except* those containing our outliers.
    * Apply the resulting boolean index to `recent_grads`
    
To check our work, we will display the length of `recent_grads_without_outliers`. Recall that at the beginning of this guided project, `recent_grads` was 172 rows in length after we cleaned the rows containing `NaN` values. Since we wish to eliminate 6 outliers, `recent_grads_without_outliers` should be 166 rows in length.

In [73]:
recent_grads_max_outliers = recent_grads.groupby('Major_category').max().loc[['Arts','Business','Engineering',
                                                                               'Education','Physical Sciences']]

recent_grads_min_outliers = recent_grads.groupby('Major_category').min().loc['Education']

recent_grads_outliers = recent_grads_max_outliers.append(recent_grads_min_outliers)
recent_grads_without_outliers = recent_grads[(~recent_grads['Major'].isin(recent_grads_outliers['Major'].array))]

# Recall that at the beginning of this notebook, our 'cleaned' data was 172 rows in length. Since we wish to remove
# 6 outliers, the length of the new DataFrame we just created should be 166 rows.

len(recent_grads_without_outliers)

166

Now we have data we can work with. Let's create a new box plot representation of the averaged median salary among all majors in a category, excluding our outliers. We will vertically concatenate the bar chart containing outliers with the bar chart containing outliers so we can directly compare them.

In [74]:
B5 = alt.Chart(recent_grads_without_outliers).mark_bar(color="#2d9fa1").encode(
    alt.X('Major_category',title='Category of Major'),
    alt.Y('Median',title='Median Salary',aggregate='mean'),
    color=alt.Color('ShareWomen', title='Percent of Women in Major',legend=alt.Legend(format="%"),aggregate='mean')
).properties(
    title='Aggregated mean of median salaries among major categories (without outliers)',
    width=600
)

alt.vconcat(B4,B5)

To answer question \#8, we can say that the outliers in our data has a decidedly minimal effect on our averaged median salaries.

## Recap

---

Let's recap our original set of questions and their answers:

* Does the sample size taken for each major correlate to its number of graduates?
    * Yes. We concluded the sample size taken for each major has a positive correlation its number of graduates. We determined this is just one of many signs of a good sample size.


* Does the number of graduates in a major have any correlation to median salary?
    * No. We concluded there is no discernable correlation between the number of graduates a major has and the median salary of that major.
    
    
* Does gender have any correlation with the median salary of a major?
    * Yes. We found that there is a weak, negative correlation between the number of women in a major and the major's median salary. Since `ShareWomen` and `ShareMen` are dependent variables, there is a weak, positive correlation between the number of men in a major and median salary.
    

* Is there any correlation between number of full time (or part time) graduates in a major and median salary?
    * No. We concluded there is no discernable correlation between the number of full time (or part time) graduates in a major and median salary.
    
    
* What category of major has the highest number of majors? Which has the lowest?
    * We concluded `Engineering` has the most majors, while `Interdisciplinary` has the least majors.
    
    
* Which categories of major have the highest and lowest median salaries?
    * We concluded `Engineering` has the highest-earning major, while `Education` has the lowest-earning major.
    

* When considering all median salaries in a category of major, which category has the highest average? Which category has the lowest?
    * We concluded that `Engineering` has the highest average salary among median salaries within a category, while `Psychology & Social Work` has the lowest average salary among median salaries within a category.


* Are there any median salary outliers that are adversely affecting our calculated averages?
    * We located several outliers using a box and whisker plot, but found that none are adversely affecting our calculated averages.
    
## Conclusion

---

For this guided project, we were able to demonstrate basic data visualization techniques and make connections between our `recent_grads` data set. We created scatter plots to show how two columns of data may be connected, bar plots to show information about our data when aggregated, and a box and whisker plot to find outliers in our data. We concluded that `Engineering` majors have a clear lead on other majors in terms of salary, but at the same time this does not necessarily apply to *all* majors of this category.

Thanks for taking the time to read this guided project. Hopefully you were able to learn something new about `altair`, or isolating and removing data using `pandas` operations.

If you have any critique, even minor, please provide feedback so I can improve my knowledge and learning processes.