In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# UNC System Salary Information

The UNC System maintains a publically available database of employees and their salaries that can be found on the web. From their website:

> **What information is provided in the database?** This database contains the names, position titles and salaries of permanent employees of the University, as furnished by UNC System Office and the constituent institutions of The University of North Carolina.
>
>**What information is not provided in the database?** We currently do not have data available for UNC Health Care, UNC Press or the North Carolina School of Science and Math.
>
>**How frequently is the data updated?** The data provided in this database is updated quarterly and represents a snapshot of employees, salaries, and titles as of the date listed. Because it is a snapshot, the database may not accurately reflect an employee's permanent salary or all sources of funding provided throughout the year. As examples, the database may not capture all external fund sources that may compensate some employees, or the information may capture temporary salary increases or temporary title changes for additional duties for an employee that could change in the course of the year.
>
>**What if I have questions about an employee's salary or other personnel information?** If you believe that information about a particular employee is inaccurate or if you have other questions about the information provided in the database, you may contact the HR department at the constituent institution to verify the employee's salary.
>
>**Why do we publish this data?** Employee salaries are public and we believe that publishing salaries benefits both the public by providing easily accessible salary information and the employing institutions by minimizing public personnel information requests.

You'll find a `.csv` file containing the full database **with names removed** in the `/activities/data/` folder as `unc_salaries.csv`. 

Load this file as a Table named `unc`. This data was downloaded in February 2021, so depending on when you're viewing this activity it may be out of date.

In [2]:
unc = Table.read_table('data/unc_salaries.csv')
unc

INSTITUTION NAME,AGE,INITIAL HIRE DATE,JOB CATEGORY,EMPLOYEE ANNUAL BASE SALARY,EMPLOYEE HOME DEPARTMENT,PRIMARY WORKING TITLE
ASU,69,"AUG 01, 1998","Librarian AC, Other",49507,Library,Adjunct Assistant Professor
ASU,63,"AUG 24, 1998","Librarian AC, Other",23558,Library,Adjunct Assistant Professor
ASU,47,"JAN 16, 2018",Lecturer,47000,Sustainable Technlgy & Built Envirn,Adjunct Instructor
ASU,41,"DEC 01, 2004",Lecturer,36000,English,Lecturer
ASU,59,"MAY 23, 2005",Food Prep Worker,31200,Sanford Commons,Food Service Technician
ASU,37,"AUG 26, 2011",Assistant Professor,68945,Psychology,Assistant Professor
ASU,42,"JUL 18, 2012","Student Activities Professional, Student Activities Assi ...",53542,Campus Activities,"Asst Dir, Org Leadership"
ASU,54,"JUL 31, 2001",Nursing Professional,51504,Health Services,Professional Nurse
ASU,50,"AUG 01, 2005","Administrative / Office / Clerical Support Staff, Other",35484,School Of Music,Administrative Support Spec
ASU,40,"APR 16, 2018",Chief Budget Officer,110000,University Budget,University Budget Director


## The features

You can see that this data has 7 features for each of the 47,653 observations. In this case each observation refers to a specific person who was currently employed within the UNC System. The 7 features are:

* `INSTITUTION NAME`: An abbreviation for the UNC institution at which the individual is employed
  * `'ASU'`: Applachian State University
  * `'ECSU'`: Elizabeth City State University
  * `'ECU'`: East Carolina University
  * `'FSU'`: Fayetteville State University
  * `'NCA&T'`: North Carolina A&T State University
  * `'NCCU'`: North Carolina Central University
  * `'NCSU'`: North Carolina State University
  * `'UNC-CH'`: University of North Carolina at Chapel Hill
  * `'UNCA'`:  University of North Carolina at Asheville
  * `'UNCC'`:  University of North Carolina at Charlotte
  * `'UNCG'`: University of North Carolina at Greensboro
  * `'UNCP'`: The University of North Carolina at Pembroke
  * `'UNCSA'`: University of North Carolina School of the Arts
  * `'UNCW'`: University of North Carolina at Wilmington
  * `'WCU'`: Western Carolina University
  * `'WSSU'`: Winston Salem State University
* `AGE`: The age of the employee at the time this data was collected
* `INITIAL HIRE DATE`: The date formatted as Mon DD, YYYY
* `JOB CATEGORY`: A broad category that describes the type of job the individual holds
* `EMPLOYEE ANNUAL BASE SALARY`: The base salary, in dollars, that the individual is paid for their work. This does not include bonus pay, additional stipends, or other benefits provided by the institution
* `EMPLOYEE HOME DEPARTMENT`: The name of the department that employees the individual at the indicated institution
* `PRIMARY WORKING TITLE`: The specific name of the job the individual holds

**WARNING:** This data is NOT cleaned to ensure consistent job categories, home department names, or primary working title. For example, for the feature `JOB CATEGORY` there are various types of Accounting Professional listed

* Accounting Professional, Accountant
* Accounting Professional, Accountant Sr.
* Accounting Professional, Accounting Unit Supervisor
* Accounting Professional, Assistant Comptroller
* Accounting Professional, Asst/Assoc Bursar
* Accounting Professional, Collections Supervisor
* Accounting Professional, Compliance Officer
* Accounting Professional, Dept Business Mgr Sr.
* Accounting Professional, Head Cashier

There is also a generic descriptor of: Accounting Professional. For many applications you would likely want to "clean" this data by making a decision on how to lump all these roles together into one category, or determine what to do with the Accounting Professional categories that lack additional descriptions of their roles. Data cleaning is a difficult and time-consuming process that we will mostly ignore in this course so we can focus on the mathematical applications that can be applied to previously cleaned data.

## How many employees?

Using the `group` Table method, count how many individuals are employed at each institution. Assign the new table created by the `group` operation to the name `school_counts`.

In [8]:
# You complete some code in this cell
school_counts = unc.pivot('INSTITUTION NAME', 'AGE')
school_counts

AGE,ASU,ECSU,ECU,FSU,NCA&T,NCCU,NCSU,UNC-CH,UNCA,UNCC,UNCG,UNCP,UNCSA,UNCW,WCU,WSSU
19,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
20,2,0,0,1,0,0,2,3,0,1,0,0,0,1,1,0
21,0,0,4,0,2,1,4,7,1,0,0,0,0,0,1,0
22,8,0,9,0,0,0,18,40,2,3,5,2,1,1,6,0
23,12,2,18,1,3,6,42,103,5,12,13,5,0,4,12,1
24,17,3,30,2,3,4,64,131,1,18,18,7,1,20,15,2
25,25,0,47,3,9,7,107,169,6,32,29,5,1,19,23,2
26,37,3,63,2,12,3,128,156,10,40,26,9,5,17,20,5
27,38,3,64,5,18,7,156,189,5,38,33,16,5,27,22,6
28,41,3,81,7,22,11,157,240,7,70,37,18,2,34,24,5


Sort this data by the column containing the count, and determine which school employs the most individuals and which school employs the least.

In [None]:
# You complete some code in this cell
...

## Average salary by school

The average salary for an employee in the UNC System is \$75,617.39. Do you think this average salary would vary for employees at different schools? Let's find out.

Using table methods on the `unc` table to create a new table that only contains data on 25 year old employees, then select the column regarding base salary for these 25 year olds as an array, and determine their average salary.

In [None]:
# You complete some code in this cell
...

Now compute the average salary for a few other ages in the data set. Do you see a trend?

In [None]:
# You complete some code in this cell
...

In [None]:
# You complete some code in this cell
...

In [None]:
# You complete some code in this cell
...

In [None]:
# You complete some code in this cell
...

In [None]:
# You complete some code in this cell
...

It looks like that as people get older, on average they are paid more. It's not immediately clear *why* that might be the case. Do employees in their 70s have many years of experience with UNC which results in higher pay through annual raises? Perhaps only those people who can keep working into their 60s or 70s do jobs that inherently pay more than those that can't be completed by employees of that age? We'd need to dig deeper into the data to determine the cause of this relationship!

## Samping from the data

There are so many data points in this table, it could be hard sometimes to work with all of the individuals. Let's take a sample from the table to create a smaller table. We'll learn a few ways to do this throughout the course, but for now let's take a sample by doing the following:

First, sort the `unc` table by age, with the younger individuals at the top of the table. Save this sorted table to `unc_sorted`.

In [None]:
# You complete some code in this cell
unc_sorted = ...
unc_sorted

Then, using `np.arange` create an array that contains the integers from 0 to 47600, incrementing by 100. Call this array `employee_number`. Inspect the array to ensure it contains the integers 0 and 47600.

In [None]:
# You complete some code in this cell
employee_number = ...
employee_number

Lastly, run the cell below which uses the `take` method. The `take` method can be given an integer or array of integers that represent row numbers, also called row indices. When the method is called, it will create a new table that only contains those rows whose indices where included in the call. We'll save this to the table named `sample` since it is a sample of our larger table named `unc`.

In [None]:
sample = unc.take(employee_number)
sample

**NOTE:** This is NOT a random sample! We'll investigate different and better methods for sampling data later in the course.

Let's see how well our sample captures the patterns in the original table. Calculate the average salary of all individuals in `sample`.

In [None]:
# You complete some code in this cell
# ...

How does this compare to the average salary of the individuals in `unc`? Is it about the same, or very different? Why do you think that is?

## You explore!

Make a calculation that you find interesting and share with your classmates in the discussion board for this activity. Include your code so others can reproduce your work.