In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# UNC System Salary Information

The UNC System maintains a publically available database of employees and their salaries that can be found on the web. From their website:

> **What information is provided in the database?** This database contains the names, position titles and salaries of permanent employees of the University, as furnished by UNC System Office and the constituent institutions of The University of North Carolina.
>
>**What information is not provided in the database?** We currently do not have data available for UNC Health Care, UNC Press or the North Carolina School of Science and Math.
>
>**How frequently is the data updated?** The data provided in this database is updated quarterly and represents a snapshot of employees, salaries, and titles as of the date listed. Because it is a snapshot, the database may not accurately reflect an employee's permanent salary or all sources of funding provided throughout the year. As examples, the database may not capture all external fund sources that may compensate some employees, or the information may capture temporary salary increases or temporary title changes for additional duties for an employee that could change in the course of the year.
>
>**What if I have questions about an employee's salary or other personnel information?** If you believe that information about a particular employee is inaccurate or if you have other questions about the information provided in the database, you may contact the HR department at the constituent institution to verify the employee's salary.
>
>**Why do we publish this data?** Employee salaries are public and we believe that publishing salaries benefits both the public by providing easily accessible salary information and the employing institutions by minimizing public personnel information requests.

You'll find a `.csv` file containing the full database **with names removed** in the `/activities/data/` folder as `unc_salaries.csv`. 

Load this file as a Table named `unc`. This data was downloaded in February 2021, so depending on when you're viewing this activity it may be out of date.

In [12]:
unc = Table.read_table('data/unc_salaries.csv')
unc

INSTITUTION NAME,AGE,INITIAL HIRE DATE,JOB CATEGORY,EMPLOYEE ANNUAL BASE SALARY,EMPLOYEE HOME DEPARTMENT,PRIMARY WORKING TITLE
ASU,69,"AUG 01, 1998","Librarian AC, Other",49507,Library,Adjunct Assistant Professor
ASU,63,"AUG 24, 1998","Librarian AC, Other",23558,Library,Adjunct Assistant Professor
ASU,47,"JAN 16, 2018",Lecturer,47000,Sustainable Technlgy & Built Envirn,Adjunct Instructor
ASU,41,"DEC 01, 2004",Lecturer,36000,English,Lecturer
ASU,59,"MAY 23, 2005",Food Prep Worker,31200,Sanford Commons,Food Service Technician
ASU,37,"AUG 26, 2011",Assistant Professor,68945,Psychology,Assistant Professor
ASU,42,"JUL 18, 2012","Student Activities Professional, Student Activities Assi ...",53542,Campus Activities,"Asst Dir, Org Leadership"
ASU,54,"JUL 31, 2001",Nursing Professional,51504,Health Services,Professional Nurse
ASU,50,"AUG 01, 2005","Administrative / Office / Clerical Support Staff, Other",35484,School Of Music,Administrative Support Spec
ASU,40,"APR 16, 2018",Chief Budget Officer,110000,University Budget,University Budget Director


## The features

You can see that this data has 7 features for each of the 47,653 observations. In this case each observation refers to a specific person who was currently employed within the UNC System. The 7 features are:

* `INSTITUTION NAME`: An abbreviation for the UNC institution at which the individual is employed
  * `'ASU'`: Applachian State University
  * `'ECSU'`: Elizabeth City State University
  * `'ECU'`: East Carolina University
  * `'FSU'`: Fayetteville State University
  * `'NCA&T'`: North Carolina A&T State University
  * `'NCCU'`: North Carolina Central University
  * `'NCSU'`: North Carolina State University
  * `'UNC-CH'`: University of North Carolina at Chapel Hill
  * `'UNCA'`:  University of North Carolina at Asheville
  * `'UNCC'`:  University of North Carolina at Charlotte
  * `'UNCG'`: University of North Carolina at Greensboro
  * `'UNCP'`: The University of North Carolina at Pembroke
  * `'UNCSA'`: University of North Carolina School of the Arts
  * `'UNCW'`: University of North Carolina at Wilmington
  * `'WCU'`: Western Carolina University
  * `'WSSU'`: Winston Salem State University
* `AGE`: The age of the employee at the time this data was collected
* `INITIAL HIRE DATE`: The date formatted as Mon DD, YYYY
* `JOB CATEGORY`: A broad category that describes the type of job the individual holds
* `EMPLOYEE ANNUAL BASE SALARY`: The base salary, in dollars, that the individual is paid for their work. This does not include bonus pay, additional stipends, or other benefits provided by the institution
* `EMPLOYEE HOME DEPARTMENT`: The name of the department that employees the individual at the indicated institution
* `PRIMARY WORKING TITLE`: The specific name of the job the individual holds

**WARNING:** This data is NOT cleaned to ensure consistent job categories, home department names, or primary working title. For example, for the feature `JOB CATEGORY` there are various types of Accounting Professional listed

* Accounting Professional, Accountant
* Accounting Professional, Accountant Sr.
* Accounting Professional, Accounting Unit Supervisor
* Accounting Professional, Assistant Comptroller
* Accounting Professional, Asst/Assoc Bursar
* Accounting Professional, Collections Supervisor
* Accounting Professional, Compliance Officer
* Accounting Professional, Dept Business Mgr Sr.
* Accounting Professional, Head Cashier

There is also a generic descriptor of: Accounting Professional. For many applications you would likely want to "clean" this data by making a decision on how to lump all these roles together into one category, or determine what to do with the Accounting Professional categories that lack additional descriptions of their roles. Data cleaning is a difficult and time-consuming process that we will mostly ignore in this course so we can focus on the mathematical applications that can be applied to previously cleaned data.

## Who makes the most?

Using table operations, sort the `unc` table by employee salary so the highest paid employee is at the top and then format that column as a number so it is displayed with commas.

In [13]:
# You complete some code in this cell
# unc...
unc.sort('EMPLOYEE ANNUAL BASE SALARY', descending=True).set_format('EMPLOYEE ANNUAL BASE SALARY', NumberFormatter)

INSTITUTION NAME,AGE,INITIAL HIRE DATE,JOB CATEGORY,EMPLOYEE ANNUAL BASE SALARY,EMPLOYEE HOME DEPARTMENT,PRIMARY WORKING TITLE
NCSU,49,"DEC 01, 2012",Head Coach - Football,1625000.0,Football,Head Coach
NCSU,48,"MAR 18, 2017",Head Coach - Men's Basketball,1350000.0,Men's Basketball,Head Coach
UNC-CH,62,"APR 15, 2019",Professor,930000.04,Orthopaedics - Pediatrics,NODESCR
ECU,63,"SEP 15, 2014",Professor (Primary) and Department Chair/Head,901100.0,BSOM Cardiovascular Science,Professor
UNC-CH,68,"DEC 01, 1983",Professor,864910.08,Med-Nephrology,Allan Brewster Distinguished Professor
UNC-CH,56,"JUL 31, 1997",Professor,860250.84,Neurosurgery,Distinguished Professor & Chair
UNC-CH,72,"AUG 01, 2020",Professor,837720.0,Pediatrics,NODESCR
UNC-CH,52,"JUL 01, 2016",Professor,830637.04,Surgery,NODESCR
ECU,51,"NOV 18, 2019",Professor,813000.0,BSOM CVS StrucHeart Disease,Clinical Professor
ECU,63,"JUL 22, 2019",Professor,813000.0,BSOM CVS StrucHeart Disease,Professor


Using a search engine, can you name this person?

## Average salary by age

The average salary for an employee in the UNC System is \$75,617.39. Do you think this average salary would vary for employees of different ages? Let's find out.

Using table methods on the `unc` table to create a new table that only contains data on 25 year old employees, then select the column regarding base salary for these 25 year olds as an array, and determine their average salary.

In [16]:
# You complete some code in this cell
# ...
np.average( unc.where('AGE', 25).column('EMPLOYEE ANNUAL BASE SALARY') )

41719.96280991736

Now compute the average salary for a few other ages in the data set. Do you see a trend?

In [18]:
# You complete some code in this cell
np.average( unc.where('AGE', 35).column('EMPLOYEE ANNUAL BASE SALARY') )

70198.36109347444

In [19]:
# You complete some code in this cell
np.average( unc.where('AGE', 45).column('EMPLOYEE ANNUAL BASE SALARY') )

78658.41542789224

In [20]:
# You complete some code in this cell
np.average( unc.where('AGE', 55).column('EMPLOYEE ANNUAL BASE SALARY') )

81855.78005199308

In [21]:
# You complete some code in this cell
np.average( unc.where('AGE', 65).column('EMPLOYEE ANNUAL BASE SALARY') )

93499.58783171521

In [22]:
# You complete some code in this cell
np.average( unc.where('AGE', 75).column('EMPLOYEE ANNUAL BASE SALARY') )

110481.54507042254

It looks like that as people get older, on average they are paid more. It's not immediately clear *why* that might be the case. Do employees in their 70s have many years of experience with UNC which results in higher pay through annual raises? Perhaps only those people who can keep working into their 60s or 70s do jobs that inherently pay more than those that can't be completed by employees of that age? We'd need to dig deeper into the data to determine the cause of this relationship!

## Samping from the data

There are so many data points in this table, it could be hard sometimes to work with all of the individuals. Let's take a sample from the table to create a smaller table. We'll learn a few ways to do this throughout the course, but for now let's take a sample by doing the following:

First, sort the `unc` table by age, with the younger individuals at the top of the table. Save this sorted table to `unc_sorted`.

In [25]:
# You complete some code in this cell
# unc_sorted = ...
# unc_sorted
unc_sorted = unc.sort('AGE')
unc_sorted

INSTITUTION NAME,AGE,INITIAL HIRE DATE,JOB CATEGORY,EMPLOYEE ANNUAL BASE SALARY,EMPLOYEE HOME DEPARTMENT,PRIMARY WORKING TITLE
ASU,19,"SEP 16, 2019",Custodian / Housekeeper,31200,University Housing-Operations,Building & Environmental Tech
UNC-CH,19,"NOV 30, 2020","Research Assistants, Technicians, Technologists, Vet Science",31200,Comparative Medicine,Research Technician
ASU,20,"FEB 18, 2019","Custodian / Housekeeper, Floor Maintenance",31846,Environmental Services,Building & Environmental Tech
ASU,20,"NOV 02, 2020",Custodian / Housekeeper,31200,Environmental Services,Building & Environmental Tech
FSU,20,"NOV 02, 2020",Carpenter (Journeyman),33390,Assc VC for Facilities Mgmt,Fac Maint Tech Mech Trade
NCSU,20,"NOV 11, 2019",IT Technical Support/Paraprofessional,38000,Communication Technologies,Technology Support Technician
NCSU,20,"APR 01, 2020",Telecommunications Technical/Professional,41000,Information Technology-EHPS,IT Operations Technician
UNC-CH,20,"MAR 18, 2019",Grounds / Landscape Worker,31200,FS-Grounds-LS Maint Hsng/Pkng,Bldg & Env Services Technician
UNC-CH,20,"OCT 05, 2020","Research Asst/Tech, Health/Medicine",36000,Radiology - Research,Soc/Clin Research Assistant
UNC-CH,20,"NOV 02, 2020","Research Asst/Tech, Health/Medicine",36000,Lineberger Compr Cancer Center,Soc/Clin Research Assistant


Then, using `np.arange` create an array that contains the integers from 0 to 47600, incrementing by 100. Call this array `employee_number`. Inspect the array to ensure it contains the integers 0 and 47600.

In [28]:
# You complete some code in this cell
# employee_number = ...
# employee_number
employee_number = np.arange(0, 47000 + 1, 100)
employee_number

array([    0,   100,   200,   300,   400,   500,   600,   700,   800,
         900,  1000,  1100,  1200,  1300,  1400,  1500,  1600,  1700,
        1800,  1900,  2000,  2100,  2200,  2300,  2400,  2500,  2600,
        2700,  2800,  2900,  3000,  3100,  3200,  3300,  3400,  3500,
        3600,  3700,  3800,  3900,  4000,  4100,  4200,  4300,  4400,
        4500,  4600,  4700,  4800,  4900,  5000,  5100,  5200,  5300,
        5400,  5500,  5600,  5700,  5800,  5900,  6000,  6100,  6200,
        6300,  6400,  6500,  6600,  6700,  6800,  6900,  7000,  7100,
        7200,  7300,  7400,  7500,  7600,  7700,  7800,  7900,  8000,
        8100,  8200,  8300,  8400,  8500,  8600,  8700,  8800,  8900,
        9000,  9100,  9200,  9300,  9400,  9500,  9600,  9700,  9800,
        9900, 10000, 10100, 10200, 10300, 10400, 10500, 10600, 10700,
       10800, 10900, 11000, 11100, 11200, 11300, 11400, 11500, 11600,
       11700, 11800, 11900, 12000, 12100, 12200, 12300, 12400, 12500,
       12600, 12700,

Lastly, run the cell below which uses the `take` method. The `take` method can be given an integer or array of integers that represent row numbers, also called row indices. When the method is called, it will create a new table that only contains those rows whose indices where included in the call. We'll save this to the table named `sample` since it is a sample of our larger table named `unc`.

In [31]:
sample = unc.take(employee_number)
sample

INSTITUTION NAME,AGE,INITIAL HIRE DATE,JOB CATEGORY,EMPLOYEE ANNUAL BASE SALARY,EMPLOYEE HOME DEPARTMENT,PRIMARY WORKING TITLE
ASU,69,"AUG 01, 1998","Librarian AC, Other",49507,Library,Adjunct Assistant Professor
ASU,38,"JUL 01, 2019","IT Client Support Professional, Business Analyst",60000,Business Systems,IT Business Systems Analyst I
ASU,59,"AUG 15, 1994",Graphical Design Paraprofessional,45954,School Of Music,Arts Production Specialist
ASU,40,"SEP 03, 2008",Cashier,31200,Park Place Cafe,Support Services Associate
ASU,36,"JAN 29, 2008",Assistant Coach - Football,90000,Football,Assistant Football Coach
ASU,43,"MAR 06, 2017",Research / Sponsored Projects Development Professional,65000,College of Education,Dir of Assessment and Accredit
ASU,46,"JUL 01, 2005",Lecturer,49600,Reading Education & Specl Education,Adjunct Instructor
ASU,59,"AUG 28, 2008",Administrative Assistant,35000,Executive Director Alumni Affairs,Administrative Support Assoc
ASU,61,"OCT 27, 1986",IT Technical Support/Paraprofessional,105222,Information Technology Service,Systems Programmer/Specialist
ASU,29,"MAY 20, 2019",General Maintenance Worker,31200,University Housing-Operations,Facility Maint Tech - Building


**NOTE:** This is NOT a random sample! We'll investigate different and better methods for sampling data later in the course.

Let's see how well our sample captures the patterns in the original table. Calculate the average salary of all individuals in `sample`.

In [34]:
# You complete some code in this cell
# ...
np.average( sample.column('EMPLOYEE ANNUAL BASE SALARY') )

75314.3049044586

How does this compare to the average salary of the individuals in `unc`? Is it about the same, or very different? Why do you think that is?

## You explore!

Make a calculation that you find interesting and share with your classmates in the discussion board for this activity. Include your code so others can reproduce your work.