# SF Salaries Exercise 

Welcome to a quick exercise for you to practice your pandas skills! We will be using the [SF Salaries Dataset](https://www.kaggle.com/kaggle/sf-salaries) from Kaggle! Just follow along and complete the tasks outlined in bold below. The tasks will get harder and harder as you go along.

** Import pandas as pd.**

In [2]:
import pandas as pd

** Read Salaries.csv as a dataframe called sal.**

In [3]:
sal = pd.read_csv('files/Salaries.csv')

** Check the head of the DataFrame. **

In [4]:
sal.head() # Get df head

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411,0.0,400184.0,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966,245132.0,137811.0,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739,106088.0,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916,56120.7,198307.0,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134402,9737.0,182235.0,,326373.19,326373.19,2011,,San Francisco,


** Use the .info() method to find out how many entries there are.**

In [5]:
sal.info() # There are 148,654 entries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object 
 2   JobTitle          148654 non-null  object 
 3   BasePay           148049 non-null  object 
 4   OvertimePay       148654 non-null  object 
 5   OtherPay          148654 non-null  object 
 6   Benefits          112495 non-null  object 
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object 
 12  Status            38119 non-null   object 
dtypes: float64(3), int64(2), object(8)
memory usage: 14.7+ MB


**What is the average BasePay ?**

In [6]:
# Get the indices of the not provided base pays
notProvided = sal['BasePay'][sal['BasePay'] == 'Not Provided'].index

# Remove the missing and notp 
sal['BasePay'].dropna().drop(notProvided).apply(lambda x:float(x)).mean()

66325.44884050643

** What is the highest amount of OvertimePay in the dataset ? **

In [7]:
# Same clean up 
notProvided = sal['OvertimePay'][sal['OvertimePay'] == 'Not Provided'].index
# Get the maximum overtime pay 
sal['OvertimePay'].dropna().drop(notProvided).apply(lambda x:float(x)).max()

245131.88

** What is the job title of  JOSEPH DRISCOLL ? Note: Use all caps, otherwise you may get an answer that doesn't match up (there is also a lowercase Joseph Driscoll). **

In [8]:
# Select JOSEPH DRISCOLL's record then get the job title
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']

24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

** How much does JOSEPH DRISCOLL make (including benefits)? **

In [9]:
# Get JOSEPH DRISCOLL's total pay abd benefits
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']

24    270324.91
Name: TotalPayBenefits, dtype: float64

** What is the name of highest paid person (including benefits)?**

In [10]:
# Remove not provided values
notProvided = sal['TotalPayBenefits'][sal['TotalPayBenefits'] == 'Not Provided'].index
# Get the max total pay and benefits
highestPaid = sal['TotalPayBenefits'].dropna().drop(notProvided).apply(lambda x:float(x)).max()
# Get the name of the highest paid person
sal[sal['TotalPayBenefits'] == highestPaid]['EmployeeName']

0    NATHANIEL FORD
Name: EmployeeName, dtype: object

** What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?**

In [11]:
# Remove not provided values
notProvided = sal['TotalPayBenefits'][sal['TotalPayBenefits'] == 'Not Provided'].index
# Get the minimum total pay and benefits
highestPaid = sal['TotalPayBenefits'].dropna().drop(notProvided).apply(lambda x:float(x)).min()
# Get the name and pay of the lowest paid person
sal[sal['TotalPayBenefits'] == highestPaid][['EmployeeName','TotalPayBenefits']] # Negative Pay

Unnamed: 0,EmployeeName,TotalPayBenefits
148653,Joe Lopez,-618.13


** What was the average (mean) BasePay of all employees per year? (2011-2014) ? **

In [12]:
# Clean up
notProvided = sal['BasePay'][sal['BasePay'] == 'Not Provided'].index
mask = sal['BasePay'].dropna().drop(notProvided).apply(lambda x:float(x)).index

# Keep only the rows with numerical pay values
df = sal.loc[mask]
# Cast base pay from a string to a float in a separate dataframe
df2 = df['BasePay'].apply(lambda x:float(x))
# Remove the string base pay 
df = df.drop('BasePay',axis=1)
# Concatenate the 2 dataframes
df = pd.concat([df,df2],axis=1)

# Group by the year and take the mean
df[['BasePay','Year']].groupby(['Year']).mean()

Unnamed: 0_level_0,BasePay
Year,Unnamed: 1_level_1
2011,63595.956517
2012,65436.406857
2013,69630.030216
2014,66564.421924


** How many unique job titles are there? **

In [13]:
sal['JobTitle'].nunique() # There are 2154 unique job titles

2159

** What are the top 5 most common jobs? **

In [14]:
sal['JobTitle'].value_counts().head() # 5 most common jobs

Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: JobTitle, dtype: int64

** How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?) **

In [15]:
sum(sal[sal['Year']==2013]['JobTitle'].value_counts() == 1) # Job titles only appearing once in 2013

202

** How many people have the word Chief in their job title? (This is pretty tricky) **

In [16]:
# Function that returns 1 if chief is in the title and 0 if not
def getChief(title):
    if 'chief' in title.lower():
        return True
    else: 
        return False
    
# Apply the function to all job titles
sal['JobTitle'].apply(getChief).sum() # You get 624 people with the word chief in their title

627

** Bonus: Is there a correlation between length of the Job Title string and Salary? **

In [17]:
# Get the length of each job title and correlate with total pay with benefits
sal['JobTitle'].apply(len).corr(sal['TotalPayBenefits']) # No correlation 

-0.03687844593260676

# Great Job!