# SF Salaries Exercise Solution

Welcome to a quick exercise for you to practice your pandas skills! We will be using the [SF Salaries Dataset](https://www.kaggle.com/kaggle/sf-salaries) from Kaggle! Just follow along and complete the tasks outlined in bold below. The tasks will get harder and harder as you go along.

### Import pandas as pd.

In [1]:
import pandas as pd

** Read Salaries.csv as a dataframe called df.**

This file had already been previewed and an index column was observed to be the first column of the dataset hence, we consider it while reading the csv into pandas.

Alternative, this could have been fixed in line 6

In [2]:
df=pd.read_csv("Salaries.csv", index_col=0)

### Check the head of the DataFrame. 

In [3]:
df.head()

Unnamed: 0_level_0,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


### Use the .info() method to find out how many entries there are.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148654 entries, 1 to 148654
Data columns (total 12 columns):
EmployeeName        148654 non-null object
JobTitle            148654 non-null object
BasePay             148045 non-null float64
OvertimePay         148650 non-null float64
OtherPay            148650 non-null float64
Benefits            112491 non-null float64
TotalPay            148654 non-null float64
TotalPayBenefits    148654 non-null float64
Year                148654 non-null int64
Notes               0 non-null float64
Agency              148654 non-null object
Status              0 non-null float64
dtypes: float64(8), int64(1), object(3)
memory usage: 13.0+ MB


### Bonus: Check for columns that contain no entries and remove them

In [5]:
print(df.isnull().sum())
df.dropna(axis=1, how="all", inplace=True)
print(df.info())

EmployeeName             0
JobTitle                 0
BasePay                609
OvertimePay              4
OtherPay                 4
Benefits             36163
TotalPay                 0
TotalPayBenefits         0
Year                     0
Notes               148654
Agency                   0
Status              148654
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 148654 entries, 1 to 148654
Data columns (total 10 columns):
EmployeeName        148654 non-null object
JobTitle            148654 non-null object
BasePay             148045 non-null float64
OvertimePay         148650 non-null float64
OtherPay            148650 non-null float64
Benefits            112491 non-null float64
TotalPay            148654 non-null float64
TotalPayBenefits    148654 non-null float64
Year                148654 non-null int64
Agency              148654 non-null object
dtypes: float64(6), int64(1), object(3)
memory usage: 10.8+ MB
None


- Assuming that the csv was loaded with the intended index column, this could have been used.

    Replace the index column with the "Id" column so that the index starts at 1 and remove the original "Id" Column

In [6]:
# df.index = df["Id"]
# df.drop("Id", axis=1, inplace=True)

### What is the average BasePay ?
    
    Use the mean() method to find the mean of the series object.

In [7]:
df["BasePay"].mean()

66325.4488404877

### What is the highest amount of OvertimePay in the dataset ? 
    
    Use the max() method to determine the maximum entry in the column.

In [8]:
df["OvertimePay"].max()

245131.88

### What is the job title of  JOSEPH DRISCOLL ? 

    Note: Use all caps, otherwise you may get an answer that doesn't match up (there is also a lowercase Joseph Driscoll). 

- A simple conditional selection enclosed in a df selection brackets would yield the desired result.

In [9]:
df[df["EmployeeName"] == "JOSEPH DRISCOLL"]["JobTitle"]

Id
25    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

### How much does JOSEPH DRISCOLL make (including benefits)? 

    The same approach as the previous exercise

In [10]:
df[df["EmployeeName"] == "JOSEPH DRISCOLL"]["TotalPayBenefits"]

Id
25    270324.91
Name: TotalPayBenefits, dtype: float64

### What is the name of highest paid person (including benefits)?

- Make a dataframe selection which meets the condition. Noting that the maximum entry of the resulting column is required. 
- use square bracket indexing to select only the Employee name and the Total pay benefits

In [11]:
df[(df["TotalPayBenefits"]) == (df["TotalPayBenefits"].max())][["EmployeeName", "TotalPayBenefits"]]

Unnamed: 0_level_0,EmployeeName,TotalPayBenefits
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,NATHANIEL FORD,567595.43


### What is the name of lowest paid person (including benefits)? 
#### Do you notice something strange about how much he or she is paid?

    Yes, the negative sign in the employee's salary is quite strange

In [12]:
df[(df["TotalPayBenefits"]) == (df["TotalPayBenefits"].min())][["EmployeeName", "TotalPayBenefits"]]

Unnamed: 0_level_0,EmployeeName,TotalPayBenefits
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
148654,Joe Lopez,-618.13


### What was the average (mean) BasePay of all employees per year? (2011-2014) ? 

- Use the groupby() method to group your dataframe by year. This will yield a dataframe object
- Select the BasePay column of of the dataframe object

In [13]:
df.groupby(["Year"]).mean()["BasePay"]

Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

### How many unique job titles are there? 

    The nunique() method of a pandas series object can be used to check the total number of unique entries in a column

In [14]:
df["JobTitle"].nunique()

2159

### What are the top 5 most common jobs? 

- Use the value_counts() method on the dataframe selection to display the unique items in the column in descending order of frequency
- Use the head() method to display the first five results

In [15]:
df["JobTitle"].value_counts().head()

Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: JobTitle, dtype: int64

### How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?) 



In [16]:
sum(df[df["Year"] == 2013]["JobTitle"].value_counts()==1)

202

### How many people have the word Chief in their job title? (This is pretty tricky) 

- Create a function to convert a string into lowercase and return true if "chief" is contained in it.
- Apply this function on the JobTitle column
- Use the sum function to sum the entries which yield a true value from the selection

In [17]:
def chief_finder(title):
    title = title.lower()
    if "chief" in title:
        return True
    else:
        return False
    
sum(df["JobTitle"].apply(lambda x:chief_finder(x)))

627

### Bonus: Is there a correlation between length of the Job Title string and Salary? 

- Create a new column and assign the result of apply(len) operation on the JobTitle column to it.
- Use the corr() method on the two columns of interest to find their correlation

In [18]:
df["title_len"] = df["JobTitle"].apply(len)
df[["title_len", "TotalPayBenefits"]].corr()

Unnamed: 0,title_len,TotalPayBenefits
title_len,1.0,-0.036878
TotalPayBenefits,-0.036878,1.0


# Great Job!