# Advanced Pandas Assignment 3

In this assignment, you will practice cleaning "dirty" data.

### Note about assignments
You can add lines of code according to your preferences. As long as the code required by the assignment is found in this notebook under the corresponding question header (ie. the answer to question 1 is underneath the title "Question 1"), you will receive credit for it.

## About the data
The data used in this assignment is a table built from the Human Resources schema of the Adventure Works 2019 database. This data contains information about each time that Employee Pay History was changed (each line is a pay rate change). It also contains information about the employee and the department they were working in when they received the pay rate listed.

This data is similar to data used in previous assignments but has been "dirtied" in order to make this assignment possible. 

The actual data is stored in a CSV file located inside the `data` folder. The file is called `pay_history.csv`.

## Instructions
### Set up
##### Import Pandas
Import the Pandas library into Jupyter Lab.

<p style="font-size:.75rem">Expected output: None</p>

In [52]:
import pandas as pd

##### Disable column display limit
Use the following code to disable the default limit for displaying columns. If you don't use this code, a data set with more than 20 columns will be truncated when displayed to take up less space.

```python
pd.options.display.max_columns = None
```

<p style="font-size:.75rem">Expected output: None</p>

In [53]:
pd.options.display.max_columns = None

##### Create the dataframe
Use the `read_csv()` function from Pandas to read the data from the `pay_history.csv` file into a dataframe called `df`.

<p style="font-size:.75rem">Expected output: None</p>

In [54]:
df = pd.read_csv("./data/pay_history.csv")

##### Preview dataframe
Use the `.head()` method to print out the first 5 rows of the dataframe.

<p style="font-size:.75rem">Expected output: 5 rows, indexes 0-4</p>

In [55]:
df.head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
0,1,16,00:00.0,125.5,2,adventure-works\ken0,,Chief Executive Officer,1/29/1969,S,M,1/14/2009,1.0,99,69,1,1,1/14/2009,,00:00.0,Executive,Executive General and Administration
1,2,1,00:00.0,63.4615,2,adventure-works\terri0,1.0,Vice President of Engineering,8/1/1971,S,F,1/31/2008,,1,20,1,1,1/31/2008,,00:00.0,Engineering,Research and Development
2,3,1,00:00.0,43.2692,2,adventure-works\roberto0,2.0,Engineering Manager,11/12/1974,M,M,11/11/2007,,2,21,1,1,11/11/2007,,00:00.0,Engineering,Research and Development
3,4,1,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,M,12/5/2007,,48,80,1,1,12/5/2007,5/30/2010,00:00.0,Engineering,Research and Development
4,4,2,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,M,12/5/2007,,48,80,1,1,5/31/2010,,00:00.0,Tool Design,Research and Development


### Questions
The data you were given to analyze looks perfectly normal at first glance. However, knowing that humans are imperfect at inputting data, you decide to do a quick glance over the data to make sure that nothing strange is happening. If there are mistakes in the data, you are determined to fix them in such a way that they don't affect the analysis.


#### Find Null Values
##### Question 1: Describe the data 
Before anything else, use the `.info()` method to check the count of null values in the data set.

<p style="font-size:.75rem">Expected output: Informational dataframe with 304 entries and 22 columns, showing non-null count and dtype for each. Ex. SalariedFlag column has 38 non-null values</p>

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         304 non-null    int64  
 1   DepartmentID       304 non-null    int64  
 2   RateChangeDate     304 non-null    object 
 3   Rate               302 non-null    float64
 4   PayFrequency       304 non-null    int64  
 5   LoginID            304 non-null    object 
 6   OrganizationLevel  303 non-null    float64
 7   JobTitle           304 non-null    object 
 8   BirthDate          304 non-null    object 
 9   MaritalStatus      304 non-null    object 
 10  Gender             304 non-null    object 
 11  HireDate           304 non-null    object 
 12  SalariedFlag       38 non-null     float64
 13  VacationHours      304 non-null    int64  
 14  SickLeaveHours     304 non-null    int64  
 15  CurrentFlag        304 non-null    object 
 16  ShiftID            304 non

##### Question 2: Count null values
Use the `.isna()` method in conjunction with the `.sum()` method to get a count of null values in each column.

<p style="font-size:.75rem">Expected output: Series of column names and counts of their null values Rate has 2 null values and SalariedFlag has 266 null values</p>

In [57]:
df.isna().sum()

EmployeeID             0
DepartmentID           0
RateChangeDate         0
Rate                   2
PayFrequency           0
LoginID                0
OrganizationLevel      1
JobTitle               0
BirthDate              0
MaritalStatus          0
Gender                 0
HireDate               0
SalariedFlag         266
VacationHours          0
SickLeaveHours         0
CurrentFlag            0
ShiftID                0
StartDate              0
EndDate              297
ModifiedDate           0
DepartmentName         0
Sub-Department         1
dtype: int64

##### Question 3: Which columns have null values?
Which columns have null values? What do you think should be done to these columns to prepare the data set for analysis (imputation or drop the column)?

Remember that imputation is usually only done when there are few null values, and dropping the column is necessary when there are many null values.

<p style="font-size:.75rem">Expected output: None</p>

1. Rate - Imputation
2. OrganizationLevel - Imputation
3. SalariedFlag - Drop Column
4. EndDate - Drop Column
5. Sub-Department - Imputation

#### Rate - Imputation (part 1)
Because there are only two missing values in the `Rate` column, let's try to impute them. We can start to do this by discovering the average rate and then assigning it to both rows where `Rate` is null.

##### Question 4: Select the `Rate` column
Get the `Rate` column out of the dataframe.

<p style="font-size:.75rem">Expected output: Series of 304 Rates, where the first value is 125.5000 and the last is 23.0769</p>

In [58]:
df['Rate']

0      125.5000
1       63.4615
2       43.2692
3        8.6200
4        8.6200
         ...   
299     23.0769
300     48.1010
301         NaN
302     23.0769
303     23.0769
Name: Rate, Length: 304, dtype: float64

##### Question 5: Get the average rate
Use the `.mean()` method to get the average rate and save it to a variable `average_rate`. Print it out to show what the average rate is.

<p style="font-size:.75rem">Expected output: 17.665210596026487</p>

In [59]:
average_rate = df['Rate'].mean()
average_rate

17.665210596026487

##### Question 6: Get the rows where `Rate` is null
Use the `.isna()` method with the `.loc` property to create a filter that returns rows where `Rate` is null. Show these rows of the dataframe.

<p style="font-size:.75rem">Expected output: 2 rows with indexes 86 and 301</p>

In [60]:
df.loc[ df['Rate'].isna() ]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
86,83,7,00:00.0,,1,adventure-works\patrick1,4.0,Production Technician - WC40,12/23/1973,M,M,2/12/2010,,61,50,1,2,2/12/2010,,00:00.0,Production,Manufacturing
301,288,3,00:00.0,,2,adventure-works\rachel0,3.0,Sales Representative,7/9/1975,S,F,5/30/2013,,35,37,1,1,5/30/2013,,00:00.0,Sales,Sales and Marketing


##### Question 7: Get the indexes of the null rows
Use the `.index` property to get the indexes from the filtered dataframe and save them to a variable `null_rates`. Print out this variable as well.

<p style="font-size:.75rem">Expected output: Int64Index containing 86 and 301</p>

In [61]:
null_rates = df[ df['Rate'].isna() ].index
null_rates

Int64Index([86, 301], dtype='int64')

##### Question 8: Get `Rate` column of filtered dataframe
Using the filtered dataframe from above (Question 6), get the `Rate` column by adding `Rate` to the `.loc` property.

You should get two rows back in a Series.

<p style="font-size:.75rem">Expected output: Series with two NaN values that have indexes 86 and 301</p>

In [62]:
df.loc[ df['Rate'].isna(), 'Rate' ]

86    NaN
301   NaN
Name: Rate, dtype: float64

##### Question 9: Set null rows equal to overall average rate
Using the code from the previous question, set the `Rate` of the two rows with null rates equal to the `average_rate`.

<p style="font-size:.75rem">Expected output: None</p>

In [63]:
df.loc[ df['Rate'].isna(), 'Rate' ] = average_rate

##### Question 10: Print out the number of nulls
Using the code from Question 2, print out the number of null values in each column.

<p style="font-size:.75rem">Expected output: Count of null values. Rate should equal 0 now</p>

In [64]:
df.isna().sum()

EmployeeID             0
DepartmentID           0
RateChangeDate         0
Rate                   0
PayFrequency           0
LoginID                0
OrganizationLevel      1
JobTitle               0
BirthDate              0
MaritalStatus          0
Gender                 0
HireDate               0
SalariedFlag         266
VacationHours          0
SickLeaveHours         0
CurrentFlag            0
ShiftID                0
StartDate              0
EndDate              297
ModifiedDate           0
DepartmentName         0
Sub-Department         1
dtype: int64

#### Rate - Imputation (part 2)
##### Question 11: Is the overall average `Rate` the best?
Now we can see that the overall average `Rate` was successfully imputed into the null values for each `Rate`. However, is this `Rate` an accurate reflection of what the `Rate` likely actually is for both employees? What could be done to estimate the rate more accurately?

<p style="font-size:.75rem">Expected output: None</p>

```
The overall average rate might be completely different from the actual average rate for each employee. Production Technicians and Sales Representatives might make more or less than the average employee. A more accurate representation of the actual likely rate for each one could be found by grouping the employees by department or job title and THEN finding the average rate in that group.
```

##### Question 12: Get `JobTitle` of each (previously) null row
Using the `.loc` property and the `null_rates` list of indexes that you obtained previously, print out the `JobTitle` column (a Series) for each of the rows which previously had a null value for `Rate`.

<p style="font-size:.75rem">Expected output: Series of JobTitle for indexes 86 and 301</p>

In [65]:
df.loc[null_rates, 'JobTitle']

86     Production Technician - WC40
301            Sales Representative
Name: JobTitle, dtype: object

##### Question 13: Turn the Series into a list
Add the `.tolist()` method onto the previous line of code to place the two job titles into a list. Save this list of job titles into a variable `job_titles` and then print it out.

<p style="font-size:.75rem">Expected output: List of job titles. Should be ['Production Technician - WC40', 'Sales Representative']</p>

In [66]:
job_titles = df.loc[null_rates, 'JobTitle'].tolist()
job_titles

['Production Technician - WC40', 'Sales Representative']

#### Question 14: Filter the dataframe by `job_titles`
Using the `.isin()` method and the `.loc` property, get back all rows of the dataframe that have a value in the `JobTitle` column that exists in the `job_titles` list.

<p style="font-size:.75rem">Expected output: 40 rows starting with index 82-89 and ending with indexes 289-303</p>

In [68]:
df.loc[ df['JobTitle'].isin(job_titles)]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
82,79,7,00:00.0,15.0,1,adventure-works\eric1,4.0,Production Technician - WC40,12/8/1966,M,M,1/24/2010,,60,50,1,2,1/24/2010,,00:00.0,Production,Manufacturing
83,80,7,00:00.0,15.0,1,adventure-works\sandeep0,4.0,Production Technician - WC40,12/3/1970,S,M,1/17/2010,,65,52,1,2,1/17/2010,,00:00.0,Production,Manufacturing
84,81,7,00:00.0,15.0,1,adventure-works\mihail0,4.0,Production Technician - WC40,3/9/1971,S,M,12/29/2009,,64,52,1,2,12/29/2009,,00:00.0,Production,Manufacturing
85,82,7,00:00.0,15.0,1,adventure-works\jack1,4.0,Production Technician - WC40,8/29/1973,S,M,3/3/2010,,62,51,1,2,3/3/2010,,00:00.0,Production,Manufacturing
86,83,7,00:00.0,17.665211,1,adventure-works\patrick1,4.0,Production Technician - WC40,12/23/1973,M,M,2/12/2010,,61,50,1,2,2/12/2010,,00:00.0,Production,Manufacturing
87,84,7,00:00.0,15.0,1,adventure-works\frank3,4.0,Production Technician - WC40,3/2/1952,M,M,2/5/2010,,66,53,1,2,2/5/2010,,00:00.0,Production,Manufacturing
88,85,7,00:00.0,15.0,1,adventure-works\brian2,4.0,Production Technician - WC40,12/23/1970,S,M,12/11/2009,,63,51,1,2,12/11/2009,,00:00.0,Production,Manufacturing
89,86,7,00:00.0,15.0,1,adventure-works\ryan0,4.0,Production Technician - WC40,6/13/1972,M,M,1/5/2009,,59,49,1,2,1/5/2009,,00:00.0,Production,Manufacturing
131,128,7,00:00.0,15.0,1,adventure-works\paul0,4.0,Production Technician - WC40,11/13/1980,S,M,12/4/2008,,68,54,1,3,12/4/2008,,00:00.0,Production,Manufacturing
132,129,7,00:00.0,15.0,1,adventure-works\gary0,4.0,Production Technician - WC40,5/16/1988,S,M,12/22/2008,,69,54,1,3,12/22/2008,,00:00.0,Production,Manufacturing


##### Question 15: Observe the filtered dataframe above
Observe the filtered dataframe returned above. What do you notice about the average rate of pay for the different job titles? How are they different from the overall average rate?

<p style="font-size:.75rem">Expected output: None</p>

```
The overall average rate was $17.67. However, the average rate for a Product Technician is $15, and the average rate for a Sales Representative is $23.08. Thus, there is a fairly significant difference from the actual rates of people with those job titles and the overall average rate.
```

##### Question 16: Group by `JobTitle` and find the mode
Using the `.groupby()` and `.agg()` methods, group the filtered dataframe from question 13 by `JobTitle` and aggregate to find the most common value for `Rate`.

Note: Pandas doesn't have a built in aggregate function to find the mode using `.agg()`. Thus, the code `.agg({'Rate': 'mode'})` won't work. Instead, you'll have to use the Pandas Series method, which will look like this: `.agg({'Rate': pd.Series.mode})`.

<p style="font-size:.75rem">Expected output: Aggregation table where the mode Rate for Production Technician - WC40 is 15.0000 and the mode Rate for Sales Representative is 23.0769</p>

In [20]:
df.loc[ df['JobTitle'].isin(job_titles)].groupby('JobTitle').agg({'Rate': pd.Series.mode})

Unnamed: 0_level_0,Rate
JobTitle,Unnamed: 1_level_1
Production Technician - WC40,15.0
Sales Representative,23.0769


##### Question 17: Replace overall rates with average rate by job title
Using any way you would like, replace both of the (previously) null rates that were assigned the overall average rate a more precise average rate based on their `JobTitle`. You can either do this (1) by using the `.loc` property and the indexes created before, or (2) by using the `.loc` property with a filter by average `Rate` and a specific `JobTitle`. You can also just type in the numbers, rather than try to extract them from the dataframe.

It will be easiest to do this on more than one line.

<p style="font-size:.75rem">Expected output: None</p>

In [21]:
# Using the index created previously
df.loc[ null_rates[0], 'Rate'] = 15

In [22]:
# Using a filter
df.loc[ (df['Rate'] == average_rate) & (df['JobTitle'] == 'Sales Representative'), 'Rate' ] = 23.0769

##### Question 18: Print out number of nulls
Using the `.isna()` and the `.sum()` methods, show how many null values exist in the dataframe.

<p style="font-size:.75rem">Expected output: Count of null values. Rate should still equal 0.</p>

In [23]:
df.isna().sum()

EmployeeID             0
DepartmentID           0
RateChangeDate         0
Rate                   0
PayFrequency           0
LoginID                0
OrganizationLevel      1
JobTitle               0
BirthDate              0
MaritalStatus          0
Gender                 0
HireDate               0
SalariedFlag         266
VacationHours          0
SickLeaveHours         0
CurrentFlag            0
ShiftID                0
StartDate              0
EndDate              297
ModifiedDate           0
DepartmentName         0
Sub-Department         1
dtype: int64

#### SalariedFlag - Drop Column
The columns `SalariedFlag` and `EndDate` seem to have many null values. In a way, it makes sense that there might be null values in the `EndDate` column, since the database designers may have designed the system so that it doesn't record any value until an employee actually stops working at Adventure Works. However, the `SalariedFlag` field is a little more suspicious.

Most likely, we will end up dropping the `SalariedFlag` column because there are so many null values. However, let's make sure to look at it first to make sure that we *can't* impute it (ie. if we can see that all of the recorded values are the same, we might be able to justify imputation in specific situations).

##### Question 19: Print out counts of each unique value in `SalariedFlag`
Before anything else, we should check to see the counts of distinct values in the `SalariedFlag` column. Use the `.value_counts()` method to print this out.

<p style="font-size:.75rem">Expected output: Value counts of distinct values in the SalariedFlag column are displayed. 1.0 = 20 and 0.0 = 18.</p>

In [24]:
df['SalariedFlag'].value_counts()

1.0    20
0.0    18
Name: SalariedFlag, dtype: int64

##### Question 20: What do you see when counting the distinct values?
Upon observing the count of distinct values in the `SalariedFlag` column, what do you notice? Is there any way we could possibly justify imputation?

<p style="font-size:.75rem">Expected output: None</p>

```
The counts of 1 and 0 are almost evenly split. It would be impossible to determine which rows should have a 1 or 0 without further evidence. For this reason, imputation cannot be used and the column should be dropped.
```

##### Question 21: Drop the `SalariedFlag` column
Use the `.drop()` method to drop the `SalariedFlag` column. 

<p style="font-size:.75rem">Expected output: None</p>

In [25]:
df.drop(columns='SalariedFlag', inplace=True)

##### Question 22: Print out the number of nulls
Use the `.isna()` and `.sum()` methods to show how many null values still exist in the dataframe. Make sure that the column `SalariedFlag` has been dropped successfully.

<p style="font-size:.75rem">Expected output: Count of nulls is shown. The SalariedFlag column no longer appears.</p>

In [26]:
df.isna().sum()

EmployeeID             0
DepartmentID           0
RateChangeDate         0
Rate                   0
PayFrequency           0
LoginID                0
OrganizationLevel      1
JobTitle               0
BirthDate              0
MaritalStatus          0
Gender                 0
HireDate               0
VacationHours          0
SickLeaveHours         0
CurrentFlag            0
ShiftID                0
StartDate              0
EndDate              297
ModifiedDate           0
DepartmentName         0
Sub-Department         1
dtype: int64

#### Check inconsistent format
To make sure that each of our columns has data in a consistent format, we should check to make sure that each column has the data type that we expect. If all of the quantitative fields are either `int64` or `float64`, we can move on to the categorical fields and get a count of each unique value to make sure each row has an expected value.

##### Question 23: Check for data types of quantitative fields
Use the `.info()` method to check the data types of each column in the dataframe.

<p style="font-size:.75rem">Expected output: Informational dataframe with info about 21 columns is shown. CurrentFlag column has a type of object.</p>

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         304 non-null    int64  
 1   DepartmentID       304 non-null    int64  
 2   RateChangeDate     304 non-null    object 
 3   Rate               304 non-null    float64
 4   PayFrequency       304 non-null    int64  
 5   LoginID            304 non-null    object 
 6   OrganizationLevel  303 non-null    float64
 7   JobTitle           304 non-null    object 
 8   BirthDate          304 non-null    object 
 9   MaritalStatus      304 non-null    object 
 10  Gender             304 non-null    object 
 11  HireDate           304 non-null    object 
 12  VacationHours      304 non-null    int64  
 13  SickLeaveHours     304 non-null    int64  
 14  CurrentFlag        304 non-null    object 
 15  ShiftID            304 non-null    int64  
 16  StartDate          304 non

##### Question 24: Does any column surprise you?
Looking at the information printed above and knowing a little about the data, do any of the column data types surprise you? Which column(s) surprise you and why? (yes, there should be at least one)

```
From what I have seen in previous data, the `CurrentFlag` column should be either a 1 or a 0 and thus have a Dtype of `int64`. However, it has a Dtype of `object`. Thus, I think that a string must have gotten introduced into the column's data.
```

##### Question 25: Check count of values of `CurrentFlag`
To be sure that there are issues with the `CurrentFlag` column, print out the count of unique values in the column using the `.value_counts()` method.

<p style="font-size:.75rem">Expected output: Value count of CurrentFlag column is returned. 1 = 299 and yes = 5.</p>

In [28]:
df['CurrentFlag'].value_counts()

1      299
yes      5
Name: CurrentFlag, dtype: int64

##### Question 26: What do you observe?
After counting up each unique value in the `CurrentFlag` column, what do you see?

<p style="font-size:.75rem">Expected output: None</p>

```
There is a mix of 1s and the word "yes", which probably should have been 1s.
```

##### Question 27: Get rows where `CurrentFlag` is `yes`
Filter the dataframe using `.loc` to show only rows where `CurrentFlag` is equal to `"yes"`.

<p style="font-size:.75rem">Expected output: 5 rows whose indexes are 73, 107, and 123-125.</p>

In [29]:
df.loc[ df['CurrentFlag'] == "yes" ]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
73,70,7,00:00.0,12.45,1,adventure-works\david2,4.0,Production Technician - WC60,12/29/1984,M,M,12/15/2008,33,36,yes,2,12/15/2008,,00:00.0,Production,Manufacturing
107,104,7,00:00.0,13.45,1,adventure-works\mary1,4.0,Production Technician - WC10,9/19/1986,M,F,12/25/2009,94,67,yes,2,12/25/2009,,00:00.0,Production,Manufacturing
123,120,7,00:00.0,11.0,1,adventure-works\kitti0,4.0,Production Technician - WC50,6/6/1987,S,F,3/4/2009,89,64,yes,1,3/4/2009,,00:00.0,Production,Manufacturing
124,121,15,00:00.0,19.2308,2,adventure-works\pilar0,3.0,Shipping and Receiving Supervisor,9/9/1972,S,M,1/2/2009,93,66,yes,1,1/2/2009,,00:00.0,Shipping and Receiving,Inventory Management
125,122,15,00:00.0,9.0,1,adventure-works\susan0,4.0,Stocker,2/17/1978,S,F,12/7/2008,97,68,yes,2,12/7/2008,,00:00.0,Shipping and Receiving,Inventory Management


##### Question 28: Change each `"yes"` to `1`
It's probably safe to assume that, since all of the values in the `CurrentFlag` column are either `1` or `"yes"`, all `"yes"` values can be changed to 1. Use the `.replace()` method to replace all occurences of "yes" with **the integer** `1`. Save it to the original dataframe.

<p style="font-size:.75rem">Expected output: None</p>

In [30]:
df['CurrentFlag'].replace("yes", 1, inplace=True)

##### Question 29: Print out the value counts again
Use the `.value_counts()` method to print out the counts of unique values in the `CurrentFlag` again. Make sure that the only value that occurs now is `1`.

<p style="font-size:.75rem">Expected output: Value counts for the CurrentFlag column is displayed. The number 1 appears twice, once with a count of 299 and again with a count of 5.</p>

In [31]:
df['CurrentFlag'].value_counts()

1    299
1      5
Name: CurrentFlag, dtype: int64

##### Question 30: What do you notice?
What do you notice about the results of the results of `.value_counts()`? Why do you think this occured?

<p style="font-size:.75rem">Expected output: None</p>

```
The value `1` is counted twice as if there were two types of `1`. It must be because the first `1` is a string and the second `1` is an integer and so they are different.
```

#### Fix not matching data type
It looks like because the column `CurrentFlag` started out with strings inside of it, most of its values are still of data type `object`. Change it so that the `CurrentFlag` column has a data type of `int8`.

##### Question 31: Print out the data type of `CurrentFlag`
Using the `.dtype` property, print out the data type of the `CurrentFlag` column.

<p style="font-size:.75rem">Expected output: dtype('O')</p>

In [32]:
df['CurrentFlag'].dtype

dtype('O')

##### Question 32: Create `CurrentFlag` as data type `int8`
Using the `.astype()` method, print out a copy of the `CurrentFlag` column whose data type is `int8`.

<p style="font-size:.75rem">Expected output: Series of all 1s is printed with dtype = int8.</p>

In [33]:
df['CurrentFlag'].astype('int8')

0      1
1      1
2      1
3      1
4      1
      ..
299    1
300    1
301    1
302    1
303    1
Name: CurrentFlag, Length: 304, dtype: int8

##### Question 33: Override the `CurrentFlag` column
Using the code from above, save the new `CurrentFlag` column of data type `int8` to the `CurrentFlag` column of the dataframe.

<p style="font-size:.75rem">Expected output: None</p>

In [34]:
df['CurrentFlag'] = df['CurrentFlag'].astype('int8')

##### Question 34: Print out the dataframe info
Use the `.info()` method to print out information about the data types of each column. Make sure that the `CurrentFlag` column now has a data type of `int8`.

<p style="font-size:.75rem">Expected output: Informational dataframe is returned that shows that the CurrentFlag column has a dtype of int8.</p>

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         304 non-null    int64  
 1   DepartmentID       304 non-null    int64  
 2   RateChangeDate     304 non-null    object 
 3   Rate               304 non-null    float64
 4   PayFrequency       304 non-null    int64  
 5   LoginID            304 non-null    object 
 6   OrganizationLevel  303 non-null    float64
 7   JobTitle           304 non-null    object 
 8   BirthDate          304 non-null    object 
 9   MaritalStatus      304 non-null    object 
 10  Gender             304 non-null    object 
 11  HireDate           304 non-null    object 
 12  VacationHours      304 non-null    int64  
 13  SickLeaveHours     304 non-null    int64  
 14  CurrentFlag        304 non-null    int8   
 15  ShiftID            304 non-null    int64  
 16  StartDate          304 non

#### Fix Extreme Values
You have already fixed some columns with null values and incorrect values. Now, check the data set for extreme/incorrect values.

##### Question 35: Describe the dataframe
Use the `.describe()` method on the dataframe to get some summary statistics about each quantitative row.

<p style="font-size:.75rem">Expected output: Descriptive dataframe is displayed.</p>

In [36]:
df.describe()

Unnamed: 0,EmployeeID,DepartmentID,Rate,PayFrequency,OrganizationLevel,VacationHours,SickLeaveHours,CurrentFlag,ShiftID
count,304.0,304.0,304.0,304.0,303.0,304.0,304.0,304.0,304.0
mean,146.825658,7.302632,17.674245,1.460526,3.465347,50.822368,45.207237,1.0,1.546053
std,85.129793,2.903371,13.028188,0.499261,0.804347,31.690771,14.704341,0.0,0.769371
min,1.0,1.0,-72.12,1.0,1.0,0.0,20.0,1.0,1.0
25%,72.75,7.0,11.0,1.0,3.0,26.75,33.0,1.0,1.0
50%,148.5,7.0,14.0,1.0,4.0,49.0,45.0,1.0,1.0
75%,224.0,7.0,23.0769,2.0,4.0,74.25,57.25,1.0,2.0
max,290.0,16.0,125.5,2.0,4.0,290.0,80.0,1.0,3.0


##### Question 36: What do you notice?
Observe the information created by the `.describe()` method above. What do you notice about the summary statistics that might indicate inaccurate data? You should see at least two columns with potential problems. Look closely at the min and maximum values for each column.

<p style="font-size:.75rem">Expected output: None</p>

```
The `Rate` column has negative values, which is impossible and should be further investigated. The `VacationHours` column also has a very high maximum value and might be an error.
```

#### Fix Rate
##### Question 37: Get a filtered `Rate` column
Knowing that the `Rate` column has at least one negative number, create a filtered dataframe that contains rows where `Rate` is less than or equal to 0.

<p style="font-size:.75rem">Expected output: 1 row of index 286. It's Rate is -72.12.</p>

In [37]:
df.loc[ df['Rate'] <= 0 ]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
286,273,3,00:00.0,-72.12,2,adventure-works\brian3,1.0,Vice President of Sales,6/6/1977,S,M,2/15/2011,10,25,1,1,2/15/2011,,00:00.0,Sales,Sales and Marketing


##### Question 38: What should be done?
Looking at the row(s) returned, what do you think should be done? Can the true `Rate` of the rows with negative values be imputed, or should they be dropped?

<p style="font-size:.75rem">Expected output: None</p>

```
The `JobTitle` of the only row returned is Vice President of Sales, which makes me expect a higher rate. Simply making the `Rate` positive rather than negative makes sense to me.
```

##### Question 39: Make the negative `Rate` positive
Using the code from above, print out a the `Rate` column of rows whose `Rate` is less than or equal to 0. Multiply it by `-1` to make it positive.

<p style="font-size:.75rem">Expected output: Series with one value of index 286. The Rate for this row is now 72.12.</p>

In [38]:
df.loc[ df['Rate'] <= 0, 'Rate' ] * -1

286    72.12
Name: Rate, dtype: float64

##### Question 40: Set the negative rate to positive rate
Using the code from above, filter the dataframe and change the negative rates to positive rates.

<p style="font-size:.75rem">Expected output: None</p>

In [39]:
df.loc[ df['Rate'] <= 0, 'Rate' ] = df.loc[ df['Rate'] <= 0, 'Rate' ] * -1 

##### Question 41: Print out the dataframe description again
Use the `.describe()` method to print out the dataframe description again. Make sure that the minimum rate is greater than 1.

<p style="font-size:.75rem">Expected output: The min Rate is now 6.5.</p>

In [40]:
df.describe()

Unnamed: 0,EmployeeID,DepartmentID,Rate,PayFrequency,OrganizationLevel,VacationHours,SickLeaveHours,CurrentFlag,ShiftID
count,304.0,304.0,304.0,304.0,303.0,304.0,304.0,304.0,304.0
mean,146.825658,7.302632,18.148719,1.460526,3.465347,50.822368,45.207237,1.0,1.546053
std,85.129793,2.903371,12.356397,0.499261,0.804347,31.690771,14.704341,0.0,0.769371
min,1.0,1.0,6.5,1.0,1.0,0.0,20.0,1.0,1.0
25%,72.75,7.0,11.0,1.0,3.0,26.75,33.0,1.0,1.0
50%,148.5,7.0,14.0,1.0,4.0,49.0,45.0,1.0,1.0
75%,224.0,7.0,23.0769,2.0,4.0,74.25,57.25,1.0,2.0
max,290.0,16.0,125.5,2.0,4.0,290.0,80.0,1.0,3.0


#### Fix VacationHours
The maximum value in the `VacationHours` is 290. That might be correct, but it also might be out of the ordinary. Let's do some work to try and fix it.

##### Question 42: Sort the dataframe by `VacationHours` descending
Use the `.sort_values()` method and the `by` and `ascending` arguments to print out the dataframe sorted by `VacationHours`, with the rows with the highest number of vacation hours on top. Add the `.head()` method after the `.sort_values()` method to get the top five rows.

<p style="font-size:.75rem">Expected output: 5 rows with indexes 292, 0, 91, 120, and 119.</p>

In [41]:
df.sort_values(by='VacationHours', ascending=False).head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
292,279,3,00:00.0,23.0769,2,adventure-works\tsvi0,3.0,Sales Representative,1/18/1974,M,M,5/31/2011,290,34,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing
0,1,16,00:00.0,125.5,2,adventure-works\ken0,,Chief Executive Officer,1/29/1969,S,M,1/14/2009,99,69,1,1,1/14/2009,,00:00.0,Executive,Executive General and Administration
91,88,7,00:00.0,13.45,1,adventure-works\betsy0,4.0,Production Technician - WC10,12/17/1966,S,F,12/18/2009,99,69,1,3,12/18/2009,,00:00.0,Production,Manufacturing
120,117,7,00:00.0,11.0,1,adventure-works\chad0,4.0,Production Technician - WC50,8/4/1990,M,M,2/18/2009,99,69,1,1,2/18/2009,,00:00.0,Production,Manufacturing
119,116,7,00:00.0,11.0,1,adventure-works\michael2,4.0,Production Technician - WC50,5/3/1974,S,M,1/31/2009,98,69,1,1,1/31/2009,,00:00.0,Production,Manufacturing


#### Check if outlier
##### Question 43: Import numpy
Before anything else, let's test to see whether `290` is really an outlier for the `VacationHours` column. To begin, import the NumPy library.

<p style="font-size:.75rem">Expected output: None</p>

In [42]:
import numpy as np

##### Question 44: Create function `isOutlier()`
Now create a function called `isOutlier()` that accepts one parameter `hours`. The function will be applied individually to each row of the `VacationHours` column and should perform the following:

1. Uses the `np.quantile()` function on the `VacationHours` column to get the first quartile (`0.25`) and save it to a variable `first_quartile`.
2. Uses the `np.quantile()` function on the `VacationHours` column to get the third quartile (`0.75`) and save it to a variable `third_quantile`.
3. Subtracts the `first_quartile` from the `third_quartile` and saves the results to a variable `iqr` (inner-quartile range).
4. Calculates the upper fence by performing `third_quartile + (1.5 * iqr)`, saving it to a variable `upper_fence`.
    - The upper fence is the upper limit for identifying outliers. Anything greater than the upper fence is considered an outlier.
5. If `hours` is greater than the `upper_fence`, return `True` (the hours for this row *is* an outlier). Otherwise, return `False`.

<p style="font-size:.75rem">Expected output: None</p>

In [43]:
def isOutlier(hours):
    # Get 25th percentile
    first_quartile = np.quantile(df['VacationHours'], 0.25)
    # Get 75th percentile
    third_quartile = np.quantile(df['VacationHours'], 0.75)
    # Get inner quartile range
    iqr = third_quartile - first_quartile
    # Calculate upper fence
    upper_fence = third_quartile + (1.5 * iqr)
    
    if hours > upper_fence:
        return True
    else:
        return False

##### Question 45: Apply function to the `VacationHours` column
Using the `.apply()` method, pass in the `isOutlier` function without parentheses to return a Series of True/False values. True indicates a row where `VacationHours` is an outlier.

<p style="font-size:.75rem">Expected output: Series of length 304 and dtype bool. All values will appear to be False.</p>

In [44]:
df['VacationHours'].apply(isOutlier)

0      False
1      False
2      False
3      False
4      False
       ...  
299    False
300    False
301    False
302    False
303    False
Name: VacationHours, Length: 304, dtype: bool

##### Question 46: Save the Series above to a new column called `VacationHoursOutlier`
Using the code from above, create a new column in the dataframe called `VacationHoursOutlier` that has either a True or False if the `VacationHours` is an outlier.

<p style="font-size:.75rem">Expected output: None</p>

In [45]:
df['VacationHoursOutlier'] = df['VacationHours'].apply(isOutlier)

##### Question 47: Sort the dataframe by `VacationHours` descending again
In the same way that you did previously, use the `.sort_values()` method with the `by` and `ascending` arguments to view the rows with the highest number of `VacationHours`.

<p style="font-size:.75rem">Expected output: Dataframe rows where index 292 is at the top.</p>

In [46]:
df.sort_values(by='VacationHours', ascending=False).head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department,VacationHoursOutlier
292,279,3,00:00.0,23.0769,2,adventure-works\tsvi0,3.0,Sales Representative,1/18/1974,M,M,5/31/2011,290,34,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,True
0,1,16,00:00.0,125.5,2,adventure-works\ken0,,Chief Executive Officer,1/29/1969,S,M,1/14/2009,99,69,1,1,1/14/2009,,00:00.0,Executive,Executive General and Administration,False
91,88,7,00:00.0,13.45,1,adventure-works\betsy0,4.0,Production Technician - WC10,12/17/1966,S,F,12/18/2009,99,69,1,3,12/18/2009,,00:00.0,Production,Manufacturing,False
120,117,7,00:00.0,11.0,1,adventure-works\chad0,4.0,Production Technician - WC50,8/4/1990,M,M,2/18/2009,99,69,1,1,2/18/2009,,00:00.0,Production,Manufacturing,False
119,116,7,00:00.0,11.0,1,adventure-works\michael2,4.0,Production Technician - WC50,5/3/1974,S,M,1/31/2009,98,69,1,1,1/31/2009,,00:00.0,Production,Manufacturing,False


##### Question 48: Is `290` an outlier for `VacationHours`?
Looking at the dataframe returned above and the `VacationHoursOutlier` column, is `290` an outlier in the `VacationHours` column? If so, what could be done to fix this value?

<p style="font-size:.75rem">Expected output: None</p>

```
The number `290` is an outlier in the `VacationHours` column. This row could be dropped, or we could use imputation to determine what other employees with the same job title have as their `VacationHours` and assign an average value to the row.
```

##### Question 49: Get rows of employees with `JobTitle` Sales Representative
As seen above, the row with total `VacationHours` has a `JobTitle` of "Sales Representative". Use a filter to get back rows of employees whose `JobTitle` is also "Sales Representative".

<p style="font-size:.75rem">Expected output: Dataframe starting with index 288 and ending with index 303, including index 292.</p>

In [47]:
df.loc[ df['JobTitle'] == 'Sales Representative' ]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department,VacationHoursOutlier
288,275,3,00:00.0,23.0769,2,adventure-works\michael9,3.0,Sales Representative,12/25/1968,S,M,5/31/2011,38,39,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
289,276,3,00:00.0,23.0769,2,adventure-works\linda3,3.0,Sales Representative,2/27/1980,M,F,5/31/2011,27,33,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
290,277,3,00:00.0,23.0769,2,adventure-works\jillian0,3.0,Sales Representative,8/29/1962,S,F,5/31/2011,24,32,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
291,278,3,00:00.0,23.0769,2,adventure-works\garrett1,3.0,Sales Representative,2/4/1975,M,M,5/31/2011,33,36,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
292,279,3,00:00.0,23.0769,2,adventure-works\tsvi0,3.0,Sales Representative,1/18/1974,M,M,5/31/2011,290,34,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,True
293,280,3,00:00.0,23.0769,2,adventure-works\pamela0,3.0,Sales Representative,12/6/1974,S,F,5/31/2011,22,31,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
294,281,3,00:00.0,23.0769,2,adventure-works\shu0,3.0,Sales Representative,3/9/1968,M,M,5/31/2011,26,33,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
295,282,3,00:00.0,23.0769,2,adventure-works\josé1,3.0,Sales Representative,12/11/1963,M,M,5/31/2011,31,35,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
296,283,3,00:00.0,23.0769,2,adventure-works\david8,3.0,Sales Representative,2/11/1974,S,M,5/31/2011,23,31,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
297,284,3,00:00.0,23.0769,2,adventure-works\tete0,3.0,Sales Representative,1/5/1978,M,M,9/30/2012,39,39,1,1,9/30/2012,,00:00.0,Sales,Sales and Marketing,False


##### Question 50: Remove the row with outliers from the filter
Add to the filter that you created above so that the row with an outlier in the `VacationHours` field is not included in the filtered dataframe.

<p style="font-size:.75rem">Expected output: Dataframe starting with index 288 and ending with index 303, excluding index 292.</p>

In [48]:
df.loc[ (df['JobTitle'] == 'Sales Representative') & (df['VacationHours'] != 290) ]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department,VacationHoursOutlier
288,275,3,00:00.0,23.0769,2,adventure-works\michael9,3.0,Sales Representative,12/25/1968,S,M,5/31/2011,38,39,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
289,276,3,00:00.0,23.0769,2,adventure-works\linda3,3.0,Sales Representative,2/27/1980,M,F,5/31/2011,27,33,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
290,277,3,00:00.0,23.0769,2,adventure-works\jillian0,3.0,Sales Representative,8/29/1962,S,F,5/31/2011,24,32,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
291,278,3,00:00.0,23.0769,2,adventure-works\garrett1,3.0,Sales Representative,2/4/1975,M,M,5/31/2011,33,36,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
293,280,3,00:00.0,23.0769,2,adventure-works\pamela0,3.0,Sales Representative,12/6/1974,S,F,5/31/2011,22,31,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
294,281,3,00:00.0,23.0769,2,adventure-works\shu0,3.0,Sales Representative,3/9/1968,M,M,5/31/2011,26,33,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
295,282,3,00:00.0,23.0769,2,adventure-works\josé1,3.0,Sales Representative,12/11/1963,M,M,5/31/2011,31,35,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
296,283,3,00:00.0,23.0769,2,adventure-works\david8,3.0,Sales Representative,2/11/1974,S,M,5/31/2011,23,31,1,1,5/31/2011,,00:00.0,Sales,Sales and Marketing,False
297,284,3,00:00.0,23.0769,2,adventure-works\tete0,3.0,Sales Representative,1/5/1978,M,M,9/30/2012,39,39,1,1,9/30/2012,,00:00.0,Sales,Sales and Marketing,False
299,286,3,00:00.0,23.0769,2,adventure-works\lynn0,3.0,Sales Representative,2/14/1977,S,F,5/30/2013,36,38,1,1,5/30/2013,,00:00.0,Sales,Sales and Marketing,False


##### Question 51: Get the average vacation hours among Sales Representatives
Using the filtered dataframe above, get the average `VacationHours` for Sales Representatives. Store this number in a variable called `sales_rep_average_vac_hours` and then print it out.

<p style="font-size:.75rem">Expected output: 31.153846153846153</p>

In [49]:
sales_rep_average_vac_hours = df.loc[ (df['JobTitle'] == 'Sales Representative') & (df['VacationHours'] != 290), 'VacationHours' ].mean()
sales_rep_average_vac_hours

31.153846153846153

##### Question 52: Assign the average vacation hours to the employee with an outlier `VacationHours`
Using the `.loc` property, change the `VacationHours` of the employee with an outlier in the `VacationHours` column to be `sales_rep_average_vac_hours` instead.

<p style="font-size:.75rem">Expected output: None</p>

In [50]:
df.loc[ df['VacationHoursOutlier'], 'VacationHours'] = sales_rep_average_vac_hours

##### Question 53: Print out the dataframe description
Finally, print out the dataframe description using the `.describe()` method. Make sure the maximum value for `VacationHours` is no longer 290.

<p style="font-size:.75rem">Expected output: Dataframe description that shows max VacationHours as 99.000000.</p>

In [51]:
df.describe()

Unnamed: 0,EmployeeID,DepartmentID,Rate,PayFrequency,OrganizationLevel,VacationHours,SickLeaveHours,CurrentFlag,ShiftID
count,304.0,304.0,304.0,304.0,303.0,304.0,304.0,304.0,304.0
mean,146.825658,7.302632,18.148719,1.460526,3.465347,49.970901,45.207237,1.0,1.546053
std,85.129793,2.903371,12.356397,0.499261,0.804347,28.566691,14.704341,0.0,0.769371
min,1.0,1.0,6.5,1.0,1.0,0.0,20.0,1.0,1.0
25%,72.75,7.0,11.0,1.0,3.0,26.75,33.0,1.0,1.0
50%,148.5,7.0,14.0,1.0,4.0,49.0,45.0,1.0,1.0
75%,224.0,7.0,23.0769,2.0,4.0,74.0,57.25,1.0,2.0
max,290.0,16.0,125.5,2.0,4.0,99.0,80.0,1.0,3.0


#### Final Thoughts
In case you were curious, the value of the row with an outlier in `VacationHours` was originally `29` instead of `290`. That means that the error was most likely caused by a human who accidentally entered in an extra `0` on the end of the actual number. In any case, the average vacation hours among Sales Representatives turned out to be fairly close to this (`31.15`), meaning that imputation worked fairly well in this example.